Thesaurus

From glossaLAB
Collection GlossariumBITri
Author Eva Carbonero
Jorge Morato-Lara
Editor Sonia Sánchez-Cuadrado
Year 2010
Volume 1
Number 1
ID 89
Object type Concept
Domain Information Retrieval
es tesauro
fr thésaurus
de thesaurus

A thesaurus is a controlled vocabulary used to represent the concepts of a specific domain systematically. The thesaurus identifies the relationships between concepts. Every concept is represented by a single term, called a descriptor. Thesauri are resources developed to index documents by these descriptors.

Thesaurus elements

The thesaurus consists of:

Descriptors: normalized terms. Descriptors represent a relevant concept in the domain.

Non-descriptor: Some descriptors might have an equivalent term, called a non-descriptor. A non-descriptor can only address one descriptor in the thesaurus. These terms represent an equivalence relationship with a single descriptor in the domain; they could be used to expand the query.

Hierarchical relationships: they represent the relation between a generic concept and a specific concept. This relationship includes: Broader-Narrower Concepts; Genus-Species; Whole-Parts; and Class-Instances. Polyhierarchies, a specific concept with two or more generic concepts are allowed

Associative relationship: This is a relationship to link concepts semantically. It is used when there is no hierarchical or equivalence relation

Scope notes: This is an explanatory note about the scope and utilization of a descriptor.

Example:

CAR

BT automotive vehicle

NT ambulance

NT cab

RT driver

RT road

UF automobile

SN Regarding part of a train, see railcar

where BT stands for Broader Term; NT Narrower Term; RT Related Term; UF Use For; and SN Scope Note

Thesaurus features

Domain coverage. Some thesauri are multidisciplinary; others just cover a specific domain. Multidisciplinary contexts increase the ambiguity. This is due to a higher probability of polysemes and homonyms.

Output formats: Usually, a thesaurus layout has two output formats on paper: alphabetic and systematic (hierarchical). The rise of the web has produced an increase in web formats, XML or RDF, both with the metadata vocabulary SKOS. Other vocabularies have been proposed like Zthes, BS8723, MADS, or Topic Maps' PSI.

Monolingual/Multilingual: Multilingual contexts have the same problems as multidisciplinary context.

Polyhierarchies/Monohierarchies: problems with polyhierarchies are due to query expansion in a random way.

Uniterms/Compound terms. Compound terms are usually nouns, but some thesauri have adjectives (as part of compound terms), acronyms, verbs and proper nouns.

Differences between Ontologies and Thesauri

The thesaurus has a few predefined elements. It has a lexical nature, and its main applications are in natural language. The origin of the thesaurus was on paper, nowadays the thesaurus has moved to digital media. This implies the codification of thesauri using web languages, like RDF or XML, and expressing thesaurus elements with metadata vocabularies, like SKOS.

Ontology has a semantic nature. Its origin was in philosophy, logical mathematics and artificial intelligence. It enables inference by a set of rules, axioms, and restrictions. The current success of ontologies is due to their presence in the Semantic Web. In this context, they provide a necessary way to share knowledge on the Web. One of the main concerns is interoperability, which is a property that ensures that unknown software will be able to work with ontologies all over the Web. Interoperability needs to represent knowledge in a formalized way, like RDF or OWL. Primitives of ontologies are properties (slots), instances, hierarchies, and relationships.

An important difference with ontologies is that thesauri are built to facilitate an existing information need. Ontologies have a proactive origin. They are often built before the need arises.

Both, ontologies and thesauri, represent the main concepts of a domain, and its relationships. Methodologies to build ontologies and thesauri share their first steps, but the higher semantic and logic load nature of ontologies divides later stages in their respective developments. In the ontology literature, thesauri are called light ontologies. Building ontologies is a laborious task, and to work with a natural language thesaurus represents a more efficient and simple approach.

Methodologies to build thesaurus

  1. Firstly, identify the information needs that the thesaurus will satisfy and the domain to be covered.
  2. Next, similar thesauri and resources must be analyzed to see if they can be utilized.
  3. Select software to edit and codify the thesaurus. User interfaces to manage and query must be as intuitive and friendly as possible.
  4. Main terms must be selected. Typically, resources needed to identify these terms are domain experts or specialized literature.
  5. Define a small number of seed terms in the thesaurus. Usually, around 10 is enough.
  6. Terms must be arranged in a hierarchical way. Usually, new terms are included to avoid gaps in the hierarchical structure.
  7. Relationships between concepts must be defined.
  8. Train Indexers to use the thesaurus.
  9. Maintain, update and improve the thesaurus.

5. Some thesaurus online

Agrovoc: multilingual thesaurus developed by FAO and focused on agriculture. It has an equivalent thesaurus in NALT.

CAB Thesaurus: focused on life sciences

Canadian literacy thesaurus: literature thesaurus, bilingual.

Eurovoc: multilingual thesaurus, developed by EU to manage administrative documents

Mesh: Medical Subject Headings, one of the largest, centered on the medical domain and used to index the Medline database.

Wordnet: lexical database, centered on the English language. There are other versions in other languages. It is widely used in ontology construction, word sense disambiguation (WSD), merging, retrieval, translation, and other Natural Language Processing (NLP) applications.

6. Software to edit and manage thesauri:

TCS: thorough and flexible software. It has suitable features to adapt thesauri to the Web. It has a good set of export formats

Domain Reuse: This suite has some tools to perform term filtering and to identify relationships between terms.

TemaTres: A free platform to edit and manage thesauri on the Web. It can export to several Web formats and use different metadata vocabularies.

ThManager: a tool to edit and manage thesauri, free, and multilingual. It exports with Dublin Core and SKOS formats. This software can extract terms with WordNet.

7. Standards

ISO 2788 (1986) Guidelines for the Establishment and Development of Monolingual Thesauri.

Z39.19 (2005) Guidelines for the Construction, Format, and Management of Monolingual Thesauri

ISO 5964 (1985) Guidelines for the Establishment and Development of Multilingual Thesauri. It is one of the first standards to talk about the alignment problems.

ISO 13250 (2003) Topic Maps, were developed to merge index of words, and its Published Subject Indicators (PSIs) about thesauri are strongly related to the SKOS proposal.

Related Resources:

References

  • AICHINSON, J. and DEXTRE, S. (2004). The Thesaurus: A Historical Viewpoint with a Look to the Future. Cataloging & Classification Quarterly, vol. 37, nº 3/4, pp. 5-21.
  • LANCASTER, F. W. (1995) El control del vocabulario en la recuperación de la información. València : Universitat de València, , p. 286
  • SLYPE, G. van. (1991). Lenguajes de indización : concepción, construcción y utilización en los sistemas documentales. Fundación Germán Sánchez Ruipérez. P 200.
  • PÉREZ AGÜERA, J.R. (2004). Automatización de tesauros y su utilización en la Web Semántica. BID: textos universitaris de biblioteconomía i documentació, Desembe 2004, nº 13. [Online] <http://www.ub.es/bid/13perez2.htm> [Accessed: 11/2010]
  • ISO-2788: 1986. Guidelines for theEstablishment and Development of Monolingual Thesauri. International Organization for Standardization, Second edition -11-15 UDC 025.48. Geneva: ISO, 1986.
  • SÁNCHEZ-CUADRADO, S., MORATO, J., [et. al.] (2007).Definición de una metodología para la construcción de Sistemas de Organización del Conocimiento a partir de un corpus documental en Lenguaje Natural. Procesamiento del Lenguaje Natural, nº 39, pp. 213-220.