Information Retrival

Collection	GlossariumBITri
Author	Jorge Morato-Lara
Editor	Jorge Morato-Lara
Year	2010
Volume	1
Number	1
ID	69
Object type	Concept
Domain	Informatics Information Management Linguistics Lis
es	recuperación de información
fr	recherche d'information
de	informationswiedergewinnung

Information retrieval is the set of activities that facilitates the searching and retrieval of data. Information retrieval comprises techniques from linguistics, computer science, information science, and text mining.

Changes in the meaning of the term

In the first place, the term only was used to denote the set of techniques and process aimed to retrieve data from data bases in computer systems. In the early nineties, with the increasing amount of text documents in the Web, text retrieval becomes the main goal of these techniques. Most of these tools look for finding words in common between the textual query and textual documents. As multimedia resources growths in Internet, search engines begin to search audio, images, and video resources. In the literature, document retrieval, text retrieval, information retrieval, and data retrieval are often employed as equivalents, although, indeed, each one has a specific meaning.

Traditionally, in the Web context the answer to a query is a set of documents that probably have relevant data about the topic. Another related area is question-answering systems that answer to a query just with a specific data, and not with a set of documents.

Information Retrieval and Knowledge Retrieval

Usually, information regards to what is and which properties has something. In other words Information is related to definitions. But Information seldom cares about how relate with other information elements, in a specific context. The integration of information items among them is what is regarded as knowledge. So, an explicitation of know-how has to define how the items are related and how the process is developed. This approach assumes two important concepts to perform a task: the existence of a goal and the existence of relationships in the system among the concepts. On one hand, the existence of goal implies a purpose and necessity to achieve a goal. This goal only exists in living beings. Therefore the Knowledge retrieval has sense just in the brain of the human being that performs the query. On the other hand, knowledge implies that the information is interrelated to archive the goal. So, the information is related by means of a set of rules and restrictions. The inclusion of these rules in computer applications is the reason to change the name from Information Retrieval Systems to Knowledge Retrieval Systems. These systems have their origin in the Artificial Intellingence (AI) field. AI tries to emulate human reasoning, and this involves having finalities, rules, and relationships. Intelligent agents and ontologies are necessary resources to emulate the human brain. These resources induce to rename information retrieval to knowledge retrieval. Knowledge Retrieval Systems tries to implement search engine that search not only words in the documents, but process, and even inference data.

3. Information Retrieval Language and Information Retrieval Systems. The fact that Information Retrieval regards to computer systems (in contrast with library methods that have a wider meaning) causes that some retrieval languages are linked with a specific technology or system. Some well-known retrieval languages are SQL, SPARQL, Boolean, etc.

4. Metadata, descriptors, and indexing. In the 60s and 70s, computers had a limited storage capacity and the speed to compute was low. Document in these systems need to represent its content with metadata and a small set of terms, called descriptors. Metadata used to be author, title, source, and date. Metadata and descriptors assignment was by-hand.

Nowadays, these metadata are used in the Semantic Web because of their simplicity, facilitating its interoperability and navigation in the Web.

Automatic indexing deals with the techniques to assign automatically relevant terms to a document. Relevance is computed by means of statistics and the term location in the document. Examples are term frequency and Inverse Document Frequency (known as tf-IDF), stop-word removal or, higher weight of the words from the title or with stressed typographically (e.g. bold letters). Most of these factors are used in web search engines.

5. Information retrieval by controlled vocabularies. In Information Science, terms from a specific domain often are listed, in a normalized way. This list is called controlled vocabulary, and each descriptor is known as descriptor. This vocabulary could present relationships among terms. Vocabulary control tries to avoid typical problems in natural language: polysemy, homonyms, and synonyms.

Relationship types in these vocabularies might present different nature. In thesaurus relationship are equivalence, hierarchy, and semantic relatedness. Faceted thesaurus shows different scopes to facilitate retrieval.

6. Relevance. Relevance is a measure about the degree a certain element answer to a query. This measure is subjective, in the sense that depends on the knowledge of the person who assesses the relevancy.

7. Retrieval Measures. Performance of an information retrieval system might be measured by the retrieved data/documents. There are two coefficients:

Precision: proportion of relevant data retrieved from the total data retrieved.
Recall: extend of relevant data retrieved from the total of data relevant in the Data Base.

Both measures have an inverse relationship (Cleverdon Law). Increase precision produces a decrease in recall. These coefficients measure two different factors: noise and silence.
Noise: non-relevant data retrieved
Silence: relevant data that have not been retrieved from the data base

Compute recall implies to know how many elements are relevant to a specific query in the data base. This relevance list is called test collection, and it is made by-hand. Test-collections are used in international competitions to test retrieval systems. TREC (Text Retrieval Conference) is the best known conference about retrieval.

8. Retrieval Models. Retrieval models compute the degree that certain elements answer to a query. As a general rule it is computed by means of a similarity coefficient (Cosine, Phi, etc). Most popular models are:

Boolean: only two values are computed, relevant/non-relevant. Only relevant document are retrieved without any order. An example is SQL in relational data bases. Although there is an extended boolean model to provide a way to sort results.
Vectorial: A vector is built to represent the terms that every item has. The query vector and every document vector are compared, measuring the grades that are between them.
Probabilistic: the probability of a document to answer to a query is computed. Often is used retrieval feedback to improve the probability estimate. Feedback is based in user judgments about the set of document retrieved. Words from positive results are given a higher value when the query is recomputed.

References

ANTONIOU, G., VAN HARMELEN, F. (2004). A semantic Web Primer. Massachussets: MIT, 2004
BAEZA-YATES, R., RIBEIRO-NETO, B. (1999). Modern information retrieval. New York: ACM Press; New York: Addison-Wesley.
CLEVERDON, C.W. (1972). “On the inverse relationship of recall and precision”. Journal of Documentation, Vol. 28, pp. 195-201.
SPARK-JONES, K. (1997). Readings in information retrieval. edited by Karen Sparck Jones, Peter Willett. San Francisco: Morgan Kaufmann.