Information retrieval
“ | But do you know that, although I have kept the diary [on a phonograph] for months past, it never once struck me how I was going to find any particular part of it in case I wanted to look it up? | ” |
—Dr Seward Bram Stoker's Dracula, 1897 |
Information retrieval is a field of Computer science that looks at how non-trivial data can be obtained from a collection of information resources. Commonly, either a full-text search is done, or the metadata which describes the resources is searched. Depending on the content, there may also be other indices. Information retrieval is about finding existing information; it is different from knowledge discovery in databases which is about finding new relationships between different datasets, which were unknown.
Techniques from information retrieval are commonly used in internet search engines, but also when looking for information on a subject in a library, or when searching complex content such as images in a database. This kind of content is usually described using metadata.
Problem description
When looking for information, the following problems are commonly encountered:
- The database contains an insufficient amount of information about the content of the documents it contains. A search may yield an incorrect answer, or no answer at all.
- The search is done with terms and keywords that are vague, so that the results returned do not exactly match the needs of the request.
A system for information retrieval will attach a score to each document returned. This score reflects how well the document matches the query of the user. The documents with the best scores are shown to the user, who has the possibility to refine the query.
Different models
There are different kinds of models that are used in information retrieval
First dimension: the mathematical model
- Set-theoretic models represent documents as a set of words or features.
- Algebraic models use vectors, matrices and tuples.
- Probabilistic models treat the process of document retrieval as a probabilistic inference. Similarities are computed as probabilities that a document is relevant for a given query. Probabilistic theorems like the Bayes' theorem are often used in these models.
- Feature-based retrieval models view documents as vectors of values of feature functions (or just features) and seek the best way to combine these features into a single relevance score, typically by learning to rank methods. Feature functions are arbitrary functions of document and query, and as such can easily incorporate almost any other retrieval model as just a yet another feature.
Second dimension: the properties of the model
- Models without term-interdependencies treat different terms/words as independent. This fact is usually represented in vector space models by the orthogonality assumption of term vectors or in probabilistic models by an independency assumption for term variables.
- Models with immanent term interdependencies allow a representation of interdependencies between terms. However the degree of the interdependency between two terms is defined by the model itself. It is usually directly or indirectly derived (e.g. by dimensional reduction) from the co-occurrence of those terms in the whole set of documents.
- Models with transcendent term interdependencies allow a representation of interdependencies between terms, but they do not allege how the interdependency between two terms is defined. They rely an external source for the degree of interdependency between two terms. (For example a human or sophisticated algorithms.)
In addition, each model has parameters, which influence its performance. Some of the models make a number of assumption about the data; they were developed for a very special purpose. Using such a model for a different purpose may not yield good results.