Controlled vocabulary

From New World Encyclopedia

Controlled vocabularies are a set of preselected terms from which a cataloger or indexer select for assigning subject headings or descriptors to a work in a library catalog or a bibliographic database. Controlled vocabularies


Controlled vocabulary schemes mandate the uses of predefined, authorized terms that have been preselected by the designer of the controlled vocabulary as opposed to natural language vocabularies where there is no restriction on the vocabulary that can be used. Descriptors and Library of Congress Subject Headings are controlled vocabualries.

Definition and purpose

Definition

In library and information science controlled vocabulary is a carefully selected list of words and phrases, which are used to tag units of information (document or work) so that they may be more easily retrieved by a search.[1][2]

In Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabulary, NISO (National Information Standards Organization (U.S.) explains the purposes of vocabulary control:

Vocabulary control is used to improve the effectiveness of information storage and retrieval systems, Web navigation systems, and other environments that seek to both identify and locate desired content via some sort of description using language. The primary purpose of vocabulary control is to achieve consistency in the description of content objects and to facilitate retrieval.[3]

=Purpose

The purpose of vocabulary control is to achieve consistency of bibliographic record management and increase efficiency in information retrieval. NISO lists five purposes:

The purpose of controlled vocabularies is to provide a means for organizing information. Through the process of assigning terms selected from controlled vocabularies to describe documents and other types of content objects, the materials are organized according to the various elements that have

been chosen to describe them. Controlled vocabularies serve five purposes:

  1. Translation: Provide a means for converting the natural language of authors, indexers, and

users into a vocabulary that can be used for indexing and retrieval.

  1. Consistency: Promote uniformity in term format and in the assignment of terms.
  2. Indication of relationships: Indicate semantic relationships among terms.
  3. Label and browse: Provide consistent and clear hierarchies in a navigation system to help

users locate desired content objects.

  1. Retrieval: Serve as a searching aid in locating content objects.[4]

Controlled vocabularies solve the problems of homographs, synonyms and polysemes by ensuring that each concept is described using only one authorized term and each authorised term in the controlled vocabulary describes only one concept. In short, controlled vocabularies reduce ambiguity inherent in normal human languages where the same concept can be given different names and ensure consistency.

For example, in the Library of Congress Subject Heading (a subject heading system that uses controlled vocabulary), authorised terms (subject headings in this case) have to be chosen to handle choices between variant spellings of the same concept (American versus British), choice among scientific and popular terms (Cockroaches versus Periplaneta americana), choices between synonyms (automobile versus cars) among other difficult issues.

Choices of authorised terms are based on the principles of user warrant (what terms users are likely to use), literary warrant (what terms are generally used in the literature and documents), structural warrant (terms chosen by considering the structure, scope of the controlled vocabulary).

Controlled vocabularies also typically handle the problem of homographs, with qualifiers. For example, the term "pool" has to be qualified to refer to either swimming pool, or the game pool to ensure that each authorised term or heading refers to only one concept.

Subject headings and thesauri

There are two main kinds of controlled vocabulary tools used in libraries: subject headings and thesauri. While the differences between the two are diminishing, there are still some minor differences.

Historically subject headings were designed to describe books in library catalogs by catalogers while thesauri were used by indexers to apply index terms to documents and articles. Subject headings tend to be broader in scope describing whole books, while thesauri tend to be more specialised covering very specific disciplines. Also because of the card catalog system, subject headings tend to have terms that are in indirect order (though with the rise of automated systems this is being removed), while thesauri terms are always in direct order. Subject headings also tend to use more pre-co-ordination of terms such that the designer of the controlled vocabulary will combine various concepts together to form one authorised subject heading. (e.g., children and terrorism) while thesauri tend to use singular direct terms. Lastly thesauri list not only equivalent terms but also narrower, broader terms and related terms among various authorised and non-authorised terms, while historically most subject headings did not.

For example Library of Congress Subject Heading itself did not have much syndetic structure until 1943, and it was not until 1985 when it began to adopt the thesauri type term "Broader term" and "Narrow term".

The terms are chosen and organized by trained professionals (including librarians and information scientists) who possess expertise in the subject area. Controlled vocabulary terms can accurately describe what a given document is actually about, even if the terms themselves do not occur within the document's text. Well known subject heading systems are library of congress subject heading, MESH, Sears. Well known thesauri are Art and Architecture Thesaurus, ERIC Thesaurus etc.

Choosing authorized terms to be used is a tricky business, besides the areas already considered above, the designer has to consider the specificity of the term chosen, whether to use direct entry, inter consistency and stability of the language. Lastly the amount of pre-co-ordinate (in which case the degree of enumeration versus synthesis becomes an issue) and post co-ordinate in the system is another important issue

Controlled vocabularies tagged to documents are metadata.

Subject indexing is the act of describing a document by index terms to indicate what the document is about or to summarize its content. The index terms are often selected from some form of controlled vocabulary.[5] Subject indexing is used in information retrieval especially to create Bibliographic databases to retrieve documents on a particular subject. Examples of academic indexing services are Zentralblatt MATH, Chemical Abstracts and PubMed. The index terms were mostly assigned by experts but author keywords are also common.

With the ability to conduct a full text search widely available, many people have come to rely on their own expertise in conducting information searches and full text search has become very popular. Subject indexing and its experts, professional Indexers and Librarians, remains crucial to information organization and retrieval. Indexers and Librarians understand controlled vocabularies and are able to find information that can't be located by full text search. The cost of expert analysis to create subject indexing is not easily compared to the cost of hardware, software and labor to manufacture a comparable set of full-text, fully searchable materials. With new web applications that allow every user to annote documents, social tagging has gained popularity especially in the Web.

Types of indexing language

There are three main types of indexing languages.

  • Controlled indexing language - Only approved terms can be used by the indexer to describe the document
  • Natural language indexing language - Any term from the document in question can be used to describe the document.
  • Free indexing language - Any term (not only from the document) can be used to describe the document.

When indexing a document, the indexer also has to choose the level of indexing exhaustivity, the level of detail in which the document is described. For example using low indexing exhaustivity, minor aspects of the work will not be described with index terms. In general the higher the indexing exhaustivity, the more terms indexed for each document.

In recent years free text search as a means of access to documents has become popular. This involves using natural language indexing with an indexing exhaustively set to maximum (every word in the text is indexed). Many studies have been done to compare the efficiency and effectiveness of free text searches against documents that have been indexed by experts using a few well chosen controlled vocabulary descriptors.

Controlled vocabularies are often claimed to improve the accuracy of free text searching, such as to reduce irrelevant items in the retrieval list. These irrelevant items (false positives) are often caused by the inherent ambiguity of natural language. Take the English word football for example. Football is the name given to a number of different team sports. Worldwide the most popular of these team sports is Association football, which also happens to be called soccer in several countries. The English language word football is also applied to Rugby football (Rugby union and rugby league), American football, Australian rules football, Gaelic football, and Canadian football. A search for football therefore will retrieve documents that are about several completely different sports. Controlled vocabulary solves this problem by tagging the documents in such a way that the ambiguities are eliminated.

Compared to free text searching, the use of a controlled vocabulary can dramatically increase the performance of an information retrieval system, if performance is measured by precision (the percentage of documents in the retrieval list that are actually relevant to the search topic).

In some cases controlled vocabulary can enhance recall as well, because unlike natural language schemes, once the correct authorised term is searched, you don't need to worry about searching for other terms that might be synonyms of that term.

However, a controlled vocabulary search may also lead to unsatisfactory recall, in that it will fail to retrieve some documents that are actually relevant to the search question.

This is particularly problematic when the search question involves terms that are sufficiently tangential to the subject area such that the indexer might have decided to tag it using a different term (but the searcher might consider the same). Essentially, this can be avoided only by an experienced user of controlled vocabulary whose understanding of the vocabulary coincides with the way it is used by the indexer.

Another possibility is that the article is just not tagged by the indexer because indexing exhaustivity is low. For example an article might mention football as a secondary focus, and the indexer might decide not to tag it with "football" because it is not important enough compared to the main focus. But it turns out that for the searcher that article is relevant and hence recall fails. A free text search would automatically pick up that article regardless.

On the other hand free text searches have high exhaustivity (you search on every word) so it has potential for high recall (assuming you solve the problems of synonyms by entering every combination) but will have much lower precision.

Controlled vocabularies are also quickly out-dated and in fast developing fields of knowledge, the authorised terms available might not be available if they are not updated regularly. Even in the best case scenario, controlled language is often not as specific as using the words of the text itself. Indexers trying to choose the appropriate index terms might mis-interpret the author, while a free text search is in no danger of doing so, because it uses the author's own words.

The use of controlled vocabularies can be costly compared to free text searches because human experts or expensive automated systems are necessary to index each entry. Furthermore, the user has to be familiar with the controlled vocabulary scheme to make best use of the system. But as already mentioned, the control of synonyms, homographs can help increase precision.

Numerous methodologies have been developed to assist in the creation of controlled vocabularies, including faceted classification, which enables a given data record or document to be described in multiple ways.

Applications

Controlled vocabularies, such as the Library of Congress Subject Headings, are an essential component of bibliography, the study and classification of books. They were initially developed in library and information science. In the 1950s, government agencies began to develop controlled vocabularies for the burgeoning journal literature in specialized fields; an example is the Medical Subject Headings (MeSH) developed by the U.S. National Library of Medicine. Subsequently, for-profit firms (called Abstracting and indexing services) emerged to index the fast-growing literature in every field of knowledge. In the 1960s, an online bibliographic database industry developed one based on dialup X.25 networking. These services were seldom made available to the public because they were difficult to use; specialist librarians called search intermediaries handled the searching job. In the 1980s, the first full text databases appeared; these databases contain the full text of the index articles as well as the bibliographic information. Online bibliographic databases have migrated to the Internet and are now publicly available; however, most are proprietary and can be expensive to use. Students enrolled in colleges and universities may be able to access some of these services; some of these services may be accessible without charge at a public library.

In large organizations, controlled vocabularies may be introduced to improve technical communication. The use of controlled vocabulary ensures that everyone is using the same word to mean the same thing. This consistency of terms is one of the most important concepts in technical writing and knowledge management, where effort is expended to use the same word throughout a document or organization instead of slightly different ones to refer to the same thing.

Web searching could be dramatically improved by the development of a controlled vocabulary for describing Web pages; the use of such a vocabulary could culminate in a Semantic Web, in which the content of Web pages is described using a machine-readable metadata scheme. One of the first proposals for such a scheme is the Dublin Core Initiative.

It is unlikely that a single metadata scheme will ever succeed in describing the content of the entire Web.[6] To create a Semantic Web, it may be necessary to draw from two or more metadata systems to describe a Web page's contents. The eXchangeable Faceted Metadata Language (XFML) is designed to enable controlled vocabulary creators to publish and share metadata systems. XFML is designed on faceted classification principles.[7]

See also

  • Authority control
  • Controlled natural language
  • Faceted classification
  • Full text search
  • Information retrieval
  • Metadata
    • Metadata registry
  • Ontology (computer science)
  • Semantic spectrum
  • Terminology
    • Technical terminology
  • Text retrieval
  • Thesaurus
  • Vocabulary-based transformation

Notes

  1. Amy J. Warner, Ph.D. A Taxonomy Primer. Retrieved April 25, 2008.
  2. Fred Leise and Karl Fast and Mike Steckel. What Is A Controlled Vocabulary?, 2002/12/16. Retrieved April 25, 2008.
  3. ANSI/NISO Z39.19-2005 p. 1. Retrieved April 28, 2008.
  4. Ibid. pp. 9-10
  5. F. W. Lancaster (2003): "Indexing and abstracting in theory and practise". Third edition. London, facet ISBN 1-85604-482-3. page 6
  6. Cory Doctorow. Metacrap: Putting the torch to seven straw-men of the meta-utopia. Retrieved April 25, 2008.
  7. Mark Pilgrim. This is XFML. Tuesday, December 3, 2002. Retrieved April 25, 2008.

References
ISBN links support NWE through referral fees

  • F. W. Lancaster (2003): "Indexing and abstracting in theory and practise". Third edition. London, facet ISBN 1-85604-482-3. page 6
  • Voss, Jakob (2007). "Tagging, Folksonomy & Co - Renaissance of Manual Indexing?". Proceedings of the International Symposium of Information Science: 234–254.

External links

de:Kontrolliertes Vokabular


Credits

New World Encyclopedia writers and editors rewrote and completed the Wikipedia article in accordance with New World Encyclopedia standards. This article abides by terms of the Creative Commons CC-by-sa 3.0 License (CC-by-sa), which may be used and disseminated with proper attribution. Credit is due under the terms of this license that can reference both the New World Encyclopedia contributors and the selfless volunteer contributors of the Wikimedia Foundation. To cite this article click here for a list of acceptable citing formats.The history of earlier contributions by wikipedians is accessible to researchers here:

The history of this article since it was imported to New World Encyclopedia:

Note: Some restrictions may apply to use of individual images which are separately licensed.