Collation is the assembly of written information into a standard order. This is commonly called alphabetization, though collation is not limited to ordering according to letters of the alphabet. Collating lists of words or names into alphabetical order is the basis of most office filing systems, library catalogs and reference books. Collation differs from classification in that classification is concerned with arranging information into logical categories, while collation is concerned with the ordering of those categories.
Advantages of sorted lists include:
- one can easily find the first n elements (e.g. the five smallest countries) and the last n elements (e.g. the three largest countries)
- one can easily find the elements in a given range (e.g. countries with an area between .. and .. square km)
- one can easily search for an element, and conclude whether it is in the list, e.g. with the binary search algorithm or interpolation search either automatically or manually.
No adequate information system is possible without precise formulas and rules of collation. Today, information systems often have to deal with multiple signage systems of various languages, raising new challenges in collation.
Numerical sorting, sorting of single characters
One collation system is numerical sorting. For example, the list of numbers 4 • 17 • 3 • -5 collates to -5 • 3 • 4 • 17.
While this might appear to work only for numbers, computers can use this method for any textual information since computers internally use character sets which assign a numeric code point to each letter or glyph. For example, a computer using ASCII code (or any of its supersets such as Unicode) and numerical sorting would collate the list of characters a • b • C • d • $ to $ • C • a • b • d.
The numerical values that ASCII uses are $ = 36, a = 97, b = 98, C = 67, and d = 100, resulting in what is called "ASCIIbetical order."
This style of collation is commonly used, often with the refinement of converting uppercase letters to lowercase before comparing ASCII values, since most people do not expect capitalized words to jump the head of the list.
A collation system for multiple-character words is alphabetical order, based on the conventional order of letters in an alphabet or abjad (most of which have a single conventional order). Each nth letter is compared with the nth letter of other words in the list, starting at the first letter of each word and advancing to the second, third, fourth, and so on, until the order is established.
The order of the Latin alphabet is:
The principle behind extending alphabetical order to words (lexicographical order) is that all words in a list beginning with the same letter should be grouped together; within a grouping starting with a single letter, all words beginning with the same two letters shall be grouped together; and so on, maximizing the number of common letters between adjacent words. The ordering principle is applied at the point where the letters differ. For instance, in the sequence:
Astrolabe Astronomy Astrophysics
The order of the words is given according to the first letter of the words that is different from the others (shown in bold). Since n follows l in the alphabet, but precedes p, Astronomy comes after Astrolabe, but before Astrophysics.
There has historically been some variation in the application of these rules. For instance, the prefixes Mc and M' in Irish and Scottish surnames were taken to be abbreviations for Mac, and alphabetized as if they were spelled out as Mac in full. Thus one might find in a catalog the sequence:
with McKinley preceding Mackintosh, as if it had been spelled "MacKinley." Since the advent of computer-sorted lists, this type of alphabetization has fallen out of favor. A variation in alphabetical principles applies to names consisting of two words. In some cases, names with identical first words are all alphabetized together under the first word, e.g., grouping together all names beginning with San, all those beginning with Santa, and those beginning with Santo:
San San Cristobal San Juan San Teodoro San Tomas Santa Barbara Santa Cruz Santa Clara Santo Domingo
But in another system, the names are alphabetized as if they had no spaces, e.g. as follows:
San San Cristobal San Juan Santa Barbara Santa Cruz Santa Clara San Teodoro Santo Domingo San Tomas
The difference between computer-style numerical sorting and true alphabetical sorting becomes obvious in languages using an extended Latin alphabet. For example, the thirty-letter alphabet of Spanish treats ñ as a basic letter following n, and formerly treated ch and ll as basic letters following c, l, respectively. Ch and ll are still considered letters, but are alphabetized as two-letter combinations. (The new alphabetization rule was issued by the Royal Spanish Academy in 1994.) On the other hand, the letter rr follows rqu as expected, both with and without the 1994 alphabetization rule. A numeric sort may order ñ incorrectly following z and treat ch as c + h, also incorrect when using pre-1994 alphabetization.
Similar differences between computer numeric sorting and alphabetic sorting occur in Danish and Norwegian (aa is ordered at the end of the alphabet when it is pronounced like å, and at the start of the alphabet when it is pronounced like a), German (ß is ordered as s + s; ä, ö, ü are ordered as a + e, o + e, u + e in phone books, but as o elsewhere, and behind o in Austria), Icelandic (ð follows d), Dutch (ij is sometimes ordered as y), English (æ is ordered as a + e), and many other languages.
Usually the spaces or hyphens between words are ignored.
Languages that used a syllabary or abugida instead of an alphabet (for example, Cherokee) can use approximately the same system if there is a set ordering for the symbols.
Another form of collation is radical-and-stroke sorting, used for non-alphabetic writing systems such as Chinese hanzi and Japanese kanji, whose thousands of symbols defy ordering by convention. In this system, common components of characters are identified; these are called radicals in Chinese and logographic systems derived from Chinese. Characters are then grouped by their primary radical, then ordered by number of pen strokes within radicals. When there is no obvious radical or more than one radical, convention governs which is used for collation. For example, the Chinese character for "mother" (媽) is sorted as a thirteen-stroke character under the three-stroke primary radical (女).
The radical-and-stroke system is cumbersome compared to an alphabetical system in which there are a few characters, all unambiguous. The choice of which components of a logograph comprise separate radicals and which radical is primary is not clear-cut. As a result, logographic languages often supplement radical-and-stroke ordering with alphabetic sorting of a phonetic conversion of the logographs. For example, the kanji word Tōkyō (東京), the Japanese name Tokyo can be sorted as if it were spelled out in the Japanese characters of the hiragana syllabary as "to-u-ki-yo-u" (とうきょう), using the conventional sorting order for these characters.
Nevertheless, the radical-and-stroke system is the only practical method for constructing dictionaries that someone may use to look up a logograph whose pronunciation is unknown.
In addition, in Greater China, surname stroke ordering is a convention in some official documentations where peoples' names are listed without hierarchy.
When lists of names or words need to be ordered, but the context does not define a particular single language or alphabet, the Unicode Collation Algorithm provides a way to put them in sequence.
Conventions in typography and in sorting systems
In typography and in the writing of scientific articles etc, such things as headers, sections, lists, pages etc., one might use alphabetical numbering instead of numerical numbering. However, this does not always mean that the full alphabet of a particular language is used. Often alphabetical numbering—or enumeration—only uses a subset of the full alphabet. E.g. the Russian alphabet has 33 letters, but typically only 28 are used in typographical enumeration (and for instance Ukrainian, Belarusian and Bulgarian Cyrillic enumeration shows similar features). Two Russian letters, Ъ and Ь, are only used for modifying the preceding consonants—they naturally fall out. The last three could have been used, but mostly aren't: Ы never begins a Russian word, Й almost never begins a word either, and it is perhaps too much alike the И—and also a relatively new character. Ё is also relatively new and much debated—sometimes in proper alphabetical sorting letters on Ё are listed under Е. (These "rules" are of course moderated, again, e.g. in phone catalogs, where foreign (non-Russian) names may frequently begin with Й or Ы.) This alludes to a simple fact: alphabets are not only tools for writing. And letters are often kept in an alphabet of a certain language even though they are not used in writing, not least because they are used in alphabetical enumeration. For instance, X,W,Z are not used in writing the Norwegian language, except in loanwords. Still they are kept in the Norwegian alphabet, and used in alphabetical lists. Likewise, earlier versions of the Russian alphabet contained letters which only had two purposes: they were good for writing Greek words and for using the Greek counting system in its Cyrillic form.
Compound words and special characters
A complication in alphabetical sorting can arise due to disagreements over how groups of words (separated compound words, names, titles, etc.) should be ordered. One rule is to remove spaces for purposes of ordering, another is to consider a space as a character that is ordered before numbers and letters (this method is consistent with ordering by ASCII or Unicode codepoint), and a third is to order a space after numbers and letters. Given the following strings to alphabetize—"catch," "cattle," "cat food"—the first rule produces "catch" "cat food" "cattle," the second "cat food" "catch" "cattle," and the third "catch" "cattle" "cat food." The first rule is used in most (but not all) dictionaries, the second in telephone directories (so that Wilson, Jim K appears with other people named Wilson, Jim and not after Wilson, Jimbo). The third rule is rarely used.
A similar complication arises when special characters such as hyphens or apostrophes appear in words or names. Any of the same rules as above can be used in this case as well; however, the strict ASCII sorting no longer corresponds exactly to any of the rules.
The telephone directory example raises another complication. In cultures where family names are written after given names, it is usually still desired to sort by family name first. In this case, names need to be reordered to be sorted properly. For example, Juan Hernandes and Brian O'Leary should be sorted as Hernandes, Juan and O'Leary, Brian even if they are not written this way. Capturing this rule in a computer collation algorithm is difficult, and simple attempts will necessarily fail. For example, unless the algorithm has at its disposal an extensive list of family names, there is no way to decide if "Gillian Lucille van der Waal" is "van der Waal, Gillian Lucille," "Waal, Gillian Lucille van der," or even "Lucille van der Waal, Gillian."
In telephone directories in English speaking countries, surnames beginning with Mc are sometimes sorted as if starting with Mac and placed between "Mabxxx" and "Madxxx." In Australian directories (and possibly others?), surnames beginning with St are treated as though spelt Saint. Under these rules, the telephone directory order of the following names would be: Maam, McAllan, Macbeth, MacCarthy, McDonald, Macy, Mboko and Sainsbury, Saint, St Clair, Salerno.
Abbreviations and common words
When abbreviations are used, it is sometimes desired to expand the abbreviations for sorting. In this case, "St. Paul" comes before "Shanghai." Obviously, to capture this behavior in a collation algorithm, we need a list of abbreviations. It may be more practical in some cases to store two sets of strings, one for sorting and one display. A similar problem arises when letters are replaced by numbers or special symbols in an irregular manner, for example 1337 for leet or the movie Se7en. In this case, proper sorting necessitates keeping two sets of strings.
In certain contexts, very common words (such as articles) at the beginning of a sequence of words are not considered for ordering, or are moved to the end. So "The Shining" is considered "Shining" or "Shining, The" when alphabetizing and therefore is ordered before "Summer of Sam." This rule is fairly easy to capture in an algorithm, but many programs rely instead on simple lexicographic ordering. One fairly quaint exception to this rule is the flying of the flag of The Former Yugoslav Republic of Macedonia at the United Nations between those of Thailand and Timor Leste.
Sorting of numbers
Ascending order of numbers differs from alphabetical order, e.g. 11 comes alphabetically before 2. This can be fixed with leading zeros: 02 comes alphabetically before 11. See e.g. ISO 8601.
Also -13 comes alphabetically after -12 although it is less. With negative numbers, to make ascending order correspond with alphabetical sorting, more drastic measures are needed such as adding a constant to all numbers to make them all positive.
Numerical sorting of strings
Sometimes, it is desired to order text with embedded numbers using proper numerical order. For example, "Figure 7b" goes before "Figure 11a," even though '7' comes after '1' in Unicode. This can be extended to Roman numerals. This behavior is not particularly difficult to produce as long as only integers are to be sorted, although it can slow down sorting significantly.
For example, Windows XP does this when sorting file names. Sorting decimals properly is a bit more difficult, due to the fact that different locales use different symbols for a decimal point, and sometimes the same character used as a decimal point is also used as a separator, for example "Section 3.2.5." There is no universal answer for how to sort such strings; any rules are application dependent.
- Greg, W. W. 1934. "A Formulary of Collation." Library. 14, no. 4: 365-382.
- Oakman, Robert L. The Present State of Computerized Collation: A Review Article. Columbia, S.C.: University of South Carolina Press, 1972.
- Sabourin, Conrad. "Literary Computing: Style Analysis, Author Identification, Text Collation, Literary Criticism: Bibliography." Infolingua, 8. Montréal: Infolingua, 1994. ISBN 2921173123 ISBN 9782921173124
- Williams, William Proctor, and Craig S. Abbott. An Introduction to Bibliographical and Textual Studies. New York: Modern Language Association of America, 1999. ISBN 0873522672 ISBN 9780873522670 ISBN 0873522680 ISBN 9780873522687
- Wisbey, Roy Albert. The Computer in Literary and Linguistic Research: Papers from a Cambridge Symposium. Publications of the Literary and linguistic computing centre, University of Cambridge, 1. London: Cambridge U.P., 1971. ISBN 0521081467 ISBN 9780521081467
All links retrieved June 4, 2013.
- Unicode Collation Algorithm Unicode Technical Standard #10
- Typographical collation for many languages, as proposed in the List module of Cascading Style Sheets.
- msort A sort program that provides an unusual level of flexibility in defining collations and extracting keys.
- ICU Locale Explorer An online demonstration of the Unicode Collation Algorithm using International Components for Unicode
New World Encyclopedia writers and editors rewrote and completed the Wikipedia article in accordance with New World Encyclopedia standards. This article abides by terms of the Creative Commons CC-by-sa 3.0 License (CC-by-sa), which may be used and disseminated with proper attribution. Credit is due under the terms of this license that can reference both the New World Encyclopedia contributors and the selfless volunteer contributors of the Wikimedia Foundation. To cite this article click here for a list of acceptable citing formats.The history of earlier contributions by wikipedians is accessible to researchers here:
Note: Some restrictions may apply to use of individual images which are separately licensed.