Zipf's law
Zipf's law is an empirical law, formulated using mathematical statistics, named after the linguist George Kingsley Zipf, who first proposed it.[1][2] Zipf's law states that given a large sample of words used, the frequency of any word is inversely proportional to its rank in the frequency table. So nth word has a frequency proportional to 1/n.
Thus the most frequent word will occur about twice as often as the second most frequent word, three times as often as the third most frequent word, etc. For example, in one sample of words in the English language, the most frequently occurring word, "the", accounts for nearly 7% of all the words (69,971 out of slightly over 1 million). True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). Only about 135 words are needed to account for half the sample of words in a large sample.[3]
The same relationship occurs in many other rankings, unrelated to language, such as the population ranks of cities in various countries, corporation sizes, income rankings, etc. The appearance of the distribution in rankings of cities by population was first noticed by Felix Auerbach in 1913.[4]
It is not known why Zipf's law holds for most languages.[5]
Zipf's Law Media
- Zipf's law on War and Peace.png
Zipf's Law on War and Peace. The lower plot shows the remainder when the Zipf law is divided away. It shows that there remains significant pattern not fitted by Zipf law.
A plot of the frequency of each word as a function of its frequency rank for two English language texts: Culpeper's Complete Herbal (1652) and H. G. Wells's The War of the Worlds (1898) in a log-log scale. The dotted line is the ideal law y ∝ 1/x.
- Zipf distribution CMF.png
Plot of the Zipf CDF for N=10
Zipf's law plot for the first 10 million words in 30 Wikipedias (as of October 2015) in a log-log scale.
Well's War of the Worlds in plain text, in a book code, and in a Vigenère cipher.
- Zipf-asia-1 Chinese, Tibetan, Vietnamese.svg
Lhasa Tibetan, Chinese, Vietnamese, all with separated syllables.
- Zipf-span-1 Spanish - Don Quixote Parts 1 and 2.svg
Cervantes's Don Quixote, Part I (1605) and Part II (1615).
References
- ↑ Zipf George K. 1935. The psychology of language. Houghton-Mifflin.
- ↑ Zipf George K. 1949. Human behavior and the principle of least effort. Addison-Wesley.
- ↑ Lua error in Module:Citation/CS1/Utilities at line 38: bad argument #1 to 'ipairs' (table expected, got nil).. P. 139: "For example, in the Brown Corpus, consisting of over one million words, half of the word volume consists of repeated uses of only 135 words."
- ↑ Auerbach F. 1913. Das Gesetz der Bevölkerungskonzentration. Petermann’s Geographische Mitteilungen 59, 74–76
- ↑ Brillouin, Léon [1959] 2004. La science et la théorie de l'information.