File:Zipf-asia-1 Chinese, Tibetan, Vietnamese.svg

Original file(SVG file, nominally 512 × 504 pixels, file size: 715 KB)

Commons-logo.svg This is a file from the Wikimedia Commons. The description on its description page there is shown below.
Commons is a freely licensed media file repository. You can help.

Summary

Description
English: Zipf law plot (frequency as function of frequency rank) for the words in texts of three East Asian languages: Tibetan, Chinese (Mandarin), and Vietnamese. Each syllable (each character, in the case of Chinese) was considered a separate word.

The languages, texts and the frequency files are:

  • Tibetan. Text of the Play of Mistaken Illusion by Kyabje Trijang Rinpoche (mid 1900s). Item 95306 from the Asian Classics Input Project (ACIP) collection. Sample: LA 'THAMS PAS 'DI SNANG SPRE'U'I GAR BAR MED YON POR BSGYUR BA'I RNAM G [...] LOG GIS NYAG RONG GI SA. File tibe/pmi/tot.1/gud.wfr (original 143289 words, truncated/filtered to 35027 words, N = 1963 distinct).
  • Chinese (Mandarin). The classical Chinese novel Dream of the Red Chamber or Dream of the Red Mansion (Hong2 Lou2 Meng4) by Cao2 Xue3 Qin2 and Gao E (~1750); with some errors and omissions. Chinese characters were mapped 1:1 from GB (Guo Biao) to pinyn with tone marks and disambiguating suffixes, e.g. 'zuo4', 'zuo4.1', 'zuo4.2', so as to distinguish characters with the same pinyin. Sample: ci3 kai1 juan3 di4.2 yi1 hui2 ye3 zuo4.2 zhe3 zi4 yun2 yin1 ceng2 li4.4 [...] dong1 bian1 wu1 nei4.1 guo4 lai2 dai4.1 le5 liu2.1. File chin/red/tot.1/gud.wfr (original 706889 words, truncated/filtered to 35027 words, N = 2420 distinct).
  • Vietnamese. The first five books (the Pentateuch) from the Cadman Vietnamese Bible (1934), probably translated from the English King James Bible. In the ASCII VIQR encoding, mapped to lowercase, without hyphens. Sample: ban dda^`u ddu+'c chu'a tro+`i du+.ng ne^n tro+`i dda^'t va? dda^'t la` [...] da.y la.i cho dde^? ca'c ngu+o+i la`m theo no' trong xu+' ma` ca'c. File viet/ptt/tot.1/gud.wfr (original 169480 words, truncated/filtered to 35027 words, N = 1631 distinct).
The word frequency files '*/*/*/gud.wfr' are available at the UNICAMP website. The original annotated full texts, before truncation/filtering, are in the companion files */*/org/main.src. The truncated/filtered texts -- one word per line, without punctuation -- are in */*/*/gud.tlw.
Date
Source Own work
Author Jorge Stolfi

Licensing

I, the copyright holder of this work, hereby publish it under the following license:
w:en:Creative Commons
attribution share alike
This file is licensed under the Creative Commons Attribution-Share Alike 4.0 International license.
You are free:
  • to share – to copy, distribute and transmit the work
  • to remix – to adapt the work
Under the following conditions:
  • attribution – You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
  • share alike – If you remix, transform, or build upon the material, you must distribute your contributions under the same or compatible license as the original.

Captions

Zipf Law plots for three East Asian languages: Tibetan, Chinese, Vietnamese

Items portrayed in this file

depicts

9 May 2023

image/svg+xml

File history

Click on a date/time to view the file as it appeared at that time.

Date/TimeDimensionsUserComment
current14:45, 15 May 2023512 × 504 (715 KB)Jorge StolfiRebuilt the file with small changes in dataset, colors

The following page uses this file:

Metadata