List of text corpora

From Wikipedia, the free encyclopedia

Following is a list of text corpora in various languages. "Text corpora" is the plural of "text corpus". A text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). Text corpora are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. For a more comprehensive list of text corpora, see https://linguistlist.org/sp/GetWRListings.cfm?wrtypeid=1

English language[]

European languages[]

Slavic[]

East Slavic[]

South Slavic[]

West Slavic[]

German[]

Middle Eastern Languages[]

  • Corpus Inscriptionum Semiticarum
  • Kanaanäische und Aramäische Inschriften
  • Hamshahri Corpus (Persian)
  • (Persian)[12]
  • Amarna letters, (for Akkadian, Egyptian, Sumerogram's, etc.)
  • TEP: Tehran English-Persian Parallel Corpus[13]
  • TMC: Tehran Monolingual Corpus, Standard corpus for Persian Language Modeling[13]
  • Persian Today Corpus: The Most Frequent Words of Today Persian, based on a one-million-word corpus (in Persian: Vāže-hā-ye Porkārbord-e Fārsi-ye Emrūz), Hamid Hassani, Tehran, Iran Language Institute (ILI), 2005, 322 pp. ISBN 964-8699-32-1
  • (Kurdish-corpus Sorani dialect) University of Kurdistan, Department of English Language and Linguistics
  • Bijankhan Corpus A Contemporary Persian Corpus for NLP researches, University of Tehran, 2012
  • Neo-Assyrian Text Corpus Project
  • Quranic Arabic Corpus (Classical Arabic)
  • Electronic Text Corpus of Sumerian Literature
  • Open Richly Annotated Cuneiform Corpus
  • Asosoft text corpus[14]Central Kurdish (Sorani)

Devanagari[]

East Asian Languages[]

South Asian Languages[]

Parallel corpora of diverse languages[]

  • Europarl Corpus - proceedings of the European Parliament from 1996–201
  • EUR-Lex corpus - collection of all official languages of the European Union, created from the EUR-Lex database[17]
  • OPUS: Open source Parallel Corpus in many many languages[18]
  • Tatoeba A parallel corpus which contains over 8.9 million sentences in multiple languages; 107 languages have more than 1,000 sentences each; a further 81 languages have from 100 to 1,000 sentences each.[19]
  • NTU-Multilingual Corpus in 7 languages (ara, eng, ind, jpn, kor, mcn, vie)[20] (legacy repo)
  • SeedLing corpus - A Seed Corpus for the Human Language Project with 1000+ languages from various sources.[21]
  • GRALIS parallel texts for various Slavic languages, compiled by the institute for Slavic languages at Graz University (Branko Tošović et al.)
  • The ACTRES Parallel Corpus (P-ACTRES 2.0) is a bidirectional English-Spanish corpus consisting of original texts in one language and their translation into the other. P-ACTRES 2.0 contains over 6 million words considering both directions together.[22]


Comparable Corpora[]

L2 (English) Corpora[]

  • Cambridge Learner Corpus[31]
  • Corpus of Academic Written and Spoken English (CAWSE),[32] a collection of Chinese students’ English language samples in academic settings. Freely downloadable online.  
  • English as a Lingua Franca in Academic Settings (ELFA),[33] an academic ELF corpus.[34][35]
  • International Corpus of Learner English (ICLE),[36] a corpus of learner written English.
  • Louvain International Database of Spoken English Interlanguage (LINDSEI),[37] a corpus of learner spoken English.
  • Trinity Lancaster Corpus, one of the largest corpus of L2 spoken English.[38][39]
  • University of Pittsburgh English Language Institute Corpus (PELIC)[40]
  • Vienna-Oxford International Corpus of English (VOICE),[41] an ELF corpus.[34]

References[]

  1. ^ "Corpus Resource Database (CoRD)". Department of English, University of Helsinki.
  2. ^ Professor Mark Davies at BYU created an online tool to search Google's English language corpus, drawn from Google Books, at http://googlebooks.byu.edu/x.asp.
  3. ^ "PhraseFinder". A search engine for the Google Books Ngram Corpus that supports wildcard queries and offers an API.
  4. ^ (in Spanish) "Molinolabs - corpus". molinolabs.com. Retrieved 12 January 2014.
  5. ^ "CorALit – CorALit - Lietuvių mokslo kalbos tekstynas". coralit.lt. Retrieved 12 January 2014.
  6. ^ "Turkish National Corpus - Türkçe Ulusal Derlemi - Homepage". tnc.org.tr. Retrieved 12 January 2014.
  7. ^ Glazkova, A (2020). "Topical Classification of Text Fragments Accounting for Their Nearest Context". Automation and Remote Control. 81 (12): 2262–2276. doi:10.1134/S0005117920120097.
  8. ^ Rubtsova, Yu (2015). "Constructing a corpus for sentiment classification training". Software & Systems. 1: 72–78. doi:10.15827/0236-235X.109.072-078.
  9. ^ "Under Update". search.dcl.bas.bg. Retrieved 12 January 2014.
  10. ^ "Електронски корупус на македонски книжевни текстови".
  11. ^ "Portál | Český národní korpus".
  12. ^ Zdravkova, Katrina; Tufiş, Dan; Simov, Kiril; Radziszewski, Adam; Qasemizadeh, Behrang; Priest-Dorman, Greg; Petkevič, Vladimír; Oravecz, Csaba; Krstev, Cvetana; Kotsyba, Natalia; Kaalep, Heiki-Jaan; Ide, Nancy; Garabík, Radovan; Dimitrova, Ludmila; Derzhanski, Ivan; Barbu, Ana-Maria; Erjavec, Tomaž (2010-05-14). "Available from CLARIN". http://nl.ijs.si/me/v4/. External link in |journal= (help)
  13. ^ Jump up to: a b "University of Tehran NLP Lab". ece.ut.ac.ir. Archived from the original on 28 January 2014. Retrieved 12 January 2014.
  14. ^ Hadi Veisi, Mohammad MohammadAmini, Hawre Hosseini; Toward Kurdish language processing: Experiments in collecting and processing the AsoSoft text corpus, Digital Scholarship in the Humanities, fqy074, https://doi.org/10.1093/llc/fqy074
  15. ^ "KOTONOHA「現代日本語書き言葉均衡コーパス」 少納言". kotonoha.gr.jp. Retrieved 12 January 2014.
  16. ^ D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, C. Wimalasuriya, N. de Silva, and G. Dias . 2015. Implementing a Corpus for Sinhala Language. In Symposium on Language Technology for South Asia.
  17. ^ "EUR-Lex Corpus". sketchengine.co.uk. Retrieved 27 October 2016.
  18. ^ "OPUS - an open source parallel corpus". opus.lingfil.uu.se. Retrieved 12 January 2014.
  19. ^ "Tatoeba - Number of sentences per language". tatoeba.org. Retrieved 23 November 2020.
  20. ^ Liling Tan and Francis Bond (14 May 2012). "Building and Annotating the Linguistically Diverse NTU-MC (NTU — Multilingual Corpus)" (PDF). International Journal of Asian Language Processing. 22 (4): 161–174. Archived from the original (PDF) on 16 January 2014. Retrieved 12 January 2014.
  21. ^ Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri . 2014. SeedLing: Building and using a seed corpus for the Human Language Project. In Proceedings of the use of Computational methods in the study of Endangered Languages (ComputEL) Workshop. Baltimore, USA.
  22. ^ H. Sanjurjo-González and M. Izquierdo. 2019. P-ACTRES 2.0: A parallel corpus for cross-linguistic research. In Parallel Corpora for Contrastive and Translation Studies: New resources and applications (pp. 215-231). John Benjamins Publishing.
  23. ^ Ralf Steinberger Ralf; Bruno Pouliquen; Anna Widiger; Camelia Ignat; Tomaž Erjavec; Dan Tufiş; Dániel Varga (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006). Genoa, Italy, 24–26 May 2006.
  24. ^ Liling Tan, Marcos Zampieri, Nikola Ljubešic, and Jörg Tiedemann. Merging comparable data sources for the discrimination of similar languages: The DSL corpus collection. In Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC). 2014.
  25. ^ Kilgarriff, Adam (2012). "Getting to Know Your Corpus". Text, Speech and Dialogue. Lecture Notes in Computer Science. 7499. pp. 3–15. CiteSeerX 10.1.1.452.8074. doi:10.1007/978-3-642-32790-2_1. ISBN 978-3-642-32789-6.
  26. ^ Belinkov, Y., Habash, N., Kilgarriff, A., Ordan, N., Roth, R., & Suchomel, V. (2013). arTen-Ten: a new, vast corpus for Arabic. Proceedings of WACL.
  27. ^ Kilgarriff, A., & Renau, I. (2013). esTenTen, a vast web corpus of Peninsular and American Spanish. Procedia - Social and Behavioral Sciences, 95, 12-19.
  28. ^ Хохлова, М. В. (2016). Обзор больших русскоязычных корпусов текстов. In Материалы научной конференции" Интернет и современное общество" (pp. 74-77).
  29. ^ Khokhlova, M. (2016). Comparison of High-Frequency Nouns from the Perspective of Large Corpora. RASLAN 2016 Recent Advances in Slavonic Natural Language Processing, 9.
  30. ^ Trampuš, M., & Novak, B. (2012, October). Internals of an aggregated web news feed. In Proceedings of the Fifteenth International Information Science Conference IS SiKDD 2012 (pp. 431-434)
  31. ^ , Wikipedia, 2019-09-27, retrieved 2020-01-07
  32. ^ "CAWSE Corpus - The University of Nottingham Ningbo China - 宁波诺丁汉大学". nottingham.edu.cn. Retrieved 2020-01-07.
  33. ^ "English as a Lingua Franca in Academic Settings". University of Helsinki. 2018-03-23. Retrieved 2020-01-07.
  34. ^ Jump up to: a b , Wikipedia, 2019-12-14, retrieved 2020-01-07
  35. ^ Mauranen, A (2010). "English as an academic lingua franca: The ELFA project". English for Specific Purposes. 29 (3): 183–190. doi:10.1016/j.esp.2009.10.001.
  36. ^ "ICLE". UCLouvain. Retrieved 2020-01-07.
  37. ^ "LINDSEI". UCLouvain (in French). Retrieved 2020-01-07.
  38. ^ "Trinity Lancaster Corpus | ESRC Centre for Corpus Approaches to Social Science (CASS)". Retrieved 2020-01-07.
  39. ^ Gablasova, D (2019). "The Trinity Lancaster Corpus: Development, Description and Application". International Journal of Learner Corpus Research. 5 (2): 126–158. doi:10.1075/ijlcr.19001.gab.
  40. ^ Juffs, A., Han, N-R., & Naismith, B. (2020). The University of Pittsburgh English Language Corpus (PELIC) [Data set]. http://doi.org/10.5281/zenodo.3991977
  41. ^ "Project". univie.ac.at. Retrieved 2020-01-07.
Retrieved from ""