NUM 5M Mongolian written corpus

View resource name in all available languages

Corpus NUM 5M de textes en mongol

492-817-146-504-9

ID:

ELRA-W0120

This is a corpus of Mongolian text mostly from domains like online or printed daily newspapers, literature, and laws.

The collected raw texts was reduced from 5 to 4.8 million words after cleaning. The cleaned corpus comprises:
- 144 texts from laws until 2009,
- 288 texts from literature that is currently being used in the primary and secondary school text books in Mongolia (including stories, novels, novelettes),
- 1,134 editorals from the printed newspaper "Unen" dating from 1984 to 1989,
- 2,477 online newswire texts dating from 2003 to 2009.

Part of this corpus, about 2,800 sentences with 100,000 words, has been POS-tagged manually and stored in XML TEI format.

View resource description in French

Il s’agit d’un corpus de textes en mongol provenant principalement de quotidiens en ligne ou papier, de livres et de textes juridiques.

La taille du corpus a été réduite de 5 millions à 4,8 millions de mots après nettoyage des textes bruts. Le corpus nettoyé contient :
- 144 textes juridiques jusqu’à 2009,
- 288 textes littéraires qui sont actuellement utilisés dans les livres d’école primaire et secondaire en Mongolie (comprenant des histoires, romans, nouvelles),
- 1134 éditoriaux du journal "Unen" entre 1984-1989,
- 2477 articles journalistiques en ligne entre 2003 et 2009.

Une partie du corpus, environ 2800 phrases (100000 mots), a été annotée manuellement en partie du discours et standardisée au format XML TEI.

MEMBERacademiccommercial
Licence: Non Commercial Use - ELRA END USER
0.00 € submit
5000.00 € submit
Licence: Commercial Use - ELRA VAR
5000.00 € submit
5000.00 € submit
NON MEMBERacademiccommercial
Licence: Non Commercial Use - ELRA END USER
0.00 € submit
7000.00 € submit
Licence: Commercial Use - ELRA VAR
7000.00 € submit
7000.00 € submit
12/07/2017
People who looked at this resource also viewed the following: