NUM 5M Mongolian written corpus – ELRA Catalogue

Last view: 2025-06-30

29 Last view: 2025-06-30

NUM 5M Mongolian written corpus

View resource name in all available languages

Corpus NUM 5M de textes en mongol

ISLRN: 492-817-146-504-9

ID:

ELRA-W0120

This is a corpus of Mongolian text mostly from domains like online or printed daily newspapers, literature, and laws.

The collected raw texts was reduced from 5 to 4.8 million words after cleaning. The cleaned corpus comprises:
- 144 texts from laws until 2009,
- 288 texts from literature that is currently being used in the primary and secondary school text books in Mongolia (including stories, novels, novelettes),
- 1,134 editorals from the printed newspaper "Unen" dating from 1984 to 1989,
- 2,477 online newswire texts dating from 2003 to 2009.

Part of this corpus, about 2,800 sentences with 100,000 words, has been POS-tagged manually and stored in XML TEI format.

View resource description in French

Il s’agit d’un corpus de textes en mongol provenant principalement de quotidiens en ligne ou papier, de livres et de textes juridiques.

La taille du corpus a été réduite de 5 millions à 4,8 millions de mots après nettoyage des textes bruts. Le corpus nettoyé contient :
- 144 textes juridiques jusqu’à 2009,
- 288 textes littéraires qui sont actuellement utilisés dans les livres d’école primaire et secondaire en Mongolie (comprenant des histoires, romans, nouvelles),
- 1134 éditoriaux du journal "Unen" entre 1984-1989,
- 2477 articles journalistiques en ligne entre 2003 et 2009.

Une partie du corpus, environ 2800 phrases (100000 mots), a été annotée manuellement en partie du discours et standardisée au format XML TEI.

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	5000.00 €
Licence: Commercial Use - ELRA VAR	5000.00 €	5000.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	7000.00 €
Licence: Commercial Use - ELRA VAR	7000.00 €	7000.00 €

DistributionAvailability start date 12/07/2017 Contact Person

Valérie Mapelli

text

Monolingual text corpusLanguages

Mongolian

Linguality

Linguality type: Monolingual

Size

no size available

AnnotationOther

Standard practices conformance: TEI

Metadata

Created: 05/12/2005

Metadata Language: French, English (fr, en)

Version

Version: 1.0

Last Updated: 08/17/2017

People who looked at this resource also viewed the following: