The EMILLE Lancaster Corpus – ELRA Catalogue

Last view: 2025-05-05

12 Last view: 2025-05-05

The EMILLE Lancaster Corpus

View resource name in all available languages

Corpus EMILLE Lancaster

ISLRN: 438-045-014-925-0

ID:

ELRA-W0038

The EMILLE Lancaster Corpus consists of three components: monolingual, parallel and annotated corpora.
There are monolingual corpora for seven South Asian languages: Bengali, Gujarati, Hindi, Punjabi, Sinhala, Tamil, Urdu.
The EMILLE monolingual corpora contain approximately 58,880,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu).
The parallel corpus consists of 200,000 words of text in English and its accompanying translations in Hindi, Bengali, Punjabi, Gujarati and Urdu.
The annotated component includes the Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode.

References: Xiao, Z, McEnery, A., Baker, P. and Hardie, A. 2004. ‘Developing Asian language corpora: standards and practice’ in Sornlertlamvanich, V., Tokunaga, T. and Huang, C. (eds.) Proceedings of the Fourth Workshop on Asian Language Resources, pp. 1-8. March 25, Sanya.

This database is available only for commercial use. For research use by academic organisations, a more complete set of the EMILLE Lancaster Corpus is available under the reference ELRA-W0037 The EMILLE/CIIL Corpus.

View resource description in French

Le corpus EMILLE Lancaster regroupe 3 composants : des corpus monolingues, des corpus parallèles et des corpus annotés.
Les corpus monolingues sont disponibles pour 7 langues parlées en Asie du sud : le bengali, le gujarati, l’hindi, le punjabi, le sinhala, le tamil et l’ourdou. Les corpus monolingues EMILLE contiennent environ 58 880 000 mots, dont 2 627 000 sont des transcriptions de données audio pour le bengali, le gujarati, l’hindi, le punjabi et l’ourdou. Le corpus parallèle contient quant à lui 200 000 mots en anglais avec sa traduction en hindi, bengali, punjabi, gujarati et ourdou.
La partie annotée regroupe les corpus monolingues et parallèles traitant la langue ourdou, annotés de façon automatique sur les parties du discours, ainsi qu’une vingtaine de corpus écrits en hindi annotés dans le but de montrer le type d’usage des démonstratifs. Tous les autres composants sont annotés au niveau de la phrase. Les corpus sont annotés au format SGML conforme avec la norme CES (Corpus Encoding Standards) et codés en Unicode.

Références : Xiao, Z, McEnery, A., Baker, P. et Hardie, A. 2004. ‘Developing Asian language corpora: standards and practice’ in Sornlertlamvanich, V., Tokunaga, T. and Huang, C. (eds.) Proceedings of the Fourth Workshop on Asian Language Resources, pp. 1-8. Mars 25, Sanya.

Cette ressource est disponible pour un usage commercial uniquement. Pour un usage de recherche par des organisations académiques, une version plus complète est disponible sous la référence ELRA-W0037 Corpus EMILLE/CIIL.

MEMBER	academic	commercial
Licence: Commercial Use - ELRA VAR		7500.00 €

NON MEMBER	academic	commercial
Licence: Commercial Use - ELRA VAR		12000.00 €

DistributionAvailability start date 15/09/2004 Contact Person

Valérie Mapelli

text

Monolingual text corpusLanguages

Sinhala; Sinhalese

Language Script: Sinhala

English

Language Script: Latin

Urdu

Language Script: Arabic

Gujarati

Language Script: Gujarati

Bengali

Language Script: Bengali

Panjabi; Punjabi

Language Script: Gurmukhi

Variety: Punjabi (Type: Dialect) (2 Gb)

Hindi

Language Script: Devanagari; Nagari

Bengali

Language Script: Bengali

Hindi

Language Script: Devanagari; Nagari

Gujarati

Language Script: Gujarati

Panjabi; Punjabi

Language Script: Gurmukhi

Variety: Punjabi (Type: Dialect) (2 Gb)

Urdu

Language Script: Arabic

Tamil

Language Script: Tamil

Linguality

Linguality type: Monolingual

Size

no size available

Resource Creation

Funding Project

EMILLE (Enabling Minority Language Engineering) - UK EPSRC

Funding Type: Own Funds

Metadata

Created: 05/12/2005

Metadata Language: French, English (fr, en)

Version

Version: 1.0

Last Updated: 03/06/2009

People who looked at this resource also viewed the following:

Resources from the same project