The EMILLE/CIIL Corpus consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual corpora, including both written and (for some languages) spoken data for fourteen South Asian languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayala...
The EMILLE Lancaster Corpus consists of three components: monolingual, parallel and annotated corpora.
There are monolingual corpora for seven South Asian languages: Bengali, Gujarati, Hindi, Punjabi, Sinhala, Tamil, Urdu.
The EMILLE monolingual corpora contain approximately 58,880,000 words (i...