1,140 language resources at your disposal
An increasing number of LRs in the various fields of Human Language Technology (see image on the left-hand side) are distributed on behalf of ELRA via its operational body ELDA, thanks to the contribution of various players of the HLT community.
Our aim is to provide Language Resources, by means of this repository, so as to prevent researchers and developers from investing efforts to rebuild resources which already exist as well as help them identify and access those resources.
The Arbobanko (Esperanto Treebank) is a 52,000 token dependency treebank of Esperanto with texts from the MONATO news magazine, consisting of random excerpts from the period 2000-2010. All words were annotated for lemma, part-of-speech, inflection, compounding and affixing, syntactic function, dependency links, NER types, semantic types of nouns and adjectives, ...
This database covers both regular and irregular Arabic plurals, and was developed by experts over a period of several years. The data includes various grammatical attributes such as part-of-speech, collectivity codes, gender codes, and full vocalization.
This database covers about 82,000 headwords, and includes part-of-speech codes as well as other grammatical/phonological data upon request.
Comprehensive monolingual word lists for both Simplified and Traditional Chinese, Japanese, Korean and Arabic, including a full-form Arabic word list. For Simplified and Traditional Chinese, Japanese and Korean, we provide readings as well, making them ideal for speech-related applications such as speech synthesis. The two Arabic databases include both vocalized ...
A large comprehensive database of Korean-Chinese personal and place names, with coverage of not only native Korean proper nouns, but also Japanese, Chinese and Western proper nouns as well.
A resource of Arab personal names and variants, in the original Arabic script, this database covers several hundred thousand Arabic script variants, along with common spelling mistakes. Every Arabic name is normalized and vocalized.
Bilingual, bidirectional database of technical terms covering fields including civil engineering, business and finance, mechanical engineering, IT/computer, and more.
This database covers general vocabulary, and includes part-of-speech codes and readings. This up-to-date dictionary is optimized for the convenience of users of electronic dictionaries and online translation tools. It has just the right amount of detail: enough equivalents to give an in-depth understanding, yet short enough not to clutter up ...
A unique resource that has been developed in cooperation with a team of native-speaker experts in Persian phonology. The data includes a confidence rank to indicate the relative likelihood that a variant will be encountered in the real world.
Comprehensive Japanese-English bilingual, bidirectional database of technical terms covering a broad spectrum of fields ranging from computer science to business and finance to biotechnology.
Comprehensive monolingual wordlist for Japanese. Readings are provided, making this database ideal for speech-related applications such as speech synthesis.
Compiled by experienced editors with in-depth knowledge of Japanese phonology and phonetics. Provides IPA phonetic transcriptions (SAMPA on request) that accurately indicate how Japanese words are pronounced in actual speech, as well as accent codes, for each entry. Includes accent information for personal names and place names, making the resource ...
SC<>TC mapping tables for Orthographic and Lexemic conversion levels together with a conversion engine. The mapping tables are comprehensive, and include approximately 700,000 items covering general vocabulary and some technical terms and proper nouns. They also include various other attributes, such as pinyin readings, grammatical information, part of speech, and ...
Covers Chinese full names of real people, including celebrities. Includes pinyin readings.
A large comprehensive database of Korean-Japanese personal and place names, with coverage of not only native Korean proper nouns, but also Chinese, Japanese, and Western proper nouns as well.
A large comprehensive database of Korean-English personal and place names, with coverage of not only native Korean proper nouns, but also Chinese, Japanese, and Western proper nouns as well.
Brings together six languages -- Simplified Chinese, Traditional Chinese, Japanese, Korean, English (Arabic upon request) -- in a multidirectional format. The database includes various data fields, such as readings in pinyin and zhuyin, hiragana, romanization in all major and most minor romanization systems, semantic classification codes, locale codes, and other ...
A large comprehensive database of Chinese-Japanese personal and place names, with coverage of not only native Chinese proper nouns, but also Korean and Western proper nouns as well.
Comprehensive monolingual wordlist for Traditional Chinese. Zhuyin is provided, making this database ideal for speech-related applications such as speech synthesis.
Covers over 660,000 entries and includes various data fields such as hiragana and romanized readings, classification codes and locale codes, English equivalents, and more. Included are a large variety of both Japanese and non-Japanese personal and place names.
Monolingual lexical database with a rich set of grammatical attributes such as derivational attributes, suffixes and prefixes and bound morphemes.
Provides comprehensive coverage for the major Chinese romanization systems and their variants, and if needed can be expanded considerably with dialectical variants (Cantonese, Hakka, Hokkien, etc.).
This database is important for translating 成語 chengyu (Chinese proverbs and idioms), which cannot be translated literally since they are often based on classical Chinese. For example, 臨陣磨槍, literally 'face battle sharpen spear', which means "do something at the last moment," cannot be correctly translated by MT or NMT systems ...
This database contains various morphological attributes such as derivational attributes, suffixes and prefixes, word elements (bound morphemes) and binding valency, and is designed to significantly enhance segmentation accuracy and tokenization.
This database covers non-Arabic names, their Arabic equivalents, and Arabic script variants for each name, with each variant ranked by frequency of occurrence.
A large-scale database of Japanese place names and POIs in Simplified Chinese, Japanese, Korean and English languages.
This is a comprehensive database of Chinese derivative affixes with adjacency attributes.
This resource is accurate, validated, and has been carefully proofread to ensure strict adherence to the complex rules of hamza orthography resulting in not only the linguistically correct standard MSA, but also all common non-standard and incorrect versions as well, carefully flagged to distinguish between them.
Japanese company and organization names with English equivalents when available.
This resource covers four million Japanese names and their romanized variants, and includes gender codes, classification codes, and frequency rankings.
Comprehensive monolingual wordlist for Korean. Readings are provided, making this database ideal for speech-related applications such as speech synthesis.
Comprehensive monolingual wordlist for Simplified Chinese. Pinyin is provided, making this database ideal for speech-related applications such as speech synthesis.
80,000 headwords, expandable to 100,000, of general vocabulary and important proper names.
A large comprehensive database of Chinese-English personal and place names, with coverage of not only native Chinese proper nouns, but also Japanese, Korean, and Western proper nouns as well.
A comprehensive monolingual lexical database of Chinese consisting of Simplified and Traditional Chinese modules, covering general vocabulary and important technical terms. Each entry is accompanied by various attributes, such as phonological, grammatical, and morphological information, as well as semantic classification codes.
Very comprehensive database of Arabic personal names and name variants mapped to the original Arabic script with a large variety of supplementary information.
Covers over 800,000 terms from over 20 science and technology domains, including computers/IT, mechanical engineering, biotechnology, chemistry, and medicine.
Chinese name components, accompanied by accurate pinyin readings, gender codes, and flags denoting whether name is a given name, surname, or both.
Monolingual lexical database which includes a significant number of affixes, particles, auxiliaries and conjugation patterns to account for all the inflectional and derivational morphology in Korean so as to enable recognition of inflected forms.
This is an extremely comprehensive Spanish full-form lexicon for general vocabulary in which all forms, including inflected, plural, feminine and affixed forms, are included. A bilingual version is also available for Spanish-English.
This Simplified Chinese-to-English Dictionary was compiled in collaboration with lexicographers from a leading Chinese university, and based on the world's most authoritative and comprehensive dictionaries that have been published in China. It has undergone extensive proofreading and validation by a team of native Chinese editors. Covers general vocabulary, technical terms, ...
Our Japanese Orthographical Database (JOD) plays a critical role in enhancing the accuracy of information retrieval, machine translation and morphological analysis applications as it helps identify and disambiguate the numerous Japanese orthographic variants that have identical meanings.
This is an extremely comprehensive Spanish-English lexicon for general vocabulary in which not only are all forms, including inflected, plural, feminine and affixed forms included, but all English equivalents for each of these forms is given as well. A monolingual version is also available for Spanish.
Covers entries of general vocabulary, along with high-frequency technical terms and proper nouns. In addition to large coverage and high level of accuracy, the database has several special features including explicit codes to indicate headword type and part-of speech, coverage of all polyphones, and correct pinyin for the neutral tone ...
This database is not only comprehensive but also linguistically accurate. It is based on solid principles of Cantonese phonology and semantics, and takes into account the phenomena of polyphony as well as tone change, which is unpredictable and requires manual proofreading. It covers 300,000 entries, including 80,000 readings and romanized ...
A large-scale database of Chinese pinyin readings. Especially noteworthy are the differences in pronunciation between Taiwan and the PRC, for example 期待 qí dài (Taiwan) and qī dài (PRC).
A large comprehensive dictionary of Chinese-English technical terms, covering over 4 million terms from 65 domains, including chemical, computer/IT, medical, civil engineering, business/finance, and mechanical engineering.
In any text document, there are particular terms that represent specific entities that are more informative and have a unique context. These entities are known as named entities, which more specifically refer to terms that represent real-world objects like people, places, organizations, and so on. They are often denoted by ...
The English-Persian terminology database of management and economics consists of around 15,000 terms in the field of management (including all branches) and economics sciences. It comes with a software through which the users can search a word, phrase or chunk in one language and receive all entries consisting of the ...
Glissando-sp includes more than 12 hours of speech in Spanish, recorded under optimal acoustic conditions, orthographically transcribed, phonetically aligned and annotated with prosodic information (location of the stressed syllables and prosodic phrasing). The corpus was recorded by 8 professional speakers and 20 non-professional speakers: 4 “news broadcaster” professional speakers (2 ...
The English-Persian terminology database of computer and IT consists of around 25,000 terms in the field of computer engineering, computer sciences and information technology. It comes with a software through which the users can search a word, phrase or chunk in one language and receive all entries consisting of the ...
Glissando-ca includes more than 12 hours of speech in Catalan, recorded under optimal acoustic conditions, orthographically transcribed, phonetically aligned and annotated with prosodic information (location of the stressed syllables and prosodic phrasing). The corpus was recorded by 8 professional speakers and 20 non-professional speakers: 4 “news broadcaster” professional speakers (2 ...
The English-Persian database of idioms and expressions consists of about 30,000 bilingual parallel sentences and phrases in English and Persian (15,000 in each language). It comes with a software through which the users can search a word, phrase or chunk in one language and receive all idioms and expressions consisting ...
The Gram Vanni data set consists of 130 hours (21,000 different audio recordings) recorded by 4,000 unique Hindi speakers from the states of Bihar, Jharkhand, and Madhya Pradesh in India (20-25% female, 60% people under 30 years of age, mostly rural). The data set was collected via a voice-based community ...