ELRA CATALOGUE

1,612 language resources at your disposal

An increasing number of LRs in the various fields of Human Language Technology (see image on the left-hand side) are distributed on behalf of ELRA via its operational body ELDA, thanks to the contribution of various players of the HLT community.
Our aim is to provide Language Resources, by means of this repository, so as to prevent researchers and developers from investing efforts to rebuild resources which already exist as well as help them identify and access those resources.

Latest Resources

EthioSpeech
EthioSpeech Corpora is comprised of over 391 hours of recorded read speech in six different Ethiopian languages by ca. 200 speakers per language: Amharic (68 hours), Tigrigna (62 hours), Oromo (70 hours), Somali (56 hours), Afar (68 hours), and Sidama (68 hours). The dominating domain is media (mainly newspapers), but ...
Comprehensive Arabic Phonetic Database
The Comprehensive Arabic Phonetic Database is a robust and detailed linguistic resource offering both phonemic and phonetic transcriptions, precisely reflecting how Modern Standard Arabic words are realized in actual speech. This database is ideally suited for speech technology applications. This is a highly comprehensive and accurate Arabic phonetic/phonemic database, covering ...
Portuguese Speech Recognition Corpus (Desktop+Mobile)
This corpus was recorded in a quiet office environment over 2 channels and collected from a total of 200 speakers, including 102 males and 98 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as keywords. Speech samples ...
American English Speech Recognition Corpus (Desktop)
This corpus was recorded in both quiet and noisy environments over 2 channels and collected from a total of 50 speakers, including 24 males and 26 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as text messages ...
Thai Speech Recognition Corpus (Desktop)
This corpus was recorded in a quiet office/home environment over 4 channels and collected from a total of 205 speakers, including 101 males and 104 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily ...
Argentina Spanish Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office environment over 3 channels and collected from a total of 300 speakers, including 132 males and 168 females, all of whom have been carefully screened to ensure their standard and clear pronunciation.The audio scripts cover information such as news, daily dialogues and ...
French English Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office/home environment over 3 channels and collected from a total of 225 speakers, including 107 males and 118 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily ...
Telugu Speech Recognition corpus (Mobile)
This corpus was recorded in a quiet office/home environment and collected from a total of 130 speakers, including 67 males and 63 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news, daily dialogues and tweets. Speech ...
Italian Speech Recognition Corpus (Desktop+Mobile)
This corpus was recorded in a quiet office/home environment over 2 channels and collected from a total of 201 speakers, including 101 males and 100 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as keywords. Speech samples ...
UAE Arabic Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office/home environment over 2 channels and collected from a total of 168 speakers, including 94 males and 74 females, all of whom have been carefully screened to ensure their standard and clear pronunciation.The audio scripts cover information such as news and daily dialogues. ...
British English Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office/home environment over 3 channels and collected from a total of 302 speakers, including 149 males and 153 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts come from news and tweets. Speech samples ...
Mexican Spanish Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office environment over 3 channels and collected from a total of 826 speakers, including 408 males and 418 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news. Speech samples ...
Japanese Speech Recognition Corpus (Telephone)
This corpus was recorded in a quiet office/home environment and collected from a total of 201 speakers, including 96 males and 105 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily dialogues. Speech samples ...
Portugal English Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office/home environment over 3 channels and collected from a total of 201 speakers, including 90 males and 111 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily ...
German Speech Recognition Corpus (Desktop+Mobile)
This corpus was recorded in a quiet office/home environment over 2 channels and collected from a total of 203 speakers, including 110 males and 93 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as keywords. Speech samples ...
Hindi Speech Recognition Corpus (Desktop)
This corpus was recorded in a quiet office environment over 4 channels and collected from a total of 196 speakers, including 95 males and 101 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily ...
Urdu Speech Recognition Corpus (Desktop)
This corpus was recorded in a quiet office environment over 4 channels and collected from a total of 203 speakers, including 109 males and 194 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily ...
Indonesian Speech Recognition Corpus (Desktop)
This corpus was recorded in a quiet office environment over 4 channels and collected from a total of 200 speakers, including 97 males and 103 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily ...
Chilean Spanish Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office/home environment over 3 channels and collected from a total of 300 speakers, including 138 males and 162 females, all of whom have been carefully screened to ensure their standard and clear pronunciation.The audio scripts cover information such as news, daily dialogues and ...
Argentina Spanish Speech Recognition Corpus (Desktop)
This corpus was recorded in a quiet office environment over 4 channels and collected from a total of 200 speakers, including 81 males and 119 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news. Speech samples ...
Chilean Spanish Speech Recognition Corpus (Desktop)
This corpus was recorded in a quiet office environment over 4 channels and collected from a total of 200 speakers, including 101 males and 99 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news. Speech samples ...
Hong Kong English Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office/home environment over 3 channels and collected from a total of 200 speakers, including 99 males and 101 females, all of whom have been carefully screened to ensure their standard and clear pronunciation.The audio scripts cover information such as news, forums, text messages ...
Spain English Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office/home environment over 3 channels and collected from a total of 200 speakers, including 99 males and 101 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts come from news, daily dialogues and tweets. ...
Hindi Speech Recognition Corpus (Mobile)
This corpus was recorded in both quiet and noisy environments over 3 channels and collected from a total of 180 speakers, including 99 males and 81 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news. Speech ...
Italian English Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office/home environment over 3 channels and collected from a total of 213 speakers, including 103 males and 110 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily ...
Korean Speech Recognition corpus (Mobile)
This corpus was recorded in a quiet office/home environment and collected from a total of 500 speakers, including 246 males and 254 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news. Speech samples are stored as ...
German English Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office/home environment over 3 channels and collected from a total of 196 speakers, including 88 males and 108 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts come from news and tweets. Speech samples ...
Australian English Speech Recognition Corpus (Desktop)
This corpus was recorded in a quiet office/home environment over 4 channels and collected from a total of 198 speakers, including 85 males and 113 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily ...
Malaysian Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office/home environment and collected from a total of 131 speakers, including 65 males and 66 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily dialogues. Speech samples ...
ÌròyìnSpeech
A modern, high-fidelity, multi-speaker, Yorùbá read speech corpus suitable for Speech Synthesis, Automatic Speech Recognition and Computational Linguistics research. The subject matter is drawn from the Broadcast News domain as well as fictional texts, delivering a multi-purpose, contemporary speech dataset. This corpus consists in 34000 read sentences, 42 hours of ...
Slovak Autistic and Non-Autistic Child Speech Corpus (SANACS)
Slovak Autistic and Non-Autistic Child Speech Corpus (SANACS) contains 67 recorded sessions of interactions between two native Slovak speakers. In 37 sessions an autistic child interacts with a neurotypical adult experimenter, and in 30 control sessions a neurotypical child interacts with the same neurotypical adult experimenter. The children were 6-12 ...
DiaLEX – Emirati (DiaLEX-UA)
The Emirati Arabic Full-Form Lexicon (DiaLEX-UA) is a comprehensive computational lexicon covering the Emirati Arabic dialect. Featuring over 37,000,000 forms for 29,000 lemmas, this full-form lexicon provides exhaustive treatment of all inflected forms. DiaLEX-UA has several features that make it ideally suited to support natural language processing applications for Emirati ...
DiaLEX – Saudi Arabian Hijazi (DiaLEX-HA)
The Hijazi Arabic Full-Form Lexicon (DiaLEX-HA) is a comprehensive computational lexicon covering the Hijazi Arabic dialect. Featuring over 25,000,000 forms for 30,000 lemmas, this full-form lexicon provides exhaustive treatment of all inflected forms. DiaLEX-HA has several features that make it ideally suited to support natural language processing applications for Hijazi ...
DiaLEX – Egyptian (DiaLEX-EA)
The Egyptian Arabic Full-Form Lexicon (DiaLEX-EA) is a comprehensive computational lexicon covering the Egyptian Arabic dialect. Featuring over 93,000,000 forms for 33,000 lemmas, this full-form lexicon provides exhaustive treatment of all inflected forms. DiaLEX-EA has several features that make it ideally suited to support natural language processing applications for Egyptian ...
Corpus for fine-grained analysis and automatic detection of irony on Twitter
The Corpus for fine-grained analysis and automatic detection of irony on Twitter was carefully annotated by trained annotators (Master’s students in Linguistics) using a detailed annotation scheme for irony categorization, which describes four labels: ‘ironic by means of a polarity contrast’, ‘situational irony’, ‘other verbal irony’ and ‘not ironic’. The ...
AUDIO Human Voice Pronunciations - Chinese (Simplified)
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Japanese
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Czech
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Swedish
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Thai
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Korean
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Italian
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Dutch
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Greek
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Hebrew
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Danish
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Norwegian
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Portuguese (Portugal)
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Arabic
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Spanish
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Portuguese (Brazil)
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Catalan
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Russian
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Polish
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...

Show less