- The Faetar Language
- Ground Rules
- Tracks
- Criteria for Judging Submissions
- Data and Licensing
- Timeline
- How to Participate
- Leaderboard
- Contact
- Organizers
- References
Low-resource speech recognition has gained substantial attention in recent years, particularly with the advent of large multilingual speech foundation models and language models. This is welcome, as there are thousands of languages which have existing small partially transcribed collections of field or found recordings, but no ASR systems. Such technology would be transformative for linguists, educators, and the numerous minoritized communities worldwide who face challenges to the survival of their languages and cultures. This technology would allow for recorded speech to be valorized and made more accessible, for the benefit of current and future speakers of these languages. However, a clear picture of best practices when developing ASR systems in very low-resource contexts has not yet emerged.
The Faetar Low-Resource ASR Challenge aims to focus researchers’ attention on several issues which are common to many archival collections of speech data:
- noisy field recordings
- lack of standard orthography, leading to noise in the transcriptions in the form of transcriber inconsistencies
- only a few hours of transcribed data
- a larger collection of untranscribed data
- no additional data in the language (textual or speech) that is easily available
- “dirty” transcriptions in documents, which contain matter that needs to be filtered out
By focusing multiple research groups on a single corpus of this kind, we aim to gain deeper insights into these problems than can be achieved otherwise.
The challenge will run from November 1st 2024 to February 1st 2025. We encourage participants to submit papers to Interspeech 2025 (main conference). See Timeline for detailed timeline.
The Faetar Language
Please see Ong et al. 2024 for more details about the Faetar ASR Benchmark Corpus.
The challenge uses the Faetar ASR Benchmark Corpus. Faetar (pronounced [fajdar]) is a variety of the Franco-Provençal language which developed in isolation in Italy, far from other speakers of Franco-Provençal, and in close contact with Italian. Faetar has less than 1000 speakers around the world, in Italy and in the diaspora. It is endangered, and preservation, learning, and documentation are a priority for many community members. The benchmark data represents the majority of all archived speech recordings of Faetar in existence, and it is not available from any other source.
Data were extracted from the Faetar collection of the Heritage Language Variation and Change in Toronto (HLVC) corpus [1]. The corpus contains 184 recordings of native Faetar speakers collected in Italy between 1992 and 1994 (the Homeland subset) and 37 recordings of first- and second-generation heritage Faetar speakers collected in Toronto between 2009 and 2010 (the Heritage subset). All come from field recordings, generally noisy, of semi-spontaneous speech.
Faetar has no standard written form. The data set is transcribed quasi-phonetically for linguistic purposes in IPA. The transcriptions are not always consistent, as different parts of the data set were transcribed for different purposes: sometimes the transcription is narrow and phonetic, while at other times the transcription is broad and phonemic.
Ground Rules
- Participants will make use of the training data provided (~4.5 hrs) and will submit phone-level decodings for train, dev and test audio.
- Participants will not have access to the test transcriptions during the period of the challenge and organizers will perform the final evaluation on the test set.
- Participants are provided with a dev kit that allows them to calculate scores on the dev and train sets.
- To ensure that participants can be confident in their results before submission, while maintaining comparability across participants, we also propose standardized alternative splits within the train set, which participants can use to do held-out evaluations without relying only on the small dev set.
- Participants must sign a data agreement preventing redistribution before accessing the data set.
- Each research group may make no more than four submissions for evaluation.
Tracks
- Constrained ASR. Participants should focus on the challenge of improving ASR architectures to work with small, poor-quality sets. Participants may not use any resources to train / fine-tune their models beyond the files contained in the provided train set. No external pre-trained acoustic models or language models are allowed, and the use of the unlabelled portion of the Faetar challenge data set is not allowed either.
Three other “thematic tracks” can be explored, and should not be considered mutually exclusive:
- Using pre-trained acoustic models or language models. Participants focus on the most effective way to make use of models pre-trained on other languages.
- Using unlabelled data. The challenge data also includes ~20 hrs of unlabelled data. Participants focus on finding the most effective way to make use of it.
- Dirty data. The training data was extracted and automatically aligned from long-form audio and partial transcriptions in “cluttered” word processor files, relying on (error-prone) VAD, scraping, and alignment. Participants focus on improving the pipeline for extracting useful training data, with the ultimate goal of improving performance. Participants seeking to participate in the Dirty Data challenge should indicate this on the registration form.
Participants should indicate at the time of submission whether they are making a Constrained ASR submission, and, otherwise, which of the three thematic tracks they are submitting to (possibly more than one).
Each research group may make no more than four submissions total for evaluation, across all tracks.
Co-submission to the ML-SUPERB Challenge
Many participants in the Faetar Challenge may also wish to participate in the ML-SUPERB 2.0 2025 Challenge, which has been tentatively accepted as a special session at Interspeech 2025. The ML-SUPERB 2.0 benchmark focuses on developing approaches to ASR which are robust across languages and language varieties. Participants in the Faetar challenge with systems that can also be fruitfully evaluated on the ML-SUPERB 2.0 benchmark are strongly encouraged to submit their papers to the ML-SUPERB session at Interspeech. For example, a submission may attempt to make efficient use of the unlabelled set by using it as additional pre-training data for a multilingual speech foundation model, while also improving the architecture of the underlying model, leading the participants to seek to evaluate the improved model on the ML-SUPERB benchmark; other potential points of contact might include explorations of improving speech enhancement, language model fusion, speaker normalization, or other traditional ASR techniques which might be relevant to the challenges posed by the Faetar corpus.
For more information, see the ML-SUPERB 2.0 website at: https://multilingual.superbbenchmark.org/
Criteria for Judging Submissions
Submissions will be evaluated on phone error rate (PER) on the test set. Participants are provided with a dev kit allowing them to calculate the PER on dev and train, as well as reproduce the baselines. Bootstrap confidence intervals can also be calculated using the dev kit to demonstrate robustness.
A winner or tie will be declared based on PER and confidence intervals. All submissions falling within a 95% confidence interval of the submission with the lowest PER will be considered to have won, with the submission having the numerically lowest PER being awarded a special distinction.
(An) overall winner(s) will be declared, as well as (a) winner(s) of the Constrained ASR track. The results of the challenge will also indicate the best approaches within the three other subtracks.
Data and Licensing
Please see Ong et al. 2024 for more details about the Faetar ASR Benchmark Corpus and a more detailed breakdown of the corpus.
The Faetar ASR Benchmark Corpus data used in the challenge is available without cost under a restrictive license that prohibits re-distribution, among other things. Please see the Registration section below to request access.
The following table shows the distribution of data in the corpus, which consists of a train set, a test set, a small dev set, and an unlabelled set.
Split | Usage in the challenge | Amount |
---|---|---|
train | Training set (all tracks); Constrained ASR track: no data beyond this set can be used for training | 4:30:17 |
dev | Validation; held-out evaluation before submission to the challenge | 11:49 |
unlab | Additional resource in Unlabelled data track. | 19:55:21 |
test | Final evaluation: transcripts are unavailable to participants | 46:54 |
Alternate splits
Since the test set is unavailable to challenge participants during the duration of the challenge, we recommend that participants not rely entirely on dev for held-out evaluation. In order to increase the amount of data available for held-out evaluation, we have created an alternate split of the train set compraised of the sets 1h and reduced_train (train minus 1h), so that the set 1h can be used as a held-out evaluation set. We will also provide benchmark results (to come) for the alternative subsets within train. The following table shows the distribution of the alternative splits:
Split | Suggested usage | Amount |
---|---|---|
1h | Hold out to use as additional validation/development data; or use as alternate train set to evaluate lower-data circumstances | 58:34 |
reduced_train (i.e. train minus 1h) | Use as alternate train set when evaluating on the above alternate split(s) | 3:40:32 |
Dirty data
The benchmark corpus was extracted and automatically aligned from long-form audio and (incomplete) transcriptions that were scraped from word processor files that often contained other, irrelevant material, then aligned to the utterance level (see Ong et al. 2024 for more details). Participants in the Dirty data track will seek to improve on the process (scraping, segmenting, aligning), with the goal of improving the quality of the train set. The ultimate goal remains the same, of improving PER on the test set.
The dirty data collection consists of the original source files (audio and transcriptions) for a subset of train that does not overlap with test or dev (In the standard benchmark corpus, some of the full audio files were split by speaker between train and dev/test), some of the mapping files that we used during the first stages of extraction and filtration of the data, and a summary of the process that was used to extract the data. Please request the dirty data set on the registration form if you intend to use it.
For participants using the dirty data, we have also created a reference subset dirty_data_train of train which only contains utterances taken from the dirty data files. Baseline results will be provided for this subset for reference. The improved training sets that participants will create from the dirty data are expected to be different from dirty_data_train.
Timeline
- November 1st, 2024: Release of data, opening of challenge
- February 1st, 2025: Participants must submit system description(s) and test decodings
- February 5th, 2025: Participants receive test scores from organizers, winners announced
- February 12th, 2025: Interspeech paper submission deadline
- February 19th, 2025: Interspeech paper update deadline
- May 21st, 2025: Interspeech paper acceptance notification
- August 17-25, 2025: Interspeech conference
How to Participate
Dev kit
Please access the dev kit, permitting you to evaluate your system and replicate baselines here (does not include the data) at https://github.com/perceptimatic/faetar-dev-kit.
Registering/requesting data access
Requests for access to the data will be responded to within 24 hrs. You will receive a download link if your request is approved.
How to submit
Each research group may make no more than four submissions for evaluation.
Details of submission process to be announced
Leaderboard
At the outset of the challenge, this will contain only the baseline model results. Please see Ong et al. 2024 for details about the baseline models.
Research group | Description | Training set | Constrained | External pre-trained AM/LM | Uses unlab | Dirty data challenge | PER on test |
---|---|---|---|---|---|---|---|
Organizers (baseline) | Kaldi HMM-GMM Mono + 5-gram Kneser-Ney [6,7] | train | x | 62.6 | |||
Organizers (baseline) | Kaldi HMM-GMM Tri + 5-gram Kneser-Ney [6,7] | train | x | 56.7 | |||
Organizers (baseline) | ESPnet ML-SUPERB [2,3] | train | x | 35.8 | |||
Organizers (baseline) | ESPnet ML-SUPERB [2,3] | 1hr | x | 37.4 | |||
Organizers (baseline) | ESPnet ML-SUPERB [2,3] | 10m | x | 45.1 | |||
Organizers (baseline) | MMS [4] | train | x | 33.0 | |||
Organizers (baseline) | mHubert-147 [5] | train | x | 33.6 | |||
Organizers (baseline) | MMS [4] continued pre-training | train | x | x | 31.5 | ||
Organizers (baseline) | MMS [4] self-training | train | x | x | 31.0 | ||
Organizers (baseline) | MMS [4] pre-training + self-training | train | x | x | 30.4 |
Contact
Questions should be directed to faetarasrchallenge at gmail dot com.
Organizers
- Ewan Dunbar, University of Toronto
- Michael Ong, University of Toronto
- Leo Peckham, University of Toronto
- Naomi Nagy, University of Toronto
References
[1] N. Nagy, “A multilingual corpus to explore variation in language contact situations,” RILA, pp. 65–84, 2011.
[2] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech 2018, 2018, pp. 2207–2211.
[3] J. Shi, W. Chen, D. Berrebbi, H.-H. Wang, W.-P. Huang, E.-P. Hu, H.-L. Chuang, X. Chang, Y. Tang, S.-W. Li, A. Mohamed, H.-Y. Lee, and S. Watanabe, “Findings of the 2023 ML-SUPERB challenge: Pretraining and evaluation over more languages and beyond,” in ASRU, 2023, pp. 1–8.
[4] V. Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski, Y. Adi, X. Zhang, W.-N. Hsu, A. Conneau, and M. Auli, “Scaling speech technology to 1,000+ languages,” 2023.
[5] M. Z. Boito, V. Iyer, N. Lagos, L. Besacier, and I. Calapodescu, “mhubert-147: A compact multilingual hubert model,” arXiv preprint arXiv:2406.06371, 2024.
[6] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recognition toolkit,” in ASRU. Hilton Waikoloa Village, Big Island, Hawaii, US: IEEE Signal Processing Society, 2011.
[7] M. Ong, S. Robertson, L. Peckham, A. J. J. de Aberasturi, P. Arkhangorodsky, R. Huo, A. Sakhardande, M. Hallap, N. Nagy, and E. Dunbar, “The faetar benchmark: Speech recognition in a very under-resourced language,” arXiv preprint arXiv:2409.08103, 2024.