The Faetar Low-Resource ASR Challenge 2025

Low-resource speech recognition has gained substantial attention in recent years, particularly with the advent of large multilingual speech foundation models and language models. This is welcome, as there are thousands of languages which have existing small partially transcribed collections of field or found recordings, but no ASR systems. Such technology would be transformative for linguists, educators, and the numerous minoritized communities worldwide who face challenges to the survival of their languages and cultures. This technology would allow for recorded speech to be valorized and made more accessible, for the benefit of current and future speakers of these languages. However, a clear picture of best practices when developing ASR systems in very low-resource contexts has not yet emerged.

The Faetar Low-Resource ASR Challenge aims to focus researchers’ attention on several issues which are common to many archival collections of speech data:

By focusing multiple research groups on a single corpus of this kind, we aim to gain deeper insights into these problems than can be achieved otherwise.

The challenge will run from November 1st 2024 to February 1st 2025. We encourage participants to submit papers to Interspeech 2025 (main conference). See Timeline for detailed timeline.

The Faetar Language

Please see Ong et al. 2024 for more details about the Faetar ASR Benchmark Corpus.

The challenge uses the Faetar ASR Benchmark Corpus. Faetar (pronounced [fajdar]) is a variety of the Franco-Provençal language which developed in isolation in Italy, far from other speakers of Franco-Provençal, and in close contact with Italian. Faetar has less than 1000 speakers around the world, in Italy and in the diaspora. It is endangered, and preservation, learning, and documentation are a priority for many community members. The benchmark data represents the majority of all archived speech recordings of Faetar in existence, and it is not available from any other source.

Data were extracted from the Faetar collection of the Heritage Language Variation and Change in Toronto (HLVC) corpus [1]. The corpus contains 184 recordings of native Faetar speakers collected in Italy between 1992 and 1994 (the Homeland subset) and 37 recordings of first- and second-generation heritage Faetar speakers collected in Toronto between 2009 and 2010 (the Heritage subset). All come from field recordings, generally noisy, of semi-spontaneous speech.

Faetar has no standard written form. The data set is transcribed quasi-phonetically for linguistic purposes in IPA. The transcriptions are not always consistent, as different parts of the data set were transcribed for different purposes: sometimes the transcription is narrow and phonetic, while at other times the transcription is broad and phonemic.

Ground Rules

Tracks

Three other “thematic tracks” can be explored, and should not be considered mutually exclusive:

Participants should indicate at the time of submission whether they are making a Constrained ASR submission, and, otherwise, which of the three thematic tracks they are submitting to (possibly more than one).

Each research group may make no more than four submissions total for evaluation, across all tracks.

Co-submission to the ML-SUPERB Challenge

Many participants in the Faetar Challenge may also wish to participate in the ML-SUPERB 2.0 2025 Challenge, which has been tentatively accepted as a special session at Interspeech 2025. The ML-SUPERB 2.0 benchmark focuses on developing approaches to ASR which are robust across languages and language varieties. Participants in the Faetar challenge with systems that can also be fruitfully evaluated on the ML-SUPERB 2.0 benchmark are strongly encouraged to submit their papers to the ML-SUPERB session at Interspeech. For example, a submission may attempt to make efficient use of the unlabelled set by using it as additional pre-training data for a multilingual speech foundation model, while also improving the architecture of the underlying model, leading the participants to seek to evaluate the improved model on the ML-SUPERB benchmark; other potential points of contact might include explorations of improving speech enhancement, language model fusion, speaker normalization, or other traditional ASR techniques which might be relevant to the challenges posed by the Faetar corpus.

For more information, see the ML-SUPERB 2.0 website at: https://multilingual.superbbenchmark.org/

Criteria for Judging Submissions

Submissions will be evaluated on phone error rate (PER) on the test set. Participants are provided with a dev kit allowing them to calculate the PER on dev and train, as well as reproduce the baselines. Bootstrap confidence intervals can also be calculated using the dev kit to demonstrate robustness.

A winner or tie will be declared based on PER and confidence intervals. All submissions falling within a 95% confidence interval of the submission with the lowest PER will be considered to have won, with the submission having the numerically lowest PER being awarded a special distinction.

(An) overall winner(s) will be declared, as well as (a) winner(s) of the Constrained ASR track. The results of the challenge will also indicate the best approaches within the three other subtracks.

Data and Licensing

Please see Ong et al. 2024 for more details about the Faetar ASR Benchmark Corpus and a more detailed breakdown of the corpus.

The Faetar ASR Benchmark Corpus data used in the challenge is available without cost under a restrictive license that prohibits re-distribution, among other things. Please see the Registration section below to request access.

The following table shows the distribution of data in the corpus, which consists of a train set, a test set, a small dev set, and an unlabelled set.

Split Usage in the challenge Amount
train Training set (all tracks); Constrained ASR track: no data beyond this set can be used for training 4:30:17
dev Validation; held-out evaluation before submission to the challenge 11:49
unlab Additional resource in Unlabelled data track. 19:55:21
test Final evaluation: transcripts are unavailable to participants 46:54

Alternate splits

Since the test set is unavailable to challenge participants during the duration of the challenge, we recommend that participants not rely entirely on dev for held-out evaluation. In order to increase the amount of data available for held-out evaluation, we have created an alternate split of the train set compraised of the sets 1h and reduced_train (train minus 1h), so that the set 1h can be used as a held-out evaluation set. We will also provide benchmark results (to come) for the alternative subsets within train. The following table shows the distribution of the alternative splits:

Split Suggested usage Amount
1h Hold out to use as additional validation/development data; or use as alternate train set to evaluate lower-data circumstances 58:34
reduced_train (i.e. train minus 1h) Use as alternate train set when evaluating on the above alternate split(s) 3:40:32

Dirty data

The benchmark corpus was extracted and automatically aligned from long-form audio and (incomplete) transcriptions that were scraped from word processor files that often contained other, irrelevant material, then aligned to the utterance level (see Ong et al. 2024 for more details). Participants in the Dirty data track will seek to improve on the process (scraping, segmenting, aligning), with the goal of improving the quality of the train set. The ultimate goal remains the same, of improving PER on the test set.

The dirty data collection consists of the original source files (audio and transcriptions) for a subset of train that does not overlap with test or dev (In the standard benchmark corpus, some of the full audio files were split by speaker between train and dev/test), some of the mapping files that we used during the first stages of extraction and filtration of the data, and a summary of the process that was used to extract the data. Please request the dirty data set on the registration form if you intend to use it.

For participants using the dirty data, we have also created a reference subset dirty_data_train of train which only contains utterances taken from the dirty data files. Baseline results will be provided for this subset for reference. The improved training sets that participants will create from the dirty data are expected to be different from dirty_data_train.

Timeline

How to Participate

Dev kit

Please access the dev kit, permitting you to evaluate your system and replicate baselines here (does not include the data) at https://github.com/perceptimatic/faetar-dev-kit.

Registering/requesting data access

Requests for access to the data will be responded to within 24 hrs. You will receive a download link if your request is approved.

How to submit

Each research group may make no more than four submissions for evaluation.

Details of submission process to be announced

Leaderboard

At the outset of the challenge, this will contain only the baseline model results. Please see Ong et al. 2024 for details about the baseline models.

Research group Description Training set Constrained External pre-trained AM/LM Uses unlab Dirty data challenge PER on test
Organizers (baseline) Kaldi HMM-GMM Mono + 5-gram Kneser-Ney [6,7] train x 62.6
Organizers (baseline) Kaldi HMM-GMM Tri + 5-gram Kneser-Ney [6,7] train x 56.7
Organizers (baseline) ESPnet ML-SUPERB [2,3] train x 35.8
Organizers (baseline) ESPnet ML-SUPERB [2,3] 1hr x 37.4
Organizers (baseline) ESPnet ML-SUPERB [2,3] 10m x 45.1
Organizers (baseline) MMS [4] train x 33.0
Organizers (baseline) mHubert-147 [5] train x 33.6
Organizers (baseline) MMS [4] continued pre-training train x x 31.5
Organizers (baseline) MMS [4] self-training train x x 31.0
Organizers (baseline) MMS [4] pre-training + self-training train x x 30.4

Contact

Questions should be directed to faetarasrchallenge at gmail dot com.

Organizers

References

[1] N. Nagy, “A multilingual corpus to explore variation in language contact situations,” RILA, pp. 65–84, 2011.

[2] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech 2018, 2018, pp. 2207–2211.

[3] J. Shi, W. Chen, D. Berrebbi, H.-H. Wang, W.-P. Huang, E.-P. Hu, H.-L. Chuang, X. Chang, Y. Tang, S.-W. Li, A. Mohamed, H.-Y. Lee, and S. Watanabe, “Findings of the 2023 ML-SUPERB challenge: Pretraining and evaluation over more languages and beyond,” in ASRU, 2023, pp. 1–8.

[4] V. Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski, Y. Adi, X. Zhang, W.-N. Hsu, A. Conneau, and M. Auli, “Scaling speech technology to 1,000+ languages,” 2023.

[5] M. Z. Boito, V. Iyer, N. Lagos, L. Besacier, and I. Calapodescu, “mhubert-147: A compact multilingual hubert model,” arXiv preprint arXiv:2406.06371, 2024.

[6] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recognition toolkit,” in ASRU. Hilton Waikoloa Village, Big Island, Hawaii, US: IEEE Signal Processing Society, 2011.

[7] M. Ong, S. Robertson, L. Peckham, A. J. J. de Aberasturi, P. Arkhangorodsky, R. Huo, A. Sakhardande, M. Hallap, N. Nagy, and E. Dunbar, “The faetar benchmark: Speech recognition in a very under-resourced language,” arXiv preprint arXiv:2409.08103, 2024.