The Faetar Low-Resource ASR Benchmark

Low-resource speech recognition has gained substantial attention in recent years, particularly with the advent of large multilingual speech foundation models and language models. This is welcome, as there are thousands of languages which have existing small partially transcribed collections of field or found recordings, but no ASR systems. Such technology would be transformative for linguists, educators, and the numerous minoritized communities worldwide who face challenges to the survival of their languages and cultures. This technology would allow for recorded speech to be valorized and made more accessible, for the benefit of current and future speakers of these languages. However, a clear picture of best practices when developing ASR systems in very low-resource contexts has not yet emerged.

The Faetar Low-Resource ASR Benchmark aims to focus researchers’ attention on several issues which are common to many archival collections of speech data:

By focusing multiple research groups on a single corpus of this kind, we aim to gain deeper insights into these problems than can be achieved otherwise.

The challenge phase (during which the test decodings are embargoed) has been extended in order to ensure that researchers have ample time to develop adequate systems. Please check back for further updates.

The Faetar Language

Please see Ong et al. 2024 for more details about the Faetar ASR Benchmark Corpus.

The challenge uses the Faetar ASR Benchmark Corpus. Faetar (pronounced [fajdar]) is a variety of the Franco-Provençal language which developed in isolation in Italy, far from other speakers of Franco-Provençal, and in close contact with Italian. Faetar has less than 1000 speakers around the world, in Italy and in the diaspora. It is endangered, and preservation, learning, and documentation are a priority for many community members. The benchmark data represents the majority of all archived speech recordings of Faetar in existence, and it is not available from any other source.

Data were extracted from the Faetar collection of the Heritage Language Variation and Change in Toronto (HLVC) corpus [1]. The corpus contains 184 recordings of native Faetar speakers collected in Italy between 1992 and 1994 (the Homeland subset) and 37 recordings of first- and second-generation heritage Faetar speakers collected in Toronto between 2009 and 2010 (the Heritage subset). All come from field recordings, generally noisy, of semi-spontaneous speech.

Faetar has no standard written form. The data set is transcribed quasi-phonetically for linguistic purposes in IPA. The transcriptions are not always consistent, as different parts of the data set were transcribed for different purposes: sometimes the transcription is narrow and phonetic, while at other times the transcription is broad and phonemic.

Ground Rules

Tracks

In order to appear on the leaderboard, participants should make official submissions. Sub-areas of interest are:

Participants seeking to participate in the Dirty Data challenge should indicate this on the registration form.

Participants should indicate at the time of submission whether they are making a Constrained ASR submission, and, otherwise, which of the three thematic tracks they are submitting to (possibly more than one).

Each research group may make no more than four submissions total for evaluation, across all tracks.

Criteria for Judging Submissions

Submissions will be evaluated on phone error rate (PER) on the test set. Participants are provided with a dev kit allowing them to calculate the PER on dev and train, as well as reproduce the baselines. Bootstrap confidence intervals can also be calculated using the dev kit to demonstrate robustness.

Data and Licensing

Please see Ong et al. 2024 for more details about the Faetar ASR Benchmark Corpus and a more detailed breakdown of the corpus.

The Faetar ASR Benchmark Corpus data is available without cost under a restrictive license that prohibits re-distribution, among other things. Please see the Registration section below to request access.

The following table shows the distribution of data in the corpus, which consists of a train set, a test set, a small dev set, and an unlabelled set.

Split Usage in the challenge Amount
train Training set (all tracks); Constrained ASR track: no data beyond this set can be used for training 4:30:17
dev Validation; held-out evaluation before submission to the challenge 11:49
unlab Additional resource in Unlabelled data track. 19:55:21
test Final evaluation: transcripts are unavailable to participants 46:54

Alternate splits

Since the test set is unavailable to challenge participants during the challenge phase, we recommend that participants not rely entirely on dev for held-out evaluation. In order to increase the amount of data available for held-out evaluation, we have created an alternate split of the train set compraised of the sets 1h and reduced_train (train minus 1h), so that the set 1h can be used as a held-out evaluation set. We also provide baseline results for the alternative subsets within train.

The following table shows the distribution of the alternative splits:

Split Suggested usage Amount
1h Hold out to use as additional validation/development data; or use as alternate train set to evaluate lower-data circumstances 58:34
reduced_train (i.e. train minus 1h) Use as alternate train set when evaluating on the above alternate split(s) 3:40:32

Dirty data

The benchmark corpus was extracted and automatically aligned from long-form audio and (incomplete) transcriptions that were scraped from word processor files that often contained other, irrelevant material, then aligned to the utterance level (see Ong et al. 2024 for more details). Participants in the Dirty data track will seek to improve on the process (scraping, segmenting, aligning), with the goal of improving the quality of the train set. The ultimate goal remains the same, of improving PER on the test set.

The dirty data collection consists of the original source files (audio and transcriptions) for a subset of train that does not overlap with test or dev (In the standard benchmark corpus, some of the full audio files were split by speaker between train and dev/test), some of the mapping files that we used during the first stages of extraction and filtration of the data, and a summary of the process that was used to extract the data. Please request the dirty data set on the registration form if you intend to use it.

For participants using the dirty data, we have also created a reference subset dirty_data_train of train which only contains utterances taken from the dirty data files. Baseline results will be provided for this subset for reference. The improved training sets that participants will create from the dirty data are expected to be different from dirty_data_train.

Timeline

As of January 2025, the challenge phase has been extended. Please check back in spring of 2025 for further information.

How to Participate

Dev kit

Please access the dev kit, permitting you to evaluate your system and replicate baselines here (does not include the data) at https://github.com/perceptimatic/faetar-dev-kit.

Registering/requesting data access

Requests for access to the data will be responded to within 24 hrs. You will receive a download link if your request is approved.

How to submit

We are happy to announce that submissions are now open for the Faetar ASR Challenge. Participants are responsible for submitting decodings for the test set and a model description. If possible, we ask participants to also upload a more detailed model description in the form of a paper draft in order to facilitate our writing a summary paper.

Each research group may submit up to four models for evaluation on the test set. Submissions must be received by Saturday, February 1st 2025 (AoE) in order to receive results on time.

Decodings for all files in the test set, for a single model, should be stored in a single plain-text file, with one file per line, and with the utterance name (file name prefix) following the decoding:

i kidʒə teɪnə lə fotd əkra l (heF003_00000916_00001116_he011)
faitan fu d ra fi (heF003_00001353_00001466_he011)
...

Leaderboard

At the outset of the challenge, this will contain only the baseline model results. Please see Ong et al. 2024 for details about the baseline models.

Research group Description Training set Constrained External pre-trained AM/LM Uses unlab Dirty data challenge PER on test PER on dev PER on 1h
Team A Conformer based CTC (sys1) train x 50.8 NA NA
Team A Conformer based CTC (sys2) train x 51.2 NA NA
Team A Conformer based CTC (sys3) train x 51.6 NA NA
Team A Conformer based CTC (sysA) train x 48.0 NA NA
Team A Conformer based CTC (sysB) train x 48.8 NA NA
Team A Conformer based CTC (sysC) train x 49.4 NA NA
Organizers (baseline) Kaldi HMM-GMM Mono + 5-gram Kneser-Ney [6,7] train x 62.6 65.9 NA
Organizers (baseline) Kaldi HMM-GMM Tri + 5-gram Kneser-Ney [6,7] train x 56.7 58.2 NA
Organizers (baseline) ESPnet-MMS ML-SUPERB [2,3] train 35.8 43.0 NA
Organizers (baseline) ESPnet-MMS ML-SUPERB [2,3] 1hr 37.4 44.4 NA
Organizers (baseline) ESPnet-MMS ML-SUPERB [2,3] 10m 45.1 50.2 NA
Organizers (baseline) MMS [4] train x 33.0 39.9 NA
Organizers (baseline) mHubert-147 [5] train x 33.6 41.1 NA
Organizers (baseline) MMS [4] self-training train x x 31.0 36.6 NA
Organizers (baseline) MMS [4] pre-training + self-training train x x 30.4 36.4 NA
Organizers (baseline) MMS [4] continued pre-training train x x 31.5 38.7 NA
Organizers (baseline) MMS [4] reduced_train x 33.8 37.9 34.5
Organizers (baseline) MMS [4] self-training reduced_train x x 33.4 37.2 34.1
Organizers (baseline) mHubert-147 [5] reduced_train x 35.5 42.8 36.3
Organizers (baseline) mHubert-147 [5] self-training reduced_train x x 35.1 42.3 35.6

Contact

Questions should be directed to faetarasrchallenge at gmail dot com.

Organizers

References

[1] N. Nagy, “A multilingual corpus to explore variation in language contact situations,” RILA, pp. 65–84, 2011.

[2] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech 2018, 2018, pp. 2207–2211.

[3] J. Shi, W. Chen, D. Berrebbi, H.-H. Wang, W.-P. Huang, E.-P. Hu, H.-L. Chuang, X. Chang, Y. Tang, S.-W. Li, A. Mohamed, H.-Y. Lee, and S. Watanabe, “Findings of the 2023 ML-SUPERB challenge: Pretraining and evaluation over more languages and beyond,” in ASRU, 2023, pp. 1–8.

[4] V. Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski, Y. Adi, X. Zhang, W.-N. Hsu, A. Conneau, and M. Auli, “Scaling speech technology to 1,000+ languages,” 2023.

[5] M. Z. Boito, V. Iyer, N. Lagos, L. Besacier, and I. Calapodescu, “mhubert-147: A compact multilingual hubert model,” arXiv preprint arXiv:2406.06371, 2024.

[6] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recognition toolkit,” in ASRU. Hilton Waikoloa Village, Big Island, Hawaii, US: IEEE Signal Processing Society, 2011.

[7] M. Ong, S. Robertson, L. Peckham, A. J. J. de Aberasturi, P. Arkhangorodsky, R. Huo, A. Sakhardande, M. Hallap, N. Nagy, and E. Dunbar, “The faetar benchmark: Speech recognition in a very under-resourced language,” arXiv preprint arXiv:2409.08103, 2024.