The Faetar Low-Resource ASR Benchmark

The Faetar Language
Ground Rules
Tracks
Criteria for Judging Submissions
Data and Licensing
- Alternate splits
- Dirty Data
Timeline
How to Participate
Leaderboard
Contact
Organizers
References

Low-resource speech recognition has gained substantial attention in recent years, particularly with the advent of large multilingual speech foundation models and language models. This is welcome, as there are thousands of languages which have existing small partially transcribed collections of field or found recordings, but no ASR systems. Such technology would be transformative for linguists, educators, and the numerous minoritized communities worldwide who face challenges to the survival of their languages and cultures. This technology would allow for recorded speech to be valorized and made more accessible, for the benefit of current and future speakers of these languages. However, a clear picture of best practices when developing ASR systems in very low-resource contexts has not yet emerged.

The Faetar Low-Resource ASR Benchmark aims to focus researchers’ attention on several issues which are common to many archival collections of speech data:

noisy field recordings
lack of standard orthography, leading to noise in the transcriptions in the form of transcriber inconsistencies
only a few hours of transcribed data
a larger collection of untranscribed data
no additional data in the language (textual or speech) that is easily available
“dirty” transcriptions in documents, which contain matter that needs to be filtered out

By focusing multiple research groups on a single corpus of this kind, we aim to gain deeper insights into these problems than can be achieved otherwise.

The challenge phase (during which the test decodings are embargoed) has been extended in order to ensure that researchers have ample time to develop adequate systems. Please check back for further updates.

The Faetar Language

Please see Ong et al. 2024 for more details about the Faetar ASR Benchmark Corpus.

The challenge uses the Faetar ASR Benchmark Corpus. Faetar (pronounced [fajdar]) is a variety of the Franco-Provençal language which developed in isolation in Italy, far from other speakers of Franco-Provençal, and in close contact with Italian. Faetar has less than 1000 speakers around the world, in Italy and in the diaspora. It is endangered, and preservation, learning, and documentation are a priority for many community members. The benchmark data represents the majority of all archived speech recordings of Faetar in existence, and it is not available from any other source.

Data were extracted from the Faetar collection of the Heritage Language Variation and Change in Toronto (HLVC) corpus [1]. The corpus contains 184 recordings of native Faetar speakers collected in Italy between 1992 and 1994 (the Homeland subset) and 37 recordings of first- and second-generation heritage Faetar speakers collected in Toronto between 2009 and 2010 (the Heritage subset). All come from field recordings, generally noisy, of semi-spontaneous speech.

Faetar has no standard written form. The data set is transcribed quasi-phonetically for linguistic purposes in IPA. The transcriptions are not always consistent, as different parts of the data set were transcribed for different purposes: sometimes the transcription is narrow and phonetic, while at other times the transcription is broad and phonemic.

Ground Rules

Participants will make use of the training data provided (~4.5 hrs) and will submit phone-level decodings for train, dev and test audio.
Participants will not have access to the test transcriptions during the period of the challenge and organizers will perform the final evaluation on the test set.
Participants are provided with a dev kit that allows them to calculate scores on the dev and train sets.
To ensure that participants can be confident in their results before submission, while maintaining comparability across participants, we also propose standardized alternative splits within the train set, which participants can use to do held-out evaluations without relying only on the small dev set.
Participants must sign a data agreement preventing redistribution before accessing the data set.
Each research group may make no more than four submissions for evaluation.

Tracks

In order to appear on the leaderboard, participants should make official submissions. Sub-areas of interest are:

Constrained ASR. Participants should focus on the challenge of improving ASR architectures to work with small, poor-quality sets. Participants may not use any resources to train / fine-tune their models beyond the files contained in the provided training data. No external pre-trained acoustic models or language models are allowed. No external pre-trained acoustic models or language models are allowed. Use of the unlabelled portion of the Faetar Benchmark data set is not allowed (unlab). At submission time, participants should not check the External AM/LM or Unlab boxes on the model description form.
Using pre-trained acoustic models or language models. Participants focus on the most effective way to make use of models pre-trained on other languages. At submission time, participants should check External AM/LM on the model description form.
Using unlabelled data. The challenge data also includes ~20 hrs of unlabelled data (unlab). Participants focus on finding the most effective way to make use of it. At submission time, participants should check Unlab on the model description form (this checkbox is not mutually exclusive with the External AM/LM checkbox).
Dirty data. The training data was extracted and automatically aligned from long-form audio and partial transcriptions in “cluttered” word processor files, relying on (error-prone) VAD, scraping, and alignment. Participants focus on improving the pipeline for extracting useful training data, with the ultimate goal of improving performance. At submission time, participants should check Dirty Data on the model description form.

Participants seeking to participate in the Dirty Data challenge should indicate this on the registration form.

Participants should indicate at the time of submission whether they are making a Constrained ASR submission, and, otherwise, which of the three thematic tracks they are submitting to (possibly more than one).

Each research group may make no more than four submissions total for evaluation, across all tracks.

Criteria for Judging Submissions

Submissions will be evaluated on phone error rate (PER) on the test set. Participants are provided with a dev kit allowing them to calculate the PER on dev and train, as well as reproduce the baselines. Bootstrap confidence intervals can also be calculated using the dev kit to demonstrate robustness.

Data and Licensing

Please see Ong et al. 2024 for more details about the Faetar ASR Benchmark Corpus and a more detailed breakdown of the corpus.

The Faetar ASR Benchmark Corpus data is available without cost under a restrictive license that prohibits re-distribution, among other things. Please see the Registration section below to request access.

The following table shows the distribution of data in the corpus, which consists of a train set, a test set, a small dev set, and an unlabelled set.

Split	Usage in the challenge	Amount
train	Training set (all tracks); Constrained ASR* track: no data beyond this set can be used for training*	4:30:17
dev	Validation; held-out evaluation before submission to the challenge	11:49
unlab	Additional resource in Unlabelled data* track.*	19:55:21
test	Final evaluation: transcripts are unavailable to participants	46:54

Alternate splits

Since the test set is unavailable to challenge participants during the challenge phase, we recommend that participants not rely entirely on dev for held-out evaluation. In order to increase the amount of data available for held-out evaluation, we have created an alternate split of the train set compraised of the sets 1h and reduced_train (train minus 1h), so that the set 1h can be used as a held-out evaluation set. We also provide baseline results for the alternative subsets within train.

The following table shows the distribution of the alternative splits:

Split	Suggested usage	Amount
1h	Hold out to use as additional validation/development data; or use as alternate train set to evaluate lower-data circumstances	58:34
reduced_train (i.e. train minus 1h)	Use as alternate train set when evaluating on the above alternate split(s)	3:40:32

Dirty data

The benchmark corpus was extracted and automatically aligned from long-form audio and (incomplete) transcriptions that were scraped from word processor files that often contained other, irrelevant material, then aligned to the utterance level (see Ong et al. 2024 for more details). Participants in the Dirty data track will seek to improve on the process (scraping, segmenting, aligning), with the goal of improving the quality of the train set. The ultimate goal remains the same, of improving PER on the test set.

The dirty data collection consists of the original source files (audio and transcriptions) for a subset of train that does not overlap with test or dev (In the standard benchmark corpus, some of the full audio files were split by speaker between train and dev/test), some of the mapping files that we used during the first stages of extraction and filtration of the data, and a summary of the process that was used to extract the data. Please request the dirty data set on the registration form if you intend to use it.

For participants using the dirty data, we have also created a reference subset dirty_data_train of train which only contains utterances taken from the dirty data files. Baseline results will be provided for this subset for reference. The improved training sets that participants will create from the dirty data are expected to be different from dirty_data_train.

Timeline

As of January 2025, the challenge phase has been extended. Please check back in spring of 2025 for further information.

How to Participate

Dev kit

Please access the dev kit, permitting you to evaluate your system and replicate baselines here (does not include the data) at https://github.com/perceptimatic/faetar-dev-kit.

Registering/requesting data access

Requests for access to the data will be responded to within 24 hrs. You will receive a download link if your request is approved.

How to submit

We are happy to announce that submissions are now open for the Faetar ASR Challenge. Participants are responsible for submitting decodings for the test set and a model description. If possible, we ask participants to also upload a more detailed model description in the form of a paper draft in order to facilitate our writing a summary paper.

Each research group may submit up to four models for evaluation on the test set. Submissions must be received by Saturday, February 1st 2025 (AoE) in order to receive results on time.

Test decodings: Each research time should already have received a personalized OneDrive link which should be used to upload submissions (test decodings). There should be one submission file per model. See below for the submission format. It is the responsibility of participants to run the evaluation (using the dev kit) on subsets other than test.
Model description: Each research team should also have received a link to fill in a model description form. The short “model description” field will be added to the leaderboard section of the website A separate form should be submitted for each submission file uploaded to the OneDrive.
Draft paper (suggested): In order to support our timely submission of a summary paper to Interspeech, we invite participants who are preparing papers to share more detailed system information in the form of a draft paper as soon as this is feasible. This draft will not be shared with anyone other than the organizers and will only be used for the purposes of accurately characterizing your system in the summary paper. You may share your draft by including a link to a preprint in the model description or by uploading a PDF to the OneDrive as soon as a draft is ready.

Decodings for all files in the test set, for a single model, should be stored in a single plain-text file, with one file per line, and with the utterance name (file name prefix) following the decoding:

i kidʒə teɪnə lə fotd əkra l (heF003_00000916_00001116_he011)
faitan fu d ra fi (heF003_00001353_00001466_he011)
...

Leaderboard

At the outset of the challenge, this will contain only the baseline model results. Please see Ong et al. 2024 for details about the baseline models.

Constrained

External pre-trained AM/LM

Uses unlab

Dirty data challenge

Research group	Description	Training set	Constrained	External pre-trained AM/LM	Uses unlab	Dirty data challenge	PER on test	PER on dev	PER on 1h
Team A	Conformer based CTC (sys1)	train	x				50.8	NA	NA
Team A	Conformer based CTC (sys2)	train	x				51.2	NA	NA
Team A	Conformer based CTC (sys3)	train	x				51.6	NA	NA
Team A	Conformer based CTC (sysA)	train				x	48.0	NA	NA
Team A	Conformer based CTC (sysB)	train				x	48.8	NA	NA
Team A	Conformer based CTC (sysC)	train				x	49.4	NA	NA
Organizers (baseline)	Kaldi HMM-GMM Mono + 5-gram Kneser-Ney [6,7]	train	x				62.6	65.9	NA
Organizers (baseline)	Kaldi HMM-GMM Tri + 5-gram Kneser-Ney [6,7]	train	x				56.7	58.2	NA
Organizers (baseline)	ESPnet-MMS ML-SUPERB [2,3]	train					35.8	43.0	NA
Organizers (baseline)	ESPnet-MMS ML-SUPERB [2,3]	1hr					37.4	44.4	NA
Organizers (baseline)	ESPnet-MMS ML-SUPERB [2,3]	10m					45.1	50.2	NA
Organizers (baseline)	MMS [4]	train		x			33.0	39.9	NA
Organizers (baseline)	mHubert-147 [5]	train		x			33.6	41.1	NA
Organizers (baseline)	MMS [4] self-training	train		x	x		31.0	36.6	NA
Organizers (baseline)	MMS [4] pre-training + self-training	train		x	x		30.4	36.4	NA
Organizers (baseline)	MMS [4] continued pre-training	train		x	x		31.5	38.7	NA
Organizers (baseline)	MMS [4]	reduced_train		x			33.8	37.9	34.5
Organizers (baseline)	MMS [4] self-training	reduced_train		x	x		33.4	37.2	34.1
Organizers (baseline)	mHubert-147 [5]	reduced_train		x			35.5	42.8	36.3
Organizers (baseline)	mHubert-147 [5] self-training	reduced_train		x	x		35.1	42.3	35.6

Contact

Questions should be directed to faetarasrchallenge at gmail dot com.

Organizers

Ewan Dunbar, University of Toronto
Michael Ong, University of Toronto
Leo Peckham, University of Toronto
Naomi Nagy, University of Toronto

References

[1] N. Nagy, “A multilingual corpus to explore variation in language contact situations,” RILA, pp. 65–84, 2011.

[2] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech 2018, 2018, pp. 2207–2211.

[3] J. Shi, W. Chen, D. Berrebbi, H.-H. Wang, W.-P. Huang, E.-P. Hu, H.-L. Chuang, X. Chang, Y. Tang, S.-W. Li, A. Mohamed, H.-Y. Lee, and S. Watanabe, “Findings of the 2023 ML-SUPERB challenge: Pretraining and evaluation over more languages and beyond,” in ASRU, 2023, pp. 1–8.

[4] V. Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski, Y. Adi, X. Zhang, W.-N. Hsu, A. Conneau, and M. Auli, “Scaling speech technology to 1,000+ languages,” 2023.

[5] M. Z. Boito, V. Iyer, N. Lagos, L. Besacier, and I. Calapodescu, “mhubert-147: A compact multilingual hubert model,” arXiv preprint arXiv:2406.06371, 2024.

[6] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recognition toolkit,” in ASRU. Hilton Waikoloa Village, Big Island, Hawaii, US: IEEE Signal Processing Society, 2011.

[7] M. Ong, S. Robertson, L. Peckham, A. J. J. de Aberasturi, P. Arkhangorodsky, R. Huo, A. Sakhardande, M. Hallap, N. Nagy, and E. Dunbar, “The faetar benchmark: Speech recognition in a very under-resourced language,” arXiv preprint arXiv:2409.08103, 2024.