- The Faetar Language
- Ground Rules
- Tracks
- Criteria for Judging Submissions
- Data and Licensing
- Timeline
- How to Participate
- Leaderboard
- Contact
- Organizers
- References
Low-resource speech recognition has gained substantial attention in recent years, particularly with the advent of large multilingual speech foundation models and language models. This is welcome, as there are thousands of languages which have existing small partially transcribed collections of field or found recordings, but no ASR systems. Such technology would be transformative for linguists, educators, and the numerous minoritized communities worldwide who face challenges to the survival of their languages and cultures. This technology would allow for recorded speech to be valorized and made more accessible, for the benefit of current and future speakers of these languages. However, a clear picture of best practices when developing ASR systems in very low-resource contexts has not yet emerged.
The Faetar Low-Resource ASR Benchmark aims to focus researchers’ attention on several issues which are common to many archival collections of speech data:
- noisy field recordings
- lack of standard orthography, leading to noise in the transcriptions in the form of transcriber inconsistencies
- only a few hours of transcribed data
- a larger collection of untranscribed data
- no additional data in the language (textual or speech) that is easily available
- “dirty” transcriptions in documents, which contain matter that needs to be filtered out
By focusing multiple research groups on a single corpus of this kind, we aim to gain deeper insights into these problems than can be achieved otherwise.
The challenge phase (during which the test decodings are embargoed) has been extended in order to ensure that researchers have ample time to develop adequate systems. Please check back for further updates.
The Faetar Language
Please see Ong et al. 2024 for more details about the Faetar ASR Benchmark Corpus.
The challenge uses the Faetar ASR Benchmark Corpus. Faetar (pronounced [fajdar]) is a variety of the Franco-Provençal language which developed in isolation in Italy, far from other speakers of Franco-Provençal, and in close contact with Italian. Faetar has less than 1000 speakers around the world, in Italy and in the diaspora. It is endangered, and preservation, learning, and documentation are a priority for many community members. The benchmark data represents the majority of all archived speech recordings of Faetar in existence, and it is not available from any other source.
Data were extracted from the Faetar collection of the Heritage Language Variation and Change in Toronto (HLVC) corpus [1]. The corpus contains 184 recordings of native Faetar speakers collected in Italy between 1992 and 1994 (the Homeland subset) and 37 recordings of first- and second-generation heritage Faetar speakers collected in Toronto between 2009 and 2010 (the Heritage subset). All come from field recordings, generally noisy, of semi-spontaneous speech.
Faetar has no standard written form. The data set is transcribed quasi-phonetically for linguistic purposes in IPA. The transcriptions are not always consistent, as different parts of the data set were transcribed for different purposes: sometimes the transcription is narrow and phonetic, while at other times the transcription is broad and phonemic.
Ground Rules
- Participants will make use of the training data provided (~4.5 hrs) and will submit phone-level decodings for train, dev and test audio.
- Participants will not have access to the test transcriptions during the period of the challenge and organizers will perform the final evaluation on the test set.
- Participants are provided with a dev kit that allows them to calculate scores on the dev and train sets.
- To ensure that participants can be confident in their results before submission, while maintaining comparability across participants, we also propose standardized alternative splits within the train set, which participants can use to do held-out evaluations without relying only on the small dev set.
- Participants must sign a data agreement preventing redistribution before accessing the data set.
- Each research group may make no more than four submissions for evaluation.
Tracks
In order to appear on the leaderboard, participants should make official submissions. Sub-areas of interest are:
- Constrained ASR. Participants should focus on the challenge of improving ASR architectures to work with small, poor-quality sets. Participants may not use any resources to train / fine-tune their models beyond the files contained in the provided training data. No external pre-trained acoustic models or language models are allowed. No external pre-trained acoustic models or language models are allowed. Use of the unlabelled portion of the Faetar Benchmark data set is not allowed (unlab). At submission time, participants should not check the External AM/LM or Unlab boxes on the model description form.
- Using pre-trained acoustic models or language models. Participants focus on the most effective way to make use of models pre-trained on other languages. At submission time, participants should check External AM/LM on the model description form.
- Using unlabelled data. The challenge data also includes ~20 hrs of unlabelled data (unlab). Participants focus on finding the most effective way to make use of it. At submission time, participants should check Unlab on the model description form (this checkbox is not mutually exclusive with the External AM/LM checkbox).
- Dirty data. The training data was extracted and automatically aligned from long-form audio and partial transcriptions in “cluttered” word processor files, relying on (error-prone) VAD, scraping, and alignment. Participants focus on improving the pipeline for extracting useful training data, with the ultimate goal of improving performance. At submission time, participants should check Dirty Data on the model description form.
Participants seeking to participate in the Dirty Data challenge should indicate this on the registration form.
Participants should indicate at the time of submission whether they are making a Constrained ASR submission, and, otherwise, which of the three thematic tracks they are submitting to (possibly more than one).
Each research group may make no more than four submissions total for evaluation, across all tracks.
Criteria for Judging Submissions
Submissions will be evaluated on phone error rate (PER) on the test set. Participants are provided with a dev kit allowing them to calculate the PER on dev and train, as well as reproduce the baselines. Bootstrap confidence intervals can also be calculated using the dev kit to demonstrate robustness.
Data and Licensing
Please see Ong et al. 2024 for more details about the Faetar ASR Benchmark Corpus and a more detailed breakdown of the corpus.
The Faetar ASR Benchmark Corpus data is available without cost under a restrictive license that prohibits re-distribution, among other things. Please see the Registration section below to request access.
The following table shows the distribution of data in the corpus, which consists of a train set, a test set, a small dev set, and an unlabelled set.
Split | Usage in the challenge | Amount |
---|---|---|
train | Training set (all tracks); Constrained ASR track: no data beyond this set can be used for training | 4:30:17 |
dev | Validation; held-out evaluation before submission to the challenge | 11:49 |
unlab | Additional resource in Unlabelled data track. | 19:55:21 |
test | Final evaluation: transcripts are unavailable to participants | 46:54 |
Alternate splits
Since the test set is unavailable to challenge participants during the challenge phase, we recommend that participants not rely entirely on dev for held-out evaluation. In order to increase the amount of data available for held-out evaluation, we have created an alternate split of the train set compraised of the sets 1h and reduced_train (train minus 1h), so that the set 1h can be used as a held-out evaluation set. We also provide baseline results for the alternative subsets within train.
The following table shows the distribution of the alternative splits:
Split | Suggested usage | Amount |
---|---|---|
1h | Hold out to use as additional validation/development data; or use as alternate train set to evaluate lower-data circumstances | 58:34 |
reduced_train (i.e. train minus 1h) | Use as alternate train set when evaluating on the above alternate split(s) | 3:40:32 |
Dirty data
The benchmark corpus was extracted and automatically aligned from long-form audio and (incomplete) transcriptions that were scraped from word processor files that often contained other, irrelevant material, then aligned to the utterance level (see Ong et al. 2024 for more details). Participants in the Dirty data track will seek to improve on the process (scraping, segmenting, aligning), with the goal of improving the quality of the train set. The ultimate goal remains the same, of improving PER on the test set.
The dirty data collection consists of the original source files (audio and transcriptions) for a subset of train that does not overlap with test or dev (In the standard benchmark corpus, some of the full audio files were split by speaker between train and dev/test), some of the mapping files that we used during the first stages of extraction and filtration of the data, and a summary of the process that was used to extract the data. Please request the dirty data set on the registration form if you intend to use it.
For participants using the dirty data, we have also created a reference subset dirty_data_train of train which only contains utterances taken from the dirty data files. Baseline results will be provided for this subset for reference. The improved training sets that participants will create from the dirty data are expected to be different from dirty_data_train.
Timeline
As of January 2025, the challenge phase has been extended. Please check back in spring of 2025 for further information.
How to Participate
Dev kit
Please access the dev kit, permitting you to evaluate your system and replicate baselines here (does not include the data) at https://github.com/perceptimatic/faetar-dev-kit.
Registering/requesting data access
Requests for access to the data will be responded to within 24 hrs. You will receive a download link if your request is approved.
How to submit
We are happy to announce that submissions are now open for the Faetar ASR Challenge. Participants are responsible for submitting decodings for the test set and a model description. If possible, we ask participants to also upload a more detailed model description in the form of a paper draft in order to facilitate our writing a summary paper.
Each research group may submit up to four models for evaluation on the test set. Submissions must be received by Saturday, February 1st 2025 (AoE) in order to receive results on time.
- Test decodings: Each research time should already have received a personalized OneDrive link which should be used to upload submissions (test decodings). There should be one submission file per model. See below for the submission format. It is the responsibility of participants to run the evaluation (using the dev kit) on subsets other than test.
- Model description: Each research team should also have received a link to fill in a model description form. The short “model description” field will be added to the leaderboard section of the website A separate form should be submitted for each submission file uploaded to the OneDrive.
- Draft paper (suggested): In order to support our timely submission of a summary paper to Interspeech, we invite participants who are preparing papers to share more detailed system information in the form of a draft paper as soon as this is feasible. This draft will not be shared with anyone other than the organizers and will only be used for the purposes of accurately characterizing your system in the summary paper. You may share your draft by including a link to a preprint in the model description or by uploading a PDF to the OneDrive as soon as a draft is ready.
Decodings for all files in the test set, for a single model, should be stored in a single plain-text file, with one file per line, and with the utterance name (file name prefix) following the decoding:
i kidʒə teɪnə lə fotd əkra l (heF003_00000916_00001116_he011)
faitan fu d ra fi (heF003_00001353_00001466_he011)
...
Leaderboard
At the outset of the challenge, this will contain only the baseline model results. Please see Ong et al. 2024 for details about the baseline models.
Research group | Description | Training set | Constrained | External pre-trained AM/LM | Uses unlab | Dirty data challenge | PER on test | PER on dev | PER on 1h |
---|---|---|---|---|---|---|---|---|---|
Team A | Conformer based CTC (sys1) | train | x | 50.8 | NA | NA | |||
Team A | Conformer based CTC (sys2) | train | x | 51.2 | NA | NA | |||
Team A | Conformer based CTC (sys3) | train | x | 51.6 | NA | NA | |||
Team A | Conformer based CTC (sysA) | train | x | 48.0 | NA | NA | |||
Team A | Conformer based CTC (sysB) | train | x | 48.8 | NA | NA | |||
Team A | Conformer based CTC (sysC) | train | x | 49.4 | NA | NA | |||
Organizers (baseline) | Kaldi HMM-GMM Mono + 5-gram Kneser-Ney [6,7] | train | x | 62.6 | 65.9 | NA | |||
Organizers (baseline) | Kaldi HMM-GMM Tri + 5-gram Kneser-Ney [6,7] | train | x | 56.7 | 58.2 | NA | |||
Organizers (baseline) | ESPnet-MMS ML-SUPERB [2,3] | train | 35.8 | 43.0 | NA | ||||
Organizers (baseline) | ESPnet-MMS ML-SUPERB [2,3] | 1hr | 37.4 | 44.4 | NA | ||||
Organizers (baseline) | ESPnet-MMS ML-SUPERB [2,3] | 10m | 45.1 | 50.2 | NA | ||||
Organizers (baseline) | MMS [4] | train | x | 33.0 | 39.9 | NA | |||
Organizers (baseline) | mHubert-147 [5] | train | x | 33.6 | 41.1 | NA | |||
Organizers (baseline) | MMS [4] self-training | train | x | x | 31.0 | 36.6 | NA | ||
Organizers (baseline) | MMS [4] pre-training + self-training | train | x | x | 30.4 | 36.4 | NA | ||
Organizers (baseline) | MMS [4] continued pre-training | train | x | x | 31.5 | 38.7 | NA | ||
Organizers (baseline) | MMS [4] | reduced_train | x | 33.8 | 37.9 | 34.5 | |||
Organizers (baseline) | MMS [4] self-training | reduced_train | x | x | 33.4 | 37.2 | 34.1 | ||
Organizers (baseline) | mHubert-147 [5] | reduced_train | x | 35.5 | 42.8 | 36.3 | |||
Organizers (baseline) | mHubert-147 [5] self-training | reduced_train | x | x | 35.1 | 42.3 | 35.6 |
Contact
Questions should be directed to faetarasrchallenge at gmail dot com.
Organizers
- Ewan Dunbar, University of Toronto
- Michael Ong, University of Toronto
- Leo Peckham, University of Toronto
- Naomi Nagy, University of Toronto
References
[1] N. Nagy, “A multilingual corpus to explore variation in language contact situations,” RILA, pp. 65–84, 2011.
[2] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech 2018, 2018, pp. 2207–2211.
[3] J. Shi, W. Chen, D. Berrebbi, H.-H. Wang, W.-P. Huang, E.-P. Hu, H.-L. Chuang, X. Chang, Y. Tang, S.-W. Li, A. Mohamed, H.-Y. Lee, and S. Watanabe, “Findings of the 2023 ML-SUPERB challenge: Pretraining and evaluation over more languages and beyond,” in ASRU, 2023, pp. 1–8.
[4] V. Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski, Y. Adi, X. Zhang, W.-N. Hsu, A. Conneau, and M. Auli, “Scaling speech technology to 1,000+ languages,” 2023.
[5] M. Z. Boito, V. Iyer, N. Lagos, L. Besacier, and I. Calapodescu, “mhubert-147: A compact multilingual hubert model,” arXiv preprint arXiv:2406.06371, 2024.
[6] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recognition toolkit,” in ASRU. Hilton Waikoloa Village, Big Island, Hawaii, US: IEEE Signal Processing Society, 2011.
[7] M. Ong, S. Robertson, L. Peckham, A. J. J. de Aberasturi, P. Arkhangorodsky, R. Huo, A. Sakhardande, M. Hallap, N. Nagy, and E. Dunbar, “The faetar benchmark: Speech recognition in a very under-resourced language,” arXiv preprint arXiv:2409.08103, 2024.