FAVE Automated Speech Recognition

The FAVE-asr package provides a system for the automated transcription of sociolinguistic interview data on local machines for use by aligners like FAVE or the Montreal Forced Aligner. The package provides functions to label different speakers in the same audio (diarization), transcribe speech, and output TextGrids with phrase- or word-level alignments.

Example Use Cases

You want a transcription of an interview for more detailed hand correction.
You want to transcribe a large corpus and your analysis can tolerate a small error rate.
You want to make an audio corpus into a text corpus.
You want to know the number of speakers in an audio file.

For examples on how to use the pacakge, see the Usage pages.

Installation

To install fave-asr using pip, run the following command in your terminal:

pip install fave-asr

Not another transcription service

There are several services which automate the process of transcribing audio, including

Unlike other services, fave-asr does not require uploading your data to other servers and instead focuses on processing audio on your own computer. Audio data can contain highly confidential information, and uploading this data to other services may not comply with ethical or legal data protection obligations. The goal of fave-asr is to serve those use cases where data protection makes local transcription necessary while making the process as seamless as cloud-based transcription services.

Example

As an example, we’ll transcribe an audio interview of Snoop Dogg by the 85 South Media podcast and output it as a TextGrid.

import os
import fave_asr

data = fave_asr.transcribe_and_diarize(
    audio_file = 'usage/resources/SnoopDogg_85SouthMedia.wav',
    hf_token = os.environ["HF_TOKEN"],
    model_name = 'small.en',
    device = 'cpu'
    )
tg = fave_asr.to_TextGrid(data)
tg.write('SnoopDogg_85SouthMedia.TextGrid')

File type = "ooTextFile"
Object class = "TextGrid"

xmin = 0.0
xmax = 51.44
tiers? <exists>
size = 1
item []:
    item [1]:
        class = "IntervalTier"
        name = "SPEAKER_00"
        xmin = 0.0
        xmax = 51.44
        intervals: size = 12
            intervals [1]:
                xmin = 0.0
                xmax = 6.24
                text = " So you know the pimpin fuck y'all I'm gonna go over to dev jam"
            intervals [2]:
                xmin = 6.24
                xmax = 8.78
                text = " And learn a little bit of corporate work cuz I don't know corporate shit"
            intervals [3]:
                xmin = 8.78
                xmax = 11.46
                text = " I only need a few months right give me a few months around the shit"
            intervals [4]:
                xmin = 11.46
                xmax = 18.94
                text = " I'm a fast learner go to dev jam get a job in a position drop a record get Benny the butcher song get hip-hop Harry's on"
            intervals [5]:
                xmin = 18.94
                xmax = 23.94
                text = " Learn a few tricks of the trade find out that the niggas that had it that wanted me to hold for them"
            intervals [6]:
                xmin = 23.94
                xmax = 26.12
                text = " Then sold it to some other people"
            intervals [7]:
                xmin = 26.12
                xmax = 29.9
                text = " So now one of my big wig buddies called me and say hey dog"
            intervals [8]:
                xmin = 29.9
                xmax = 34.38
                text = " I know the people that got there from and they don't know what to do with it. Mmm"
            intervals [9]:
                xmin = 34.38
                xmax = 40.72
                text = " Let me hide them. I know just what to do with it. So I hit them like let me um, we work for y'all"
            intervals [10]:
                xmin = 40.72
                xmax = 45.96
                text = " The play was cool, but it's like yeah fuck that how much how much to buy this shit?"
            intervals [11]:
                xmin = 45.96
                xmax = 50.22
                text = " How much to buy death row first how much for my masters?"
            intervals [12]:
                xmin = 50.22
                xmax = 51.44
                text = " How much for all of the masters?"

For more

To start jumping in, check out the quickstart
To learn how to set up and use the gated models, check out the gated model documentation

You can also directly read up on the function and class references.

Authors

The speaker diarization pipeline is based on an article and code by Luís Roque released under the CC-By-4.0 license. Christian Brickhouse modified that work to use the whisper-timestamped model and for use as a library. For licensing of the test audio, see the README in that directory.

Recommended citation

Brickhouse, Christian (2024). FAVE-ASR: Offline transcription of interview data (Version 0.1.0) [computer software]. https://forced-alignment-and-vowel-extraction.github.io/fave-asr/

Reuse

GPLv3