Usage examples

Pipeline walkthrough

The fave-asr pipeline automates a few different steps that can be broken down depending on your needs. For example, if you just need a transcript but don’t care about who said the words, you can just do the transcribe step and none of the others.

Raw transcription

import fave_asr

transcription = fave_asr.transcribe(
    audio_file = 'resources/SnoopDogg_85SouthMedia.wav',
    model_name = 'small.en',
    device = 'cpu'
    )

The output in transcription is a dictionary with the keys segments and language_code. segments is a List of Dicts, with each Dict having data on the speech in that segment.

transcription['segments'][0].keys()

dict_keys(['id', 'seek', 'start', 'end', 'text', 'tokens', 'temperature', 'avg_logprob', 'compression_ratio', 'no_speech_prob', 'confidence', 'words'])

If you wanted a text transcript of the entire file, you can iterate through segments and get the text field for each one.

text_list = []
for segment in transcription['segments']:
    text_list.append(segment['text'])
print("\n".join(text_list))

 So you know the pimpin fuck y'all I'm gonna go over to dev jam
 And learn a little bit of corporate work cuz I don't know corporate shit
 I only need a few months right give me a few months around the shit
 I'm a fast learner go to dev jam get a job in a position drop a record get Benny the butcher song get hip-hop Harry's on
 Learn a few tricks of the trade find out that the niggas that had it that wanted me to hold for them
 Then sold it to some other people
 So now one of my big wig buddies called me and say hey dog
 I know the people that got there from and they don't know what to do with it. Mmm
 Let me hide them. I know just what to do with it. So I hit them like let me um, we work for y'all
 The play was cool, but it's like yeah fuck that how much how much to buy this shit?
 How much to buy death row first how much for my masters?
 How much for all of the masters?

Each segment also has word-level data available in the words field including start and end times for each word.

Diarization

Gated model access

Diarization requires a HuggingFace Access Token and that you agree to the terms of some gated models. See the documentation page on setting and using access tokens

Some audio files have more than one speaker, and a raw transcript may not be useful if we don’t know who said what. The process of assigning speech to a speaker in an audio file is diarization. fave-asr uses machine learning models which are gated, meaning that the creators might require you to agree to particular terms before using it. You can learn more and agree to the terms at the page for the diarization model.

import os
diarization = fave_asr.diarize(
    audio_file = 'resources/SnoopDogg_85SouthMedia.wav',
    hf_token=os.environ["HF_TOKEN"]
    )
print(diarization)

                              segment label     speaker      start        end
0   [ 00:00:00.365 -->  00:00:00.704]     A  SPEAKER_00   0.365025   0.704584
1   [ 00:00:01.044 -->  00:00:01.825]     B  SPEAKER_00   1.044143   1.825127
2   [ 00:00:02.741 -->  00:00:03.251]     C  SPEAKER_00   2.741935   3.251273
3   [ 00:00:03.947 -->  00:00:05.305]     D  SPEAKER_00   3.947368   5.305603
4   [ 00:00:06.273 -->  00:00:08.684]     E  SPEAKER_00   6.273345   8.684211
5   [ 00:00:08.752 -->  00:00:09.057]     F  SPEAKER_01   8.752122   9.057725
6   [ 00:00:09.057 -->  00:00:12.436]     G  SPEAKER_00   9.057725  12.436333
7   [ 00:00:09.074 -->  00:00:09.091]     H  SPEAKER_01   9.074703   9.091681
8   [ 00:00:13.064 -->  00:00:14.915]     I  SPEAKER_00  13.064516  14.915110
9   [ 00:00:15.271 -->  00:00:18.276]     J  SPEAKER_00  15.271647  18.276740
10  [ 00:00:18.955 -->  00:00:21.027]     K  SPEAKER_00  18.955857  21.027165
11  [ 00:00:21.451 -->  00:00:23.539]     L  SPEAKER_00  21.451613  23.539898
12  [ 00:00:23.998 -->  00:00:25.305]     M  SPEAKER_00  23.998302  25.305603
13  [ 00:00:25.611 -->  00:00:25.848]     N  SPEAKER_02  25.611205  25.848896
14  [ 00:00:26.188 -->  00:00:26.833]     O  SPEAKER_00  26.188455  26.833616
15  [ 00:00:26.460 -->  00:00:26.731]     P  SPEAKER_02  26.460102  26.731749
16  [ 00:00:27.071 -->  00:00:29.465]     Q  SPEAKER_00  27.071307  29.465195
17  [ 00:00:29.923 -->  00:00:31.400]     R  SPEAKER_00  29.923599  31.400679
18  [ 00:00:32.164 -->  00:00:33.098]     S  SPEAKER_00  32.164686  33.098472
19  [ 00:00:33.488 -->  00:00:33.845]     T  SPEAKER_03  33.488964  33.845501
20  [ 00:00:34.219 -->  00:00:35.288]     U  SPEAKER_03  34.219015  35.288625
21  [ 00:00:34.286 -->  00:00:37.224]     V  SPEAKER_00  34.286927  37.224109
22  [ 00:00:37.427 -->  00:00:38.276]     W  SPEAKER_00  37.427844  38.276740
23  [ 00:00:38.853 -->  00:00:39.753]     X  SPEAKER_00  38.853990  39.753820
24  [ 00:00:40.840 -->  00:00:42.707]     Y  SPEAKER_00  40.840407  42.707980
25  [ 00:00:42.843 -->  00:00:45.152]     Z  SPEAKER_00  42.843803  45.152801
26  [ 00:00:46.052 -->  00:00:47.461]    AA  SPEAKER_00  46.052632  47.461800
27  [ 00:00:48.225 -->  00:00:49.329]    AB  SPEAKER_00  48.225806  49.329372
28  [ 00:00:50.263 -->  00:00:51.553]    AC  SPEAKER_00  50.263158  51.553480

The diarization output is a Pandas DataFrame with various columns. Most important are speaker, start, and end which give a speaker label for that segment, the start time of the segment, and the end time of the segment.

For example, you can get a list of unique speaker labels using python’s set function.

speakers = set(diarization['speaker'])

And you can use the len function to get the number of speakers

len(speakers)

You can also filter the transcript by selecting only segments with a particular speaker using Pandas’ DataFrame.loc method.

snoop_dogg = diarization.loc[diarization['speaker'] == 'SPEAKER_00']
print(snoop_dogg)

                              segment label     speaker      start        end
0   [ 00:00:00.365 -->  00:00:00.704]     A  SPEAKER_00   0.365025   0.704584
1   [ 00:00:01.044 -->  00:00:01.825]     B  SPEAKER_00   1.044143   1.825127
2   [ 00:00:02.741 -->  00:00:03.251]     C  SPEAKER_00   2.741935   3.251273
3   [ 00:00:03.947 -->  00:00:05.305]     D  SPEAKER_00   3.947368   5.305603
4   [ 00:00:06.273 -->  00:00:08.684]     E  SPEAKER_00   6.273345   8.684211
6   [ 00:00:09.057 -->  00:00:12.436]     G  SPEAKER_00   9.057725  12.436333
8   [ 00:00:13.064 -->  00:00:14.915]     I  SPEAKER_00  13.064516  14.915110
9   [ 00:00:15.271 -->  00:00:18.276]     J  SPEAKER_00  15.271647  18.276740
10  [ 00:00:18.955 -->  00:00:21.027]     K  SPEAKER_00  18.955857  21.027165
11  [ 00:00:21.451 -->  00:00:23.539]     L  SPEAKER_00  21.451613  23.539898
12  [ 00:00:23.998 -->  00:00:25.305]     M  SPEAKER_00  23.998302  25.305603
14  [ 00:00:26.188 -->  00:00:26.833]     O  SPEAKER_00  26.188455  26.833616
16  [ 00:00:27.071 -->  00:00:29.465]     Q  SPEAKER_00  27.071307  29.465195
17  [ 00:00:29.923 -->  00:00:31.400]     R  SPEAKER_00  29.923599  31.400679
18  [ 00:00:32.164 -->  00:00:33.098]     S  SPEAKER_00  32.164686  33.098472
21  [ 00:00:34.286 -->  00:00:37.224]     V  SPEAKER_00  34.286927  37.224109
22  [ 00:00:37.427 -->  00:00:38.276]     W  SPEAKER_00  37.427844  38.276740
23  [ 00:00:38.853 -->  00:00:39.753]     X  SPEAKER_00  38.853990  39.753820
24  [ 00:00:40.840 -->  00:00:42.707]     Y  SPEAKER_00  40.840407  42.707980
25  [ 00:00:42.843 -->  00:00:45.152]     Z  SPEAKER_00  42.843803  45.152801
26  [ 00:00:46.052 -->  00:00:47.461]    AA  SPEAKER_00  46.052632  47.461800
27  [ 00:00:48.225 -->  00:00:49.329]    AB  SPEAKER_00  48.225806  49.329372
28  [ 00:00:50.263 -->  00:00:51.553]    AC  SPEAKER_00  50.263158  51.553480

Diarized transcription

The last stage of the pipeline is combining the diarization and the transcription by assigning speakers to segments.

diarized_transcript = fave_asr.assign_speakers(diarization,transcription)

The structure of diarized_transcript is very similar to the structure of transcription but the segments and words now have a speaker field.

diarized_transcript['segments'][0]['speaker']

'SPEAKER_00'

Output

TextGrid

A diarized transcript can be converted to a textgrid object and navigated using that library.

tg = fave_asr.to_TextGrid(diarized_transcript)

You can write the output to a file using the textgrid.write method by specifying a file name for the output TextGrid.

tg.write('SnoopDogg_Interview.TextGrid')

Reuse

GPLv3