import fave_asr
transcription = fave_asr.transcribe(
audio_file = 'resources/SnoopDogg_85SouthMedia.wav',
model_name = 'small.en',
device = 'cpu'
)Usage examples
Pipeline walkthrough
The fave-asr pipeline automates a few different steps that can be broken down depending on your needs. For example, if you just need a transcript but don’t care about who said the words, you can just do the transcribe step and none of the others.
Raw transcription
The output in transcription is a dictionary with the keys segments and language_code. segments is a List of Dicts, with each Dict having data on the speech in that segment.
transcription['segments'][0].keys()dict_keys(['id', 'seek', 'start', 'end', 'text', 'tokens', 'temperature', 'avg_logprob', 'compression_ratio', 'no_speech_prob', 'confidence', 'words'])
If you wanted a text transcript of the entire file, you can iterate through segments and get the text field for each one.
text_list = []
for segment in transcription['segments']:
text_list.append(segment['text'])
print("\n".join(text_list)) So you know the pimpin fuck y'all I'm gonna go over to dev jam
And learn a little bit of corporate work cuz I don't know corporate shit
I only need a few months right give me a few months around the shit
I'm a fast learner go to dev jam get a job in a position drop a record get Benny the butcher song get hip-hop Harry's on
Learn a few tricks of the trade find out that the niggas that had it that wanted me to hold for them
Then sold it to some other people
So now one of my big wig buddies called me and say hey dog
I know the people that got there from and they don't know what to do with it. Mmm
Let me hide them. I know just what to do with it. So I hit them like let me um, we work for y'all
The play was cool, but it's like yeah fuck that how much how much to buy this shit?
How much to buy death row first how much for my masters?
How much for all of the masters?
Each segment also has word-level data available in the words field including start and end times for each word.
Diarization
Diarization requires a HuggingFace Access Token and that you agree to the terms of some gated models. See the documentation page on setting and using access tokens
Some audio files have more than one speaker, and a raw transcript may not be useful if we don’t know who said what. The process of assigning speech to a speaker in an audio file is diarization. fave-asr uses machine learning models which are gated, meaning that the creators might require you to agree to particular terms before using it. You can learn more and agree to the terms at the page for the diarization model.
import os
diarization = fave_asr.diarize(
audio_file = 'resources/SnoopDogg_85SouthMedia.wav',
hf_token=os.environ["HF_TOKEN"]
)
print(diarization) segment label speaker start end
0 [ 00:00:00.365 --> 00:00:00.704] A SPEAKER_00 0.365025 0.704584
1 [ 00:00:01.044 --> 00:00:01.825] B SPEAKER_00 1.044143 1.825127
2 [ 00:00:02.741 --> 00:00:03.251] C SPEAKER_00 2.741935 3.251273
3 [ 00:00:03.947 --> 00:00:05.305] D SPEAKER_00 3.947368 5.305603
4 [ 00:00:06.273 --> 00:00:08.684] E SPEAKER_00 6.273345 8.684211
5 [ 00:00:08.752 --> 00:00:09.057] F SPEAKER_01 8.752122 9.057725
6 [ 00:00:09.057 --> 00:00:12.436] G SPEAKER_00 9.057725 12.436333
7 [ 00:00:09.074 --> 00:00:09.091] H SPEAKER_01 9.074703 9.091681
8 [ 00:00:13.064 --> 00:00:14.915] I SPEAKER_00 13.064516 14.915110
9 [ 00:00:15.271 --> 00:00:18.276] J SPEAKER_00 15.271647 18.276740
10 [ 00:00:18.955 --> 00:00:21.027] K SPEAKER_00 18.955857 21.027165
11 [ 00:00:21.451 --> 00:00:23.539] L SPEAKER_00 21.451613 23.539898
12 [ 00:00:23.998 --> 00:00:25.305] M SPEAKER_00 23.998302 25.305603
13 [ 00:00:25.611 --> 00:00:25.848] N SPEAKER_02 25.611205 25.848896
14 [ 00:00:26.188 --> 00:00:26.833] O SPEAKER_00 26.188455 26.833616
15 [ 00:00:26.460 --> 00:00:26.731] P SPEAKER_02 26.460102 26.731749
16 [ 00:00:27.071 --> 00:00:29.465] Q SPEAKER_00 27.071307 29.465195
17 [ 00:00:29.923 --> 00:00:31.400] R SPEAKER_00 29.923599 31.400679
18 [ 00:00:32.164 --> 00:00:33.098] S SPEAKER_00 32.164686 33.098472
19 [ 00:00:33.488 --> 00:00:33.845] T SPEAKER_03 33.488964 33.845501
20 [ 00:00:34.219 --> 00:00:35.288] U SPEAKER_03 34.219015 35.288625
21 [ 00:00:34.286 --> 00:00:37.224] V SPEAKER_00 34.286927 37.224109
22 [ 00:00:37.427 --> 00:00:38.276] W SPEAKER_00 37.427844 38.276740
23 [ 00:00:38.853 --> 00:00:39.753] X SPEAKER_00 38.853990 39.753820
24 [ 00:00:40.840 --> 00:00:42.707] Y SPEAKER_00 40.840407 42.707980
25 [ 00:00:42.843 --> 00:00:45.152] Z SPEAKER_00 42.843803 45.152801
26 [ 00:00:46.052 --> 00:00:47.461] AA SPEAKER_00 46.052632 47.461800
27 [ 00:00:48.225 --> 00:00:49.329] AB SPEAKER_00 48.225806 49.329372
28 [ 00:00:50.263 --> 00:00:51.553] AC SPEAKER_00 50.263158 51.553480
The diarization output is a Pandas DataFrame with various columns. Most important are speaker, start, and end which give a speaker label for that segment, the start time of the segment, and the end time of the segment.
For example, you can get a list of unique speaker labels using python’s set function.
speakers = set(diarization['speaker'])And you can use the len function to get the number of speakers
len(speakers)4
You can also filter the transcript by selecting only segments with a particular speaker using Pandas’ DataFrame.loc method.
snoop_dogg = diarization.loc[diarization['speaker'] == 'SPEAKER_00']
print(snoop_dogg) segment label speaker start end
0 [ 00:00:00.365 --> 00:00:00.704] A SPEAKER_00 0.365025 0.704584
1 [ 00:00:01.044 --> 00:00:01.825] B SPEAKER_00 1.044143 1.825127
2 [ 00:00:02.741 --> 00:00:03.251] C SPEAKER_00 2.741935 3.251273
3 [ 00:00:03.947 --> 00:00:05.305] D SPEAKER_00 3.947368 5.305603
4 [ 00:00:06.273 --> 00:00:08.684] E SPEAKER_00 6.273345 8.684211
6 [ 00:00:09.057 --> 00:00:12.436] G SPEAKER_00 9.057725 12.436333
8 [ 00:00:13.064 --> 00:00:14.915] I SPEAKER_00 13.064516 14.915110
9 [ 00:00:15.271 --> 00:00:18.276] J SPEAKER_00 15.271647 18.276740
10 [ 00:00:18.955 --> 00:00:21.027] K SPEAKER_00 18.955857 21.027165
11 [ 00:00:21.451 --> 00:00:23.539] L SPEAKER_00 21.451613 23.539898
12 [ 00:00:23.998 --> 00:00:25.305] M SPEAKER_00 23.998302 25.305603
14 [ 00:00:26.188 --> 00:00:26.833] O SPEAKER_00 26.188455 26.833616
16 [ 00:00:27.071 --> 00:00:29.465] Q SPEAKER_00 27.071307 29.465195
17 [ 00:00:29.923 --> 00:00:31.400] R SPEAKER_00 29.923599 31.400679
18 [ 00:00:32.164 --> 00:00:33.098] S SPEAKER_00 32.164686 33.098472
21 [ 00:00:34.286 --> 00:00:37.224] V SPEAKER_00 34.286927 37.224109
22 [ 00:00:37.427 --> 00:00:38.276] W SPEAKER_00 37.427844 38.276740
23 [ 00:00:38.853 --> 00:00:39.753] X SPEAKER_00 38.853990 39.753820
24 [ 00:00:40.840 --> 00:00:42.707] Y SPEAKER_00 40.840407 42.707980
25 [ 00:00:42.843 --> 00:00:45.152] Z SPEAKER_00 42.843803 45.152801
26 [ 00:00:46.052 --> 00:00:47.461] AA SPEAKER_00 46.052632 47.461800
27 [ 00:00:48.225 --> 00:00:49.329] AB SPEAKER_00 48.225806 49.329372
28 [ 00:00:50.263 --> 00:00:51.553] AC SPEAKER_00 50.263158 51.553480
Diarized transcription
The last stage of the pipeline is combining the diarization and the transcription by assigning speakers to segments.
diarized_transcript = fave_asr.assign_speakers(diarization,transcription)The structure of diarized_transcript is very similar to the structure of transcription but the segments and words now have a speaker field.
diarized_transcript['segments'][0]['speaker']'SPEAKER_00'
Output
TextGrid
A diarized transcript can be converted to a textgrid object and navigated using that library.
tg = fave_asr.to_TextGrid(diarized_transcript)You can write the output to a file using the textgrid.write method by specifying a file name for the output TextGrid.
tg.write('SnoopDogg_Interview.TextGrid')