import fave_asr
= fave_asr.transcribe(
transcription = 'resources/SnoopDogg_85SouthMedia.wav',
audio_file = 'small.en',
model_name = 'cpu'
device )
Usage examples
Pipeline walkthrough
The fave-asr
pipeline automates a few different steps that can be broken down depending on your needs. For example, if you just need a transcript but don’t care about who said the words, you can just do the transcribe step and none of the others.
Raw transcription
The output in transcription
is a dictionary with the keys segments
and language_code
. segments
is a List of Dicts, with each Dict having data on the speech in that segment.
'segments'][0].keys() transcription[
dict_keys(['id', 'seek', 'start', 'end', 'text', 'tokens', 'temperature', 'avg_logprob', 'compression_ratio', 'no_speech_prob', 'confidence', 'words'])
If you wanted a text transcript of the entire file, you can iterate through segments
and get the text
field for each one.
= []
text_list for segment in transcription['segments']:
'text'])
text_list.append(segment[print("\n".join(text_list))
So you know the pimpin fuck y'all I'm gonna go over to dev jam
And learn a little bit of corporate work cuz I don't know corporate shit
I only need a few months right give me a few months around the shit
I'm a fast learner go to dev jam get a job in a position drop a record get Benny the butcher song get hip-hop Harry's on
Learn a few tricks of the trade find out that the niggas that had it that wanted me to hold for them
Then sold it to some other people
So now one of my big wig buddies called me and say hey dog
I know the people that got there from and they don't know what to do with it. Mmm
Let me hide them. I know just what to do with it. So I hit them like let me um, we work for y'all
The play was cool, but it's like yeah fuck that how much how much to buy this shit?
How much to buy death row first how much for my masters?
How much for all of the masters?
Each segment also has word-level data available in the words
field including start
and end
times for each word.
Diarization
Diarization requires a HuggingFace Access Token and that you agree to the terms of some gated models. See the documentation page on setting and using access tokens
Some audio files have more than one speaker, and a raw transcript may not be useful if we don’t know who said what. The process of assigning speech to a speaker in an audio file is diarization. fave-asr
uses machine learning models which are gated, meaning that the creators might require you to agree to particular terms before using it. You can learn more and agree to the terms at the page for the diarization model.
import os
= fave_asr.diarize(
diarization = 'resources/SnoopDogg_85SouthMedia.wav',
audio_file =os.environ["HF_TOKEN"]
hf_token
)print(diarization)
segment label speaker start end
0 [ 00:00:00.365 --> 00:00:00.704] A SPEAKER_00 0.365025 0.704584
1 [ 00:00:01.044 --> 00:00:01.825] B SPEAKER_00 1.044143 1.825127
2 [ 00:00:02.741 --> 00:00:03.251] C SPEAKER_00 2.741935 3.251273
3 [ 00:00:03.947 --> 00:00:05.305] D SPEAKER_00 3.947368 5.305603
4 [ 00:00:06.273 --> 00:00:08.684] E SPEAKER_00 6.273345 8.684211
5 [ 00:00:08.752 --> 00:00:09.057] F SPEAKER_01 8.752122 9.057725
6 [ 00:00:09.057 --> 00:00:12.436] G SPEAKER_00 9.057725 12.436333
7 [ 00:00:09.074 --> 00:00:09.091] H SPEAKER_01 9.074703 9.091681
8 [ 00:00:13.064 --> 00:00:14.915] I SPEAKER_00 13.064516 14.915110
9 [ 00:00:15.271 --> 00:00:18.276] J SPEAKER_00 15.271647 18.276740
10 [ 00:00:18.955 --> 00:00:21.027] K SPEAKER_00 18.955857 21.027165
11 [ 00:00:21.451 --> 00:00:23.539] L SPEAKER_00 21.451613 23.539898
12 [ 00:00:23.998 --> 00:00:25.305] M SPEAKER_00 23.998302 25.305603
13 [ 00:00:25.611 --> 00:00:25.848] N SPEAKER_02 25.611205 25.848896
14 [ 00:00:26.188 --> 00:00:26.833] O SPEAKER_00 26.188455 26.833616
15 [ 00:00:26.460 --> 00:00:26.731] P SPEAKER_02 26.460102 26.731749
16 [ 00:00:27.071 --> 00:00:29.465] Q SPEAKER_00 27.071307 29.465195
17 [ 00:00:29.923 --> 00:00:31.400] R SPEAKER_00 29.923599 31.400679
18 [ 00:00:32.164 --> 00:00:33.098] S SPEAKER_00 32.164686 33.098472
19 [ 00:00:33.488 --> 00:00:33.845] T SPEAKER_03 33.488964 33.845501
20 [ 00:00:34.219 --> 00:00:35.288] U SPEAKER_03 34.219015 35.288625
21 [ 00:00:34.286 --> 00:00:37.224] V SPEAKER_00 34.286927 37.224109
22 [ 00:00:37.427 --> 00:00:38.276] W SPEAKER_00 37.427844 38.276740
23 [ 00:00:38.853 --> 00:00:39.753] X SPEAKER_00 38.853990 39.753820
24 [ 00:00:40.840 --> 00:00:42.707] Y SPEAKER_00 40.840407 42.707980
25 [ 00:00:42.843 --> 00:00:45.152] Z SPEAKER_00 42.843803 45.152801
26 [ 00:00:46.052 --> 00:00:47.461] AA SPEAKER_00 46.052632 47.461800
27 [ 00:00:48.225 --> 00:00:49.329] AB SPEAKER_00 48.225806 49.329372
28 [ 00:00:50.263 --> 00:00:51.553] AC SPEAKER_00 50.263158 51.553480
The diarization output is a Pandas DataFrame with various columns. Most important are speaker
, start
, and end
which give a speaker label for that segment, the start time of the segment, and the end time of the segment.
For example, you can get a list of unique speaker labels using python’s set
function.
= set(diarization['speaker']) speakers
And you can use the len
function to get the number of speakers
len(speakers)
4
You can also filter the transcript by selecting only segments with a particular speaker using Pandas’ DataFrame.loc
method.
= diarization.loc[diarization['speaker'] == 'SPEAKER_00']
snoop_dogg print(snoop_dogg)
segment label speaker start end
0 [ 00:00:00.365 --> 00:00:00.704] A SPEAKER_00 0.365025 0.704584
1 [ 00:00:01.044 --> 00:00:01.825] B SPEAKER_00 1.044143 1.825127
2 [ 00:00:02.741 --> 00:00:03.251] C SPEAKER_00 2.741935 3.251273
3 [ 00:00:03.947 --> 00:00:05.305] D SPEAKER_00 3.947368 5.305603
4 [ 00:00:06.273 --> 00:00:08.684] E SPEAKER_00 6.273345 8.684211
6 [ 00:00:09.057 --> 00:00:12.436] G SPEAKER_00 9.057725 12.436333
8 [ 00:00:13.064 --> 00:00:14.915] I SPEAKER_00 13.064516 14.915110
9 [ 00:00:15.271 --> 00:00:18.276] J SPEAKER_00 15.271647 18.276740
10 [ 00:00:18.955 --> 00:00:21.027] K SPEAKER_00 18.955857 21.027165
11 [ 00:00:21.451 --> 00:00:23.539] L SPEAKER_00 21.451613 23.539898
12 [ 00:00:23.998 --> 00:00:25.305] M SPEAKER_00 23.998302 25.305603
14 [ 00:00:26.188 --> 00:00:26.833] O SPEAKER_00 26.188455 26.833616
16 [ 00:00:27.071 --> 00:00:29.465] Q SPEAKER_00 27.071307 29.465195
17 [ 00:00:29.923 --> 00:00:31.400] R SPEAKER_00 29.923599 31.400679
18 [ 00:00:32.164 --> 00:00:33.098] S SPEAKER_00 32.164686 33.098472
21 [ 00:00:34.286 --> 00:00:37.224] V SPEAKER_00 34.286927 37.224109
22 [ 00:00:37.427 --> 00:00:38.276] W SPEAKER_00 37.427844 38.276740
23 [ 00:00:38.853 --> 00:00:39.753] X SPEAKER_00 38.853990 39.753820
24 [ 00:00:40.840 --> 00:00:42.707] Y SPEAKER_00 40.840407 42.707980
25 [ 00:00:42.843 --> 00:00:45.152] Z SPEAKER_00 42.843803 45.152801
26 [ 00:00:46.052 --> 00:00:47.461] AA SPEAKER_00 46.052632 47.461800
27 [ 00:00:48.225 --> 00:00:49.329] AB SPEAKER_00 48.225806 49.329372
28 [ 00:00:50.263 --> 00:00:51.553] AC SPEAKER_00 50.263158 51.553480
Diarized transcription
The last stage of the pipeline is combining the diarization and the transcription by assigning speakers to segments.
= fave_asr.assign_speakers(diarization,transcription) diarized_transcript
The structure of diarized_transcript
is very similar to the structure of transcription
but the segments and words now have a speaker
field.
'segments'][0]['speaker'] diarized_transcript[
'SPEAKER_00'
Output
TextGrid
A diarized transcript can be converted to a textgrid object and navigated using that library.
= fave_asr.to_TextGrid(diarized_transcript) tg
You can write the output to a file using the textgrid.write
method by specifying a file name for the output TextGrid.
'SnoopDogg_Interview.TextGrid') tg.write(