fave_asr
fave_asr
This module automates the transcription and diarization of linguistic data.
Functions
Name | Description |
---|---|
align_segments | Align the transcript segments using a pretrained alignment model (Wav2Vec2 by default). |
assign_speakers | Assign speakers to each transcript segment based on the speaker diarization result. |
diarize | Perform speaker diarization on an audio file. |
to_TextGrid | Convert a diarized transcription dictionary to a TextGrid |
transcribe | Transcribe an audio file using a whisper model. |
transcribe_and_diarize | Transcribe an audio file and perform speaker diarization. |
align_segments
fave_asr.align_segments(segments, language_code, audio_file, device='cpu')
Align the transcript segments using a pretrained alignment model (Wav2Vec2 by default).
Parameters
Name | Type | Description | Default |
---|---|---|---|
segments |
typing.List[typing.Dict[str, typing.Any]] | List of transcript segments to align. | required |
language_code |
str | Language code of the audio file. | required |
audio_file |
str | Path to the audio file containing the audio data. | required |
device |
str | The device to use for inference (e.g., “cpu” or “cuda”). | 'cpu' |
Returns
Type | Description |
---|---|
typing.Dict[str, typing.Any] | A dictionary representing the aligned transcript segments. |
assign_speakers
fave_asr.assign_speakers(diarization_result, aligned_segments)
Assign speakers to each transcript segment based on the speaker diarization result.
Parameters
Name | Type | Description | Default |
---|---|---|---|
diarization_result |
typing.Dict[str, typing.Any] | Dictionary of diarized audio file, including speaker embeddings and number of speakers. | required |
aligned_segments |
typing.Dict[str, typing.Any] | Dictionary representing the aligned transcript segments. | required |
Returns
Type | Description |
---|---|
typing.List[typing.Dict[str, typing.Any]] | A list of dictionaries representing each segment of the transcript, including |
typing.List[typing.Dict[str, typing.Any]] | the start and end times, the spoken text, and the speaker ID. |
diarize
fave_asr.diarize(audio_file, hf_token)
Perform speaker diarization on an audio file.
Parameters
Name | Type | Description | Default |
---|---|---|---|
audio_file |
str | Path to the audio file to diarize. | required |
hf_token |
str | Authentication token for accessing the Hugging Face API. | required |
Returns
Type | Description |
---|---|
typing.Dict[str, typing.Any] | A dictionary representing the diarized audio file, including the speaker embeddings and the number of speakers. |
to_TextGrid
fave_asr.to_TextGrid(diarized_transcription, by_phrase=True)
Convert a diarized transcription dictionary to a TextGrid
Parameters
Name | Type | Description | Default |
---|---|---|---|
diarized_transcription |
Output of pipeline.assign_speakers() | required | |
by_phrase |
Flag for whether the intervals should be by phrase (True) or word (False) | True |
Returns
Type | Description |
---|---|
A textgrid.TextGrid object populated with the diarized and | |
transcribed data. Tiers are by speaker and contain word-level | |
intervals not utterance-level. |
transcribe
fave_asr.transcribe(audio_file, model_name, device='cpu', detect_disfluencies=True)
Transcribe an audio file using a whisper model.
Parameters
Name | Type | Description | Default |
---|---|---|---|
audio_file |
str | Path to the audio file to transcribe. | required |
model_name |
str | Name of the model to use for transcription. | required |
device |
str | The device to use for inference (e.g., “cpu” or “cuda”). | 'cpu' |
detect_disfluencies |
bool | Flag for whether the transcription should include disfluencies, marked with [*] | True |
Returns
Type | Description |
---|---|
typing.Dict[str, typing.Any] | A dictionary representing the transcript segments and language code. |
transcribe_and_diarize
fave_asr.transcribe_and_diarize(audio_file, hf_token, model_name, device='cpu')
Transcribe an audio file and perform speaker diarization.
Parameters
Name | Type | Description | Default |
---|---|---|---|
audio_file |
str | Path to the audio file to transcribe and diarize. | required |
hf_token |
str | Authentication token for accessing the Hugging Face API. | required |
model_name |
str | Name of the model to use for transcription. | required |
device |
str | The device to use for inference (e.g., “cpu” or “cuda”). | 'cpu' |
Returns
Type | Description |
---|---|
typing.List[typing.Dict[str, typing.Any]] | A list of dictionaries representing each segment of the transcript, including |
typing.List[typing.Dict[str, typing.Any]] | the start and end times, the spoken text, and the speaker ID. |