fave_asr

fave_asr

This module automates the transcription and diarization of linguistic data.

Functions

Name Description
align_segments Align the transcript segments using a pretrained alignment model (Wav2Vec2 by default).
assign_speakers Assign speakers to each transcript segment based on the speaker diarization result.
diarize Perform speaker diarization on an audio file.
to_TextGrid Convert a diarized transcription dictionary to a TextGrid
transcribe Transcribe an audio file using a whisper model.
transcribe_and_diarize Transcribe an audio file and perform speaker diarization.

align_segments

fave_asr.align_segments(segments, language_code, audio_file, device='cpu')

Align the transcript segments using a pretrained alignment model (Wav2Vec2 by default).

Parameters

Name Type Description Default
segments typing.List[typing.Dict[str, typing.Any]] List of transcript segments to align. required
language_code str Language code of the audio file. required
audio_file str Path to the audio file containing the audio data. required
device str The device to use for inference (e.g., “cpu” or “cuda”). 'cpu'

Returns

Type Description
typing.Dict[str, typing.Any] A dictionary representing the aligned transcript segments.

assign_speakers

fave_asr.assign_speakers(diarization_result, aligned_segments)

Assign speakers to each transcript segment based on the speaker diarization result.

Parameters

Name Type Description Default
diarization_result typing.Dict[str, typing.Any] Dictionary of diarized audio file, including speaker embeddings and number of speakers. required
aligned_segments typing.Dict[str, typing.Any] Dictionary representing the aligned transcript segments. required

Returns

Type Description
typing.List[typing.Dict[str, typing.Any]] A list of dictionaries representing each segment of the transcript, including
typing.List[typing.Dict[str, typing.Any]] the start and end times, the spoken text, and the speaker ID.

diarize

fave_asr.diarize(audio_file, hf_token)

Perform speaker diarization on an audio file.

Parameters

Name Type Description Default
audio_file str Path to the audio file to diarize. required
hf_token str Authentication token for accessing the Hugging Face API. required

Returns

Type Description
typing.Dict[str, typing.Any] A dictionary representing the diarized audio file, including the speaker embeddings and the number of speakers.

to_TextGrid

fave_asr.to_TextGrid(diarized_transcription, by_phrase=True)

Convert a diarized transcription dictionary to a TextGrid

Parameters

Name Type Description Default
diarized_transcription Output of pipeline.assign_speakers() required
by_phrase Flag for whether the intervals should be by phrase (True) or word (False) True

Returns

Type Description
A textgrid.TextGrid object populated with the diarized and
transcribed data. Tiers are by speaker and contain word-level
intervals not utterance-level.

transcribe

fave_asr.transcribe(audio_file, model_name, device='cpu', detect_disfluencies=True)

Transcribe an audio file using a whisper model.

Parameters

Name Type Description Default
audio_file str Path to the audio file to transcribe. required
model_name str Name of the model to use for transcription. required
device str The device to use for inference (e.g., “cpu” or “cuda”). 'cpu'
detect_disfluencies bool Flag for whether the transcription should include disfluencies, marked with [*] True

Returns

Type Description
typing.Dict[str, typing.Any] A dictionary representing the transcript segments and language code.

transcribe_and_diarize

fave_asr.transcribe_and_diarize(audio_file, hf_token, model_name, device='cpu')

Transcribe an audio file and perform speaker diarization.

Parameters

Name Type Description Default
audio_file str Path to the audio file to transcribe and diarize. required
hf_token str Authentication token for accessing the Hugging Face API. required
model_name str Name of the model to use for transcription. required
device str The device to use for inference (e.g., “cpu” or “cuda”). 'cpu'

Returns

Type Description
typing.List[typing.Dict[str, typing.Any]] A list of dictionaries representing each segment of the transcript, including
typing.List[typing.Dict[str, typing.Any]] the start and end times, the spoken text, and the speaker ID.

Reuse

GPLv3