fave_asr

fave_asr

This module automates the transcription and diarization of linguistic data.

Functions

Name	Description
align_segments	Align the transcript segments using a pretrained alignment model (Wav2Vec2 by default).
assign_speakers	Assign speakers to each transcript segment based on the speaker diarization result.
diarize	Perform speaker diarization on an audio file.
to_TextGrid	Convert a diarized transcription dictionary to a TextGrid
transcribe	Transcribe an audio file using a whisper model.
transcribe_and_diarize	Transcribe an audio file and perform speaker diarization.

fave_asr.align_segments(segments, language_code, audio_file, device='cpu')

Align the transcript segments using a pretrained alignment model (Wav2Vec2 by default).

Name	Type	Description	Default
`segments`	typing.List[typing.Dict[str, typing.Any]]	List of transcript segments to align.	required
`language_code`	str	Language code of the audio file.	required
`audio_file`	str	Path to the audio file containing the audio data.	required
`device`	str	The device to use for inference (e.g., “cpu” or “cuda”).	`'cpu'`

Type	Description
typing.Dict[str, typing.Any]	A dictionary representing the aligned transcript segments.

fave_asr.assign_speakers(diarization_result, aligned_segments)

Assign speakers to each transcript segment based on the speaker diarization result.

Name	Type	Description	Default
`diarization_result`	typing.Dict[str, typing.Any]	Dictionary of diarized audio file, including speaker embeddings and number of speakers.	required
`aligned_segments`	typing.Dict[str, typing.Any]	Dictionary representing the aligned transcript segments.	required

Type	Description
typing.List[typing.Dict[str, typing.Any]]	A list of dictionaries representing each segment of the transcript, including
typing.List[typing.Dict[str, typing.Any]]	the start and end times, the spoken text, and the speaker ID.

fave_asr.diarize(audio_file, hf_token)

Perform speaker diarization on an audio file.

Name	Type	Description	Default
`audio_file`	str	Path to the audio file to diarize.	required
`hf_token`	str	Authentication token for accessing the Hugging Face API.	required

Type	Description
typing.Dict[str, typing.Any]	A dictionary representing the diarized audio file, including the speaker embeddings and the number of speakers.

fave_asr.to_TextGrid(diarized_transcription, by_phrase=True)

Convert a diarized transcription dictionary to a TextGrid

Name	Type	Description	Default
`diarized_transcription`		Output of pipeline.assign_speakers()	required
`by_phrase`		Flag for whether the intervals should be by phrase (True) or word (False)	`True`

Type	Description
	A textgrid.TextGrid object populated with the diarized and
	transcribed data. Tiers are by speaker and contain word-level
	intervals not utterance-level.

fave_asr.transcribe(audio_file, model_name, device='cpu', detect_disfluencies=True)

Transcribe an audio file using a whisper model.

Name	Type	Description	Default
`audio_file`	str	Path to the audio file to transcribe.	required
`model_name`	str	Name of the model to use for transcription.	required
`device`	str	The device to use for inference (e.g., “cpu” or “cuda”).	`'cpu'`
`detect_disfluencies`	bool	Flag for whether the transcription should include disfluencies, marked with [*]	`True`

Type	Description
typing.Dict[str, typing.Any]	A dictionary representing the transcript segments and language code.

fave_asr.transcribe_and_diarize(audio_file, hf_token, model_name, device='cpu')

Transcribe an audio file and perform speaker diarization.

Name	Type	Description	Default
`audio_file`	str	Path to the audio file to transcribe and diarize.	required
`hf_token`	str	Authentication token for accessing the Hugging Face API.	required
`model_name`	str	Name of the model to use for transcription.	required
`device`	str	The device to use for inference (e.g., “cpu” or “cuda”).	`'cpu'`

Type	Description
typing.List[typing.Dict[str, typing.Any]]	A list of dictionaries representing each segment of the transcript, including
typing.List[typing.Dict[str, typing.Any]]	the start and end times, the spoken text, and the speaker ID.

GPLv3