How it works
A fave-extract on a single audio file/textgrid pair takes place in 4 steps.
Initial Formant Tracking
Initial formant tracking is carried out using fasttrackpy. The range of max-formants considered is broader in the new-fave configuration than the default fasttrackpy settings. This because early testing found this necessary to reduce error for some speakers for some vowels. You can review the fasttrack configuration file below.
fasttrack_config.yml
entry_classes:
- Word
- Phone
target_tier: "Phone"
target_labels: "[AEIOU]"
min_duration: 0.05
min_max_formant: 3200
max_max_formant: 10000
n_formants: 3
nstep: 40
window_length: 0.025
time_step: 0.002
smoother:
method: dct_smooth_regression
order: 5
The initial formant tracking uses the original labeling in the input textgrid, and targets intervals that match the regular expression in the target_labels
field of the fasttrack config file.
Label Recoding
In the next step, any label-recoding rules that were passed to fave-extract are applied. For example, if you wanted to distinguish between schwa and ʌ, but are using CMU labels that label them AH0
and AH1
, respetively, you could include the following recoding rule file.
- rule: schwa
conditions:
- attribute: label
relation: ==
set: AH0
return: "@"
- rule: wedge
conditions:
- attribute: label
relation: rematches
set: AH
return: "^"
Why bother with label recoding?
In the next stage, optimization, distributional statistics are estimated and iteratively updated for each vowel class. Vowel classes are determined by their label, so if there are two distinct vowel classes that you are interested in, but they both have the same label in the original textgrid, you will have to recode them to be separate for optimization.
Optimization
In the optimization stage, the following distributional properties are estimated for each vowel class.
- The multidimensional distribution over vowel log(formant track) DCT parameters, and formant bandwidths of the vowel class.
- The multidimensional distribution over the centroid position of the vowel class.
- The multidimensional distribution over formant ratios for the vowel class.
- The distribution over max-formant for the vowel class.
If a corpus of reference values (DCT parameters, or formant points) is provided, the multidimensional distribution over these reference values across the entire corpus is also estimated.
In cases where there are less than 10 tokens for a vowel class, these distributions are replaced with similar distributions over the speaker’s entire vowel space.
For these 4 distributions, the Mahalanobis distance is calculated for each candidate formant track for each vowel token. These Mahalanobis distances are converted to log-probabilities and summed, together with the smoothing errors from fasttrack processing, and F1 & F2 cutoff values, resulting in an overall log-probability for each candidate track. The candidate track with the largest log-probability is selected as the new winning track.
This optimization is repeated for a maximum of 10 iterations, or until the difference between two automation steps is sufficiently small.
Saving Data
The desired output data is then saved. Available output formats are
- Formant point data
- Formant track data
- DCT parameters of formant track data
- DCT parameters of log(formant track) data
- The recoded textgrid.