Outputting and saving as a DataFrame

You can ouput and save any given aligned_textgrid as a polars dataframe with the to_df() function.

Outputting as a data frame

import polars as pl

from aligned_textgrid import AlignedTextGrid, Word, Phone,\
                             to_df

tg = AlignedTextGrid(
    textgrid_path="../resources/josef-fruehwald_speaker.TextGrid",
    entry_classes=[Word, Phone]
)

Single Intervals

Bottom of the hierarchy

If you pass a single interval from the bottom of the sequence hierarchy, you’ll get back a fairly minimal dataframe with the start and end times, the label, and an ID for the interval.

one_interval = tg[0].Phone[1]
one_interval_df = to_df(one_interval)

one_interval_df
shape: (1, 5)
Phone_id Phone_tier_index Phone_label Phone_start Phone_end
str i64 str f64 f64
"0-0-1-0" 1 "HH" 0.11 1.97

Top of the hierarchy

If you pass to_df() an interval from higher up in the hierarchy, by default it will output its data, as well as the data for every interval below it in the hierarchy, concatenated horizontally.

word_interval = tg[0].Word[1]
word_interval_df = to_df(word_interval)

word_interval_df
shape: (4, 10)
Word_id Word_tier_index Word_label Word_start Word_end Phone_id Phone_tier_index Phone_label Phone_start Phone_end
str i64 str f64 f64 str i64 str f64 f64
"0-0-1" 1 "when" 0.11 2.2 "0-0-1-0" 1 "HH" 0.11 1.97
"0-0-1" 1 "when" 0.11 2.2 "0-0-1-1" 2 "W" 1.97 2.09
"0-0-1" 1 "when" 0.11 2.2 "0-0-1-2" 3 "EH1" 2.09 2.13
"0-0-1" 1 "when" 0.11 2.2 "0-0-1-3" 4 "N" 2.13 2.2

However, if you want just a simplified, single row output for an interval, regardless of its location within the hierarchy, pass to_df(..., with_subset = False).

word_interval_df2 = to_df(word_interval, with_subset=False)

word_interval_df2
shape: (1, 6)
id tier_index label start end entry_class
str i64 str f64 f64 str
"0-0-1" 1 "when" 0.11 2.2 "Word"

Tiers

If you pass a tier to to_df(), it will output a dataframe for ever interval in the tier concatenated vertically. By default, this means intervals high in the hierarchy will have their rows repeated for every interval they contain, but if you want one row per interval in the output, you can pass to_df(..., with_subset = False).

tier_df1 = to_df(tg[0].Word)

tier_df1.shape
(1191, 10)
tier_df1.head(10)
shape: (10, 10)
Word_id Word_tier_index Word_label Word_start Word_end Phone_id Phone_tier_index Phone_label Phone_start Phone_end
str i64 str f64 f64 str i64 str f64 f64
"0-0-0" 0 "" 0.0 0.11 "0-0-0-0" 0 "" 0.0 0.11
"0-0-1" 1 "when" 0.11 2.2 "0-0-1-0" 1 "HH" 0.11 1.97
"0-0-1" 1 "when" 0.11 2.2 "0-0-1-1" 2 "W" 1.97 2.09
"0-0-1" 1 "when" 0.11 2.2 "0-0-1-2" 3 "EH1" 2.09 2.13
"0-0-1" 1 "when" 0.11 2.2 "0-0-1-3" 4 "N" 2.13 2.2
"0-0-2" 2 "the" 2.2 2.26 "0-0-2-0" 5 "DH" 2.2 2.22
"0-0-2" 2 "the" 2.2 2.26 "0-0-2-1" 6 "AH0" 2.22 2.26
"0-0-3" 3 "sunlight" 2.26 2.72 "0-0-3-0" 7 "S" 2.26 2.39
"0-0-3" 3 "sunlight" 2.26 2.72 "0-0-3-1" 8 "AH1" 2.39 2.44
"0-0-3" 3 "sunlight" 2.26 2.72 "0-0-3-2" 9 "N" 2.44 2.52
# 1 row per interval
tier_df2 = to_df(tg[0].Word, with_subset=False)

tier_df2.shape
(377, 6)
tier_df2.head(10)
shape: (10, 6)
id tier_index label start end entry_class
str i64 str f64 f64 str
"0-0-0" 0 "" 0.0 0.11 "Word"
"0-0-1" 1 "when" 0.11 2.2 "Word"
"0-0-2" 2 "the" 2.2 2.26 "Word"
"0-0-3" 3 "sunlight" 2.26 2.72 "Word"
"0-0-4" 4 "strikes" 2.72 3.22 "Word"
"0-0-5" 5 "raindrops" 3.22 3.79 "Word"
"0-0-6" 6 "in" 3.79 3.89 "Word"
"0-0-7" 7 "the" 3.89 4.02 "Word"
"0-0-8" 8 "air" 4.02 4.45 "Word"
"0-0-9" 9 "" 4.45 4.61 "Word"

TierGroups and TextGrids

The behavior for TierGroups and TextGrids are similar. By default, the to_df() function will either return a dataframe representing the entire hierarchy structure, or will return one row for each interval in the TextGrid.

full_df1 = to_df(tg)

full_df1.shape
(1191, 10)
full_df1.head(10)
shape: (10, 10)
Word_id Word_tier_index Word_label Word_start Word_end Phone_id Phone_tier_index Phone_label Phone_start Phone_end
str i64 str f64 f64 str i64 str f64 f64
"0-0-0" 0 "" 0.0 0.11 "0-0-0-0" 0 "" 0.0 0.11
"0-0-1" 1 "when" 0.11 2.2 "0-0-1-0" 1 "HH" 0.11 1.97
"0-0-1" 1 "when" 0.11 2.2 "0-0-1-1" 2 "W" 1.97 2.09
"0-0-1" 1 "when" 0.11 2.2 "0-0-1-2" 3 "EH1" 2.09 2.13
"0-0-1" 1 "when" 0.11 2.2 "0-0-1-3" 4 "N" 2.13 2.2
"0-0-2" 2 "the" 2.2 2.26 "0-0-2-0" 5 "DH" 2.2 2.22
"0-0-2" 2 "the" 2.2 2.26 "0-0-2-1" 6 "AH0" 2.22 2.26
"0-0-3" 3 "sunlight" 2.26 2.72 "0-0-3-0" 7 "S" 2.26 2.39
"0-0-3" 3 "sunlight" 2.26 2.72 "0-0-3-1" 8 "AH1" 2.39 2.44
"0-0-3" 3 "sunlight" 2.26 2.72 "0-0-3-2" 9 "N" 2.44 2.52
# 1 row per interval
full_df2 = to_df(tg, with_subset=False)

full_df2.shape
(1568, 6)
full_df2.head(5)
shape: (5, 6)
id tier_index label start end entry_class
str i64 str f64 f64 str
"0-0-0" 0 "" 0.0 0.11 "Word"
"0-0-1" 1 "when" 0.11 2.2 "Word"
"0-0-2" 2 "the" 2.2 2.26 "Word"
"0-0-3" 3 "sunlight" 2.26 2.72 "Word"
"0-0-4" 4 "strikes" 2.72 3.22 "Word"
full_df2.tail(5)
shape: (5, 6)
id tier_index label start end entry_class
str i64 str f64 f64 str
"0-0-374-1" 1186 "R" 111.83 111.92 "Phone"
"0-0-375-0" 1187 "B" 111.92 112.02 "Phone"
"0-0-375-1" 1188 "L" 112.02 112.08 "Phone"
"0-0-375-2" 1189 "UW1" 112.08 112.31 "Phone"
"0-0-376-0" 1190 "" 112.31 115.065034 "Phone"

Saving a DataFrame

To save one of these dataframes, use one of the methods from polars, like DataFrame.write_csv()

full_df1.write_csv("test.csv")

Reuse

GPLv3