Outputting and saving as a DataFrame

You can ouput and save any given aligned_textgrid as a polars dataframe with the to_df() function.

Outputting as a data frame

import polars as pl

from aligned_textgrid import AlignedTextGrid, Word, Phone,\
                             to_df

tg = AlignedTextGrid(
    textgrid_path="../resources/josef-fruehwald_speaker.TextGrid",
    entry_classes=[Word, Phone]
)

Single Intervals

Bottom of the hierarchy

If you pass a single interval from the bottom of the sequence hierarchy, you’ll get back a fairly minimal dataframe with the start and end times, the label, and an ID for the interval.

one_interval = tg[0].Phone[1]
one_interval_df = to_df(one_interval)

one_interval_df

shape: (1, 5)

Phone_id	Phone_tier_index	Phone_label	Phone_start	Phone_end
str	i64	str	f64	f64
"0-0-1-0"	1	"HH"	0.11	1.97

Top of the hierarchy

If you pass to_df() an interval from higher up in the hierarchy, by default it will output its data, as well as the data for every interval below it in the hierarchy, concatenated horizontally.

word_interval = tg[0].Word[1]
word_interval_df = to_df(word_interval)

word_interval_df

shape: (4, 10)

Word_id	Word_tier_index	Word_label	Word_start	Word_end	Phone_id	Phone_tier_index	Phone_label	Phone_start	Phone_end
str	i64	str	f64	f64	str	i64	str	f64	f64
"0-0-1"	1	"when"	0.11	2.2	"0-0-1-0"	1	"HH"	0.11	1.97
"0-0-1"	1	"when"	0.11	2.2	"0-0-1-1"	2	"W"	1.97	2.09
"0-0-1"	1	"when"	0.11	2.2	"0-0-1-2"	3	"EH1"	2.09	2.13
"0-0-1"	1	"when"	0.11	2.2	"0-0-1-3"	4	"N"	2.13	2.2

However, if you want just a simplified, single row output for an interval, regardless of its location within the hierarchy, pass to_df(..., with_subset = False).

word_interval_df2 = to_df(word_interval, with_subset=False)

word_interval_df2

shape: (1, 6)

id	tier_index	label	start	end	entry_class
str	i64	str	f64	f64	str
"0-0-1"	1	"when"	0.11	2.2	"Word"

Tiers

If you pass a tier to to_df(), it will output a dataframe for ever interval in the tier concatenated vertically. By default, this means intervals high in the hierarchy will have their rows repeated for every interval they contain, but if you want one row per interval in the output, you can pass to_df(..., with_subset = False).

tier_df1 = to_df(tg[0].Word)

tier_df1.shape

(1191, 10)

tier_df1.head(10)

shape: (10, 10)

Word_id	Word_tier_index	Word_label	Word_start	Word_end	Phone_id	Phone_tier_index	Phone_label	Phone_start	Phone_end
str	i64	str	f64	f64	str	i64	str	f64	f64
"0-0-0"	0	""	0.0	0.11	"0-0-0-0"	0	""	0.0	0.11
"0-0-1"	1	"when"	0.11	2.2	"0-0-1-0"	1	"HH"	0.11	1.97
"0-0-1"	1	"when"	0.11	2.2	"0-0-1-1"	2	"W"	1.97	2.09
"0-0-1"	1	"when"	0.11	2.2	"0-0-1-2"	3	"EH1"	2.09	2.13
"0-0-1"	1	"when"	0.11	2.2	"0-0-1-3"	4	"N"	2.13	2.2
"0-0-2"	2	"the"	2.2	2.26	"0-0-2-0"	5	"DH"	2.2	2.22
"0-0-2"	2	"the"	2.2	2.26	"0-0-2-1"	6	"AH0"	2.22	2.26
"0-0-3"	3	"sunlight"	2.26	2.72	"0-0-3-0"	7	"S"	2.26	2.39
"0-0-3"	3	"sunlight"	2.26	2.72	"0-0-3-1"	8	"AH1"	2.39	2.44
"0-0-3"	3	"sunlight"	2.26	2.72	"0-0-3-2"	9	"N"	2.44	2.52

# 1 row per interval
tier_df2 = to_df(tg[0].Word, with_subset=False)

tier_df2.shape

(377, 6)

tier_df2.head(10)

shape: (10, 6)

id	tier_index	label	start	end	entry_class
str	i64	str	f64	f64	str
"0-0-0"	0	""	0.0	0.11	"Word"
"0-0-1"	1	"when"	0.11	2.2	"Word"
"0-0-2"	2	"the"	2.2	2.26	"Word"
"0-0-3"	3	"sunlight"	2.26	2.72	"Word"
"0-0-4"	4	"strikes"	2.72	3.22	"Word"
"0-0-5"	5	"raindrops"	3.22	3.79	"Word"
"0-0-6"	6	"in"	3.79	3.89	"Word"
"0-0-7"	7	"the"	3.89	4.02	"Word"
"0-0-8"	8	"air"	4.02	4.45	"Word"
"0-0-9"	9	""	4.45	4.61	"Word"

TierGroups and TextGrids

The behavior for TierGroups and TextGrids are similar. By default, the to_df() function will either return a dataframe representing the entire hierarchy structure, or will return one row for each interval in the TextGrid.

full_df1 = to_df(tg)

full_df1.shape

(1191, 10)

full_df1.head(10)

shape: (10, 10)

Word_id	Word_tier_index	Word_label	Word_start	Word_end	Phone_id	Phone_tier_index	Phone_label	Phone_start	Phone_end
str	i64	str	f64	f64	str	i64	str	f64	f64
"0-0-0"	0	""	0.0	0.11	"0-0-0-0"	0	""	0.0	0.11
"0-0-1"	1	"when"	0.11	2.2	"0-0-1-0"	1	"HH"	0.11	1.97
"0-0-1"	1	"when"	0.11	2.2	"0-0-1-1"	2	"W"	1.97	2.09
"0-0-1"	1	"when"	0.11	2.2	"0-0-1-2"	3	"EH1"	2.09	2.13
"0-0-1"	1	"when"	0.11	2.2	"0-0-1-3"	4	"N"	2.13	2.2
"0-0-2"	2	"the"	2.2	2.26	"0-0-2-0"	5	"DH"	2.2	2.22
"0-0-2"	2	"the"	2.2	2.26	"0-0-2-1"	6	"AH0"	2.22	2.26
"0-0-3"	3	"sunlight"	2.26	2.72	"0-0-3-0"	7	"S"	2.26	2.39
"0-0-3"	3	"sunlight"	2.26	2.72	"0-0-3-1"	8	"AH1"	2.39	2.44
"0-0-3"	3	"sunlight"	2.26	2.72	"0-0-3-2"	9	"N"	2.44	2.52

# 1 row per interval
full_df2 = to_df(tg, with_subset=False)

full_df2.shape

(1568, 6)

full_df2.head(5)

shape: (5, 6)

id	tier_index	label	start	end	entry_class
str	i64	str	f64	f64	str
"0-0-0"	0	""	0.0	0.11	"Word"
"0-0-1"	1	"when"	0.11	2.2	"Word"
"0-0-2"	2	"the"	2.2	2.26	"Word"
"0-0-3"	3	"sunlight"	2.26	2.72	"Word"
"0-0-4"	4	"strikes"	2.72	3.22	"Word"

full_df2.tail(5)

shape: (5, 6)

id	tier_index	label	start	end	entry_class
str	i64	str	f64	f64	str
"0-0-374-1"	1186	"R"	111.83	111.92	"Phone"
"0-0-375-0"	1187	"B"	111.92	112.02	"Phone"
"0-0-375-1"	1188	"L"	112.02	112.08	"Phone"
"0-0-375-2"	1189	"UW1"	112.08	112.31	"Phone"
"0-0-376-0"	1190	""	112.31	115.065034	"Phone"

Saving a DataFrame

To save one of these dataframes, use one of the methods from polars, like DataFrame.write_csv()

full_df1.write_csv("test.csv")

Reuse

GPLv3