Nesting & list columns
interfacer
is designed to work with list columns, as
generated by purrr
. purrr
style list columns
may contain any arbitrary data type within a list. Consider the
following complex dataframe for example, which includes a single regular
factor column, a nested dataframe as a list column, a nested S3
lm
object as a list column and a nested matrix as a list
column:
tmp = iris %>%
tidyr::nest(by_species = -Species) %>%
dplyr::mutate(
model = purrr::map(by_species, ~ stats::lm(Sepal.Length ~ Sepal.Width, .x)),
quantiles = purrr::map(by_species, ~ sapply(.x, quantile))
)
tmp %>% dplyr::glimpse()
#> Rows: 3
#> Columns: 4
#> $ Species <fct> setosa, versicolor, virginica
#> $ by_species <list> [<tbl_df[50 x 4]>], [<tbl_df[50 x 4]>], [<tbl_df[50 x 4]>]
#> $ model <list> [2.6390012, 0.6904897, 0.04428474, 0.18952960, -0.14856834,…
#> $ quantiles <list> <<matrix[5 x 4]>>, <<matrix[5 x 4]>>, <<matrix[5 x 4]>>
interfacer
can be used to both represent and validate
this data structure. Here the initial specifications were generated
using iclip(tmp)
and hand modified:
# Pasted from `iclip(tmp)` with minor modification:
i_tmp = interfacer::iface(
Species = enum(`setosa`,`versicolor`,`virginica`) ~ "the Species column",
by_species = list(i_by_species) ~ "the by_species column",
model = list(of_type(lm)) ~ "the model column",
quantiles = list(matrix) ~ "the quantiles column",
.groups = NULL
)
i_by_species = interfacer::iface(
Sepal.Length = numeric ~ "the Sepal.Length column",
Sepal.Width = numeric ~ "the Sepal.Width column",
Petal.Length = numeric ~ "the Petal.Length column",
Petal.Width = numeric ~ "the Petal.Width column",
.groups = NULL
)
We can then test that the input matches this specification:
tmp %>% iconvert(i_tmp) %>% dplyr::glimpse()
#> Rows: 3
#> Columns: 4
#> $ Species <fct> setosa, versicolor, virginica
#> $ by_species <list> [<tbl_df[50 x 4]>], [<tbl_df[50 x 4]>], [<tbl_df[50 x 4]>]
#> $ model <list> [2.6390012, 0.6904897, 0.04428474, 0.18952960, -0.14856834,…
#> $ quantiles <list> <<matrix[5 x 4]>>, <<matrix[5 x 4]>>, <<matrix[5 x 4]>>
Such specifications could be used for validation, or controlling function dispatch. However it must be recognised that validation of nested dataframes is potentially computationally expensive as each individual nested dataframe must be completely validated. This could create a high overhead in situations where there are a large number of small nested dataframes.
Another example of a nested list column using the diamonds dataframe demonstrates this overhead, where 276 nested dataframes need to be validated individually. This takes a few seconds on my machine.
i_diamonds_cat = interfacer::iface(
cut = enum(`Fair`,`Good`,`Very Good`,`Premium`,`Ideal`, .ordered=TRUE) ~ "the cut column",
color = enum(`D`,`E`,`F`,`G`,`H`,`I`,`J`, .ordered=TRUE) ~ "the color column",
clarity = enum(`I1`,`SI2`,`SI1`,`VS2`,`VS1`,`VVS2`,`VVS1`,`IF`, .ordered=TRUE) ~ "the clarity column",
data = list(i_diamonds_data) ~ "A nested data column must be specified as a list",
.groups = FALSE
)
i_diamonds_data = interfacer::iface(
carat = numeric ~ "the carat column",
depth = numeric ~ "the depth column",
table = numeric ~ "the table column",
price = integer ~ "the price column",
x = numeric ~ "the x column",
y = numeric ~ "the y column",
z = numeric ~ "the z column",
.groups = FALSE
)
nested_diamonds = ggplot2::diamonds %>%
tidyr::nest(data = c(-cut,-color,-clarity))
system.time(
nested_diamonds %>%
iconvert(i_diamonds_cat) %>%
dplyr::glimpse()
)
#> Rows: 276
#> Columns: 4
#> $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
#> $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, I, E, G,…
#> $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
#> $ data <list> [<tbl_df[469 x 7]>], [<tbl_df[614 x 7]>], [<tbl_df[89 x 7]>],…
#> user system elapsed
#> 2.948 0.006 2.954
In this example the price column is removes before nesting. Errors in the validation of nested columns are bubbled up to the top level.
try(
ggplot2::diamonds %>%
dplyr::select(-price) %>%
tidyr::nest(data = c(-cut,-color,-clarity)) %>%
iconvert(i_diamonds_cat) %>%
dplyr::glimpse()
)
#> Error : input column `data` in function parameter `<unknown>(<unknown> = ?)` cannot be coerced to a list(i_diamonds_data): nested dataframe problem - missing columns: price