The population description is a simple summary of the co-variates in a data set with no reference to outcome, and not comparing intervention (although it might contain intervention rates.) It will report summary statistics for continuous and counts for categorical data,
Usage
describe_population(
df,
...,
label_fn = label_extractor(df),
units = extract_units(df),
override_type = list(),
layout = "single",
override_percent_dp = list(),
override_real_dp = list(),
font_size = getOption("tableone.font_size", 8),
font = getOption("tableone.font", "Arial"),
footer_text = NULL,
show_binary_value = NULL,
raw_output = FALSE
)
Arguments
- df
a dataframe of individual observations. Grouping, if present, is ignored. (n.b. if you wanted to construct multiple summary tables a
dplyr::group_map()
call could be used)- ...
the columns of variables we wish to summarise. This can be given as a
tidyselect
specification (seeutils::vignette("syntax", package = "tidyselect")
), identifying the columns. Alternatively it can be given as a formula of the natureoutcome ~ intervention + covariate_1 + covariate_2 + ...
.which may be more convenient if you are going on to do a model fit. If the latter format the left hand side is ignored (outcomes are not usual in this kind of table).
- label_fn
(optional) a function for mapping a co-variate column name to printable label. This is by default a no-operation and the output table will contain the dataframe column names as labels. A simple alternative would be some form of dplyr::case_when lookup, or a string function such as stringr::str_to_sentence. (N.b. this function must be vectorised). Any value provided here will be overridden by the
options("tableone.labeller" = my_label_fn)
which allows global setting of the labeller.- units
(optional) a named list of units, following a
c(<colname_1> = "<unit_1>", <colname_2> = "<unit_2>", ...)
format. columns not present in this list are assumed to have no units. Units may be involved in the formatting of the summary output.- override_type
(optional) a named list of data summary types. The default type for a column in a data set are calculated using heurisitics depending on the nature of the data (categorical or continuous), and result of normality tests. if you want to override this the options are "subtype_count","median_iqr","mean_sd","skipped" and you specify this on a column by column bases with a named list (e.g
c("Petal.Width"="mean_sd")
). Overriding the default does not check the type of data is correct for the summary type and will potentially cause errors if this is not done correctly.- layout
(optional) various layouts are defined as default. As of this version of
tableone
they are "relaxed","compact","micro","simple","single","missing". The layouts can be customised using the optionsoptions("tableone.format_list"=list(...)")
, and this is described in more detail in the vignettes.- override_percent_dp
(optional) a named list of overrides for the default precision of formatting percentages, following a
c(<colname_1> = 2, <colname_2> = 4, ...)
format. columns not present in this list will use the defaults defined in the layout. See the vignette on customisation.- override_real_dp
(optional) a named list of overrides for the default precision of formatting real values, following a
c(<colname_1> = 2, <colname_2> = 4, ...)
format. columns not present in this list will use the defaults defined in the layout. See theutils::vignette("customisation", package="tableone")
.- font_size
(optional) the font size for the table in points
- font
(optional) the font family for the table (which will be matched to closest on your system)
- footer_text
any text that needs to be added at the end of the table, setting this to FALSE dsables the whole footer (as does
options("tableone.hide_footer"=TRUE)
).- show_binary_value
if set this will filter the display of covariates where the number of possibilities is exactly 2 to this value.
- raw_output
return comparison as
t1_signif
dataframe rather than formatted table
Examples
# the heuristics detect that Petals in the iris data set are not normally
# distributed and hence report median and IQR:
iris %>% describe_population(tidyselect::everything())
#> mean_sd summary for Sepal.Length
#> mean_sd summary for Sepal.Width
#> median_iqr summary for Petal.Length
#> median_iqr summary for Petal.Width
#> subtype_count summary for Species
#> ──────────────────────────────────────────────────────────────────────
#> Variable Characteristic Value Count (N=150)
#> ──────────────────────────────────────────────────────────────────────
#> Sepal.Length Mean ± SD 5.84 ± 0.828 150
#> ──────────────────────────────────────────────────────────────────────
#> Sepal.Width Mean ± SD 3.06 ± 0.436 150
#> ──────────────────────────────────────────────────────────────────────
#> Petal.Length Median [IQR] 4.35 [1.6—5.1] 150
#> ──────────────────────────────────────────────────────────────────────
#> Petal.Width Median [IQR] 1.3 [0.3—1.8] 150
#> ──────────────────────────────────────────────────────────────────────
#> Species setosa % [95% 33.3% 50/150
#> CI] [26.3%—41.2%]
#> versicolor % 33.3% 50/150
#> [95% CI] [26.3%—41.2%]
#> virginica % [95% 33.3% 50/150
#> CI] [26.3%—41.2%]
#> ──────────────────────────────────────────────────────────────────────
#> Normal distributions determined by the Anderson-Darling
#> test (P>0.005)
#>
#> Column names: Variable, Characteristic, 1, 2
# Overriding the heuristics is possible:
iris %>% describe_population(
tidyselect::everything(),
override_type = c(Petal.Length = "mean_sd", Petal.Width = "mean_sd")
)
#> mean_sd summary for Sepal.Length
#> mean_sd summary for Sepal.Width
#> mean_sd summary for Petal.Length
#> mean_sd summary for Petal.Width
#> subtype_count summary for Species
#> ──────────────────────────────────────────────────────────────────────
#> Variable Characteristic Value Count (N=150)
#> ──────────────────────────────────────────────────────────────────────
#> Sepal.Length Mean ± SD 5.84 ± 0.828 150
#> ──────────────────────────────────────────────────────────────────────
#> Sepal.Width Mean ± SD 3.06 ± 0.436 150
#> ──────────────────────────────────────────────────────────────────────
#> Petal.Length Mean ± SD 3.76 ± 1.77 150
#> ──────────────────────────────────────────────────────────────────────
#> Petal.Width Mean ± SD 1.2 ± 0.762 150
#> ──────────────────────────────────────────────────────────────────────
#> Species setosa % [95% 33.3% 50/150
#> CI] [26.3%—41.2%]
#> versicolor % 33.3% 50/150
#> [95% CI] [26.3%—41.2%]
#> virginica % [95% 33.3% 50/150
#> CI] [26.3%—41.2%]
#> ──────────────────────────────────────────────────────────────────────
#> Normal distributions determined by the Anderson-Darling
#> test (P>0.005)
#>
#> Column names: Variable, Characteristic, 1, 2
# The counts sometimes seem redundant if there is no missing information:
diamonds %>% describe_population(tidyselect::everything())
#> median_iqr summary for carat
#> subtype_count summary for cut
#> subtype_count summary for color
#> subtype_count summary for clarity
#> median_iqr summary for depth
#> median_iqr summary for table
#> median_iqr summary for price
#> median_iqr summary for x
#> median_iqr summary for y
#> median_iqr summary for z
#> subtype_count summary for is_colored
#> ──────────────────────────────────────────────────────────────────────
#> Variable Characteristic Value Count (N=53940)
#> ──────────────────────────────────────────────────────────────────────
#> carat Median [IQR] 0.7 [0.4—1.04] 53940
#> ──────────────────────────────────────────────────────────────────────
#> cut Fair % [95% CI] 3.0% [2.8%—3.1%] 1610/53940
#> Good % [95% CI] 9.1% [8.9%—9.3%] 4906/53940
#> Very Good % [95% 22.4% 12082/53940
#> CI] [22.0%—22.8%]
#> Premium % [95% 25.6% 13791/53940
#> CI] [25.2%—25.9%]
#> Ideal % [95% CI] 40.0% 21551/53940
#> [39.5%—40.4%]
#> ──────────────────────────────────────────────────────────────────────
#> color D % [95% CI] 12.6% 6775/53940
#> [12.3%—12.8%]
#> E % [95% CI] 18.2% 9797/53940
#> [17.8%—18.5%]
#> F % [95% CI] 17.7% 9542/53940
#> [17.4%—18.0%]
#> G % [95% CI] 20.9% 11292/53940
#> [20.6%—21.3%]
#> H % [95% CI] 15.4% 8304/53940
#> [15.1%—15.7%]
#> I % [95% CI] 10.1% 5422/53940
#> [9.8%—10.3%]
#> J % [95% CI] 5.2% [5.0%—5.4%] 2808/53940
#> ──────────────────────────────────────────────────────────────────────
#> clarity I1 % [95% CI] 1.4% [1.3%—1.5%] 741/53940
#> SI2 % [95% CI] 17.0% 9194/53940
#> [16.7%—17.4%]
#> SI1 % [95% CI] 24.2% 13065/53940
#> [23.9%—24.6%]
#> VS2 % [95% CI] 22.7% 12258/53940
#> [22.4%—23.1%]
#> VS1 % [95% CI] 15.1% 8171/53940
#> [14.8%—15.5%]
#> VVS2 % [95% CI] 9.4% [9.1%—9.6%] 5066/53940
#> VVS1 % [95% CI] 6.8% [6.6%—7.0%] 3655/53940
#> IF % [95% CI] 3.3% [3.2%—3.5%] 1790/53940
#> ──────────────────────────────────────────────────────────────────────
#> depth Median [IQR] 61.8 [61—62.5] 53940
#> ──────────────────────────────────────────────────────────────────────
#> table Median [IQR] 57 [56—59] 53940
#> ──────────────────────────────────────────────────────────────────────
#> price Median [IQR] 2.4e+03 53940
#> [950—5.32e+03]
#> ──────────────────────────────────────────────────────────────────────
#> x Median [IQR] 5.7 [4.71—6.54] 53940
#> ──────────────────────────────────────────────────────────────────────
#> y Median [IQR] 5.71 [4.72—6.54] 53940
#> ──────────────────────────────────────────────────────────────────────
#> z Median [IQR] 3.53 [2.91—4.04] 53940
#> ──────────────────────────────────────────────────────────────────────
#> is_colored clear % [95% CI] 30.7% 16572/53940
#> [30.3%—31.1%]
#> colored % [95% 69.3% 37368/53940
#> CI] [68.9%—69.7%]
#> ──────────────────────────────────────────────────────────────────────
#> Normal distributions determined by the Anderson-Darling
#> test (P>0.005)
#>
#> Column names: Variable, Characteristic, 1, 2
# however in a data set with missing values the denominators are important:
missing_diamonds %>% describe_population(tidyselect::everything())
#> median_iqr summary for carat
#> subtype_count summary for cut
#> subtype_count summary for color
#> subtype_count summary for clarity
#> median_iqr summary for depth
#> median_iqr summary for table
#> median_iqr summary for price
#> median_iqr summary for x
#> median_iqr summary for y
#> median_iqr summary for z
#> subtype_count summary for is_colored
#> ──────────────────────────────────────────────────────────────────────
#> Variable Characteristic Value Count (N=53940)
#> ──────────────────────────────────────────────────────────────────────
#> carat Median [IQR] 0.7 [0.4—1.04] 48682
#> ──────────────────────────────────────────────────────────────────────
#> cut Fair % [95% CI] 3.0% [2.8%—3.2%] 1454/48553
#> Good % [95% CI] 9.2% [8.9%—9.5%] 4462/48553
#> Very Good % [95% 22.3% 10816/48553
#> CI] [21.9%—22.6%]
#> Premium % [95% 25.7% 12460/48553
#> CI] [25.3%—26.1%]
#> Ideal % [95% CI] 39.9% 19361/48553
#> [39.4%—40.3%]
#> ──────────────────────────────────────────────────────────────────────
#> color D % [95% CI] 12.5% 6079/48569
#> [12.2%—12.8%]
#> E % [95% CI] 18.3% 8886/48569
#> [18.0%—18.6%]
#> F % [95% CI] 17.7% 8613/48569
#> [17.4%—18.1%]
#> G % [95% CI] 20.9% 10137/48569
#> [20.5%—21.2%]
#> H % [95% CI] 15.4% 7466/48569
#> [15.1%—15.7%]
#> I % [95% CI] 10.0% 4876/48569
#> [9.8%—10.3%]
#> J % [95% CI] 5.2% [5.0%—5.4%] 2512/48569
#> ──────────────────────────────────────────────────────────────────────
#> clarity I1 % [95% CI] 1.4% [1.3%—1.5%] 664/48527
#> SI2 % [95% CI] 17.0% 8265/48527
#> [16.7%—17.4%]
#> SI1 % [95% CI] 24.2% 11756/48527
#> [23.8%—24.6%]
#> VS2 % [95% CI] 22.7% 11020/48527
#> [22.3%—23.1%]
#> VS1 % [95% CI] 15.2% 7355/48527
#> [14.8%—15.5%]
#> VVS2 % [95% CI] 9.4% [9.2%—9.7%] 4570/48527
#> VVS1 % [95% CI] 6.8% [6.6%—7.0%] 3298/48527
#> IF % [95% CI] 3.3% [3.1%—3.5%] 1599/48527
#> ──────────────────────────────────────────────────────────────────────
#> depth Median [IQR] 61.8 [61—62.5] 48584
#> ──────────────────────────────────────────────────────────────────────
#> table Median [IQR] 57 [56—59] 48707
#> ──────────────────────────────────────────────────────────────────────
#> price Median [IQR] 2.41e+03 48675
#> [952—5.33e+03]
#> ──────────────────────────────────────────────────────────────────────
#> x Median [IQR] 5.69 [4.72—6.54] 48577
#> ──────────────────────────────────────────────────────────────────────
#> y Median [IQR] 5.71 [4.72—6.54] 48578
#> ──────────────────────────────────────────────────────────────────────
#> z Median [IQR] 3.52 [2.91—4.03] 48559
#> ──────────────────────────────────────────────────────────────────────
#> is_colored clear % [95% CI] 30.7% 16572/53940
#> [30.3%—31.1%]
#> colored % [95% 69.3% 37368/53940
#> CI] [68.9%—69.7%]
#> ──────────────────────────────────────────────────────────────────────
#> Normal distributions determined by the Anderson-Darling
#> test (P>0.005)
#>
#> Column names: Variable, Characteristic, 1, 2
# for factor levels we can make the missing values more explicit
missing_diamonds %>% explicit_na() %>%
describe_population(tidyselect::everything())
#> median_iqr summary for carat
#> subtype_count summary for cut
#> subtype_count summary for color
#> subtype_count summary for clarity
#> median_iqr summary for depth
#> median_iqr summary for table
#> median_iqr summary for price
#> median_iqr summary for x
#> median_iqr summary for y
#> median_iqr summary for z
#> subtype_count summary for is_colored
#> ──────────────────────────────────────────────────────────────────────
#> Variable Characteristic Value Count (N=53940)
#> ──────────────────────────────────────────────────────────────────────
#> carat Median [IQR] 0.7 [0.4—1.04] 48682
#> ──────────────────────────────────────────────────────────────────────
#> cut Fair % [95% CI] 2.7% [2.6%—2.8%] 1454/53940
#> Good % [95% CI] 8.3% [8.0%—8.5%] 4462/53940
#> Very Good % [95% 20.1% 10816/53940
#> CI] [19.7%—20.4%]
#> Premium % [95% 23.1% 12460/53940
#> CI] [22.7%—23.5%]
#> Ideal % [95% CI] 35.9% 19361/53940
#> [35.5%—36.3%]
#> <missing> % [95% 10.0% 5387/53940
#> CI] [9.7%—10.2%]
#> ──────────────────────────────────────────────────────────────────────
#> color D % [95% CI] 11.3% 6079/53940
#> [11.0%—11.5%]
#> E % [95% CI] 16.5% 8886/53940
#> [16.2%—16.8%]
#> F % [95% CI] 16.0% 8613/53940
#> [15.7%—16.3%]
#> G % [95% CI] 18.8% 10137/53940
#> [18.5%—19.1%]
#> H % [95% CI] 13.8% 7466/53940
#> [13.6%—14.1%]
#> I % [95% CI] 9.0% [8.8%—9.3%] 4876/53940
#> J % [95% CI] 4.7% [4.5%—4.8%] 2512/53940
#> <missing> % [95% 10.0% 5371/53940
#> CI] [9.7%—10.2%]
#> ──────────────────────────────────────────────────────────────────────
#> clarity I1 % [95% CI] 1.2% [1.1%—1.3%] 664/53940
#> SI2 % [95% CI] 15.3% 8265/53940
#> [15.0%—15.6%]
#> SI1 % [95% CI] 21.8% 11756/53940
#> [21.4%—22.1%]
#> VS2 % [95% CI] 20.4% 11020/53940
#> [20.1%—20.8%]
#> VS1 % [95% CI] 13.6% 7355/53940
#> [13.3%—13.9%]
#> VVS2 % [95% CI] 8.5% [8.2%—8.7%] 4570/53940
#> VVS1 % [95% CI] 6.1% [5.9%—6.3%] 3298/53940
#> IF % [95% CI] 3.0% [2.8%—3.1%] 1599/53940
#> <missing> % [95% 10.0% 5413/53940
#> CI] [9.8%—10.3%]
#> ──────────────────────────────────────────────────────────────────────
#> depth Median [IQR] 61.8 [61—62.5] 48584
#> ──────────────────────────────────────────────────────────────────────
#> table Median [IQR] 57 [56—59] 48707
#> ──────────────────────────────────────────────────────────────────────
#> price Median [IQR] 2.41e+03 48675
#> [952—5.33e+03]
#> ──────────────────────────────────────────────────────────────────────
#> x Median [IQR] 5.69 [4.72—6.54] 48577
#> ──────────────────────────────────────────────────────────────────────
#> y Median [IQR] 5.71 [4.72—6.54] 48578
#> ──────────────────────────────────────────────────────────────────────
#> z Median [IQR] 3.52 [2.91—4.03] 48559
#> ──────────────────────────────────────────────────────────────────────
#> is_colored clear % [95% CI] 30.7% 16572/53940
#> [30.3%—31.1%]
#> colored % [95% 69.3% 37368/53940
#> CI] [68.9%—69.7%]
#> ──────────────────────────────────────────────────────────────────────
#> Normal distributions determined by the Anderson-Darling
#> test (P>0.005)
#>
#> Column names: Variable, Characteristic, 1, 2
# in the output above the price variable is not # presented the way we would
# like so here we override the number of decimal places shown for the price
# variable while we are at it we will use a mid point for the decimal point,
# and make the variable labels sentence case.
old = options("tableone.dp"="\u00B7")
missing_diamonds %>%
explicit_na() %>%
describe_population(
tidyselect::everything(),
label_fn=stringr::str_to_sentence,
override_real_dp=list(price=6)
)
#> median_iqr summary for Carat
#> subtype_count summary for Cut
#> subtype_count summary for Color
#> subtype_count summary for Clarity
#> median_iqr summary for Depth
#> median_iqr summary for Table
#> median_iqr summary for Price
#> median_iqr summary for X
#> median_iqr summary for Y
#> median_iqr summary for Z
#> subtype_count summary for Is_colored
#> ──────────────────────────────────────────────────────────────────────
#> Variable Characteristic Value Count (N=53940)
#> ──────────────────────────────────────────────────────────────────────
#> Carat Median [IQR] 0·7 [0·4—1·04] 48682
#> ──────────────────────────────────────────────────────────────────────
#> Cut Fair % [95% CI] 2·7% [2·6%—2·8%] 1454/53940
#> Good % [95% CI] 8·3% [8·0%—8·5%] 4462/53940
#> Very Good % [95% 20·1% 10816/53940
#> CI] [19·7%—20·4%]
#> Premium % [95% 23·1% 12460/53940
#> CI] [22·7%—23·5%]
#> Ideal % [95% CI] 35·9% 19361/53940
#> [35·5%—36·3%]
#> <missing> % [95% 10·0% 5387/53940
#> CI] [9·7%—10·2%]
#> ──────────────────────────────────────────────────────────────────────
#> Color D % [95% CI] 11·3% 6079/53940
#> [11·0%—11·5%]
#> E % [95% CI] 16·5% 8886/53940
#> [16·2%—16·8%]
#> F % [95% CI] 16·0% 8613/53940
#> [15·7%—16·3%]
#> G % [95% CI] 18·8% 10137/53940
#> [18·5%—19·1%]
#> H % [95% CI] 13·8% 7466/53940
#> [13·6%—14·1%]
#> I % [95% CI] 9·0% [8·8%—9·3%] 4876/53940
#> J % [95% CI] 4·7% [4·5%—4·8%] 2512/53940
#> <missing> % [95% 10·0% 5371/53940
#> CI] [9·7%—10·2%]
#> ──────────────────────────────────────────────────────────────────────
#> Clarity I1 % [95% CI] 1·2% [1·1%—1·3%] 664/53940
#> SI2 % [95% CI] 15·3% 8265/53940
#> [15·0%—15·6%]
#> SI1 % [95% CI] 21·8% 11756/53940
#> [21·4%—22·1%]
#> VS2 % [95% CI] 20·4% 11020/53940
#> [20·1%—20·8%]
#> VS1 % [95% CI] 13·6% 7355/53940
#> [13·3%—13·9%]
#> VVS2 % [95% CI] 8·5% [8·2%—8·7%] 4570/53940
#> VVS1 % [95% CI] 6·1% [5·9%—6·3%] 3298/53940
#> IF % [95% CI] 3·0% [2·8%—3·1%] 1599/53940
#> <missing> % [95% 10·0% 5413/53940
#> CI] [9·8%—10·3%]
#> ──────────────────────────────────────────────────────────────────────
#> Depth Median [IQR] 61·8 [61—62·5] 48584
#> ──────────────────────────────────────────────────────────────────────
#> Table Median [IQR] 57 [56—59] 48707
#> ──────────────────────────────────────────────────────────────────────
#> Price Median [IQR] 2407 [952—5330] 48675
#> ──────────────────────────────────────────────────────────────────────
#> X Median [IQR] 5·69 [4·72—6·54] 48577
#> ──────────────────────────────────────────────────────────────────────
#> Y Median [IQR] 5·71 [4·72—6·54] 48578
#> ──────────────────────────────────────────────────────────────────────
#> Z Median [IQR] 3·52 [2·91—4·03] 48559
#> ──────────────────────────────────────────────────────────────────────
#> Is_colored clear % [95% CI] 30·7% 16572/53940
#> [30·3%—31·1%]
#> colored % [95% 69·3% 37368/53940
#> CI] [68·9%—69·7%]
#> ──────────────────────────────────────────────────────────────────────
#> Normal distributions determined by the Anderson-Darling
#> test (P>0.005)
#>
#> Column names: Variable, Characteristic, 1, 2
options(old)