Skip to contents

Formula versus tidyselect interface

tableone is implemented to allow a subset of columns in a large dataset to be pulled into a table without any fuss. It is also designed with a workflow in mind that involves building statistical models from the data later. We assume the data follows a general pattern in that there are one observation per row, individual columns are specific data points in those observations and may be one of:

  • outcome: something that we will be assessing in a statistical model, maybe a continuous outcome, or a time measure, or a logical measure.
  • intervention: the thing that is varied between the different observations
  • covariates: the other factors that may influence the outcome that we want to control for.

In the end we will want to construct a model that takes the following high level structure:

outcome ~ intervention + covariate_1 + covariate_2 + ... + covariate_n

Simple population description example

Before we build a model we need to firstly compare the distribution of the covariates in the population and secondly compare them in the intervention and non-intervention groups, usually done without reference to outcome. To demonstrate this we are using the survival::cgd data set.

cgd = survival::cgd %>% 
  # filter to include only the first visit
  dplyr::filter(enum==1) %>% 
  # make the steroids and propylac columns into a logical value
  # see later for a better way of doing this.
  dplyr::mutate(
    steroids = as.logical(steroids),
    propylac = as.logical(propylac)
  )
  

# A basic unstratified population description table is as follows:
cgd %>% describe_population(tidyselect::everything())
#> Warning: Unknown or uninitialised column: `level`.
Variable Characteristic Value Count (N=128)
id Median [IQR] 64.5 [32.8—96.2] 128
center Harvard Medical Sch % [95% CI] 3.1% [1.2%—7.8%] 4/128
Scripps Institute % [95% CI] 12.5% [7.8%—19.3%] 16/128
Copenhagen % [95% CI] 3.1% [1.2%—7.8%] 4/128
NIH % [95% CI] 20.3% [14.3%—28.1%] 26/128
L.A. Children's Hosp % [95% CI] 6.2% [3.2%—11.8%] 8/128
Mott Children's Hosp % [95% CI] 7.0% [3.7%—12.8%] 9/128
Univ. of Utah % [95% CI] 3.1% [1.2%—7.8%] 4/128
Univ. of Washington % [95% CI] 3.1% [1.2%—7.8%] 4/128
Univ. of Minnesota % [95% CI] 4.7% [2.2%—9.8%] 6/128
Univ. of Zurich % [95% CI] 12.5% [7.8%—19.3%] 16/128
Texas Children's Hosp % [95% CI] 6.2% [3.2%—11.8%] 8/128
Amsterdam % [95% CI] 14.8% [9.7%—22.0%] 19/128
Mt. Sinai Medical Ctr % [95% CI] 3.1% [1.2%—7.8%] 4/128
random 128
treat placebo % [95% CI] 50.8% [42.2%—59.3%] 65/128
rIFN-g % [95% CI] 49.2% [40.7%—57.8%] 63/128
sex male % [95% CI] 81.2% [73.6%—87.1%] 104/128
female % [95% CI] 18.8% [12.9%—26.4%] 24/128
age Median [IQR] 12 [7—22] 128
height Median [IQR] 141 [116—170] 128
weight Median [IQR] 34.8 [20.7—59.2] 128
inherit X-linked % [95% CI] 67.2% [58.7%—74.7%] 86/128
autosomal % [95% CI] 32.8% [25.3%—41.3%] 42/128
steroids false % [95% CI] 97.7% [93.3%—99.2%] 125/128
true % [95% CI] 2.3% [0.8%—6.7%] 3/128
propylac false % [95% CI] 13.3% [8.5%—20.2%] 17/128
true % [95% CI] 86.7% [79.8%—91.5%] 111/128
hos.cat US:NIH % [95% CI] 20.3% [14.3%—28.1%] 26/128
US:other % [95% CI] 49.2% [40.7%—57.8%] 63/128
Europe:Amsterdam % [95% CI] 14.8% [9.7%—22.0%] 19/128
Europe:other % [95% CI] 15.6% [10.3%—22.9%] 20/128
tstart Median [IQR] 0 [0—0] 128
enum Median [IQR] 1 [1—1] 128
tstop Median [IQR] 269 [197—304] 128
status Median [IQR] 0 [0—1] 128
Normal distributions determined by the Anderson-Darling test (P>0.005)

This could have been specified using the formula interface. In this example we have taken an example of the formula we might wish to use for a survival model and we reuse it to give us a more targetted descriptive table. It is also possible to supply tableone with a relabelling function that maps column names to printable labels, as demonstrated here:

# define a formula - this might be reused in model building later
formula = Surv(tstart, tstop, status) ~ treat + 
  sex + age + height + weight + inherit + steroids + hos.cat

# set a table relabelling function
rename_cols = function(col) {
  dplyr::case_when(
    col == "hos.cat" ~ "Location",
    col == "steroids" ~ "Steroid treatment",
    TRUE ~ stringr::str_to_sentence(col)
  )
}
options("tableone.labeller"=rename_cols)

# create a simple description
cgd %>% describe_population(formula)
Variable Characteristic Value Count (N=128)
Treat placebo % [95% CI] 50.8% [42.2%—59.3%] 65/128
rIFN-g % [95% CI] 49.2% [40.7%—57.8%] 63/128
Sex male % [95% CI] 81.2% [73.6%—87.1%] 104/128
female % [95% CI] 18.8% [12.9%—26.4%] 24/128
Age Median [IQR] 12 [7—22] 128
Height Median [IQR] 141 [116—170] 128
Weight Median [IQR] 34.8 [20.7—59.2] 128
Inherit X-linked % [95% CI] 67.2% [58.7%—74.7%] 86/128
autosomal % [95% CI] 32.8% [25.3%—41.3%] 42/128
Steroid treatment false % [95% CI] 97.7% [93.3%—99.2%] 125/128
true % [95% CI] 2.3% [0.8%—6.7%] 3/128
Location US:NIH % [95% CI] 20.3% [14.3%—28.1%] 26/128
US:other % [95% CI] 49.2% [40.7%—57.8%] 63/128
Europe:Amsterdam % [95% CI] 14.8% [9.7%—22.0%] 19/128
Europe:other % [95% CI] 15.6% [10.3%—22.9%] 20/128
Normal distributions determined by the Anderson-Darling test (P>0.005)

The relabelling function can either be passed to each invocation of tableone functions or as an option as shown here, which makes the labeller available to all subsequent calls. This is useful if you are generating many tables from a single dataset.

We will generally use the formula interface from here on but for exploration of larger datasets with more covariates the tidyselect interface may be more useful.

Comparing the population by intervention

In this example a more useful table compares the treatment groups. We can use the same formula syntax for this, but in this case the first predictor is assumed to be the intervention and the data set is compared by intervention (in this case the treat column). From this we can conclude that the population is well distributed between placebo and treatment groups and there is no major bias in the randomisation process:


# same as above
formula = Surv(tstart, tstop, status) ~ treat + 
  sex + age + height + weight + inherit + steroids + hos.cat

# labelling function is still active
cgd %>% compare_population(formula)
placebo rIFN-g
Variable Characteristic Value (N=65) Value (N=63) P value
Sex male % [95% CI] (n) 81.5% [70.4%—89.1%] (53) 81.0% [69.6%—88.8%] (51) 1 †
female % [95% CI] (n) 18.5% [10.9%—29.6%] (12) 19.0% [11.2%—30.4%] (12)
Age Median [IQR] 14 [7—24] 12 [7—19.5] 0.56 ††
Height Median [IQR] 143 [115—171] 139 [119—167] 0.45 †††
Weight Median [IQR] 36.1 [21.6—63.7] 34.4 [20.6—53.7] 0.4 †††
Inherit X-linked % [95% CI] (n) 63.1% [50.9%—73.8%] (41) 71.4% [59.3%—81.1%] (45) 0.35 †
autosomal % [95% CI] (n) 36.9% [26.2%—49.1%] (24) 28.6% [18.9%—40.7%] (18)
Steroid treatment false % [95% CI] (n) 96.9% [89.5%—99.2%] (63) 98.4% [91.5%—99.7%] (62) 1 †
true % [95% CI] (n) 3.1% [0.8%—10.5%] (2) 1.6% [0.3%—8.5%] (1)
Location US:NIH % [95% CI] (n) 16.9% [9.7%—27.8%] (11) 23.8% [15.0%—35.6%] (15) 0.7 †
US:other % [95% CI] (n) 49.2% [37.5%—61.1%] (32) 49.2% [37.3%—61.2%] (31)
Europe:Amsterdam % [95% CI] (n) 15.4% [8.6%—26.1%] (10) 14.3% [7.7%—25.0%] (9)
Europe:other % [95% CI] (n) 18.5% [10.9%—29.6%] (12) 12.7% [6.6%—23.1%] (8)
†, Fisher's exact test (categorical); ††, 2 sample Wilcoxon Rank Sum test (continuous); †††, 2 sample Kolmogorov-Smirnov test (continuous)
Normal distributions determined by the Anderson-Darling test (P>0.005)
An adjusted P value of 0.00714 may be considered significant.

Alternatively if we were using the tidyselect interface this alternate syntax would have given us the same table. Note that we must group the data by intervention, for the tidyselect to work as intended:

cgd %>% dplyr::group_by(treat) %>% 
  compare_population(sex,age,height,weight,inherit,steroids,hos.cat)

Analysis of missing data

We need to make sure that not only is the data equivalent between the intervention groups but also that missing data is not unevenly distributed or excessive. Reporting on the frequency of missing data stratified by intervention is also easy, and to demonstrate this we make a data set with 10% of the placebo arm having missing values, but 25% of the treatment arm:


# generate a dataset with values missing not at random compared to the intervention:
cgd_treat = cgd %>% dplyr::mutate(treat = as.character(treat)) %>% dplyr::filter(treat != "placebo")
cgd_placebo = cgd %>% dplyr::mutate(treat = as.character(treat)) %>% dplyr::filter(treat == "placebo")

set.seed(100)
mnar_cgd = dplyr::bind_rows(
  cgd_placebo %>% .make_missing(p_missing = 0.1),
  cgd_treat %>% .make_missing(p_missing = 0.25)
)

Comparing this new data set we see that there is significant differences in some of the data (but not the steroids variable). As this is quite a small dataset it is not sufficiently powered to reliably detect the difference in missingness at this level (15% difference).

# compare the MNAR dataset against the intervention:
formula = Surv(tstart, tstop, status) ~ treat + 
  sex + age + height + weight + inherit + steroids + hos.cat

mnar_cgd %>% compare_missing(formula)
placebo rIFN-g
variable missing % (N) missing % (N) P value
Sex 12.3% (8/65) 23.8% (15/63) 0.11
Age 4.6% (3/65) 23.8% (15/63) 0.002
Height 6.2% (4/65) 20.6% (13/63) 0.019
Weight 10.8% (7/65) 23.8% (15/63) 0.062
Inherit 12.3% (8/65) 28.6% (18/63) 0.028
Steroid treatment 13.8% (9/65) 28.6% (18/63) 0.052
Location 9.2% (6/65) 19.0% (12/63) 0.13
More than 10% of data is missing for variables Sex, Age, Height, Weight, Inherit, Steroid treatment, Location.
Data is missing not at random (compared to Treat) at a p-value<0.007 (0.05 over 7 comparisons) for variables Age.

with this analysis it is useful to be able to update the analysis formula removing the variables with missing data so that we are confident the models are based on reasonable data.


# formula can also be a list of formulae
new_formula = mnar_cgd %>% tableone::remove_missing(formula)
#> More than 10% of data is missing for variables Sex, Age, Height, Weight, Inherit, Steroid treatment, Location.
#> Data is missing not at random (compared to Treat) at a p-value<0.007 (0.05 over 7 comparisons) for variables Age.

print(new_formula)
#> [[1]]
#> Surv(tstart, tstop, status) ~ treat

Conversion of discrete data

Using this new data set with missing data it may be necessary to discretise some or all of the data, or convert logical values into properly named factors.


decade = function(x) sprintf("%d-%d",x-(x%%10),x-(x%%10)+9)

discrete_cgd = mnar_cgd %>% 
  # pick out the first episode
  dplyr::filter(enum == 1) %>%
  # convert data
  make_factors(
    steroids,propylac,age,weight,height,
    .logical = c("received","not received"),
    .numeric = list(
      age="{decade(value)}",
      weight="{ifelse(value<20,'<20','20+')}",
      height="{ifelse(value<mean(value, na.rm=TRUE),'below average','above average')}"
    )
  )

formula = Surv(tstart, tstop, status) ~ treat + 
  sex + age + height + weight + inherit + steroids + hos.cat


old = options("tableone.show_pvalue_method"=TRUE)
# This comparison implicitly ignores missing values.
t = discrete_cgd %>% compare_population(formula)
options(old)

t
placebo rIFN-g
Variable Characteristic Value (N=61) Value (N=49) P value
Sex male % [95% CI] (n) 77.8% [65.1%—86.8%] (42) 80.6% [65.0%—90.2%] (29) — †
female % [95% CI] (n) 22.2% [13.2%—34.9%] (12) 19.4% [9.8%—35.0%] (7)
Age 0-9 % [95% CI] (n) 31.0% [20.6%—43.8%] (18) 47.4% [32.5%—62.7%] (18) — ††
10-19 % [95% CI] (n) 31.0% [20.6%—43.8%] (18) 23.7% [13.0%—39.2%] (9)
20-29 % [95% CI] (n) 31.0% [20.6%—43.8%] (18) 13.2% [5.8%—27.3%] (5)
30-39 % [95% CI] (n) 6.9% [2.7%—16.4%] (4) 13.2% [5.8%—27.3%] (5)
40-49 % [95% CI] (n) 0.0% [0.0%—6.2%] (0) 2.6% [0.5%—13.5%] (1)
Height below average % [95% CI] (n) 43.9% [31.8%—56.7%] (25) 52.5% [37.5%—67.1%] (21) — ††
above average % [95% CI] (n) 56.1% [43.3%—68.2%] (32) 47.5% [32.9%—62.5%] (19)
Weight <20 % [95% CI] (n) 21.4% [12.7%—33.8%] (12) 20.5% [10.8%—35.5%] (8) — ††
20+ % [95% CI] (n) 78.6% [66.2%—87.3%] (44) 79.5% [64.5%—89.2%] (31)
Inherit X-linked % [95% CI] (n) 58.5% [45.1%—70.7%] (31) 68.4% [52.5%—80.9%] (26) — †
autosomal % [95% CI] (n) 41.5% [29.3%—54.9%] (22) 31.6% [19.1%—47.5%] (12)
Steroid treatment received % [95% CI] (n) 1.9% [0.3%—9.9%] (1) 0.0% [0.0%—10.4%] (0) — †
not received % [95% CI] (n) 98.1% [90.1%—99.7%] (52) 100.0% [89.6%—100.0%] (33)
Location US:NIH % [95% CI] (n) 19.3% [11.1%—31.3%] (11) 20.5% [10.8%—35.5%] (8) — †
US:other % [95% CI] (n) 50.9% [38.3%—63.4%] (29) 51.3% [36.2%—66.1%] (20)
Europe:Amsterdam % [95% CI] (n) 14.0% [7.3%—25.3%] (8) 12.8% [5.6%—26.7%] (5)
Europe:other % [95% CI] (n) 15.8% [8.5%—27.4%] (9) 15.4% [7.2%—29.7%] (6)
†, Not calculated due to missing values (categorical); ††, Not calculated due to missing values (ordered)
An adjusted P value of 0.00714 may be considered significant.
# N.B. The following option is involved when converting integer data
# which decides how many levels of integer data are considered discrete
# and when to decide integer data can be treated as continuous:
options("tableone.max_discrete_levels"=0)
# and is described in the documentation for make_factors().

Making missing factors explicit:

In the comparison above missing values were not included, and we should be cautious of the findings. Because of the missingness tableone will not calculate p-values. If factor values are missing (as in this case) then we can include them as a new group and get a more robust comparison which includes the distribution of missingness, and for which we can calculate a p-value. However previously ordered variables, are now regarded as unordered as we cannot determine the value of a missing level.

discrete_cgd %>% explicit_na() %>% compare_population(formula)
placebo rIFN-g
Variable Characteristic Value (N=61) Value (N=49) P value
Sex male % [95% CI] (n) 68.9% [56.4%—79.1%] (42) 59.2% [45.2%—71.8%] (29) 0.13 †
female % [95% CI] (n) 19.7% [11.6%—31.3%] (12) 14.3% [7.1%—26.7%] (7)
<missing> % [95% CI] (n) 11.5% [5.7%—21.8%] (7) 26.5% [16.2%—40.3%] (13)
Age 0-9 % [95% CI] (n) 29.5% [19.6%—41.9%] (18) 36.7% [24.7%—50.7%] (18) 0.0064 †
10-19 % [95% CI] (n) 29.5% [19.6%—41.9%] (18) 18.4% [10.0%—31.4%] (9)
20-29 % [95% CI] (n) 29.5% [19.6%—41.9%] (18) 10.2% [4.4%—21.8%] (5)
30-39 % [95% CI] (n) 6.6% [2.6%—15.7%] (4) 10.2% [4.4%—21.8%] (5)
40-49 % [95% CI] (n) 0.0% [-0.0%—5.9%] (0) 2.0% [0.4%—10.7%] (1)
<missing> % [95% CI] (n) 4.9% [1.7%—13.5%] (3) 22.4% [13.0%—35.9%] (11)
Height below average % [95% CI] (n) 41.0% [29.5%—53.5%] (25) 42.9% [30.0%—56.7%] (21) 0.12 †
above average % [95% CI] (n) 52.5% [40.2%—64.5%] (32) 38.8% [26.4%—52.8%] (19)
<missing> % [95% CI] (n) 6.6% [2.6%—15.7%] (4) 18.4% [10.0%—31.4%] (9)
Weight <20 % [95% CI] (n) 19.7% [11.6%—31.3%] (12) 16.3% [8.5%—29.0%] (8) 0.21 †
20+ % [95% CI] (n) 72.1% [59.8%—81.8%] (44) 63.3% [49.3%—75.3%] (31)
<missing> % [95% CI] (n) 8.2% [3.6%—17.8%] (5) 20.4% [11.5%—33.6%] (10)
Inherit X-linked % [95% CI] (n) 50.8% [38.6%—62.9%] (31) 53.1% [39.4%—66.3%] (26) 0.3 †
autosomal % [95% CI] (n) 36.1% [25.2%—48.6%] (22) 24.5% [14.6%—38.1%] (12)
<missing> % [95% CI] (n) 13.1% [6.8%—23.8%] (8) 22.4% [13.0%—35.9%] (11)
Steroid treatment received % [95% CI] (n) 1.6% [0.3%—8.7%] (1) 0.0% [0.0%—7.3%] (0) 0.019 †
not received % [95% CI] (n) 85.2% [74.3%—92.0%] (52) 67.3% [53.4%—78.8%] (33)
<missing> % [95% CI] (n) 13.1% [6.8%—23.8%] (8) 32.7% [21.2%—46.6%] (16)
Location US:NIH % [95% CI] (n) 18.0% [10.4%—29.5%] (11) 16.3% [8.5%—29.0%] (8) 0.33 †
US:other % [95% CI] (n) 47.5% [35.5%—59.8%] (29) 40.8% [28.2%—54.8%] (20)
Europe:Amsterdam % [95% CI] (n) 13.1% [6.8%—23.8%] (8) 10.2% [4.4%—21.8%] (5)
Europe:other % [95% CI] (n) 14.8% [8.0%—25.7%] (9) 12.2% [5.7%—24.2%] (6)
<missing> % [95% CI] (n) 6.6% [2.6%—15.7%] (4) 20.4% [11.5%—33.6%] (10)
†, Fisher's exact test (categorical)
An adjusted P value of 0.00714 may be considered significant.

Non biomedical data

Beyond the bio-medical example tableone can make any more general comparison between data that has a structure like:

~ group + observation_1 + observation_2 + ... + observation_n

We will use the iris and the diamonds datasets to demonstrate this more general use case for tableone.


# revert the labeller setting to the default
# and additionally hide the footer.
old = options(
  "tableone.labeller"=NULL,
  "tableone.show_pvalue_method"=FALSE,
  "tableone.hide_footer"=TRUE)

# the heuristics detect that Petals in the iris data set are not normally
# distributed and hence report median and IQR:
iris %>% dplyr::group_by(Species) %>% compare_population(tidyselect::everything())
setosa versicolor virginica
Variable Characteristic Value (N=50) Value (N=50) Value (N=50) P value
Sepal.Length Mean ± SD 5.01 ± 0.352 5.94 ± 0.516 6.59 ± 0.636 <0.001
Sepal.Width Mean ± SD 3.43 ± 0.379 2.77 ± 0.314 2.97 ± 0.322 <0.001
Petal.Length Median [IQR] 1.5 [1.4—1.58] 4.35 [4—4.6] 5.55 [5.1—5.88] <0.001
Petal.Width Median [IQR] 0.2 [0.2—0.3] 1.3 [1.2—1.5] 2 [1.8—2.3] <0.001

options(old)

The missing_diamonds data set which is included in this package has 10% of the values removed. This demonstrates the need for reporting the denominator.

# The counts sometimes seem redundant if there is no missing information:
# however in a data set with missing values the denominators are important:
missing_diamonds %>% describe_population(tidyselect::everything())
Variable Characteristic Value Count (N=53940)
Carat Median [IQR] 0.7 [0.4—1.04] 48682
Cut Fair % [95% CI] 3.0% [2.8%—3.2%] 1454/48553
Good % [95% CI] 9.2% [8.9%—9.5%] 4462/48553
Very Good % [95% CI] 22.3% [21.9%—22.6%] 10816/48553
Premium % [95% CI] 25.7% [25.3%—26.1%] 12460/48553
Ideal % [95% CI] 39.9% [39.4%—40.3%] 19361/48553
Color D % [95% CI] 12.5% [12.2%—12.8%] 6079/48569
E % [95% CI] 18.3% [18.0%—18.6%] 8886/48569
F % [95% CI] 17.7% [17.4%—18.1%] 8613/48569
G % [95% CI] 20.9% [20.5%—21.2%] 10137/48569
H % [95% CI] 15.4% [15.1%—15.7%] 7466/48569
I % [95% CI] 10.0% [9.8%—10.3%] 4876/48569
J % [95% CI] 5.2% [5.0%—5.4%] 2512/48569
Clarity I1 % [95% CI] 1.4% [1.3%—1.5%] 664/48527
SI2 % [95% CI] 17.0% [16.7%—17.4%] 8265/48527
SI1 % [95% CI] 24.2% [23.8%—24.6%] 11756/48527
VS2 % [95% CI] 22.7% [22.3%—23.1%] 11020/48527
VS1 % [95% CI] 15.2% [14.8%—15.5%] 7355/48527
VVS2 % [95% CI] 9.4% [9.2%—9.7%] 4570/48527
VVS1 % [95% CI] 6.8% [6.6%—7.0%] 3298/48527
IF % [95% CI] 3.3% [3.1%—3.5%] 1599/48527
Depth Median [IQR] 61.8 [61—62.5] 48584
Table Median [IQR] 57 [56—59] 48707
Price Median [IQR] 2.41e+03 [952—5.33e+03] 48675
X Median [IQR] 5.69 [4.72—6.54] 48577
Y Median [IQR] 5.71 [4.72—6.54] 48578
Z Median [IQR] 3.52 [2.91—4.03] 48559
Is_colored clear % [95% CI] 30.7% [30.3%—31.1%] 16572/53940
colored % [95% CI] 69.3% [68.9%—69.7%] 37368/53940
Normal distributions determined by the Anderson-Darling test (P>0.005)