Analyzing the Canadian Community Health Survey (CCHS) 2015 data with R: mean diet quality score

16 minute read

Published:

In a previous post, I showed how to account for the Canadian Community Health Survey (CCHS) complex survey design for a simple analysis in R. However, some nutrition analyses require multiple steps that are not “built-in” in a statistical software. For example, the recommended approach to estimate a (mean) diet quality score based on 24-hour dietary recall data, the main dietary assessment instrument in surveys, is the population ratio method (Freedman et al. 2008).

The goal of this post is to provide R code showing how to account for the CCHS 2015 design to calculate a diet quality score using the population ratio method. Note that data for the CCHS 2015 - Nutrition (Public Use Micro Datafile or PUMF) are publicly available.

Population ratio vs. simple scoring

Briefly, to derive a diet quality score using the population ratio method, dietary intakes and ratio of intakes are estimated at the population- or group-level then scored. For example, to score the percentage of total energy intake (%E) as saturated fats (SFA):

  1. we calculate the mean calories from saturated fats ($\mu_{SFA}$) and mean total energy intake in calories ($\mu_{E}$) in the overall population or group of interest.
  2. we divide the mean calories from saturated fat by the mean total energy intake $\frac{\mu_{SFA}}{\mu_{E}}$.
  3. we apply the scoring algorithm, e.g., $f(\frac{\mu_{SFA}}{\mu_{E}})$, to derive the diet quality score .

Alternatively, one could have used a simple scoring method In a simple scoring method, the scoring algorithm is directly applied to each respondent’s dietary intakes $f(\frac{SFA_i}{E_i})$. Each respondent has a score and the mean of these scores is then used to estimate the mean diet quality score of the population or group: $\mu_{f(\frac{SFA_i}{E_i})}$. Unfortunately, the simple scoring method is less optimal since it assumes that each respondent’s dietary intakes reflect usual dietary intakes, i.e., a long-term average. This is not the case when using only one 24-h dietary intake, because of random errors (more details here). Of note, the main issue in this example is that we apply a scoring to the dietary intakes and the “input” matters because scoring ‘truncates’ a distribution of intakes at certain cut-points. On the contrary, the population ratio method directly scores usual dietary intakes because mean intakes at the group level do reflect usual dietary intakes.

Example summary

Mean score according to the population ratio method:

\[\mu_{Score} = f(x) = f(\frac{\mu_{SFA}}{\mu_{E}})\]

Mean score according to the simple scoring method:

\[\mu_{Score} = \mu_{f(x)} = \mu_{f(\frac{SFA_i}{E_i})}\]

To obtain a mean diet quality score, the population ratio method is more optimal since the scoring is applied to intakes that reflect long-term intakes rather than intakes on a given day.

Data preparation and overview

For demonstration purpose, I show how to calculate Healthy Eating Food Index (HEFI)-2019 scores according to smoking status (smokers vs. non-smokers as well as score difference). The HEFI-2019 is a continuous score that reflects the degree of adherence to Canada’s Food Guide 2019 recommendations on healthy food choices (Brassard et al. 2022a, b). The data used for the present example are already “pre-processed”. Complete details are shown in the HEFI-2019 Github repository.

# ********************************************** #
#             Load required packages             #
# ********************************************** #

library(tidyverse)
library(dplyr)
library(data.table)
library(parallel)
library(furrr)
library(ggplot2)
library(haven)

# HEFI-2019 scoring algorithm
  #if not installed, run: devtools::install_github("didierbrassard/hefi2019")
  library(hefi2019)

# ********************************************** #
#           Load and prepare CCHS data           #
# ********************************************** #

# 1) Load processed data, dietary intakes +  sociodemographic characteristics
  load(file.path("post_data","intake_per24hr.rdata"))
  load(file.path("post_data","hs_nci.rdata"))
  
# 2) Combine dietary intakes of the first recall with sociodemographic data
  
  intake_and_sociodeom <-
    inner_join(intake_per24hr,hs_nci|>select(ADM_RNO,SUPPID,WTS_P,drig,sex,age,smoking)) |>
    # remove 24-h recall with 0 energy intake & first 24-hr only & aged 19 y +
    filter(energy>0 & SUPPID==1 & age>=19) |>
    mutate(
      # recode smoking as yes or no
      smoking = ifelse(smoking %in% c(1,2),1,smoking)
    )
  
  #note: sample size of respondents 2y+ for first 24-h recall = 20,103

# 3) Load and prepare bootstrap replicate weights
  
  # Note: a hardcoded path (<external_drive>) is used since files are on an external drive.
  # Usually we would want analysis to be self-contained, but survey files are a bit large.
  # Also, the data are set-up exactly as they are once obtained from Statistics Canada
  
  bsw <-
    data.table::fread(file.path(external_drive,"CCHS_Nutrition_2015_PUMF","Bootstrap",
                                "Data_Donnee","b5.txt"),
                      stringsAsFactors = F, data.table= F, header=F,
                      col.names = c("ADM_RNO","WTS_P",paste0("BSW",seq(1,500)))) |>
    # keep only respondents in <intake_and_sociodeom>
    right_join(intake_and_sociodeom[,"ADM_RNO"]) |>
    # rename sampling weights
    rename(BSW0=WTS_P)
 
 # cchs_data overview
 head(intake_and_sociodeom)
## # A tibble: 6 × 24
##   ADM_RNO SUPPID    wg    rg  pfab  pfpb otherfoods    vf water otherbevs  milk
##     <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl> <dbl> <dbl>     <dbl> <dbl>
## 1       1      1 1.05   0.9  1.20   5.84      45.1  0     1917.        0    0  
## 2       2      1 0      1.24 4.45   0         12.3  2.21     0      2125.   0  
## 3       3      1 1.19   0    3.27   0          2.40 0.494  813.        0  122. 
## 4       4      1 0.477  1.45 1.08   0          3.27 2.11  1873.      685.  11.4
## 5       5      1 0      0.96 1.74   6.45      74.1  6.96  2226.      391.  61.0
## 6       6      1 0      1.07 0.540  0          1.24 4.04  4129.      513.   0  
## # … with 13 more variables: plantbev <dbl>, freesugars <dbl>, energy <dbl>,
## #   sfa <dbl>, mufa <dbl>, pufa <dbl>, sodium <dbl>, nonzero_energy <dbl>,
## #   WTS_P <dbl>, drig <dbl>, sex <dbl>, age <dbl>, smoking <dbl>
 # bootstrap weight overview
 head(bsw[,1:6])
##   ADM_RNO BSW0 BSW1 BSW2 BSW3 BSW4
## 1       1  859   31   33 2234   27
## 2       2  901  539   62   64   60
## 3       3 4431   41 5898 5498 8455
## 4       4  539   50   33   42 1941
## 5       5  703   50 1812   64   60
## 6       6   42  312  188  144  225

‘Population ratio’ function

Typically, we would use built-in function or package (e.g., the survey package in R) to obtain estimates and confidence interval. However, there is rarely a built-in function for multistage analyses. In the present case, there are different approaches to estimate variance of population ratio means with survey data. Perhaps the simplest approach - although not the most efficient - is to iteratively loop the analysis through all bootstrap replicate weights. In other words, we calculate mean estimates for each bootstrap weight and then calculate confidence intervals based on the (bootstrap) sampling distribution.

Since we will repeat the analysis 500+1 times to have diet quality score estimates for each bootstrap replicate weight as well as the original sampling weight, we create a function that will generate the desired statistics. This process will effectively account for the complex survey design and allow us to derive mean and difference.

# ********************************************** #
#      Function of mean, score, difference       #
# ********************************************** #

  # note: 'ADM_RNO', 'smoking' and  'hefi2019_vars' are hardcoded in this function and
  # should be modified if used for a different analysis.

  mean_score_diff <-
    function(bsw_number,indata,inbsw,bsw_suffix="BSW"){
    # Parameters:
      # bsw_number = number from 0 to 500 where 0 = original estimate and 1, 2, 3 ... 500 = bsw
      # indata = input data set
      # inbsw  = data set with bootstrap replicate weight
      # bsw_suffix = common suffix to all variables representing sampling and bootstrap weights
    
    # 1) Create weights variable
      weights <- paste0(bsw_suffix,bsw_number)

    # 2) calculate MEAN
      suppressMessages(
      estimate <-
        # combine weight value with indata
        dplyr::left_join(indata,inbsw |> select(ADM_RNO,!!weights)) |>
        # rename weights for use with weighted.mean
        dplyr::rename(CURRENT_BSW= !!weights ) |>
        # remove missing values
        dplyr::filter(is.na(smoking)==FALSE) |>
        # group by smoking status
        dplyr::group_by(smoking) |>
        # calculate weighted mean
        dplyr::summarise(
          across(.cols=all_of(hefi2019_vars),
                 function(x) weighted.mean(x,w=CURRENT_BSW),
                 .names ="{col}_MEAN" )
          # note: suffix <_MEAN> added for labeling population-level values (vs. respondent-level)
          ) |>
      # 3) SCORE: Apply the HEFI-2019 scoring algorithm to obtain
        hefi2019::hefi2019(#indata             = .,
          vegfruits          = vf_MEAN,
          wholegrfoods       = wg_MEAN,
          nonwholegrfoods    = rg_MEAN,
          profoodsanimal     = pfab_MEAN,
          profoodsplant      = pfpb_MEAN,
          otherfoods         = otherfoods_MEAN,
          waterhealthybev    = water_MEAN,
          unsweetmilk        = milk_MEAN,
          unsweetplantbevpro = plantbev_MEAN,
          otherbeverages     = otherbevs_MEAN ,
          mufat              = mufa_MEAN ,
          pufat              = pufa_MEAN ,
          satfat             = sfa_MEAN ,
          freesugars         = freesugars_MEAN,
          sodium             = sodium_MEAN,
          energy             = energy_MEAN)
      ) # end of suppress message

      # 3) DIFFERENCE: Calculate difference in smokers vs. non-smokers
      estimate_diff <-
        estimate |>
        dplyr::select(smoking,starts_with("HEFI")) |>
        tidyr::pivot_longer(
          cols = starts_with("HEFI"),
          names_to = "HEFI2019_components"
        ) |>
        tidyr::pivot_wider(
          names_from = "smoking",
          names_prefix = "SMK_"
        ) |>
        dplyr::mutate(
          DIFF = SMK_1 - SMK_0,
          # add BSW id
          replicate = bsw_number
        )
      # 4) remove temporary data and return final data with all statistics
      rm(estimate)
      rm(weights)
      return(estimate_diff)
    }

Calculate estimates with sampling weights

It is often best to verify that everything works as intended before running lengthy bootstrap analyses. For this test, we will estimate HEFI-2019 score based on the sampling weights (i.e., renamed from WTS_P to BSW0) without variance estimation.

# ********************************************** #
#        Run <mean_score_diff> with BSW0         #
# ********************************************** #

# 1) Define vectors of HEFI-2019 dietary constituents

  hefi2019_vars <- names(intake_per24hr[,3:(ncol(intake_per24hr)-1)])

  hefi2019_vars
##  [1] "wg"         "rg"         "pfab"       "pfpb"       "otherfoods"
##  [6] "vf"         "water"      "otherbevs"  "milk"       "plantbev"  
## [11] "freesugars" "energy"     "sfa"        "mufa"       "pufa"      
## [16] "sodium"
  # note: this works since the <intake_per24hr> was specifically created to calculate HEFI-2019 scores.
  
# 2) Apply function 

 popratio_hefi2019_bsw0 <-  
    mean_score_diff(bsw_number = 0,
              indata = intake_and_sociodeom,
              inbsw  = bsw)
 
 # 3) Show output
 
 popratio_hefi2019_bsw0
## # A tibble: 11 × 5
##    HEFI2019_components   SMK_0  SMK_1   DIFF replicate
##    <chr>                 <dbl>  <dbl>  <dbl>     <dbl>
##  1 HEFI2019C1_VF         10.1   7.85  -2.25          0
##  2 HEFI2019C2_WHOLEGR     1.24  0.832 -0.412         0
##  3 HEFI2019C3_GRRATIO     1.53  1.08  -0.452         0
##  4 HEFI2019C4_PROFOODS    5     4.90  -0.101         0
##  5 HEFI2019C5_PLANTPRO    1.75  1.31  -0.444         0
##  6 HEFI2019C6_BEVERAGES   7.90  7.16  -0.746         0
##  7 HEFI2019C7_FATTYACID   2.38  2.16  -0.225         0
##  8 HEFI2019C8_SFAT        4.19  3.78  -0.404         0
##  9 HEFI2019C9_FREESUGARS  8.99  7.12  -1.87          0
## 10 HEFI2019C10_SODIUM     5.01  4.72  -0.292         0
## 11 HEFI2019_TOTAL_SCORE  48.1  40.9   -7.21          0

The population ratio analysis shows that smokers and non-smokers respectively have a HEFI-2019 score of 40.9 pts and 48.1 (/80), which corresponds to a 7.2 lower score in smokers.
In other words, smokers have a lower adherence to Canada’s Food Guide 2019 recommendations on healthy food choices. This is expected since non-smokers tend to have a higher propensity to health-seeking behaviors, including overall diet quality and hence, adherence to Canada’s Food Guide 2019 recommendations.

Calculate estimates of bootstrap replicate weights

A proper analysis requires that we provide an estimation of the variability associated with the difference estimated above. The code below will repeat the analysis through all bootstrap replicate weights. For efficiency purpose, I use the purrr package to easily loop the analysis and output a data frame including all results.

# ********************************************** #
#   Run <mean_score_diff> with BSW0 to BSW500    #
# ********************************************** #

# 1) Call the analysis for BSW0 to BSW500 with <seq(0,500)> and the <mean_score_diff> function
  tictoc::tic()
  
  hefi2019_smk <-
    purrr::map_dfr(seq(0,500), mean_score_diff,
                   indata = intake_and_sociodeom,
                   inbsw  = bsw)
  tictoc::toc()
## 34.634 sec elapsed
# 2) Save output for further analyses

  # note: useful to save output of bootstrap analysis as it can take some time to compete
  
  save(hefi2019_smk,
       file = file.path("post_data","hefi2019_smk_bootstrap.rdata"))
  
# 3) Show 10 estimates of total score output
  hefi2019_smk |> filter(HEFI2019_components=="HEFI2019_TOTAL_SCORE") |> head(n=10)
## # A tibble: 10 × 5
##    HEFI2019_components  SMK_0 SMK_1  DIFF replicate
##    <chr>                <dbl> <dbl> <dbl>     <int>
##  1 HEFI2019_TOTAL_SCORE  48.1  40.9 -7.21         0
##  2 HEFI2019_TOTAL_SCORE  48.3  41.6 -6.65         1
##  3 HEFI2019_TOTAL_SCORE  48.6  40.2 -8.40         2
##  4 HEFI2019_TOTAL_SCORE  48.2  40.8 -7.42         3
##  5 HEFI2019_TOTAL_SCORE  48.7  40.7 -7.97         4
##  6 HEFI2019_TOTAL_SCORE  48.7  40.3 -8.43         5
##  7 HEFI2019_TOTAL_SCORE  47.9  40.5 -7.42         6
##  8 HEFI2019_TOTAL_SCORE  48.6  41.5 -7.03         7
##  9 HEFI2019_TOTAL_SCORE  48.6  42.0 -6.54         8
## 10 HEFI2019_TOTAL_SCORE  48.3  41.1 -7.26         9

Calculate confidence interval

Having all bootstrap estimates, we can now directly calculate confidence intervals, e.g., with an alpha=0.05 for 95% confidence interval. There are again different options available to calculate confidence intervals, including the normal approximation or percentile method. For simplicity, the normal approximation method is shown below.

# ********************************************** #
#          Calculate bootstrap variance          #
# ********************************************** #

# 1) Separate 'original' sample estimates from 'bootstrap' estimates

  hefi2019_smk0 <- 
    hefi2019_smk |> 
    filter(replicate==0)

  hefi2019_smkbs <-
    hefi2019_smk |>
    filter(replicate>0)

# 2) Check bootstrap estimates normality - best to look at when using normal approximation

  hefi2019_smkbs |>
    # note: I focus on difference for this example
    ggplot(aes(x=DIFF)) +
    geom_density(fill="gray",alpha=0.3) +
    labs(title="Bootstrap estimates distribution for HEFI-2019 score difference (CCHS 2015 - Nutrition)",
         subtitle = "Should be approximately normal to use normal approximation",
         y="Density",x="Difference, points") +
    facet_wrap(~HEFI2019_components,scales="free") +
    theme_bw() +
    theme( panel.grid.major.y = element_blank(),
           legend.title = element_blank(),
           legend.position = "top")

# 3) Calculate standard deviation of the sampling distribution (i.e., bootstrap Standard Error)

  hefi2019_smkbs_se <-
    hefi2019_smkbs |>
    group_by(HEFI2019_components) |>
    summarise(
      across(c("SMK_0","SMK_1","DIFF"), function(x) sd(x))
    )

# 4) Merge original sample estimates with bootstrap Standard Error, calculate 95CI

  # 4.1) Transpose base estimates (wide->long)
  hefi2019_smk0_long <-
    hefi2019_smk0 |>
    select(-replicate) |>
    pivot_longer(cols=c("SMK_0","SMK_1","DIFF"),
                 values_to="estimate")

  # 4.2) Transpose bootstraps Standard Error (wide->long)
  hefi2019_smkbs_se_long <-
    hefi2019_smkbs_se |>
    pivot_longer(cols=c("SMK_0","SMK_1","DIFF"),
                 values_to="se")

  # 4.3) Merge both data and calculate 95%CI using normal approximation
  hefi2019_smkf <-
    full_join(hefi2019_smk0_long, hefi2019_smkbs_se_long) |>
    mutate(
      # Calculate 95%CI
      nboot  = 500,
      alpha  = 0.05,
      tvalue = qt(1-(alpha/2),nboot-2),
      lcl    = estimate - (se * tvalue),
      ucl    = estimate + (se * tvalue)
    )

  # delete temporary data
  rm(list=c("hefi2019_smk0","hefi2019_smkbs",
            "hefi2019_smkbs_se","hefi2019_smk0_long",
            "hefi2019_smkbs_se_long"))
  
  # show final data
  head(hefi2019_smkf)
## # A tibble: 6 × 9
##   HEFI2019_components name  estimate     se nboot alpha tvalue    lcl    ucl
##   <chr>               <chr>    <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1 HEFI2019C1_VF       SMK_0   10.1   0.126    500  0.05   1.96  9.86  10.4  
## 2 HEFI2019C1_VF       SMK_1    7.85  0.224    500  0.05   1.96  7.41   8.29 
## 3 HEFI2019C1_VF       DIFF    -2.25  0.253    500  0.05   1.96 -2.75  -1.76 
## 4 HEFI2019C2_WHOLEGR  SMK_0    1.24  0.0341   500  0.05   1.96  1.18   1.31 
## 5 HEFI2019C2_WHOLEGR  SMK_1    0.832 0.0434   500  0.05   1.96  0.746  0.917
## 6 HEFI2019C2_WHOLEGR  DIFF    -0.412 0.0529   500  0.05   1.96 -0.516 -0.309

We can then use the hefi2019smkf data frame to generate table or graph with results. I show a simple table below:

# ********************************************** #
#        Simple draft table with results         #
# ********************************************** #

  hefi2019_smkf |>
    mutate(
      # show estimate (95%CI) for difference and estimate (SE) for mean
      estimate_fmt = 
        ifelse(name=="DIFF",
               paste0(round(estimate,1)," (", round(lcl,1),", ", round(ucl,1),")"),
               paste0(round(estimate,1)," (", round(se,2),")")
                )
    ) |>
    select(HEFI2019_components,name,estimate_fmt) |>
    # transpose data to wide format
    pivot_wider(values_from=estimate_fmt) |>
    # create table
    knitr::kable(
      col.names = c("HEFI-2019 components", "Non-smokers (SE)", "Smokers (SE)", "Difference (95%CI)"),
      caption = "Mean HEFI-2019 scores and difference in Smokers vs. Non-Smokers aged 19y+ (CCHS 2015 - Nutrition)")
HEFI-2019 componentsNon-smokers (SE)Smokers (SE)Difference (95%CI)
HEFI2019C1_VF10.1 (0.13)7.8 (0.22)-2.3 (-2.8, -1.8)
HEFI2019C2_WHOLEGR1.2 (0.03)0.8 (0.04)-0.4 (-0.5, -0.3)
HEFI2019C3_GRRATIO1.5 (0.04)1.1 (0.05)-0.5 (-0.6, -0.3)
HEFI2019C4_PROFOODS5 (0.02)4.9 (0.11)-0.1 (-0.3, 0.1)
HEFI2019C5_PLANTPRO1.8 (0.06)1.3 (0.09)-0.4 (-0.7, -0.2)
HEFI2019C6_BEVERAGES7.9 (0.04)7.2 (0.1)-0.7 (-1, -0.5)
HEFI2019C7_FATTYACID2.4 (0.05)2.2 (0.1)-0.2 (-0.4, 0)
HEFI2019C8_SFAT4.2 (0.08)3.8 (0.14)-0.4 (-0.7, -0.1)
HEFI2019C9_FREESUGARS9 (0.15)7.1 (0.38)-1.9 (-2.6, -1.1)
HEFI2019C10_SODIUM5 (0.1)4.7 (0.16)-0.3 (-0.7, 0.1)
HEFI2019_TOTAL_SCORE48.1 (0.34)40.9 (0.61)-7.2 (-8.5, -5.9)

Mean HEFI-2019 scores and difference in Smokers vs. Non-Smokers aged 19y+ (CCHS 2015 - Nutrition)

References

Brassard, D., Elvidge Munene, L. A., St-Pierre, S., Guenther, P. M., Kirkpatrick, S. I., Slater, J., Lemieux, S., Jessri, M., Haines, J., Prowse, R., Olstad, D. L., Garriguet, D., Vena, J., Vatanparast, H., L’Abbe, M. R., & Lamarche, B. (2022a). Development of the Healthy Eating Food Index (HEFI)-2019 measuring adherence to Canada’s Food Guide 2019 recommendations on healthy food choices. Appl Physiol Nutr Metab, 47(5), 595-610. https://doi.org/10.1139/apnm-2021-0415

Brassard, D., Elvidge Munene, L. A., St-Pierre, S., Gonzalez, A., Guenther, P. M., Jessri, M., Vena, J., Olstad, D. L., Vatanparast, H., Prowse, R., Lemieux, S., L’Abbe, M. R., Garriguet, D., Kirkpatrick, S. I., & Lamarche, B. (2022b). Evaluation of the Healthy Eating Food Index (HEFI)-2019 measuring adherence to Canada’s Food Guide 2019 recommendations on healthy food choices. Appl Physiol Nutr Metab, 47(5), 582-594. https://doi.org/10.1139/apnm-2021-0416

Freedman, L. S., Guenther, P. M., Krebs-Smith, S. M., & Kott, P. S. (2008). A population’s mean Healthy Eating Index-2005 scores are best estimated by the score of the population ratio when one 24-hour recall is available. J Nutr, 138(9), 1725-1729. https://doi.org/10.1093/jn/138.9.1725