Blog posts

2023

The most boring yet essential skill (part 2): reshaping data

17 minute read

Published:

A common data manipulation task involves transposing data from a long (portrait) to wide (landscape) format, and vice versa. This manipulation is sometimes necessary to accommodate statistical procedure, but also to aggregate and combine variables in a data. In nutrition, a typical example is combining repeated dietary intake data. For example, study participant may have completed 2 or 3 24-hour dietary recalls each and we may wish to calculate average intakes among repeated assessments. While this manipulation can be done manually in Excel, having reproducible code is advantageous for many reasons. The purpose of this blog is to introduce these common data format and demonstrate how we can go from one format to another using R, as well as nutrition data example.

The most boring yet essential skill: organizing files and codes

23 minute read

Published:

The ability to reproduce an analysis is essential for any research project or study based on quantitative data. Reproducibility depends on the data (e.g., availability of raw data), software (e.g., statistical software and/or packages used) as well as hardware and operating system. A basic form of reproducibility is when you can repeat your own analysis and obtain the same results you had before. A key part of this process is being able to understand and track each step of your analysis. The purpose of this blog is to introduce simple tips and tricks that support reproducible science. More advanced practices are introduced at last.

Nutrition data visualization: proportions and ratios

8 minute read

Published:

In this nutrition data visualization series, I aim to show how to visualize common statistics in health and nutrition. Or, at least, how I think it is best to visualize these data.
In this article, I focus on the case where we are interested in proportions. Often, we want to compare two proportions. In a survey, proportions (named prevalence) are compared using prevalence ratio; in a cohort study, proportions (named risks) are compared using risk ratios. Other examples of summary statistics for proportions include odds ratio (e.g., case-control study), hazard ratios (e.g., survival analysis), etc.

‘Statistical method you should know’: percentage difference

11 minute read

Published:

For descriptive purpose, we may want to plot and compare many differences for variables with varying units. Nutrient intakes are a common example where we may have data measured in calories (e.g., energy), grams (e.g., saturated fats) or servings (e.g., sugar-sweetened beverages). However, the extent of the difference - the effect size - cannot be easily compared when units and scales vary. Percentage difference and log-transformation of data may be helpful to facilitate result presentation.

‘Statistical method you should know’: regression calibration

15 minute read

Published:

Random measurement errors associated with short-term dietary assessment instruments (e.g., 24-hour dietary recalls and food records) may cause unexpected bias depending on study objective and target statistic. Fortunately, there are well established methods to mitigate the impact of random errors. One of these methods is called regression calibration. In this blog, I introduce the method and show how it can be applied to a simple nutrition analysis.

‘Statistical method you should know’: restricted cubic spline

14 minute read

Published:

In this article, I describe and provide a brief introduction for a statistical method that I find very useful: restricted cubic splines. During my PhD, I diligently learned regression models assumption for my biostatistics class. One of these assumptions in the case of linear regression models is that the independent variable $X$ should be linearly related to the dependent variable, $Y$. After all, it makes sense that linear regression models estimate linear relationships. I nearly fell off my chair when I learned that the linearity assumption is not even required! We can relax this assumption by using simple statistical transformations. One of those transformation is the restricted cubic spline.

2022

Analyzing the Canadian Community Health Survey (CCHS) 2015 data with R: mean diet quality score

16 minute read

Published:

In a previous post, I showed how to account for the Canadian Community Health Survey (CCHS) complex survey design for a simple analysis in R. However, some nutrition analyses require multiple steps that are not “built-in” in a statistical software. For example, the recommended approach to estimate a (mean) diet quality score based on 24-hour dietary recall data, the main dietary assessment instrument in surveys, is the population ratio method (Freedman et al. 2008).

Impact of random errors: two nutrition examples

14 minute read

Published:

In my previous blog, I explained the difference between systematic and random errors. While it is obvious that a systematic error (difference) between the “true” value and its measurement can be a problem, the impact of random errors is often more subtle. However, in many cases, random errors can be as problematic as systematic errors if they are ignored. In this post, I aim to provide a simple demonstration of how random errors may cause problems for two common analyses in nutrition.

‘Statistical concept you should know’: random and systematic measurement errors

11 minute read

Published:

In medicine, epidemiology or nutrition, we measure data on features about the world we are interested in. It is often the case that we cannot obtain a perfect measure. For example, we cannot observe many people’s diet everyday over many months to determine usual dietary intakes. Instead, we use dietary assessment instruments to collect imperfect information about diet.

Analyzing the Canadian Community Health Survey (CCHS) 2015 data with R: linear regression example

12 minute read

Published:

A key component of survey analysis, including the CCHS 2015 - Nutrition (Health Canada 2017) is accounting for the survey design. Indeed, the design must be properly specified to obtain proper standard errors and variance estimates. Plus, using the sampling weights generates estimates representative of the target population. SAS is often the go-to statistical package for complex sampling survey analysis. In my experience, introductory workshops on survey analysis teach how to use SAS’ PROC SURVEYMEANS, PROC SURVEYREG, etc. However, examples in R are often not provided. Recently, I have been using R a lot more, but I struggled to find good example for the CCHS 2015 - Nutrition.

‘Statistical method you should know’: the bootstrap

15 minute read

Published:

In this article, I describe a statistical method that I use very often: the bootstrap. Yet, I believe the method is rarely thought outside epidemiology or biostatistics graduate studies curriculum. This is unfortunate because the bootstrap is (relatively) simple and extremely useful.

Nutrition data visualization: distribution

8 minute read

Published:

In this nutrition data visualization series, I aim to show how to visualize common statistics in health and nutrition. Or, at least, how I think it is best to visualize these data.
In this article, I focus on the case where we are interested in the distribution of intakes. You can skip the next section to see visualization code only.

Nutrition data visualization: means and differences

8 minute read

Published:

In this nutrition data visualization series, I aim to show how to visualize common statistics in health and nutrition. Or, at least, how I think it is best to visualize these data.
In this article, I focus on examples of means and difference of means.

‘Talented fighters are lazy!’: uncovering selection bias in combat sports

8 minute read

Published:

In combat sports like boxing or mixed martial arts, there is a saying that talented athletes lack work ethic compared with less talented ones. As if athletes with the most “raw talent” had a natural tendency to become lazy over time and work less than athletes that were not as gifted with natural “raw talent”.