C Dealing with Missing Data

In an article called “Much ado about nothing”? several different approaches, methods and best practices for dealing with missing data are discussed. The methods are diverse, both in number and in their effect on the results of analyses. Therefore, the first rule of dealing with missing data is: Always report analysis results for the imputed data as well as the data with missing values removed!

The CRAN taskview on missing data is a good starting point for finding what you may need. In this vignette we will specifically discuss package imputeTS for Time Series Missing Value Imputation and mice Multivariate Imputation by Chained Equations.

Data with missing values

We’ll create some variables from which we artifically remove datapoints. This allows us to evaluate how well the imputation methods perform in recovering the true values.

set.seed(54321)
# Random normally distributed numbers
zscore <- rnorm(n = 122)
df_vars <- data.frame(zscore = zscore)
# Random discrete uniform numbers
df_vars$unif_discrete <- unif_discrete  <- round(runif(NROW(df_vars),min = 0,max = 6))
df_vars$unif_discrete[c(5,10:15,74:78,102,111,120)] <- NA
# Unordered catagorical 
df_vars$cat_unordered <- cat_unordered  <- factor(round(runif(NROW(df_vars),min = 1,max = 7)))
df_vars$cat_unordered[c(5,10:15,74:78,102,111,120)] <- NA
# Ordered categroical
df_vars$cat_ordered <- cat_ordered <- ordered(round(runif(NROW(df_vars),min = 1,max = 20)))

We’ll also load the data analysed by Bastiaansen et al. (2019) and select some variables which have missing values.

# # Load data from OSF https://osf.io/tcnpd/
# require(osfr)
# manyAnalystsESM <- rio::import(osfr::osf_download(osfr::osf_retrieve_file("tcnpd") , overwrite = TRUE)$local_path)

# Or use the internal data
data(manyAnalystsESM)

# We want to use these variables
# Note: the infix function '%ci%' is from package 'invctr'
vars    <- c("angry"%ci%manyAnalystsESM,"ruminate"%ci%manyAnalystsESM,"hours"%ci%manyAnalystsESM)

df_vars <-  cbind(df_vars,manyAnalystsESM[,vars])

# Give zscore and ordered categorical the same NAs as variable 'angry'
df_vars$zscore[is.na(df_vars$angry)] <- NA
df_vars$cat_ordered[is.na(df_vars$angry)] <- NA

Function imputeTS::statsNA() can produce some helpful statistics on the NAs that might be present in your data.

require(imputeTS)

# The variable 'angry'
imputeTS::statsNA(df_vars$angry)
> [1] "Length of time series:"
> [1] 122
> [1] "-------------------------"
> [1] "Number of Missing Values:"
> [1] 9
> [1] "-------------------------"
> [1] "Percentage of Missing Values:"
> [1] "7.38%"
> [1] "-------------------------"
> [1] "Number of Gaps:"
> [1] 8
> [1] "-------------------------"
> [1] "Average Gap Size:"
> [1] 1.125
> [1] "-------------------------"
> [1] "Stats for Bins"
> [1] "  Bin 1 (31 values from 1 to 31) :      1 NAs (3.23%)"
> [1] "  Bin 2 (31 values from 32 to 62) :      0 NAs (0%)"
> [1] "  Bin 3 (31 values from 63 to 93) :      3 NAs (9.68%)"
> [1] "  Bin 4 (29 values from 94 to 122) :      5 NAs (17.2%)"
> [1] "-------------------------"
> [1] "Longest NA gap (series of consecutive NAs)"
> [1] "2 in a row"
> [1] "-------------------------"
> [1] "Most frequent gap size (series of consecutive NA series)"
> [1] "1 NA in a row (occurring 7 times)"
> [1] "-------------------------"
> [1] "Gap size accounting for most NAs"
> [1] "1 NA in a row (occurring 7 times, making up for overall 7 NAs)"
> [1] "-------------------------"
> [1] "Overview NA series"
> [1] "  1 NA in a row: 7 times"
> [1] "  2 NA in a row: 1 times"
# Uniform discrete numbers
imputeTS::statsNA(df_vars$unif_discrete)
> [1] "Length of time series:"
> [1] 122
> [1] "-------------------------"
> [1] "Number of Missing Values:"
> [1] 15
> [1] "-------------------------"
> [1] "Percentage of Missing Values:"
> [1] "12.3%"
> [1] "-------------------------"
> [1] "Number of Gaps:"
> [1] 6
> [1] "-------------------------"
> [1] "Average Gap Size:"
> [1] 2.5
> [1] "-------------------------"
> [1] "Stats for Bins"
> [1] "  Bin 1 (31 values from 1 to 31) :      7 NAs (22.6%)"
> [1] "  Bin 2 (31 values from 32 to 62) :      0 NAs (0%)"
> [1] "  Bin 3 (31 values from 63 to 93) :      5 NAs (16.1%)"
> [1] "  Bin 4 (29 values from 94 to 122) :      3 NAs (10.3%)"
> [1] "-------------------------"
> [1] "Longest NA gap (series of consecutive NAs)"
> [1] "6 in a row"
> [1] "-------------------------"
> [1] "Most frequent gap size (series of consecutive NA series)"
> [1] "1 NA in a row (occurring 4 times)"
> [1] "-------------------------"
> [1] "Gap size accounting for most NAs"
> [1] "6 NA in a row (occurring 1 times, making up for overall 6 NAs)"
> [1] "-------------------------"
> [1] "Overview NA series"
> [1] "  1 NA in a row: 4 times"
> [1] "  5 NA in a row: 1 times"
> [1] "  6 NA in a row: 1 times"