logo

We will use different data to illustrate Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random. First, we will present the three missingness situations, and afterwards, we show some tools to analyse the missingness patterns. There will be some redundancy of the plots used, but so the reader can chose the preferred tool.

Here you see a list of the packages we use for these analyses (click on the code button if you don’t see the black boxes with the code).

This page is still work in progress, this is the version from 13-01-2022. If you want to report an error, you can do this here Click to send an e-mail to Roger Hilfiker.


knitr::opts_chunk$set(echo = TRUE)
r <- getOption("repos")
r["CRAN"] <- "https://stat.ethz.ch/CRAN/"
options(repos = r)
list.of.packages <- c("bookdown","rmarkdown" ,"knitr","rio", "psych","janitor", 
                      "tidyverse","jtools","summarytools", "qgraph",  "gtsummary" , "viridis", "wesanderson", "missMethods", "ggpubr", "ggrepel", "naniar", "finalfit", "missMethods", "rpart", "rpart.plot")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

library(summarytools)
library(psych)
library(janitor)
library(sjlabelled)
library(tidyverse)
library(gtsummary)
library(viridis)
library(wesanderson)
library(missMethods)
library(ggpubr)
library(ggrepel)
library(naniar)
library(finalfit)
library(rpart)
library(rpart.plot)
library(mice)
library(knitr)

1 Missing Completely at Random (MCAR)

We use some data from a published study and then we delete randomly some data.
This downloaded dataset has no missing values. So we will present the analysis without the missing values. Then, we will delete some values: We can simulate Missing Completely at Random values with the package missMethods, see here for a tutorial:.

Load the data directly from the web See the article here:

df1<-rio::import("https://doi.org/10.1371/journal.pone.0262238.s008", format="xlsx")
df1<-df1 %>% 
  rename(WalkingDistance_m_6min=Distance30) %>% 
  select(Number, ActiveSmoking, Age, Sex, COPDduration,FEV1, WalkingDistance_m_6min )

1.0.1 Dataset without missing values

Below you see a summary of the data, here still without missing data:

options(width = 300)
summary(df1)
##      Number      ActiveSmoking        Age             Sex        COPDduration        FEV1       WalkingDistance_m_6min
##  Min.   : 1.00   Min.   :  0.0   Min.   :53.00   Min.   :1.00   Min.   : 1.00   Min.   :0.440   Min.   :214.7         
##  1st Qu.:14.25   1st Qu.:  0.0   1st Qu.:64.00   1st Qu.:1.00   1st Qu.: 7.25   1st Qu.:1.015   1st Qu.:304.8         
##  Median :27.50   Median :  0.0   Median :68.50   Median :1.00   Median :16.00   Median :1.540   Median :371.4         
##  Mean   :28.10   Mean   :160.1   Mean   :69.08   Mean   :1.08   Mean   :25.40   Mean   :1.496   Mean   :359.8         
##  3rd Qu.:41.75   3rd Qu.:  1.0   3rd Qu.:76.00   3rd Qu.:1.00   3rd Qu.:44.25   3rd Qu.:1.805   3rd Qu.:408.0         
##  Max.   :56.00   Max.   :999.0   Max.   :80.00   Max.   :2.00   Max.   :72.00   Max.   :2.530   Max.   :555.3

1.0.2 Scatterplot FEV1 - Six Minute Walking Distance (no missing data)

We plot the association between FEV1 and six minute walking distance:

correlation_Sex1<-cor(df1[df1$Sex==1,6:7])
correlation_Sex2<-cor(df1[df1$Sex==2,6:7])

ggplot(data=df1, aes(FEV1, WalkingDistance_m_6min, group=factor(Sex), colour=factor(Sex)))+
  geom_point()+
  # stat_cor(method = "pearson", label.x = 1, label.y = 100)+ # This line does overplot both correlations
  geom_smooth(method="lm")+
  annotate("text", x = 1.8, y = 365,label = paste("r = " ,round(correlation_Sex1[2],2)))+
  annotate("text", x = 0.85, y = 255,label = paste("r = " ,round(correlation_Sex2[2],2)))

1.0.3 Correlation matrix (no missing data)

options(width = 300)
round(cor(df1[-1], use="pairwise.complete.obs"),2)
##                        ActiveSmoking   Age   Sex COPDduration  FEV1 WalkingDistance_m_6min
## ActiveSmoking                   1.00  0.10  0.47        -0.08 -0.31                  -0.18
## Age                             0.10  1.00  0.24         0.07 -0.27                  -0.57
## Sex                             0.47  0.24  1.00         0.08 -0.36                  -0.29
## COPDduration                   -0.08  0.07  0.08         1.00 -0.09                  -0.15
## FEV1                           -0.31 -0.27 -0.36        -0.09  1.00                   0.36
## WalkingDistance_m_6min         -0.18 -0.57 -0.29        -0.15  0.36                   1.00

1.1 Add missing values Missing Completely At Random

Create some missing values by randomly deleting 30% of the data-points in the six minute walking distance, FEV1, and Age:

set.seed(12345)
df1_mcar <- delete_MCAR(df1, 0.15, c("WalkingDistance_m_6min", "FEV1", "Age"))

1.1.1 Correlation matrix (Missing Completely At Random)

round(cor(df1_mcar[-1], use="pairwise.complete.obs"),2)
##                        ActiveSmoking   Age   Sex COPDduration  FEV1 WalkingDistance_m_6min
## ActiveSmoking                   1.00  0.16  0.47        -0.08 -0.31                  -0.13
## Age                             0.16  1.00  0.26         0.11 -0.38                  -0.58
## Sex                             0.47  0.26  1.00         0.08 -0.42                  -0.33
## COPDduration                   -0.08  0.11  0.08         1.00 -0.14                  -0.08
## FEV1                           -0.31 -0.38 -0.42        -0.14  1.00                   0.26
## WalkingDistance_m_6min         -0.13 -0.58 -0.33        -0.08  0.26                   1.00

We see already that the correlations are not the same anymore (for the variables with missing values). But the difference is only randomly different and we are allowed to use the data with the missing values. There is no bias.

1.1.2 Scatterplot (Missing Completely At Random)

And we plot again the same scatter plot as above:

summary(df1_mcar)
##      Number      ActiveSmoking        Age             Sex        COPDduration        FEV1       WalkingDistance_m_6min
##  Min.   : 1.00   Min.   :  0.0   Min.   :53.00   Min.   :1.00   Min.   : 1.00   Min.   :0.440   Min.   :214.7         
##  1st Qu.:14.25   1st Qu.:  0.0   1st Qu.:64.25   1st Qu.:1.00   1st Qu.: 7.25   1st Qu.:1.130   1st Qu.:304.9         
##  Median :27.50   Median :  0.0   Median :69.50   Median :1.00   Median :16.00   Median :1.645   Median :371.4         
##  Mean   :28.10   Mean   :160.1   Mean   :69.07   Mean   :1.08   Mean   :25.40   Mean   :1.551   Mean   :363.5         
##  3rd Qu.:41.75   3rd Qu.:  1.0   3rd Qu.:76.00   3rd Qu.:1.00   3rd Qu.:44.25   3rd Qu.:1.972   3rd Qu.:412.4         
##  Max.   :56.00   Max.   :999.0   Max.   :80.00   Max.   :2.00   Max.   :72.00   Max.   :2.530   Max.   :555.3         
##                                  NA's   :8                                      NA's   :8       NA's   :8
correlation_Sex1<-cor(df1_mcar[df1_mcar$Sex==1,6:7], use='pairwise.complete.obs')
correlation_Sex2<-cor(df1_mcar[df1_mcar$Sex==2,6:7], use='pairwise.complete.obs')

ggplot(data=df1_mcar, aes(FEV1, WalkingDistance_m_6min, group=factor(Sex), colour=factor(Sex)))+
  geom_point()+
  # stat_cor(method = "pearson", label.x = 1, label.y = 100)+
  #geom_text_repel(nudge_y=0.3)+
  geom_smooth(method="lm")+
  annotate("text", x = 1.8, y = 360,label = paste("r = " ,round(correlation_Sex1[2],2)))+
  annotate("text", x = 1.1, y = 290,label = paste("r = " ,round(correlation_Sex2[2],2)))
## `geom_smooth()` using formula 'y ~ x'
The correlations with the data set containing missing values are not exactly the same as those without missing values. However, the difference is just a random difference and the result is not biased

The correlations with the data set containing missing values are not exactly the same as those without missing values. However, the difference is just a random difference and the result is not biased

In this situation, we have Missing Completely At Random. We can say that data are Missing Completely At Random if the probability of a value being missing is not associated to other observed or unobserved data. Because the value of the variable or the values of other variables - whether we have them measured or not - are not influencing whether our data are missing, we can say that the results will still be representative of the population, i.e. we do not have a bias. But we lose statistical precision.

1.2 Test Missing Completely At Random or not

We can statistically test whether the data are Missing Completely at Random or whether they are not (i.e. whether they are either Missing At Random or Missing Not At Random). There is the Little’s test Click here for the original article. We can use the command mcar_test from the naniar package. If the p-value for this test is very low, we have a suspicion, that the data are not Missing Completely At Random. In our case, the p-value is high, so there is probably no problem. You find here the help page of the test.

littleltest<-naniar::mcar_test(df1_mcar)

The next code chunk just prepares a text conditional on the p-value. We use this text below the code chunk.

if (littleltest$p.value < 0.05) {
  interpretation<-"there is evidence that the data are not Missing Completely At Random, therefore, we assume that they are either Missing At Random or Missing Not At Random."
} else {
  interpretation<-"there is no evidence that the data are not Missing Completely At Random."
}

if (littleltest$p.value < 0.001) {
  pvalue="p < 0.001" 
} else {
    pvalue=round(littleltest$p.value, 3)
  }

We see that the p-value is: 0.356, therefore we can conclude that there is no evidence that the data are not Missing Completely At Random.


2 Missing At Random

2.1 Data generation for the example

We extract some data similar to the data presented in the book Multiple Imputation and its Application on page 13.

How did we simulate the data? We used Webplotdigitizer to extract the data from the figure and then adapted some values. The book presents the means, but not the standard deviations for the data. Therefore, we could not really use rnorm to simulate the data (or we would have to guess the standard deviations).

The data we have here are particular: They contain the values that are missing. In a real live situation, we would not see the missing values, but here, we show them to illustrate the problem. In a real live situation, we could only have the values that were missing, if we would go back to the patients who provided missing values and asked them, what the values were. So in this example, some participants would not provide information on the salary. We would then contact these persons, and ask them whether they would tell us their salary. Of course this is difficult - and there are reasons the participants did not tell the salaryin the first place, so probably they will not tell it even when we ask them directly.

2.1.1 Reshape the data

We need to reorganize the data - this process is often called reshape of the data, or in r they call it pivoting the data.

head(data)
##   Count Job_A_Observed Job_A_Missing Job_B_Observed Job_B_Missing
## 1     1       48.79310      48.79310       20.28736      22.57471
## 2     2       49.94253      50.86207       21.66667      21.72414
## 3     3       50.86207      52.70115       22.12644      22.72414
## 4     4       52.01149      52.70115       22.35632      27.87356
## 5     5       52.70115      53.62069       22.81609      27.87356
## 6     6       53.39080      54.54023       22.81609      29.86207
summary(data)
##      Count    Job_A_Observed  Job_A_Missing   Job_B_Observed  Job_B_Missing  
##  Min.   : 1   Min.   :48.79   Min.   :48.79   Min.   :19.49   Min.   :21.72  
##  1st Qu.:23   1st Qu.:57.24   1st Qu.:56.32   1st Qu.:26.49   1st Qu.:25.30  
##  Median :45   Median :60.29   Median :60.49   Median :30.63   Median :29.86  
##  Mean   :45   Mean   :60.08   Mean   :60.08   Mean   :30.00   Mean   :30.01  
##  3rd Qu.:67   3rd Qu.:63.16   3rd Qu.:63.33   3rd Qu.:33.16   3rd Qu.:34.20  
##  Max.   :89   Max.   :72.01   Max.   :71.36   Max.   :40.98   Max.   :38.91  
##               NA's   :21      NA's   :57                      NA's   :78

Don’t panic if you do not really understand the pivot_longer command- It takes the author of this script always several hours to understand it for a given dataset 😄. Maybe a video might help Click here.

data_long<-data %>% 
  pivot_longer(cols=-Count, 
               names_pattern = "[Job]_(.*)_(.*)",
               names_to=c( "Job", "Missingness"), 
               values_to="Salary_CHF") %>% 
  filter(!is.na(Salary_CHF))

The long data:

head(data_long)
## # A tibble: 6 x 4
##   Count Job   Missingness Salary_CHF
##   <int> <chr> <chr>            <dbl>
## 1     1 A     Observed          48.8
## 2     1 A     Missing           48.8
## 3     1 B     Observed          20.3
## 4     1 B     Missing           22.6
## 5     2 A     Observed          49.9
## 6     2 A     Missing           50.9

2.1.2 Summary Statistic

Here we calculate some summary statistics: Here we see the salaries separately for those who have missing values and for those who do not have missing values in the salaryvariable (again: in real live we would not have the mean salary for those who have missing information for salary, of course (because they are missing 🤠 ).)

summary_table<-data_long %>% 
  group_by(Missingness, Job) %>% 
  summarise(mean_Salary=mean(Salary_CHF, na.rm=TRUE), 
            sd_Salary=sd(Salary_CHF, na.rm=TRUE), 
            n_nonmissing=sum(!is.na(Salary_CHF)), 
            n=n())
## `summarise()` has grouped output by 'Missingness'. You can override using the `.groups` argument.
kable(summary_table)
Missingness Job mean_Salary sd_Salary n_nonmissing n
Missing A 60.08023 5.751603 32 32
Missing B 30.01254 5.996058 11 11
Observed A 60.07776 4.851583 68 68
Observed B 29.99948 5.050440 89 89

We could also just use the tbl_summary command from the gtsummary package: The mean salarywe see here is the salarycalculated with all values, i.e. even with the salaries for those who are said to be missing (again, a situation we don’t have in real live). So the means are the unbiased mean that we would obtain, if we would have all data.

tbl_summary(data_long, by=Job)
Characteristic A, N = 1001 B, N = 1001
Count 26 (13, 43) 40 (15, 64)
Missingness
Missing 32 (32%) 11 (11%)
Observed 68 (68%) 89 (89%)
Salary_CHF 60 (57, 63) 31 (26, 33)

1 Median (IQR); n (%)

2.1.3 Difference in salaries between observed and “true” salariers

True means here, if we would have all salaries). Here the mean salaryover all jobs: 45.04

And now the salary for the observed values: 43.03

The difference is due to the fact that the mean of the salaryof those participants with missing values in salaryare different from those who provided the salaryinformation: 52.39

This means, that the overall mean for the salary is biased downwards (i.e. too small compared to the true mean, which we would have only if we could analyse the missing values also).

So our data are not Missing Completely At Random. But are they at least Missing At Random?

2.2 Missing At Random conditional on job

  • Within both Jobs, the distribution of missing and observed salaries are similar.
means<-aggregate(Salary_CHF~Job+Missingness, data_long, mean)
ggplot(data_long, aes(factor(Job), Salary_CHF))+
         geom_point(data=data_long[data_long$Missingness=="Observed",], position=position_nudge(x=-0.02), aes(colour=Missingness))+
         geom_point(data=data_long[data_long$Missingness=="Missing",], position=position_nudge(x=0.02), aes(colour=Missingness))+
  theme_classic()+
  theme(
  legend.position=c(0.5, 0.5))+
  xlab("Job")+
  ylab("Salary (in 1000 CHF)")+
  geom_text(data=means[means$Missingness=="Observed",], aes(label=paste0("Mean\n Observed: \n ",round(Salary_CHF)), y= Salary_CHF), nudge_x=-0.15)+
  geom_text(data=means[means$Missingness=="Missing",], aes(label=paste0("Mean\n Missing:  \n", round(Salary_CHF)), y= Salary_CHF), nudge_x=0.15)+
  annotate("text", x=1.8, y=70,label=
             "Over both Jobs, the probability of missing values \n increases with the salary. Within Job A and within \n Job B, the probability of missing values does not \n depend on the salary. Therefore, we say  that \n values are Missing At Random dependent on the job." )+
  annotate("text", x=1, y=15, label="Observed: 68 \n Missing: 32")+
  annotate("text", x=2, y=15, label="Observed: 89 \n Missing: 11")+
  annotate("text", x=1, y=30, label="In Job A, the distribution of \n observed and unobserved \n salaries is similar")+
  annotate("text", x=2, y=55, label="In Job B, the distribution of \n observed and unobserved \n salaries is similar")+
  # scale_color_viridis(discrete = TRUE, option="turbo")+
  scale_color_manual(values = wes_palette("BottleRocket1", n = 2))
We see that within the two Jobs, the  missingness 
 does not depend on the salary.  In Job A, those with 
 higher salaries,  do not have a higher probability for missing values. The same is true for those in Job B. However, overall, the 
 probability for missing value is associated with the salary. Overall, the chance of beeing a missing value depends 
 on its true value (i.e. the higher the salary, the higher the probability of a missing value - in other words: People with higher salaries do more often not report their salary). Therefore, we say that the values are Missing At Random dependent on the Job (conditional on the job)

We see that within the two Jobs, the missingness does not depend on the salary. In Job A, those with higher salaries, do not have a higher probability for missing values. The same is true for those in Job B. However, overall, the probability for missing value is associated with the salary. Overall, the chance of beeing a missing value depends on its true value (i.e. the higher the salary, the higher the probability of a missing value - in other words: People with higher salaries do more often not report their salary). Therefore, we say that the values are Missing At Random dependent on the Job (conditional on the job)

2.2.1 The Problem

In real live, we can not test whether the distribution of the missing salaries is similar compared to the nonmissing salaries. Because in real live, the missing salaries are missing (i.e. we don’t know how they are, i.e. we have no information on the red points in the Figure above).

2.3 Summary statistics

Below just the code (click on the show button to see the code) to calculate the mean values for the different subgroups.

cat("Mean Job A: ", mean(data_long$Salary_CHF[data_long$Job=="A"]))
## Mean Job A:  60.07855
cat("Mean Job B: ", mean(data_long$Salary_CHF[data_long$Job=="B"]))
## Mean Job B:  30.00092
cat("Mean Job A, only observed: ", mean(data_long$Salary_CHF[data_long$Job=="A"& data_long$Missingness=="Observed"]))
## Mean Job A, only observed:  60.07776
cat("Mean Job B, only observed: ", mean(data_long$Salary_CHF[data_long$Job=="B"& data_long$Missingness=="Observed"]))
## Mean Job B, only observed:  29.99948
cat("Mean overall, only observed: ", mean(data_long$Salary_CHF[data_long$Missingness=="Observed"]))
## Mean overall, only observed:  43.02702
cat("Mean overall, all data (including missing) ", mean(data_long$Salary_CHF))
## Mean overall, all data (including missing)  45.03973

2.4 Proof

Proof that the missingness of the salaries is dependent on the salaries over both jobs but not dependent on the salaries within the Jobs:

Again, we need to remember that this “proof” is not possible in the real world, because the data we need to test the assumption that our data are Missing At Random dependent on the job is missing. 😠 We can only do this in this exercise because we now the salaries of those with missing values in salary. In the real world, this could potentially be achieved when we would contact people with missing data on salaryand ask them, whether they wouldn’t agree, for academic or didactic reasons, to provide the information on their salary.

ggplot(data_long, aes(x=Salary_CHF))+
  geom_density(aes(fill=Missingness), alpha=0.4)
We see that there are more missing values in people with higher salaries

We see that there are more missing values in people with higher salaries

ggplot(data_long, aes(x=Salary_CHF))+
  geom_density(aes(fill=Missingness), alpha=0.4)+
  facet_grid(~Job)
If we analyse both jobs seperately, we see that the salaries of the missing and the observed are similarly distributed

If we analyse both jobs seperately, we see that the salaries of the missing and the observed are similarly distributed

2.4.1 Logistic Regression on Missingness

We can now run logistic regression to evaluate, whether the salaryhas an association with the probability of having missing values. If the salaryhas an association with the probability of having missing values, then we say, that the missingness of the variable dependents of the true (i.e. unobserved) value of the variable. In this case we see that - if both jobs are analyses together - that the missingness of the salarydepends on the salary. Be aware: In a real situation, we would not know this, because we would not know the value of the missing salaries. We could find this out, if in a study, we would contact people with missing values on salaryand ask 🙏 them their salary. Or we ask some 👾, they know it all.

We create a variable that is 1 if the value is missing and 0 if the value is not missing. Then we run a logistic regression with this variable missing as the dependent variable and the salaryas the independent variable to see if the odds to have missing values increases with increasing salaries.

We see in the output below, that there is an increase in the odds of having missing values with increasing salaries. There is a 4% increase in the odds for having a missing values if you earn 1’000 CHF more (odds ratio of 1.04 per unit increase in salary, one unit here is 1’000 CHF).

data_long<-data_long %>% 
  mutate(missing=case_when(
    Missingness=="Missing"~1, 
    Missingness=="Observed"~0))

logreg_missing_overall<-glm(missing~Salary_CHF, data=data_long, family=binomial)
summary(logreg_missing_overall)
## 
## Call:
## glm(formula = missing ~ Salary_CHF, family = binomial, data = data_long)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0370  -0.8186  -0.5243  -0.4343   2.2003  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.18659    0.62884  -5.067 4.03e-07 ***
## Salary_CHF   0.03954    0.01193   3.314 0.000919 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 208.20  on 199  degrees of freedom
## Residual deviance: 196.11  on 198  degrees of freedom
## AIC: 200.11
## 
## Number of Fisher Scoring iterations: 4
jtools::summ(logreg_missing_overall, confint=TRUE, exp=TRUE)
Observations 200
Dependent variable missing
Type Generalized linear model
Family binomial
Link logit
χ²(1) 12.10
Pseudo-R² (Cragg-Uhler) 0.09
Pseudo-R² (McFadden) 0.06
AIC 200.11
BIC 206.70
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.04 0.01 0.14 -5.07 0.00
Salary_CHF 1.04 1.02 1.06 3.31 0.00
Standard errors: MLE

2.4.2 Plot MAR

We will use a plot to show that missingness is dependent on salary if we do not use the information on the type of job. First, we plot the salary for those mising and non-missing (again: in real world we do not know the salary for those where the salary is missing, of course)

p1<-ggplot(data_long, aes(Salary_CHF, missing))+
  geom_point(alpha=0.3)

# Here the probability over all, we see that the probabilty for having missing values in salaryincreases with salary
p2<-p1+
  geom_smooth(method=glm, method.args=list(family="binomial"))+
  labs(
  title="Logistic Regression Model", 
  x="Salary 1000 CHF",
  y="Probability of having Missing Values in Salary_CHF"
  )
print(p2)
## `geom_smooth()` using formula 'y ~ x'
The probability of missing values in salaryincreases with increasing salary. People with higher values do more often miss to provide information on salary.

The probability of missing values in salaryincreases with increasing salary. People with higher values do more often miss to provide information on salary.

2.4.3 Including Job information

We see here below, that conditional on the job, there is no association between the salary and the probability of having missing data on salary. Don’t be confused about the different methods used in the code-block below. In the first model, we assume that there is no interaction between the job and salary for the missingness of the salary. This is in fact not the best way to analyse it. We should include an interaction (or analyse the association separately for both jobs). In the second and third model, we analysed the association separately for those with job A and Job B. In the fourth model, we modeled an interaction between the job and the salary for the association with the probability to have missing values in the salary. The fifth approach just shows how we can do the models two and three with the purrr approach (this is really only for the very interested students).

logreg_missing_per_Job<-glm(missing~Salary_CHF+Job, data=data_long, family=binomial)
summary(logreg_missing_per_Job)
## 
## Call:
## glm(formula = missing ~ Salary_CHF + Job, family = binomial, 
##     data = data_long)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8792  -0.8780  -0.4829  -0.4824   2.1019  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7670618  2.1074489  -0.364    0.716
## Salary_CHF   0.0002212  0.0348951   0.006    0.995
## JobB        -1.3303162  1.1178086  -1.190    0.234
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 208.20  on 199  degrees of freedom
## Residual deviance: 194.68  on 197  degrees of freedom
## AIC: 200.68
## 
## Number of Fisher Scoring iterations: 4
data_A<-data_long %>% 
  filter(Job=="A")

logreg_missing_Job_A<-glm(missing~Salary_CHF, data=data_A, family=binomial)
summary(logreg_missing_Job_A)
## 
## Call:
## glm(formula = missing ~ Salary_CHF, family = binomial, data = data_A)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8787  -0.8783  -0.8781   1.5094   1.5101  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.595e-01  2.534e+00  -0.300    0.764
## Salary_CHF   9.508e-05  4.202e-02   0.002    0.998
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 125.37  on 99  degrees of freedom
## Residual deviance: 125.37  on 98  degrees of freedom
## AIC: 129.37
## 
## Number of Fisher Scoring iterations: 4
data_B<-data_long %>% 
  filter(Job=="B")

logreg_missing_Job_B<-glm(missing~Salary_CHF, data=data_B, family=binomial)
summary(logreg_missing_Job_B)
## 
## Call:
## glm(formula = missing ~ Salary_CHF, family = binomial, data = data_B)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4840  -0.4831  -0.4828  -0.4820   2.1028  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1057881  1.9069362  -1.104    0.269
## Salary_CHF   0.0005015  0.0626423   0.008    0.994
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 69.303  on 99  degrees of freedom
## Residual deviance: 69.303  on 98  degrees of freedom
## AIC: 73.303
## 
## Number of Fisher Scoring iterations: 4
logreg_missing_per_Job_interaction<-glm(missing~Salary_CHF*Job, data=data_long, family=binomial)
summary(logreg_missing_per_Job)
## 
## Call:
## glm(formula = missing ~ Salary_CHF + Job, family = binomial, 
##     data = data_long)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8792  -0.8780  -0.4829  -0.4824   2.1019  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7670618  2.1074489  -0.364    0.716
## Salary_CHF   0.0002212  0.0348951   0.006    0.995
## JobB        -1.3303162  1.1178086  -1.190    0.234
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 208.20  on 199  degrees of freedom
## Residual deviance: 194.68  on 197  degrees of freedom
## AIC: 200.68
## 
## Number of Fisher Scoring iterations: 4
# with purrr --> this is the same as doing both anaylsis seperatly (i.e. per Job)
library(purrr)
data_long %>% 
  split(.$Job) %>% 
  map(~glm(missing~Salary_CHF, family=binomial, data=.)) %>% 
  map(summary)
## $A
## 
## Call:
## glm(formula = missing ~ Salary_CHF, family = binomial, data = .)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8787  -0.8783  -0.8781   1.5094   1.5101  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.595e-01  2.534e+00  -0.300    0.764
## Salary_CHF   9.508e-05  4.202e-02   0.002    0.998
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 125.37  on 99  degrees of freedom
## Residual deviance: 125.37  on 98  degrees of freedom
## AIC: 129.37
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## $B
## 
## Call:
## glm(formula = missing ~ Salary_CHF, family = binomial, data = .)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4840  -0.4831  -0.4828  -0.4820   2.1028  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1057881  1.9069362  -1.104    0.269
## Salary_CHF   0.0005015  0.0626423   0.008    0.994
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 69.303  on 99  degrees of freedom
## Residual deviance: 69.303  on 98  degrees of freedom
## AIC: 73.303
## 
## Number of Fisher Scoring iterations: 4

In the next output from the model with the interaction, we see that there is no interaction and no association of the salaryon the missingness on the salary, if we use the information on the job.

jtools::summ(logreg_missing_per_Job_interaction, confint=TRUE, exp=TRUE)
Observations 200
Dependent variable missing
Type Generalized linear model
Family binomial
Link logit
χ²(3) 13.53
Pseudo-R² (Cragg-Uhler) 0.10
Pseudo-R² (McFadden) 0.06
AIC 202.68
BIC 215.87
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.47 0.00 67.12 -0.30 0.76
Salary_CHF 1.00 0.92 1.09 0.00 1.00
JobB 0.26 0.00 130.19 -0.42 0.67
Salary_CHF:JobB 1.00 0.86 1.16 0.01 1.00
Standard errors: MLE
Probably this is easier to see when we analyse the jobs separately:

The association of the salaryon the probability of missing values in the salary for those with job A:

jtools::summ(logreg_missing_Job_A, confint=TRUE, exp=TRUE)
Observations 100
Dependent variable missing
Type Generalized linear model
Family binomial
Link logit
χ²(1) 0.00
Pseudo-R² (Cragg-Uhler) 0.00
Pseudo-R² (McFadden) 0.00
AIC 129.37
BIC 134.58
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.47 0.00 67.12 -0.30 0.76
Salary_CHF 1.00 0.92 1.09 0.00 1.00
Standard errors: MLE

The association of the salaryon the probability of missing values in the salary for those with job B:

jtools::summ(logreg_missing_Job_B, confint=TRUE, exp=TRUE)
Observations 100
Dependent variable missing
Type Generalized linear model
Family binomial
Link logit
χ²(1) 0.00
Pseudo-R² (Cragg-Uhler) 0.00
Pseudo-R² (McFadden) 0.00
AIC 73.30
BIC 78.51
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.12 0.00 5.11 -1.10 0.27
Salary_CHF 1.00 0.88 1.13 0.01 0.99
Standard errors: MLE
# now we put all together, but now we had seperate curves for both jobs
ggplot(data_long, aes(Salary_CHF, missing), group=Job)+
  geom_point(aes(colour=Job), alpha=0.3)+
  geom_smooth(aes(colour=Job), method=glm, method.args=list(family="binomial"))+
  labs(
    title="Logistic Regression Model", 
    x="Salary in CHF",
    y="Probability of having Missing Values in Salary_CHF"
  )
## `geom_smooth()` using formula 'y ~ x'
Analysed seperately per job, we see that there is no association between the salaryand the probability of missingness in salary. That's why we can say that the values in salaryare Missing At Random (dependent on the job)

Analysed seperately per job, we see that there is no association between the salaryand the probability of missingness in salary. That’s why we can say that the values in salaryare Missing At Random (dependent on the job)

For this example, we can say that the variable salary is Missing At Random if we take into consideration the job (“Salary is Missing At Random, dependent on the job); or we could even say, that within categories of job (i.e. Job A / Job B), salary is Missing Completely at Random.

In a real live situation, we can not test whether data are Missing At Random, because we do not have information on the values that are missing. It will be an assumption.

In practice, if we want to impute the missing salaries, our imputations would be wrong if we would discard the job information (i.e. if we would impute the mean of the salaries over all jobs), it would be important to consider the job variable in the imputation method.

2.5 Simple Imputation

Although simple imputations are not the best solution, we use it here to demonstrate that it is important that we take into account the relevant variables.

First, we delete the missing values in our dataset. Remember: In this didactic exercise, we had data on salary that we declared as being missing - but we still had these values - so we could analyse what the salaries of the missing salaries were - sorry, a bit complicated 🤔.

data_missing<-data_long %>% 
  mutate(Salary_CHF=ifelse(Missingness=="Missing", NA, Salary_CHF))
summary(data_missing)
##      Count           Job            Missingness          Salary_CHF       missing     
##  Min.   : 1.00   Length:200         Length:200         Min.   :19.49   Min.   :0.000  
##  1st Qu.:13.75   Class :character   Class :character   1st Qu.:30.17   1st Qu.:0.000  
##  Median :30.00   Mode  :character   Mode  :character   Median :35.52   Median :0.000  
##  Mean   :34.73                                         Mean   :43.03   Mean   :0.215  
##  3rd Qu.:54.00                                         3rd Qu.:59.14   3rd Qu.:0.000  
##  Max.   :89.00                                         Max.   :72.01   Max.   :1.000  
##                                                        NA's   :43
We could also just run our Little’s test to check whether the data are Missing Completely at Random. For this, we have to exclude the last variable (which is just an indicator for the missing of the data in the salary.

We see that the p-value is very low, therefore, we can not state the data are Missing Completely At Random.

naniar::mcar_test(data_missing[1:4])
## # A tibble: 1 x 4
##   statistic    df p.value missing.patterns
##       <dbl> <dbl>   <dbl>            <int>
## 1       200     3       0                2

Remember, there is no test to distinguish between Not Missing At Random and Missing At Random.

Now we impute the missing values of the salaries with the overall mean of the salaries (this is wrong, but we do it here anyway). Remember the true mean salary is 45.04.

data_missing_imputed1<-data_missing %>% 
  mutate(Salary_CHF=case_when(
    is.na(Salary_CHF)~mean(Salary_CHF, na.rm=TRUE), 
    TRUE~Salary_CHF))

After the imputation with the overall mean salary, the overall mean salary is: 43.03. This mean is biased. In this example we know this, because we can calculate the true mean, which is 45.04.

Now we impute the mean dependent of the job:

data_missing_imputed2<-data_missing %>% 
  group_by(Job) %>% 
  mutate(Salary_CHF=case_when(
    is.na(Salary_CHF)~mean(Salary_CHF, na.rm=TRUE), 
    TRUE~Salary_CHF ))

After the imputation with the job related mean, the overall mean salary is: 45.04, which corresponds to the true value.


3 Missing Not At Random

If data are Missing Not At Random, the probability of a missing value depends on the true (underlying) value of the variable and this association remains, even when controlling for all other observed variables. This means, in contrast to the salaryexample, we are not able to create a situation where the data are Missing At Random.

We use our data from the article we used for the Missint Completely At Random example and simulate data Missing Not At Random.

names(df1)
## [1] "Number"                 "ActiveSmoking"          "Age"                    "Sex"                    "COPDduration"           "FEV1"                   "WalkingDistance_m_6min"
df_mnar<-missMethods::delete_MNAR_censoring(df1, 0.3, c("ActiveSmoking", "Age", "Sex","COPDduration", "FEV1", "WalkingDistance_m_6min"))

Let’s have a look at the missingness pattern: We see that there are more missing values for those with a lower identification number, so patients included earlier in the study had more missing values.Then those with a higher walking distance in the six meter walking test had a higher probability of having missing values in other variables. What we can’t observe of course, is, whether the “true” value of a variable, for example the walking distance, had an influence of whether this value was observed or missing. But the probability is high that this could be the case (i.e. that the walking distance was missing in those patients with a good walking distance), as in general those with higher walking distance had more missing values.

df_mnar %>%
  add_prop_miss() %>% # Adds Column Containing Proportion Of Missing Data Values
  rpart(prop_miss_all ~ ., data = .,  model=TRUE) %>% # the rpart function does the regression tree 
  prp(type = 4, extra = 101, prefix = "Prop. Miss = ") # plots the tree

df_mnar %>%
  missing_plot()

df_mnar %>% 
  missing_pairs(position = "fill", )

Let’s test whether data are Missing Completely At Random: We see that the p-value is very low, therefore, we can not consider the data as Missing Completely At Random.

littleltest<-naniar::mcar_test(df_mnar[1:4])
littleltest
## # A tibble: 1 x 4
##   statistic    df  p.value missing.patterns
##       <dbl> <dbl>    <dbl>            <int>
## 1      42.3    16 0.000357                8
if (littleltest$p.value < 0.05) {
  interpretation<-"there is evidence that the data are not Missing Completely At Random, therefore, we assume that they are either Missing At Random or Missing Not At Random."
} else {
  interpretation<-"there is no evidence that the data are not Missing Completely At Random."
}

if (littleltest$p.value < 0.001) {
  pvalue="p < 0.001" 
} else {
    pvalue=round(littleltest$p.value, 3)
  }

We see that the p-value is: p < 0.001, therefore we can conclude that there is evidence that the data are not Missing Completely At Random, therefore, we assume that they are either Missing At Random or Missing Not At Random.

4 Test for missing value patterns

4.1 Example Missing Completely At Random

Please read first here for some tips on how to test for missing value patterns Link.

You find here an extensive tutorial on how to analyse missing data in R

But let’s analyse our data. We use first our first dataset with missing values, the df1_mcar. This can help to identify some data coding errors.

finalfit::ff_glimpse(df1_mcar)
## $Continuous
##                                         label var_type  n missing_n missing_percent  mean    sd   min quartile_25 median quartile_75   max
## Number                                 Number    <dbl> 50         0             0.0  28.1  16.6   1.0        14.2   27.5        41.8  56.0
## ActiveSmoking                   ActiveSmoking    <dbl> 50         0             0.0 160.1 369.8   0.0         0.0    0.0         1.0 999.0
## Age                                       Age    <dbl> 42         8            16.0  69.1   7.6  53.0        64.2   69.5        76.0  80.0
## Sex                                       Sex    <dbl> 50         0             0.0   1.1   0.3   1.0         1.0    1.0         1.0   2.0
## COPDduration                     COPDduration    <dbl> 50         0             0.0  25.4  23.1   1.0         7.2   16.0        44.2  72.0
## FEV1                                     FEV1    <dbl> 42         8            16.0   1.6   0.6   0.4         1.1    1.6         2.0   2.5
## WalkingDistance_m_6min WalkingDistance_m_6min    <dbl> 42         8            16.0 363.5  79.0 214.7       304.9  371.4       412.4 555.3
## 
## $Categorical
## data frame with 0 columns and 50 rows

It is also nice to have some numbers for the missing data. The naniar package features some very helpful commands. You find here a nice tutorial

We can simply ask: are there any missing data?

any_na(df1_mcar)
## [1] TRUE

Yes, that’s true, there are some missing data. Then, we can look at the absolute frequency and the proportion of missing data:

n_miss(df1_mcar)
## [1] 24
prop_miss(df1_mcar)
## [1] 0.06857143

4.1.1 Tree with proportions missing

(Example Missing Completely At Random) But the next tree graph provides way more information. We see the proportion of missing values for different subgroups of the data. It’s cool, but you use three packages: naniar, rpart and rpart.plot. Have a look at this tutorial, you see even nicer trees

df1_mcar %>%
  add_prop_miss() %>% # Adds Column Containing Proportion Of Missing Data Values
  rpart(prop_miss_all ~ ., data = .,  model=TRUE) %>% # the rpart function does the regression tree 
  prp(type = 4, extra = 101, prefix = "Prop. Miss = ") # plots the tree

4.1.2 Simple Table with Base R

(Example Missing Completely At Random) We can also simply produce a table with base R to get the variables that contain missing values:

df1_mcar %>% is.na() %>% colSums()
##                 Number          ActiveSmoking                    Age                    Sex           COPDduration                   FEV1 WalkingDistance_m_6min 
##                      0                      0                      8                      0                      0                      8                      8

But there are better ways to present this.

4.1.3 N and %

Absolute frequencies (N) and relative frequencies of missing values per variables (Example Missing Completely At Random)

# Get number of missings per variable (n and %)
miss_var_summary(df1_mcar)
## # A tibble: 7 x 3
##   variable               n_miss pct_miss
##   <chr>                   <int>    <dbl>
## 1 Age                         8       16
## 2 FEV1                        8       16
## 3 WalkingDistance_m_6min      8       16
## 4 Number                      0        0
## 5 ActiveSmoking               0        0
## 6 Sex                         0        0
## 7 COPDduration                0        0
miss_var_table(df1_mcar)
## # A tibble: 2 x 3
##   n_miss_in_var n_vars pct_vars
##           <int>  <int>    <dbl>
## 1             0      4     57.1
## 2             8      3     42.9

4.1.4 Visualising N & %

Visual presentation of absolute and relative frequencies (Example Missing Completely At Random) Do you prefer this visually?

gg_miss_var(df1_mcar)
Absolute numbers of missing values per variable

Absolute numbers of missing values per variable

4.1.5 Table N or %

Table of absolute frequencies (N) and relative frequencies of missing values per participant (Example Missing Completely At Random)

miss_case_summary(df1_mcar)
## # A tibble: 50 x 3
##     case n_miss pct_miss
##    <int>  <int>    <dbl>
##  1    11      2     28.6
##  2    24      2     28.6
##  3    30      2     28.6
##  4    32      2     28.6
##  5    38      2     28.6
##  6    39      2     28.6
##  7     1      1     14.3
##  8     2      1     14.3
##  9    10      1     14.3
## 10    12      1     14.3
## # ... with 40 more rows
miss_case_table(df1_mcar)
## # A tibble: 3 x 3
##   n_miss_in_case n_cases pct_cases
##            <int>   <int>     <dbl>
## 1              0      32        64
## 2              1      12        24
## 3              2       6        12

4.1.6 Plot Patterns

(Example Missing Completely At Random) Next, we can visualize analyse the missing patterns. We can, for example, use the commands missing_plot() and missing_pattern() from the finalfit package.

df1_mcar %>%
  missing_plot()

4.1.7 Plot %

Plot with relative frequencies overall and per variable (Example Missing Completely At Random) The naniar packages has an alternative version

vis_miss(df1_mcar) + theme(axis.text.x = element_text(angle=80))

df1_mcar %>%
  missing_pattern()

##    Number ActiveSmoking Sex COPDduration Age FEV1 WalkingDistance_m_6min   
## 32      1             1   1            1   1    1                      1  0
## 5       1             1   1            1   1    1                      0  1
## 3       1             1   1            1   1    0                      1  1
## 2       1             1   1            1   1    0                      0  2
## 4       1             1   1            1   0    1                      1  1
## 1       1             1   1            1   0    1                      0  2
## 3       1             1   1            1   0    0                      1  2
##         0             0   0            0   8    8                      8 24

4.1.8 Combination of missing variables

Plot to evaluate combinations of Missingness (Example Missing Completely At Random) The naniar package has again a nice graph for the association between missing variables.

gg_miss_upset(df1_mcar)
We see how many times the combination of missing values are present.

We see how many times the combination of missing values are present.

4.1.9 Distributions Missinges

(Example Missing Completely At Random) The following figure is a bit more complex, but provides more information on the associations between values and missingness. In the following plot, we see, that the participants with missing FEV1 values are slightly younger than those with non-missing FEV1 values. Because we generated the missing data as Missing Completely At Random, we know that this difference is not a bias, but just a random error. The Little’s test was non-significant, i.e. indicating that the data can still be considered as Missing Completely At Random.

df1_mcar %>% 
  missing_pairs(position = "fill", )

4.1.10 Heatplot

(Example Missing Completely At Random) Naniar has a nice heatplot. We can decide which variable should be on the x-axis (horizontal axis)

gg_miss_fct(df1_mcar, fct = Sex)

### Comparison Missing / Nonmissing Comparison of values in some variables dependent on missingness of another variable (Example Missing Completely At Random) We can now compare the data. For this we need to define the dependent and independent variables. We can see that there is no significant difference in age, sex, COPD-duration and FEV1 between those who have missing values in the walking distance. We could now repeat the analyses with other dependent variables.

names(df1_mcar)
## [1] "Number"                 "ActiveSmoking"          "Age"                    "Sex"                    "COPDduration"           "FEV1"                   "WalkingDistance_m_6min"
explanatory = c("Age", "Sex", 
  "COPDduration", "FEV1")
dependent = "WalkingDistance_m_6min"
df1_mcar %>% 
  missing_compare(dependent, explanatory)
## Warning in stats::chisq.test(x, y, ...): Chi-squared approximation may be incorrect
##  Missing data analysis: WalkingDistance_m_6min           Not missing     Missing     p
##                                            Age Mean (SD)  69.4 (7.6)  67.3 (8.1) 0.501
##                                            Sex         1   38 (82.6)    8 (17.4) 0.842
##                                                        2   4 (100.0)     0 (0.0)      
##                                   COPDduration Mean (SD) 24.9 (22.6) 27.9 (27.1) 0.744
##                                           FEV1 Mean (SD)   1.6 (0.6)   1.5 (0.4) 0.696

We use the same command as above, but this time we set Age as the dependent variable.

names(df1_mcar)
## [1] "Number"                 "ActiveSmoking"          "Age"                    "Sex"                    "COPDduration"           "FEV1"                   "WalkingDistance_m_6min"
explanatory = c("Sex", "WalkingDistance_m_6min",
  "COPDduration", "FEV1")
dependent = "Age"
df1_mcar %>% 
  missing_compare(dependent, explanatory)
## Warning in stats::chisq.test(x, y, ...): Chi-squared approximation may be incorrect
##  Missing data analysis: Age            Not missing       Missing     p
##                         Sex         1    38 (82.6)      8 (17.4) 0.842
##                                     2    4 (100.0)       0 (0.0)      
##      WalkingDistance_m_6min Mean (SD) 368.9 (72.3) 336.6 (109.7) 0.329
##                COPDduration Mean (SD)  26.1 (24.9)    21.8 (8.9) 0.630
##                        FEV1 Mean (SD)    1.6 (0.6)     1.3 (0.4) 0.264

4.2 Same for Missing At Random

Now we use the same commands with our data for the Missing At Random example.

data_missing<-data_missing %>% 
  select(-Count, -Missingness, -missing)
finalfit::ff_glimpse(data_missing)
## $Continuous
##                 label var_type   n missing_n missing_percent mean   sd  min quartile_25 median quartile_75  max
## Salary_CHF Salary_CHF    <dbl> 157        43            21.5 43.0 15.7 19.5        30.2   35.5        59.1 72.0
## 
## $Categorical
##     label var_type   n missing_n missing_percent levels_n levels levels_count levels_percent
## Job   Job    <chr> 200         0             0.0        2      -            -              -

4.2.1 Simple Plot

data_missing %>%
  missing_plot()

data_missing %>%
  missing_pattern()

##     Job Salary_CHF   
## 157   1          1  0
## 43    1          0  1
##       0         43 43

4.2.2 Heatplot from the naniar package

Naniar has a nice heatplot

gg_miss_fct(data_missing, fct = Job)

4.2.3 Distribution Missingness (Example Missing At Random)

data_missing %>% 
  missing_pairs(position = "fill", )
## Warning: Removed 43 rows containing non-finite values (stat_boxplot).

## Warning: Removed 43 rows containing non-finite values (stat_boxplot).
We can see that people working in Job A have more often missing values in the salaryvariable. However, with this figure we can't see, whether higher salaries do have a higher probability of beeing missing data. This we could only evaluate, if we go visit some people with missing data in salaryand try to get the information from them to analyse whether higher salaries are associated with more or less missing data.

We can see that people working in Job A have more often missing values in the salaryvariable. However, with this figure we can’t see, whether higher salaries do have a higher probability of beeing missing data. This we could only evaluate, if we go visit some people with missing data in salaryand try to get the information from them to analyse whether higher salaries are associated with more or less missing data.

4.2.4 Decision Tree (Example Missing At Random)

Let’s again plot a tree with the proportion of missing values for subgroups:

data_missing %>%
  add_prop_miss() %>% # Adds Column Containing Proportion Of Missing Data Values
  rpart(prop_miss_all ~ ., data = .,  model=TRUE) %>% # the rpart function does the regression tree 
  prp(type = 4, extra = 101, prefix = "Prop. Miss = ") # plots the tree

4.2.5 Comparison of Data (Example Missing At Random)

We can now compare the data. For this we need to define the dependent and independent variables. The table show us the missing in the salaryvariables stratified in the two jobs. We see that Job A has a higher proportion of missing values in salary.

names(data_missing)
## [1] "Job"        "Salary_CHF"
explanatory = c("Job")
dependent = "Salary_CHF"
data_missing %>% 
  missing_compare(dependent, explanatory)
##  Missing data analysis: Salary_CHF   Not missing   Missing     p
##                                Job A   68 (68.0) 32 (32.0) 0.001
##                                    B   89 (89.0) 11 (11.0)

4.3 How to test for missing value patterns (Example Missing Not At Random)

4.3.1 Finding Problems in Coding (Example Missing Not At Random)

But let’s analyse our data. We use first our first dataset with missing values, the df1_mcar. This can help to identify some data coding errors.

finalfit::ff_glimpse(df_mnar)
## $Continuous
##                                         label var_type  n missing_n missing_percent  mean    sd   min quartile_25 median quartile_75   max
## Number                                 Number    <dbl> 50         0             0.0  28.1  16.6   1.0        14.2   27.5        41.8  56.0
## ActiveSmoking                   ActiveSmoking    <dbl> 35        15            30.0 228.8 425.4   0.0         0.0    1.0         1.0 999.0
## Age                                       Age    <dbl> 35        15            30.0  72.7   5.2  65.0        67.5   72.0        77.0  80.0
## Sex                                       Sex    <dbl> 35        15            30.0   1.1   0.3   1.0         1.0    1.0         1.0   2.0
## COPDduration                     COPDduration    <dbl> 35        15            30.0  34.6  21.7   9.0        13.5   30.0        54.0  72.0
## FEV1                                     FEV1    <dbl> 35        15            30.0   1.8   0.4   1.3         1.5    1.7         2.1   2.5
## WalkingDistance_m_6min WalkingDistance_m_6min    <dbl> 35        15            30.0 399.0  53.3 315.2       367.1  390.4       423.2 555.3
## 
## $Categorical
## data frame with 0 columns and 50 rows

It is also nice to have some numbers for the missing data. The naniar package features some very helpful commands. You find here a nice tutorial

We can simply ask: are there any missing data?

any_na(df_mnar)
## [1] TRUE

Yes, that’s true, there are some missing data. Then, we can look at the absolute frequency and the proportion of missing data:

n_miss(df_mnar)
## [1] 90
prop_miss(df_mnar)
## [1] 0.2571429

4.3.2 Tree with proportions missing (Example Missing Not At Random)

But the next tree graph provides way more information. We see the proportion of missing values for different subgroups of the data. It’s cool, but you use three packages: naniar, rpart and rpart.plot. Have a look at this tutorial, you see even nicer trees

df_mnar %>%
  add_prop_miss() %>% # Adds Column Containing Proportion Of Missing Data Values
  rpart(prop_miss_all ~ ., data = .,  model=TRUE) %>% # the rpart function does the regression tree 
  prp(type = 4, extra = 101, prefix = "Prop. Miss = ") # plots the tree

4.3.3 Simple Table with Base R (Example Missing Not At Random)

We can also simply produce a table with base R to get the variables that contain missing values:

df_mnar %>% is.na() %>% colSums()
##                 Number          ActiveSmoking                    Age                    Sex           COPDduration                   FEV1 WalkingDistance_m_6min 
##                      0                     15                     15                     15                     15                     15                     15

But there are better ways to present this.

4.3.4 Absolute frequencies (N) and relative frequencies of missing values per variables (Example Missing Not At Random)

# Get number of missings per variable (n and %)
miss_var_summary(df_mnar)
## # A tibble: 7 x 3
##   variable               n_miss pct_miss
##   <chr>                   <int>    <dbl>
## 1 ActiveSmoking              15       30
## 2 Age                        15       30
## 3 Sex                        15       30
## 4 COPDduration               15       30
## 5 FEV1                       15       30
## 6 WalkingDistance_m_6min     15       30
## 7 Number                      0        0
miss_var_table(df_mnar)
## # A tibble: 2 x 3
##   n_miss_in_var n_vars pct_vars
##           <int>  <int>    <dbl>
## 1             0      1     14.3
## 2            15      6     85.7

4.3.5 Visual presentation of absolute and relative frequencies (Example Missing Not At Random)

Do you prefer this visually?

gg_miss_var(df_mnar)
Absolute numbers of missing values per variable

Absolute numbers of missing values per variable

4.3.6 Table of absolute frequencies (N) and relative frequencies of missing values per participant (Example Missing Not At Random)

miss_case_summary(df_mnar)
## # A tibble: 50 x 3
##     case n_miss pct_miss
##    <int>  <int>    <dbl>
##  1     1      4     57.1
##  2    10      4     57.1
##  3    14      4     57.1
##  4     2      3     42.9
##  5     3      3     42.9
##  6     5      3     42.9
##  7     6      3     42.9
##  8     9      3     42.9
##  9    12      3     42.9
## 10    17      3     42.9
## # ... with 40 more rows
miss_case_table(df_mnar)
## # A tibble: 5 x 3
##   n_miss_in_case n_cases pct_cases
##            <int>   <int>     <dbl>
## 1              0       7        14
## 2              1      13        26
## 3              2      16        32
## 4              3      11        22
## 5              4       3         6

4.3.7 Plot Patterns (Example Missing Not At Random)

Next, we can visualize analyse the missing patterns. We can, for example, use the commands missing_plot() and missing_pattern() from the finalfit package.

df_mnar %>%
  missing_plot()

4.3.8 Plot with relative frequencies overall and per variable (Example Missing Not At Random)

The naniar packages has an alternative version

vis_miss(df_mnar) + theme(axis.text.x = element_text(angle=80))

df_mnar %>%
  missing_pattern()

##   Number ActiveSmoking Age Sex COPDduration FEV1 WalkingDistance_m_6min   
## 7      1             1   1   1            1    1                      1  0
## 2      1             1   1   1            1    1                      0  1
## 2      1             1   1   1            1    0                      1  1
## 4      1             1   1   1            1    0                      0  2
## 3      1             1   1   1            0    1                      1  1
## 2      1             1   1   1            0    1                      0  2
## 1      1             1   1   1            0    0                      0  3
## 2      1             1   1   0            1    1                      1  1
## 4      1             1   0   1            1    1                      1  1
## 1      1             1   0   1            1    1                      0  2
## 2      1             1   0   1            1    0                      1  2
## 1      1             1   0   1            0    1                      1  2
## 2      1             1   0   0            1    1                      1  2
## 2      1             1   0   0            1    0                      1  3
## 1      1             0   1   1            1    1                      0  2
## 1      1             0   1   1            1    0                      1  2
## 1      1             0   1   1            1    0                      0  3
## 1      1             0   1   1            0    1                      1  2
## 1      1             0   1   1            0    0                      1  3
## 1      1             0   1   0            1    1                      1  2
## 1      1             0   1   0            1    1                      0  3
## 1      1             0   1   0            1    0                      1  3
## 2      1             0   1   0            0    1                      1  3
## 2      1             0   1   0            0    1                      0  4
## 1      1             0   0   1            0    1                      1  3
## 1      1             0   0   0            1    1                      1  3
## 1      1             0   0   0            0    1                      1  4
##        0            15  15  15           15   15                     15 90

4.3.9 Combinations of Missingness (Example Missing Not At Random)

The naniar package has again a nice graph for the association between missing variables.

gg_miss_upset(df_mnar)
We see how many times the combination of missing values are present.

We see how many times the combination of missing values are present.

4.3.10 Distributions Missinges (Example Missing Not At Random)

The following figure is a bit more complex, but provides more information on the associations between values and missingness. In the following plot, we see, that the participants with missing FEV1 values are slightly younger than those with non-missing FEV1 values. Because we generated the missing data as Missing Not At Random, we know that this difference is not a bias, but just a random error. The Little’s test was non-significant, i.e. indicating that the data can still be considered as Missing Not At Random.

df_mnar %>% 
  missing_pairs(position = "fill", )

4.3.11 Heatplot (Example Missing Not At Random)

Naniar has a nice heatplot. We can decide which variable should be on the x-axis (horizontal axis)

gg_miss_fct(df_mnar, fct = Sex)

4.3.12 Comparison of values in some variables dependent on missingness of another variable (Example Missing Not At Random)

We can now compare the data. For this we need to define the dependent and independent variables. We can see that there is a significant difference in age between those who have missing values in the walking distance. We could now repeat the analyses with other dependent variables.

names(df_mnar)
## [1] "Number"                 "ActiveSmoking"          "Age"                    "Sex"                    "COPDduration"           "FEV1"                   "WalkingDistance_m_6min"
explanatory = c("ActiveSmoking","Age", "Sex", 
  "COPDduration", "FEV1")
dependent = "WalkingDistance_m_6min"
df_mnar %>% 
  missing_compare(dependent, explanatory)
## Warning in stats::chisq.test(x, y, ...): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(x, y, ...): Chi-squared approximation may be incorrect
##  Missing data analysis: WalkingDistance_m_6min           Not missing     Missing     p
##                                  ActiveSmoking         0    9 (75.0)    3 (25.0) 0.299
##                                                        1   12 (80.0)    3 (20.0)      
##                                                      999    4 (50.0)    4 (50.0)      
##                                            Age Mean (SD)  71.2 (4.9)  75.1 (4.9) 0.028
##                                            Sex         1   22 (71.0)    9 (29.0) 0.207
##                                                        2    1 (25.0)    3 (75.0)      
##                                   COPDduration Mean (SD) 31.8 (20.6) 41.6 (23.9) 0.234
##                                           FEV1 Mean (SD)   1.8 (0.4)   1.6 (0.2) 0.185

We use the same command as above, but this time we set sex as the dependent variable. We see that there are significant differences in the walking distance, active smoking, COPD duration and FEV1 in between those with missing values in age and those without missing values in age.

names(df_mnar)
## [1] "Number"                 "ActiveSmoking"          "Age"                    "Sex"                    "COPDduration"           "FEV1"                   "WalkingDistance_m_6min"
explanatory = c("Sex", "WalkingDistance_m_6min","ActiveSmoking",
  "COPDduration", "FEV1")
dependent = "Age"
df_mnar %>% 
  missing_compare(dependent, explanatory)
## Warning in stats::chisq.test(x, y, ...): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(x, y, ...): Chi-squared approximation may be incorrect
##  Missing data analysis: Age            Not missing      Missing     p
##                         Sex         1    22 (71.0)     9 (29.0) 0.521
##                                     2    4 (100.0)      0 (0.0)      
##      WalkingDistance_m_6min Mean (SD) 378.9 (36.8) 429.2 (61.0) 0.005
##               ActiveSmoking         0    10 (83.3)     2 (16.7) 0.021
##                                     1     6 (40.0)     9 (60.0)      
##                                   999     7 (87.5)     1 (12.5)      
##                COPDduration Mean (SD)  41.1 (22.0)  22.2 (15.1) 0.012
##                        FEV1 Mean (SD)    1.7 (0.3)    2.0 (0.4) 0.007

5 Multiple Imputation

We will not talk in depth about multiple Imputation. You find nice tutorials on the web. For example in this online book

We can look at an example with our Missing At Random dataset

5.1 MICE Multiple Imputation Chained Equation

We will impute twenty datasets and then later combine them for the analysis. We use the standard imputation method (“pmm”) and a linear regression method with bootstrap (“norm.boot”).

data_missing<-data_missing %>% 
  mutate(Job=as.numeric(factor(Job))-1) # We need first to transform the character Variable Job into a numeric variable; if we do it 0 and 1, then the output from the regression will be easier to read.

imputed_Data <- mice::mice(data_missing, m= 20, maxit = 50, method = 'pmm', seed = 500) 



imputed_Data2 <- mice::mice(data_missing, m=    20, maxit = 50, method = 'norm.boot',  seed = 500) ## 

5.2 For the analyses, we run now the analyses on all datasets and use specific pooling methods to combine the results (we will not go into this today)

We could also extract one complete dataset and use this (but of course, this is not recommended)

imputed_dataset_3 <- complete(imputed_Data,3)

# we can again produce a summary table 
length2 <- function (x, na.rm=FALSE) {
  if (na.rm) sum(!is.na(x))
  else       length(x)
}

summary_imputed_dataset_3<-imputed_dataset_3 %>% 
  group_by(Job) %>% 
  summarise(across(where(is.numeric), list(mean=mean,sd=sd, n=length2), na.rm=TRUE))

5.2.1 now we can do the analyses on the all imputed datasets and then pool the results

Please consult this free online-book for more information.

We know that the “true” mean of the salary is 45, the biased mean is 43.

impdat <- complete(imputed_Data2,action="long",include = FALSE)

pool_mean <- with(impdat, by(impdat, .imp, function(x) c(mean(x$Salary_CHF, na.rm=TRUE),sd(x$Salary_CHF, na.rm=TRUE))))

Here we see the means and sds for the n imputed datasets:

pool_mean
## .imp: 1
## [1] 44.74463 15.56609
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## .imp: 2
## [1] 44.90551 15.84167
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## .imp: 3
## [1] 45.16237 15.72754
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## .imp: 4
## [1] 45.21108 15.74316
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## .imp: 5
## [1] 44.91854 15.70049
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## .imp: 6
## [1] 44.97969 15.72645
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## .imp: 7
## [1] 45.04073 15.87618
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## .imp: 8
## [1] 45.28212 16.08505
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## .imp: 9
## [1] 45.25419 15.95380
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## .imp: 10
## [1] 44.86104 15.33384
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## .imp: 11
## [1] 45.06406 15.93399
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## .imp: 12
## [1] 44.94934 15.63871
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## .imp: 13
## [1] 44.75034 15.89600
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## .imp: 14
## [1] 44.90516 15.83382
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## .imp: 15
## [1] 44.92813 15.96828
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## .imp: 16
## [1] 45.05244 16.12921
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## .imp: 17
## [1] 44.81725 15.92128
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## .imp: 18
## [1] 44.66779 15.69874
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## .imp: 19
## [1] 44.88738 15.78969
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
## .imp: 20
## [1] 44.89266 15.98175

Now we need to pool these:

Reduce("+",pool_mean)/length(pool_mean)
## [1] 44.96372 15.81729

We could also do regressions with the imputed data:

fitm <- with(imputed_Data, lm(Salary_CHF ~ 1))
lin.pool <- pool(fitm)
summary(lin.pool)
##          term estimate std.error statistic       df p.value
## 1 (Intercept)  44.8771  1.187918  37.77793 155.3157       0

We see that the first method we used with the predictive mean matching (which is the default method) provides good results.

fitm2 <- with(imputed_Data2, lm(Salary_CHF ~ Job))
lin.pool2 <- pool(fitm2)
summary(lin.pool2)
##          term  estimate std.error statistic       df p.value
## 1 (Intercept)  59.96643 0.5830683 102.84632 85.71535       0
## 2         Job -30.00541 0.8011106 -37.45477 99.36685       0

Also the second method, the linear regression imputation provides the same results.

For the comparison, below the analysis from the dataset where all values were available, i.e. the true result. We see that in this simple example the imputation method above produces the same results as compared to the analysis of the data set without missing values. However, we see that the standard errors are larger for the imputed analyses. So the imputation tries to correct the bias, but it will not reduce the loss in statistical precision.

res<-lm(Salary_CHF~1, data=data_long)
summary(res)
## 
## Call:
## lm(formula = Salary_CHF ~ 1, data = data_long)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.5455 -14.4075  -0.1547  15.2620  26.9718 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   45.040      1.126   40.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.92 on 199 degrees of freedom

5.2.2 Analysis with the dataset with the missing data

We see that this result is biased. The true salary(i.e. if calculated with the full dataset) would be 45’000; here we see that the biased result is 43’000 when we omit the missing values.

res<-lm(Salary_CHF~1, data=data_missing)
summary(res)
## 
## Call:
## lm(formula = Salary_CHF ~ 1, data = data_missing)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -23.53 -12.85  -7.51  16.11  28.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   43.027      1.257   34.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.75 on 156 degrees of freedom
##   (43 observations deleted due to missingness)
fitm2 <- with(imputed_Data2, lm(Salary_CHF ~ 1))
lin.pool2 <- pool(fitm2)
summary(lin.pool2)
##          term estimate std.error statistic       df p.value
## 1 (Intercept) 44.96372  1.131903  39.72399 191.3292       0