2. Missing data

Before running an analysis - it’s important to check that the data are reasonably complete. Some knowledge of the data collection protocol is generally necessary to know what data are expected. For example, did you head out to the field one day and forget a critical piece of equipment so there are NAs in a particular column? Was it impossible to access the site in 2020 due to COVID restrictions? Or was there a sheet that just didn’t get entered and you need to head back to your files and key in those data?

library(tidyverse) # general data wrangling
library(summarytools) # summarizes missing values per column

If you have already completed the Scrambled data types module you will have created this file. For those who prefer to jump in at the middle, the original dataset with mis-entered dates and missing value codes FIXED can be downloaded here and tucked into your example_data directory.

df <-read.csv (file.path("example_data", "sunflower_data_1.csv"))

We will be using the summarytools package can give us a nice overview of our dataset, including the percent of missing values (NAs) recorded in each variable.

note to mac users, you probably need X quartz installed for the summarytools package to work. And it may crash RStudio from time to time…

view(dfSummary(df))

Data Frame Summary

df

Dimensions: 2685 x 8
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

harvest_date [character]

1. 1999-10-13
2. 2006-10-23
3. 2005-10-25
4. 1997-09-30
5. 1993-10-21
6. 1994-10-12
7. 2000-10-17
8. 2004-10-27
9. 2010-10-21
10. 2009-11-16
[ 14 others ]

113	(	6.9%	)
107	(	6.6%	)
105	(	6.4%	)
99	(	6.1%	)
98	(	6.0%	)
98	(	6.0%	)
90	(	5.5%	)
80	(	4.9%	)
76	(	4.7%	)
73	(	4.5%	)
692	(	42.4%	)

1054 (39.3%)

harvest_moisture_pct [numeric]

Mean (sd) : 14.3 (6.1)

min ≤ med ≤ max:

5.8 ≤ 11.8 ≤ 43.4

IQR (CV) : 7.9 (0.4)

413 distinct values

656 (24.4%)

height_in [numeric]

Mean (sd) : 62.2 (19.6)

min ≤ med ≤ max:

31 ≤ 60 ≤ 188

IQR (CV) : 10 (0.3)

376 distinct values

398 (14.8%)

yield_lb_acre [numeric]

Mean (sd) : 1954.6 (497.5)

min ≤ med ≤ max:

230 ≤ 2001 ≤ 3426

IQR (CV) : 641.1 (0.3)

1635 distinct values

73 (2.7%)

year [integer]

Mean (sd) : 1999.5 (11.4)

min ≤ med ≤ max:

1978 ≤ 1999 ≤ 2022

IQR (CV) : 17 (0)

40 distinct values

0 (0.0%)

hybrid [character]

1. 894

2. Falcon

3. SF187

4. 141

5. Badger DMR

6. 8D310

7. Hornet

8. SF270

9. 3845 HO

10. 432 E

[ 1757 others ]

27	(	1.0%	)
12	(	0.4%	)
10	(	0.4%	)
9	(	0.3%	)
9	(	0.3%	)
8	(	0.3%	)
8	(	0.3%	)
8	(	0.3%	)
7	(	0.3%	)
7	(	0.3%	)
2580	(	96.1%	)

0 (0.0%)

emergence_date [character]

1. 1999-06-04
2. 1993-05-29
3. 2005-05-26
4. 2000-06-06
5. 2006-06-06
6. 1994-05-28
7. 1994-05-29
8. 1999-06-03
9. 2010-05-23
10. 2004-05-29
[ 140 others ]

34	(	2.5%	)
30	(	2.2%	)
30	(	2.2%	)
28	(	2.1%	)
28	(	2.1%	)
27	(	2.0%	)
27	(	2.0%	)
26	(	1.9%	)
26	(	1.9%	)
25	(	1.8%	)
1084	(	79.4%	)

1320 (49.2%)

planting_date [character]

1. 2006-06-01
2. 2005-05-23
3. 1997-05-29
4. 1993-05-26
5. 1994-05-25
6. 2000-06-02
7. 2004-05-26
8. 2010-05-20
9. 2009-05-27
10. 1991-05-28
[ 56 others ]

107	(	6.9%	)
105	(	6.7%	)
99	(	6.3%	)
98	(	6.3%	)
98	(	6.3%	)
90	(	5.8%	)
80	(	5.1%	)
76	(	4.9%	)
73	(	4.7%	)
72	(	4.6%	)
663	(	42.5%	)

1124 (41.9%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2024-02-28

Knowing more about the intended sampling design will tell us what other missing values should we look for. For example, were data collected every year? If so examining the years included in the dataset may be important.

Sometimes a visual check is easy to spot what is missing.

sort (unique (df$year))

 [1] 1978 1979 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
[16] 1994 1996 1997 1998 1999 2000 2001 2002 2004 2005 2006 2007 2008 2009 2010
[31] 2012 2013 2014 2015 2016 2017 2018 2020 2021 2022

Sometimes it’s hard to find exactly what is missing. setdiff (for simple checks for expected values based on complete sets) expand.grid (for checking pre-defined combinations) and the padr package (for padding out dataset based on regular temporal frequency) are your friends here. For this dataset, we’ll use the simple code below to figure out what years are missing. It is then up to you whether it is important to add those years with NA values to the analysis, go hunt further for the data, or just carry on with the dataset structured as is, but knowing they are missing.

setdiff(seq(min(df$year), max(df$year)), unique(df$year))

[1] 1980 1995 2003 2011 2019

Commonly, we expect a similar number of samples over some interval (e.g. per site or year). It can be helpful to calculate summaries of number of records by groups to figure out if the data collection efforts are distributed as expected. Viewing just the first 10 years, the number of data points is highly variable among years, is this expected?

ct_by_year = df %>%
  group_by(year) %>%
  tally()

year	n
1978	26
1979	36
1981	53
1982	64
1983	69
1984	64
1985	64
1986	57
1987	64
1988	49