Before running an analysis - it’s important to check that the data are reasonably complete. Some knowledge of the data collection protocol is generally necessary to know what data are expected. For example, did you head out to the field one day and forget a critical piece of equipment so there are NAs in a particular column? Was it impossible to access the site in 2020 due to COVID restrictions? Or was there a sheet that just didn’t get entered and you need to head back to your files and key in those data?
library(tidyverse) # general data wranglinglibrary(summarytools) # summarizes missing values per column
If you have already completed the Scrambled data types module you will have created this file. For those who prefer to jump in at the middle, the original dataset with mis-entered dates and missing value codes FIXED can be downloaded here and tucked into your example_data directory.
We will be using the summarytools package can give us a nice overview of our dataset, including the percent of missing values (NAs) recorded in each variable.
note to mac users, you probably need X quartz installed for the summarytools package to work. And it may crash RStudio from time to time…
view(dfSummary(df))
Data Frame Summary
df
Dimensions: 2685 x 8
Duplicates: 0
Variable
Stats / Values
Freqs (% of Valid)
Graph
Missing
harvest_date
[character]
1. 1999-10-13
2. 2006-10-23
3. 2005-10-25
4. 1997-09-30
5. 1993-10-21
6. 1994-10-12
7. 2000-10-17
8. 2004-10-27
9. 2010-10-21
10. 2009-11-16
[ 14 others ]
113
(
6.9%
)
107
(
6.6%
)
105
(
6.4%
)
99
(
6.1%
)
98
(
6.0%
)
98
(
6.0%
)
90
(
5.5%
)
80
(
4.9%
)
76
(
4.7%
)
73
(
4.5%
)
692
(
42.4%
)
1054
(39.3%)
harvest_moisture_pct
[numeric]
Mean (sd) : 14.3 (6.1)
min ≤ med ≤ max:
5.8 ≤ 11.8 ≤ 43.4
IQR (CV) : 7.9 (0.4)
413 distinct values
656
(24.4%)
height_in
[numeric]
Mean (sd) : 62.2 (19.6)
min ≤ med ≤ max:
31 ≤ 60 ≤ 188
IQR (CV) : 10 (0.3)
376 distinct values
398
(14.8%)
yield_lb_acre
[numeric]
Mean (sd) : 1954.6 (497.5)
min ≤ med ≤ max:
230 ≤ 2001 ≤ 3426
IQR (CV) : 641.1 (0.3)
1635 distinct values
73
(2.7%)
year
[integer]
Mean (sd) : 1999.5 (11.4)
min ≤ med ≤ max:
1978 ≤ 1999 ≤ 2022
IQR (CV) : 17 (0)
40 distinct values
0
(0.0%)
hybrid
[character]
1. 894
2. Falcon
3. SF187
4. 141
5. Badger DMR
6. 8D310
7. Hornet
8. SF270
9. 3845 HO
10. 432 E
[ 1757 others ]
27
(
1.0%
)
12
(
0.4%
)
10
(
0.4%
)
9
(
0.3%
)
9
(
0.3%
)
8
(
0.3%
)
8
(
0.3%
)
8
(
0.3%
)
7
(
0.3%
)
7
(
0.3%
)
2580
(
96.1%
)
0
(0.0%)
emergence_date
[character]
1. 1999-06-04
2. 1993-05-29
3. 2005-05-26
4. 2000-06-06
5. 2006-06-06
6. 1994-05-28
7. 1994-05-29
8. 1999-06-03
9. 2010-05-23
10. 2004-05-29
[ 140 others ]
34
(
2.5%
)
30
(
2.2%
)
30
(
2.2%
)
28
(
2.1%
)
28
(
2.1%
)
27
(
2.0%
)
27
(
2.0%
)
26
(
1.9%
)
26
(
1.9%
)
25
(
1.8%
)
1084
(
79.4%
)
1320
(49.2%)
planting_date
[character]
1. 2006-06-01
2. 2005-05-23
3. 1997-05-29
4. 1993-05-26
5. 1994-05-25
6. 2000-06-02
7. 2004-05-26
8. 2010-05-20
9. 2009-05-27
10. 1991-05-28
[ 56 others ]
107
(
6.9%
)
105
(
6.7%
)
99
(
6.3%
)
98
(
6.3%
)
98
(
6.3%
)
90
(
5.8%
)
80
(
5.1%
)
76
(
4.9%
)
73
(
4.7%
)
72
(
4.6%
)
663
(
42.5%
)
1124
(41.9%)
Generated by summarytools 1.0.1 (R version 4.2.2) 2024-02-28
Knowing more about the intended sampling design will tell us what other missing values should we look for. For example, were data collected every year? If so examining the years included in the dataset may be important.
Sometimes a visual check is easy to spot what is missing.
Sometimes it’s hard to find exactly what is missing. setdiff (for simple checks for expected values based on complete sets) expand.grid (for checking pre-defined combinations) and the padr package (for padding out dataset based on regular temporal frequency) are your friends here. For this dataset, we’ll use the simple code below to figure out what years are missing. It is then up to you whether it is important to add those years with NA values to the analysis, go hunt further for the data, or just carry on with the dataset structured as is, but knowing they are missing.
Commonly, we expect a similar number of samples over some interval (e.g. per site or year). It can be helpful to calculate summaries of number of records by groups to figure out if the data collection efforts are distributed as expected. Viewing just the first 10 years, the number of data points is highly variable among years, is this expected?