Sensible data ranges are often easy to define based on biological principles (elephants are heavy!; mosquitoes are light!) or mathematical principles (percents must be bounded between 0 and 100).
library(tidyverse) # general data wranglinglibrary(summarytools)
If you have already completed the Scrambled data types module you will have created this file. For those who prefer to jump in at the middle, the original dataset with mis-entered dates and missing value codes FIXED can be downloaded here and tucked into your example_data directory.
We can use the tidyverse package to make ourselves a quick table of ranges to scan.
# This code reads a dataframe 'df' and selects only the numeric columns. It then pivots the# data into a longer format, groups by the column names, and summarizes the minimum and# maximum values for each column. min_max <- df %>%select(where(is.numeric)) %>%pivot_longer(cols =everything()) %>%group_by(name) %>%summarize(across("value", list(min =~min(.x, na.rm =TRUE), max =~max(.x, na.rm =TRUE))))
name
value_min
value_max
harvest_moisture_pct
5.82
43.4
height_in
31.00
188.0
year
1978.00
2022.0
yield_lb_acre
230.00
3426.0
If you completed the Missing data module you will recall that the summarytools package also gives us this information (and more!) in the Stats/Values column.
view(dfSummary(df))
Data Frame Summary
df
Dimensions: 2685 x 8
Duplicates: 0
Variable
Stats / Values
Freqs (% of Valid)
Graph
Missing
harvest_date
[character]
1. 1999-10-13
2. 2006-10-23
3. 2005-10-25
4. 1997-09-30
5. 1993-10-21
6. 1994-10-12
7. 2000-10-17
8. 2004-10-27
9. 2010-10-21
10. 2009-11-16
[ 14 others ]
113
(
6.9%
)
107
(
6.6%
)
105
(
6.4%
)
99
(
6.1%
)
98
(
6.0%
)
98
(
6.0%
)
90
(
5.5%
)
80
(
4.9%
)
76
(
4.7%
)
73
(
4.5%
)
692
(
42.4%
)
1054
(39.3%)
harvest_moisture_pct
[numeric]
Mean (sd) : 14.3 (6.1)
min ≤ med ≤ max:
5.8 ≤ 11.8 ≤ 43.4
IQR (CV) : 7.9 (0.4)
413 distinct values
656
(24.4%)
height_in
[numeric]
Mean (sd) : 62.2 (19.6)
min ≤ med ≤ max:
31 ≤ 60 ≤ 188
IQR (CV) : 10 (0.3)
376 distinct values
398
(14.8%)
yield_lb_acre
[numeric]
Mean (sd) : 1954.6 (497.5)
min ≤ med ≤ max:
230 ≤ 2001 ≤ 3426
IQR (CV) : 641.1 (0.3)
1635 distinct values
73
(2.7%)
year
[integer]
Mean (sd) : 1999.5 (11.4)
min ≤ med ≤ max:
1978 ≤ 1999 ≤ 2022
IQR (CV) : 17 (0)
40 distinct values
0
(0.0%)
hybrid
[character]
1. 894
2. Falcon
3. SF187
4. 141
5. Badger DMR
6. 8D310
7. Hornet
8. SF270
9. 3845 HO
10. 432 E
[ 1757 others ]
27
(
1.0%
)
12
(
0.4%
)
10
(
0.4%
)
9
(
0.3%
)
9
(
0.3%
)
8
(
0.3%
)
8
(
0.3%
)
8
(
0.3%
)
7
(
0.3%
)
7
(
0.3%
)
2580
(
96.1%
)
0
(0.0%)
emergence_date
[character]
1. 1999-06-04
2. 1993-05-29
3. 2005-05-26
4. 2000-06-06
5. 2006-06-06
6. 1994-05-28
7. 1994-05-29
8. 1999-06-03
9. 2010-05-23
10. 2004-05-29
[ 140 others ]
34
(
2.5%
)
30
(
2.2%
)
30
(
2.2%
)
28
(
2.1%
)
28
(
2.1%
)
27
(
2.0%
)
27
(
2.0%
)
26
(
1.9%
)
26
(
1.9%
)
25
(
1.8%
)
1084
(
79.4%
)
1320
(49.2%)
planting_date
[character]
1. 2006-06-01
2. 2005-05-23
3. 1997-05-29
4. 1993-05-26
5. 1994-05-25
6. 2000-06-02
7. 2004-05-26
8. 2010-05-20
9. 2009-05-27
10. 1991-05-28
[ 56 others ]
107
(
6.9%
)
105
(
6.7%
)
99
(
6.3%
)
98
(
6.3%
)
98
(
6.3%
)
90
(
5.8%
)
80
(
5.1%
)
76
(
4.9%
)
73
(
4.7%
)
72
(
4.6%
)
663
(
42.5%
)
1124
(41.9%)
Generated by summarytools 1.0.1 (R version 4.2.2) 2024-02-28
Often a visual assessment is key to figuring out what is wrong. Most biological data are normally or lognormally distributed. What is that suspect lump of data in height_in?
too_tall <- df %>%# filter on values >3d greater than the mean#filter(height_in > 62.2 + 19.6 * 3)filter(height_in > (mean(height_in, na.rm =TRUE)+3*sd(height_in, na.rm =TRUE)))
harvest_date
harvest_moisture_pct
height_in
yield_lb_acre
year
hybrid
emergence_date
planting_date
2004-10-27
23.6
173
1310
2004
894
2004-05-29
2004-05-26
2004-10-27
26.9
160
1421
2004
9405
2004-05-30
2004-05-26
2004-10-27
24.4
170
1597
2004
9441
2004-05-28
2004-05-26
2004-10-27
25.2
157
1609
2004
CL 55-15
2004-05-31
2004-05-26
2004-10-27
22.2
155
1687
2004
Exp 15
2004-05-30
2004-05-26
2004-10-27
24.1
168
1603
2004
Exp T-1
NA
2004-05-26
2004-10-27
23.4
165
1529
2004
Exp T-2
2004-05-30
2004-05-26
2004-10-27
23.7
175
1474
2004
Exp T-3
2004-05-28
2004-05-26
2004-10-27
21.3
165
1594
2004
MSR 50
2004-05-29
2004-05-26
2004-10-27
21.2
188
2027
2004
MSR 51
2004-05-31
2004-05-26
All appear to be from a single year, which is either (a) the tallest year in sunflower history or (b) a coding error. Perhaps the data were recorded in centimeters instead of inches?
# does converting to cm to inches fix the problem?mean(too_tall$height_in, na.rm =TRUE) /2.54
[1] 61.46716
It is always best to double check your field notes, your assistants field notes etc to verify your assumptions when correcting data after the fact! But sometimes, the unit errors are pretty easy to detect and erroneous data can be easily corrected after the fact.