4. Data in range

Sensible data ranges are often easy to define based on biological principles (elephants are heavy!; mosquitoes are light!) or mathematical principles (percents must be bounded between 0 and 100).

library(tidyverse) # general data wrangling
library(summarytools)

If you have already completed the Scrambled data types module you will have created this file. For those who prefer to jump in at the middle, the original dataset with mis-entered dates and missing value codes FIXED can be downloaded here and tucked into your example_data directory.

df <- read.csv(file.path("example_data", "sunflower_data_1.csv"))

We can use the tidyverse package to make ourselves a quick table of ranges to scan.

# This code reads a dataframe 'df' and selects only the numeric columns. It then pivots the
# data into a longer format, groups by the column names, and summarizes the minimum and
# maximum values for each column. 

min_max <- df %>%
  select(where(is.numeric)) %>%
  pivot_longer(cols = everything()) %>%
  group_by(name) %>%
  summarize(across("value", list(min = ~ min(.x, na.rm = TRUE), max = ~ max(.x, na.rm = TRUE))))
name value_min value_max
harvest_moisture_pct 5.82 43.4
height_in 31.00 188.0
year 1978.00 2022.0
yield_lb_acre 230.00 3426.0

If you completed the Missing data module you will recall that the summarytools package also gives us this information (and more!) in the Stats/Values column.

view(dfSummary(df))

Data Frame Summary

df

Dimensions: 2685 x 8
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
harvest_date [character]
1. 1999-10-13
2. 2006-10-23
3. 2005-10-25
4. 1997-09-30
5. 1993-10-21
6. 1994-10-12
7. 2000-10-17
8. 2004-10-27
9. 2010-10-21
10. 2009-11-16
[ 14 others ]
113(6.9%)
107(6.6%)
105(6.4%)
99(6.1%)
98(6.0%)
98(6.0%)
90(5.5%)
80(4.9%)
76(4.7%)
73(4.5%)
692(42.4%)
1054 (39.3%)
harvest_moisture_pct [numeric]
Mean (sd) : 14.3 (6.1)
min ≤ med ≤ max:
5.8 ≤ 11.8 ≤ 43.4
IQR (CV) : 7.9 (0.4)
413 distinct values 656 (24.4%)
height_in [numeric]
Mean (sd) : 62.2 (19.6)
min ≤ med ≤ max:
31 ≤ 60 ≤ 188
IQR (CV) : 10 (0.3)
376 distinct values 398 (14.8%)
yield_lb_acre [numeric]
Mean (sd) : 1954.6 (497.5)
min ≤ med ≤ max:
230 ≤ 2001 ≤ 3426
IQR (CV) : 641.1 (0.3)
1635 distinct values 73 (2.7%)
year [integer]
Mean (sd) : 1999.5 (11.4)
min ≤ med ≤ max:
1978 ≤ 1999 ≤ 2022
IQR (CV) : 17 (0)
40 distinct values 0 (0.0%)
hybrid [character]
1. 894
2. Falcon
3. SF187
4. 141
5. Badger DMR
6. 8D310
7. Hornet
8. SF270
9. 3845 HO
10. 432 E
[ 1757 others ]
27(1.0%)
12(0.4%)
10(0.4%)
9(0.3%)
9(0.3%)
8(0.3%)
8(0.3%)
8(0.3%)
7(0.3%)
7(0.3%)
2580(96.1%)
0 (0.0%)
emergence_date [character]
1. 1999-06-04
2. 1993-05-29
3. 2005-05-26
4. 2000-06-06
5. 2006-06-06
6. 1994-05-28
7. 1994-05-29
8. 1999-06-03
9. 2010-05-23
10. 2004-05-29
[ 140 others ]
34(2.5%)
30(2.2%)
30(2.2%)
28(2.1%)
28(2.1%)
27(2.0%)
27(2.0%)
26(1.9%)
26(1.9%)
25(1.8%)
1084(79.4%)
1320 (49.2%)
planting_date [character]
1. 2006-06-01
2. 2005-05-23
3. 1997-05-29
4. 1993-05-26
5. 1994-05-25
6. 2000-06-02
7. 2004-05-26
8. 2010-05-20
9. 2009-05-27
10. 1991-05-28
[ 56 others ]
107(6.9%)
105(6.7%)
99(6.3%)
98(6.3%)
98(6.3%)
90(5.8%)
80(5.1%)
76(4.9%)
73(4.7%)
72(4.6%)
663(42.5%)
1124 (41.9%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2024-02-28

Often a visual assessment is key to figuring out what is wrong. Most biological data are normally or lognormally distributed. What is that suspect lump of data in height_in?

too_tall <- df %>%
  # filter on values >3d greater than the mean
  #filter(height_in > 62.2 + 19.6 * 3)
  filter(height_in > (mean(height_in, na.rm = TRUE)+3*sd(height_in, na.rm = TRUE)))
harvest_date harvest_moisture_pct height_in yield_lb_acre year hybrid emergence_date planting_date
2004-10-27 23.6 173 1310 2004 894 2004-05-29 2004-05-26
2004-10-27 26.9 160 1421 2004 9405 2004-05-30 2004-05-26
2004-10-27 24.4 170 1597 2004 9441 2004-05-28 2004-05-26
2004-10-27 25.2 157 1609 2004 CL 55-15 2004-05-31 2004-05-26
2004-10-27 22.2 155 1687 2004 Exp 15 2004-05-30 2004-05-26
2004-10-27 24.1 168 1603 2004 Exp T-1 NA 2004-05-26
2004-10-27 23.4 165 1529 2004 Exp T-2 2004-05-30 2004-05-26
2004-10-27 23.7 175 1474 2004 Exp T-3 2004-05-28 2004-05-26
2004-10-27 21.3 165 1594 2004 MSR 50 2004-05-29 2004-05-26
2004-10-27 21.2 188 2027 2004 MSR 51 2004-05-31 2004-05-26

All appear to be from a single year, which is either (a) the tallest year in sunflower history or (b) a coding error. Perhaps the data were recorded in centimeters instead of inches?

not_too_tall <- df %>%
  anti_join(., too_tall)
Joining with `by = join_by(harvest_date, harvest_moisture_pct, height_in,
yield_lb_acre, year, hybrid, emergence_date, planting_date)`
mean(not_too_tall$height_in, na.rm = TRUE)
[1] 58.8046
mean(too_tall$height_in, na.rm = TRUE)
[1] 156.1266
# does converting to cm to inches fix the problem?
mean(too_tall$height_in, na.rm = TRUE) / 2.54
[1] 61.46716

It is always best to double check your field notes, your assistants field notes etc to verify your assumptions when correcting data after the fact! But sometimes, the unit errors are pretty easy to detect and erroneous data can be easily corrected after the fact.

too_tall <- too_tall %>%
  mutate(height_in = height_in / 2.54)

df <- bind_rows(too_tall, not_too_tall)