4. Data in range

Sensible data ranges are often easy to define based on biological principles (elephants are heavy!; mosquitoes are light!) or mathematical principles (percents must be bounded between 0 and 100).

library(tidyverse) # general data wrangling
library(summarytools)

If you have already completed the Scrambled data types module you will have created this file. For those who prefer to jump in at the middle, the original dataset with mis-entered dates and missing value codes FIXED can be downloaded here and tucked into your example_data directory.

df <- read.csv(file.path("example_data", "sunflower_data_1.csv"))

We can use the tidyverse package to make ourselves a quick table of ranges to scan.

# This code reads a dataframe 'df' and selects only the numeric columns. It then pivots the
# data into a longer format, groups by the column names, and summarizes the minimum and
# maximum values for each column. 

min_max <- df %>%
  select(where(is.numeric)) %>%
  pivot_longer(cols = everything()) %>%
  group_by(name) %>%
  summarize(across("value", list(min = ~ min(.x, na.rm = TRUE), max = ~ max(.x, na.rm = TRUE))))

name	value_min	value_max
harvest_moisture_pct	5.82	43.4
height_in	31.00	188.0
year	1978.00	2022.0
yield_lb_acre	230.00	3426.0

If you completed the Missing data module you will recall that the summarytools package also gives us this information (and more!) in the Stats/Values column.

view(dfSummary(df))

Data Frame Summary

df

Dimensions: 2685 x 8
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

harvest_date [character]

1. 1999-10-13
2. 2006-10-23
3. 2005-10-25
4. 1997-09-30
5. 1993-10-21
6. 1994-10-12
7. 2000-10-17
8. 2004-10-27
9. 2010-10-21
10. 2009-11-16
[ 14 others ]

113	(	6.9%	)
107	(	6.6%	)
105	(	6.4%	)
99	(	6.1%	)
98	(	6.0%	)
98	(	6.0%	)
90	(	5.5%	)
80	(	4.9%	)
76	(	4.7%	)
73	(	4.5%	)
692	(	42.4%	)

1054 (39.3%)

harvest_moisture_pct [numeric]

Mean (sd) : 14.3 (6.1)

min ≤ med ≤ max:

5.8 ≤ 11.8 ≤ 43.4

IQR (CV) : 7.9 (0.4)

413 distinct values

656 (24.4%)

height_in [numeric]

Mean (sd) : 62.2 (19.6)

min ≤ med ≤ max:

31 ≤ 60 ≤ 188

IQR (CV) : 10 (0.3)

376 distinct values

398 (14.8%)

yield_lb_acre [numeric]

Mean (sd) : 1954.6 (497.5)

min ≤ med ≤ max:

230 ≤ 2001 ≤ 3426

IQR (CV) : 641.1 (0.3)

1635 distinct values

73 (2.7%)

year [integer]

Mean (sd) : 1999.5 (11.4)

min ≤ med ≤ max:

1978 ≤ 1999 ≤ 2022

IQR (CV) : 17 (0)

40 distinct values

0 (0.0%)

hybrid [character]

1. 894

2. Falcon

3. SF187

4. 141

5. Badger DMR

6. 8D310

7. Hornet

8. SF270

9. 3845 HO

10. 432 E

[ 1757 others ]

27	(	1.0%	)
12	(	0.4%	)
10	(	0.4%	)
9	(	0.3%	)
9	(	0.3%	)
8	(	0.3%	)
8	(	0.3%	)
8	(	0.3%	)
7	(	0.3%	)
7	(	0.3%	)
2580	(	96.1%	)

0 (0.0%)

emergence_date [character]

1. 1999-06-04
2. 1993-05-29
3. 2005-05-26
4. 2000-06-06
5. 2006-06-06
6. 1994-05-28
7. 1994-05-29
8. 1999-06-03
9. 2010-05-23
10. 2004-05-29
[ 140 others ]

34	(	2.5%	)
30	(	2.2%	)
30	(	2.2%	)
28	(	2.1%	)
28	(	2.1%	)
27	(	2.0%	)
27	(	2.0%	)
26	(	1.9%	)
26	(	1.9%	)
25	(	1.8%	)
1084	(	79.4%	)

1320 (49.2%)

planting_date [character]

1. 2006-06-01
2. 2005-05-23
3. 1997-05-29
4. 1993-05-26
5. 1994-05-25
6. 2000-06-02
7. 2004-05-26
8. 2010-05-20
9. 2009-05-27
10. 1991-05-28
[ 56 others ]

107	(	6.9%	)
105	(	6.7%	)
99	(	6.3%	)
98	(	6.3%	)
98	(	6.3%	)
90	(	5.8%	)
80	(	5.1%	)
76	(	4.9%	)
73	(	4.7%	)
72	(	4.6%	)
663	(	42.5%	)

1124 (41.9%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2024-02-28

Often a visual assessment is key to figuring out what is wrong. Most biological data are normally or lognormally distributed. What is that suspect lump of data in height_in?

too_tall <- df %>%
  # filter on values >3d greater than the mean
  #filter(height_in > 62.2 + 19.6 * 3)
  filter(height_in > (mean(height_in, na.rm = TRUE)+3*sd(height_in, na.rm = TRUE)))

harvest_date	harvest_moisture_pct	height_in	yield_lb_acre	year	hybrid	emergence_date	planting_date
2004-10-27	23.6	173	1310	2004	894	2004-05-29	2004-05-26
2004-10-27	26.9	160	1421	2004	9405	2004-05-30	2004-05-26
2004-10-27	24.4	170	1597	2004	9441	2004-05-28	2004-05-26
2004-10-27	25.2	157	1609	2004	CL 55-15	2004-05-31	2004-05-26
2004-10-27	22.2	155	1687	2004	Exp 15	2004-05-30	2004-05-26
2004-10-27	24.1	168	1603	2004	Exp T-1	NA	2004-05-26
2004-10-27	23.4	165	1529	2004	Exp T-2	2004-05-30	2004-05-26
2004-10-27	23.7	175	1474	2004	Exp T-3	2004-05-28	2004-05-26
2004-10-27	21.3	165	1594	2004	MSR 50	2004-05-29	2004-05-26
2004-10-27	21.2	188	2027	2004	MSR 51	2004-05-31	2004-05-26

All appear to be from a single year, which is either (a) the tallest year in sunflower history or (b) a coding error. Perhaps the data were recorded in centimeters instead of inches?

not_too_tall <- df %>%
  anti_join(., too_tall)

Joining with `by = join_by(harvest_date, harvest_moisture_pct, height_in,
yield_lb_acre, year, hybrid, emergence_date, planting_date)`

mean(not_too_tall$height_in, na.rm = TRUE)

[1] 58.8046

mean(too_tall$height_in, na.rm = TRUE)

[1] 156.1266

# does converting to cm to inches fix the problem?
mean(too_tall$height_in, na.rm = TRUE) / 2.54

[1] 61.46716

It is always best to double check your field notes, your assistants field notes etc to verify your assumptions when correcting data after the fact! But sometimes, the unit errors are pretty easy to detect and erroneous data can be easily corrected after the fact.

too_tall <- too_tall %>%
  mutate(height_in = height_in / 2.54)

df <- bind_rows(too_tall, not_too_tall)