5. Spelling counts

Typing is hard. REALLY HARD. I get it. But computers don’t. So if you have typos in your data, your analysis is going to treat those creatively spelled variables as entirely different things. It’s unfortunate.

library(tidyverse) # general data wrangling
library(summarytools) # slick dataset summaries

If you have already completed the Scrambled data types module you will have created this file. For those who prefer to jump in at the middle, the original dataset with mis-entered dates and missing value codes FIXED can be downloaded here and tucked into your example_data directory.

df <- read.csv("example_data/sunflower_data_1.csv")

The summarytools package introduced in previous modules includes information on the number most commonly encountered strings in non-numeric fields. If you have 10 or fewer categories, this can be a great way to quickly see what the most common categories are and if you have spelled them correctly.

Data Frame Summary

df

Dimensions: 2685 x 8
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

harvest_date [character]

1. 1999-10-13
2. 2006-10-23
3. 2005-10-25
4. 1997-09-30
5. 1993-10-21
6. 1994-10-12
7. 2000-10-17
8. 2004-10-27
9. 2010-10-21
10. 2009-11-16
[ 14 others ]

113	(	6.9%	)
107	(	6.6%	)
105	(	6.4%	)
99	(	6.1%	)
98	(	6.0%	)
98	(	6.0%	)
90	(	5.5%	)
80	(	4.9%	)
76	(	4.7%	)
73	(	4.5%	)
692	(	42.4%	)

1054 (39.3%)

harvest_moisture_pct [numeric]

Mean (sd) : 14.3 (6.1)

min ≤ med ≤ max:

5.8 ≤ 11.8 ≤ 43.4

IQR (CV) : 7.9 (0.4)

413 distinct values

656 (24.4%)

height_in [numeric]

Mean (sd) : 62.2 (19.6)

min ≤ med ≤ max:

31 ≤ 60 ≤ 188

IQR (CV) : 10 (0.3)

376 distinct values

398 (14.8%)

yield_lb_acre [numeric]

Mean (sd) : 1954.6 (497.5)

min ≤ med ≤ max:

230 ≤ 2001 ≤ 3426

IQR (CV) : 641.1 (0.3)

1635 distinct values

73 (2.7%)

year [integer]

Mean (sd) : 1999.5 (11.4)

min ≤ med ≤ max:

1978 ≤ 1999 ≤ 2022

IQR (CV) : 17 (0)

40 distinct values

0 (0.0%)

hybrid [character]

1. 894

2. Falcon

3. SF187

4. 141

5. Badger DMR

6. 8D310

7. Hornet

8. SF270

9. 3845 HO

10. 432 E

[ 1757 others ]

27	(	1.0%	)
12	(	0.4%	)
10	(	0.4%	)
9	(	0.3%	)
9	(	0.3%	)
8	(	0.3%	)
8	(	0.3%	)
8	(	0.3%	)
7	(	0.3%	)
7	(	0.3%	)
2580	(	96.1%	)

0 (0.0%)

emergence_date [character]

1. 1999-06-04
2. 1993-05-29
3. 2005-05-26
4. 2000-06-06
5. 2006-06-06
6. 1994-05-28
7. 1994-05-29
8. 1999-06-03
9. 2010-05-23
10. 2004-05-29
[ 140 others ]

34	(	2.5%	)
30	(	2.2%	)
30	(	2.2%	)
28	(	2.1%	)
28	(	2.1%	)
27	(	2.0%	)
27	(	2.0%	)
26	(	1.9%	)
26	(	1.9%	)
25	(	1.8%	)
1084	(	79.4%	)

1320 (49.2%)

planting_date [character]

1. 2006-06-01
2. 2005-05-23
3. 1997-05-29
4. 1993-05-26
5. 1994-05-25
6. 2000-06-02
7. 2004-05-26
8. 2010-05-20
9. 2009-05-27
10. 1991-05-28
[ 56 others ]

107	(	6.9%	)
105	(	6.7%	)
99	(	6.3%	)
98	(	6.3%	)
98	(	6.3%	)
90	(	5.8%	)
80	(	5.1%	)
76	(	4.9%	)
73	(	4.7%	)
72	(	4.6%	)
663	(	42.5%	)

1124 (41.9%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2024-02-27

This problem gets more complex when there are 1767 categories!! Careful inspection can help. For instance we could view the entire table, sorted alphabetically

hybrid_names <- df %>%
  select(hybrid) %>%
  distinct() %>%
  arrange(hybrid)

Let’s look at the top 40.

hybrid
02TH03896
03TH004205
03TH004251
06EX05
08EXP01
10
101
103
107
109
115
11G04
11G08
11G12
11G13
11N94
120
121
1266
1296
12E06
12E12
12E13
12E14
12G04
12G20
12G25
12G25 CL
12G25CL
12G28
12H92
12N92
13-652 CL
1300
131
134A
14-572 CL
140
141
1424

It often takes some common sense to figure out what is spelled wrong. Sometimes differing in only one digit is meaningful, sometimes that is a typo. Commonly, errors involve things like capitalization and spacing.

In our particular example, it seems that sometimes a space has been entered in the middle of some varieties by one person who keyed in data, but not another, e.g. 12G25 CL vs 12G25CL. You have some options here. First, you could go back and correct those typos in the data. For a small volume of data, that’s often a great option.

Another option would be to correct each variety name using R code. For instance, if we knew that all instances of 12G25 CL should actually have been entered as 12G25CL and 20-EXP05 is really 20-EXP5, we could fix those errors using a few lines of code

df <- df %>%
  # create a new variable
  mutate(
    hybrid_fixed =
      case_when(
        hybrid == "12G25 CL" ~ "12G25CL",
        hybrid == "20-EXP05" ~ "20-EXP5",
        TRUE ~ hybrid
      )
  )

Check your work. Did you fix everything you intended to? Did you inadvertently break anything else in the process?

View(df %>%
  select(hybrid, hybrid_fixed) %>%
  distinct() %>%
  # this allows us to easily view which ones changed to see if we did it right
  mutate(fixed = (hybrid != hybrid_fixed)) %>%
  arrange(desc(fixed)))

hybrid	hybrid_fixed	fixed
12G25 CL	12G25CL	TRUE
20-EXP05	20-EXP5	TRUE
AP2098	AP2098	FALSE
AP3470	AP3470	FALSE
AK7304	AK7304	FALSE
AK7305	AK7305	FALSE
AK7306	AK7306	FALSE
AK7311	AK7311	FALSE
AS3211	AS3211	FALSE
AS471	AS471	FALSE

Having reassured ourselves we fixed the problematic data without simultaneously breaking any other data we can remove the extra column.

df <- df %>%
  # out with the bad, in with the good
  select(-hybrid) %>%
  rename(hybrid = hybrid_fixed)

We might also postulate that spaces are causing us problems more generally, or other characters. Would it be appropriate to remove all the spaces, tabs, newlines and carriage returns in the variety names? Learning a little about regular expressions can help you here!

df_space_removal <- df %>%
  mutate(hybrid_no_space = gsub("\\s+", "", hybrid)) %>%
  select(hybrid_no_space, hybrid) %>%
  distinct()

dup_check <- janitor::get_dupes(df_space_removal, hybrid_no_space)

Before applying this we want to check that all sets of names we have ‘combined’ make sense. You would want to check the whole list, but just a few are displayed here for brevity. Using the get_dupes function we can see how after running our whitespace removal above has combined the names on the right (hybrid) into like categories without the spelling variants (hybrid_no_space). We can scan down this list to make sure the resulting sets are what we intended to check whether what we have done was sensible. Regular expressions are a powerful tool for editing and correcting string variables, but in my experience can take some trial and error to get right.

hybrid_no_space	dupe_count	hybrid
8H350DM	3	8H350 DM
8H350DM	3	8H 350 DM
8H350DM	3	8H350DM
8N337DM	3	8N 337 DM
8N337DM	3	8N337 DM
8N337DM	3	8N337DM
8N453DM	3	8N453 DM
8N453DM	3	8N 453 DM
8N453DM	3	8N453DM
DKF33-33NS	3	DKF33-33NS
DKF33-33NS	3	DKF 33-33NS
DKF33-33NS	3	DKF 33-33 NS
IS4704NS	3	IS 4704NS
IS4704NS	3	IS4704NS
IS4704NS	3	IS4704 NS
3080DMR,NS	2	3080DMR, NS
3080DMR,NS	2	3080 DMR, NS
4415HO/CLP/DM	2	4415HO/CLP/DM
4415HO/CLP/DM	2	4415 HO/CLP/DM
4421CL	2	4421CL

Last, you might curate a list that matches the misentered name to the correct name, or use someone a published list of synonyms. As one example, the taxoncleanr package can help simplify the world of pain that is aligning taxonomic names.