Typing is hard. REALLY HARD. I get it. But computers don’t. So if you have typos in your data, your analysis is going to treat those creatively spelled variables as entirely different things. It’s unfortunate.
library(tidyverse) # general data wranglinglibrary(summarytools) # slick dataset summaries
If you have already completed the Scrambled data types module you will have created this file. For those who prefer to jump in at the middle, the original dataset with mis-entered dates and missing value codes FIXED can be downloaded here and tucked into your example_data directory.
The summarytools package introduced in previous modules includes information on the number most commonly encountered strings in non-numeric fields. If you have 10 or fewer categories, this can be a great way to quickly see what the most common categories are and if you have spelled them correctly.
Data Frame Summary
df
Dimensions: 2685 x 8
Duplicates: 0
Variable
Stats / Values
Freqs (% of Valid)
Graph
Missing
harvest_date
[character]
1. 1999-10-13
2. 2006-10-23
3. 2005-10-25
4. 1997-09-30
5. 1993-10-21
6. 1994-10-12
7. 2000-10-17
8. 2004-10-27
9. 2010-10-21
10. 2009-11-16
[ 14 others ]
113
(
6.9%
)
107
(
6.6%
)
105
(
6.4%
)
99
(
6.1%
)
98
(
6.0%
)
98
(
6.0%
)
90
(
5.5%
)
80
(
4.9%
)
76
(
4.7%
)
73
(
4.5%
)
692
(
42.4%
)
1054
(39.3%)
harvest_moisture_pct
[numeric]
Mean (sd) : 14.3 (6.1)
min ≤ med ≤ max:
5.8 ≤ 11.8 ≤ 43.4
IQR (CV) : 7.9 (0.4)
413 distinct values
656
(24.4%)
height_in
[numeric]
Mean (sd) : 62.2 (19.6)
min ≤ med ≤ max:
31 ≤ 60 ≤ 188
IQR (CV) : 10 (0.3)
376 distinct values
398
(14.8%)
yield_lb_acre
[numeric]
Mean (sd) : 1954.6 (497.5)
min ≤ med ≤ max:
230 ≤ 2001 ≤ 3426
IQR (CV) : 641.1 (0.3)
1635 distinct values
73
(2.7%)
year
[integer]
Mean (sd) : 1999.5 (11.4)
min ≤ med ≤ max:
1978 ≤ 1999 ≤ 2022
IQR (CV) : 17 (0)
40 distinct values
0
(0.0%)
hybrid
[character]
1. 894
2. Falcon
3. SF187
4. 141
5. Badger DMR
6. 8D310
7. Hornet
8. SF270
9. 3845 HO
10. 432 E
[ 1757 others ]
27
(
1.0%
)
12
(
0.4%
)
10
(
0.4%
)
9
(
0.3%
)
9
(
0.3%
)
8
(
0.3%
)
8
(
0.3%
)
8
(
0.3%
)
7
(
0.3%
)
7
(
0.3%
)
2580
(
96.1%
)
0
(0.0%)
emergence_date
[character]
1. 1999-06-04
2. 1993-05-29
3. 2005-05-26
4. 2000-06-06
5. 2006-06-06
6. 1994-05-28
7. 1994-05-29
8. 1999-06-03
9. 2010-05-23
10. 2004-05-29
[ 140 others ]
34
(
2.5%
)
30
(
2.2%
)
30
(
2.2%
)
28
(
2.1%
)
28
(
2.1%
)
27
(
2.0%
)
27
(
2.0%
)
26
(
1.9%
)
26
(
1.9%
)
25
(
1.8%
)
1084
(
79.4%
)
1320
(49.2%)
planting_date
[character]
1. 2006-06-01
2. 2005-05-23
3. 1997-05-29
4. 1993-05-26
5. 1994-05-25
6. 2000-06-02
7. 2004-05-26
8. 2010-05-20
9. 2009-05-27
10. 1991-05-28
[ 56 others ]
107
(
6.9%
)
105
(
6.7%
)
99
(
6.3%
)
98
(
6.3%
)
98
(
6.3%
)
90
(
5.8%
)
80
(
5.1%
)
76
(
4.9%
)
73
(
4.7%
)
72
(
4.6%
)
663
(
42.5%
)
1124
(41.9%)
Generated by summarytools 1.0.1 (R version 4.2.2) 2024-02-27
This problem gets more complex when there are 1767 categories!! Careful inspection can help. For instance we could view the entire table, sorted alphabetically
It often takes some common sense to figure out what is spelled wrong. Sometimes differing in only one digit is meaningful, sometimes that is a typo. Commonly, errors involve things like capitalization and spacing.
In our particular example, it seems that sometimes a space has been entered in the middle of some varieties by one person who keyed in data, but not another, e.g. 12G25 CL vs 12G25CL. You have some options here. First, you could go back and correct those typos in the data. For a small volume of data, that’s often a great option.
Another option would be to correct each variety name using R code. For instance, if we knew that all instances of 12G25 CL should actually have been entered as 12G25CL and 20-EXP05 is really 20-EXP5, we could fix those errors using a few lines of code
df <- df %>%# create a new variablemutate(hybrid_fixed =case_when( hybrid =="12G25 CL"~"12G25CL", hybrid =="20-EXP05"~"20-EXP5",TRUE~ hybrid ) )
Check your work. Did you fix everything you intended to? Did you inadvertently break anything else in the process?
View(df %>%select(hybrid, hybrid_fixed) %>%distinct() %>%# this allows us to easily view which ones changed to see if we did it rightmutate(fixed = (hybrid != hybrid_fixed)) %>%arrange(desc(fixed)))
hybrid
hybrid_fixed
fixed
12G25 CL
12G25CL
TRUE
20-EXP05
20-EXP5
TRUE
AP2098
AP2098
FALSE
AP3470
AP3470
FALSE
AK7304
AK7304
FALSE
AK7305
AK7305
FALSE
AK7306
AK7306
FALSE
AK7311
AK7311
FALSE
AS3211
AS3211
FALSE
AS471
AS471
FALSE
Having reassured ourselves we fixed the problematic data without simultaneously breaking any other data we can remove the extra column.
df <- df %>%# out with the bad, in with the goodselect(-hybrid) %>%rename(hybrid = hybrid_fixed)
We might also postulate that spaces are causing us problems more generally, or other characters. Would it be appropriate to remove all the spaces, tabs, newlines and carriage returns in the variety names? Learning a little about regular expressions can help you here!
Before applying this we want to check that all sets of names we have ‘combined’ make sense. You would want to check the whole list, but just a few are displayed here for brevity. Using the get_dupes function we can see how after running our whitespace removal above has combined the names on the right (hybrid) into like categories without the spelling variants (hybrid_no_space). We can scan down this list to make sure the resulting sets are what we intended to check whether what we have done was sensible. Regular expressions are a powerful tool for editing and correcting string variables, but in my experience can take some trial and error to get right.
hybrid_no_space
dupe_count
hybrid
8H350DM
3
8H350 DM
8H350DM
3
8H 350 DM
8H350DM
3
8H350DM
8N337DM
3
8N 337 DM
8N337DM
3
8N337 DM
8N337DM
3
8N337DM
8N453DM
3
8N453 DM
8N453DM
3
8N 453 DM
8N453DM
3
8N453DM
DKF33-33NS
3
DKF33-33NS
DKF33-33NS
3
DKF 33-33NS
DKF33-33NS
3
DKF 33-33 NS
IS4704NS
3
IS 4704NS
IS4704NS
3
IS4704NS
IS4704NS
3
IS4704 NS
3080DMR,NS
2
3080DMR, NS
3080DMR,NS
2
3080 DMR, NS
4415HO/CLP/DM
2
4415HO/CLP/DM
4415HO/CLP/DM
2
4415 HO/CLP/DM
4421CL
2
4421CL
Last, you might curate a list that matches the misentered name to the correct name, or use someone a published list of synonyms. As one example, the taxoncleanr package can help simplify the world of pain that is aligning taxonomic names.