library(naniar) # to homogenize missing value codes
library(tidyverse) # general data wrangling
library(supportR) # for dealing with non-numeric or non-date values1. Scrambled data types
This module addresses common errors where string and/or missing data encoded as strings are interwoven with numbers and dates, as well as non-standardized date formatting.
df <- read.csv(file.path("example_data", "sunflower_data_broken.csv"))To understand a little bit about the data, it can be helpful to get a brief summary of the contents
glimpse(df)Rows: 2,685
Columns: 9
$ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
$ harvest_date <int> 19970930, 19970930, 19970930, 19970930, 19970930,…
$ harvest_moisture_pct <dbl> 8.4, 9.9, 11.0, 10.0, 10.3, 9.7, 9.7, 8.1, 9.5, 9…
$ height_in <chr> "57", "62", "56", "60", "58", "58", "66", "60", "…
$ planting_date <chr> "19970529", "19970529", "19970529", "19970529", "…
$ yield_lb_acre <dbl> 2542, 2179, 2510, 2225, 2208, 2326, 2319, 2224, 2…
$ Year <int> 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1…
$ Hybrid <chr> "AP2098", "AP3470", "AK7304", "AK7305", "AK7306",…
$ emergence_date <int> 19970533, 19970534, 19970532, 19970532, 19970533,…
A glimpse at our dataset tells us something about its structure. Ideally, you are the data collector and already know all of this. The first column (X) appears to be an index variable. Sometimes this is intentional; more often it occurs when people write out files in R and forget to include the row.names = F argument.
Some things should strike you as potentially problematic about this dataset:
- There are two date fields, but one is encoded as a character and one as an integer.
- Height (with units inches) is encoded as a character.
It is very easy to convert everything that should be numeric into a numeric field, but doing this blindly risks wiping out actual data. For example - what if all the low values were entered as ‘below detection level’. Setting all those values to ‘NA’ would bias the results. It’s important to suss out what those mystery values are, and why they are there in the first place.
One way to interrogate these values is through the num_check function in the supportR package.
num_check(df, 'height_in')[1] "na" "N/A"
I often find it easier to inspect the entire data record when I come across issues, so I wrote a short function that iterates across columns and pulls out all rows of data that cannot successfully be converted to a numeric value. This function (check_non_numeric) is designed to check if a specified column in a dataframe contains non-numeric values. The function takes two arguments: df, which is the dataframe to be checked, and var, which is the name of the column in the dataframe to be checked.
Copy the function below to use in the next set of code.
# Function to find non-numeric values in a column and inspect
# @param df data frame
# @param var variable to check
# @return data frame with non-numeric values
check_non_numeric <- function(df, var) {
row_probs <- df %>% filter(.data[[var]] %in% suppressMessages(supportR::num_check(., var)))
if (nrow(row_probs > 0)) {
warning(paste("Check file for", var, "\n"))
return(row_probs)
} else {
cat(paste("all values are numeric for ", var, "\n"))
return(NULL)
}
}To inspect our data for possible problems in numeric columns, we need to first define all columns that we think should be numeric. Since the majority of the data appear to be numeric, it’s often easier to just define which variables are NOT numeric, and test the remaining set.
# define numeric columns as all of those that aren't non-numeric
non_numeric_cols <- "Hybrid"
numeric_cols <- setdiff(names(df), non_numeric_cols)We can now iterate over all the columns we think should contain numeric data and look for problems.
# set up a list to put numeric check results into
all_checks <- list()
for (i in numeric_cols) {
all_checks[[i]] <- check_non_numeric(df, i)
}all values are numeric for X
all values are numeric for harvest_date
all values are numeric for harvest_moisture_pct
Warning in check_non_numeric(df, i): Check file for height_in
Warning in check_non_numeric(df, i): Check file for planting_date
all values are numeric for yield_lb_acre
all values are numeric for Year
all values are numeric for emergence_date
# resulting list will just have the things you still need to check
names(all_checks)[1] "height_in" "planting_date"
The function has told use we have some non-numeric values in the variables “height_in” and “planting date”. Let’s look at what these problems are to determine what to do about them.
View(all_checks[["height_in"]])| X | harvest_date | harvest_moisture_pct | height_in | planting_date | yield_lb_acre | Year | Hybrid | emergence_date |
|---|---|---|---|---|---|---|---|---|
| 40 | 19970930 | 9.5 | na | 19970529 | 2381 | 1997 | ST2117 | 19970532 |
| 475 | 20021031 | 13.3 | N/A | 20020529 | 2008 | 2002 | F00001 | 20020534 |
We can see from the above there are two different missing value codes included in the height column. Having confidence in the issues, we can now set those to a proper missing value code. Here I’m going to run on the assumption that these two missing value codes may be found elsewhere, and replace them everywhere using the replace_with_na_all function. Note we could have also addressed this by defining multiple missing value codes when we initially read in the data.
# replace non-standard missing value codes
df <- df %>% replace_with_na_all(
condition =
~ .x %in% c("N/A", "na")
)Next, let’s inspect the planting date problems.
View(all_checks[["planting_date"]])| X | harvest_date | harvest_moisture_pct | height_in | planting_date | yield_lb_acre | Year | Hybrid | emergence_date |
|---|---|---|---|---|---|---|---|---|
| 867 | 20101021 | 11.7 | 63 | 5/20/2010 | 1032 | 2010 | F30008NS,CL | 20100522 |
| 868 | 20101021 | 8.2 | 64 | 5/20/2010 | 1244 | 2010 | F51122NS,CL | 20100523 |
| 869 | 20101021 | 8.6 | 66 | 5/20/2010 | 1290 | 2010 | F51137NS,CL | 20100523 |
| 870 | 20101021 | 9.4 | 71 | 5/20/2010 | 1354 | 2010 | F51139NS,DM,CL | 20100522 |
| 871 | 20101021 | 11.0 | 67 | 5/20/2010 | 1236 | 2010 | F51313NS,DM,CL | 20100524 |
| 872 | 20101021 | 9.3 | 64 | 5/20/2010 | 664 | 2010 | F89036NS,DM,CL | 20100522 |
| 873 | 20101021 | 10.0 | 69 | 5/20/2010 | 230 | 2010 | F89057NS,SU | 20100521 |
| 874 | 20101021 | 8.4 | 58 | 5/20/2010 | 1153 | 2010 | F91033NS,SU | 20100522 |
| 875 | 20101021 | 9.9 | 65 | 5/20/2010 | 759 | 2010 | 306 DMR NS | 20100522 |
| 876 | 20101021 | 8.5 | 62 | 5/20/2010 | 1592 | 2010 | 3080 DMR NS | 20100524 |
It appears that some of the data have dates entered as M/D/YYYY. To determine what to do about this, we need to see all the date formats that can be encountered.
unique(sort(df[["planting_date"]], decreasing = TRUE))| planting_date |
|---|
| 5/20/2010 |
| 20220606 |
| 20210526 |
| 20200601 |
| 20180605 |
| 20170525 |
| 20160527 |
| 20150603 |
| 20140603 |
| 20130606 |
In this dataset, sometimes the date is encoded as an 8 digit integer YYYYMMDD while others have M/D/YYYY. Based on this information, we can use the lubridate package to interpret the dates entered in either format, and then consolidate into a single, proper date column. You will get some warnings! This is to be expected as we know that the date formatting differs among rows.
df <- df %>%
rename(planting_date_original = planting_date) %>%
mutate(
yyyymmdd = lubridate::ymd(planting_date_original),
mdy = lubridate::mdy(planting_date_original),
planting_date_corrected = coalesce(yyyymmdd, mdy)
)Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `yyyymmdd = lubridate::ymd(planting_date_original)`.
Caused by warning:
! 146 failed to parse.
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
Always check your work!!
# always check your work
View(df %>%
filter(Year %in% c(2008:2012)) %>%
select(planting_date_original, planting_date_corrected) %>%
distinct())| planting_date_original | planting_date_corrected |
|---|---|
| 20090527 | 2009-05-27 |
| 5/20/2010 | 2010-05-20 |
| 20120606 | 2012-06-06 |
| NA | NA |
Finally, we can remove the temporary columns we created, and set all our remaining columns to the correct column types.
# re define numeric columns - this time leaving off date
numeric_cols <- c("harvest_moisture_pct", "height_in", "yield_lb_acre")
df <- df %>%
select(-yyyymmdd, -mdy, -planting_date_original, -X) %>%
rename(planting_date = planting_date_corrected) %>%
# while we are at it let's rename everything to lower case as
# variable capitalization is annoying and hard to remember
rename_all(tolower) %>%
# set harvest_date to a date rather than an integer
mutate(
harvest_date = lubridate::ymd(harvest_date),
emergence_date = lubridate::ymd(emergence_date)
) %>%
mutate_at(numeric_cols, as.numeric)Warning: There was 1 warning in `mutate()`.
ℹ In argument: `emergence_date = lubridate::ymd(emergence_date)`.
Caused by warning:
! 266 failed to parse.
Last, we might to view this corrected file and then save it for further downstream analyses. In this case, we’ll use this example for the next steps in the tutorial, so I would suggest saving it into the example_data, but with a new name. This is in order to not overwrite the raw data, in the event you find one of your processing steps was wrong and also so that the entire processing chain can be repeated start to finish, if necessary, which can’t be done if you save the outputs over the input files for a given step in your analysis.
glimpse(df)Rows: 2,685
Columns: 8
$ harvest_date <date> 1997-09-30, 1997-09-30, 1997-09-30, 1997-09-30, …
$ harvest_moisture_pct <dbl> 8.4, 9.9, 11.0, 10.0, 10.3, 9.7, 9.7, 8.1, 9.5, 9…
$ height_in <dbl> 57, 62, 56, 60, 58, 58, 66, 60, 60, 54, 54, 62, 5…
$ yield_lb_acre <dbl> 2542, 2179, 2510, 2225, 2208, 2326, 2319, 2224, 2…
$ year <int> 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1…
$ hybrid <chr> "AP2098", "AP3470", "AK7304", "AK7305", "AK7306",…
$ emergence_date <date> NA, NA, NA, NA, NA, 1997-05-30, NA, NA, NA, 1997…
$ planting_date <date> 1997-05-29, 1997-05-29, 1997-05-29, 1997-05-29, …
write.csv(df, file.path("example_data", "sunflower_data_1.csv"), row.names = FALSE)