1. Scrambled data types

This module addresses common errors where string and/or missing data encoded as strings are interwoven with numbers and dates, as well as non-standardized date formatting.

library(naniar) # to homogenize missing value codes
library(tidyverse) # general data wrangling
library(supportR) # for dealing with non-numeric or non-date values
df <- read.csv(file.path("example_data", "sunflower_data_broken.csv"))

To understand a little bit about the data, it can be helpful to get a brief summary of the contents

glimpse(df)
Rows: 2,685
Columns: 9
$ X                    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
$ harvest_date         <int> 19970930, 19970930, 19970930, 19970930, 19970930,…
$ harvest_moisture_pct <dbl> 8.4, 9.9, 11.0, 10.0, 10.3, 9.7, 9.7, 8.1, 9.5, 9…
$ height_in            <chr> "57", "62", "56", "60", "58", "58", "66", "60", "…
$ planting_date        <chr> "19970529", "19970529", "19970529", "19970529", "…
$ yield_lb_acre        <dbl> 2542, 2179, 2510, 2225, 2208, 2326, 2319, 2224, 2…
$ Year                 <int> 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1…
$ Hybrid               <chr> "AP2098", "AP3470", "AK7304", "AK7305", "AK7306",…
$ emergence_date       <int> 19970533, 19970534, 19970532, 19970532, 19970533,…

A glimpse at our dataset tells us something about its structure. Ideally, you are the data collector and already know all of this. The first column (X) appears to be an index variable. Sometimes this is intentional; more often it occurs when people write out files in R and forget to include the row.names = F argument.

Some things should strike you as potentially problematic about this dataset:

  1. There are two date fields, but one is encoded as a character and one as an integer.
  2. Height (with units inches) is encoded as a character.

It is very easy to convert everything that should be numeric into a numeric field, but doing this blindly risks wiping out actual data. For example - what if all the low values were entered as ‘below detection level’. Setting all those values to ‘NA’ would bias the results. It’s important to suss out what those mystery values are, and why they are there in the first place.

One way to interrogate these values is through the num_check function in the supportR package.

num_check(df, 'height_in')
[1] "na"  "N/A"

I often find it easier to inspect the entire data record when I come across issues, so I wrote a short function that iterates across columns and pulls out all rows of data that cannot successfully be converted to a numeric value. This function (check_non_numeric) is designed to check if a specified column in a dataframe contains non-numeric values. The function takes two arguments: df, which is the dataframe to be checked, and var, which is the name of the column in the dataframe to be checked.

Copy the function below to use in the next set of code.

# Function to find non-numeric values in a column and inspect

# @param df data frame
# @param var variable to check
# @return data frame with non-numeric values
check_non_numeric <- function(df, var) {
  row_probs <- df %>% filter(.data[[var]] %in% suppressMessages(supportR::num_check(., var)))
  if (nrow(row_probs > 0)) {
    warning(paste("Check file for", var, "\n"))
    return(row_probs)
  } else {
    cat(paste("all values are numeric for ", var, "\n"))
    return(NULL)
  }
}

To inspect our data for possible problems in numeric columns, we need to first define all columns that we think should be numeric. Since the majority of the data appear to be numeric, it’s often easier to just define which variables are NOT numeric, and test the remaining set.

# define numeric columns as all of those that aren't non-numeric
non_numeric_cols <- "Hybrid"
numeric_cols <- setdiff(names(df), non_numeric_cols)

We can now iterate over all the columns we think should contain numeric data and look for problems.

# set up a list to put numeric check results into
all_checks <- list()

for (i in numeric_cols) {
  all_checks[[i]] <- check_non_numeric(df, i)
}
all values are numeric for  X 
all values are numeric for  harvest_date 
all values are numeric for  harvest_moisture_pct 
Warning in check_non_numeric(df, i): Check file for height_in 
Warning in check_non_numeric(df, i): Check file for planting_date 
all values are numeric for  yield_lb_acre 
all values are numeric for  Year 
all values are numeric for  emergence_date 
# resulting list will just have the things you still need to check
names(all_checks)
[1] "height_in"     "planting_date"

The function has told use we have some non-numeric values in the variables “height_in” and “planting date”. Let’s look at what these problems are to determine what to do about them.

View(all_checks[["height_in"]])
X harvest_date harvest_moisture_pct height_in planting_date yield_lb_acre Year Hybrid emergence_date
40 19970930 9.5 na 19970529 2381 1997 ST2117 19970532
475 20021031 13.3 N/A 20020529 2008 2002 F00001 20020534

We can see from the above there are two different missing value codes included in the height column. Having confidence in the issues, we can now set those to a proper missing value code. Here I’m going to run on the assumption that these two missing value codes may be found elsewhere, and replace them everywhere using the replace_with_na_all function. Note we could have also addressed this by defining multiple missing value codes when we initially read in the data.

# replace non-standard missing value codes
df <- df %>% replace_with_na_all(
  condition =
    ~ .x %in% c("N/A", "na")
)

Next, let’s inspect the planting date problems.

View(all_checks[["planting_date"]])
X harvest_date harvest_moisture_pct height_in planting_date yield_lb_acre Year Hybrid emergence_date
867 20101021 11.7 63 5/20/2010 1032 2010 F30008NS,CL 20100522
868 20101021 8.2 64 5/20/2010 1244 2010 F51122NS,CL 20100523
869 20101021 8.6 66 5/20/2010 1290 2010 F51137NS,CL 20100523
870 20101021 9.4 71 5/20/2010 1354 2010 F51139NS,DM,CL 20100522
871 20101021 11.0 67 5/20/2010 1236 2010 F51313NS,DM,CL 20100524
872 20101021 9.3 64 5/20/2010 664 2010 F89036NS,DM,CL 20100522
873 20101021 10.0 69 5/20/2010 230 2010 F89057NS,SU 20100521
874 20101021 8.4 58 5/20/2010 1153 2010 F91033NS,SU 20100522
875 20101021 9.9 65 5/20/2010 759 2010 306 DMR NS 20100522
876 20101021 8.5 62 5/20/2010 1592 2010 3080 DMR NS 20100524

It appears that some of the data have dates entered as M/D/YYYY. To determine what to do about this, we need to see all the date formats that can be encountered.

unique(sort(df[["planting_date"]], decreasing = TRUE))
planting_date
5/20/2010
20220606
20210526
20200601
20180605
20170525
20160527
20150603
20140603
20130606

In this dataset, sometimes the date is encoded as an 8 digit integer YYYYMMDD while others have M/D/YYYY. Based on this information, we can use the lubridate package to interpret the dates entered in either format, and then consolidate into a single, proper date column. You will get some warnings! This is to be expected as we know that the date formatting differs among rows.

df <- df %>%
  rename(planting_date_original = planting_date) %>%
  mutate(
    yyyymmdd = lubridate::ymd(planting_date_original),
    mdy = lubridate::mdy(planting_date_original),
    planting_date_corrected = coalesce(yyyymmdd, mdy)
  )
Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `yyyymmdd = lubridate::ymd(planting_date_original)`.
Caused by warning:
!  146 failed to parse.
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.

Always check your work!!

# always check your work
View(df %>%
  filter(Year %in% c(2008:2012)) %>%
  select(planting_date_original, planting_date_corrected) %>%
  distinct())
planting_date_original planting_date_corrected
20090527 2009-05-27
5/20/2010 2010-05-20
20120606 2012-06-06
NA NA

Finally, we can remove the temporary columns we created, and set all our remaining columns to the correct column types.

# re define numeric columns - this time leaving off date
numeric_cols <- c("harvest_moisture_pct", "height_in", "yield_lb_acre")

df <- df %>%
  select(-yyyymmdd, -mdy, -planting_date_original, -X) %>%
  rename(planting_date = planting_date_corrected) %>%
  # while we are at it let's rename everything to lower case as
  # variable capitalization is annoying and hard to remember
  rename_all(tolower) %>%
  # set harvest_date to a date rather than an integer
  mutate(
    harvest_date = lubridate::ymd(harvest_date),
    emergence_date = lubridate::ymd(emergence_date)
  ) %>%
  mutate_at(numeric_cols, as.numeric)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `emergence_date = lubridate::ymd(emergence_date)`.
Caused by warning:
!  266 failed to parse.

Last, we might to view this corrected file and then save it for further downstream analyses. In this case, we’ll use this example for the next steps in the tutorial, so I would suggest saving it into the example_data, but with a new name. This is in order to not overwrite the raw data, in the event you find one of your processing steps was wrong and also so that the entire processing chain can be repeated start to finish, if necessary, which can’t be done if you save the outputs over the input files for a given step in your analysis.

glimpse(df)
Rows: 2,685
Columns: 8
$ harvest_date         <date> 1997-09-30, 1997-09-30, 1997-09-30, 1997-09-30, …
$ harvest_moisture_pct <dbl> 8.4, 9.9, 11.0, 10.0, 10.3, 9.7, 9.7, 8.1, 9.5, 9…
$ height_in            <dbl> 57, 62, 56, 60, 58, 58, 66, 60, 60, 54, 54, 62, 5…
$ yield_lb_acre        <dbl> 2542, 2179, 2510, 2225, 2208, 2326, 2319, 2224, 2…
$ year                 <int> 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1…
$ hybrid               <chr> "AP2098", "AP3470", "AK7304", "AK7305", "AK7306",…
$ emergence_date       <date> NA, NA, NA, NA, NA, 1997-05-30, NA, NA, NA, 1997…
$ planting_date        <date> 1997-05-29, 1997-05-29, 1997-05-29, 1997-05-29, …
write.csv(df, file.path("example_data", "sunflower_data_1.csv"), row.names = FALSE)