1. Scrambled data types

This module addresses common errors where string and/or missing data encoded as strings are interwoven with numbers and dates, as well as non-standardized date formatting.

library(naniar) # to homogenize missing value codes
library(tidyverse) # general data wrangling
library(supportR) # for dealing with non-numeric or non-date values

df <- read.csv(file.path("example_data", "sunflower_data_broken.csv"))

To understand a little bit about the data, it can be helpful to get a brief summary of the contents

glimpse(df)

Rows: 2,685
Columns: 9
$ X                    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
$ harvest_date         <int> 19970930, 19970930, 19970930, 19970930, 19970930,…
$ harvest_moisture_pct <dbl> 8.4, 9.9, 11.0, 10.0, 10.3, 9.7, 9.7, 8.1, 9.5, 9…
$ height_in            <chr> "57", "62", "56", "60", "58", "58", "66", "60", "…
$ planting_date        <chr> "19970529", "19970529", "19970529", "19970529", "…
$ yield_lb_acre        <dbl> 2542, 2179, 2510, 2225, 2208, 2326, 2319, 2224, 2…
$ Year                 <int> 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1…
$ Hybrid               <chr> "AP2098", "AP3470", "AK7304", "AK7305", "AK7306",…
$ emergence_date       <int> 19970533, 19970534, 19970532, 19970532, 19970533,…

A glimpse at our dataset tells us something about its structure. Ideally, you are the data collector and already know all of this. The first column (X) appears to be an index variable. Sometimes this is intentional; more often it occurs when people write out files in R and forget to include the row.names = F argument.

Some things should strike you as potentially problematic about this dataset:

There are two date fields, but one is encoded as a character and one as an integer.
Height (with units inches) is encoded as a character.

It is very easy to convert everything that should be numeric into a numeric field, but doing this blindly risks wiping out actual data. For example - what if all the low values were entered as ‘below detection level’. Setting all those values to ‘NA’ would bias the results. It’s important to suss out what those mystery values are, and why they are there in the first place.

One way to interrogate these values is through the num_check function in the supportR package.

num_check(df, 'height_in')

[1] "na"  "N/A"

I often find it easier to inspect the entire data record when I come across issues, so I wrote a short function that iterates across columns and pulls out all rows of data that cannot successfully be converted to a numeric value. This function (check_non_numeric) is designed to check if a specified column in a dataframe contains non-numeric values. The function takes two arguments: df, which is the dataframe to be checked, and var, which is the name of the column in the dataframe to be checked.

Copy the function below to use in the next set of code.

# Function to find non-numeric values in a column and inspect

# @param df data frame
# @param var variable to check
# @return data frame with non-numeric values
check_non_numeric <- function(df, var) {
  row_probs <- df %>% filter(.data[[var]] %in% suppressMessages(supportR::num_check(., var)))
  if (nrow(row_probs > 0)) {
    warning(paste("Check file for", var, "\n"))
    return(row_probs)
  } else {
    cat(paste("all values are numeric for ", var, "\n"))
    return(NULL)
  }
}

To inspect our data for possible problems in numeric columns, we need to first define all columns that we think should be numeric. Since the majority of the data appear to be numeric, it’s often easier to just define which variables are NOT numeric, and test the remaining set.

# define numeric columns as all of those that aren't non-numeric
non_numeric_cols <- "Hybrid"
numeric_cols <- setdiff(names(df), non_numeric_cols)

We can now iterate over all the columns we think should contain numeric data and look for problems.

# set up a list to put numeric check results into
all_checks <- list()

for (i in numeric_cols) {
  all_checks[[i]] <- check_non_numeric(df, i)
}

all values are numeric for  X 
all values are numeric for  harvest_date 
all values are numeric for  harvest_moisture_pct

Warning in check_non_numeric(df, i): Check file for height_in

Warning in check_non_numeric(df, i): Check file for planting_date

all values are numeric for  yield_lb_acre 
all values are numeric for  Year 
all values are numeric for  emergence_date

# resulting list will just have the things you still need to check
names(all_checks)

[1] "height_in"     "planting_date"

The function has told use we have some non-numeric values in the variables “height_in” and “planting date”. Let’s look at what these problems are to determine what to do about them.

View(all_checks[["height_in"]])

X	harvest_date	harvest_moisture_pct	height_in	planting_date	yield_lb_acre	Year	Hybrid	emergence_date
40	19970930	9.5	na	19970529	2381	1997	ST2117	19970532
475	20021031	13.3	N/A	20020529	2008	2002	F00001	20020534

We can see from the above there are two different missing value codes included in the height column. Having confidence in the issues, we can now set those to a proper missing value code. Here I’m going to run on the assumption that these two missing value codes may be found elsewhere, and replace them everywhere using the replace_with_na_all function. Note we could have also addressed this by defining multiple missing value codes when we initially read in the data.

# replace non-standard missing value codes
df <- df %>% replace_with_na_all(
  condition =
    ~ .x %in% c("N/A", "na")
)

Next, let’s inspect the planting date problems.

View(all_checks[["planting_date"]])

X	harvest_date	harvest_moisture_pct	height_in	planting_date	yield_lb_acre	Year	Hybrid	emergence_date
867	20101021	11.7	63	5/20/2010	1032	2010	F30008NS,CL	20100522
868	20101021	8.2	64	5/20/2010	1244	2010	F51122NS,CL	20100523
869	20101021	8.6	66	5/20/2010	1290	2010	F51137NS,CL	20100523
870	20101021	9.4	71	5/20/2010	1354	2010	F51139NS,DM,CL	20100522
871	20101021	11.0	67	5/20/2010	1236	2010	F51313NS,DM,CL	20100524
872	20101021	9.3	64	5/20/2010	664	2010	F89036NS,DM,CL	20100522
873	20101021	10.0	69	5/20/2010	230	2010	F89057NS,SU	20100521
874	20101021	8.4	58	5/20/2010	1153	2010	F91033NS,SU	20100522
875	20101021	9.9	65	5/20/2010	759	2010	306 DMR NS	20100522
876	20101021	8.5	62	5/20/2010	1592	2010	3080 DMR NS	20100524

It appears that some of the data have dates entered as M/D/YYYY. To determine what to do about this, we need to see all the date formats that can be encountered.

unique(sort(df[["planting_date"]], decreasing = TRUE))

planting_date
5/20/2010
20220606
20210526
20200601
20180605
20170525
20160527
20150603
20140603
20130606

In this dataset, sometimes the date is encoded as an 8 digit integer YYYYMMDD while others have M/D/YYYY. Based on this information, we can use the lubridate package to interpret the dates entered in either format, and then consolidate into a single, proper date column. You will get some warnings! This is to be expected as we know that the date formatting differs among rows.

df <- df %>%
  rename(planting_date_original = planting_date) %>%
  mutate(
    yyyymmdd = lubridate::ymd(planting_date_original),
    mdy = lubridate::mdy(planting_date_original),
    planting_date_corrected = coalesce(yyyymmdd, mdy)
  )

Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `yyyymmdd = lubridate::ymd(planting_date_original)`.
Caused by warning:
!  146 failed to parse.
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.

Always check your work!!

# always check your work
View(df %>%
  filter(Year %in% c(2008:2012)) %>%
  select(planting_date_original, planting_date_corrected) %>%
  distinct())

planting_date_original	planting_date_corrected
20090527	2009-05-27
5/20/2010	2010-05-20
20120606	2012-06-06
NA	NA

Finally, we can remove the temporary columns we created, and set all our remaining columns to the correct column types.

# re define numeric columns - this time leaving off date
numeric_cols <- c("harvest_moisture_pct", "height_in", "yield_lb_acre")

df <- df %>%
  select(-yyyymmdd, -mdy, -planting_date_original, -X) %>%
  rename(planting_date = planting_date_corrected) %>%
  # while we are at it let's rename everything to lower case as
  # variable capitalization is annoying and hard to remember
  rename_all(tolower) %>%
  # set harvest_date to a date rather than an integer
  mutate(
    harvest_date = lubridate::ymd(harvest_date),
    emergence_date = lubridate::ymd(emergence_date)
  ) %>%
  mutate_at(numeric_cols, as.numeric)

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `emergence_date = lubridate::ymd(emergence_date)`.
Caused by warning:
!  266 failed to parse.

Last, we might to view this corrected file and then save it for further downstream analyses. In this case, we’ll use this example for the next steps in the tutorial, so I would suggest saving it into the example_data, but with a new name. This is in order to not overwrite the raw data, in the event you find one of your processing steps was wrong and also so that the entire processing chain can be repeated start to finish, if necessary, which can’t be done if you save the outputs over the input files for a given step in your analysis.

glimpse(df)

Rows: 2,685
Columns: 8
$ harvest_date         <date> 1997-09-30, 1997-09-30, 1997-09-30, 1997-09-30, …
$ harvest_moisture_pct <dbl> 8.4, 9.9, 11.0, 10.0, 10.3, 9.7, 9.7, 8.1, 9.5, 9…
$ height_in            <dbl> 57, 62, 56, 60, 58, 58, 66, 60, 60, 54, 54, 62, 5…
$ yield_lb_acre        <dbl> 2542, 2179, 2510, 2225, 2208, 2326, 2319, 2224, 2…
$ year                 <int> 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1…
$ hybrid               <chr> "AP2098", "AP3470", "AK7304", "AK7305", "AK7306",…
$ emergence_date       <date> NA, NA, NA, NA, NA, 1997-05-30, NA, NA, NA, 1997…
$ planting_date        <date> 1997-05-29, 1997-05-29, 1997-05-29, 1997-05-29, …

write.csv(df, file.path("example_data", "sunflower_data_1.csv"), row.names = FALSE)