9. Scripted workflows

Today’s tutorial has focused on discovering and correcting issues in single dataset. As you work through your own data, you will likely want to make a single script that reads in the raw data, addresses any data quality issues, and writes out an ‘analysis-ready’ dataset. For this type of script, the interactive elements (View, summarytools, etc) are likely to be cumbersome.

For building up an end-to-end analysis workflow, it is best to Keep Raw Data Raw.

This means making your original data read-only, and writing a script (with good documentation!) that can be rerun as needed to rename columns, remove outliers etc.

With this workflow a simple one line of code such as source('clean_my_data.R') can be run and re-run as often as needed.

I usually include a function to generate a series of plots appropriate to the dataset in hand as part of this scripted workflow. Sometimes no amount of logic checks can illuminate the possible problems you might encounter as much as a visual aid can.

Some tips: plotting this year’s data on top of last year’s data, or coloring points by location, data collector, etc. can help spot errors due to changes in protocols, sensor swaps, or miskeyed data.

When I have multiple variables to look through, I usually prefer to set up a common type of plot, iterate through all the variables, and dump the results into a pdf that I can flip through later or share with collaborators rather than viewing things only in Rstudio’s plot window, or having a zillion .jpgs to flip through. An example script would correct all the typos/date formatting etc as we have done already, plot all the variables, and write out a final file for analysis.

library(tidyverse)
library(naniar) # to homogenize missing value codes

df <- read_csv(file.path("example_data", "sunflower_data_broken.csv"))

# insert all the corrections you have discovered you needed through
# the interactive work and carefully checked
df <- df %>%
  # replace non-standard missing value codes
  replace_with_na_all(
    condition =
      ~ .x %in% c("N/A", "na")
  ) %>%
  # fix planting dates
  rename(planting_date_original = planting_date) %>%
  mutate(
    yyyymmdd = lubridate::ymd(planting_date_original),
    mdy = lubridate::mdy(planting_date_original),
    planting_date_corrected = coalesce(yyyymmdd, mdy)
  ) %>%
  select(-yyyymmdd, -mdy, -planting_date_original) %>%
  rename(planting_date = planting_date_corrected) %>%
  # while we are at it let's rename everything to lower case as
  # variable capitalization is annoying and hard to remember
  rename_all(tolower) %>%
  # set harvest_date to a date rather than an integer
  mutate(
    harvest_date = lubridate::ymd(harvest_date),
    emergence_date = lubridate::ymd(emergence_date)
  )


# define numeric columns
non_numeric_cols <- "hybrid"
numeric_cols <- setdiff(names(df), non_numeric_cols)

df <- df %>% mutate_at(numeric_cols, as.numeric) %>%
  # remove duplicates
  distinct() %>%
  # fix spelling
  mutate(hybrid = gsub("\\s+", "", hybrid))
# add on whatever additional steps you discovered through interactive qc here


write.csv(df, "analysis_ready_data/sunflower_data.csv")


# remember to make a figures directory first
pdf("figures/qcplots.pdf", 7, 5)
for (var in numeric_cols) {
  # df$var <- df[[var]] # this should work but I think hates the spaces? # myplot = ggplot(mydata, aes(x= !!var))+
  # quick glance at the time series while you have the sheets in hand
  myplot <- ggplot(df, aes(y = .data[[var]], x = year, color = hybrid)) +
    geom_jitter(width = 0.2, height = 0, show.legend = FALSE) +
    ylab(var)
  print(myplot)
}
dev.off()
png 
  2