Today’s tutorial has focused on discovering and correcting issues in single dataset. As you work through your own data, you will likely want to make a single script that reads in the raw data, addresses any data quality issues, and writes out an ‘analysis-ready’ dataset. For this type of script, the interactive elements (View, summarytools, etc) are likely to be cumbersome.
For building up an end-to-end analysis workflow, it is best to Keep Raw Data Raw.
This means making your original data read-only, and writing a script (with good documentation!) that can be rerun as needed to rename columns, remove outliers etc.
With this workflow a simple one line of code such as source('clean_my_data.R') can be run and re-run as often as needed.
I usually include a function to generate a series of plots appropriate to the dataset in hand as part of this scripted workflow. Sometimes no amount of logic checks can illuminate the possible problems you might encounter as much as a visual aid can.
Some tips: plotting this year’s data on top of last year’s data, or coloring points by location, data collector, etc. can help spot errors due to changes in protocols, sensor swaps, or miskeyed data.
When I have multiple variables to look through, I usually prefer to set up a common type of plot, iterate through all the variables, and dump the results into a pdf that I can flip through later or share with collaborators rather than viewing things only in Rstudio’s plot window, or having a zillion .jpgs to flip through. An example script would correct all the typos/date formatting etc as we have done already, plot all the variables, and write out a final file for analysis.
library(tidyverse)library(naniar) # to homogenize missing value codesdf <-read_csv(file.path("example_data", "sunflower_data_broken.csv"))# insert all the corrections you have discovered you needed through# the interactive work and carefully checkeddf <- df %>%# replace non-standard missing value codesreplace_with_na_all(condition =~ .x %in%c("N/A", "na") ) %>%# fix planting datesrename(planting_date_original = planting_date) %>%mutate(yyyymmdd = lubridate::ymd(planting_date_original),mdy = lubridate::mdy(planting_date_original),planting_date_corrected =coalesce(yyyymmdd, mdy) ) %>%select(-yyyymmdd, -mdy, -planting_date_original) %>%rename(planting_date = planting_date_corrected) %>%# while we are at it let's rename everything to lower case as# variable capitalization is annoying and hard to rememberrename_all(tolower) %>%# set harvest_date to a date rather than an integermutate(harvest_date = lubridate::ymd(harvest_date),emergence_date = lubridate::ymd(emergence_date) )# define numeric columnsnon_numeric_cols <-"hybrid"numeric_cols <-setdiff(names(df), non_numeric_cols)df <- df %>%mutate_at(numeric_cols, as.numeric) %>%# remove duplicatesdistinct() %>%# fix spellingmutate(hybrid =gsub("\\s+", "", hybrid))# add on whatever additional steps you discovered through interactive qc herewrite.csv(df, "analysis_ready_data/sunflower_data.csv")# remember to make a figures directory firstpdf("figures/qcplots.pdf", 7, 5)for (var in numeric_cols) {# df$var <- df[[var]] # this should work but I think hates the spaces? # myplot = ggplot(mydata, aes(x= !!var))+# quick glance at the time series while you have the sheets in hand myplot <-ggplot(df, aes(y = .data[[var]], x = year, color = hybrid)) +geom_jitter(width =0.2, height =0, show.legend =FALSE) +ylab(var)print(myplot)}dev.off()