library(tidyverse) # general data wrangling8. Logic tests
Knowing particular things about your data will help you check for internal consistency and spot additional errors. For example, plants (usually) get larger over time, and it’s difficult to die before you are born.
# here we'll use the readr function read_csv to read our data
# readr "knows" about dates, which can save us some typing
df <- read_csv(file.path("example_data" , "sunflower_data_1.csv"),
show_col_types = FALSE)In our particular example there are 3 dates included, and we know for each individual they must occur in a particular order if they data were transcribed correctly. , we will impose an order test to check whether emergence follows planting and harvest follows emergence.
View(df %>%
filter(emergence_date <= planting_date | emergence_date >= harvest_date | harvest_date <= planting_date) %>%
select(planting_date, emergence_date, harvest_date))What is wrong with these data?
| planting_date | emergence_date | harvest_date |
|---|---|---|
| 1999-06-04 | 1999-06-03 | 1999-10-13 |
| 1999-06-07 | 1999-06-06 | 1999-10-13 |
| 1999-06-08 | 1999-06-03 | 1999-10-13 |
| 1999-06-09 | 1999-06-03 | 1999-10-13 |
| 1999-06-10 | 1999-06-06 | 1999-10-13 |
| 1999-06-11 | 1999-06-05 | 1999-10-13 |
| 1999-06-12 | 1999-06-03 | 1999-10-13 |
| 1999-06-13 | 1999-06-06 | 1999-10-13 |
| 1999-06-14 | 1999-06-07 | 1999-10-13 |
| 1999-06-15 | 1999-06-03 | 1999-10-13 |
| 1999-06-16 | 1999-06-05 | 1999-10-13 |
| 1999-06-17 | 1999-06-02 | 1999-10-13 |
| 1999-06-18 | 1999-06-04 | 1999-10-13 |
| 1999-06-19 | 1999-06-03 | 1999-10-13 |
| 1999-06-20 | 1999-06-04 | 1999-10-13 |
| 1999-06-21 | 1999-06-04 | 1999-10-13 |
| 1999-06-22 | 1999-06-03 | 1999-10-13 |
| 1999-06-23 | 1999-06-02 | 1999-10-13 |
| 1999-06-24 | 1999-06-03 | 1999-10-13 |
| 1999-06-25 | 1999-06-03 | 1999-10-13 |
| 1999-06-26 | 1999-06-04 | 1999-10-13 |
| 1999-06-27 | 1999-06-06 | 1999-10-13 |
| 1999-06-28 | 1999-06-03 | 1999-10-13 |
| 1999-06-29 | 1999-06-05 | 1999-10-13 |
| 1999-06-30 | 1999-06-04 | 1999-10-13 |
| 1999-07-01 | 1999-06-03 | 1999-10-13 |
| 1999-07-02 | 1999-06-03 | 1999-10-13 |
| 1999-07-03 | 1999-06-08 | 1999-10-13 |
| 1999-07-04 | 1999-06-06 | 1999-10-13 |
| 1999-07-05 | 1999-06-05 | 1999-10-13 |
| 1999-07-06 | 1999-06-04 | 1999-10-13 |
| 1999-07-07 | 1999-06-05 | 1999-10-13 |
| 1999-07-08 | 1999-06-03 | 1999-10-13 |
| 1999-07-09 | 1999-06-03 | 1999-10-13 |
| 1999-07-10 | 1999-06-07 | 1999-10-13 |
| 1999-07-11 | 1999-06-05 | 1999-10-13 |
| 1999-07-12 | 1999-06-03 | 1999-10-13 |
| 1999-07-13 | 1999-06-03 | 1999-10-13 |
The error reflects an amazingly common issue when keying data into excel or other spreadsheets - the ‘drag down’ feature often autofills sequential numbers or dates, rather than a constant. Microsoft Excel is trying to read your mind and doing a terrible job of it. Use caution!
What other logic checks might we write to detect this kind of error?