3. Duplicated data

Sometimes duplicate data are expected, but duplicate data can also occur due to transcription errors or mislabeled samples. Use your knowledge of the sampling design to inform what duplicates you expect!

Load packages.

library(tidyverse) # general data wrangling
library(summarytools) # counts complete duplicates
library(janitor) # finds user defined duplicates

If you have already completed the Scrambled data types module you will have created this file. For those who prefer to jump in at the middle, the original dataset with mis-entered dates and missing value codes FIXED can be downloaded here and tucked into your example_data directory.

df <- read.csv(file.path("example_data", "sunflower_data_1.csv"))

The summarytools package introduced in the Missing data module includes information on the number of complete duplicates (rows with all values identical) in our dataset. We can also get this information from one line of code.

sum(duplicated(df))

[1] 0

But what if we understood that each hybrid was only planted once per year? We might then want the count of instances where combinations of year and hybrid are duplicated.

sum(duplicated(df %>%
  select(year, hybrid)))

[1] 4

Note that the duplicated functions considers anything after the first instance to be a duplicate, some of these may be triplicates or quadriplicates!! So a count of 4 here might mean one combination that exists 5 times, or 4 pairs, or one triplicate and one duplicate.

For more advanced duplicate sleuthing, check out the janitor package. The get_dupes function returns the rows that are duplicated and inserts a count of the duplicates. This suite of functions can be very helpful in sussing out why duplicates are occurring and what to do with them.

these_dupes <- get_dupes(df, year, hybrid)

View(these_dupes)

year	hybrid	dupe_count	harvest_date	harvest_moisture_pct	height_in	yield_lb_acre	emergence_date	planting_date
1990	8803	2	NA	15.70	NA	2012.00	NA	NA
1990	8803	2	NA	17.40	NA	1865.00	NA	NA
2000	HySun 530	2	2000-10-17	15.60	59.00	1921.00	2000-06-07	2000-06-02
2000	HySun 530	2	2000-10-17	13.30	60.00	2112.00	2000-06-07	2000-06-02
2015	Falcon	2	2015-11-03	10.24	56.69	2008.90	2015-06-07	2015-06-03
2015	Falcon	2	2015-11-03	10.12	55.12	2125.95	2015-06-05	2015-06-03
2016	Falcon	2	2016-11-09	6.89	60.04	1522.60	2016-05-31	2016-05-27
2016	Falcon	2	2016-11-09	7.71	58.07	1683.70	2016-05-30	2016-05-27

Getting rid of exact duplicates (for example if you find you entered the same data twice) is very easy using the distinct function. In this case, this won’t accomplish anything because our data are not exact duplicates.

df <- df %>%
  distinct()

Records that are partial duplicates (such as the example above) often require going back to the original paper datasheets to check which record is correct. Perhaps one of the rows above was actually a different year, for example. In other cases, re-reading the protocol to understand the replication structure may illuminate why these ‘apparent’ duplicates exist and the statistical analysis can be adjusted accordingly.