Code adapted from Ali!


1. How can I drop rows with no variety name? (missing data)

2. Why aren’t the boxes sorting as I expect?

3. How can I clean up the variety names?

4. How to sort by median vs. mean?

Data preparation

Data loading and checking

alldata <- read.csv("./data/combined_clean.csv")

Data transformation

# filter to kale
kale <- alldata %>% 
  filter(crop == "kale")

# attempt to drop missing data
kale_varieties <- kale %>%
  drop_na(variety) # this doesn't work b/c the empty cells aren't actually read as NA

# alternative way to drop missing data
kale_varieties <- kale %>% 
  filter(variety != "") 
# this works to remove rows with empty variety values
# this says, keep all rows where variety does NOT equal  "" (empty)
# the ! in R means 'not' 

Clean duplicate variety names

# generate a table of all kale variety names

              black_magic                  darkibor   dazzling_blue_laacinato 
                        2                        13                         4 
               dwarf_blue               dwarf_curly                  lacinato 
                        2                         3                        80 
               meadowlark  nero_di_tuscana_lacinato                       new 
                        2                         3                         2 
              red_russian   red_russian,red russian                  siberian 
                       17                         1                         1 
white_russian,red russian                 winterbor 
                        1                        16 
# use case_match() to replace duplicate names
# first argument is variable name (variety)
# next is a list of "original value" ~ "replacement value"
# .default = variety says that if a match is not present in the list, 
# default to the original value (for those values that don't need replacing)
kale_var_clean <- kale_varieties %>%
  mutate(var_clean = case_match(
                      "red_russian,red russian" ~ "red_russian", 
                      "white_russian,red russian" ~ "mixed",
                      "dazzling_blue_laacinato" ~ "lacinato",
                      .default = variety),
         var_title = str_to_title(var_clean) # capitalize first letter of each name

# check that it worked

             Black_magic                 Darkibor               Dwarf_blue 
                       2                       13                        2 
             Dwarf_curly                 Lacinato               Meadowlark 
                       3                       84                        2 
                   Mixed Nero_di_tuscana_lacinato                      New 
                       1                        3                        2 
             Red_russian                 Siberian                Winterbor 
                      18                        1                       16 

Relationship of nutrient density to crop variety

Example as originally developed in lab

  • The reason why the sorting looks odd is because it is sorting by median across the entire (combined) dataset, not within each facet.
       aes(x = fct_reorder(var_title, antioxidants, .fun = median), # .fun = median to sort by median
                           y = antioxidants, fill = var_title)) + 
  geom_boxplot(alpha= 0.5) + 
  facet_grid(group ~., scales="free", space = "free") + 
  theme_bw() + 
  xlab("") + 
  ylab("Antioxidants (FRAP units per 100 g)") + 
  theme(legend.position = "none")+

Version Author Date
17fbc45 maggiedouglas 2024-02-27

Example combined across datasets

  • This shows that the sorting works without the facets - supporting the idea that the strange result in the faceted version is driven by calculating the median (or mean) across groups
       aes(x = fct_reorder(var_title, antioxidants, .fun = median), # .fun = median to sort by median (not mean)
                           y = antioxidants, fill = var_title)) + 
  geom_boxplot(alpha= 0.5) + 
 # facet_grid(group ~., scales="free", space = "free") + 
  theme_bw() + 
  xlab("") + 
  ylab("Antioxidants (FRAP units per 100 g)") + 
  theme(legend.position = "none")+

Version Author Date
17fbc45 maggiedouglas 2024-02-27

Let’s discuss our options in class on Thursday…

Use the imperfect sorting and keep the facets? OR

Drop the facets and graph the combined dataset?

