Code adapted from Ali!

Questions:

1. How can I drop rows with no variety name? (missing data)

2. Why aren’t the boxes sorting as I expect?

3. How can I clean up the variety names?

4. How to sort by median vs. mean?

Data preparation

Data loading and checking

library(tidyverse)
library(RColorBrewer)
alldata <- read.csv("./data/combined_clean.csv")
str(alldata)
summary(alldata)

Data transformation

# filter to kale
kale <- alldata %>% 
  filter(crop == "kale")

# attempt to drop missing data
kale_varieties <- kale %>%
  drop_na(variety) # this doesn't work b/c the empty cells aren't actually read as NA

# alternative way to drop missing data
kale_varieties <- kale %>% 
  filter(variety != "") 
# this works to remove rows with empty variety values
# this says, keep all rows where variety does NOT equal  "" (empty)
# the ! in R means 'not' 

Clean duplicate variety names

# generate a table of all kale variety names
table(kale_varieties$variety)

              black_magic                  darkibor   dazzling_blue_laacinato 
                        2                        13                         4 
               dwarf_blue               dwarf_curly                  lacinato 
                        2                         3                        80 
               meadowlark  nero_di_tuscana_lacinato                       new 
                        2                         3                         2 
              red_russian   red_russian,red russian                  siberian 
                       17                         1                         1 
white_russian,red russian                 winterbor 
                        1                        16 
# use case_match() to replace duplicate names
# first argument is variable name (variety)
# next is a list of "original value" ~ "replacement value"
# .default = variety says that if a match is not present in the list, 
# default to the original value (for those values that don't need replacing)
kale_var_clean <- kale_varieties %>%
  mutate(var_clean = case_match(
                      variety, 
                      "red_russian,red russian" ~ "red_russian", 
                      "white_russian,red russian" ~ "mixed",
                      "dazzling_blue_laacinato" ~ "lacinato",
                      .default = variety),
         var_title = str_to_title(var_clean) # capitalize first letter of each name
         )

# check that it worked
table(kale_var_clean$var_title)

             Black_magic                 Darkibor               Dwarf_blue 
                       2                       13                        2 
             Dwarf_curly                 Lacinato               Meadowlark 
                       3                       84                        2 
                   Mixed Nero_di_tuscana_lacinato                      New 
                       1                        3                        2 
             Red_russian                 Siberian                Winterbor 
                      18                        1                       16 

Relationship of nutrient density to crop variety

Example as originally developed in lab

  • The reason why the sorting looks odd is because it is sorting by median across the entire (combined) dataset, not within each facet.
ggplot(kale_var_clean, 
       aes(x = fct_reorder(var_title, antioxidants, .fun = median), # .fun = median to sort by median
                           y = antioxidants, fill = var_title)) + 
  geom_boxplot(alpha= 0.5) + 
  facet_grid(group ~., scales="free", space = "free") + 
  theme_bw() + 
  xlab("") + 
  ylab("Antioxidants (FRAP units per 100 g)") + 
  theme(legend.position = "none")+
  coord_flip()

Version Author Date
17fbc45 maggiedouglas 2024-02-27

Example combined across datasets

  • This shows that the sorting works without the facets - supporting the idea that the strange result in the faceted version is driven by calculating the median (or mean) across groups
ggplot(kale_var_clean, 
       aes(x = fct_reorder(var_title, antioxidants, .fun = median), # .fun = median to sort by median (not mean)
                           y = antioxidants, fill = var_title)) + 
  geom_boxplot(alpha= 0.5) + 
 # facet_grid(group ~., scales="free", space = "free") + 
  theme_bw() + 
  xlab("") + 
  ylab("Antioxidants (FRAP units per 100 g)") + 
  theme(legend.position = "none")+
  coord_flip()

Version Author Date
17fbc45 maggiedouglas 2024-02-27

Let’s discuss our options in class on Thursday…

Use the imperfect sorting and keep the facets? OR

Drop the facets and graph the combined dataset?


sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] RColorBrewer_1.1-3 lubridate_1.9.3    forcats_1.0.0      stringr_1.5.1     
 [5] dplyr_1.1.4        purrr_1.0.2        readr_2.1.5        tidyr_1.3.0       
 [9] tibble_3.2.1       ggplot2_3.4.4      tidyverse_2.0.0    workflowr_1.7.1   

loaded via a namespace (and not attached):
 [1] sass_0.4.8        utf8_1.2.4        generics_0.1.3    stringi_1.8.3    
 [5] hms_1.1.3         digest_0.6.34     magrittr_2.0.3    timechange_0.3.0 
 [9] evaluate_0.23     grid_4.3.2        fastmap_1.1.1     rprojroot_2.0.4  
[13] jsonlite_1.8.8    processx_3.8.3    whisker_0.4.1     ps_1.7.5         
[17] promises_1.2.1    httr_1.4.7        fansi_1.0.6       scales_1.3.0     
[21] jquerylib_0.1.4   cli_3.6.2         rlang_1.1.3       munsell_0.5.0    
[25] withr_3.0.0       cachem_1.0.8      yaml_2.3.8        tools_4.3.2      
[29] tzdb_0.4.0        colorspace_2.1-0  httpuv_1.6.13     vctrs_0.6.5      
[33] R6_2.5.1          lifecycle_1.0.4   git2r_0.33.0      fs_1.6.3         
[37] pkgconfig_2.0.3   callr_3.7.3       pillar_1.9.0      bslib_0.6.1      
[41] later_1.3.2       gtable_0.3.4      glue_1.7.0        Rcpp_1.0.12      
[45] highr_0.10        xfun_0.41         tidyselect_1.2.0  rstudioapi_0.15.0
[49] knitr_1.45        farver_2.1.1      htmltools_0.5.7   labeling_0.4.3   
[53] rmarkdown_2.25    compiler_4.3.2    getPass_0.2-4