Variety graph trouble-shooting

Code adapted from Ali!

Questions:

1. How can I drop rows with no variety name? (missing data)

2. Why aren’t the boxes sorting as I expect?

3. How can I clean up the variety names?

4. How to sort by median vs. mean?

Data preparation

Data loading and checking

library(tidyverse)
library(RColorBrewer)
alldata <- read.csv("./data/combined_clean.csv")
str(alldata)
summary(alldata)

Data transformation

# filter to kale
kale <- alldata %>% 
  filter(crop == "kale")

# attempt to drop missing data
kale_varieties <- kale %>%
  drop_na(variety) # this doesn't work b/c the empty cells aren't actually read as NA

# alternative way to drop missing data
kale_varieties <- kale %>% 
  filter(variety != "") 
# this works to remove rows with empty variety values
# this says, keep all rows where variety does NOT equal  "" (empty)
# the ! in R means 'not'

Clean duplicate variety names

# generate a table of all kale variety names
table(kale_varieties$variety)


              black_magic                  darkibor   dazzling_blue_laacinato 
                        2                        13                         4 
               dwarf_blue               dwarf_curly                  lacinato 
                        2                         3                        80 
               meadowlark  nero_di_tuscana_lacinato                       new 
                        2                         3                         2 
              red_russian   red_russian,red russian                  siberian 
                       17                         1                         1 
white_russian,red russian                 winterbor 
                        1                        16

# use case_match() to replace duplicate names
# first argument is variable name (variety)
# next is a list of "original value" ~ "replacement value"
# .default = variety says that if a match is not present in the list, 
# default to the original value (for those values that don't need replacing)
kale_var_clean <- kale_varieties %>%
  mutate(var_clean = case_match(
                      variety, 
                      "red_russian,red russian" ~ "red_russian", 
                      "white_russian,red russian" ~ "mixed",
                      "dazzling_blue_laacinato" ~ "lacinato",
                      .default = variety),
         var_title = str_to_title(var_clean) # capitalize first letter of each name
         )

# check that it worked
table(kale_var_clean$var_title)


             Black_magic                 Darkibor               Dwarf_blue 
                       2                       13                        2 
             Dwarf_curly                 Lacinato               Meadowlark 
                       3                       84                        2 
                   Mixed Nero_di_tuscana_lacinato                      New 
                       1                        3                        2 
             Red_russian                 Siberian                Winterbor 
                      18                        1                       16

Relationship of nutrient density to crop variety

Example as originally developed in lab

The reason why the sorting looks odd is because it is sorting by median across the entire (combined) dataset, not within each facet.

ggplot(kale_var_clean, 
       aes(x = fct_reorder(var_title, antioxidants, .fun = median), # .fun = median to sort by median
                           y = antioxidants, fill = var_title)) + 
  geom_boxplot(alpha= 0.5) + 
  facet_grid(group ~., scales="free", space = "free") + 
  theme_bw() + 
  xlab("") + 
  ylab("Antioxidants (FRAP units per 100 g)") + 
  theme(legend.position = "none")+
  coord_flip()

Version	Author	Date
17fbc45	maggiedouglas	2024-02-27

Example combined across datasets

This shows that the sorting works without the facets - supporting the idea that the strange result in the faceted version is driven by calculating the median (or mean) across groups

ggplot(kale_var_clean, 
       aes(x = fct_reorder(var_title, antioxidants, .fun = median), # .fun = median to sort by median (not mean)
                           y = antioxidants, fill = var_title)) + 
  geom_boxplot(alpha= 0.5) + 
 # facet_grid(group ~., scales="free", space = "free") + 
  theme_bw() + 
  xlab("") + 
  ylab("Antioxidants (FRAP units per 100 g)") + 
  theme(legend.position = "none")+
  coord_flip()

Version	Author	Date
17fbc45	maggiedouglas	2024-02-27

Let’s discuss our options in class on Thursday…

Drop the facets and graph the combined dataset?

sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] RColorBrewer_1.1-3 lubridate_1.9.3    forcats_1.0.0      stringr_1.5.1     
 [5] dplyr_1.1.4        purrr_1.0.2        readr_2.1.5        tidyr_1.3.0       
 [9] tibble_3.2.1       ggplot2_3.4.4      tidyverse_2.0.0    workflowr_1.7.1   

loaded via a namespace (and not attached):
 [1] sass_0.4.8        utf8_1.2.4        generics_0.1.3    stringi_1.8.3    
 [5] hms_1.1.3         digest_0.6.34     magrittr_2.0.3    timechange_0.3.0 
 [9] evaluate_0.23     grid_4.3.2        fastmap_1.1.1     rprojroot_2.0.4  
[13] jsonlite_1.8.8    processx_3.8.3    whisker_0.4.1     ps_1.7.5         
[17] promises_1.2.1    httr_1.4.7        fansi_1.0.6       scales_1.3.0     
[21] jquerylib_0.1.4   cli_3.6.2         rlang_1.1.3       munsell_0.5.0    
[25] withr_3.0.0       cachem_1.0.8      yaml_2.3.8        tools_4.3.2      
[29] tzdb_0.4.0        colorspace_2.1-0  httpuv_1.6.13     vctrs_0.6.5      
[33] R6_2.5.1          lifecycle_1.0.4   git2r_0.33.0      fs_1.6.3         
[37] pkgconfig_2.0.3   callr_3.7.3       pillar_1.9.0      bslib_0.6.1      
[41] later_1.3.2       gtable_0.3.4      glue_1.7.0        Rcpp_1.0.12      
[45] highr_0.10        xfun_0.41         tidyselect_1.2.0  rstudioapi_0.15.0
[49] knitr_1.45        farver_2.1.1      htmltools_0.5.7   labeling_0.4.3   
[53] rmarkdown_2.25    compiler_4.3.2    getPass_0.2-4

Variety graph trouble-shooting

Prof. D

2024-02-28

Code adapted from Ali!

Questions:

1. How can I drop rows with no variety name? (missing data)

2. Why aren’t the boxes sorting as I expect?

3. How can I clean up the variety names?

4. How to sort by median vs. mean?

Data preparation

Data loading and checking

Data transformation

Clean duplicate variety names

Relationship of nutrient density to crop variety

Example as originally developed in lab

Example combined across datasets

Let’s discuss our options in class on Thursday…

Use the imperfect sorting and keep the facets? OR

Drop the facets and graph the combined dataset?