PS 10: ANOVA/Varieties

Learning objectives

Fit an ANOVA model in R
Assess the fit of the data to model assumptions
Transform data to improve fit to assumptions (if necessary)
Correctly interpret the output from an ANOVA

Background

This week, we will continue exploring formal statistical models, moving on to analysis of variance (ANOVA). ANOVA is a statistical modeling technique that is used to test whether a continuous response variable differs by a categorical grouping factor. We will use this approach to test whether nutrient density differs among crop varieties in the combined Pasa and Bionutrient Institute dataset.

Resources

Data dictionaries
- Pasa and BI
Chunk options
- In addition to the options listed, you can also use results = "hide" to show the code but not the output in the final document
ANOVA
- See Cheat Sheets –> ANOVA for guidance on running running ANOVA in R
- See link to example from class/lab
Graphing resources
- See Cheat Sheets –> Graph
- See the ggplot2 Cheat Sheet
- See the R Graph Gallery

Part 1: Create an R Markdown file for your problem set

Make sure you are working in the R Project for this week in Posit Cloud (called Lab_10_ANOVA)
- This is important because if I have a question about what you did or how your code is working, I need to be able to find/access your code on Posit Cloud
Create a new R Markdown document using the green plus sign in upper left
- Title your document PS 10: ANOVA
Navigate to File -> Save As
- Save the file as follows: 10_ps_Crop_Nutrient_ANOVA.Rmd - replacing Crop and Nutrient with the name of your crop (e.g. Beet) and nutrient (e.g. Calcium)
- See the class spreadsheet for your nutrient assignment
Adjust the R Markdown header
- Write your student ID number (not your name) as the author of the script
Use subheadings to organize your document into the sections shown below
- Use ## for main headings
- Use ### for subheadings
Use code chunks to organize your code within each section (for those sections needing code)
- Make sure that every one of your code chunks is named in the chunk header
- Use R Markdown chunk options to make your output more readable. You are welcome to suppress output for initial data processing steps, but please show both code and output for all ‘deliverables’ outlined below.

Part 2: Set expectations

Take a moment to record your expectations before you begin.

You may find it helpful to review the exploratory graphs created by the variety group before spring break. (link here)

Your notes should include the following (2-4 sentences):

Based on what you have learned so far and your exploration of the data, do you expect to see differences among varieties in your crop and nutrient? Why or why not?

Assume that variety is the independent variables and nutrient density is the dependent variable. Your expectations should include any specific patterns you expect to see among groups (e.g. do you have any expectations about which varieties will be more nutritious?, etc.).

Part 3: ANOVA for differences among varieties

A: Prepare your dataset

Data loading + checking
- Load the tidyverse using library()
- Load the necessary dataset (varieties_clean.csv) using read.csv()
  - FYI - this dataset has been filtered to remove Pasa data from BI, so there is no duplication of values in the dataset (what does this have to do with independence?). This dataset has also been cleaned to only include those varieties with at least four observations across the Pasa and BI dataset combined.
Data transformation
- Filter the dataset using filter()
  - Include only your crop
- Use select to narrow the dataset to the variables of interest
  - See the data dictionary to decide what variables you need: Pasa and BI
  - I recommend using the variable var_title for variety, because it has already been reformatted to be cleaner for graphing.
- Filter the dataset to remove NA values
  - After selecting your variables, use drop_na() to remove NA values
- Store the filtered and narrowed dataset as a new dataframe
Check your data and make sure it loaded correctly
- Check the structure of the data using str()
- Check the sample size using table(df$variable) on your dataframe (df) and independent variable
  - If there are varieties with fewer than 4 observations, remove them using filter()
  - Remember that the != means not equal
  - The form of the code would be filter(var_title != c("variety_name1","variety_name2"))
- Generate an initial summary of the data using summary()
- Adjust chunk options so that the code shows, but the output is hidden in your knitted file

B: Fit the model

Now, use aov to fit the model that you are interested in.

Q: What null hypothesis is this model testing? What is the alternative hypothesis? (2-3 sentences)

C: Assess assumptions

Create the appropriate diagnostic plots. Based on what you know about the assumptions of ANOVA and your understanding of the experimental design, assess how well your data fits the assumptions of the model.

Record your conclusions about the fit of the data to the assumptions. Be sure to explain the thinking/evidence behind your conclusions. (1-2 sentences per assumption)

Independence:
Normality:
Equal variance:

Adjust the model if necessary

The next step will depend upon whether or not your model is meeting assumptions. If it is, you can skip ahead to the next section. If it is not, you should consider either transforming the data or using Welch’s ANOVA.

Note: If the Welch’s ANOVA is throwing error messages or strange results (e.g. NA or NaN values in the ANOVA table), it is likely that you have varieties with too-small sample size and/or varieties without any variation in the measurement variable. In that case, use filter() to remove the problematic varieties and re-fit the model. (see more code tips under Prepare your dataset above)

D: Summarize results

Now, you will need to examine the outcomes of your final model (using summary() for standard ANOVA or simply printing the model object for Welch’s ANOVA). Remember to examine the output from the model that best fits your assumptions and the data.

IF you found significant differences, you should perform a post-hoc test to determine which varieties are different from each other (standard ANOVA: HSD.test, Welch’s ANOVA games_howell_test().

Write a brief paragraph summarizing the results of your analysis in full sentences. If you did not find significant differences, this will likely be 1-2 sentences. If you did find significant differences, this will likely be 3-4 sentences. Your paragraph should include:

Whether or not you found significant differences among groups (including the name of your statistical test, F value, degrees of freedom, and P value)
- To make a subscript in R Markdown, you can use the tilde: F~4,30~ becomes F_4,30
If there were significant differences, where did they occur? (i.e. which groups were different from each other?)
What was the direction and magnitude of the observed differences?

E: Final graph

Create a final graph to summarize your data + model.

This should be a bar graph showing the means of each group, with error bars representing +/- 1 standard error. Order the bars from high to low mean using fct_reorder(). (to see how this function works, see the example code from the variety group linked below)

Other specifications of graphs should be similar to our prior graphs:

Label all axes, including units (see data dictionary)
Bars are colored by the grouping factor
Flip the axes for readability (using coord_flip)
Apply class theme (theme_light)
No legend (unnecessary)
No title (unnecessary)
Use chunk options to adjust the size as needed to make the graph readable

For your reference, here is a link to the scripts created by the variety group before spring break: link here

Part 4: Final interpretation

Interpret the results of your analysis (4-6 sentences):

If a farmer wants to increase nutrient density of your nutrient, could they do so through variety selection? Why or why not?
What do you recommend in terms of specific varieties?
Do you think it would be worthwhile to explore for your crop and nutrient whether there is variation based in color or type (hybrid vs. open pollinated)? Your response should be informed by your results in combination with the information we compiled on varieties.
What are your overall conclusions? Were your expectations supported or challenged?

Submit your problem set!

Knit your R Markdown file using the Knit button at the top of the code editor. This is a good check on whether your analysis is reproducible!

To access your file, navigate to the Files tab in the lower right window. Find the .html file for your problem set and click the box next to it. Navigate to More –> Export to download the file. It will likely go to your downloads folder.

Examine the file closely to make sure that it knitted correctly and contains all parts of your problem set. If you need to make revisions, you can simply revise your code and then knit it again. Submit the .html file in the appropriate Moodle dropbox.

sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       cli_3.6.2         knitr_1.45        rlang_1.1.3      
 [5] xfun_0.41         stringi_1.8.3     promises_1.2.1    jsonlite_1.8.8   
 [9] workflowr_1.7.1   glue_1.7.0        rprojroot_2.0.4   git2r_0.33.0     
[13] htmltools_0.5.7   httpuv_1.6.13     sass_0.4.8        fansi_1.0.6      
[17] rmarkdown_2.25    jquerylib_0.1.4   evaluate_0.23     tibble_3.2.1     
[21] fastmap_1.1.1     yaml_2.3.8        lifecycle_1.0.4   stringr_1.5.1    
[25] compiler_4.3.2    fs_1.6.3          Rcpp_1.0.12       pkgconfig_2.0.3  
[29] rstudioapi_0.15.0 later_1.3.2       digest_0.6.34     R6_2.5.1         
[33] utf8_1.2.4        pillar_1.9.0      magrittr_2.0.3    bslib_0.6.1      
[37] tools_4.3.2       cachem_1.0.8