PS 3: Data description

Learning objectives

Use R Markdown to implement reproducible data analysis in R
Use dplyr functions to transform a dataset (e.g. filter, arrange)
Apply understanding of ggplot to describe the distribution of key variables
Continue building an understanding of the data for our project

Background

Our work with Pasa Sustainable Agriculture this semester is focused on completing an exploratory analysis of data from their Nutrient Density Study, and related datasets. We will use R Markdown and a package called workflowr to create a web report to document and share our work (the website you are reading right now was also generated using this package!). The report will have several sections, which we will create through our work together. The first section we will tackle is a basic description of the main datasets we are working with.

Part 1: Draft report text

Generate introductory text on Pasa’s Nutrient Density Study

Building on your responses to last week’s problem set and your reading of background sources related to nutrient decline, each group will generate one paragraph of text for the section of the report that introduces the datasets for our project.

Keep in mind the following as you write:

Pasa refers to study participants as ‘farmer collaborators’

The organization name is not capitalized (‘Pasa’ not PASA)

The audience for our report is likely to include Pasa staff, farmer members (collaborators and not), Bionutrient Institute staff.

Here is a link to a Google doc where you should write your section.

Resources:

Part 2: Describe the dataset

Each group will be responsible for one crop in the dataset, as follows:

Group 1: Beet
Group 2: Carrot
Group 3: Kale
Group 4: Lettuce
Group 5: Potato
Group 6: Pepper

Create an R Markdown file for your problem set

Create a new R Markdown document using the green plus sign in upper left
- Title your document PS 3: Data description.
Navigate to File -> Save As
- Save the file as 03_ps_Data-description.Rmd.
Adjust the R Markdown header
- Write your name as the author of the script
Use subheadings ### to organize your document into the following sections (to work there needs to be a space after the ###)
- Set expectations
- Data preparation
- Tables
- Graphs
- Compare expectations to data
Use code chunks to organize your code within each section (for those sections needing code)
- Make sure that every one of your code chunks is named in the chunk header
- Both the code and output should be included in your final problem set (this is the default)
- Use R Markdown formatting options to make your problem set readable
  - R Markdown cheat sheet

Set expectations

Last week we learned that Pasa’s guiding question for your Nutrient Density Study is:

What are impacts of crop management and soil status on nutrient density?

As we begin our study, the first question that we need to answer is:

Are the data appropriate to answer the guiding question?

Take a moment before you begin to record your expectations in answer to this question. What kind(s) of data do you think the dataset will contain? Do you expect that the data will be suitable to answer the question? Why or why not? (2-3 sentences)

Data preparation

Data loading + checking (code)
- Load the tidyverse library using library()
- Load the necessary dataset (pasa_data_clean.csv) using read.csv()
- Check the structure of the data using str()
- Generate an initial summary of the data using summary()
Data transformation (code)
- Filter the dataset using filter() to include only your crop and store it as a new dataframe
Your summary (text)
- Did the data load correctly? If not, what needs to be fixed?
- Describe what you notice in interpreting the output from summary()
  - How common is missing data? Do you see any patterns?
  - Do the range of values make sense relative to what the data is showing?
  - You may find it helpful to consult the data dictionary

Tables

Generate tables to summarize the # of samples for your crop according to:

* State
* Farms (farms are indicated by the `farmer_id` column)
* Variety
* Crop management (there are multiple associated variables that can be split up)

Each member of the group should create at least three tables. (i.e. you should divide up the variables among your group members)

You should store each table as a new dataframe, and create it using group_by(), summarize(), and n().

Your tables should be arranged (using arrange()) so that the rows are ordered by number of samples, high to low.

Use datatable() to display your table.

Graphs

Generate graphs to show the distribution of major nutrient outcomes for your crop:

* Antioxidants
* Polyphenols
* Calcium
* Potassium
* Magnesium
* Phosphorus

Each member of the group should create two to three of these graphs. (i.e. you should divide up the work among your group members)

Compare expectations to data

Revisit the expectations you recorded at the beginning. Examine the outcomes from your data summary and consider them in light of your expectations. Do you think the dataset is suitable to answer Pasa’s guiding question? Why or why not? What do you recommend that we do next in light of what you saw in the data? (3-5 sentences)

Part 3: Submit your problem set!

Knit your R Markdown file using the Knit button at the top of the code editor. This is a good check on whether your analysis is reproducible!

To access your file, navigate to the Files tab in the lower right window. Find the file called 03_ps_Data-description.html and click the box next to it. Navigate to More –> Export to download the file. It will likely go to your downloads folder.

Examine the file closely to make sure that it knitted correctly and contains all parts of your problem set. If you need to make revisions, you can simply revise your code and then knit it again. Submit the .html file in the appropriate Moodle dropbox.

sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       httr_1.4.7        cli_3.6.2         knitr_1.45       
 [5] rlang_1.1.3       xfun_0.41         stringi_1.8.3     processx_3.8.3   
 [9] promises_1.2.1    jsonlite_1.8.8    glue_1.7.0        rprojroot_2.0.4  
[13] git2r_0.33.0      htmltools_0.5.7   httpuv_1.6.13     ps_1.7.5         
[17] sass_0.4.8        fansi_1.0.6       rmarkdown_2.25    jquerylib_0.1.4  
[21] tibble_3.2.1      evaluate_0.23     fastmap_1.1.1     yaml_2.3.8       
[25] lifecycle_1.0.4   whisker_0.4.1     stringr_1.5.1     compiler_4.3.2   
[29] fs_1.6.3          pkgconfig_2.0.3   Rcpp_1.0.12       rstudioapi_0.15.0
[33] later_1.3.2       digest_0.6.34     R6_2.5.1          utf8_1.2.4       
[37] pillar_1.9.0      callr_3.7.3       magrittr_2.0.3    bslib_0.6.1      
[41] tools_4.3.2       cachem_1.0.8      getPass_0.2-4