Appendix D Assignments
D.1 Evaluation overview
Item | Weight (%) | Due date | Notes |
---|---|---|---|
Set-up | 5 | Friday, Jan 22, 1400 | D.2 |
Online Module + Cheat sheet | 2 + 3 | Saturday, Jan 23, 1400 | D.2 |
Online Module + Cheat sheet | 2 + 3 | Saturday, Jan 30, 1400 | D.3 |
Online Module + Cheat sheet | 2 + 3 | Saturday, Feb 6, 1400 | D.4 |
Online Module + Cheat sheet | 2 + 3 | Friday, Feb 12, 1900 | D.5 |
Reading week, no class | - | - | - |
Data Challenge | 15 | Saturday, Feb 27, 1400 | D.6 |
Online Module + Cheat sheet | 2 + 3 | Saturday, Mar 6, 1400 | D.7 |
Project seed | 5 | Saturday, Mar 13, 1400 | D.8 |
Online Module + Cheat sheet | 2 + 3 | Saturday, Mar 20, 1400 | D.9 |
Project synopsis | 5 | Saturday, Mar 27, 1400 | D.10 |
Online Module + Cheat sheet | 2 + 3 | Saturday, Apr 03, 1400 | D.11 |
Descriptives | 5 | Saturday, Apr 10, 1400 | D.12 |
Final project deliverable | 30 | Monday, Apr 26, 1400 | D.13 |
D.2 Week 1
Reading
Housekeeping and Chapter 1.
There are two parts to this week’s assignment.
- Set-up (5%)
- You will be setting up R & RStudio
after reading Chapter 1.
Specifically, you will show me that you have properly customized
the settings in RStudio regarding
.RData
. - You will show me that you have adopted the workflow as instructed
in Chapter 1.
Specifically, you will show me that you have successfully installed
renv
by running.libPaths()
in the RStudio console, as well as set up the project folder. - Book a session with me using https://chunyun.youcanbook.me and walk me through your set-up.
- Sign up for an account on cocalc
- Sign up for an account on slack and join the workspace on slack for this course.
- You will be setting up R & RStudio
after reading Chapter 1.
Specifically, you will show me that you have properly customized
the settings in RStudio regarding
- Datacamp (5%)
- Sign up for Datacamp following the invitation link
and complete the module assigned to you (2%).
- Write down R commands that you learned from the module (3%). See Appendix E for an example of the deliverable.
- Sign up for Datacamp following the invitation link
and complete the module assigned to you (2%).
D.3 Week 2
Reading
Chapter 2
Datacamp (5%)
- complete the module assigned to you (2%).
- Write down R commands that you learned from the module (3%).
- complete the module assigned to you (2%).
D.4 Week 3
Reading
This week, we change gears a bit and read the first two chapters from another book: Chapter 1 and 2 of “Introduction to Modern Statistics”
https://openintro-ims.netlify.app/getting-started-with-data.html
https://openintro-ims.netlify.app/summarizing-visualizing-data.html
Challenge
If you find these two chapters very easy to read and have extra time, try to re-create some plots from the assigned reading using what you have learned from this course so far. For example, Figure 2.1, Figure 2.4, Figure 2.5. As you attempt to re-produce those figures, you may find yourself looking up various topics online such as “how to change colour in ggplot.” Do take notes of what you learn and update your cheat sheet.
Datacamp (5%)
- complete the module assigned to you (2%).
- Write down R commands that you learned from the module (3%).
- complete the module assigned to you (2%).
D.5 Week 4
D.6 Week 5
Reading
Chapter 5
Data challenge
You will work through a data wrangling problem using a real-life dataset. You will need to apply what you have learned in the previous four weeks to solve the problem. You will submit the work as a
.Rmd
file. Start with the template below.
data_challenge.Rmd
---
title: "Data challenge"
author: "First Last"
date: "2021-02-12"
output:
html_document:
number_sections: true
---
# Problem statement [^1]
At Shopify, there are 100 sneaker shops on line,
and each of these shops sells only one model of shoe. [^2]
We want to do some analysis of the average order value (AOV)
across all the 100 sneaker shops.
When we look at orders data over a 30 day window,
we naively calculate an AOV of $3145.13.
Given that we know these shops are selling sneakers,
a relatively affordable item, something seems wrong with our analysis.
Use the following three questions to guide your investigation.
(a) Investigate what could have gone wrong with this data.
(b) What alternate statistics would you report for this dataset?
(c) What is its value (their values)?
# Import the data
```{r packages, message = FALSE, echo=F}
# Install xfun so that I can use xfun::pkg_load2
if (!requireNamespace('xfun')) install.packages('xfun')
xf <- loadNamespace('xfun')
cran_packages = c(
"dplyr",
"ggplot2",
"knitr",
"readr",
"tibble"
)
if (length(cran_packages) != 0) xf$pkg_load2(cran_packages)
import::from(magrittr, '%>%')
gg <- import::from(ggplot2, .all=TRUE, .into={new.env()})
dp <- import::from(dplyr, .all=TRUE, .into={new.env()})
```
```{r message=F}
df_sales <- readr::read_csv("https://query.data.world/s/e76we3vpgut5cmydqgm2nb7e5xqx7i",
col_types = readr::cols(
order_id = readr::col_integer(),
shop_id = readr::col_integer(),
user_id = readr::col_integer(),
order_amount = readr::col_double(),
total_items = readr::col_integer(),
payment_method = readr::col_factor(),
created_at = readr::col_datetime(format = "%Y-%m-%d %H:%M:%z")
)
)
# df_sales # sanity check to make sure the data has been imported correctly
```
Most of the variables in the dataset are self-explanatory.
Nevertheless, here is a list of brief descriptions for each one.
+ `order_id`: A unique identifier for each order placed by a customer online
+ `shop_id`: A unique identifier for the sneaker shop involved in
an online order+ `order_amount`: The amount of money each online order incurred
+ `total_items`: The number of sneakers purchased in each online order
+ `payment_method`: The type of payment method used in an online order,
e.g., credit card, cash.+ `created_at`: The time stamp when an online order was created
# Eyeball the data
```{r}
tibble::glimpse(df_sales)
```
From the print out, can you tell how many rows and columns are in the data?
(5000 rows/transactions, and 7 columns)
Print out the first three rows to get an idea of what the data actually look like.
```{r}
head(df_sales, 3) %>%
knitr::kable(
caption = "First three rows of the `df_sales` data frame"
)
```
Confirm where the value $3145.13 in the problem statement came from.
```{r}
mean(df_sales$order_amount)
```
<!-- Complete the report with your response.
Keep in mind that there are more than one way to solve the problem.
The following two headings are just an example
of how to organize your final deliverable.
You may change them to your liking. -->
# Check for extreme values
`order_amount` and `total_items.`
Now we examine the minimum and maximum of
The dubious result is likely due to some extreme values
in either or both of the two varables.
<!-- Your response below -->
# Summary
<!-- Your response below -->
[^1]: This problem was based on the Shopify Data Science Challenge in 2019.
[^2]: Average Order Value (AOV) means average dollar spent per order.
D.7 Week 6
Reading
Chapter 6
Datacamp (5%)
- complete the module assigned to you (2%).
- Write down R commands that you learned from the module (3%).
- complete the module assigned to you (2%).
D.8 Week 7
Reading
Chapter 7
Project seed
This is the first of the four deliverables for your course project. For the first deliverable, you will decide on a dataset to use for your final project. This dataset could be from a published study, or one that you / your supervisors have collected in the past. There are several factors you need to consider when choosing the dataset:
Are you allowed to re-analyze (part of) the data?
- If the dataset exists as an appendix of a published article, you are most likely allowed to re-analyze the data.
- If you or your supervisor collected the data for a previous study, you might have to consult your supervisor and get permission.
- If the dataset is part of your ongoing research and you have passed the ethics, you should be able to use it for this course project as well. When in doubt, consult your supervisor.
Does the underlying research question interest you?
I assume that if you choose to use a dataset from your own research or your supervisor’s, the research question addressed by analyzing the data would be interesting to you. Below, I also provide a few sources to look for published studies with their datasets attached. You may choose one whose topic interests you.
Which statistical technique would be required to re-analyze the data?
If you are going to select a dataset that came with a published study, you will likely use the same statistical technique employed by the original authors. You may choose a dataset that would allow you to practice a technique interesting to you, even if you are not yet familiar with the said technique.
Where to look for published studies that have their datasets attached?
Here are a few examples of published articles along with their open access datasets from these three sources mentioned above:
- Musician productivity: article + data, animation
- Math anxiety: article + data
- Numerical cognition: article + data
- Self-deception: article + data
- Moral machine: article + data
- Neurodegeneration and identity: data, article
- Cognitive modelling: article + data
- Password selection: data, article
If you choose to use a published article — either open access or from your supervisor — and the data used in that article, you do not have to replicate all the analyses reported in the original article. In fact, in the second deliverable of this course project, you will define the scope of your analyses.
To summarize, for this assignment, your deliverable will consist of:
An url to a published study which includes its open access data, or
An electronic copy of a published study (presumably from your supervisor) with its accompanying dataset
D.9 Week 8
D.10 Week 9
Reading
Chapter 9
Project synopsis
This is the second of the four deliverables for your course project. Consider this as the first draft of the introduction to a paper you are writing. In this synopsis, answer the following questions:
- What is the research question you (or the study you are replicating) are (is) trying to answer?
- In order to tackle the research question, what measurements were involved?
- What findings were reported in the original paper?
Have an imaginary audience in front of you: how would you describe the study to them? Your audience knows very little beyond common sense about the field, so be sure to provide some lay-person style explanations.
Some questions you may have regarding this deliverable:
Can I describe my project as a replication? Yes, you can refer to your project as “a replication of xx” If you are replicating an analysis.
If I am replicating analyses from a paper, do I have to replicate every analysis reported in the paper? No, you do not need to replicate everything from the original paper. Sometimes a paper includes more than a dozen analyses. Select the ones that interest you (or pertinent to the research question you have chosen). If you are only replicating a part of the original analysis, you can also mention that in the synthesis (e.g., the current replication will not provide a comprehensive answer to the research question …)
Will I be graded for my writing? Not in this deliverable. But do pay attention to your writing style.
Look out for non sequiturs, grammatical errors, etc. In the final deliverable, writing style will account for one third of the marks. Now is your opportunity to get feedback and improve if necessary.How long should the write-up be? Depending on the complexity of your research question. Roughly one page long.
What if I have more questions? I’m sure there are questions I have not covered. Like any other writing tasks, you have to make some assumption about your audience when you commit ideas to the paper. Don’t be afraid to make them, especially because this is just a draft.
D.11 Week 10
Reading
Chapter 10
Datacamp (5%)
- complete the module assigned to you (2%).
- Write down R commands that you learned from the module (3%).
- complete the module assigned to you (2%).
D.12 Week 11
Reading
Chapter 11
Descriptives
This is the third deliverable of your course project. Below is a template for you as a starting point. You only need to use it if you do not have a better option. Examples of components in this deliverable include:
A description of each variable, in plain English. For example, “lifeExp – Average life expectancy in each country, measured in years.” In the description, you may include information such as: What does this variable measure? If this is a numerical variable, what is the unit of the measurement? Is this a key variable, i.e., you plan to use it in the replication analyses?
Missing data analysis. For each key variable, investigate whether there is any missing value (empty cell, or NA, or any other types of illegitimate values). Report missingness on each variable; how many values are missing in each column
For numerical variables, examine: mean, standard deviation, minimum, maximum, histogram, boxplot
For categorical variables: frequency tables
descriptives.Rmd
---
title: "" # swap with your own
author: "Your Name"
date: "year-month-date"
output: html_document
---
# Preamble (delete this section before submitting your work)
- If there are a large number variables in your dataset,
some of the analyses listed below may become tedious.
Try to group them when describing the fingings,
especially if some variables are of the same type (e.g., 5 point likert scale).
- This guide tries to provide a one-size-fits-all solution to descriptives,
which is an impossible task to begin with.
I have merely listed some typical analyses
you may encounter during this stage of data analysis.
*Not every prompt included in this template would apply to your case.*
*Choose ones that you think are relevant.*
Other than answering the questions mentioned in this template,
you may want to take a look at the original paper,
and consider what descriptives have been reported there.
Not sure what counts as descriptives (vs. inferential statistics)?
Look for these keywords:
number of participants,
their mean/median age,
male vs. female ratio,
percentages,
demographics, etc.
- Please feel free to re-structure this document as you see fit. Own it!
- Submit this Rmarkdown file as your deliverable
# Descriptives
```{r setup, include=FALSE}
knitr::opts_chunk$set(
# set any chunk options here
)
```
```{r packages, include=FALSE}
if (!requireNamespace('xfun')) install.packages('xfun')
cran_packages <- c(
# adjust this list based on your needs
"dplyr",
"ggplot2",
"skimr",
"tibble"
)
if (length(cran_packages) != 0) xfun::pkg_load2(cran_packages)
gg <- import::from(ggplot2, .all=TRUE, .into={new.env()})
dp <- import::from(dplyr, .all=TRUE, .into={new.env()})
import::from(magrittr, '%>%')
```
```{r import-data}
# import the dataset you will be working with and save it as a data frame
# this should be the dataset post cleaning/filtering
```
## Eye-ball the data
```{r}
# replace xx with the actual name of your dataframe
tibble::glimpse(xx)
```
<!-- - Include a description of each variable, in plain English. -->
<!-- For example, -->
<!-- *lifeExp – Average life expectancy in each country, measured in years.* -->
<!-- In the description, you may address: -->
<!-- -->
<!-- - What does this variable measure? -->
<!-- -->
<!-- - Is this a numerical/continous variable or a categorical varialbe/factor? -->
<!-- -->
<!-- - If this is a numerical variable, what is the unit of the measurement? -->
<!-- -->
<!-- - Is this a variable you plan to use it in the replication analyses? -->
<!-- This question is intended for those of you -->
<!-- who may be working with a dataset which has more variables than what you need. -->
<!-- But if you still have many variables in this data drame -->
<!-- that are irrelevant to your analysis, -->
<!-- then you might want to go back to the data cleaning stage. -->
## Missing data analysis
```{r}
# Insert code below
```
<!-- For each key variable, -->
<!-- investigate whether there is any missing value -->
<!-- (empty cell, or NA, or any other types of illegitimate values). -->
<!-- Report missingness on each variable: how many values are missing in each column. -->
## Numerical variables
<!-- For each numerical variable: -->
<!-- -->
<!-- - Calculate mean, standard deviation, minimum, maximum -->
<!-- -->
<!-- - Plot histogram and boxplot -->
```{r}
# Insert code below
```
<!-- Discuss any notable findings.
For example, is the distribution of any variable clearly not normal? -->
## Categorical variables
<!-- For each categorical variable/factor: -->
<!-- -->
<!-- - List levels of this variable -->
<!-- -->
<!-- - Provide a frequency table -->
```{r}
# Insert code below
```
<!-- Discuss any notable findings.
For example, is the frequency of observations
for a certain level on a variable particularly low, e.g., smaller than 5? -->
D.13 Week 12
Reading
Chapter 12
Final deliverable
This is the fourth and last deliverable of your course project.
You will hand in both an .Rmd
file and its knitted output.
The .Rmd
file will consists of components typical of a published article,
albeit with less detailed introduction and discussion.
Introduction: You will use what you have written for the synopsis, with edits based on my feedback.
Methods: You will use what you have written for the descriptives, with edits based on my feedback.
Results: Report the results of your main analyses. What was reported in the original study? Consider replicating (some of) those components, including numbers, tables, figures. The scope of your analyses should be tied to your synopsis. Which part of the original study did you decide to replicate when you were writing the synopsis? If you have changed your mind about the scope since then, update the introduction to reflect the new scope.
Conclusion: Summarize your findings. Draw conclusions based on the findings.
Discussion: Discuss any similarity and/or discrepancy between your results/conclusions and those reported in the original study. If there is any discrepancy, try to offer some reasonable explanations. If the scope of your analyses is so different from the original one that it is not feasible to compare the findings, try to compare the analytical process. For example,
During the analyses, did you make any decision different than what was reported in the original study? If so, what are they and why did you make them differently?
Or did you have to make any decision because no information was mentioned in the original study? Would you recommend other authors report such information to ease future replications?
What lessons have you learned while working on this exercise that you wouuld share with your peers?