RTutorial-Week1/week_one_sample_solution.Rmd at master · posnerab/RTutorial-Week1 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314

```{r 'check_ps', include=FALSE}

user.name = 'Jane Doe' # set to your user name

# To check your problem set, run the
# RStudio Addin 'Check Problemset'

# Alternatively run the following lines
library(RTutor)
ps.dir = getwd() # directory of this file
ps.file = 'week_one.Rmd' # name of this file
check.problem.set('week_one', ps.dir, ps.file, user.name=user.name, reset=FALSE)
```


## Exercise Overview

<h1>Week 1 - Data Analysis & Visualization in R</h1>


<h4>PBHLTH W142: UC Berkeley School of Public Health</h4>
<div id="block-quotes" class="section level2">
<blockquote>
<p>Author: Xander Posner, Graduate Student Instructor</p>
<p><a href="mailto:xander@berkeley.edu?subject=PHW142">xander@berkeley.edu</a></p>
</blockquote>
</div>
<h4>Access this week's bCourses site <a href="https://bcourses.berkeley.edu/courses/1492482/pages/week-1", target = _blank>here</a></h4>
<h4>Check out the <code>source code</code> for this project on <a href="https://github.com/posnerab/RTutorial-Week1", target = _blank>Github</a></h4>


<h3>Tools Used in this Tutorial</h3>

<p>In this tutorial, we will: </p>

<ul>
<li>Get to know the Western Collaborative Group Study (wcgs) dataset from the <code>epitools</code> package</li>
<li>Change some variable types (e.g., numeric to factor)</li>
<li>Create a new BMI variable based on existing height and weight measurements</li>
<li>Generate summary statistics and figures</li>
<li>Interpret your summary statistics and figures</li>
</ul>


<h3>R Packages Used in this Tutorial</h3>

<ul>
<li><code>"tidyverse"</code></li>
<li><code>"epitools"</code></li>
</ul>

<p>When practicing on your own in RStudio, you will first need to install each of these packages using the <code>install.packages("")</code> function, putting the name(s) of the package(s) inside the quotation marks like this:</p>

<div id="block-quotes" class="section level2">
<blockquote>
<p><code>install.packages("tidyverse")</code> OR <code>install.packages(c("tidyverse", "epitools"))</code></p>
</blockquote>
</div>

<p>You <strong>won't</strong> need to <i>install</i> packages more than once on the same computer.</p>

<p>After installing, you must load each package library <strong>separately</strong>, using the <code>library()</code> function, putting the name of each package inside the parentheses without quotation marks like this:</p>

<div id="block-quotes" class="section level2">
<blockquote>
<li><p><code>library(tidyverse)</code></p></li>
<li><p><code>library(epitools)</code></p></li>
</blockquote>
</div>

<p>You <strong>will</strong> need to <i>load</i> packages each time you need to use them, using the <code>library()</code> function whenever you initiate a new R session.</p>

<p><strong>You don't need to install any packages for this tutorial</strong>. Just press <strong>check</strong> to run any chunk with a <code>library()</code> function in it.</p>


<h3>R Functions Used in this Tutorial</h3>

<ul>
<li><code>library()</code> to load each package library</li>
<li><code>data()</code> to read built-in datasets into R</li>
<li><code>head()</code> to view the first few lines of your dataset</li>
<li><code>nrow()</code> to view the number of rows in your dataset</li>
<li><code>str()</code> to view the "structure" of your dataset, including variable names and types</li>
<li><code>summary()</code> to produce summary statistics</li>
<li><code>ggplot()</code>, <code>geom_histogram()</code>, <code>geom_boxplot()</code>, and <code>geom_col()</code> to create histograms, boxplots, and bar charts</li>
</ul>


Click the "Go to next exercise..." button to continue.


## Exercise 1 -- Get Started

<p>Some code chunks are finished for you; just press <strong>check</strong> to proceed. Other chunks require you to modify and/or add code before pressing <strong>check</strong> to move through the tutorial.</p>
<p>If you get stuck, use the <strong>hint</strong> in each code chunk first. If you're still having trouble, head to Piazza and post your question.</p>

<strong>Make sure you run each chunk in order!</strong>

<h3>Get to Know Your Data</h3>


a) The <code>wcgs</code> dataset is built into the <code>epitools</code> package. As long as you load the libraries first using the <code>library()</code> functions, you will be able to load the dataset by calling <code>data(wcgs)</code> as shown below. Click "check" to continue.
```{r "1_a"}
library(epitools)
library(tidyverse) # helps us make charts and graphics
library(rmarkdown)
library(DescTools)

data(wcgs)
```

b) Now that the <code>wcgs</code> dataset is loaded into your environment, use the <code>head()</code> function to view the first six rows.
```{r "1_b"}
head(wcgs) # view the first six rows of the dataset in table format
```


c) Use the <code>nrow()</code> function to calculate the number of rows in your dataset, which here amounts to our sample size, $n$.
```{r "1_c"}
n <- nrow(wcgs)
```

<p>Notice that when we assigned the <code>nrow()</code> function to the <code>object</code> named <code>n</code>, R doesn't immediately <i>print</i> the result. Instead the resulting value is saved in your <code>Global Environment</code>.</p>

d) Here are the ways you can have the results of a function print to your Console. Press <strong>check</strong> to observe.
```{r "1_d"}
nrow(wcgs) # PRINTS to the console but DOES NOT save the value

n <- nrow(wcgs) # SAVES the value but DOES NOT print

n # PRINTS out the results that you previously saved above

n + 1 # adds 1 to the value saved in the n object: example of what you can do with stored objects in your environment

print(n <- nrow(wcgs)) # SAVES the value AND also PRINTS to the console in one step
```

<p><strong>NOTE. R is cAsE sEnSiTiVe.</strong><code>N</code> would be an entirely separate object from <code>n</code> from R's perspective.</p>


<h3>Summarizing your data</h3>

<p>R has many ways to view summary statistics. The simplest is the <code>summary()</code> function.</p>

e) Run the chunk below to see what summary statistics the <code>summary()</code> function produces.
```{r "1_e"}
summary(wcgs)
```

<p>You will notice the <code>summary()</code> function prints out the following values for each variable in the <code>wcgs</code> dataset:<p>
<ul>
<li>The minimum value</li>
<li>The value at Q1, the first quartile (25th percentile)</li>
<li>The median value or Q2 (50th percentile)</li>
<li>The mean (average) value</li>
<li>The value at Q3, the third quartile (75th percentile)</li>
<li>The maximum value</li>
</ul>

f) Use the dollar sign <code>$</code> to access different variables in your dataset. The first one is done for you. Fill in the code below to summarize the <code>height0</code> variable.
```{r "1_f"}
summary_age0 <- summary(wcgs$age0)
summary_age0

summary_height0 <- "<<<<<<YOUR CODE HERE>>>>>>"
summary_height0

summary_age0 <- summary(wcgs$age0)
summary_age0

summary_height0 <- summary(wcgs$height0)
summary_height0
```


## Exercise 2 -- Modify Your Dataset

<h3>Make Factor Variables</h3>

a) First, view the structure of your dataset using the <code>str()</code> function. Click <strong>Check</strong> to observe. Notice the different variable "types."
```{r "2_a"}
str(wcgs)
```

b) Notice that the dichotomous behavioral pattern variable, <code>dibpat0</code>, is currently coded as an integer (1 = Type A, 0 = Type B). We want R to know that <code>behpat0</code> is a factor variable with 2 discrete levels: A and B. To do this, we will use the <code>factor()</code> function and create a new factor variable called <code>dibpat0_fact</code>.
```{r "2_b"}
wcgs$dibpat0_fact <- factor(wcgs$dibpat0, ordered = TRUE, labels = c("A","B"))
```


c) Now create a new variable called <code>smoker0</code> that is equal to 1 if someone currently smokes any cigarettes and 0 if they don't
```{r "2_c"}
wcgs$smoker0[wcgs$ncigs0 > 0] <- 1 # creates a new variable called smoker0 and sets the value to 1 for all individuals with ncigs0 > 0

wcgs$smoker0[wcgs$ncigs0 == 0] <- 0 # sets the value of smoker0 to 0 for all individuals with ncigs = 0.
```

<p>The square brackets indicate an R operation called "subsetting." This operation tells R to set the value of <code>smoker0</code> to 0 if the <code>ncigs0</code> value for that individual is 0.</p>

<p><strong>Note that we had to use two equal signs (==) when we specified the second condition for subsetting.</strong></p>

<h4>Options for Subsetting Conditions</h4>
<ul>
<li><code>dataset$variable == X</code></li> (equal to some number X)
<li><code>dataset$variable < Y</code></li> (less than some number Y)
<li><code>dataset$variable > Z</code></li> (greater than some number Z)
<li><code>dataset$variable <= A</code></li> (less than or equal to some number A)
<li><code>dataset$variable >= B</code></li> (greater than or equal to some number B)
</ul>


d) Now try making a "yes/no" factor variable for systolic blood pressure, <code>sbp0</code>, just like we did for smoking above. Call the new variable <code>highsbp0</code> and set it equal to 1 if <code>sbp0</code> is greater than or equal to 140, and set it equal to 0 if <code>sbp0</code> is less than 140.
```{r "2_d"}
"<<<<<<YOUR CODE HERE>>>>>>" <- 1

"<<<<<<YOUR CODE HERE>>>>>>" <- 0


wcgs$highsbp0[wcgs$sbp0 >= 140] <- 1

wcgs$highsbp0[wcgs$sbp0 < 140] <- 0
```

e) Now make <code>highsbp0</code> a factor variable with 2 levels
```{r "2_e"}
wcgs$highsbp0 <- factor(wcgs$highsbp0, labels = c("normal sbp", "high sbp"))
```


<h3>Create a new BMI variable</h3>

<p>The <code>wcgs</code> dataset has the height variable, <code>height0</code>, in inches (in) and the weight variable, <code>weight0</code>, in pounds (lbs). We want to analyze BMI, which is equal to weight (mass) in kilograms divided by height in meters squared, so we'll have to convert height to centimeters (cm) and weight to kilograms (kg).</p>

f) Create new variables for height, weight, and BMI in the wcgs dataset by using the assignment operator (<-) like so:
```{r "2_f"}
wcgs$heightcm0 <- round(wcgs$height0 * 2.54, digits = 2)
# this line of code tells R to take the Height0 value for each individual (row) in the wcgs dataset, multiply it by 2.54, round the result to 2 decimal places (digits), and then add the value to a new variable (column) we named Heightcm0.

wcgs$weightkg0 <- round(wcgs$weight0/2.2, digits = 2)

wcgs$BMI0 <- round(wcgs$weightkg0/((wcgs$heightcm0/100)^2), digits = 1)
```

g) This chunk writes the <code>wcgs</code>data frame, including the new variables we created, to a comma separated value (.csv) file named "wcgs_data_new.csv". The csv file is then loaded into R and assigned to the <code>wcgs_new</code> object so it can be called later on.
```{r "2_g"}
wcgs_new <- write_csv(wcgs, "wcgs_data_new.csv")
```

h) Now, use the head() function again to view the first six lines of the <code>wcgs_new</code> dataset. You will notice that there are now five new columns with the variable names we assigned and the values we calculated.
```{r "2_h"}
head(wcgs_new)
```

## Exercise 3 -- Create Histograms & Box Plots

a) Create a histogram for the <code>sbp0</code> variable.
```{r "3_a"}
ggplot(wcgs_new, aes(x = sbp0, y = ..density..)) +
  geom_histogram(binwidth = 10,
                 color = "black",
                 fill = "blue") +
  labs(title = "WCGS Systolic Blood Pressure at Baseline",
       subtitle = "n = 3154 men ages 39 to 59",
       x = "SBP in mmHg", y = "Density" ) +
  scale_x_continuous(limits = c(0, 250),
                     breaks = c(50, 100, 150, 200, 250),
                     labels = c(50, 100, 150, 200, 250)) +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))
```

b) Create a box plot for the <code>sbp0</code> variable.
```{r "3_b"}
ggplot(wcgs_new, aes(y = sbp0)) +
  geom_boxplot(width = 0.5) +
  scale_x_continuous(labels = NULL) +
  labs(y = "SBP in mmHg",
       x = "",
       title = "WCGS Systolic Blood Pressure at Baseline",
       subtitle = "n = 3154 men ages 39 to 59") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))
```

c) Create a bar chart for the <code>sbp0</code> variable for each level of the <code>dibpat0</code> variable.
```{r "3_c"}
ggplot(wcgs_new, aes(y = sbp0)) +
  geom_col(aes(x = dibpat0_fact),
           position = "dodge",
           fill = "orange",
           width = 0.7) +
  labs(y = "SBP in mmHg",
       x = "Behavioral Pattern",
       title = "Systolic Blood Pressure According to Behavioral Pattern",
       subtitle = "WCGS: n = 3154 men ages 39 to 59") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))
```

d) Create a two-way table of the behavioral pattern factor variable <code>dibpat0_fact</code> and high blood pressure <code>highsbp0</code>.
```{r "3_d"}
highsbp_dibpat_table <- table(wcgs_new$dibpat0_fact, wcgs_new$highsbp0)


# You will model your tables in R Lab Workbook 1 on the following two table outputs

addmargins(highsbp_dibpat_table)
round(prop.table(highsbp_dibpat_table), digits = 3)
```