r-intro/Session2.4-programming.Rmd at master · PMacDaSci/r-intro

History

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

863

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

---

title: "Introduction to Solving Biological Problems Using R - Week 4"

output:

html_notebook:

toc: yes

toc_float: yes

html_document:

df_print: paged

toc: yes

date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'

---

*Miriam Yeung*

Modified version of material from the University of Cambridge Bioinformatics Training Unit.

#4. Automation in R

## Motivation

Typically when we perform an analysis we need to perform the same steps **multiple times**, for example, over multiple genes, samples, datasets.

With the basic analyses and graphs that we have created in the previous sessions it is possible to

copy and paste the relevant section of code and adjust the code to produce multiple graphs.

However, take the scenarios where you want to:

1. Count the number of genes expressed in the patients for ***multiple chromosomes***

2. Create boxplots of expression in ER -ve versus ER +ve patients for ***multiple genes***

3. Creating boxplots for ***multiple genes*** *if* expression is different in ER +ve versus ER-ve

First lets load in the libraries we'll need.

```{r, message=FALSE}

library(dplyr)

library(ggplot2)

library(RColorBrewer)

```

Lets also load in the data.

```{r}

patients <- read.delim("updated.patient.txt", stringsAsFactors = FALSE)

exprsAnno <- read.delim("anno.gene.expression.txt", stringsAsFactors = FALSE)

```

You can check the dimensions of the objects, to check the numbers of rows and columns they contain.

```{r}

dim(patients)

dim(exprsAnno)

```

Take a look at the files to check what they contain and as a sanity check that all looks ok.

```{r}

View(patients)

View(exprsAnno)

```

You can check the column names.

```{r}

colnames(patients)

colnames(exprsAnno)

```

`patients` has 6 columns for samplename, age, er, grade, her2, pr, for 168 patients. We can use this object to get ER/PR/HER2 status of the patients.

`exprsAnno` has the expression information for the 168 patients (who all have an ID that starts with "NKI"), followed by columns for gene symbol, chromosome and start position (HUGO.gene.symbol, Chromosome, Start).

## Automation Example 1 - Counting genes for multiple chromosomes

We want to identify the number of the genes in the dataset for Chromosomes 1, 5, 6, 8.

First lets identify the number on chromosome 1. We can use `dplyr`'s filter and summarise functions. We can use summarise(n()) to get a count of rows.

```{r}

exprsAnno %>%

filter(Chromosome == "chr1") %>%

summarise(n())

```

Note that `tally()` is a convenient wrapper function you can use instead of `summarise(n())`

```{r}

exprsAnno %>%

filter(Chromosome == "chr1") %>%

tally()

```

That gives us the count for one chromosome (chr1). Now how do we identify the number of genes for multiple chromosomes, chromosomes 1, 5, 6 and 8?

We could copy and paste the code like below.

```{r}

exprsAnno %>%

filter(Chromosome == "chr1") %>%

summarise(n())

exprsAnno %>%

filter(Chromosome == "chr5") %>%

summarise(n())

exprsAnno %>%

filter(Chromosome == "chr6") %>%

summarise(n())

exprsAnno %>%

filter(Chromosome == "chr8") %>%

summarise(n())

```

But as you can imagine, the more lines of code that need to be copied, pasted and edited, the more likely it is that errors will arise.

This method of copying, pasting and editing can be:

1. Tedious

2. Erroneous

## Automating Commands: Loops and flow control

- Many programming languages have ways of doing the same thing many times, perhaps changing some variable each time. This is called **looping**. It is a way to automate tasks.

- As we are doing the same thing multiple times, but with a different chromosome each time, we can use a **loop** instead

- R has two basic types of loop

+ a **`while`** loop: run some code while some condition is true

(*hardly ever used! Therefore will not be covered*)

+ a **`for`** loop: run some code on every value in a vector

`for`

The basic structure of a `for` loop:

```

for (element in vector){

... do this ...

}

```

- Therefore the code is only run for as many elements are in the vector. Therefore we can predict the number of times the code is run.

- Note: *element* and *vector* are just variable names and can therefore be named whatever you want

them to be, as long as they satisfy the constraints on varible naming.

- Here's how we might use a `for` loop to find out the number of genes on each chromosome. We store the count in a variable called `numRows` so we can then print it out.

```{r}

chrom <- c("chr1", "chr5", "chr6", "chr8")

for (chr in chrom){

numRows <- exprsAnno %>%

filter(Chromosome == chr) %>%

summarise(n())

s <- paste("The number of genes on", chr, "is", numRows)

print(s)

}

```

- The above for loop finds out the same information that we identified earlier by

copying/pasting/editing

- To more accurately depict the operations of a `for` loop, the commands being run are similar to the

following:

```{r}

chr <- "chr1"

numRows <- exprsAnno %>% filter(Chromosome == chr) %>% summarise(n())

chr <- "chr5"

numRows <- exprsAnno %>% filter(Chromosome == chr) %>% summarise(n())

chr <- "chr6"

numRows <- exprsAnno %>% filter(Chromosome == chr) %>% summarise(n())

chr <- "chr8"

numRows <- exprsAnno %>% filter(Chromosome == chr) %>% summarise(n())

```

## Exercise

Using the `patients` object, find the number of patients who are aged above 20, 30, 40, and print the answer to the console.

```{r}

ages <- c(20, 30, 40)

### Insert your code here ###

```

## Automation Example 2 - Creating boxplots for multiple genes

Whilst the first example only contains 3 lines of code, and therefore is less likely to lead to errors, imagine that you want to create multiple boxplots for the expression values of the following genes in ER -ve and ER +ve patients.

- FOXE1

- TECTA

- BAX

- MAP3K8

- GEMIN8

First lets create a boxplot for one gene, FOXE1, in ER -ve and ER +ve patients.

We need to select the rows where HUGO.gene.symbol == "FOXE1" and the columns with the expression values for the patients. We select the columns starting with the patient ids "NKI" as we don't want the columns HUGO.gene.symbol, Chromosome, Start.

```{r eval=FALSE}

exprsAnno %>%

filter(HUGO.gene.symbol == "FOXE1") %>%

select(starts_with("NKI")) %>%

View()

```

This gives us the expression values all in one row. We need to tranpose this to plot. With `dplyr` we can pipe (`%>%`) the result into R's `t()` function to easily achieve this.

```{r}

exprsAnno %>%

filter(HUGO.gene.symbol == "FOXE1") %>%

select(starts_with("NKI")) %>%

t() %>%

View()

```

This is a matrix (which we can see from the "V1" header in View or if we use class() around the code above). We need to convert this into a data frame, which we can do by adding another pipe.

```{r}

exprsAnno %>%

filter(HUGO.gene.symbol == "FOXE1") %>%

select(starts_with("NKI")) %>%

t() %>%

data.frame() %>%

View()

```

Next we'll save the data frame as an object called `filtered`.

```{r}

filtered <- exprsAnno %>%

filter(HUGO.gene.symbol == "FOXE1") %>%

select(starts_with("NKI")) %>%

t() %>%

data.frame()

```

And add the gene name as a header.

```{r}

colnames(filtered) <- "FOXE1"

```

Now we're ready to make a boxplot. We have a data frame with all the patients values for FOXE1. We want to group these by ER status, to visualise the distribution of expression values for the ER+ve and ER-ve patients.

We can identify the ER status of the patients from the patients dataframe.

```{r}

patients[, "er"]

```

A 1 means a patient is ER +ve, a 0 means they're ER -ve. We can then tell ggplot to group the patients by the ER status column.

```{r}

ggplot(filtered, aes(y = FOXE1, x = factor(patients[, "er"]))) +

geom_boxplot()

```

We can also colour by ER status.

```{r}

ggplot(filtered, aes(y = FOXE1, x = factor(patients[, "er"]), fill = factor(patients[, "er"]))) +

geom_boxplot()

```

These are the default ggplot colours. We can also specify our own colours with `scale_fill_manual`. For example, we could colour the ER -ve goldenrod (yellow) and ER +ve dodgerblue and add labels.

```{r}

ggplot(filtered, aes(y = FOXE1, x = factor(patients[, "er"]), fill = factor(patients[, "er"]))) +

geom_boxplot() +

scale_fill_manual(values = c("goldenrod", "dodgerblue"),

name = "ER status",

labels = c("ER -ve", "ER +ve"))

```

We can label the axes with `labs`, and `scale_x_discrete(labels = NULL)` can be used to remove the "1" and "0" from the x axis ticks.

```{r}

g <- ggplot(filtered, aes(y = FOXE1, x = factor(patients[, "er"]), fill = factor(patients[, "er"]))) +

geom_boxplot() +

scale_fill_manual(values = c("goldenrod", "dodgerblue"),

name = "ER status",

labels = c("ER -ve", "ER +ve")) +

labs(title = "Expression of FOXE1 ~ Estrogen Receptor Status",

x = "Estrogen Receptor Status",

y = "Expression values of FOXE1") +

scale_x_discrete(labels = NULL)

print(g)

```

To centre the title `theme(plot.title = element_text(hjust = 0.5))` can be used.

```{r}

ggplot(filtered, aes(y = FOXE1, x = factor(patients[, "er"]), fill = factor(patients[, "er"]))) +

geom_boxplot() +

scale_fill_manual(values = c("goldenrod", "dodgerblue"),

name = "ER status",

labels = c("ER -ve", "ER +ve")) +

labs(title = "Expression of FOXE1 ~ Estrogen Receptor Status",

x = "Estrogen Receptor Status",

y = "Expression values of FOXE1") +

scale_x_discrete(labels = NULL) +

theme(plot.title = element_text(hjust = 0.5))

```

We can save the plot in an object, lets call it `g`.

```{r}

g <- ggplot(filtered, aes(y = FOXE1, x = factor(patients[, "er"]), fill = factor(patients[, "er"]))) +

geom_boxplot() +

scale_fill_manual(values = c("goldenrod", "dodgerblue"),

name = "ER status",

labels = c("ER -ve", "ER +ve")) +

labs(title = "Expression of FOXE1 ~ Estrogen Receptor Status",

x = "Estrogen Receptor Status",

y = "Expression values of FOXE1") +

scale_x_discrete(labels = NULL) +

theme(plot.title = element_text(hjust = 0.5))

```

If we want to produce the plot from `g` we can use print().

```{r}

print(g)

```

Now we've got a nice plot showing the expression of FOXE1 in ER -ve and ER +ve patients.

Next, to make the boxplots for the other genes, TECTA, BAX, MAP3K8, GEMIN8, we could copy and paste the code for the FOXE1 plot over and over like below.

```{r, eval=FALSE}

## TECTA gene ##

filtered <- exprsAnno %>%

filter(HUGO.gene.symbol == "TECTA") %>%

select(starts_with("NKI")) %>%

t() %>%

data.frame()

colnames(filtered) <- "TECTA"

g <- ggplot(filtered, aes(y = TECTA, x = factor(patients[, "er"]), fill = factor(patients[, "er"]))) +

geom_boxplot() +

scale_fill_manual(values = c("goldenrod", "dodgerblue"),

name = "ER status",

labels = c("ER -ve", "ER +ve")) +

labs(title = "Expression of TECTA ~ Estrogen Receptor Status",

x = "Estrogen Receptor Status",

y = "Expression values of TECTA") +

scale_x_discrete(labels = NULL) +

theme(plot.title = element_text(hjust = 0.5))

print(g)

## BAX gene ##

filtered <- exprsAnno %>%

filter(HUGO.gene.symbol == "BAX") %>%

select(starts_with("NKI")) %>%

t() %>%

data.frame()

colnames(filtered) <- "BAX"

g <- ggplot(filtered, aes(y = BAX, x = factor(patients[, "er"]), fill = factor(patients[, "er"]))) +

geom_boxplot() +

scale_fill_manual(values = c("goldenrod", "dodgerblue"),

name = "ER status",

labels = c("ER -ve", "ER +ve")) +

labs(title = "Expression of BAX ~ Estrogen Receptor Status",

x = "Estrogen Receptor Status",

y = "Expression values of BAX") +

scale_x_discrete(labels = NULL) +

theme(plot.title = element_text(hjust = 0.5))

print(g)

## MAP3K8 gene ##

# ETC....

```

However, as already stated, this is tedious and error-prone.

So instead, lets automate it!

## Exercise

Convert the code used to generate the boxplots in the second example of copy/paste/edit

into a `for` loop.

Which variables are you looping over?

Hint: Consider using the function `paste()` to help change the y-axis label and title to reflect the

gene that is being plotted.

## Saving multiple plots

You can save all the plots produced by a `for` loop in a PDF by using the `pdf()` and `dev.off()` functions *outside* the `for` loop.

This will create one file with all the plots.

```

pdf("myplots.pdf")

for (element in vector) {

# plot code

}

dev.off()

```

Otherwise, another option is to create a PDF (or PNG/JPEG) for each graph *inside* the `for` loop.

```

for (element in vector) {

pdf(element)

# plot code

dev.off()

}

```

## Storing results

Note that this `for` loop is helping us identify the number of genes in each chromosome but the

result is not stored. Thus, we can not access the results at a later time.

- When storing the results from a loop, we often create an empty variable before starting the for loop

- This is used store the result at each iteration of the loop

```{r}

numGenes <- NULL

chrom <- c("chr1", "chr5", "chr6", "chr8")

for(chr in chrom) {

numRows <- exprsAnno %>%

filter(Chromosome == chr) %>%

summarise(n())

numGenes[chr] <- numRows

}

numGenes

```

## Exercise

Identify the number of patients who are positive for ER/PR/HER2 when considering each status

individually. Store the results in a vector.

```{r}

status <- c("er", "pr", "her2")

### Your answer here ###

```

##Conditional branching: Commands and flow control

What if we only wanted to create boxplots for genes if they met certain criteria, for example, if the expression of the gene differed between ER -ve and ER +ve. To do that we could use an if statement.

- Use an `if` statement for any kind of condition testing

- Different outcomes can be selected based on a condition within brackets

```

if (condition) {

... do this ...

} else {

... do something else ...

}

```

- `condition` is any logical value, and can contain multiple conditions.

+ e.g. `(a == 2 & b < 5)`, this is a compound conditional argument

- The condition should return a *single* value of `TRUE` or `FALSE`

## Other conditional tests

- There are various tests that can check the type of data stored in a variable

+ these tend to be called **`is...()`**.

+ try *tab-complete* on `is.`

```{r}

is.numeric(10)

is.numeric("TEN")

is.character(10)

```

- `is.na()` is useful for seeing if an `NA` value is found

+ cannot use `== NA`!

```{r}

x <- c(1, 2, NA)

mean(x)

is.na(x)

```

## Example

Checking if a gene of interest is in the dataset

- There are 2 methods that could be used to achieve this

1. Make uses of the functions `all()` and `any()`

2. Make use of the function `sum()`

```{r}

## Method 1

if (any(exprsAnno$HUGO.gene.symbol == "PIK3CA")){

print("PIK3CA is in the dataset")

}else{

print("PIK3CA is not in the dataset")

}

## Method 2

if (sum(exprsAnno$HUGO.gene.symbol == "PIK3CA") == 1){

print("PIK3CA is in the dataset")

}else{

print("PIK3CA is not in the dataset")

}

```

## Exercise

Write an `if else` statement to check if *all* of the following genes are in the dataset:

BCL2, HOXA9, MAPK1, ARID1A, GATA3, ESR1

Hint: Make use of `%in%`

## Other useful conditionals

`file.exists()`, `dir.exists()` and `dir.create()` are also functions that are useful in `if else` statements.

Particularly, when writing scripts that take inputs from the commandline.

## Combining Loops and Conditional branching

Using the **`for`** loop we wrote before, we could add an `if else` branch to identify if the difference

between the mean expression for a gene differs between the groups ER positive/negative.

## Automation Example 3 - Creating boxplots if expression differs between ER -ve and ER +ve

We are interested in finding out if the following genes:

AMPD3, TECTA, TRPV4, CD244, ABHD10, GEMIN8, MAP1A, SMAD7

have difference between the ER status of greater than 0.04.

- Here's how we can combine a `for` loop and an `if` statement to test for this

**`for`** each iteration of the loop:

1. Identify the expression values associated with the gene of interest

2. Group the samples based on there ER status, and find the mean for each group

3. **`if`** the absolute difference is greater than 0.04, print a statement that informs us of this

4. **`else`**, do nothing

```{r}

mygenes <- c("AMPD3", "TECTA", "TRPV4", "CD244", "ABHD10", "GEMIN8", "MAP1A", "SMAD7")

for (gen in mygenes){

filtered <- exprsAnno %>%

filter(HUGO.gene.symbol == gen) %>%

select(starts_with("NKI")) %>%

t() %>%

data.frame()

colnames(filtered) <- "gene"

x <- filtered %>%

group_by(factor(patients[, "er"])) %>%

summarise("exprsAv" = mean(gene, na.rm = TRUE))

diff <- x$exprsAv[1] - x$exprsAv[2]

if (abs(diff) > 0.04){

print(paste("The difference between ER status for", gen, "is", abs(diff), "which is greater than 0.04"))

}else{

print(paste("The difference between ER status for", gen, "is not greater than 0.04"))

}

```

Whilst it is slightly informative to know that the difference between means of ER status for each gene.

It would be more interesting to plot the expression of the genes that have an absolute difference of

greater than 0.04.

Therefore we want to take the same steps as above but change from printing a statement to plotting a graph

- Here's how we can use an `if` statement to test for this

+ **`for`** each iteration of the the loop:

1. Identify the expression values associated with the gene of interest

2. Group the samples based on there ER status, and find the mean for each group

3. **`if`** the absolute difference is greater than 0.04, produce a boxplot to depict the difference

4. **`else`**, do nothing

```{r}

mygenes <- c("AMPD3", "TECTA", "TRPV4", "CD244", "ABHD10", "GEMIN8", "MAP1A", "SMAD7")

for (gen in mygenes){

filtered <- exprsAnno %>%

filter(HUGO.gene.symbol == gen) %>%

select(starts_with("NKI")) %>%

t() %>%

data.frame()

colnames(filtered) <- "gene"

x <- filtered %>%

group_by(factor(patients[, "er"])) %>%

summarise("exprsAv" = mean(gene, na.rm = TRUE))

diff <- x$exprsAv[1] - x$exprsAv[2]

if (abs(diff) > 0.04){

g <- ggplot(filtered, aes(y = gene, x = factor(patients[, "er"]), fill = factor(patients[, "er"]))) +

geom_boxplot()

print(g)

}else{

}

```

## Exercise

Add additional layers to the above plot so that the labels are more informative.

See plot below, replace "x" in the labs for the gene that we are investigating:

Also, change the colours associated with each ER status.

![](Final_boxplot.png)

##Code formatting avoids bugs!

Compare:

```{r eval=FALSE}

for (a in ages){

numRows <- patients %>%

filter(age > a) %>%

summarise(n())

if(numRows < 10){

s <- paste("There are fewer than 10 patients younger than", a)

print(s)

}else{

s <- paste("There are more than 10 patients younger than", a)

print(s)

}

```

to:

```{r eval=FALSE}

for (a in ages) {

numRows <- patients %>%

filter(age > a) %>%

summarise(n())

if (numRows < 10) {

s <- paste("There are fewer than 10 patients younger than", a)

print(s)

} else{

s <- paste("There are more than 10 patients younger than", a)

print(s)

}

```

- The code between brackets `{}` *always* is *indented*, this clearly separates what is executed once, and what is run multiple times

- Trailing bracket `}` always alone on the line at the same indentation level as the initial bracket `{`

- Use white spaces to divide the horizontal space between units of your code, e.g. around assignments, comparisons

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Session2.4-programming.Rmd

Session2.4-programming.Rmd

Files

Session2.4-programming.Rmd

Latest commit

History

Session2.4-programming.Rmd

File metadata and controls