Tutorial

Mission Impossible or Possible?

老師、同學您好!

106學期第二學期開始實施學術演講加分鼓勵方案,經系務會議通過實施。 適用年級大一到大四,相關訊息將公告FB及系網,並請導師協助於班級宣導。 為鼓勵學生踴躍參加本系舉辦的學術演講,建議採加分鼓勵方式,加分方式為每參加一次0.5分,依據參加的次數計算總分,全勤總分再加1分。 學生可將參加學術演講得到的分數,自行選擇課程另行加分(限經系務會議通過的加分科目清單)。適用大一到大四全系學生。 演講統計至第17週,第18週由同學自行列印演講參加分數表,與欲加分科目老師簽名確認加分,送繳系辦存查。 (大四統計的截止週次會另行公告) 請同學學術演講必須完成簽到及簽退,始得算一次,為對於演講者之尊重,請勿代同學簽名或中場離開。 謝謝配合!

系辦

簽到統計表下載 密碼:xxxxxxxx

Prerequisite Skills

  • Data Import (e.g., Excel, txt, tab-delimited file)
  • 統計每位同學出席次數(列如何加總?)
  • 計算總分 (參加一次0.5分,全勤再+1分,運算式?)
  • 產生本學期演講日期 (如何自動化?)
  • Data Export (存檔、檔案格式?)

1.Data Import

Solution 1

直接從RStudio圖形化介面匯入資料.

  • Sheet: 選取Excel資料表
  • Skip: 是否需要跳脫不必要的Row, 跳掉?個Rows不匯入

Solution 2

以指令方式輸入

library(readxl)   # 讀取Excel *.xlsx 所需要的模組

# 使用中文當變數(可)
簽到統計表 <- read_excel("/home/pjhuang/Documents/BioDatabase/03_Data_Import_Export/簽到統計表.xlsx", sheet = "一年級", skip = 1)
簽到統計表  # 印出table

# 但用英文當變數更方便
Sheet1 <- read_excel("/home/pjhuang/Documents/BioDatabase/03_Data_Import_Export/簽到統計表.xlsx", sheet = "一年級", skip = 1)
Sheet1   # print table

Exercise 1

Commands:

  1. head(Sheet1)
  2. Sheet1$姓名 , Sheet1$學號, Sheet1$43161 (What’s Wrong?)
  3. Sheet1$姓名[1] , Sheet1$姓名[1:5] , Sheet1$姓名[c(1,4,8)]
  4. Sheet1[1:2,1:3], Sheet1[1:2,c(1:5)], Sheet1[c(1,3),c(1,3,4)]
  5. Sheet1[,-c(1:2)]
  6. rownames(Sheet1), colnames(Sheet1),
library(readxl)
Sheet1 <- read_excel("/home/pjhuang/Documents/BioDatabase/03_Data_Import_Export/簽到統計表.xlsx", sheet = "一年級", skip = 1)

Exercise 2

欄位名稱操作、重新命名、自動產生日期

Commands:

  1. colnames(Sheet1)[3:19]
  2. seq(as.Date("2018-3-2"),by = 7,length.out = 17) # 間隔7天、17週
  3. colnames(Sheet1)[3:19]=seq(as.Date("2018-3-2"),by = 7,length.out = 17) # What’s wrong?
  4. colnames(Sheet1)[3:19]=as.character(seq(as.Date("2018-3-2"),by = 7,length.out = 17))
  5. colnames(Sheet1)
library(readxl)
Sheet1 <- read_excel("/home/pjhuang/Documents/BioDatabase/03_Data_Import_Export/簽到統計表.xlsx", sheet = "一年級", skip = 1)

Exercise 3

四則運算

Commands:

  1. x=1, x=x+1, y=2, x=x+1+y
  2. Sheet$43161, Sheet$43161 +1, Sheet$43161 *10, Sheet$43161 + Sheet$43168
  3. Sheet$43161 = Sheet$43161 +1, Sheet$43161 = Sheet$43161 *10
  4. rowSums(Sheet1) # What’s Wrong?
  5. Sheet1[order(Sheet1$43161)]
  6. Sheet1[-order(Sheet1$43161)]
library(readxl)
Sheet1 <- read_excel("/home/pjhuang/Documents/BioDatabase/03_Data_Import_Export/簽到統計表.xlsx", sheet = "一年級", skip = 1)
  • Hint1: You may want to use the rowSums function. ex. rowSums(Sheet1)
  • Hint2: ‘x’ must be numeric. ex. Sheet[c(1:2)]
  • Hint3: NA must be removed. ex. rowSums(x, na.rm=TRUE)
  • Hint4: You may want to create a new column to store the values of rowSums. ex. Sheet1$Sum = rowSums(x, na.rm=TRUE)
  • Hint4 Don’t forget to multiply Sum by 0.5

2.Data Export

Commands:

  1. write.table(Sheet1,file="result.txt",quote=F, sep="\t")
  2. write.table(Sheet1,file="result.txt",quote=F, sep="\t",row.names = F, col.names = F)

Database_Homework_20180315

繳交格式

  • Email to: pjhuang@gap.cgu.edu.tw
  • 信件主題: 生物資料庫作業20180315_姓名
  • 附檔:   biodb_hw_20180315_學號.Rmd (修課同學) or biodb_hw_20180315_英文名稱.Rmd (旁聽同學)
  • Rmarkdown: author: 請填入 “系所_姓名”
  • 期限:   下次上課前一天(星期三) Deadline

參考答案

##  [1] "學號"   "姓名"   "43161"  "43168"  "43175"  "43182"  "43189" 
##  [8] "43196"  "43203"  "43210"  "43217"  "43224"  "43231"  "43238" 
## [15] "43245"  "43252"  "43259"  "43266"  "43273"  "全勤"   "總計"  
## [22] "積分"   "總積分"
##  [1] "43161" "43168" "43175" "43182" "43189" "43196" "43203" "43210"
##  [9] "43217" "43224" "43231" "43238" "43245" "43252" "43259" "43266"
## [17] "43273"
##  [1] "2018-03-02" "2018-03-09" "2018-03-16" "2018-03-23" "2018-03-30"
##  [6] "2018-04-06" "2018-04-13" "2018-04-20" "2018-04-27" "2018-05-04"
## [11] "2018-05-11" "2018-05-18" "2018-05-25" "2018-06-01" "2018-06-08"
## [16] "2018-06-15" "2018-06-22"
##  [1] "學號"       "姓名"       "2018-03-02" "2018-03-09" "2018-03-16"
##  [6] "2018-03-23" "2018-03-30" "2018-04-06" "2018-04-13" "2018-04-20"
## [11] "2018-04-27" "2018-05-04" "2018-05-11" "2018-05-18" "2018-05-25"
## [16] "2018-06-01" "2018-06-08" "2018-06-15" "2018-06-22" "全勤"      
## [21] "總計"       "積分"       "總積分"

生物資料庫管理與應用_Week3

Tutorial

Introduction to SQL

Course Description

The role of a data scientist is to turn raw data into actionable insights. Much of the world’s raw data—from electronic medical records to customer transaction histories—lives in organized collections of tables called relational databases. Therefore, to be an effective data scientist, you must know how to wrangle and extract data from these databases using a language called SQL (pronounced ess-que-ell, or sequel). This course teaches you everything you need to know to begin working with databases today!


Course material download and preparation

International Cancer Genome Consortium (ICGC)
ICGC Data Portal
Acute Lymphoblastic Leukemia


Commands:

  1. 從ICGC下載 Acute Lymphoblastic Leukemia的simple somatic mutation profile
    download.file("https://dcc.icgc.org/api/v1/download?fn=/release_26/Projects/ALL-US/simple_somatic_mutation.open.ALL-US.tsv.gz",destfile = "ALL-US.tsv.gz")

  2. 利用read.table把gz檔直接讀入mutations變數中

mutations=read.table("ALL-US.tsv.gz",header=T,sep="\t") 

or

mutations=read.delim("ALL-US.tsv.gz",header=T,sep="\t")

  1. 檢視資料前6列
    head(mutations)

  2. 顯示欄位名稱 colnames(mutations)

download.file("https://dcc.icgc.org/api/v1/download?fn=/release_26/Projects/ALL-US/simple_somatic_mutation.open.ALL-US.tsv.gz",destfile = "ALL-US.tsv.gz")
mutations=read.delim("ALL-US.tsv.gz",header=T,sep="\t")
head(mutations)
colnames(mutations)

1.Selecting columns

This chapter provides a brief introduction to working with relational databases. You’ll learn about their structure, how to talk about them using database lingo, and how to begin an analysis by using simple SQL commands to select and summarize columns from database tables.

1-1.Beginning your SQL journey

SQL, which stands for Structured Query Language, is a language for interacting with data stored in something called a relational database.

You can think of a relational database as a collection of tables. A table is just a set of rows and columns, like a spreadsheet, which represents exactly one type of entity.

Each row, or record, of a table contains information about a single entity. For example, in a table representing somatic mutations, each row represents a single mutation. Each column, or field, of a table contains a single attribute for all rows in the table.

The table of mutations might look something like this:

Quiz

Commands:

  1. 檢視欄位名稱
    colnames(mutations)

  2. 計算欄位數目
    length(colnames(mutations))

mutations=read.table("/home/pjhuang/Documents/BioDatabase/04_Intro_to_SQL/ALL-US.tsv.gz",header=T,sep="\t")

1-2.SELECTing single columns

While SQL can be used to create and modify databases, the focus of this course will be querying databases. A query is a request for data from a database table (or combination of tables). Querying is an essential skill for a data scientist, since the data you need for your analyses will often live in databases.

In SQL, you can select data from a table using a SELECT statement. For example, the following query selects the chromosome column from the mutations table:

Commands:

1. loading required R package (sqldf)

library(sqldf)

2. In SQL, you can select data from a table using a SELECT statement.

For example, the following query selects the chromosome column from the mutations table:

sqldf("SELECT chromosome FROM mutations")

In this query, SELECT and FROM are called keywords. In SQL, keywords are not case-sensitive, which means you can write the same query as:

sqldf("select chromosome from mutations")

That said, it’s good practice to make SQL keywords uppercase to distinguish them from other parts of your query, like column and table names.

Exercise 1

Select the gene_affected column from the mutations table.

library(sqldf)
Loading required package: gsubfn
Loading required package: proto
Could not load tcltk.  Will use slower R code instead.
Loading required package: RSQLite

1-3.SELECTing multiple columns

Well done! Now you know how to select single columns.

In the real world, you will often want to select multiple columns. Luckily, SQL makes this really easy. To select multiple columns from a table, simply separate the column names with commas!

For example, this query selects two columns, chromosome and gene_affected, from the mutations table:

sqldf("SELECT chromosome, gene_affected FROM mutations")

Sometimes, you may want to select all columns from a table. Typing out every column name would be a pain, so there’s a handy shortcut:

sqldf("SELECT * FROM mutations")

If you only want to return a certain number of results, you can use the LIMIT keyword to limit the number of rows returned:

sqldf("SELECT * FROM mutations LIMIT 10")

1-4.SELECT DISTINCT

Often your results will include many duplicate values. If you want to select all the unique values from a column, you can use the DISTINCT keyword.

This might be useful if, for example, you’re interested in knowing which chromosomes are represented in the mutations table:

sqldf("SELECT DISTINCT chromosome FROM mutations")

Exercise 2

Get all the unique genes represented in the mutations table.

mutations=read.table("/home/pjhuang/Documents/BioDatabase/04_Intro_to_SQL/ALL-US.tsv.gz",header=T,sep="\t")
library(sqldf)

1-5.Learning to COUNT

What if you want to count the number of mutation events in your mutations table? The COUNT statement lets you do this by returning the number of rows in one or more columns.

For example, this code gives the number of rows in the mutations table:

sqldf("SELECT COUNT(*) FROM mutations")

Exercise 3

Quiz
mutations=read.table("/home/pjhuang/Documents/BioDatabase/04_Intro_to_SQL/ALL-US.tsv.gz",header=T,sep="\t")
library(sqldf)

1-6.Practice with COUNT

As you’ve seen, COUNT(*) tells you how many rows are in a table. However, if you want to count the number of non-missing values in a particular column, you can call COUNT on just that column.

For example, to count the number of gene_affected present in the mutations table:

sqldf("SELECT COUNT(gene_affected) from mutations")

It’s also common to combine COUNT with DISTINCT to count the number of distinct values in a column.

For example, this query counts the number of distinct gene_affected contained in the mutations table:

sqldf("SELECT COUNT(DISTINCT gene_affected) from mutations")

mutations=read.table("/home/pjhuang/Documents/BioDatabase/04_Intro_to_SQL/ALL-US.tsv.gz",header=T,sep="\t")
library(sqldf)

2.Filtering rows

This chapter builds on the first by teaching you how to filter tables for rows satisfying some criteria of interest. You’ll learn how to use basic comparison operators, combine multiple criteria, match patterns in text, and much more.

2-1.Filtering results

Congrats on finishing the first chapter! You now know how to select columns and perform basic counts. This chapter will focus on filtering your results.

In SQL, the WHERE keyword allows you to filter based on both text and numeric values in a table. There are a few different comparison operators you can use:

  • = equal
  • <> not equal
  • < less than
  • > greater than
  • <= less than or equal to
  • >= greater than or equal to

For example, you can filter text records such as chromosome. The following code returns all mutations with the chromosome ‘1’:

mutations=read.table("/home/pjhuang/Documents/BioDatabase/04_Intro_to_SQL/ALL-US.tsv.gz",header=T,sep="\t")
library(sqldf)

Notice that the WHERE clause always comes after the FROM statement!

Note that in this course we will use <> and not != for the not equal operator, as per the SQL standard.

2-2.Simple filtering of numeric values

As you learned in the previous exercise, the WHERE clause can also be used to filter numeric records, such as years, ages or positions.

For example, the following query selects all details for mutations with positions over 110000000:

sqldf("select * from mutations where chromosome_start > 110000000 ")

Now it’s your turn to use the WHERE clause to filter numeric values!

2-3.Simple filtering of text

Remember, the WHERE clause can also be used to filter text results, such as names or countries.

For example, this query gets the gene_affected of all mutations which were consequence_type is stop_gained:

sqldf("SELECT gene_affected,consequence_type FROM mutations WHERE consequence_type = 'stop_gained'")

Now it’s your turn to practice using WHERE with text values!

2-4.WHERE AND

Often, you’ll want to select data based on multiple conditions. You can build up your WHERE queries by combining multiple conditions with the AND keyword.

For example,

sqldf("SELECT chromosome,gene_affected,consequence_type FROM mutations WHERE consequence_type = 'stop_gained' AND chromosome='1' ")

You can add as many AND conditions as you need!

2-5.WHERE AND OR

What if you want to select rows based on multiple conditions where some but not all of the conditions need to be met? For this, SQL has the OR operator.

For example, the following returns all mutations in either chromosome X or 2:

sqldf("SELECT chromosome,icgc_mutation_id FROM mutations WHERE chromosome = 'X' OR chromosome='2' ")

Note that you need to specify the column for every OR condition, so the following is invalid:

sqldf("SELECT chromosome,icgc_mutation_id FROM mutations WHERE chromosome = 'X' OR '2' ")

When combining AND and OR, be sure to enclose the individual clauses in parentheses, like so:

sqldf("SELECT chromosome,icgc_mutation_id,consequence_type FROM mutations WHERE (chromosome = 'X' OR chromosome='2') AND (consequence_type = 'stop_gained' or consequence_type = 'missense_variant') ")

2-5.BETWEEN

As you’ve learned, you can use the following query to get icgc_mutation_id of all mutations identified in and between position 65000000 and 66000000:

sqldf("SELECT chromosome,chromosome_start,icgc_mutation_id FROM mutations WHERE chromosome_start BETWEEN 65000000 AND 66000000")

Similar to the WHERE clause, the BETWEEN clause can be used with multiple AND and OR operators, so you can build up your queries and make them even more powerful!

For example, suppose we want to limit the result to chromosome 1.

sqldf("SELECT chromosome,chromosome_start,icgc_mutation_id FROM mutations WHERE chromosome_start BETWEEN 65000000 AND 66000000 AND chromosome = '1' ")

2-6.WHERE IN

As you’ve seen, WHERE is very useful for filtering results. However, if you want to filter based on many conditions, WHERE can get unwieldy. For example:

sqldf("SELECT chromosome,chromosome_start,icgc_mutation_id FROM mutations WHERE chromosome = 1 OR chromosome = 2 OR chromosome = 11")

Enter the IN operator! The IN operator allows you to specify multiple values in a WHERE clause, making it easier and quicker to specify multiple OR conditions! Neat, right?

So, the above example would become simply:

sqldf("SELECT chromosome,chromosome_start,icgc_mutation_id FROM mutations WHERE chromosome IN (1,2, 11) ")

2-7. Introduction to NULL and IS NULL

In SQL, NULL represents a missing or unknown value. You can check for NULL values using the expression IS NULL. For example, to count the number of missing biological_validation_platform in the mutations table:

sqldf("SELECT COUNT(*) FROM mutations WHERE biological_validation_platform IS NULL")

As you can see, IS NULL is useful when combined with WHERE to figure out what data you’re missing.

Sometimes, you’ll want to filter out missing values so you only get results which are not NULL. To do this, you can use the IS NOT NULL operator.

For example, this query gives the names of all people whose birth dates are not missing in the people table.

sqldf("SELECT COUNT(*) FROM mutations WHERE biological_validation_platform IS NOT NULL")

2-8. LIKE and NOT LIKE

As you’ve seen, the WHERE clause can be used to filter text data. However, so far you’ve only been able to filter by specifying the exact text you’re interested in. In the real world, often you’ll want to search for a pattern rather than a specific text string.

In SQL, the LIKE operator can be used in a WHERE clause to search for a pattern in a column. To accomplish this, you use something called a wildcard as a placeholder for some other values. There are two wildcards you can use with LIKE:

The % wildcard will match zero, one, or many characters in text. For example, the following query matches gene_affected like ‘ENSG00000158286’, ‘ENSG00000158545’ ‘ENSG00000158773’, ‘ENSG00000158553’, and so on:

sqldf("SELECT gene_affected FROM mutations WHERE gene_affected LIKE 'ENSG00000158%' ")

The _ wildcard will match a single character. For example, the following query matches companies like ‘ENSG00000198910’, ‘ENSG00000108417’, and so on:

sqldf("SELECT gene_affected FROM mutations WHERE gene_affected LIKE 'ENSG000001_8%' ")

You can also use the NOT LIKE operator to find records that don’t match the pattern you specify.

3.Sorting, grouping and joins

This chapter provides a brief introduction to sorting and grouping your results, and briefly touches on the concept of joins.

3-1.ORDER BY

Congratulations on making it this far! You now know how to select and filter your results.

In this chapter you’ll learn how to sort and group your results to gain further insight. Let’s go!

In SQL, the ORDER BY keyword is used to sort results in ascending or descending order according to the values of one or more columns.

By default ORDER BY will sort in ascending order. If you want to sort the results in descending order, you can use the DESC keyword. For example,

sqldf("SELECT icgc_mutation_id,chromosome, chromosome_start FROM mutations ORDER BY chromosome, chromosome_start LIMIT 50")

sqldf("SELECT icgc_mutation_id,chromosome, chromosome_start FROM mutations ORDER BY chromosome DESC, chromosome_start ASC LIMIT 50")

3-2.GROUP BY

Now you know how to sort results! Often you’ll need to aggregate results. For example, you might want to get count the number of male and female employees in your company. Here, what you want is to group all the males together and count them, and group all the females together and count them. In SQL, GROUP BY allows you to group a result by one or more columns, like so:

sqldf("SELECT chromosome,count(*) FROM mutations GROUP BY chromosome")

3-3.GROUP BY practice

As you’ve just seen, combining aggregate functions with GROUP BY can yield some powerful results!

A word of warning: SQL will return an error if you try to SELECT a field that is not in your GROUP BY clause without using it to calculate some kind of value about the entire group.

Note that you can combine GROUP BY with ORDER BY to group your results, calculate something about them, and then order your results. For example,

sqldf("SELECT chromosome,count(*) AS count FROM mutations GROUP BY chromosome ORDER BY count DESC")

3-4.HAVING a great time

In SQL, aggregate functions can’t be used in WHERE clauses. For example, the following query is invalid:

sqldf("SELECT chromosome,count(*) AS count FROM mutations GROUP BY chromosome HAVING count >= 500")

4.Aggreate Functions

This chapter builds on the first two by teaching you how to use aggregate functions to summarize your data and gain useful insights. Additionally, you’ll learn about arithmetic in SQL, and how to use aliases to make your results more readable!

4-1.Aggreate Functions

Often, you will want to perform some calculation on the data in a database. SQL provides a few functions, called aggregate functions, to help you out with this.

mutation_count=sqldf("SELECT chromosome,icgc_donor_id,count(*) AS count FROM mutations GROUP BY chromosome,icgc_donor_id")

sqldf("select chromosome,AVG(count) from mutation_count group by chromosome")

sqldf("select chromosome,MAX(count) from mutation_count group by chromosome")

sqldf("select chromosome,SUM(count) from mutation_count group by chromosome")

sqldf("select chromosome,count(*) as count from mutations group by chromosome")

4-2 GROUP CONCAT

sqldf("select chromosome,SUM(count),GROUP_CONCAT(icgc_donor_id) from result group by chromosome")

4-3 It’s AS simple AS aliasing

You may have noticed in the first exercise of this chapter that the column name of your result was just the name of the function you used. For example,

sqldf("select chromosome AS Chromosome,SUM(count) AS Total,GROUP_CONCAT(icgc_donor_id) Sample_IDs from result group by chromosome")

5. Homework

  • email to: pjhuang@gap.cgu.edu.tw  
  • title: 生物資料庫作業_20180322_姓名  
  • deadline: 4/5 (民族掃墓節)   
  • 格式:Rmd檔  

  • Homework

The Cancer Genome Atlas (TCGA)

TCGA-LAML (Acute Myeloid Leukemia)
LAML_MAF

Hint:

  • library(sqldf)
  • mutations=read.delim("http://163.25.93.96:3838/laml.maf",header=T,sep="\t")
  1. TCGA-LAML (Acute Myeloid Leukemia) 癌症基因體計畫包含幾個樣本?

Hints: Tumor_Sample_Barcode為樣本代號

  1. TCGA-LAML計畫中,突變次數最高的前10個基因名稱為何?

Hints: Hugo_Symbol為基因名稱。

  1. 承2,請依Chromosome 1~22,X,Y 分別列出每條染色體突變次數最高的基因。

  2. 分別列出每條染色體上發生包含Nonsense_Mutation突變的基因。

Hints: 擷取出Variant_Classification欄位中的Nonsense_Mutation,

  1. 承4, 哪些樣本發生Nonsense_Mutation事件? 依序列出 Chromosome, Hugo_Symbol, Variant_Classification, SampleIDs.
Hints: GROUP_CONCAT
Tutorial

Course Description

In this interactive tutorial, you will learn how to perform sophisticated dplyr techniques to carry out your data manipulation with R. First you will master the five verbs of R data manipulation with dplyr: select, mutate, filter, arrange and summarise. Next, you will learn how you can chain your dplyr operations using the pipe operator of the magrittr package. In the final section, the focus is on practicing how to subset your data using the group_by function, and how you can access data stored outside of R in a database. All said and done, you will be familiar with data manipulation tools and techniques that will allow you to efficiently manipulate data.

1.Introduction to dplyr and tbls

Introduction to the dplyr package and the tbl class. Learn the philosophy that guides dplyr, discover some useful applications of the dplyr package, and meet the data structures that dplyr uses behind the scenes.

1-1 Load the dplyr and hflights package

Welcome to the interactive exercises part of your dplyr course. Here you will learn the ins and outs of working with dplyr. dplyr is an R package, a collection of functions and data sets that enhance the R language.

Throughout this course you will use dplyr to analyze a data set of airline flight data containing flights that departed from Houston. This data is stored in a package called hflights.

Both dplyr and hflights are already installed on server, so loading them with library() will get you up and running.

  • Load the dplyr package.
  • Load the hflights package. A variable called hflights will become available, a data.frame representing the data set.
  • Use both head() and summary() on the hflights data frame to get to know the data. Can you guess the meaning of all variables?
library(dplyr)
library(hflights)

1-2 Explore the data set

A data scientist must be familiar with his or her data. Experiment with the data set in the console and maybe try to generate some insightful plots. For your convenience, hflights is already loaded into the workspace.

How many observations and how many variables are contained in the hflights data set?

  • head()
  • tail()
  • dim()

1-3 Convert data.frame to tibble

As Garrett explained, a tbl (pronounced tibble) is just a special kind of data.frame. They make your data easier to look at, but also easier to work with. On top of this, it is straightforward to derive a tbl from a data.frame structure using tbl_df().

The tbl format changes how R displays your data, but it does not change the data’s underlying data structure. A tbl inherits the original class of its input, in this case, a data.frame. This means that you can still manipulate the tbl as if it were a data.frame. In other words, you can do anything with the hflights tbl that you could do with the hflights data.frame.

  • Convert hflights (which is a data.frame) into a tbl, also named hflights.

hflights = tbl_df(hflights)

  • Display the new hflights in your console window. Notice the easy-to-read layout.

hflisghts

  • To see how tbls behave like data.frames, save the UniqueCarrier column of hflights as an object named carriers, using base R syntax only.

carriers <- hflights$UniqueCarrier

1-4 Changing labels of hflights

A bit of cleaning would be a good idea since the UniqueCarrier variable of hflights uses a confusing code system.

To do this, let’s work with a lookup table, that comes in the form of a named vector. When you subset the lookup table with a character string (like the character strings in UniqueCarrier), R will return the values of the lookup table that correspond to the names in the character string. To see how this works, run following code in the console:

  • lookup table

lut <- c(“AA” = “American”, “AS” = “Alaska”, “B6” = “JetBlue”, “CO” = “Continental”, “DL” = “Delta”, “OO” = “SkyWest”, “UA” = “United”, “US” = “US_Airways”, “WN” = “Southwest”, “EV” = “Atlantic_Southeast”, “F9” = “Frontier”, “FL” = “AirTran”, “MQ” = “American_Eagle”, “XE” = “ExpressJet”, “YV” = “Mesa”)

  • Use lut to translate the UniqueCarrier column of hflights

hflights\(UniqueCarrier=lut[hflights\)UniqueCarrier]

  • Inspect the resulting raw values of your variables

glimpse(hflights)

Let’s try a similar thing, but this time to change the labels in the CancellationCode column. This column lists reasons why a flight was cancelled using a non-informative alphabetical code. Execute

unique(hflights$CancellationCode)

  • Build the lookup table

lut <- c(“A”=“carrier”,“B”=“weather”,“C”=“FFA”,“D”=“security”,“E”=“not cancelled”)

  • Use the lookup table to create a vector of code labels. Assign the vector to the CancellationCode column of hflights

hflights\(CancellationCode=lut[hflights\)CancellationCode]

  • Inspect the resulting raw values of your variables

glimpse(hflights)

2. Select and mutate

Get familiar with dplyr’s manipulation verbs. Meet the five verbs and then practice using the mutate and select verbs.

2-1. The five versbs and their meaning

The dplyr package contains five key data manipulation functions, also called verbs:

  • select(), which returns a subset of the columns,
  • filter(), that is able to return a subset of the rows,
  • arrange(), that reorders the rows according to single or multiple variables,
  • mutate(), used to add columns from existing data,
  • summarise(), which reduces each group to a single row by calculating aggregate measures.

If you want to find out more about these functions, consult the documentation by clicking on the functions above. What order of operations should we use to find the average value of the ArrDelay (arrival delay) variable for all American Airline flights in the hflights tbl?

Feel free to play around in the console; hflights is preloaded. From now on, the UniqueCarrier column and CancellationCode column contain the recoded versions, similar to the cleaning up you did in the previous chapter.

2-1 Choosing is not losing! The select verb

To answer the simple question whether flight delays tend to shrink or grow during a flight, we can safely discard a lot of the variables of each flight. To select only the ones that matter, we can use select().

As an example, take the following call, that selects the variables var1 and var2 from the data frame df.

select(df, var1, var2)

You can also use : to select a range of variables and - to exclude some variables, similar to indexing a data.frame with square brackets. You can use both variable’s names as well as integer indexes. This call selects the four first variables except for the second one of a data frame df:

select(df, 1:4, -2)

select() does not change the data frame it is called on; you have to explicitly assign the result of select() to a variable to store the result.

Exercise:

  • Use select() to print out a tbl that contains only the columns ActualElapsedTime, AirTime, ArrDelay and DepDelay of hflights.

select(hflights, ActualElapsedTime, AirTime, ArrDelay, DepDelay)

  • Print out a tbl with the columns Origin up to and including Cancelled of hflights.

select(hflights, Origin:Cancelled)

  • Find the most concise way to select: columns Year up to and including DayOfWeek, columns ArrDelay up to and including Diverted. You can examine the order of the variables in hflights with names(hflights) in the console.

select(hflights, -(DepTime:AirTime))

2-2 Helper functions for variable selection

dplyr comes with a set of helper functions that can help you select groups of variables inside a select() call:

  • starts_with("X"): every name that starts with "X",
  • ends_with("X"): every name that ends with "X",
  • contains("X"): every name that contains "X",
  • matches("X"): every name that matches "X", where "X" can be a regular expression,
  • num_range("x", 1:5): the variables named x01, x02, x03, x04 and x05,
  • one_of(x): every name that appears in x, which should be a character vector.

Pay attention here: When you refer to columns directly inside select(), you don’t use quotes. If you use the helper functions, you do use quotes.

Exercise:

  • Use select() and a helper function to print out a tbl that contains just ArrDelay and DepDelay of hflights.

select(hflights, ends_with(“Delay”))

  • Use a combination of helper functions and variable names to print out only the UniqueCarrier, FlightNum, TailNum, Cancelled, and CancellationCode columns of hflights.

select(hflights, UniqueCarrier, ends_with(“Num”), starts_with(“Cancel”))

  • Find the most concise way to return the following columns with select and its helper functions: DepTime, ArrTime, ActualElapsedTime, AirTime, ArrDelay, DepDelay. Use only helper functions!

select(hflights, contains(“Tim”), contains(“Del”))

2-3 Comparison to base R

To see the added value of the dplyr package, it is useful to compare its syntax with base R. Up to now, you have only considered functionality that is also available without the use of dplyr. The elegance and ease-of-use of dplyr is a great plus though.

Exercise 1

ex1r <- hflights[c(“TaxiIn”,“TaxiOut”,“Distance”)] ex1d <- select(hflights, contains(“Taxi”), Distance)

Exercise 2

ex2r <- hflights[c(“Year”,“Month”,“DayOfWeek”,“DepTime”,“ArrTime”)] ex2d <- select(hflights, Year:ArrTime, -DayofMonth)

Exercise 3

ex3r <- hflights[c(“TailNum”,“TaxiIn”,“TaxiOut”)] ex3d <- select(hflights,starts_with(“T”))

2-4 Mutating is creating

mutate() is the second of five data manipulation functions you will get familiar with in this course. mutate() creates new columns which are added to a copy of the dataset.

Take this example that adds a new column, z, which is the element-wise sum of the columns x and y, to the data frame df:

mutate(df, z = x + y)

  • Create a new data frame, g1, which is the data frame hflights with an additional column: ActualGroundTime, the difference between ActualElapsedTime and AirTime.

g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)

  • Extend g1 further, by adding an additional column GroundTime. This column is the sum of the TaxiIn and TaxiOut columns. Store the resulting data frame in g2. Check in the console that the GroundTime and ActualGroundTime columns are equal.

g2 <- mutate(g1, GroundTime = TaxiIn + TaxiOut)

  • Add a new variable to g2 named AverageSpeed that denotes the average speed that each plane flew in miles per hour. Save the resulting dataset as g3. Use the following formula: Distance / AirTime * 60.

g3 <- mutate(g2, AverageSpeed = Distance / AirTime * 60)

  • Print out g3.

g3

2-5 Add multiple variables using mutate

So far you’ve added variables to hflights one at a time, but you can also use mutate() to add multiple variables at once. To create more than one variable, place a comma between each variable that you define inside mutate().

mutate() even allows you to use a new variable while creating a next variable in the same call. In this example, the new variable x is directly reused to create the new variable y:

mutate(my_df, x = a + b, y = x + c)

  • Adapt the code that builds m1: add a variable loss_ratio, which is the ratio of loss to DepDelay.

m1 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_percent = (ArrDelay - DepDelay) / DepDelay * 100)

  • Create a tbl m2 from hflights by adding three variables:

m2 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_percent = loss / DepDelay * 100)

  • TotalTaxi, which is the sum of TaxiIn and TaxiOut;

  • ActualGroundTime, which is the difference of ActualElapsedTime and AirTime;

  • Diff, the difference between the two newly created variables. This column should be zero for all observations!

m3 <- mutate(hflights, TotalTaxi = TaxiIn + TaxiOut, ActualGroundTime = ActualElapsedTime - AirTime, Diff = TotalTaxi - ActualGroundTime)

3. Filter and arrange

Learn how to search through the observations in your data set (and extract useful observations) with the filter function. Rearrange the observations in your data set with the arrange verb.

3-1 Logical operators

R comes with a set of logical operators that you can use inside filter():

x < y, TRUE if x is less than y x <= y, TRUE if x is less than or equal to y x == y, TRUE if x equals y x != y, TRUE if x does not equal y x >= y, TRUE if x is greater than or equal to y x > y, TRUE if x is greater than y x %in% c(a, b, c), TRUE if x is in the vector c(a, b, c)

The following example filters df such that only the observations for which a is positive, are kept:

filter(df, a > 0)

  • Print out all flights in hflights that traveled 3000 or more miles.

filter(hflights, Distance >= 3000)

  • Print out all flights in hflights flown by JetBlue, Southwest, or Delta.

filter(hflights, UniqueCarrier %in% c(“JetBlue”, “Southwest”, “Delta”))

  • Extract from hflights all flights where taxiing took longer than the actual flight. Avoid the use of mutate() and do the math directly in the logical expression of filter().

filter(hflights, TaxiIn + TaxiOut > AirTime)

3-2 Combining tests using boolean operators

R also comes with a set of boolean operators that you can use to combine multiple logical tests into a single test. These include & (and), | (or), and ! (not). Instead of using the & operator, you can also pass several logical tests to filter(), separated by commas. The following two calls are completely equivalent:

filter(df, a > 0 & b > 0) filter(df, a > 0, b > 0)

Next, is.na() will also come in handy. This example keeps the observations in df for which the variable x is not NA:

filter(df, !is.na(x))

  • Use R’s logical and boolean operators to select just the rows where a flight left before 5:00 am (500) or arrived after 10:00 pm (2200).

filter(hflights, DepTime < 500 | ArrTime > 2200)

  • Print out all of the flights that departed late but arrived ahead of schedule. Use DepDelay and ArrDelay for this.

filter(hflights, DepDelay > 0, ArrDelay < 0)

  • Find all of the flights that were cancelled after being delayed. These are flights that were cancelled, while having a DepDelay greater than zero.

filter(hflights, Cancelled == 1, DepDelay > 0)

3-3 Arranging your data

arrange() can be used to rearrange rows according to any type of data. If you pass arrange() a character variable, for example, R will rearrange the rows in alphabetical order according to values of the variable. If you pass a factor variable, R will rearrange the rows according to the order of the levels in your factor (running levels() on the variable reveals this order).

dtc has already been defined on the right. It’s up to you to write some arrange() expressions to display its contents appropriately!

  • Arrange dtc, by departure delays so that the shortest departure delay is at the top of the data set.

dtc <- filter(hflights, Cancelled == 1, !is.na(DepDelay))

arrange(dtc, DepDelay)

  • Arrange dtc so that flights that were cancelled for the same reason appear next to each other.

arrange(dtc, CancellationCode)

  • Arrange dtc so that flights by the same carrier appear next to each other. Within each carrier, flights that have smaller departure delays appear before flights that have higher departure delays. Do this in a one-liner.

arrange(dtc, UniqueCarrier, DepDelay)

3-4 Reverse the order of arranging

By default, arrange() arranges the rows from smallest to largest. Rows with the smallest value of the variable will appear at the top of the data set. You can reverse this behavior with the desc() function. arrange() will reorder the rows from largest to smallest values of a variable if you wrap the variable name in desc() before passing it to arrange().

  • Arrange hflights so that flights by the same carrier appear next to each other and within each carrier, flights that have larger departure delays appear before flights that have smaller departure delays.

arrange(hflights, UniqueCarrier, desc(DepDelay))

  • Arrange the flights in hflights by their total delay (the sum of DepDelay and ArrDelay). Try to do this directly inside arrange().

arrange(hflights, DepDelay + ArrDelay)

Gviz

TP53 gene Location:
chr17 from 7668402 to 7687538

IdeogramTrack & GenomeAxisTrack

library(Gviz)
from <- 7668402
to <- 7687538
axTrack <- GenomeAxisTrack()
idxTrack <- IdeogramTrack(genome = "hg38", chromosome = "chr17")
plotTracks(list(idxTrack,axTrack),from = from, to = to, showTitle = FALSE)

GENCODE GENE Track

knownGenes <- UcscTrack(genome = "hg38", chromosome = "chr17", 
 track = "knownGene", from = from, to = to, trackType = "GeneRegionTrack",
 rstarts = "exonStarts", rends = "exonEnds", gene = "name",
 symbol = "name", transcript = "name", strand = "strand",
 fill = "#8282d2", name = "GENCODE Genes")

plotTracks(list(idxTrack,axTrack,knownGenes),from = from, to = to, showTitle = FALSE)