This is an easy to use R package for automated basic RNASeq analysis with minimal coding requirement. This package is designed to be used by biologists with little to no coding experience.
For an in depth tutorial checkout the following blog post.
Currently supported Model Organism: Human.
Over all data structure analysis:
- Euclidean distance between samples
- Poisson distance between samples
- PCA analysis
- PCA Eigenvectors
- Multidimensional scaling (MDS) analysis
- Most variable genes
Analysis between groups of interest:
- Differential gene expression analysis using DESeq2
- Volcano Plot of differentially expressed genes
- Euclidean distance between samples
- Poisson distance between samples
- PCA analysis
- PCA Eigenvectors
- Multidimensional scaling (MDS) analysis
- GO enrichment of the differentially expressed genes
- KEGG pathway enrichment of the differentially expressed genes
- KEGG pathway diagrams of the top 5 enriched pathways
- GSEA analysis (H, C1, C2, C3, C4, C5, C6, C7 genesets)
- Custom geneset GSEA analysis
A CSV file with un-normalized unique genes as rows and samples as columns. Counts table is generally generated after your FASTQ files have been aligned against the reference genome and quantified (not included in this pipeline). Please note that you will have to provide the un-normalized data as input. Using normalized data, will not work with this package. Instead of gene names, you could also feed in the data with ENSEMBL ID's. No other form of ID's is supported at the moment.
Example counts table:
A CSV file with information regarding the samples. The columns of the count matrix and the rows of the meta data (information about samples) must be in the same order. arseq will not make guesses as to which column of the count matrix belongs to which row of the metadata, these must be provided to arseq already in a consistent order.
Example meta data file:
Install and load the package.
# For developmental version
if( !require(devtools) ) install.packages("devtools")
devtools::install_github( "ajitjohnson/arseq", INSTALL_opts = "--no-multiarch")
# Load the package
library("arseq")
Import your counts matrix and meta data file into R environment.
# Set the working directory (path to the folder of where your data is located)
setwd("\path to the folder \of where your data is located\")
# Load your counts table into R
my_data <- read.csv("counts_table.csv", row.names = 1, header = T) # replace counts_table.csv with your file name
# Load your meta data into R
my_meta <- read.csv("meta_data.csv", row.names = 1, header = T) # replace meta_data.csv with your file name
Run the analysis
# Run the analysis. The results will be saved in the same folder as your input data.
arseq (data = my_data, meta = my_meta, design = "treatment", contrast = list(A = c("control"), B= c("drug_A")))
In the above command,
design
takes in the column name of the metadata file that contains information regarding the groups you would like to perform differential expression on. You could pass more complex designs- Read the documentation of DESEq2. As an example, in the above image (metadata file), there a column named treatment that contains information regarding which samples are control samples and which samples were treated with different drugs. So if I want to identify the differentially expressed genes between the control samples and treated samples, I would pass design = "treatment"
.
contrast
is another argument that you will need to specify. This is simply the groups of samples between which you would like to perform differential expression analysis. It follows the following format contrast = list(A = c(" "), B= c(" "))
.
If you have three groups in your dataset- Control, drug_A and drug_B
Comparison- 1: To identify the differentially expressed genes between Control vs drug_A, you would pass the contrast in the following manner contrast = list(A = c("Control"), B= c("drug_A"))
Comparison- 2: To identify the differentially expressed genes between Control vs drug_A drug_B, you would pass the contrast in the following manner contrast = list(A = c("Control"), B= c("drug_A", "drug_B"))
The package comes with an example dataset. In order to familiarise yourself with the package and its requirements you could play around with the example dataset.
# view the example counts table
head(example_data)
# view the example meta data
head(example_meta)
# Set the working directory. Folder to which you would like to save your results.
setwd("\path to the folder \that you would like to save the results\")
# Run the analysis. Here we are identifying the differences between control samples and treatment1 samples.
arseq (data = example_data, meta = example_meta, design = "treatment", contrast = list(A = c("drug_A"), B= c("drug_B")))
The arseq
function can take in a few additional arguments.
qc
- Default is TRUE. This will run the general stat module (e.g. PCA, MDS, etc.. for your entire dataset). If you are making multiple comparisons using the contrast
argument, run qc = TRUE
for the first time and change it to qc = FALSE
for the subsequent comparisons to speed up the analysis.
variable.genes
- Number of variable genes to be identified. By default the program identifies the top 1000 most variable genes. you could set it to variable.genes=3000
to calculate the top 3000 most variable genes.
If you found this package useful, please do cite this page in your publication. Thank you.
If there are any issues please report it at https://github.com/ajitjohnson/arseq/issues
For an in depth tutorial checkout the following blog post.
You can also tweet me directly for inclusion of new methods into this package.