RNA sequencing (RNA-Seq) is revolutionizing the study of the transcriptome. A highly sensitive and accurate tool for measuring expression across the transcriptome, it is providing visibility to previously undetected changes occurring in disease states, in response to therapeutics, under different environmental conditions and across a broad range of other study designs.

RNA-Seq allows researchers to detect both known and novel features in a single assay, enabling the detection of transcript isoforms, gene fusions, single nucleotide variants, allele-specific gene expression and other features without the limitation of prior knowledge.

fig-rna-seqIn all cases an RNA-seq experiment involves making a collection of cDNA fragments which are flanked by specific constant sequences (known as adapters) that are necessary for sequencing . This collection (referred to as a library) is then sequenced using short-read sequencing which produces millions of short sequence reads that correspond to individual cDNA fragments.



A typical RNA-seq experiment consists of the following steps:

  • Design Experiment. Set up the experiment to address your questions.
  • RNA Preparation. Isolate and purify input RNA.
  • Prepare Libraries. Convert the RNA to cDNA; add sequencing adapters.
  • Sequence. Sequence cDNAs using a sequencing platform.
  • Analysis

RNA Analysis Steps

Data set on GEO containing RNA-seq and bisulfite sequencing data from AML3 cells treated with the drug Azacitidine (GSE55125). This drug is known to block DNA methylation, so it will be interesting to see how this effects gene expression and whether we can learn anything extra about the mechanisms of this potential anticancer drug.

Data analysis step 1: download from GEO and convert to fastq

We will be downloading human RNA-seq data from GEO accession GSE55123. Now you would have thought that this would be easy, but you have to understand that the data we download from GEO is in NCBI’s short read archive format (SRA). To unpack the original sequence files can be a bit tricky at first, even the size of the SRA toolkit manual is enough to make you cringe.

So start (in linux) by making a text file containing all the SRA file links fron the NCBI ftp site. Let’s call it “url.txt”.


OK now we’ll make a little script that goes and downloads each of these URLs. Notice that we use axel, which is a really cool downloading utility that can make multiple server connections and get the most out of your bandwidth. axel has a few bugs. For instance it doesn’t handle long URLs well and in those cases you might want to try aria2c as an alternative.

for URL in `cat $URLFILE`
do axel --num-connections=$NUM $URL

OK so now you’ll have a bunch of SRA files and you need to get them into fastq format. In most cases, I would recommend downloading the CentOS version, even if you’re running Ubuntu.

wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.3.5-2/sratoolkit.2.3.5-2-centos_linux64.tar.gz
tar xvfz sratoolkit.2.3.5-2-centos_linux64.tar.gzcd sratoolkit.2.3.5-2-centos_linux64/bin/
sudo ln * /usr/local/binNow go back to the directory with the SRA files and create a script that converts them (dump) to compressed fastq.

for SRA in *sra
fastq-dump --gzip --split-3 $SRA

Data analysis step 2: quality control of RNA-seq data

In a previous post, we downloaded RNA-seq data from GEO (GSE55123) . Lets continue with the processing of this data by performing QC analysis. In a previous post, I went into a bit more detail, but here we will simply use fastx_quality_stats from the fastx toolkit to have a look at quality scores among the data sets.The general strategy was to unzip the data on the fly, convert to tabular format and then select a random 1 million sequences and then submit these to fastx_quality_stats. So having a look through the output file, shows very high quality scores with median scores >36 which suggests this dataset is very high quality. Below see the code used and a graph of median quality scores throughout the run.

for FQZ in *bz2
do echo $FQZ
pbzip2 -dc $FQZ | paste - - - - | shuf | head -1000000 \
| tr '\t' '\n' | fastx_quality_stats | cut -f1,6-9
done | tee quality_analysis.txt

Data analysis step 3: Align paired end RNA-seq with Tophat

we will align the paired end data to the human genome with Tophat.

Part 1 is to get a suitable reference genome sequence. I chose download the most recent human genome sequence from the Ensembl ftp site (Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa).

axel –num-connections=6 ftp://ftp.ensembl.org/pub/release-76/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz

Once downloaded, the genome needs to be uncompressed and indexed with bowtie2. This will take a few hours.

gunzip ${REF}.gz
bowtie2-build $REF $REF

Next step is to use Tophat to align the reads in paired-end mode. Tophat is smart enough to recognise the bz2 compression. The script also indexes the bam files and generates some familiar samtools statistics.


for FQZ1 in SR*_1.fastq.bz2
FQZ2=`echo $FQZ1 | sed 's/_1/_2/'`
BASE=`echo $FQZ1 | cut -d '_' -f1`
tophat -p 4 -o ${BASE}_tophat $REF $FQZ1 $FQZ2
samtools index ${BASE}_tophat/accepted_hits.bam
samtools flagstat ${BASE}_tophat/accepted_hits.bam
> ${BASE}_tophat/accepted_hits.bam.stats

A scan through the align_summary.txt shows that the alignment stats looks reasonable.

Tophat alignment statistics for paired end RNA-seq data (GSE55123)

If you are working on a big multicore server with sufficient memory, you can speed up the alignment by running multiple alignment jobs in the background.



for FQZ1 in SR*_1.fastq.bz2
FQZ2=`echo $FQZ1 | sed 's/_1/_2/'`
BASE=`echo $FQZ1 | cut -d '_' -f1`
tophat -p 4 -o ${BASE}_tophat $REF $FQZ1 $FQZ2
samtools index ${BASE}_tophat/accepted_hits.bam
samtools flagstat ${BASE}_tophat/accepted_hits.bam \> ${BASE}_tophat/accepted_hits.bam.stats ) &
done ; wait

Data analysis step 4: Count aligned reads and create count matrix

we will count the reads aligning to exons and generate a count matrix for statistical analysis.My overall strategy is to use bedtools to count reads over exons. To do that, I’ll need a bed file containing all exons. I could use UCSC genes, but I’d rather use the latest Ensembl version.

Create Exon bed file:To begin, we need a list of known exons. I will use the latest Ensembl human gene annotation gtf file from their ftp server. Then extract the exon entries with grep and then rearrange the gene accession and symbol information into a temporary bed file. The temporary bed file is exploded on gene name and then overlapping exons are merged with bedtools.

wget ftp://ftp.ensembl.org/pub/release-76/gtf/homo_sapiens/Homo_sapiens.GRCh38.76.gtf.gz
zcat Homo_sapiens.GRCh38.76.gtf.gz \
| grep -w exon | tr '"' '\t' \
| cut -f1,4,5,10,16 | sed 's/\t/_/4' \
| sort -k4 > tmp.bedmkdir Homo_sapiens.GRCh38_explode
cd Homo_sapiens.GRCh38_explode
awk '{print > $4}' ../tmp.bedfor i in `ls ENS*`
do cat $i | sortBed | mergeBed -nms \
| cut -d ';' -f1
<emdone | sortBed > ../Homo_sapiens.GRCh38_exon.bed
Count reads with bed file:

In this part, we ask bedtools to count the reads aligned to the exons with multicov tool. We use the options “-D” to allow duplicate reads  and “-q 10” to only count reads that were uniquely mapped. The script then parses the data to a perl script called “agg.pl” that aggregates the exons data to generate a final gene-wise count. The temporary count files are then then merged from all the samples using awk to make a data matrix.

dMX=`echo $EXPT | sed 's/$/.xls/'`

dfor BAM in *bam
dbedtools multicov -D -q 10 -bams $BAM -bed $BED \
d| cut -f4- | tr '\t' ',' \
d| perl agg.pl > ${BAM}.cnt &

dls *bam | cat | tr '\n' '\t' \
d| sed 's/^/\t/;s/$/\n/;s/.fq.sort.bam//g' > header.tmp

dawk '$0 !~ /#/{arr[$1]=arr[$1] " " $2}END{for(i in arr)print i,arr[i]}' *cnt \
d| tr -s ' ' | tr ' ' '\t' | cat header.tmp - > $MX

drm header.tmp

Here is the agg.pl script that you will need:
use warnings;
<emuse strict;
use integer;
my %a;
while (<>) {
    my ($animal, $n) = /^\s*(\S+)\s*,\s*(\S+)/;
    $a{$animal} += $n if defined $n;
print "$_\t${a{$_}}\n" for sort keys %a;
Here is what the matrix file will look like:
ENSG00000185650_ZFP36L1 479 467 481 89 77 96
ENSG00000260835_RP11-468I15.1 0 0 0 0 0 0
ENSG00000116032_GRIN3B 62 45 62 43 54 52
ENSG00000185339_TCN2 1 2 1 0 1 1
ENSG00000255506_RP11-203F8.1 0 0 0 0 0 0
ENSG00000229899_AC084290.2 0 0 0 0 0 0
ENSG00000269933_RP3-333A15.2 0 0 0 0 0 0
ENSG00000216069_MIR875 0 0 0 0 0 0
ENSG00000236345_RP11-59D5__B.2 608 512 600 433 406 417
You will notice that there are a lot of genes with few or no reads. I generally like to remove genes with fewer than 10 reads per sample on average. I use a small awk script called filt.sh
sed -n 1p $FILE > header.txt
awk 'BEGIN {FS=OFS="\t"}
sum=0; n=0
     {sum+=$i; ++n}
     print sum/n,$0
}' $FILE \
| awk '$1>10' | cut -f2- | cat header.txt - > ${FILE}_filt.xls
Here is how you use the filt.sh script:
$ ./filt.sh CountMatrix.xls
Then have a look at the filtered matrix file:
ENSG00000185650_ZFP36L1 479 467 481 89 77 96
ENSG00000116032_GRIN3B 62 45 62 43 54 52
ENSG00000236345_RP11-59D5__B.2 608 512 600 433 406 417
ENSG00000138069_RAB1A 4141 3568 3571 3250 3259 3189
ENSG00000127666_TICAM1 719 650 677 615 648 566
ENSG00000155393_HEATR3 1672 1291 1570 1511 1634 1641
ENSG00000146731_CCT6A 9400 7432 7743 8923 9322 9471
ENSG00000153012_LGI2 273 275 214 299 264 272
ENSG00000253716_RP13-582O9.5 78 53 62 75 77 60
At this stage, we can compare the number of genes that were discarded from the analysis with “wc”:
$ wc -l Aza_AML3_tophat*
  19919 Aza_AML3_tophat_filt.xls
  62758 Aza_AML3_tophat.xls
So this shows that 42,839 Ensembl genes did not pass our detection threshold. By omitting these silent and lowly expressed genes, we will remove some noise and also get the added bonus of making the false discovery rate correction of p-values less severe when we get to the statistical analysis.

Data analysis step 5: Differential analysis of RNA-seq

We are at the stage where we can now perform a statistical analysis of the count matrix we generated in the last post and look at the genes expression differences caused by Azacitidine.

For this type of analysis we could load our data into R and perform the analysis ourselves, but for a simple experiment design with 2 sample groups in triplicate without batch effects or sample pairing I want to share with you an easy solution. DEB is a online service provided by the  Interdisciplinary Center for Biotechnology Research (ICBR) University of Florida that will analyse the count matrix for you with either DESeq, edgeR or baySeq. Their Bioinformation paper is also worth a look.

As with all aspects of bioinformatics, format is critical. You need to follow the specified format exactly. Here is what the head of my count matrix looks like:

ENSG00000185650_ZFP36L1 479 467 481 89 77 96
ENSG00000116032_GRIN3B 62 45 62 43 54 52
ENSG00000236345_RP11-59D5__B.2 608 512 600 433 406 417
ENSG00000138069_RAB1A 4141 3568 3571 3250 3259 3189
ENSG00000127666_TICAM1 719 650 677 615 648 566
ENSG00000155393_HEATR3 1672 1291 1570 1511 1634 1641
ENSG00000146731_CCT6A 9400 7432 7743 8923 9322 9471
ENSG00000153012_LGI2 273 275 214 299 264 272
ENSG00000253716_RP13-582O9.5 78 53 62 75 77 60

 I then uploaded the count matrix to DEB to perform differential gene expression analysis using all three statistical algorithms with a significance threshold of 5% after false discovery rate (FDR) correction.
Number of differentially expressed genes (FDR≤0.05) using three popular algorithms for RNA-seq analysis.

Then I wanted to know what the overlap was between the three algorithms, so I drew Venn diagrams with Venny.


This shows that overall, the three algorithms were very similar, although edgeR was the least conservative and baySeq was the most conservative.

Here is the smear plot of the effect of azacitidine using DESeq. The version generated by edgeR was virtually identical.

Smear plot from DESeq for the comparison of azacitidine treated cells versus the control cells (GEO link). The x-axis shows the average expression level and the y-axis shows the fold change. Both axes are on the log scale. DEB server was used for statistical analysis.

As you can see, the range of fold changes is quite striking, with several genes with log2 fold changes less than -6, and only 2 genes with fold change greater than +2. Below is the top 20 up- and down-expressed genes by azacitidine.

Top 20 up- and down-expressed genes by azacitidine determined by DESeq.

Data analysis step 6: Draw a heatmap from RNA-seq data using R

Here  we use DESeq output for downstream analysis. If we want to draw a heatmap at this stage, we might struggle because the output provided by the DEB applet does not send back the normalised count data for each sample.

It is not really useful to plot all 5704 genes with FDR adjusted p-values <0.05 on the heatmap, so I will simply show the top 100 by p-value. Here are the general steps I will use in my R script below:

  1. Read the count matrix and DESeq table into R and merge into one table
  2. Sort based on p-value with most significant genes on top
  3. Select the columns containing gene name and raw counts
  4. Scale the data per row
  5. Select the top 100 genes by significance
  6. Generate the heatmap with mostly default values
Google searches show that R has some quite elaborate heatmap options, especially with features from ggplot2 and RColorBrewer. In this example, I will use built-in R features.

#read in the count matrix
mx<-read.table("Aza_AML3_countmatrix.xls", row.names = 1 , header = TRUE)

#read in the DESeq DGE spreadsheet
dge<-read.table(“DEB_DESeq.xls”, row.names = 1 , header = TRUE)

#merge the counts onto the DGE spreadsheet

#sort the merged table by p-value
smg<-mg[order(mg$pval), ]

#select only the columns containing the gene names and count data
x<-subset(smg, select = c(“Row.names”, “UNTR1”, “UNTR2”, “UNTR3”, “AZA1”, “AZA2”, “AZA3”))

#make the table a data frame with gene names then remove duplicate gene name column
y<-(as.data.frame(x, row.names = x$Row.names))
x<-subset(y, ,-c(Row.names))

#scale rows

#only grab top 100 by p-value
h<-head(xtst, n = 100L)

#set layout options – adjust if labels get cut off
pdf(“heatmap.pdf”,width=7, height=8)

#draw heatmap allowing larger margins and adjusting row label font size
heatmap(h, margins = c(4,10), cexRow=.4)

#output plot to file

Heatmap representing gene expression of AML3 cells treated with azacitidine. Only top 100 most significant genes are shown. Statistical analysis was performed using DESeq. Scale: Yellow indicates high  expression and red is low expression.

As you can see, the heatmap shows stark expression changes for these top 100 most significantly differential genes. Also note that the majority of genes in the top 100 are down-regulated.