Making heatmaps with R for microbiome analysis

Posted on 20 August, 2013 by Jeremy Yoder

Arianne Albert is the Biostatistician for the Women’s Health Research Institute at the British Columbia Women’s Hospital and Health Centre. She earned a PhD from the University of British Columbia under the tutelage of Dolph Schluter before branching off into health research. Arianne currently has her fingers in all kinds of research pies, including work on the vaginal microbiome. She still has a soft spot for sticklebacks though. You can find her on LinkedIn and Google Scholar.
Heatmaps are incredibly useful for the visual display of microarray data or data from high-trhoughput sequencing studies such as microbiome analysis. Basically, they are false colour images where cells in the matrix with high relative values are coloured differently from those with low relative values. Heatmaps can range from very simple blocks of colour with lists along 2 sides, or they can include information about hierarchical clustering, and/or values of other covariates of interest. Fortunately, R provides lots of options for constructing and annotating heatmaps.

R preliminaries

Load libraries

The first step is to make sure you’ve got the right libraries loaded. I find that the heatmap function in the basic stats package (loaded by default) is quite useful for many applications. However, more complicated annotations require either gplots or Heatplus (from Bioconductor).

`1`	`library(gplots)` `# for heatmap.2`

1# to install packages from Bioconductor:

2source("http://bioconductor.org/biocLite.R")

3biocLite("Heatplus")  # annHeatmap or annHeatmap2

4library(Heatplus)

1# load the vegan package for hierachical clustering if you want to use distance functions not specified in dist.

2library(vegan)

1# load the RColorBrewer package for better colour options

2library(RColorBrewer)

Get and Prepare Dataset

Next, we need some data! For this example, I’m going to use some microbiome data available for download from Dryad, originally from the paper:
Boucias DG, Cai Y, Sun Y, Lietze V, Sen R, Raychoudhury R, Scharf ME. 2013. The hindgut-lumen prokaryotic microbiota of the lignocellulose-degrading termite Reticulitermes flavipes and its responses to dietary lignocellulose composition. Molecular Ecology 22(7): 1836–1853. doi:10.1111/mec.12230; Dryad data package doi:10.5061/dryad.b922k.
This file is not formatted in a way that R will understand, so it requires some alteration before importing. The file has samples across the top and genera along the rows as well as some lovely, but superfluous, summary annotation. I got rid of extra rows and columns and transposed the rest in Excel. Now the 238 genera are along the top with the 12 samples as the rows.
Load your data! I’m assuuming you know how to do this in R …

1all.data <- read.csv("genera example.csv")  # load the data

2dim(all.data)

3[1]  12 239

Here are the first 3 rows and 4 columns to give you an idea of what the data should look like:

1all.data[1:3, 1:4]

2    sample Acetivibrio Acetobacter Achromobacter 

31     S7           0           5             0

42     S8           1           0             1

53     S9           0           3             1

We’ll have to strip off the sample ids and convert them to row names so that the data matrix contains only sequence count data.

1row.names(all.data) <- all.data$sample

2all.data <- all.data[, -1]

Now, to make a heatmap with microbiome sequencing data, we ought to first transform the raw counts of reads to proportions within a sample:

1data.prop <- all.data/rowSums(all.data)

2data.prop[1:3, 1:3]

3    Acetivibrio Acetobacter Achromobacter

4S7   0.000e+00   0.0004289     0.000e+00

5S8   9.625e-05   0.0000000     9.625e-05

6S9   0.000e+00   0.0002759     9.195e-05

Make heatmaps

Generally, the default colours for heatmaps are pretty appalling. I personally like this palette, but play around to see what suits you.

1#
 colorRampPalette is in the RColorBrewer package.  This creates a colour
 palette that shades from light yellow to red in RGB space with 100 
unique colours


2scaleyellowred <- colorRampPalette(c("lightyellow", "red"), space = "rgb")(100)

The most basic of heatmaps:

`1`	`heatmap(as.matrix(data.prop), Rowv =` `NA, Colv =` `NA, col = scaleyellowred)`

It’s pretty clear that this plot is inadequate in many ways. For one, the genus labels are all squished along the bottom and impossible to read. One solution to this problem is to remove genera that are exceedingly rare from this figure. Let’s try removing genera whose relative read abundance is less than 1% of at least 1 sample.

1# determine the maximum relative abundance for each column

2maxab <- apply(data.prop, 2, max)

3head(maxab)

4Acetivibrio     Acetobacter   Achromobacter Acidaminococcus 

5 1.088e-04       5.972e-04       5.441e-04       8.578e-05

1# remove the genera with less than 1% as their maximum relative abundance

2n1 <- names(which(maxab < 0.01))

3data.prop.1 <- data.prop[, -which(names(data.prop) %in% n1)]

This leaves us with 23 genera suggesting that most of the taxa sampled occur at very low relative abundances. The heatmap with the new data looks like this, with a bit of extra fiddling to get all of the labels displayed.

1#
 the margins command sets the width of the white space around the plot. 
The first element is the bottom margin and the second is the right 
margin


2heatmap(as.matrix(data.prop.1), Rowv = NA, Colv = NA, col = scaleyellowred, margins = c(10, 2))

Much better! Now let’s add a dendrogram for the samples. The heatmap function will do this for you, but I prefer to make my own using the vegan package as it has more options for distance metrics. Also, this means that you can do hierarchical clustering using the full dataset, but only display the more abundant taxa in the heatmap.

1# calculate the Bray-Curtis dissimilarity matrix on the full dataset:

2data.dist <- vegdist(data.prop, method = "bray")

1#
 Do average linkage hierarchical clustering. Other options are 
'complete' or 'single'. You'll need to choose the one that best fits the
 needs of your situation and your data.


2row.clus <- hclust(data.dist, "aver")

1# make the heatmap with Rowv = as.dendrogram(row.clus)

2heatmap(as.matrix(data.prop.1), Rowv = as.dendrogram(row.clus), Colv = NA, col = scaleyellowred, margins = c(10, 3))

plot of chunk make distance matrix and clusters

You can also add a column dendrogram to cluster the genera that occur more often together. Note that this one must be done on the same dataset that is used in the Heatmap (i.e. reduced number of genera).

1# you have to transpose the dataset to get the genera as rows

2data.dist.g <- vegdist(t(data.prop.1), method = "bray")

3col.clus <- hclust(data.dist.g, "aver")

1# make the heatmap with Rowv = as.dendrogram(row.clus)

2heatmap(as.matrix(data.prop.1), Rowv = as.dendrogram(row.clus), Colv = as.dendrogram(col.clus), col = scaleyellowred, margins = c(10, 3))

plot of chunk make distance matrix and clusters for genera

For some applications, the above heatmaps would be sufficient, however, there are many times where it would be nice to indicate on the heatmap values for other important variables, such as experimental groups, or values of covariates. Also, it would be nice to have a legend for the colour scale. heatmap.2 in the gplots package can do both of these things.
First, let’s make up some annotation for our dataset (I would like to be 100% clear that this is totally fictional data and not in the original publication in any way):

1# There are 12 samples in the dataset, so let's assign them randomly to one of two fictional categories

2var1 <- round(runif(n = 12, min = 1, max = 2))  # this randomly samples from a uniform distribution and rounds the result to an integer value

1# change the 1s and 2s to the names of colours:

2var1 <- replace(var1, which(var1 == 1), "deepskyblue")

3var1 <- replace(var1, which(var1 == 2), "magenta")

1# if this was real data, you would need the order of the values to be in the same order as the sequencing data e.g.

2cbind(row.names(data.prop), var1)

01var1

02 [1,] S7  magenta

03 [2,] S8  magenta

04 [3,] S9  deepskyblue

05 [4,] S10 deepskyblue

06 [5,] S11 deepskyblue

07 [6,] S12 magenta

08 [7,] S13 deepskyblue

09 [8,] S14 deepskyblue

10 [9,] S15 deepskyblue

11[10,] S16 deepskyblue

12[11,] S17 deepskyblue

13[12,] S18 deepskyblue

Next, we can make a more complicated heatmap.

1heatmap.2(as.matrix(data.prop.1),Rowv = as.dendrogram(row.clus), Colv = as.dendrogram(col.clus), col = scaleyellowred, RowSideColors = var1, # this puts in the annotation for the samples margins = c(10, 3)) 

You’ll note that some of the defaults for heatmap.2 are kind of funny looking. The green tracing in each column can be turned off with trace = "none", and the histogram in the scale legend can be removed with density.info = "none". You can add plot labels (e.g. main, xlab and ylab), as well as change the relative sizing of the plot elements (lhei and lwid). There are lots of other options which are surprisingly well described in the help for heatmap.2.

1heatmap.2(as.matrix(data.prop.1), Rowv = as.dendrogram(row.clus), Colv = as.dendrogram(col.clus), col = scaleyellowred, RowSideColors = var1, margins = c(11, 5), trace = "none", density.info = "none", xlab = "genera", ylab = "Samples", main = "Heatmap example", lhei = c(2, 8)) # this makes the colour-key legend a little thinner

Finally, the least intuitive function is annHeatmap2 from the Heatplus package. However, it allows for annotation with multiple variables at once, as well as an option for colouring the dendrogram into clusters. All of the options within the function are written as lists, which can take some getting used to if you don’t use them regularly.

1# the annHeatmap2 function needs to be wrapped in the plot function in order to display the results

2plot(annHeatmap2(as.matrix(data.prop.1),

1#
 the colours have to be fiddled with a little more than with the other 
functions. With 50 breaks in the heatmap densities, there needs to be 51
 colours in the scale palette. Error messages from the plotting should 
help to determine how many colours are needed for a given number of 
breaks


2col = colorRampPalette(c("lightyellow", "red"), space = "rgb")(51), breaks = 50,

1#
 dendrogram controls how dendrograms are made and can be calculatated 
witin the function. The internal calculation can be overridden and 
externally calculated dendrograms can be displayed.


2dendrogram = list(Row = list(dendro = as.dendrogram(row.clus)), Col = list(dendro = as.dendrogram(col.clus))), legend = 3, #
 this puts the colour-scale legend on the plot. The number indicates the
 side on which to plot it (1 = bottom, 2 = left, 3 = top, 4 = right)


3labels = list(Col = list(nrow = 12)) # gives more space for the Genus names


4))

Annotations are added with the ann sublist which takes a data frame as its input. Unlike the other functions, it can take multiple variables for annotation including continuous variables.

1# make a data frame of variables for annotation

2ann.dat <- data.frame(var1 = c(rep("cat1", 4), rep("cat2", 8)), var2 = rnorm(12,  mean = 50, sd = 20))

1plot(annHeatmap2(as.matrix(data.prop.1), col = colorRampPalette(c("lightyellow", "red"), space = "rgb")(51), breaks = 50, dendrogram = list(Row = list(dendro = as.dendrogram(row.clus)), Col = list(dendro = as.dendrogram(col.clus))), legend = 3, labels = list(Col = list(nrow = 12)), ann = list(Row = list(data = ann.dat))))

Finally, clusters in the dedrogram(s) can be highlighted using different colours with the cluster sublist:

1# make a data frame of variables for annotation

2ann.dat <- data.frame(var1 = c(rep("cat1", 4), rep("cat2", 8)), var2 = rnorm(12, mean = 50, sd = 20))

1plot(annHeatmap2(as.matrix(data.prop.1),

2col = colorRampPalette(c("lightyellow", "red"), space = "rgb")(51),

3breaks = 50,

4dendrogram = list(Row = list(dendro = as.dendrogram(row.clus)), Col = list(dendro = as.dendrogram(col.clus))),

5legend = 3, 

6labels = list(Col = list(nrow = 12)), 

7ann = list(Row = list(data = ann.dat)),

8cluster = list(Row = list(cuth = 0.25, col = brewer.pal(3, "Set2"))) # cuth gives the height at which the dedrogram should be cut to form clusters, and col specifies the colours for the clusters

9))

This is just a taste of what these functions can do. Experimenting with changing the default values and seeing what happens is half the fun with R.

DeepSeq Data Analysis Tools

NGS Data Analysis
Site Name	Description	Clicks
FASTX Tool Kit	The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.	144
The Genome Analysis Toolkit (GATK)	“The GATK is a structured software library that makes writing efficient analysis tools using next-generation sequencing data very easy, and second it’s a suite of tools for working with human medical resequencing projects such as 1000 Genomes and The Cancer Genome Atlas. These tools include things like a depth of coverage analyzers, a quality score recalibrator, a SNP/indel caller and a local realigner”.	136
Genome-wide Complex Trait Analysis (GCTA)	"GCTA (Genome-wide Complex Trait Analysis) is designed to estimate the proportion of phenotypic variance explained by genome- or chromosome-wide SNPs for complex traits".	133
Burrows Wheeler Algorithm Download	"BWA is a program for aligning sequencing reads against a large reference genome (e.g. human genome). It has two major components, one for read shorter than 150bp and the other for longer reads".	175
SAM Tools	"SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format".	125

Top of Page

SNP Detection and Genomics Work
Site Name	Description	Clicks
GenSel Software	BIGS GenSel Login Page. GelSel is used primarily for statistical analyses associated with Genomic Selection. This software was developed by Dr. Dorian J. Garrick and Dr. Rohan Fernando at Iowa State University, USA.	151
AgBase v. 2.00	"AgBase is a curated resource for functional analysis of agricultural plant and animal gene products including gene ontology annotations". It provides several functional analysis tools, including, GOProfiler, Genome2Seq, GORetriver etc.	124
Genome2Seq	"Genome2Seq is used to rapidly retrieve a fasta file of sequences based on genome co-ordinates generated from RNA-Seq data".	124
gnuplot version 4.6	"gnuplot is a command-driven interactive function plotting program. It can be used to plot functions and data points in both two- and three-dimensional plots in many different formats. It is designed primarily for the visual display of scientific data. gnuplot is copyrighted, but freely distributable".	114
SNPlotz: a genome plot tool for SNP association studies	"SNPlotz is a web tool for users to plot phenotypic effects data from GWAS analysis, along a genome. The plot tool is coupled with the GBrowse to bring up all related genome features (transcripts, genes, QTL, etc) for further data mining on any interesting SNP shown on the plot".	121
Bioinformatics Services and Tools- CCMB, MI, USA	Use this website to get guidelines on workflows of RNAseq, ChipSeq, Genome Assembly, Genome RNAseq and microarray analysis as well as list of several useful bioinformatics tools and resources.	126
Gene Ontology Tools	"Gene Ontology (GO), the de facto standard in gene functionality description, is used widely in functional annotation and enrichment analysis".	153
MISO-The Sequence Ontology Browser	Its the "home page of the Sequence Ontology Project (SO), a joint effort by genome annotation centres, and other groups using sequence annotation data, including: WormBase, FlyBase, the Mouse Genome Informatics group, and the Sanger Institute". Aims at developing an ontology suitable for describing biological.	115
SNP Nexus	An SNP annotation tool. "SNPnexus was designed to simplify and assist in the selection of functionally relevant Single Nucleotide Polymorphisms (SNP) for large-scale genotyping studies of multifactorial disorders".	154
Helix System	This system "evaluates new high performance architectures and introduces them into production work at NIH". Some of the function of this system are installing scientific application programs, evaluateing new high performance architectures, installing and supporting scientific application programs etc.	180

Top of Page

NGS Forums
Site Name	Description	Clicks
SEQanswers: Discussion Forum	A discussion forum "intended to be oriented toward life sciences and next generation sequencing". However, "any scientific topic related to genomics is open for discussion".	578
Generic Model Organism Database	The Generic Model Organism Database project is “a collection of open source software tools for creating and managing genome-scale biological databases”. It can be used “to create a small laboratory database of genome annotations, or a large web-accessible community database”.	118
Database management for SNP data (snpdb)	"This project is used to facilitate the management and analysis of Single-Nucleotide Polymorphism (SNP) data from Genome-Wide Association Studies (GWAS) using a database system".	117

Metagenomics

Top of Page

Metagenomics
Site Name	Description	Clicks
MEGAN4-MEtaGenome ANalyzer	It is a "stand-alone analysis tool for metagenomic of short-read data".	144
MG-RAST	"MG-RAST (the Metagenomics RAST) server is an automated analysis platform for metagenomes providing quantitative insights into microbial populations based on sequence data". Its works the best in Firefox browser.	131
IMG/M	“The Integrated Microbial Genomes (IMG) system serves as a community resource for analysis and annotation of genome and metagenome datasets in a comprehensive comparative context. The IMG data warehouse integrates genome and metagenome datasets provided by IMG users with a comprehensive set of publicly available isolate and single cell genomes, and a rich set of publicly available metagenome samples”.	152
CAMERA	CAMERA (Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis) aims at serving the" needs of the microbial ecology research community, and other scientists using metagenomics data, by creating a rich, distinctive data repository and a bioinformatics tools resource that will address many of the unique challenges of metagenomic analysis".	129
CARMA	"CARMA is a software pipeline for characterizing the taxonomic composition and genetic diversity of short-read metagenomes. The software was originally designed for the analysis of environmental metagenomes obtained by the ultra-fast 454 pyrosequencing system".	146
Terragenome	Its an international soil metagenome sequencing consortium.	135

Top of Page

Quotes

“Biostatistics is a radar for biomedicine “that” allows biologists to navigate in an immense ocean of information with only a small amount of light, saving them from inappropriate experiments and helping them to design consistent ones”

(EurekAlert!, the online, global news service operated by AAAS, the science society).

“Bioinformatics is like an amoeba; it comes in various shapes and sizes.”

(Nancy Lorenzi)

Next generation sequencing, metagenomics analysis, metagenome, The Genome Analysis Toolkit (GATK), Genome-wide Complex Trait Analysis (GCTA), Burrows Wheeler Algorithm Download, SAM Tools, GenSel Software, genomic selection, AgBase v. 2.00, Genome2Seq, Discussion Forum, Generic Model Organism Database, Database management for SNP data (snpdb), HiSeq, Illumina, Roche, Reverese transcription PCR

Working is My Life

Search This Blog

Thursday, May 1, 2014

R (18) : Making heatmaps with R for microbiome analysis

Making heatmaps with R for microbiome analysis

R preliminaries

Load libraries

Get and Prepare Dataset

Make heatmaps

Bioinformatics Software and Tools

NGS

DeepSeq Data Analysis Tools

Metagenomics

Quotes