Box-plot with R – Tutorial
(This article was first published on R Video tutorial for Spatial Statistics, and kindly contributed to R-bloggers)
Yesterday
I wanted to create a box-plot for a small dataset to see the evolution
of 3 stations through a 3 days period. I like box-plots very much
because I think they are one of the clearest ways of showing trend in
your data. R is extremely good for this type of plot and, for this
reason, I decided to add a post on my blog to show how to create a
box-plot, but also because I want to use my own blog to help me remember
pieces of code that I might want to use in the future but that I tend
to forget.
For this example I first created a dummy dataset using the function rnorm()
which generates random normal-distributed sequences. This function
requires 3 arguments, the number of samples to create, the mean and the
standard deviation of the distribution, for example:
rnorm(n=100,mean=3,sd=1)
This generates 100 numbers (floats to be exact), which have mean equal to 3 and standard deviation equal to 1.
To generate my dataset I used the following line of code:
data<-data.frame(Stat11=rnorm(100,mean=3,sd=2),
Stat21=rnorm(100,mean=4,sd=1),
Stat31=rnorm(100,mean=6,sd=0.5),
Stat41=rnorm(100,mean=10,sd=0.5),
Stat12=rnorm(100,mean=4,sd=2),
Stat22=rnorm(100,mean=4.5,sd=2),
Stat32=rnorm(100,mean=7,sd=0.5),
Stat42=rnorm(100,mean=8,sd=3),
Stat13=rnorm(100,mean=6,sd=0.5),
Stat23=rnorm(100,mean=5,sd=3),
Stat33=rnorm(100,mean=8,sd=0.2),
Stat43=rnorm(100,mean=4,sd=4))
This line creates a data.frame with 12 columns that looks like this:
Stat11
|
Stat21
|
Stat31
|
Stat41
|
Stat12
|
Stat22
|
Stat32
|
Stat42
|
Stat13
|
Stat23
|
Stat33
|
Stat43
|
5
|
2
|
9
|
-3
|
10
|
4
|
1
|
1
|
4
|
1
|
5
|
9
|
6
|
13
|
8
|
3
|
7
|
3
|
10
|
10
|
10
|
5
|
9
|
8
|
4
|
4
|
6
|
0
|
10
|
6
|
7
|
6
|
6
|
8
|
2
|
7
|
6
|
7
|
6
|
3
|
9
|
1
|
7
|
0
|
1
|
0
|
6
|
0
|
0
|
2
|
8
|
1
|
6
|
8
|
0
|
8
|
3
|
10
|
9
|
8
|
0
|
19
|
10
|
0
|
11
|
10
|
5
|
6
|
5
|
8
|
10
|
1
|
7
|
4
|
5
|
-5
|
7
|
0
|
3
|
5
|
2
|
5
|
5
|
3
|
4
|
12
|
9
|
-4
|
7
|
1
|
9
|
0
|
7
|
2
|
1
|
7
|
7
|
3
|
9
|
0
|
11
|
0
|
8
|
1
|
7
|
0
|
7
|
7
|
6
|
19
|
8
|
3
|
10
|
10
|
9
|
6
|
0
|
2
|
8
|
2
|
6
|
13
|
6
|
-5
|
12
|
8
|
1
|
4
|
0
|
4
|
5
|
10
|
8
|
11
|
6
|
-1
|
11
|
4
|
4
|
1
|
4
|
6
|
6
|
10
|
8
|
13
|
5
|
-5
|
7
|
10
|
0
|
4
|
2
|
7
|
3
|
1
|
2
|
8
|
5
|
-2
|
5
|
7
|
4
|
2
|
7
|
0
|
3
|
1
|
8
|
11
|
7
|
3
|
11
|
1
|
0
|
9
|
2
|
3
|
5
|
8
|
4
|
19
|
5
|
-1
|
11
|
6
|
3
|
4
|
9
|
5
|
9
|
0
|
2
|
9
|
5
|
-3
|
12
|
7
|
6
|
4
|
8
|
2
|
6
|
8
|
7
|
10
|
5
|
-4
|
8
|
9
|
6
|
9
|
1
|
4
|
3
|
4
|
…
|
…
|
…
|
…
|
…
|
…
|
…
|
…
|
…
|
…
|
…
|
…
|
As I mentioned before, this should represent 4 stations for which the measure were replicated in 3 successive days.
Now, for the creation of the box-plot the simplest function is boxplot() and can be simply called by adding the name of the dataset as only argument:
boxplot(data)
This creates the following plot:
It
is already a good plot, but it needs some adjustments. It is in black
and white, the box-plots are evenly spaced, even though they are from 3
different replicates, there are no labels on the axis and the names of
the stations are not all reported.
So now we need to start doing some tweaking.
First, I want to draw the names of the stations vertically, instead of horizontally. This can be easily done with the argument las. So now the call to the function boxplot()becomes:
boxplot(data, las = 2)
This generates the following plot:
Next, I want to change the name of the stations so that they look less confusing. For doing that I can use the option names:
boxplot(data, las = 2, names = c("Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4"))
which generates this plot:
If the names are too long and they do not fit into the plot’s window you can increase it by using the option par:
boxplot(data, las = 2, par(mar = c(12, 5, 4, 2)+ 0.1), names = c("Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4"))
Now
I want to group the 4 stations so that the division in 3 successive
days is clearer. To do that I can use the option at, which let me
specify the position, along the X axis, of each box-plot:
boxplot(data, las = 2, at =c(1,2,3,4, 6,7,8,9, 11,12,13,14), par(mar = c(12, 5, 4, 2) + 0.1), names = c("Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4"))
Here
I am specifying that I want the first 4 box-plots at position x=1, x=2,
x=3 and x=4, then I want to leave a space between the fourth and the
fifth and place this last at x=6, and so on.
If
you want to add colours to your box plot, you can use the option col
and specify a vector with the colour numbers or the colour names. You
can find the colour numbers here, and the colour names here.
Here is an example:
boxplot(data, las = 2, col = c("red","sienna","palevioletred1","royalblue2","red","sienna","palevioletred1",
"royalblue2","red","sienna","palevioletred1","royalblue2"),
at = c(1,2,3,4, 6,7,8,9, 11,12,13,14), par(mar = c(12, 5, 4, 2) + 0.1),
names = c("Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4"))
Now, for the finishing touches, we can put some labels to plot.
The common way to put labels on the axes of a plot is by using the arguments xlab and ylab.
Let’s try it:
boxplot(data, ylab ="Oxigen (%)", xlab ="Time", las = 2, col = c("red","sienna","palevioletred1","royalblue2","red","sienna","palevioletred1","royalblue2","red","sienna","palevioletred1","royalblue2"),at = c(1,2,3,4, 6,7,8,9, 11,12,13,14), par(mar = c(12, 5, 4, 2) + 0.1), names = c("Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4","Station 1","Station 2","Station 3","Station 4"))
As
you can see from the image above, the label on the Y axis is place very
well and we can keep it. On the other hand, the label on the X axis is
drawn right below the stations names and it does not look good.
To solve this is better to delete the option xlab from the boxplot call and instead use an additional function called mtext(),
that places a text outside the plot area, but within the plot window.
To place text within the plot area (where the box-plots are actually
depicted) you need to use the function text().
The function mtext() requires 3 arguments: the label, the position and the line number.
An example of a call to the function mtext is the following:
mtext(“Label”, side = 1, line = 7)
the option side takes an integer between 1 and 4, with these meaning: 1=bottom, 2=left, 3=top, 4=right
The
option line takes an integer with the line number, starting from 0
(which is the line closer to the plot axis). In this case I put the
label onto the 7th line from the X axis.
With these option you can produce box plot for every situation.
The following is just one example:
This is the script:
data<-data.frame(Stat11=rnorm(100,mean=3,sd=2), Stat21=rnorm(100,mean=4,sd=1), Stat31=rnorm(100,mean=6,sd=0.5), Stat41=rnorm(100,mean=10,sd=0.5), Stat12=rnorm(100,mean=4,sd=2), Stat22=rnorm(100,mean=4.5,sd=2), Stat32=rnorm(100,mean=7,sd=0.5), Stat42=rnorm(100,mean=8,sd=3), Stat13=rnorm(100,mean=6,sd=0.5), Stat23=rnorm(100,mean=5,sd=3), Stat33=rnorm(100,mean=8,sd=0.2), Stat43=rnorm(100,mean=4,sd=4)) boxplot(data, las = 2, col = c("red","sienna","palevioletred1","royalblue2","red","sienna","palevioletred1","royalblue2","red","sienna","palevioletred1","royalblue2"), at = c(1,2,3,4, 6,7,8,9, 11,12,13,14), par(mar = c(12, 5, 4, 2) + 0.1), names = c("","","","","","","","","","","",""), ylim=c(-6,18)) #Station labelsmtext("Station1", side=1, line=1, at=1, las=2, font=1, col="red")mtext("Station2", side=1, line=1, at=2, las=2, font=2, col="sienna")mtext("Station3", side=1, line=1, at=3, las=2, font=3, col="palevioletred1")mtext("Station4", side=1, line=1, at=4, las=2, font=4, col="royalblue2")mtext("Station1", side=1, line=1, at=6, las=2, font=1, col="red")mtext("Station2", side=1, line=1, at=7, las=2, font=2, col="sienna")mtext("Station3", side=1, line=1, at=8, las=2, font=3, col="palevioletred1")mtext("Station4", side=1, line=1, at=9, las=2, font=4, col="royalblue2")mtext("Station1", side=1, line=1, at=11, las=2, font=1, col="red")mtext("Station2", side=1, line=1, at=12, las=2, font=2, col="sienna")mtext("Station3", side=1, line=1, at=13, las=2, font=3, col="palevioletred1")mtext("Station4", side=1, line=1, at=14, las=2, font=4, col="royalblue2") #Axis labelsmtext("Time", side = 1, line = 6, cex = 2, font = 3)mtext("Oxigen (%)", side = 2, line = 3, cex = 2, font = 3) #In-plot labelstext(1,-4,"*")text(6,-4,"*")text(11,-4,"*") text(2,9,"A",cex=0.8,font=3)text(7,11,"A",cex=0.8,font=3)text(12,15,"A",cex=0.8,font=3)