ggplot2: Layers

10 min read

1 Introduction

In general, there are three purposes for a layer:

  • To display the data.
  • To display a statistical summary of the data.
  • To add additional metadata: context, annotations, and references.

2 Geoms

Geoms can be roughly divided into individual and collective geoms.

  • Individual geoms draw a distinct graphical object for each observation (row). For example, the point geom draws one point per row.
    • geom_bar(stat = "identity") makes a bar chart. We need stat = “identity” because the default stat automatically counts values
    • geom_area() draws an area plot, which is a line plot filled to the y-axis (filled lines).
    • geom_rect() draw rectangles (parameterised by the four corners of the rectangle).
    • geom_tile() also draw rectangles (parameterised by the center of the rect and its size).
    • geom_raster() is a fast special case of geom_tile() used when all the tiles are the same size.
  • Collective geoms display multiple observations with one geometric object.
    • The assignment of observations to graphical elements can be controlled by group aesthetic (shown below).
    • If a group is defined by a combination of multiple variables, use interaction() to combine them, e.g. aes(group = interaction(school_id, student_id)).
    • Setting the grouping aesthetic in ggplot() will affect all geoms below.
    • Lines and paths operate on a “first value” principle: each segment is defined by two observations, and ggplot2 applies the aesthetic value (e.g., color) associated with the first observation when drawing the segment.
    • Mapping aesthetics to discrete variables works well for bar and area plots, resulting in stacking individual pieces.
library(babynames)
hadley <- dplyr::filter(babynames, name == "Hadley")
ggplot(hadley, aes(year, n, group = sex, color = sex)) + 
  geom_line()

3 Statistical summaries

3.1 Revealing uncertainty

There are four basic families of geoms using the aesthetics ymin and ymax to show uncertainty:

  • Discrete x, range: geom_errorbar(), geom_linerange()
  • Discrete x, range & center: geom_crossbar(), geom_pointrange()
  • Continuous x, range:geom_ribbon()
  • Continuous x, range & center: geom_smooth(stat = "identity")
y <- c(18, 11, 16)
df <- data.frame(x = 1:3, y = y, se = c(1.2, 0.5, 1.0))
base <- ggplot(df, aes(x, y, 
                       ymin = y - se, ymax = y + se))
base + geom_crossbar()|
base + geom_pointrange()|
base + geom_smooth(stat = "identity")

base + geom_errorbar()|
# if set width of errorbar to 0, it will be same as linerange
base + geom_errorbar(width = 0)|
base + geom_linerange()|
base + geom_ribbon()

3.2 Weighted data

Data can be weighted by the weight aesthetic.

# Unweighted
ggplot(midwest, aes(percwhite, percbelowpoverty)) + 
  geom_point() + 
  geom_smooth(method = lm, size = 1) |

# Weighted by population
ggplot(midwest, aes(percwhite, percbelowpoverty)) + 
  geom_point(aes(size = poptotal / 1e6)) + 
  geom_smooth(aes(weight = poptotal), 
              method = lm, size = 1) +
  scale_size_area(guide = "none")

3.3 Displaying distributions

There are a number of geoms that can be used to display distributions. To compare the distribution between groups, you have a few options:

  • Show small multiples of the histogram, facet_wrap(~ var).
  • Use color and a frequency polygon, geom_freqpoly().
  • Use a “conditional density plot”, geom_histogram(position = "fill").

An alternative to a bin-based visualisation is a density estimate. geom_density() places a little normal distribution at each data point and sums up all the curves.

# define layout
layout <- "
AB
CD
EF
"
# Use color and a frequency polygon
ggplot(diamonds, aes(depth)) + 
  geom_freqpoly(aes(color = cut), 
                binwidth = 0.1, na.rm = TRUE) +
  theme(legend.position = "none") +
# Use a "conditional density plot"
ggplot(diamonds, aes(depth)) + 
  geom_histogram(aes(fill = cut), 
                 binwidth = 0.1, position = "fill",
                 na.rm = TRUE) +
  theme(legend.position = "none") +
# density plot
ggplot(diamonds, aes(depth)) +
  geom_density(na.rm = TRUE) + 
  theme(legend.position = "none") +
# adjust binwidth
ggplot(diamonds, aes(depth)) + 
  geom_histogram(aes(fill = cut), 
                 binwidth = 0.2, position = "fill",
                 na.rm = TRUE) +
  theme(legend.position = "none") +
# adjust = 1/2 means use half of the default bandwidth (more rough)
ggplot(diamonds, aes(depth)) +
  geom_density(na.rm = TRUE, adjust = 0.5) + 
  theme(legend.position = "none") +
# both fill and color can separate groups
ggplot(diamonds, aes(depth, fill = cut)) +
  geom_density(alpha = 0.2, na.rm = TRUE) + 
  theme(legend.position = "none") +
# apply layout and set common x axis
plot_layout(design = layout) & 
  scale_x_continuous(limits = c(58, 68))

3.4 Dealing with overplotting

When the data is large, points on a scatterplot will overlap. This problem is called overplotting, which can be solved by a number of ways:

  • Making points smaller (geom_point(shape = ".")) or using hollow glyphs (geom_point(shape = 1)).
  • Making points transparent (geom_point(alpha = 1 / 10)).
  • Alleviate some overlaps with geom_jitter(), and adjust points by width and height arguments.
  • Breaking the plot into many small squares or hexagons.
df <- data.frame(x = rnorm(2000), 
                 y = rnorm(2000))
norm <- ggplot(df, aes(x, y)) + 
  xlab(NULL) + ylab(NULL)
norm + geom_bin2d() |
# adjust bins
norm + geom_hex(bins = 10)

  • Adding data summaries to help guide the eye to the true shape of the pattern within the data

3.5 Statistical summaries

Some summary functions like stat_summary() can produce y, ymin and ymax aesthetics.

# show number of observations in each bin
ggplot(diamonds, aes(color)) + 
  geom_bar() |
# show average price
ggplot(diamonds, aes(color, price)) + 
  geom_bar(stat = "summary_bin", fun = mean) +
  coord_cartesian(ylim=c(2000,6000)) |
# show average price with se
ggplot(diamonds, aes(color, price)) + 
  stat_summary(fun = mean, geom = "bar") +
  stat_summary(fun.data = mean_se, 
               geom = "errorbar",width = 0.1) +
  coord_cartesian(ylim=c(2000,6000)) 

3.6 Surfaces

The ggplot2 package does not support true 3d surfaces, but it does support many common tools for summarizing 3d surfaces in 2d.

small <- faithfuld[seq(1, nrow(faithfuld), by = 10), ]
# contour
# the .. notation (..level..)
# refers to a variable computed internally 
ggplot(faithfuld, aes(eruptions, waiting)) + 
  geom_contour(aes(z = density, color = ..level..)) |
# raster
ggplot(faithfuld, aes(eruptions, waiting)) + 
  geom_raster(aes(fill = density)) |
# Bubble plots work better with fewer observations
ggplot(small, aes(eruptions, waiting)) + 
  geom_point(aes(size = density), alpha = 1/3) + 
  scale_size_area()

There is another good example for geom_tile() in this website: Top 50 ggplot2 Visualizations.

var <- mpg$class  # the categorical data 
## Prep data (nothing to change here)
nrows <- 10
df <- expand.grid(y = 1:nrows, x = 1:nrows)
categ_table <- round(table(var) * ((nrows*nrows)/(length(var))))

df$category <- factor(rep(names(categ_table), categ_table))  
# NOTE: if sum(categ_table) is not 100 (i.e. nrows^2), 
# it will need adjustment to make the sum to 100.

## Plot
ggplot(df, aes(x = x, y = y, fill = category)) + 
  # set border color and size
  geom_tile(color = "black", size = 0.5) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0), trans = 'reverse') +
  scale_fill_brewer(palette = "Set3") +
  labs(title="Waffle Chart", subtitle="'Class' of vehicles",
       caption="Source: mpg") + 
  theme(panel.border = element_rect(size = 1, fill = NA),
        plot.title = element_text(size = rel(1.2)),
        axis.text = element_blank(),
        axis.title = element_blank(),
        axis.ticks = element_blank(),
        axis.line=element_blank(),
        legend.title = element_blank(),
        legend.position = "right")

4 Annotations

Conceptually, an annotation supplies metadata for the plot: that is, it provides additional information about the data being displayed.

4.1 Plot and axis titles

The labs() helper function lets you to set the various titles using name-value pairs like title = "My plot title". Mathematical expressions wrapped in quote(). The rules by which these expressions are interpreted can be found by typing ?plotmath. To enable markdown you need to set the relevant theme element to ggtext::element_markdown().

base <- ggplot() + xlim(-2*pi, 2*pi)
base + 
  geom_function(fun = rlang::as_function(~ sin(.x^3)))+
  labs(x = "Axis title with *italics* and **boldface**",
       y = quote( f(x) == sin(x^3) ))+
  theme(axis.title.x = ggtext::element_markdown())

There are two ways to remove the axis label. Setting labs(x = "") omits the label but still allocates space; setting labs(x = NULL) removes the label and its space.

4.2 Text labels

The function geom_text() adds label text at the specified x and y positions. It has the most aesthetics of any geom, because there are so many ways to control the appearance of a text:

  • family aesthetic provides the name of a font.
  • fontface aesthetic specifies the face, which can be “plain” (the default), “bold” or “italic”.
  • hjust (“left”, “center”, “right”, “inward”, “outward”) and vjust (“bottom”, “middle”, “top”, “inward”, “outward”) aesthetics controls alignment of the text.
  • size aesthetic specifies the font size in millimeters (There are 72.27 pts in a inch, so to convert from points to mm, just multiply by 72.27 / 25.4).
  • angle controls the rotation of the text in degrees.

In addition to the various aesthetics, geom_text() has three parameters:

  • nudge_x and nudge_y are useful to offset the text label.
  • If check_overlap = TRUE, overlapping labels will be automatically removed from the plot.

Another function geom_label() draws a rounded rectangle behind the text.

4.3 Building custom annotations

One useful way to annotate this plot is to use shading to indicate which president was in power at the time. The annotate() helper function which creates the data frame.

presidential <- subset(presidential, 
                       start > economics$date[1])
yrng <- range(economics$unemploy)
xrng <- range(economics$date)
caption <- paste(
  strwrap("Unemployment rates in the US have 
  varied a lot over the years", 40), 
  collapse = "\n")

ggplot(economics) + 
  # use geom_rect() to introduce shading
  geom_rect(
    aes(xmin = start, xmax = end, fill = party), 
    ymin = -Inf, ymax = Inf, alpha = 0.2, 
    data = presidential
  ) + 
  # use geom_vline() to introduce separators
  geom_vline(
    aes(xintercept = as.numeric(start)), 
    data = presidential,
    color = "grey50", alpha = 0.5
  ) + 
  # use geom_text to add labels
  geom_text(
    aes(x = start, y = 2500, label = name), 
    data = presidential, 
    size = 3, vjust = 0, hjust = 0, nudge_x = 50
  ) + 
  # use geom_line() to overlay the data on top of 
  # these background elements
  geom_line(aes(date, unemploy)) + 
  scale_fill_manual(values = c("blue", "red")) +
  annotate(geom = "text", 
           x = xrng[1], y = yrng[2], 
           label = caption,
           hjust = 0, vjust = 1, size = 4)+
  # change legend position
  theme(legend.position = "bottom")+
  xlab("date") + 
  ylab("unemployment")

Highlight a subset of points can be achieved by drawing larger points in a different color underneath the main data set. geom_curve() and geom_segment() can be used to draw curves and lines connecting points with labels.

p <- ggplot(mpg, aes(displ, hwy)) +
  geom_point(
    data = filter(mpg, manufacturer == "subaru"), 
    color = "orange",
    size = 3) +
  geom_point() 

p + annotate(
  geom = "curve", x = 4, y = 35, xend = 2.65, yend = 27, 
  curvature = .3, arrow = arrow(length = unit(2, "mm"))) +
  annotate(geom = "text", x = 4.1, y = 35, label = "subaru", 
           hjust = "left")

5 Arranging plots

As shown before, the patchwork package can arrange plots flexibly.

  • + does not specify any specific layout, only that the plots should be displayed together.
  • layout can be controlled by plot_layout(), like plot_layout(ncol = 2).
  • operators / and | can facilitate generating columns and rows.
  • operators can be grouped by parentheses (e.g. p3 | (p2 / (p1 | p4))).
  • complex layouts can be specified in the design argument in plot_layout().
  • guides can be combined and shown in a particular region (e.g. p1 + p2 + p3 + guide_area() + plot_layout(ncol = 2, guides = "collect")).
  • & can change theme and axis for all the graphics.
  • tags can be nested by setting tag_level in plot_layout().
p123[[2]] <- p123[[2]] + plot_layout(tag_level = "new")
p123 + plot_annotation(tag_levels = c("I", "a"))
  • plots can be shown on top of each other using inset_element().
p24 <- p2 / p4 + plot_layout(guides = "collect")
p1 + 
  inset_element(
    p24, 
    left = 0.4, 
    bottom = 0.4, 
    right = unit(1, "npc") - unit(15, "mm"), 
    top = unit(1, "npc") - unit(15, "mm"),
    align_to = "full"
  )