Advanced Data Visualization in R (2024)

UC Berkeley Biostatistics seminar

Sara E. Moore
24 February 2015



You have:

  • a working knowledge of R,
  • some familiarity with the usage of ggplot2 (such as what was presented during the 2013 or 2014 UC Berkeley SCF/D-Lab R Bootcamp),
  • an interest in creating data visualizations in R, both static (mostly using ggplot2) and interactive (using a variety of packages).

Why ggplot?

  • It's pretty.
  • Its commands are intuitive and "human-readable."
  • Nearly any graphic can be created, so you can use it for everything and maintain a consistent style.
  • It has (sort of) built-in support for maps.

Why not ggplot?

  • It's slow.
  • It won't do some things.
  • There's a steep learning curve.
  • lattice is better at trellis graphs? Faceting is just as powerful.

Tidy data [1]

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

The (Layered) Grammar of Graphics [2; 3; 4]

  • Move away from using "names" and "chart typologies."
  • Instead, use "statements" constructed via grammar
  • Why?
    • An infinite number of unique graphics can be created.
    • The implementation is DRY ("don't repeat yourself") not WET ("write everything twice" or "we enjoy typing").

"Good grammar is just the first step in creating a good sentence."

Wickham, 2010

Components of the Grammar

Specify a statistical graphic using components of statements:

  • Data (data),
  • Statistical transformations (stat: identity, count, mean, etc.),
  • Geometric elements/objects (geom: points, lines, etc.),
  • Aesthetic mappings (aes: color, shape, size, transparency, etc.),
  • Coordinate systems (coord: cartesian, polar, map, etc.),
  • Guides/Scales and transformations thereof (scale, guide, and others: axes, log-transformed scales, legends, etc.),
  • Faceting/conditioning/latticing/trellising (facet),
  • Tweaking graphical positioning and visual elements (position, theme, etc.), and
  • Layering.

The anatomy of a ggplot command

  • All arguments to the main, initial function call, ggplot, set graph defaults.
  • These defaults can be changed for an individual element (even data).
ggplot(data=, aes(x=,y=,...)) + geom_????(...) + ...
ggplot() + geom_????(data=, aes(x=,y=,...),...) + ...

The data

source(paste("assets","load_marchmania2015.R", sep=.Platform[["file.sep"]]))# change this path to the location of your CSVs downloaded from kaggle# note that required packages will be installed automaticallyncaa.bball = load.clean.summ.ncaa("~/Dropbox/kaggle/marchmania2015")

A simple scatterplot

library(ggplot2)ggplot(data=ncaa.bball[["games"]], aes(x=lscore, y=wscore)) + geom_point() + xlab("Points scored by losing team") + ylab("Points scored by winning team") + ggtitle("Final scores of NCAA basketball games\nNovember 1984 - April 2014") + theme(text = element_text(size = 16))

Advanced Data Visualization in R (3)

Improving the scatterplot

library(munsell) # color system used by ggplot2ggplot(data=ncaa.bball[["games"]], aes(x=lscore, y=wscore)) + # default is bins=c(30,30) # can use scalar when no. of bins for x and y are same. # here alternatively specifying binwidth: stat_binhex(binwidth=c(4, 4)) + # mnsl converts [hue lightness/color purity] to hex color codes. # scale_fill_gradient == scale_fill_continuous. scale_fill_gradient("Number of games", trans = "sqrt", low=mnsl("7.5G 2/4"), high=mnsl("7.5G 9/6")) + xlab("Points scored by losing team") + ylab("Points scored by winning team") + ggtitle("Final scores of NCAA basketball games\nNovember 1984 - April 2014") + theme_classic(base_size = 16)

Advanced Data Visualization in R (4)

Improving the scatterplot

ggplot(data=ncaa.bball[["games"]], aes(x=lscore, y=wscore)) + # default is bins=c(30,30) # can use scalar when no. of bins for x and y are same. # here alternatively specifying binwidth: geom_hex(stat = "binhex", binwidth=c(4,4)) + # scale_fill_gradient == scale_fill_continuous. # the mnsl fxn converts [hue lightness/color purity] to hex color codes. scale_fill_gradient("Number of games", trans = "sqrt", low=mnsl("7.5G 2/4"), high=mnsl("7.5G 9/6")) + xlab("Points scored by losing team") + ylab("Points scored by winning team") + ggtitle("Final scores of NCAA basketball games\nNovember 1984 - April 2014") + theme_classic(base_size = 16)

Advanced Data Visualization in R (5)

When geoms transform

geomstatmodifiable defaults
geom_boxplot()stat_boxplot()max length of whiskers (beyond hinges) = 1.5*IQR
geom_bar()stat_bin()30 bins: binwidth = [range of x]/30
geom_histogram()stat_bin()30 bins: binwidth = [range of x]/30
geom_freqpoly()stat_bin()30 bins: binwidth = [range of x]/30
geom_dotplot()stat_bindot()30 bins: binwidth = [range of x]/30; "dotdensity" method
geom_bin2d()stat_bin2d()30 bins for each of x and y
geom_hex()stat_binhex()30 bins for each of x and y (calls hexbin::hexBin())
geom_density2d()stat_density2d()Gaussian kernel; bandwidths (x and y) set by Silverman's "rule of thumb"; 100 grid points for x and y (calls MASS::kde2d())
geom_density()stat_density()Gaussian kernel; bandwidth set by Silverman's "rule of thumb" (calls stats::density())
geom_violin()stat_ydensity()Gaussian kernel; bandwidth set by Silverman's "rule of thumb" (calls stats::density()); all violins have same area before trimming tails, tails are trimmed to [range of y]
geom_smooth()stat_smooth()if n<1000, stats::loess() with polynomial degree 2, \(\alpha=0.75\), etc.; else, gam::gam() with penalized cubic regression splines, etc.; 80 evaluation points
geom_quantile()stat_quantile()3 quartiles; "br" method (modified Barrodale & Roberts method; calls quantreg::rq())
geom_contour()stat_contour()10 pretty breakpoints covering [range of z]

Other ggplot transformations

  • stat_ecdf: Empirical Cumulative Density Function
  • stat_function: Superimpose a function.
  • stat_qq: Calculation for quantile-quantile plot.
  • stat_spoke: Convert angle and radius to xend and yend.
  • stat_sum: Sum unique values. Useful for overplotting on scatterplots.
  • stat_summary: Summarise y values at every unique x.
  • stat_summary_hex: Apply funciton for 2D hexagonal bins.
  • stat_summary2d: Apply function for 2D rectangular bins.
  • stat_unique: Remove duplicates.
  • stat_identity

Dates, tidyr, and summaries with ggplot2

library(tidyr) # gatherduke.2014 = subset(ncaa.bball[["teams"]], (season==2014)&("Duke"))[, c("date","fg.pct","fg3.pct","ft.pct","result")] %>% tidyr::gather(shot.type, pct, -c(date, result))library(scales) # date_formatggplot(duke.2014, aes(x=date, y=pct, color=shot.type)) + geom_line() + geom_line(stat = "hline", yintercept = "mean", linetype="dashed", alpha=0.65) + geom_rug(data=subset(duke.2014,(result=="Loss")&(shot.type=="fg.pct")), sides="b", color="grey20") + scale_x_datetime("Game date", labels = date_format("%b %Y")) + ylab("Proportion of shots made") + scale_color_discrete("Type of shot", labels=c("two point field goal","three point field goal","free throw")) + ggtitle("Duke's per-game shot percentages, 2013-14 season:\nbottom ticks indicate losses; dashed lines are season averages") + theme_classic(base_size = 16) + theme(legend.position = "bottom")

Advanced Data Visualization in R (6)

A heatmap with ggplot2::geom_tile

library(RColorBrewer) # brewer.pallibrary(grid) # unittourney.teams.2014 = as.character( unique(subset(ncaa.bball[["team.season.summ"]], (season==2014)&(max.tourney.round>="Sweet Sixteen"))$ = subset(ncaa.bball[["team.season.summ"]], = ggplot(hist.perf.teams.2014, aes(x=season,, fill=win.pct)) + geom_tile() + scale_fill_gradientn( "Proportion of regular season games won", colours = brewer.pal(9,"GnBu"), na.value="grey80", breaks=seq(0,1,0.25), guide = guide_colorbar(barwidth = 15, barheight = 1)) + scale_x_continuous("Season", expand = c(0, 0)) + scale_y_discrete("Team", expand = c(0, 0)) + ggtitle("Historical regular season performance of\n2014 NCAA tournament 'Sweet 16' teams") + theme_classic(base_size=16) + theme(legend.position = "bottom", axis.text.y = element_text(size = 11), plot.margin = unit(c(0,0.1,-0.4,0.1), "cm"))p1

Advanced Data Visualization in R (7)

Creating a dendrogram with ggdendro

library(ggdendro)team.season.df = subset(ncaa.bball[["team.season.summ"]],[, c("season", "","win.pct")]# use tidyr, but this time go long --> wide (spread)team.season.mat = as.matrix(team.season.df %>% spread(, win.pct))rownames(team.season.mat) = team.season.mat[,"season"]team.season.mat = team.season.mat[ ,-which(colnames(team.season.mat)=="season")]teams.hc = hclust(dist(t(team.season.mat)), "ave")ggdendrogram(teams.hc, rotate = TRUE)

Advanced Data Visualization in R (8)

Simplifying the dendrogram

teams.dendro = as.dendrogram(teams.hc)teams.ddata = dendro_data(teams.dendro)p2 = ggplot(segment(teams.ddata)) + geom_segment(aes(x = x, y = y, xend = xend, yend = yend)) + coord_flip() + theme_dendro() + # tweak these if the dendrogram doesn't line up: theme(plot.margin = unit(c(-7,0,-15,-20), "points"))p2

Advanced Data Visualization in R (9)

Heatmap, reordered

# need to remove the extra factor levels# and rorder according to the clusteringhist.perf.teams.2014$ = as.character( hist.perf.teams.2014$ can do sort(unique(x)) here because they were originally # in alphabetical order. just be sure the order of the # dendrogram matches up with the new order of the heatmap.hist.perf.teams.2014$ = factor( hist.perf.teams.2014$, sort(unique(hist.perf.teams.2014$[ order.dendrogram(teams.dendro)], ordered=TRUE)p1 = ggplot(hist.perf.teams.2014, aes(x=season,, fill=win.pct)) + geom_tile() + scale_fill_gradientn( "Proportion of regular season games won", colours = brewer.pal(9,"GnBu")[3:9], na.value="grey80", breaks=seq(0,1,0.25), guide = guide_colorbar(barwidth = 15, barheight = 1)) + scale_x_continuous("Season", expand = c(0, 0)) + scale_y_discrete("Team", expand = c(0, 0)) + ggtitle("Historical regular season performance of\n2014 NCAA tournament 'Sweet 16' teams") + theme_classic(base_size=16) + theme(legend.position = "bottom", axis.text.y = element_text(size = 11), plot.margin = unit(c(0,0.1,-0.4,0.1), "cm"))p1

Advanced Data Visualization in R (10)

Putting it all together

library(gtable)g1 = gtable_add_cols(ggplotGrob(p1), unit(4,"cm"))# may need to adjust "t" and "b" if you don't add a ggtitle:g = gtable_add_grob(g1, ggplotGrob(p2), t=3, l=ncol(g1), b=4, r=ncol(g1))grid.newpage()grid.draw(g)

Advanced Data Visualization in R (11)

Packages that pair well with ggplot2

  • grid
  • gridExtra: additional functions to tweak/manipulate grid graphics
  • scales: additional functions to deal with the scale portions of the grammar of graphics
  • gtable: use to dismantle/hack underlying table of Grid Graphical Objects (grobs) that make up a ggplot
  • ggsubplot: embed smaller subplots within larger plots, all using ggplot2 graphics
  • dplyr (or plyr): manipulate data
  • tidyr (or reshape2): restructure data (esp. wide \(\leftrightarrow\) long)
  • lubridate: "makes working with dates fun instead of frustrating"
  • devtools: R package development tools (esp. ability to install packages from github rather than CRAN)

Where to go for help with ggplot2

Why interactive?

  • They're pretty, fun, and people love them (see Hans Rosling's TED talks).
  • They allow you to engage with, explore, and discover more about your data -- visually.
  • Static graphics are "dead" (according to The Economist).

Advanced Data Visualization in R (12)

SVG ggplot with plotly

# start with a static ggplotp = ggplot(subset(ncaa.bball[["team.season.summ"]], max.tourney.round>="Sweet Sixteen"), aes(x=season, y=win.pct, color=max.tourney.round)) + # ideally we would jitter here, but plotly has trouble with this # geom_point(position = position_jitter(w = 0.4, h = 0.002)) + geom_point(size=3, alpha=0.7) + # instead use transparency xlab("Year") + ylab("Proportion of regular season games won") + scale_color_discrete("Highest tournament round achieved") + ggtitle("Regular season performance of 'Sweet Sixteen' teams 1985-2014, by season") + theme_classic(base_size = 16) + theme(legend.position = "bottom")p

Advanced Data Visualization in R (13)

SVG ggplot with plotly

# library(devtools)# install_github("ropensci/plotly")library(plotly)# get a account and get your api key here:# plug it in with your username in the statement below.# set_credentials_file("username", "xxxxxxxxxx")py <- plotly()# recall p is the object returned by ggplot# = py$ggplotly(p) # in an R session, opens in = py$ggplotly(p, session="knitr") # embed in knitr document# if you're embedding in a knitr document, # be sure to also set the code chunk# plotly=TRUE

SVG graphic with clickme

# library(devtools)# install_github("nachocab/clickme")library(clickme)cmplot = with(subset(ncaa.bball[["team.season.summ"]], max.tourney.round>="Sweet Sixteen"), clickme("points", x = season, y = win.pct, names =, color_groups = as.character(max.tourney.round), x_title = "Year", x_format = "", y_title = "Proportion of regular season games won", color_title = "Highest tournament round achieved", color_group_order = levels(max.tourney.round)[4:8], title = "Regular season performance of 'Sweet 16' teams", subtitle = "1985-2014, by season", file_path = paste(getwd(),"clickme0.html", sep=.Platform$file.sep), height = 600, width = 700))# cmplot # in an R session, open in browser# embed in knitr document:cmplot$iframe()$hide()

MotionChart with googleVis (Flash)

library(googleVis)# data.frame with >=4 cols: x, y, id, time. color and size optional,# but if you don't provide them, # it will choose them for you (if there are columns left to use)mc = gvisMotionChart( subset(ncaa.bball[["team.season.summ"]], ![,c("", "season", "win.pct", "points.avg", "mov.avg", "tourney.seed")], idvar="", timevar="season", xvar="win.pct", yvar="tourney.seed", sizevar="mov.avg", colorvar="points.avg", options=list(width=750, height=650))# plot(mc) # in an R session, opens in browserprint(mc, 'chart') # embed in knitr document

Many interactive options in R

  • plotly: ggplot2 graphics \(\rightarrow\) SVG via's R API
  • googleVis: R interface to Google Charts API; SVG and Flash
  • rCharts: SVG graphics with popular JS libs, directly from R
  • ggvis: SVG and HTML5 Canvas graphics, rendered using vega, declared in a "grammar of graphics" style similar to ggplot2
  • gridSVG: ggplot2 and lattice graphics \(\rightarrow\) SVG image
  • clickme: interactive SVG graphics from R
  • rMaps: interactive maps with popular JS libs, directly from R
  • networkD3: d3.js network graphs from R (SVG)
  • rgl: interactive 3D visualizations using OpenGL and other frameworks/formats (example)
  • rggobi: R interface to GGobi, a "data visualization system" separate from R
  • SVGAnnotation: used for "post-processing SVG plots created in R"


