UC Berkeley Biostatistics seminar
Sara E. Moore
24 February 2015
Intro
You have:
- a working knowledge of R,
- some familiarity with the usage of
ggplot2
(such as what was presented during the 2013 or 2014 UC Berkeley SCF/D-Lab R Bootcamp), - an interest in creating data visualizations in R, both static (mostly using ggplot2) and interactive (using a variety of packages).
Why ggplot
?
- It's pretty.
- Its commands are intuitive and "human-readable."
- Nearly any graphic can be created, so you can use it for everything and maintain a consistent style.
- It has (sort of) built-in support for maps.
Why not ggplot
?
- It's slow.
- It won't do some things.
- There's a steep learning curve.
Faceting is just as powerful.lattice
is better at trellis graphs?
Tidy data [1]
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
The (Layered) Grammar of Graphics [2; 3; 4]
- Move away from using "names" and "chart typologies."
- Instead, use "statements" constructed via grammar
- Why?
- An infinite number of unique graphics can be created.
- The implementation is DRY ("don't repeat yourself") not WET ("write everything twice" or "we enjoy typing").
"Good grammar is just the first step in creating a good sentence."
Wickham, 2010
Components of the Grammar
Specify a statistical graphic using components of statements:
- Data (
data
), - Statistical transformations (
stat
: identity, count, mean, etc.), - Geometric elements/objects (
geom
: points, lines, etc.), - Aesthetic mappings (
aes
: color, shape, size, transparency, etc.), - Coordinate systems (
coord
: cartesian, polar, map, etc.), - Guides/Scales and transformations thereof (
scale
,guide
, and others: axes, log-transformed scales, legends, etc.), - Faceting/conditioning/latticing/trellising (
facet
), - Tweaking graphical positioning and visual elements (
position
,theme
, etc.), and - Layering.
The anatomy of a ggplot
command
- All arguments to the main, initial function call,
ggplot
, set graph defaults. - These defaults can be changed for an individual element (even
data
).
ggplot(data=, aes(x=,y=,...)) + geom_????(...) + ...
ggplot() + geom_????(data=, aes(x=,y=,...),...) + ...
The data
source(paste("assets","load_marchmania2015.R", sep=.Platform[["file.sep"]]))# change this path to the location of your CSVs downloaded from kaggle# note that required packages will be installed automaticallyncaa.bball = load.clean.summ.ncaa("~/Dropbox/kaggle/marchmania2015")
A simple scatterplot
library(ggplot2)ggplot(data=ncaa.bball[["games"]], aes(x=lscore, y=wscore)) + geom_point() + xlab("Points scored by losing team") + ylab("Points scored by winning team") + ggtitle("Final scores of NCAA basketball games\nNovember 1984 - April 2014") + theme(text = element_text(size = 16))
Improving the scatterplot
library(munsell) # color system used by ggplot2ggplot(data=ncaa.bball[["games"]], aes(x=lscore, y=wscore)) + # default is bins=c(30,30) # can use scalar when no. of bins for x and y are same. # here alternatively specifying binwidth: stat_binhex(binwidth=c(4, 4)) + # mnsl converts [hue lightness/color purity] to hex color codes. # scale_fill_gradient == scale_fill_continuous. scale_fill_gradient("Number of games", trans = "sqrt", low=mnsl("7.5G 2/4"), high=mnsl("7.5G 9/6")) + xlab("Points scored by losing team") + ylab("Points scored by winning team") + ggtitle("Final scores of NCAA basketball games\nNovember 1984 - April 2014") + theme_classic(base_size = 16)
Improving the scatterplot
ggplot(data=ncaa.bball[["games"]], aes(x=lscore, y=wscore)) + # default is bins=c(30,30) # can use scalar when no. of bins for x and y are same. # here alternatively specifying binwidth: geom_hex(stat = "binhex", binwidth=c(4,4)) + # scale_fill_gradient == scale_fill_continuous. # the mnsl fxn converts [hue lightness/color purity] to hex color codes. scale_fill_gradient("Number of games", trans = "sqrt", low=mnsl("7.5G 2/4"), high=mnsl("7.5G 9/6")) + xlab("Points scored by losing team") + ylab("Points scored by winning team") + ggtitle("Final scores of NCAA basketball games\nNovember 1984 - April 2014") + theme_classic(base_size = 16)
When geoms transform
geom | stat | modifiable defaults |
---|---|---|
geom_boxplot() | stat_boxplot() | max length of whiskers (beyond hinges) = 1.5*IQR |
geom_bar() | stat_bin() | 30 bins: binwidth = [range of x]/30 |
geom_histogram() | stat_bin() | 30 bins: binwidth = [range of x]/30 |
geom_freqpoly() | stat_bin() | 30 bins: binwidth = [range of x]/30 |
geom_dotplot() | stat_bindot() | 30 bins: binwidth = [range of x]/30; "dotdensity" method |
geom_bin2d() | stat_bin2d() | 30 bins for each of x and y |
geom_hex() | stat_binhex() | 30 bins for each of x and y (calls hexbin::hexBin() ) |
geom_density2d() | stat_density2d() | Gaussian kernel; bandwidths (x and y) set by Silverman's "rule of thumb"; 100 grid points for x and y (calls MASS::kde2d() ) |
geom_density() | stat_density() | Gaussian kernel; bandwidth set by Silverman's "rule of thumb" (calls stats::density() ) |
geom_violin() | stat_ydensity() | Gaussian kernel; bandwidth set by Silverman's "rule of thumb" (calls stats::density() ); all violins have same area before trimming tails, tails are trimmed to [range of y] |
geom_smooth() | stat_smooth() | if n<1000, stats::loess() with polynomial degree 2, \(\alpha=0.75\), etc.; else, gam::gam() with penalized cubic regression splines, etc.; 80 evaluation points |
geom_quantile() | stat_quantile() | 3 quartiles; "br" method (modified Barrodale & Roberts method; calls quantreg::rq() ) |
geom_contour() | stat_contour() | 10 pretty breakpoints covering [range of z] |
Other ggplot
transformations
stat_ecdf
: Empirical Cumulative Density Functionstat_function
: Superimpose a function.stat_qq
: Calculation for quantile-quantile plot.stat_spoke
: Convert angle and radius to xend and yend.stat_sum
: Sum unique values. Useful for overplotting on scatterplots.stat_summary
: Summarise y values at every unique x.stat_summary_hex
: Apply funciton for 2D hexagonal bins.stat_summary2d
: Apply function for 2D rectangular bins.stat_unique
: Remove duplicates.stat_identity
Dates, tidyr
, and summaries with ggplot2
library(tidyr) # gatherduke.2014 = subset(ncaa.bball[["teams"]], (season==2014)&(team.name=="Duke"))[, c("date","fg.pct","fg3.pct","ft.pct","result")] %>% tidyr::gather(shot.type, pct, -c(date, result))library(scales) # date_formatggplot(duke.2014, aes(x=date, y=pct, color=shot.type)) + geom_line() + geom_line(stat = "hline", yintercept = "mean", linetype="dashed", alpha=0.65) + geom_rug(data=subset(duke.2014,(result=="Loss")&(shot.type=="fg.pct")), sides="b", color="grey20") + scale_x_datetime("Game date", labels = date_format("%b %Y")) + ylab("Proportion of shots made") + scale_color_discrete("Type of shot", labels=c("two point field goal","three point field goal","free throw")) + ggtitle("Duke's per-game shot percentages, 2013-14 season:\nbottom ticks indicate losses; dashed lines are season averages") + theme_classic(base_size = 16) + theme(legend.position = "bottom")
A heatmap with ggplot2::geom_tile
library(RColorBrewer) # brewer.pallibrary(grid) # unittourney.teams.2014 = as.character( unique(subset(ncaa.bball[["team.season.summ"]], (season==2014)&(max.tourney.round>="Sweet Sixteen"))$team.name))hist.perf.teams.2014 = subset(ncaa.bball[["team.season.summ"]], team.name%in%tourney.teams.2014)p1 = ggplot(hist.perf.teams.2014, aes(x=season, y=team.name, fill=win.pct)) + geom_tile() + scale_fill_gradientn( "Proportion of regular season games won", colours = brewer.pal(9,"GnBu"), na.value="grey80", breaks=seq(0,1,0.25), guide = guide_colorbar(barwidth = 15, barheight = 1)) + scale_x_continuous("Season", expand = c(0, 0)) + scale_y_discrete("Team", expand = c(0, 0)) + ggtitle("Historical regular season performance of\n2014 NCAA tournament 'Sweet 16' teams") + theme_classic(base_size=16) + theme(legend.position = "bottom", axis.text.y = element_text(size = 11), plot.margin = unit(c(0,0.1,-0.4,0.1), "cm"))p1
Creating a dendrogram with ggdendro
library(ggdendro)team.season.df = subset(ncaa.bball[["team.season.summ"]], team.name%in%tourney.teams.2014)[, c("season", "team.name","win.pct")]# use tidyr, but this time go long --> wide (spread)team.season.mat = as.matrix(team.season.df %>% spread(team.name, win.pct))rownames(team.season.mat) = team.season.mat[,"season"]team.season.mat = team.season.mat[ ,-which(colnames(team.season.mat)=="season")]teams.hc = hclust(dist(t(team.season.mat)), "ave")ggdendrogram(teams.hc, rotate = TRUE)
Simplifying the dendrogram
teams.dendro = as.dendrogram(teams.hc)teams.ddata = dendro_data(teams.dendro)p2 = ggplot(segment(teams.ddata)) + geom_segment(aes(x = x, y = y, xend = xend, yend = yend)) + coord_flip() + theme_dendro() + # tweak these if the dendrogram doesn't line up: theme(plot.margin = unit(c(-7,0,-15,-20), "points"))p2
Heatmap, reordered
# need to remove the extra factor levels# and rorder according to the clusteringhist.perf.teams.2014$team.name = as.character( hist.perf.teams.2014$team.name)# can do sort(unique(x)) here because they were originally # in alphabetical order. just be sure the order of the # dendrogram matches up with the new order of the heatmap.hist.perf.teams.2014$team.name = factor( hist.perf.teams.2014$team.name, sort(unique(hist.perf.teams.2014$team.name))[ order.dendrogram(teams.dendro)], ordered=TRUE)p1 = ggplot(hist.perf.teams.2014, aes(x=season, y=team.name, fill=win.pct)) + geom_tile() + scale_fill_gradientn( "Proportion of regular season games won", colours = brewer.pal(9,"GnBu")[3:9], na.value="grey80", breaks=seq(0,1,0.25), guide = guide_colorbar(barwidth = 15, barheight = 1)) + scale_x_continuous("Season", expand = c(0, 0)) + scale_y_discrete("Team", expand = c(0, 0)) + ggtitle("Historical regular season performance of\n2014 NCAA tournament 'Sweet 16' teams") + theme_classic(base_size=16) + theme(legend.position = "bottom", axis.text.y = element_text(size = 11), plot.margin = unit(c(0,0.1,-0.4,0.1), "cm"))p1
Putting it all together
library(gtable)g1 = gtable_add_cols(ggplotGrob(p1), unit(4,"cm"))# may need to adjust "t" and "b" if you don't add a ggtitle:g = gtable_add_grob(g1, ggplotGrob(p2), t=3, l=ncol(g1), b=4, r=ncol(g1))grid.newpage()grid.draw(g)
Packages that pair well with ggplot2
- grid
- gridExtra: additional functions to tweak/manipulate grid graphics
- scales: additional functions to deal with the
scale
portions of the grammar of graphics - gtable: use to dismantle/hack underlying table of Grid Graphical Objects (grobs) that make up a
ggplot
- ggsubplot: embed smaller subplots within larger plots, all using
ggplot2
graphics - dplyr (or plyr): manipulate data
- tidyr (or reshape2): restructure data (esp. wide \(\leftrightarrow\) long)
- lubridate: "makes working with dates fun instead of frustrating"
- devtools: R package development tools (esp. ability to install packages from github rather than CRAN)
- ggthemes, xkcd: for when you don't want your
ggplot
to look so much like aggplot
- extrafont, RColorBrewer, munsell, https://github.com/karthik/wesanderson : custom fonts and color schemes
- ggdendro: dendrograms and trees with
ggplot2
- GGally: plot matrices (as in
graphics::pairs
), parallel coordinate plots, and network plots withggplot2
- ggtern: ternary plots (as in
vcd::ternaryplot
) and other additional *geom*s forggplot2
. - ggmap, maps, maptools, sp, rgdal, RgoogleMaps: maps with
ggplot2
- ggHorizon horizon graphs with
ggplot2
(example 1, example 2) - \(\ldots\)
Where to go for help with ggplot2
Why interactive?
- They're pretty, fun, and people love them (see Hans Rosling's TED talks).
- They allow you to engage with, explore, and discover more about your data -- visually.
- Static graphics are "dead" (according to The Economist).
SVG ggplot
with plotly
# start with a static ggplotp = ggplot(subset(ncaa.bball[["team.season.summ"]], max.tourney.round>="Sweet Sixteen"), aes(x=season, y=win.pct, color=max.tourney.round)) + # ideally we would jitter here, but plotly has trouble with this # geom_point(position = position_jitter(w = 0.4, h = 0.002)) + geom_point(size=3, alpha=0.7) + # instead use transparency xlab("Year") + ylab("Proportion of regular season games won") + scale_color_discrete("Highest tournament round achieved") + ggtitle("Regular season performance of 'Sweet Sixteen' teams 1985-2014, by season") + theme_classic(base_size = 16) + theme(legend.position = "bottom")p
SVG ggplot
with plotly
# library(devtools)# install_github("ropensci/plotly")library(plotly)# get a plot.ly account and get your api key here:# https://plot.ly/settings/api# plug it in with your username in the statement below.# set_credentials_file("username", "xxxxxxxxxx")py <- plotly()# recall p is the object returned by ggplot# plot.info = py$ggplotly(p) # in an R session, opens in browserplot.info = py$ggplotly(p, session="knitr") # embed in knitr document# if you're embedding in a knitr document, # be sure to also set the code chunk# plotly=TRUE
SVG graphic with clickme
# library(devtools)# install_github("nachocab/clickme")library(clickme)cmplot = with(subset(ncaa.bball[["team.season.summ"]], max.tourney.round>="Sweet Sixteen"), clickme("points", x = season, y = win.pct, names = team.name, color_groups = as.character(max.tourney.round), x_title = "Year", x_format = "", y_title = "Proportion of regular season games won", color_title = "Highest tournament round achieved", color_group_order = levels(max.tourney.round)[4:8], title = "Regular season performance of 'Sweet 16' teams", subtitle = "1985-2014, by season", file_path = paste(getwd(),"clickme0.html", sep=.Platform$file.sep), height = 600, width = 700))# cmplot # in an R session, open in browser# embed in knitr document:cmplot$iframe()$hide()
MotionChart with googleVis
(Flash)
library(googleVis)# data.frame with >=4 cols: x, y, id, time. color and size optional,# but if you don't provide them, # it will choose them for you (if there are columns left to use)mc = gvisMotionChart( subset(ncaa.bball[["team.season.summ"]], !is.na(tourney.seed))[,c("team.name", "season", "win.pct", "points.avg", "mov.avg", "tourney.seed")], idvar="team.name", timevar="season", xvar="win.pct", yvar="tourney.seed", sizevar="mov.avg", colorvar="points.avg", options=list(width=750, height=650))# plot(mc) # in an R session, opens in browserprint(mc, 'chart') # embed in knitr document
Many interactive options in R
- plotly:
ggplot2
graphics \(\rightarrow\) SVG via plot.ly's R API - googleVis: R interface to Google Charts API; SVG and Flash
- rCharts: SVG graphics with popular JS libs, directly from R
- ggvis: SVG and HTML5 Canvas graphics, rendered using vega, declared in a "grammar of graphics" style similar to
ggplot2
- gridSVG:
ggplot2
andlattice
graphics \(\rightarrow\) SVG image
- clickme: interactive SVG graphics from R
- rMaps: interactive maps with popular JS libs, directly from R
- networkD3: d3.js network graphs from R (SVG)
- rgl: interactive 3D visualizations using OpenGL and other frameworks/formats (example)
- rggobi: R interface to GGobi, a "data visualization system" separate from R
- SVGAnnotation: used for "post-processing SVG plots created in R"
References
[1] L. Wilkinson. The Grammar of Graphics. Springer-Verlag NewYork, 2005. DOI: 10.1007/0-387-28695-0.
[2] H. Wickham. ggplot2. Springer Science + Business Media,2009. DOI: 10.1007/978-0-387-98141-3.
[3] H. Wickham. "A Layered Grammar of Graphics". In: Journal ofComputational and Graphical Statistics 19.1 (2010), pp. 3-28.DOI: 10.1198/jcgs.2009.07098.
[4] H. Wickham. "Tidy Data". In: Journal of Statistical Software59.10 (2014).
Credits
Built using slidify with
- knitr,
- the deck.js framework,
- the RefManageR bibliography manager, and
- the prettify syntax highlighter.