The Rmarkdown for this document is https://github.com/rnabioco/bmsc-7810-pbda/blob/main/_posts/2023-12-05-class-5-intro-to-ggplot2/class-5-intro-to-ggplot2.Rmd
ggplot2 package homepage :: https://ggplot2.tidyverse.org/
ggplot2 reference :: https://ggplot2.tidyverse.org/reference R for
Data Science 2e :: https://r4ds.hadley.nz/
ggplot2 Book :: https://ggplot2-book.org/
Gallery of Plots and Examples :: https://r-graph-gallery.com/
Data Visualization with ggplot2 :: Cheat sheet :: https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf
This package allows you to declaratively create graphics by giving a set
of variables to map to aesthetics and then layer graphical directives to
produce a plot. It’s part of the tidyverse of R packages for data
science and analysis, sharing in their design philosophy. It’s an
alternative to the built in R graphics and plotting functions.
Written by Hadley Wickham
-Leland Wilkinson 1945-2021
Layers of logical command flow and readability.
Plot = data + aesthetics + geometry
data = the dataset, typically a dataframe
aesthetics = map variables x and y to axis
geometry = type of graphic or plot to be rendered
facets = multiple plots
statistics = add calculations
theme = make the plot pretty or follow a particular style
# ggplot(<DATA>, aes(<MAPPINGS>)) + <GEOM_function>()
?ggplot # bring up the ggplot function help
To begin plotting we need to start with some data to visualize. Here we
can use a built-in dataset regarding Motor Trend Car Road Tests called
mtcars
. This dataset is a dataframe which is a key format for using
with ggplot. We can preview the data structure using the head()
function.
#some built in data.
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
A data frame with 32 observations on 11 (numeric) variables.
[, 1] mpg = Miles/(US) gallon
[, 2] cyl = Number of cylinders
[, 3] disp = Displacement (cu.in.)
[, 4] hp = Gross horsepower
[, 5] dra = Rear axle ratio
[, 6] wt = Weight (1000 lbs)
[, 7] qsec = 1/4 mile time
[, 8] vs = Engine (0 = V-shaped, 1 = straight)
[, 9] am = Transmission (0 = automatic, 1 = manual)
[,10] gear = Number of forward gears
[,11] carb = Number of carburetors
-R Documentation
Using the basic ggplot grammar of graphics template we can produce a scatterplot from the dataframe.
# ggplot(<DATA>, aes(<MAPPINGS>)) + <GEOM_function>()
The first part of the expression calls the ggplot
function and takes
the dataframe
and the aes
function which are the aesthetics
mappings. In this case we are mapping the x-axis to be the wt
variable
and the y-axis to be the mpg
variable . If you only evaluate the first
part this is what you get:
Next we have to add the geometry layer to be able to actually see the
data. Here we are adding the geom_point
geometry which allows you to
visualize the data as points. You use a plus sign to add these
additional layers.
ggplot(mtcars, aes(x=wt, y=mpg)) + geom_point()
We can change the data being plotted by picking a different column from the dataframe. For instance here we are plotting the horsepower(hp) versus miles per gallon(mpg). Also note that we can make the code more readable by placing proceeding layers on a different line after the plus sign. A common error is misplacing the plus sign. It must be trailing on the line before the next layer.
ggplot(mtcars, aes(x=hp, y=mpg)) +
geom_point()
Exercise: Try building a scatterplot on your own. This time plot the variables corresponding to the number of cylinders and the type of transmission.
Exercise: Modify the scatterplot to plot horsepower instead of the type of transmission. Can you start to see a relationship with the data?
We can add a title to the plot simply by adding another layer and the
ggtitle()
function.
ggplot(mtcars, aes(x=hp, y=mpg)) +
geom_point() +
ggtitle("1974 Cars: Horsepower vs Miles Per Gallon")
We can overwrite the default labels and add our own to the x and y axis
by using the xlab()
and ylab()
functions respectively.
ggplot(mtcars, aes(x=hp, y=mpg, alpha = 0.5)) +
geom_point() +
labs(x = "Horepower",
y = "Miles Per Gallon",
title = "Horsepower vs Miles Per Gallon Scatterplot",
subtitle = "Motor Trend Car Road Tests - 1974",
caption = "Smith et al. 1974")
Notice that we also added an alpha aesthetic which helps us visualize
overlapping points. We can add a show.legend = FALSE
argument to the
geom_point
function to remove the alpha legend and clean up the plot
figure. Let’s try it. You can also specify a vector of aesthetics to
display.
Check the documentation ?geom_point
.
We can easily add a third bit of information to the plot by using the color aesthetic. Each geometry has its own list of aesthetics that you can add and modify. Consult the help page for each one.
?geom_point() # bring up the help page for geom_point()
Here we are adding the color aesthetic.
And we can relabel the legend title for the new color aesthetic to make it more readable.
You can even continue to add even more information to the plot through additional aesthetics. Though this might be a bit much.
Instead we can use a specific value instead of the wt variable to adjust the size of the dots.
There are many other geometries that you can use in your plots.
https://ggplot2.tidyverse.org/reference
Here is a short list:
geom_point(): scatterplot
geom_line(): lines connecting points by increasing value of x
geom_path(): lines connecting points in sequence of appearance
geom_boxplot(): box and whiskers plot for categorical variables
geom_bar(): bar charts for categorical x axis
geom_col(): bar chart where heights of the bars represent values in the
data
geom_histogram(): histogram for continuous x axis
geom_violin(): distribution kernel of data dispersion
geom_smooth(): function line based on data
geom_bin2d(): heatmap of 2d bin counts
geom_contour(): 2d contours of a 3d surface
geom_count(): count overlapping points
geom_density(): smoothed density estimates
geom_dotplot(): dot plot
geom_hex(): hexagonal heatmap of 2d bin counts
geom_freqpoly(): histogram and frequency polygons
geom_jitter(): jittered point plot geom_polygon(): polygons
But utilizing the right plot to efficiently show your data is key. Here we swapped the geom_point for geom_line to see what would happen. You could also try something like geom_bin2d()
The geom_col() geometry is a type of bar plot that uses the heights of the bars to represent values in the data. Let’s look at plotting this type of data for the cars in this dataset.
?geom_col()
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Looking back at the data structure of mtcars, we see that the names of
the cars are stored as the row names of the data frame. We can access
this using the rownames()
function and use it in subsequent plots.
Q: What was another way to address this issue, discussed in the first block?
rownames(mtcars)
[1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
[4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
[7] "Duster 360" "Merc 240D" "Merc 230"
[10] "Merc 280" "Merc 280C" "Merc 450SE"
[13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
[16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
[19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
[22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
[25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
[28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
[31] "Maserati Bora" "Volvo 142E"
You will learn other ways to make this more legible later. For a quick fix we can swap the x and y mappings.
We can reorder the data to make it easier to visualize important information.
Exercise: Plot a bar chart using geom_col() with the mtcar dataset. Plot
the names of the cars ranked by the weight of each car. Try adding a
third aesthetic color
for horsepower.
You can also add another layer of geometry to the same ggplot. Notice
you can have two separate aesthetic declarations and they have moved
from the ggplot function to their respective geom_ functions
.
# ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +
# <GEOM_FUNCTION1>() +
# <GEOM_FUNCTION2>()
# OR
# ggplot(data = <DATA>) +
# <GEOM_FUNCTION1>(mapping = aes(<MAPPINGS>)) +
# <GEOM_FUNCTION2>(mapping = aes(<MAPPINGS>))
ggplot(mtcars) +
geom_point(aes(x=hp, y=mpg)) +
geom_line(aes(x=hp, y=mpg, color=cyl)) +
ggtitle("Modern Cars: Horsepower vs Miles Per Gallon") +
ylab("miles per gallon") +
xlab("horsepower") +
labs(color="#cylinders")
This particular geometry addition isn’t very useful.
Exercise: Try adding geom_smooth() instead of geom_line().
Saving these plots is easy! Simply call the ggsave()
function to save
the last plot that you created. You can specify the file format by
changing the extension after the filename.
ggsave("plot.png") # saves the last plot to a PNG file in the current working directory
You can also specify the dots per inch and the width of height of the image to ensure publication quality figures upon saving.
ggsave("plot-highres.png", dpi = 300, width = 8, height = 4) # you can specify the dots per inch (dpi) and the width and height parameters
Exercise: Try saving the last plot that we produced as a jpg. Can you navigate to where it saved and open it on your computer?
Data Visualization with ggplot2 :: Cheat sheet :: https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf
Lets take a look at gallery resource to preview different plot types and get ideas for our own plots. https://r-graph-gallery.com/
R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.2
[5] purrr_1.0.1 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
[9] ggplot2_3.4.2 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] highr_0.10 bslib_0.4.2 compiler_4.2.2
[4] pillar_1.9.0 jquerylib_0.1.4 tools_4.2.2
[7] digest_0.6.31 downlit_0.4.3 timechange_0.2.0
[10] jsonlite_1.8.4 evaluate_0.21 memoise_2.0.1
[13] lifecycle_1.0.3 gtable_0.3.3 pkgconfig_2.0.3
[16] rlang_1.1.1 cli_3.6.1 rstudioapi_0.14
[19] distill_1.6 yaml_2.3.7 xfun_0.39
[22] fastmap_1.1.1 withr_2.5.0 knitr_1.43
[25] systemfonts_1.0.4 hms_1.1.3 generics_0.1.3
[28] sass_0.4.6 vctrs_0.6.2 grid_4.2.2
[31] tidyselect_1.2.0 glue_1.6.2 R6_2.5.1
[34] textshaping_0.3.6 fansi_1.0.4 rmarkdown_2.22
[37] farver_2.1.1 tzdb_0.4.0 magrittr_2.0.3
[40] scales_1.2.1 htmltools_0.5.5 colorspace_2.1-0
[43] ragg_1.2.5 labeling_0.4.2 utf8_1.2.3
[46] stringi_1.7.12 munsell_0.5.0 cachem_1.0.8