Plot histogram with overlaid normal curve

Given a vector of values, create a ggplot histogram with overlaid best-fitting normal curve, with prettified caption of numerics
Published

July 29, 2024

Another basic task that I’m tired of looking up how to perform, so I’m posting this for personal reference.

Task: given a vector of values, create a ggplot histogram with overlaid best-fitting normal curve, with optional caption including mean and standard deviation, presented prettified.

library(tidyverse)

format_num <- function(n, digits = 3) {
  # Prettify numeric results -- no scientific notation, use significant digits
  formatC(signif(n, digits=digits), digits=digits, format="fg", flag="#")
}

hist_normal <- function(values, binwidth = NA, caption = TRUE, num_sd = NA) {
  # values is a vector of numbers
  df <- data.frame(value = values)
  values_mean <- mean(df$value)
  values_sd   <- sd(df$value)
  if (is.na(binwidth)) {binwidth <- abs((max(df$value) - min(df$value)) / 30)}
  
  g <- df %>%
    ggplot(aes(x = value)) +
    geom_histogram(
      aes(y = after_stat(density)),
      binwidth = binwidth,
      colour = "black", fill = "white"
    ) +
    stat_function(fun = dnorm, args = list(mean = values_mean, sd = values_sd))

  if (caption) {
    g <- g +
      labs(caption = paste0(
        "mean = ", format_num(values_mean),
        "; sd = ", format_num(values_sd),
        "; n = ", length(values)
      ))
  }
  
  if (!is.na(num_sd)) {
    g <- g + coord_cartesian(xlim = values_mean + values_sd * c(-num_sd, num_sd))
  }
  
  return (g)
}

Simple example: default of around 30 bins will be too many for n = 200 points.

hist_normal(rnorm(200))

Smoother example, centering the plot around mean and specifying x-axis limits as 4 standard deviations around mean:

hist_normal(rnorm(5000, mean = 25, sd = 2.5), binwidth = 0.5, num_sd = 4)