Hands-on Exercise 4C: Visualising Uncertainty

Author

Cai Jingheng

Published

January 30, 2024

Modified

January 31, 2024

1 Getting Started

1.1 Installing and loading the packages

For the purpose of this exercise, the following R packages will be used, they are:

tidyverse, a family of R packages for data science process,
plotly for creating interactive plot,
gganimate for creating animation plot,
DT for displaying interactive html table,
crosstalk for for implementing cross-widget interactions (currently, linked brushing and filtering), and
ggdist for visualising distribution and uncertainty.

Code

devtools::install_github("wilkelab/ungeviz")

Code

pacman::p_load(ungeviz, plotly, crosstalk,
               DT, ggdist, ggridges,
               colorspace, gganimate, tidyverse)

1.2 Data import

For the purpose of this exercise, Exam_data.csv will be used.

Code

exam <- read_csv("data/Exam_data.csv")

2 Visualizing the uncertainty of point estimates: ggplot2 methods

A point estimate is a single number, such as a mean. Uncertainty, on the other hand, is expressed as standard error, confidence interval, or credible interval.

Important

Don’t confuse the uncertainty of a point estimate with the variation in the sample

In this section, you will learn how to plot error bars of maths scores by race by using data provided in exam tibble data frame.

Firstly, code chunk below will be used to derive the necessary summary statistics.

Code

my_sum <- exam %>%
  group_by(RACE) %>%
  summarise(
    n=n(),
    mean=mean(MATHS),
    sd=sd(MATHS)
    ) %>%
  mutate(se=sd/sqrt(n-1))

Things to learn from the code chunk above

group_by() of dplyr package is used to group the observation by RACE,
summarise() is used to compute the count of observations, mean, standard deviation
mutate() is used to derive standard error of Maths by RACE, and
the output is save as a tibble data table called my_sum.

Note

For the mathematical explanation, please refer to Slide 20 of Lesson 4.

Next, the code chunk below will be used to display my_sum tibble data frame in an html table format.

Code

knitr::kable(head(my_sum), format = 'html')

RACE	n	mean	sd	se
Chinese	193	76.50777	15.69040	1.132357
Indian	12	60.66667	23.35237	7.041005
Malay	108	57.44444	21.13478	2.043177
Others	9	69.66667	10.72381	3.791438

2.1 Plotting standard error bars of point estimates

Now we are ready to plot the standard error bars of mean maths score by race as shown below.

Code

ggplot(my_sum) +
  geom_errorbar(
    aes(x=RACE, 
        ymin=mean-se, 
        ymax=mean+se), 
    width=0.2, 
    colour="#3459e6", 
    alpha=0.9, 
    size=0.5) +
  geom_point(aes
           (x=RACE, 
            y=mean), 
           stat="identity", 
           color="red",
           size = 1.5,
           alpha=1) +
  ggtitle("Standard error of mean maths score by rac")+
  theme_minimal()

2.2 Plotting confidence interval of point estimates

Instead of plotting the standard error bar of point estimates, we can also plot the confidence intervals of mean maths score by race.

Code

ggplot(my_sum) +
  geom_errorbar(
    aes(x=reorder(RACE, -mean), 
        ymin=mean-1.96*se, 
        ymax=mean+1.96*se), 
    width=0.2, 
    colour="#3459e6", 
    alpha=0.9, 
    size=0.5) +
  geom_point(aes
           (x=RACE, 
            y=mean), 
           stat="identity", 
           color="red",
           size = 1.5,
           alpha=1) +
  labs(x = "Maths score",
       title = "95% confidence interval of mean maths score by race")+
  theme_minimal()

Things to learn from the code chunk above

The confidence intervals are computed by using the formula mean+/-1.96*se.
The error bars is sorted by using the average maths scores.
labs() argument of ggplot2 is used to change the x-axis label.

2.3 Visualizing the uncertainty of point estimates with interactive error bars

In this section, you will learn how to plot interactive error bars for the 99% confidence interval of mean maths score by race as shown in the figure below.

Code

shared_df = SharedData$new(my_sum)

bscols(widths = c(4,8),
       ggplotly((ggplot(shared_df) +
                   geom_errorbar(aes(
                     x=reorder(RACE, -mean),
                     ymin=mean-2.58*se, 
                     ymax=mean+2.58*se), 
                     width=0.2, 
                     colour="#3459e6", 
                     alpha=0.9, 
                     size=0.5) +
                   geom_point(aes(
                     x=RACE, 
                     y=mean, 
                     text = paste("Race:", `RACE`, 
                                  "<br>N:", `n`,
                                  "<br>Avg. Scores:", round(mean, digits = 2),
                                  "<br>95% CI:[", 
                                  round((mean-2.58*se), digits = 2), ",",
                                  round((mean+2.58*se), digits = 2),"]")),
                     stat="identity", 
                     color="red", 
                     size = 1.5, 
                     alpha=1) + 
                   xlab("Race") + 
                   ylab("Average Scores") + 
                   theme_minimal() + 
                   theme(axis.text.x = element_text(
                     angle = 45, vjust = 0.5, hjust=1)) +
                   ggtitle("99% Confidence interval of average /<br>maths scores by race")), 
                tooltip = "text"), 
       DT::datatable(shared_df, 
                     rownames = FALSE, 
                     class="compact", 
                     width="100%", 
                     options = list(pageLength = 10,
                                    scrollX=T), 
                     colnames = c("No. of pupils", 
                                  "Avg Scores",
                                  "Std Dev",
                                  "Std Error")) %>%
         formatRound(columns=c('mean', 'sd', 'se'),
                     digits=2))

3 Visualising Uncertainty: ggdist package

ggdist is an R package that provides a flexible set of ggplot2 geoms and stats designed especially for visualising distributions and uncertainty.
It is designed for both frequentist and Bayesian uncertainty visualization, taking the view that uncertainty visualization can be unified through the perspective of distribution visualization:
- for frequentist models, one visualises confidence distributions or bootstrap distributions (see vignette(“freq-uncertainty-vis”));
- for Bayesian models, one visualises probability distributions (see the tidybayes package, which builds on top of ggdist).

3.1 Visualizing the uncertainty of point estimates: ggdist methods

In the code chunk below, stat_pointinterval() of ggdist is used to build a visual for displaying distribution of maths scores by race.

Code

exam %>%
  ggplot(aes(x = RACE, 
             y = MATHS) ) +
  stat_pointinterval(color = "#3459e6",point_color = "red") +
  labs(
    title = "Visualising confidence intervals of mean math score",
    subtitle = "Mean Point + Multiple-interval plot")+theme_minimal()

Note

This function comes with many arguments, students are advised to read the syntax reference for more detail.

For example, in the code chunk below the following arguments are used:

.width = 0.95
.point = median
.interval = qi

Code

exam %>%
  ggplot(aes(x = RACE, y = MATHS)) +
  stat_pointinterval(.width = 0.95,
  .point = median,
  .interval = qi,color = "#3459e6",point_color = "red") +
  labs(
    title = "Visualising confidence intervals of median math score",
    subtitle = "Median Point + Multiple-interval plot")+theme_minimal()

What I learnt…

.width: It controls the width of the confidence interval. It specifies the size of the confidence interval, usually a decimal between 0 and 1. For example, .width = 0.95 indicates a confidence interval width of 95%, a common choice.
.point: It determines the statistic used to represent the central point. It can be a string specifying the statistic to use, such as "mean", "median", and so on. In the example, .point = "median" means that the median will be used to represent the central point.
.interval: It defines the method used to calculate the confidence interval. It can be a string indicating the calculation method, such as "qi" (quantile-based interval), "ci" (confidence interval), etc. In the example, .interval = "qi" means using quantiles to calculate the confidence interval.

3.2 Visualizing the uncertainty of point estimates: ggdist methods

Code

exam %>%
  ggplot(aes(x = RACE, y = MATHS)) +
  stat_pointinterval(.width = 0.99,
  .point = median,
  .interval = qi,color = "#3459e6",point_color = "red") +
  labs(
    title = "Visualising confidence intervals of median math score (99% confidence intervals)",
    subtitle = "Median Point + Multiple-interval plot")+theme_minimal()

3.3 Visualizing the uncertainty of point estimates: ggdist methods

In the code chunk below, stat_gradientinterval() of ggdist is used to build a visual for displaying distribution of maths scores by race.

Code

exam %>%
  ggplot(aes(x = RACE, 
             y = MATHS)) +
  stat_gradientinterval(  fill = "skyblue", 
    color = "#3459e6",   point_color = "red", 
    show.legend = TRUE     
  ) +                        
  labs(
    title = "Visualising confidence intervals of mean math score",
    subtitle = "Gradient + interval plot")+
  theme_minimal()

4 Visualising Uncertainty with Hypothetical Outcome Plots (HOPs)

Step 1: Installing ungeviz package

Code

devtools::install_github("wilkelab/ungeviz")

Step 2: Launch the application in R

Code

library(ungeviz)

Code

ggplot(data = exam, 
       (aes(x = factor(RACE), y = MATHS))) +
  geom_point(position = position_jitter(
    height = 0.3, width = 0.05), 
    size = 0.4, color = "#3459e6", alpha = 1/2) +
  geom_hpline(data = sampler(25, group = RACE), height = 0.6, color = "coral") +
  theme_bw() + 
  # `.draw` is a generated column indicating the sample draw
  transition_states(.draw, 1, 3)

5 Visualising Uncertainty with Hypothetical Outcome Plots (HOPs)

Code

ggplot(data = exam, 
       (aes(x = factor(RACE), 
            y = MATHS))) +
  geom_point(position = position_jitter(
    height = 0.3, 
    width = 0.05), 
    size = 0.4, color = "#3459e6", alpha = 1/2) +
  geom_hpline(data = sampler(25, group = RACE), height = 0.6, color = "coral") +
  theme_bw() + 
  transition_states(.draw, 1, 3)

--- title: "Hands-on Exercise 4C: Visualising Uncertainty" author: "Cai Jingheng" date: "Jan,30,2024" date-modified: "last-modified" toc: true number-sections: true format: html: code-fold: true code-tools: true warning: false --- ## **Getting Started** ### **Installing and loading the packages** For the purpose of this exercise, the following R packages will be used, they are: - tidyverse, a family of R packages for data science process, - plotly for creating interactive plot, - gganimate for creating animation plot, - DT for displaying interactive html table, - crosstalk for for implementing cross-widget interactions (currently, linked brushing and filtering), and - ggdist for visualising distribution and uncertainty. ```{r} devtools::install_github("wilkelab/ungeviz") ``` ```{r} pacman::p_load(ungeviz, plotly, crosstalk, DT, ggdist, ggridges, colorspace, gganimate, tidyverse) ``` ### **Data import** For the purpose of this exercise, *Exam_data.csv* will be used. ```{r} exam <- read_csv("data/Exam_data.csv") ``` ## **Visualizing the uncertainty of point estimates: ggplot2 methods** A point estimate is a single number, such as a mean. Uncertainty, on the other hand, is expressed as standard error, confidence interval, or credible interval. ::: callout-important - Don't confuse the uncertainty of a point estimate with the variation in the sample ::: In this section, you will learn how to plot error bars of maths scores by race by using data provided in *exam* tibble data frame. Firstly, code chunk below will be used to derive the necessary summary statistics. ```{r} my_sum <- exam %>% group_by(RACE) %>% summarise( n=n(), mean=mean(MATHS), sd=sd(MATHS) ) %>% mutate(se=sd/sqrt(n-1)) ``` ::: {.callout-tip title="Things to learn from the code chunk above"} - `group_by()` of **dplyr** package is used to group the observation by RACE, - `summarise()` is used to compute the count of observations, mean, standard deviation - `mutate()` is used to derive standard error of Maths by RACE, and - the output is save as a tibble data table called *my_sum*. ::: ::: callout-note For the mathematical explanation, please refer to Slide 20 of Lesson 4. ::: Next, the code chunk below will be used to display *my_sum* tibble data frame in an html table format. ```{r} knitr::kable(head(my_sum), format = 'html') ``` ### **Plotting standard error bars of point estimates** Now we are ready to plot the standard error bars of mean maths score by race as shown below. ```{r} ggplot(my_sum) + geom_errorbar( aes(x=RACE, ymin=mean-se, ymax=mean+se), width=0.2, colour="#3459e6", alpha=0.9, size=0.5) + geom_point(aes (x=RACE, y=mean), stat="identity", color="red", size = 1.5, alpha=1) + ggtitle("Standard error of mean maths score by rac")+ theme_minimal() ``` ### **Plotting confidence interval of point estimates** Instead of plotting the standard error bar of point estimates, we can also plot the confidence intervals of mean maths score by race. ```{r} ggplot(my_sum) + geom_errorbar( aes(x=reorder(RACE, -mean), ymin=mean-1.96*se, ymax=mean+1.96*se), width=0.2, colour="#3459e6", alpha=0.9, size=0.5) + geom_point(aes (x=RACE, y=mean), stat="identity", color="red", size = 1.5, alpha=1) + labs(x = "Maths score", title = "95% confidence interval of mean maths score by race")+ theme_minimal() ``` ::: {.callout-tip title="Things to learn from the code chunk above"} - The confidence intervals are computed by using the formula mean+/-1.96\*se. - The error bars is sorted by using the average maths scores. - `labs()` argument of ggplot2 is used to change the x-axis label. ::: ### **Visualizing the uncertainty of point estimates with interactive error bars** In this section, you will learn how to plot interactive error bars for the 99% confidence interval of mean maths score by race as shown in the figure below. ```{r} shared_df = SharedData$new(my_sum) bscols(widths = c(4,8), ggplotly((ggplot(shared_df) + geom_errorbar(aes( x=reorder(RACE, -mean), ymin=mean-2.58*se, ymax=mean+2.58*se), width=0.2, colour="#3459e6", alpha=0.9, size=0.5) + geom_point(aes( x=RACE, y=mean, text = paste("Race:", `RACE`, "<br>N:", `n`, "<br>Avg. Scores:", round(mean, digits = 2), "<br>95% CI:[", round((mean-2.58*se), digits = 2), ",", round((mean+2.58*se), digits = 2),"]")), stat="identity", color="red", size = 1.5, alpha=1) + xlab("Race") + ylab("Average Scores") + theme_minimal() + theme(axis.text.x = element_text( angle = 45, vjust = 0.5, hjust=1)) + ggtitle("99% Confidence interval of average /<br>maths scores by race")), tooltip = "text"), DT::datatable(shared_df, rownames = FALSE, class="compact", width="100%", options = list(pageLength = 10, scrollX=T), colnames = c("No. of pupils", "Avg Scores", "Std Dev", "Std Error")) %>% formatRound(columns=c('mean', 'sd', 'se'), digits=2)) ``` ## **Visualising Uncertainty: ggdist package** - [**ggdist**](https://mjskay.github.io/ggdist/) is an R package that provides a flexible set of ggplot2 geoms and stats designed especially for visualising distributions and uncertainty. - It is designed for both frequentist and Bayesian uncertainty visualization, taking the view that uncertainty visualization can be unified through the perspective of distribution visualization: - for frequentist models, one visualises confidence distributions or bootstrap distributions (see vignette("freq-uncertainty-vis")); - for Bayesian models, one visualises probability distributions (see the tidybayes package, which builds on top of ggdist). ![](https://r4va.netlify.app/chap10/img/ggdist.png) ### **Visualizing the uncertainty of point estimates: ggdist methods** In the code chunk below, [`stat_pointinterval()`](https://mjskay.github.io/ggdist/reference/stat_pointinterval.html) of **ggdist** is used to build a visual for displaying distribution of maths scores by race. ```{r} exam %>% ggplot(aes(x = RACE, y = MATHS) ) + stat_pointinterval(color = "#3459e6",point_color = "red") + labs( title = "Visualising confidence intervals of mean math score", subtitle = "Mean Point + Multiple-interval plot")+theme_minimal() ``` ::: callout-note This function comes with many arguments, students are advised to read the syntax reference for more detail. ::: For example, in the code chunk below the following arguments are used: - .width = 0.95 - .point = median - .interval = qi ```{r} exam %>% ggplot(aes(x = RACE, y = MATHS)) + stat_pointinterval(.width = 0.95, .point = median, .interval = qi,color = "#3459e6",point_color = "red") + labs( title = "Visualising confidence intervals of median math score", subtitle = "Median Point + Multiple-interval plot")+theme_minimal() ``` ::: {.callout-tip title="What I learnt..."} 1. **`.width`**: It controls the width of the confidence interval. It specifies the size of the confidence interval, usually a decimal between 0 and 1. For example, **`.width = 0.95`** indicates a confidence interval width of 95%, a common choice. 2. **`.point`**: It determines the statistic used to represent the central point. It can be a string specifying the statistic to use, such as **`"mean"`**, **`"median"`**, and so on. In the example, **`.point = "median"`** means that the median will be used to represent the central point. 3. **`.interval`**: It defines the method used to calculate the confidence interval. It can be a string indicating the calculation method, such as **`"qi"`** (quantile-based interval), **`"ci"`** (confidence interval), etc. In the example, **`.interval = "qi"`** means using quantiles to calculate the confidence interval. ::: ### **Visualizing the uncertainty of point estimates: ggdist methods** ```{r} exam %>% ggplot(aes(x = RACE, y = MATHS)) + stat_pointinterval(.width = 0.99, .point = median, .interval = qi,color = "#3459e6",point_color = "red") + labs( title = "Visualising confidence intervals of median math score (99% confidence intervals)", subtitle = "Median Point + Multiple-interval plot")+theme_minimal() ``` ### **Visualizing the uncertainty of point estimates: ggdist methods** In the code chunk below, [`stat_gradientinterval()`](https://mjskay.github.io/ggdist/reference/stat_gradientinterval.html) of **ggdist** is used to build a visual for displaying distribution of maths scores by race. ```{r} exam %>% ggplot(aes(x = RACE, y = MATHS)) + stat_gradientinterval( fill = "skyblue", color = "#3459e6", point_color = "red", show.legend = TRUE ) + labs( title = "Visualising confidence intervals of mean math score", subtitle = "Gradient + interval plot")+ theme_minimal() ``` ## **Visualising Uncertainty with Hypothetical Outcome Plots (HOPs)** Step 1: Installing ungeviz package ```{r} devtools::install_github("wilkelab/ungeviz") ``` Step 2: Launch the application in R ```{r} library(ungeviz) ``` ```{r} ggplot(data = exam, (aes(x = factor(RACE), y = MATHS))) + geom_point(position = position_jitter( height = 0.3, width = 0.05), size = 0.4, color = "#3459e6", alpha = 1/2) + geom_hpline(data = sampler(25, group = RACE), height = 0.6, color = "coral") + theme_bw() + # `.draw` is a generated column indicating the sample draw transition_states(.draw, 1, 3) ``` ## **Visualising Uncertainty with Hypothetical Outcome Plots (HOPs)** ```{r} ggplot(data = exam, (aes(x = factor(RACE), y = MATHS))) + geom_point(position = position_jitter( height = 0.3, width = 0.05), size = 0.4, color = "#3459e6", alpha = 1/2) + geom_hpline(data = sampler(25, group = RACE), height = 0.6, color = "coral") + theme_bw() + transition_states(.draw, 1, 3) ```