Diversity Analysis

Diversity Analysis

Alpha diversity summaries are common in microbiome studies.

They are useful — but only when interpreted carefully.

This chapter focuses on:

  • what alpha diversity measures
  • how sequencing depth affects diversity
  • how to interpret group differences responsibly

Alpha diversity is not a single concept.

Different metrics emphasize richness, evenness, and dominance differently. Interpretation depends on the metric.


Load Data

library(phyloseq)
library(vegan)

ps <- readRDS("data/moving-pictures-ps.rds")

Check Sequencing Depth (R → Python)

Alpha diversity is sensitive to sequencing depth.

lib_size <- phyloseq::sample_sums(ps)

df_lib <- data.frame(
  sample_id = names(lib_size),
  library_size = as.numeric(lib_size),
  stringsAsFactors = FALSE
)

dir.create("outputs/tables", recursive = TRUE, showWarnings = FALSE)
readr::write_csv(df_lib, "outputs/tables/library-size.csv")

summary(df_lib$library_size)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    897    1838    4010    4524    7013    9820 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("outputs/tables/library-size.csv")

sns.set_theme(style="whitegrid")

plt.figure(figsize=(9,5))
sns.histplot(df["library_size"], bins=30, kde=True)

plt.title("Sequencing depth distribution")
plt.xlabel("Library size (total reads per sample)")
plt.ylabel("Number of samples")

sns.despine()
plt.tight_layout()
plt.show()

If sequencing depth varies widely, observed richness may vary even if communities are identical.

Depth inspection is part of interpretation, not an optional diagnostic.


Compute Alpha Diversity (R)

otu <- methods::as(phyloseq::otu_table(ps), "matrix")
if (!phyloseq::taxa_are_rows(ps)) otu <- t(otu)

observed <- colSums(otu > 0)
shannon  <- vegan::diversity(t(otu), index = "shannon")
simpson  <- vegan::diversity(t(otu), index = "simpson")
pielou   <- shannon / log(pmax(observed, 1))

alpha_df <- data.frame(
  sample_id = colnames(otu),
  observed  = observed,
  shannon   = shannon,
  simpson   = simpson,
  pielou    = pielou,
  stringsAsFactors = FALSE
)

meta <- data.frame(phyloseq::sample_data(ps))
meta$sample_id <- rownames(meta)

alpha_df <- merge(alpha_df, meta, by = "sample_id", all.x = TRUE)

cols <- names(alpha_df)
body_col <- intersect(c("body-site", "body.site", "body_site"), cols)

if (length(body_col) == 0) {
  stop(
    "Body site column not found. Available columns: ",
    paste(cols, collapse = ", ")
  )
}

alpha_df$body_site <- alpha_df[[body_col[1]]]
alpha_df$library_size <- phyloseq::sample_sums(ps)[alpha_df$sample_id]

readr::write_csv(alpha_df, "outputs/tables/alpha-diversity.csv")

head(alpha_df)
  sample_id observed  shannon   simpson    pielou barcode.sequence body.site
1    L1S105       63 2.682108 0.8707597 0.6473620     AGTGCGATGCGT       gut
2    L1S140       65 2.660947 0.8518507 0.6374460     ATGGCAGCTCTA       gut
3    L1S208       85 3.121034 0.8999369 0.7025161     CTGAGATACGCG       gut
4    L1S257       81 3.262504 0.9261295 0.7424147     CCGACTGAGATG       gut
5    L1S281       72 3.189387 0.9082814 0.7457648     CCTCTCGTGATC       gut
6     L1S57       70 2.905920 0.8661380 0.6839881     ACACACTATGGC       gut
  year month day   subject reported.antibiotic.usage
1 2009     3  17 subject-1                        No
2 2008    10  28 subject-2                       Yes
3 2009     1  20 subject-2                        No
4 2009     3  17 subject-2                        No
5 2009     4  14 subject-2                        No
6 2009     1  20 subject-1                        No
  days.since.experiment.start body_site library_size
1                         140       gut         7780
2                           0       gut         7163
3                          84       gut         8152
4                         140       gut         6388
5                         168       gut         6615
6                          84       gut         8702

Visualize Alpha Diversity (Python, Seaborn)

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("outputs/tables/alpha-diversity.csv")
df = df.dropna(subset=["body_site"])

sns.set_theme(style="whitegrid")

metrics = [
    ("observed", "Observed richness"),
    ("shannon", "Shannon diversity"),
    ("simpson", "Simpson diversity"),
    ("pielou", "Pielou evenness"),
]

order = (
    df.groupby("body_site")["observed"]
    .median()
    .sort_values(ascending=False)
    .index
)

for col, title in metrics:
    fig, ax = plt.subplots(figsize=(9,5))

    sns.boxplot(
        data=df,
        x="body_site",
        y=col,
        order=order,
        showfliers=False,
        width=0.6,
        ax=ax
    )

    sns.stripplot(
        data=df,
        x="body_site",
        y=col,
        order=order,
        size=3,
        alpha=0.4,
        jitter=0.25,
        ax=ax
    )

    ax.set_title(f"{title} by body site")
    ax.set_xlabel("Body site")
    ax.set_ylabel(title)

    sns.despine()
    plt.setp(ax.get_xticklabels(), rotation=20, ha="right")

    fig.tight_layout()
    plt.show()
<Axes: xlabel='body_site', ylabel='observed'>
<Axes: xlabel='body_site', ylabel='observed'>
Text(0.5, 1.0, 'Observed richness by body site')
Text(0.5, 0, 'Body site')
Text(0, 0.5, 'Observed richness')
[None, None, None, None, None, None, None, None]
<Axes: xlabel='body_site', ylabel='shannon'>
<Axes: xlabel='body_site', ylabel='shannon'>
Text(0.5, 1.0, 'Shannon diversity by body site')
Text(0.5, 0, 'Body site')
Text(0, 0.5, 'Shannon diversity')
[None, None, None, None, None, None, None, None]
<Axes: xlabel='body_site', ylabel='simpson'>
<Axes: xlabel='body_site', ylabel='simpson'>
Text(0.5, 1.0, 'Simpson diversity by body site')
Text(0.5, 0, 'Body site')
Text(0, 0.5, 'Simpson diversity')
[None, None, None, None, None, None, None, None]
<Axes: xlabel='body_site', ylabel='pielou'>
<Axes: xlabel='body_site', ylabel='pielou'>
Text(0.5, 1.0, 'Pielou evenness by body site')
Text(0.5, 0, 'Body site')
Text(0, 0.5, 'Pielou evenness')
[None, None, None, None, None, None, None, None]


Depth Sensitivity Check (Python)

import numpy as np

df = pd.read_csv("outputs/tables/alpha-diversity.csv")

plt.figure(figsize=(8.5,5))
sns.scatterplot(data=df, x="library_size", y="observed", alpha=0.7)

coef = np.polyfit(df["library_size"], df["observed"], 1)
x = np.linspace(df["library_size"].min(), df["library_size"].max(), 200)
plt.plot(x, coef[0]*x + coef[1])

plt.title("Observed richness vs sequencing depth")
plt.xlabel("Library size")
plt.ylabel("Observed richness")

sns.despine()
plt.tight_layout()
plt.show()


Interpretation

Observed richness is highly depth-sensitive.

Shannon and Simpson reduce some sensitivity but do not eliminate it.

If richness increases strongly with depth, group comparisons may reflect sequencing effort rather than ecology.

What these plots show

  • Distribution of diversity within groups
  • Variability across samples
  • Dominance vs evenness patterns

What they do not show

  • Absolute abundance
  • Statistical significance
  • Causation
  • Confounder adjustment

Alpha diversity is descriptive unless formally tested.