Summary and Next Steps

This guide introduced a practical workflow for visualizing microbiome data.

The goal was not to memorize plotting code. The goal was to build a reliable interpretation habit:

A figure is not the conclusion.

A figure is a structured view of the data under explicit assumptions. The quality of interpretation depends on whether those assumptions are stated and tested.


What you can now do

By the end of this guide you can:

  • load and reuse a canonical phyloseq object
  • recognize sparsity as a biological feature, not an error
  • interpret sequencing depth and why it matters for richness
  • build composition plots that are honest about limitations
  • read diversity plots without confusing depth effects for biology
  • interpret ordination as distance-based geometry
  • use heatmaps to discover co-occurrence patterns

A compact checklist for new datasets

Use this checklist whenever you start a new microbiome dataset.

Structure

  • Do samples align across abundance and metadata?
  • Are taxa identifiers consistent across tables?
  • Are there missing or duplicated sample IDs?

Sparsity

  • What fraction of the matrix is zero?
  • Are there taxa present in only 1–2 samples?
  • Will a prevalence threshold improve clarity?

Depth

  • Are library sizes comparable across groups?
  • Does observed richness strongly track depth?
  • Do conclusions change under a depth sensitivity check?

Composition

  • Are you comparing proportions or absolute abundance?
  • Did you state the taxonomic rank and top-N choice?
  • Would the story change if you aggregated differently?

Diversity and ordination

  • Which alpha metric matches the question?
  • Which beta distance matches the data and interpretation?
  • If you run PERMANOVA, did you check dispersion?

Most interpretation problems come from skipping one of the checklist blocks.

The fix is almost always upstream, not in the plotting code.


Save a small “results snapshot” (R → Python)

A practical habit is to export a small results bundle that can be reloaded later.

We will export:

  • library size summary
  • alpha diversity table
  • ordination coordinates
dir.create("outputs/snapshots", recursive = TRUE, showWarnings = FALSE)

ps <- readRDS("data/moving-pictures-ps.rds")

# Library size
lib_size <- phyloseq::sample_sums(ps)
df_lib <- data.frame(sample_id = names(lib_size), library_size = as.numeric(lib_size))
readr::write_csv(df_lib, "outputs/snapshots/library-size.csv")

# Alpha diversity (simple, reproducible)
otu <- methods::as(phyloseq::otu_table(ps), "matrix")
if (!phyloseq::taxa_are_rows(ps)) otu <- t(otu)

observed <- colSums(otu > 0)
shannon  <- vegan::diversity(t(otu), index = "shannon")

alpha_df <- data.frame(
  sample_id = colnames(otu),
  observed = observed,
  shannon = shannon,
  stringsAsFactors = FALSE
)

meta <- data.frame(phyloseq::sample_data(ps))
meta$sample_id <- rownames(meta)
alpha_df <- merge(alpha_df, meta, by = "sample_id", all.x = TRUE)

cols <- names(alpha_df)
body_col <- intersect(c("body-site", "body.site", "body_site"), cols)
if (length(body_col) == 0) stop("Body site column not found in metadata.")
alpha_df$body_site <- alpha_df[[body_col[1]]]

readr::write_csv(alpha_df, "outputs/snapshots/alpha-diversity-mini.csv")

# Ordination coordinates (PCoA, Bray–Curtis, rel abundance)
ps_rel <- phyloseq::transform_sample_counts(ps, function(x) x / sum(x))
dist_bc <- phyloseq::distance(ps_rel, method = "bray")
ord <- phyloseq::ordinate(ps_rel, method = "PCoA", distance = dist_bc)

coords <- as.data.frame(ord$vectors[, 1:2])
colnames(coords) <- c("PC1", "PC2")
coords$sample_id <- rownames(coords)

ord_df <- merge(coords, meta, by = "sample_id", all.x = TRUE)
cols <- names(ord_df)
body_col <- intersect(c("body-site", "body.site", "body_site"), cols)
if (length(body_col) == 0) stop("Body site column not found in ordination metadata.")
ord_df$body_site <- ord_df[[body_col[1]]]

readr::write_csv(ord_df, "outputs/snapshots/ordination-pcoa.csv")

c("library-size.csv", "alpha-diversity-mini.csv", "ordination-pcoa.csv")
[1] "library-size.csv"         "alpha-diversity-mini.csv"
[3] "ordination-pcoa.csv"     
import pandas as pd

lib = pd.read_csv("outputs/snapshots/library-size.csv")
alpha = pd.read_csv("outputs/snapshots/alpha-diversity-mini.csv")
ordn = pd.read_csv("outputs/snapshots/ordination-pcoa.csv")

print("Library size rows:", lib.shape[0])
Library size rows: 34
print("Alpha diversity rows:", alpha.shape[0])
Alpha diversity rows: 34
print("Ordination rows:", ordn.shape[0])
Ordination rows: 34
alpha[["sample_id","body_site","observed","shannon"]].head()
  sample_id body_site  observed   shannon
0    L1S105       gut        63  2.682108
1    L1S140       gut        65  2.660947
2    L1S208       gut        85  3.121034
3    L1S257       gut        81  3.262504
4    L1S281       gut        72  3.189387

A snapshot makes your workflow restartable.

You can reproduce plots without rerunning every upstream step. This is a small habit that prevents large confusion later.


Where to Go Next

If you are continuing with this dataset:

  • Compare taxonomic resolution (Family, Genus, ASV) and assess stability.
  • Explore alternative distances (Jaccard for presence/absence; Aitchison for compositional structure).
  • Identify taxa that most strongly contribute to observed separation.

If you are starting a new dataset:

  • Re-run the structural checklist.
  • Rebuild a canonical object.
  • Keep transformation and filtering decisions explicit and version controlled.

The fastest way to improve interpretation is not adding more plots.

It is strengthening the logic that connects each plot to a defensible biological question.


Beyond Descriptive Visualization

This guide focused on structural clarity and visualization:

  • composition
  • diversity
  • ordination
  • clustering patterns

These approaches describe what the data look like.

They do not yet address:

  • Whether differences are statistically robust
  • How much variance is explained by specific variables
  • How covariates influence interpretation
  • How to reconcile conflicting signals across metrics

These questions require inferential modeling and more advanced analytical design.

The extended continuation of this guide explores:

  • PERMANOVA and multivariate hypothesis testing
  • constrained ordination
  • variance partitioning
  • model-based approaches
  • structured interpretation workflows

The objective is not to add more figures.

It is to reason more rigorously about biological signal.


Continue the Learning Path

The premium continuation of this guide expands these topics in depth:

→ https://complexdatainsights.com/microbiome-premium

It builds from the same dataset and structure, but moves from descriptive visualization toward formal inference and analytical decision-making.