Lesson 4 Visualizing Microbiome Composition

This lesson focuses on composition-focused visualization—turning a microbiome feature table into interpretable summaries.

We use intermediate CSVs exported in Lesson 03 to generate clean, publication-ready visuals.

4.1 Learning objectives

By the end of this lesson, you will be able to:

Create a sequencing depth plot
Create a stacked bar plot of top genera across samples and groups
Create a heatmap of taxa abundance
Save all figures to a single figures/ folder using the CDI plotting workflow

from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl

# Lesson ID drives figure naming (e.g., figures/04_001.png)
_ = cdi_notebook_init(chapter="04", title_x=0)

4.2 Load intermediate data (from Lesson 03)

These files are created in Lesson 03 during rendering:

data/intermediate/dietswap-relative-tidy.csv
data/intermediate/dietswap-sample-depth.csv
data/intermediate/dietswap-taxa-totals.csv

from pathlib import Path
import pandas as pd

p_tidy  = Path('data/intermediate/dietswap-relative-tidy.csv')
p_depth = Path('data/intermediate/dietswap-sample-depth.csv')
p_taxa  = Path('data/intermediate/dietswap-taxa-totals.csv')

missing = [str(p) for p in [p_tidy, p_depth, p_taxa] if not p.exists()]
if missing:
    raise FileNotFoundError(
        'Missing intermediate CSVs. Render Lesson 03 first (or run the Rmd build) to generate them:\n- ' + '\n- '.join(missing)
    )

df = pd.read_csv(p_tidy)
depth = pd.read_csv(p_depth)
taxa = pd.read_csv(p_taxa)

df.head()

	OTU	Sample	Abundance	subject	sex	nationality	group	sample	timepoint	timepoint.within.group	bmi_group	Phylum	Family	Genus
0	Prevotella melaninogenica et rel.	Sample-187	0.769942	kpb	male	AAM	DI	Sample-187	4	1	obese	Bacteroidetes	Bacteroidetes	Prevotella melaninogenica et rel.
1	Prevotella melaninogenica et rel.	Sample-182	0.760767	kpb	male	AAM	ED	Sample-182	1	1	obese	Bacteroidetes	Bacteroidetes	Prevotella melaninogenica et rel.
2	Prevotella melaninogenica et rel.	Sample-210	0.750560	qjy	female	AFR	ED	Sample-210	1	1	overweight	Bacteroidetes	Bacteroidetes	Prevotella melaninogenica et rel.
3	Prevotella melaninogenica et rel.	Sample-104	0.748627	vem	male	AFR	HE	Sample-104	3	2	lean	Bacteroidetes	Bacteroidetes	Prevotella melaninogenica et rel.
4	Prevotella melaninogenica et rel.	Sample-168	0.747613	mnk	female	AAM	HE	Sample-168	3	2	obese	Bacteroidetes	Bacteroidetes	Prevotella melaninogenica et rel.

4.3 Sequencing depth per sample

import matplotlib.pyplot as plt

depth_sorted = depth.sort_values('depth')

plt.figure(figsize=(12, 5))
plt.bar(range(len(depth_sorted)), depth_sorted['depth'])
plt.title('Sequencing depth per sample')
plt.xlabel('Samples (sorted)')
plt.ylabel('Total reads')
plt.xticks([])
plt.tight_layout()

show_and_save_mpl()

Saved PNG → figures/04_001.png

4.4 Stacked bar plot of top genera

We keep the top N genera and collapse the rest into Other for readability.

import numpy as np

rank_col = 'Genus' if 'Genus' in df.columns else ('OTU' if 'OTU' in df.columns else None)
if rank_col is None:
    raise ValueError('No taxonomic column found (expected Genus or OTU).')

top_n = 12
top_taxa = (
    df.groupby(rank_col)['Abundance']
      .sum()
      .sort_values(ascending=False)
      .head(top_n)
      .index
)

df_plot = df.copy()
df_plot[rank_col] = df_plot[rank_col].where(df_plot[rank_col].isin(top_taxa), other='Other')

wide = (df_plot.groupby(['Sample', rank_col])['Abundance'].sum().unstack(fill_value=0))

if 'group' in df_plot.columns:
    meta = df_plot[['Sample','group']].drop_duplicates().set_index('Sample')
    wide = wide.loc[meta.sort_values('group').index]

ax = wide.plot(kind='bar', stacked=True, figsize=(14, 5))
ax.set_title(f'Top {top_n} {rank_col} (relative abundance)')
ax.set_xlabel('')
ax.set_ylabel('Relative abundance')
ax.set_xticks([])
ax.legend(bbox_to_anchor=(1.02, 1), loc='upper left', title=rank_col)
plt.tight_layout()

show_and_save_mpl()

Saved PNG → figures/04_002.png

4.5 Heatmap of taxa abundance

import seaborn as sns

top_hm = (
    df.groupby(rank_col)['Abundance']
      .sum()
      .sort_values(ascending=False)
      .head(25)
      .index
)

hm = (df[df[rank_col].isin(top_hm)]
      .groupby([rank_col, 'Sample'])['Abundance']
      .sum()
      .unstack(fill_value=0))

hm_log = np.log10(hm + 1e-6)

plt.figure(figsize=(12, 7))
sns.heatmap(hm_log, cbar_kws={'label': 'log10(relative abundance)'})
plt.title(f'Top taxa heatmap ({rank_col}, log10 relative abundance)')
plt.xlabel('Sample')
plt.ylabel(rank_col)
plt.tight_layout()

show_and_save_mpl()

Saved PNG → figures/04_003.png

4.6 Key takeaways

Sequencing depth provides essential context for composition plots
Stacked bar plots are clearest with top taxa + simplified legends
Heatmaps should be limited to top features and use log scaling for contrast
show_and_save_mpl() ensures figures are saved consistently to figures/

Continue to Completing the Free Track