Lesson 3 Exploring and Summarizing Microbiome Feature Tables
Before applying statistical tests or machine learning models, it is essential to understand the structure and basic properties of microbiome feature tables.
In this lesson, we explore microbiome count data using a real, published dataset and generate downstream-ready intermediate files that will be reused in upcoming lessons.
CDI workflow note: This notebook uses a Python kernel for consistency. All R code is provided inside fenced blocks (
{r} ...), ready for conversion to.Rmdand execution during bookdown rendering.
3.1 Learning objectives
By the end of this lesson, you will be able to:
- Load and inspect a real-world microbiome dataset
- Understand the structure of a feature (OTU/ASV) table
- Summarize sequencing depth across samples
- Quantify sparsity and identify dominant taxa
- Export intermediate CSV files for downstream lessons and Python visualization
3.2 Real-world dataset: dietswap
We use the dietswap dataset provided by the microbiome R package.
library(phyloseq)
library(microbiome)
library(ggplot2)
library(dplyr)
library(tidyr)
data("dietswap", package = "microbiome")
ps <- dietswap
psphyloseq-class experiment-level object
otu_table() OTU Table: [ 130 taxa and 222 samples ]
sample_data() Sample Data: [ 222 samples by 8 sample variables ]
tax_table() Taxonomy Table: [ 130 taxa by 3 taxonomic ranks ]
3.3 Inspecting the data object
[1] 130
[1] 222
[1] "subject" "sex" "nationality"
[4] "group" "sample" "timepoint"
[7] "timepoint.within.group" "bmi_group"
3.4 Inspect taxonomic labels
At this stage, you may notice taxon names such as Lactobacillus gasseri et rel. or Prevotella melaninogenica et rel.
[1] "Actinomycetaceae" "Aerococcus"
[3] "Aeromonas" "Akkermansia"
[5] "Alcaligenes faecalis et rel." "Allistipes et rel."
[7] "Anaerobiospirillum" "Anaerofustis"
[9] "Anaerostipes caccae et rel." "Anaerotruncus colihominis et rel."
3.5 Sequencing depth per sample
Sequencing depth variation matters for interpretation and for later normalization decisions.
Min. 1st Qu. Median Mean 3rd Qu. Max.
1776 10033 13255 13284 16092 28883
3.6 Sparsity of microbiome data
Microbiome feature tables are typically sparse: many taxa are absent in many samples.
[1] 0.205648
3.7 Dominant taxa
Identify taxa that dominate total counts across the dataset.
Prevotella melaninogenica et rel. Oscillospira guillermondii et rel.
873314 351976
Bacteroides vulgatus et rel. Clostridium cellulosi et rel.
279145 153230
Prevotella oralis et rel. Faecalibacterium prausnitzii et rel.
133207 108484
Sporobacter termitidis et rel. Clostridium symbiosum et rel.
65092 57279
Allistipes et rel. Clostridium orbiscindens et rel.
55677 54064
3.8 Raw vs relative abundance
Raw counts reflect sequencing output; relative abundance reflects composition.
phyloseq-class experiment-level object
otu_table() OTU Table: [ 130 taxa and 222 samples ]
sample_data() Sample Data: [ 222 samples by 8 sample variables ]
tax_table() Taxonomy Table: [ 130 taxa by 3 taxonomic ranks ]
3.9 Saving intermediate data for downstream lessons
To support a Python-kernel notebook workflow, we export intermediate objects as CSV files.
These CSVs will be reused in later lessons (stacked bar plots, heatmaps, ordination, etc.).
# Create output folder
dir.create("data/intermediate", recursive = TRUE, showWarnings = FALSE)
# Tidy relative-abundance table (best for Python plotting)
df_rel <- psmelt(ps_rel)
write.csv(df_rel, "data/intermediate/dietswap-relative-tidy.csv", row.names = FALSE)
# Sample sequencing depth with metadata
depth_df <- data.frame(sample_id = names(sample_sums(ps)), depth = as.numeric(sample_sums(ps)))
meta_df <- data.frame(sample_data(ps))
meta_df$sample_id <- rownames(meta_df)
depth_df <- merge(depth_df, meta_df, by = "sample_id", all.x = TRUE)
write.csv(depth_df, "data/intermediate/dietswap-sample-depth.csv", row.names = FALSE)
# Taxa totals with taxonomy (useful for 'top taxa' selection)
taxa_df <- data.frame(taxon_id = names(taxa_sums(ps)), total_count = as.numeric(taxa_sums(ps)))
tax <- data.frame(tax_table(ps))
tax$taxon_id <- rownames(tax)
taxa_df <- merge(taxa_df, tax, by = "taxon_id", all.x = TRUE)
write.csv(taxa_df, "data/intermediate/dietswap-taxa-totals.csv", row.names = FALSE)
# Quick sanity check
file.exists("data/intermediate/dietswap-relative-tidy.csv")[1] TRUE
3.10 Key takeaways
- Feature tables are high-dimensional and sparse
- Sequencing depth varies across samples
- Early summaries reveal structure and potential issues
- Exporting CSV intermediates enables clean hybrid R + Python workflows
Continue to Lesson 04 — Visualizing Microbiome Composition