Lesson 3 Exploring and Summarizing Microbiome Feature Tables

Before applying statistical tests or machine learning models, it is essential to understand the structure and basic properties of microbiome feature tables.

In this lesson, we explore microbiome count data using a real, published dataset and generate downstream-ready intermediate files that will be reused in upcoming lessons.

CDI workflow note: This notebook uses a Python kernel for consistency. All R code is provided inside fenced blocks ({r} ...), ready for conversion to .Rmd and execution during bookdown rendering.

3.1 Learning objectives

By the end of this lesson, you will be able to:

Load and inspect a real-world microbiome dataset
Understand the structure of a feature (OTU/ASV) table
Summarize sequencing depth across samples
Quantify sparsity and identify dominant taxa
Export intermediate CSV files for downstream lessons and Python visualization

3.2 Real-world dataset: `dietswap`

We use the dietswap dataset provided by the microbiome R package.

library(phyloseq)
library(microbiome)
library(ggplot2)
library(dplyr)
library(tidyr)

data("dietswap", package = "microbiome")
ps <- dietswap
ps

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 130 taxa and 222 samples ]
sample_data() Sample Data:       [ 222 samples by 8 sample variables ]
tax_table()   Taxonomy Table:    [ 130 taxa by 3 taxonomic ranks ]

# Ensure data directory exists
dir.create("data/raw", recursive = TRUE, showWarnings = FALSE)

# Save canonical phyloseq object
saveRDS(ps, file = "data/raw/dietswap-phyloseq.rds")

3.3 Inspecting the data object

# Dimensions
ntaxa(ps)

[1] 130

nsamples(ps)

[1] 222

# Metadata variables
sample_variables(ps)

[1] "subject"                "sex"                    "nationality"           
[4] "group"                  "sample"                 "timepoint"             
[7] "timepoint.within.group" "bmi_group"

3.4 Inspect taxonomic labels

At this stage, you may notice taxon names such as Lactobacillus gasseri et rel. or Prevotella melaninogenica et rel.

## Show unique genus labels (first few)
unique(as.vector(tax_table(ps)[, "Genus"]))[1:10]

 [1] "Actinomycetaceae"                  "Aerococcus"                       
 [3] "Aeromonas"                         "Akkermansia"                      
 [5] "Alcaligenes faecalis et rel."      "Allistipes et rel."               
 [7] "Anaerobiospirillum"                "Anaerofustis"                     
 [9] "Anaerostipes caccae et rel."       "Anaerotruncus colihominis et rel."

What does “et rel.” mean in taxon names?

The suffix “et rel.” stands for “et relatives”.
It indicates that the taxon includes the named organism and closely related taxa that cannot be reliably distinguished using marker-gene sequencing (e.g. 16S rRNA).

This reflects taxonomic resolution limits, not an error in the data.

When interpreting results: - Treat “et rel.” taxa as groups of related organisms - Avoid species-level causal claims - Use them confidently for diversity, differential abundance, and ML analyses

3.5 Sequencing depth per sample

Sequencing depth variation matters for interpretation and for later normalization decisions.

depth <- sample_sums(ps)
summary(depth)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1776   10033   13255   13284   16092   28883

3.6 Sparsity of microbiome data

Microbiome feature tables are typically sparse: many taxa are absent in many samples.

otu_mat <- as(otu_table(ps), "matrix")
mean(otu_mat == 0)

[1] 0.205648

3.7 Dominant taxa

Identify taxa that dominate total counts across the dataset.

taxa_sums(ps) |>
  sort(decreasing = TRUE) |>
  head(10)

   Prevotella melaninogenica et rel.   Oscillospira guillermondii et rel. 
                              873314                               351976 
        Bacteroides vulgatus et rel.        Clostridium cellulosi et rel. 
                              279145                               153230 
           Prevotella oralis et rel. Faecalibacterium prausnitzii et rel. 
                              133207                               108484 
      Sporobacter termitidis et rel.        Clostridium symbiosum et rel. 
                               65092                                57279 
                  Allistipes et rel.     Clostridium orbiscindens et rel. 
                               55677                                54064

3.8 Raw vs relative abundance

Raw counts reflect sequencing output; relative abundance reflects composition.

ps_rel <- phyloseq::transform_sample_counts(ps, function(x) x / sum(x))
ps_rel

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 130 taxa and 222 samples ]
sample_data() Sample Data:       [ 222 samples by 8 sample variables ]
tax_table()   Taxonomy Table:    [ 130 taxa by 3 taxonomic ranks ]

3.9 Saving intermediate data for downstream lessons

To support a Python-kernel notebook workflow, we export intermediate objects as CSV files.

These CSVs will be reused in later lessons (stacked bar plots, heatmaps, ordination, etc.).

# Create output folder
dir.create("data/intermediate", recursive = TRUE, showWarnings = FALSE)

# Tidy relative-abundance table (best for Python plotting)
df_rel <- psmelt(ps_rel)
write.csv(df_rel, "data/intermediate/dietswap-relative-tidy.csv", row.names = FALSE)

# Sample sequencing depth with metadata
depth_df <- data.frame(sample_id = names(sample_sums(ps)), depth = as.numeric(sample_sums(ps)))
meta_df <- data.frame(sample_data(ps))
meta_df$sample_id <- rownames(meta_df)
depth_df <- merge(depth_df, meta_df, by = "sample_id", all.x = TRUE)
write.csv(depth_df, "data/intermediate/dietswap-sample-depth.csv", row.names = FALSE)

# Taxa totals with taxonomy (useful for 'top taxa' selection)
taxa_df <- data.frame(taxon_id = names(taxa_sums(ps)), total_count = as.numeric(taxa_sums(ps)))
tax <- data.frame(tax_table(ps))
tax$taxon_id <- rownames(tax)
taxa_df <- merge(taxa_df, tax, by = "taxon_id", all.x = TRUE)
write.csv(taxa_df, "data/intermediate/dietswap-taxa-totals.csv", row.names = FALSE)

# Quick sanity check
file.exists("data/intermediate/dietswap-relative-tidy.csv")

[1] TRUE

3.10 Key takeaways

Feature tables are high-dimensional and sparse
Sequencing depth varies across samples
Early summaries reveal structure and potential issues
Exporting CSV intermediates enables clean hybrid R + Python workflows

Continue to Lesson 04 — Visualizing Microbiome Composition