Lesson 2 Microbiome Data and Metadata Fundamentals

In microbiome analysis, the most common source of downstream errors is not statistical testing or visualization, but problems introduced much earlier—at the level of data structure, identifiers, and metadata alignment.

This lesson establishes the conceptual and practical foundations required to work safely with microbiome data throughout the rest of the guide.

2.1 Learning objectives

By the end of this lesson, you will be able to:

Identify the core components of microbiome data analysis
Understand how feature tables, taxonomy, and metadata relate
Import QIIME 2 outputs into R using reproducible workflows
Validate sample identifiers and metadata alignment
Construct a clean phyloseq object for downstream analysis

2.2 What constitutes microbiome “data”?

Modern microbiome analyses are built around three primary data objects:

Feature table — a matrix of counts or abundances
- rows: features (ASVs / OTUs)
- columns: samples
Taxonomy table — mapping of features to taxonomic ranks
- Kingdom → Phylum → Class → Order → Family → Genus → Species
Sample metadata — experimental and biological context
- subject ID, group/condition, covariates (age, sex, site, batch, etc.)

These objects are independent but interdependent. A valid analysis requires that they be synchronized and consistent.

2.3 QIIME 2 as the upstream reference

This guide assumes that raw sequencing data have already been processed using a modern pipeline such as QIIME 2, producing:

feature-table.qza
taxonomy.qza
metadata.tsv

We omit HPC/cluster-level processing steps (raw reads → feature table/taxonomy) in the main lessons and focus on analysis-ready inputs.

2.3.1 File paths used in this lesson

This notebook expects the following files to exist:

data/qiime2/feature-table.qza
data/qiime2/taxonomy.qza
data/qiime2/metadata.tsv

If you haven’t obtained example data yet, place these files into data/qiime2/ before running the code.

Reference (qiime2R): The official qiime2R vignette explains read_qza() and qza_to_phyloseq() in depth. Use it as a reference when importing QIIME 2 artifacts into R.

# Core packages
library(qiime2R)
library(phyloseq)

2.4 Import QIIME 2 artifacts into a phyloseq object

qiime2R::qza_to_phyloseq() provides a clean, reliable way to combine feature table, taxonomy, and metadata into a single object.

# Build phyloseq object directly from QIIME 2 outputs ----------------------
ps <- qza_to_phyloseq(
  features = "data/qiime2/feature-table.qza",
  taxonomy = "data/qiime2/taxonomy.qza",
  metadata = "data/qiime2/metadata.tsv"
)

ps

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 249 taxa and 128 samples ]
sample_data() Sample Data:       [ 128 samples by 12 sample variables ]
tax_table()   Taxonomy Table:    [ 249 taxa by 7 taxonomic ranks ]

2.5 Sanity checks

Confirm that the object contains the expected number of samples, taxa, and metadata variables.

nsamples(ps)

[1] 128

ntaxa(ps)

[1] 249

sample_variables(ps)

 [1] "forward.absolute.filepath" "reverse.absolute.filepath"
 [3] "unit_name"                 "sra"                      
 [5] "adapters"                  "bioproject"               
 [7] "type"                      "librarylayout"            
 [9] "organism"                  "latitude"                 
[11] "longitude"                 "bases"

2.6 Validate identifiers

Even when using qza_to_phyloseq(), it’s good practice to confirm that sample identifiers align.

sample_ids_feature_table <- sample_names(ps)
sample_ids_metadata <- rownames(as(sample_data(ps), "data.frame"))

length(intersect(sample_ids_feature_table, sample_ids_metadata))

[1] 128

setdiff(sample_ids_feature_table, sample_ids_metadata)

character(0)

setdiff(sample_ids_metadata, sample_ids_feature_table)

character(0)

2.7 Save an analysis-ready object

Save the validated object so downstream lessons can load it directly.

dir.create("data/intermediate", recursive = TRUE, showWarnings = FALSE)
saveRDS(ps, file = "data/intermediate/phyloseq-clean.rds")

2.8 Key takeaways

Microbiome analysis depends critically on data integrity
QIIME 2 outputs (feature table + taxonomy + metadata) are sufficient for downstream analysis
qza_to_phyloseq() creates a consistent, analysis-ready phyloseq object
Save a clean object once, then reuse it across the guide

Continue to Lesson 03 — Exploring and Summarizing Microbiome Feature Tables