Lesson 2 Microbiome Data and Metadata Fundamentals

In microbiome analysis, the most common source of downstream errors is not statistical testing or visualization, but problems introduced much earlier—at the level of data structure, identifiers, and metadata alignment.

This lesson establishes the conceptual and practical foundations required to work safely with microbiome data throughout the rest of the guide.

2.1 Learning objectives

By the end of this lesson, you will be able to:

  • Identify the core components of microbiome data analysis
  • Understand how feature tables, taxonomy, and metadata relate
  • Import QIIME 2 outputs into R using reproducible workflows
  • Validate sample identifiers and metadata alignment
  • Construct a clean phyloseq object for downstream analysis

2.2 What constitutes microbiome “data”?

Modern microbiome analyses are built around three primary data objects:

  1. Feature table — a matrix of counts or abundances
    • rows: features (ASVs / OTUs)
    • columns: samples
  2. Taxonomy table — mapping of features to taxonomic ranks
    • Kingdom → Phylum → Class → Order → Family → Genus → Species
  3. Sample metadata — experimental and biological context
    • subject ID, group/condition, covariates (age, sex, site, batch, etc.)

These objects are independent but interdependent. A valid analysis requires that they be synchronized and consistent.

2.3 QIIME 2 as the upstream reference

This guide assumes that raw sequencing data have already been processed using a modern pipeline such as QIIME 2, producing:

  • feature-table.qza
  • taxonomy.qza
  • metadata.tsv

We omit HPC/cluster-level processing steps (raw reads → feature table/taxonomy) in the main lessons and focus on analysis-ready inputs.

2.3.1 File paths used in this lesson

This notebook expects the following files to exist:

  • data/qiime2/feature-table.qza
  • data/qiime2/taxonomy.qza
  • data/qiime2/metadata.tsv

If you haven’t obtained example data yet, place these files into data/qiime2/ before running the code.

Reference (qiime2R): The official qiime2R vignette explains read_qza() and qza_to_phyloseq() in depth. Use it as a reference when importing QIIME 2 artifacts into R.

# Core packages
library(qiime2R)
library(phyloseq)

2.4 Import QIIME 2 artifacts into a phyloseq object

qiime2R::qza_to_phyloseq() provides a clean, reliable way to combine feature table, taxonomy, and metadata into a single object.

# Build phyloseq object directly from QIIME 2 outputs ----------------------
ps <- qza_to_phyloseq(
  features = "data/qiime2/feature-table.qza",
  taxonomy = "data/qiime2/taxonomy.qza",
  metadata = "data/qiime2/metadata.tsv"
)

ps
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 249 taxa and 128 samples ]
sample_data() Sample Data:       [ 128 samples by 12 sample variables ]
tax_table()   Taxonomy Table:    [ 249 taxa by 7 taxonomic ranks ]

2.5 Sanity checks

Confirm that the object contains the expected number of samples, taxa, and metadata variables.

nsamples(ps)
[1] 128
ntaxa(ps)
[1] 249
sample_variables(ps)
 [1] "forward.absolute.filepath" "reverse.absolute.filepath"
 [3] "unit_name"                 "sra"                      
 [5] "adapters"                  "bioproject"               
 [7] "type"                      "librarylayout"            
 [9] "organism"                  "latitude"                 
[11] "longitude"                 "bases"                    

2.6 Validate identifiers

Even when using qza_to_phyloseq(), it’s good practice to confirm that sample identifiers align.

sample_ids_feature_table <- sample_names(ps)
sample_ids_metadata <- rownames(as(sample_data(ps), "data.frame"))

length(intersect(sample_ids_feature_table, sample_ids_metadata))
[1] 128
setdiff(sample_ids_feature_table, sample_ids_metadata)
character(0)
setdiff(sample_ids_metadata, sample_ids_feature_table)
character(0)

2.7 Save an analysis-ready object

Save the validated object so downstream lessons can load it directly.

dir.create("data/intermediate", recursive = TRUE, showWarnings = FALSE)
saveRDS(ps, file = "data/intermediate/phyloseq-clean.rds")

2.8 Key takeaways

  • Microbiome analysis depends critically on data integrity
  • QIIME 2 outputs (feature table + taxonomy + metadata) are sufficient for downstream analysis
  • qza_to_phyloseq() creates a consistent, analysis-ready phyloseq object
  • Save a clean object once, then reuse it across the guide