Lesson 2 Microbiome Data and Metadata Fundamentals
In microbiome analysis, the most common source of downstream errors is not statistical testing or visualization, but problems introduced much earlier—at the level of data structure, identifiers, and metadata alignment.
This lesson establishes the conceptual and practical foundations required to work safely with microbiome data throughout the rest of the guide.
2.1 Learning objectives
By the end of this lesson, you will be able to:
- Identify the core components of microbiome data analysis
- Understand how feature tables, taxonomy, and metadata relate
- Import QIIME 2 outputs into R using reproducible workflows
- Validate sample identifiers and metadata alignment
- Construct a clean
phyloseqobject for downstream analysis
2.2 What constitutes microbiome “data”?
Modern microbiome analyses are built around three primary data objects:
- Feature table — a matrix of counts or abundances
- rows: features (ASVs / OTUs)
- columns: samples
- Taxonomy table — mapping of features to taxonomic ranks
- Kingdom → Phylum → Class → Order → Family → Genus → Species
- Sample metadata — experimental and biological context
- subject ID, group/condition, covariates (age, sex, site, batch, etc.)
These objects are independent but interdependent. A valid analysis requires that they be synchronized and consistent.
2.3 QIIME 2 as the upstream reference
This guide assumes that raw sequencing data have already been processed using a modern pipeline such as QIIME 2, producing:
feature-table.qzataxonomy.qzametadata.tsv
We omit HPC/cluster-level processing steps (raw reads → feature table/taxonomy) in the main lessons and focus on analysis-ready inputs.
2.4 Import QIIME 2 artifacts into a phyloseq object
qiime2R::qza_to_phyloseq() provides a clean, reliable way to combine feature table, taxonomy, and metadata into a single object.
# Build phyloseq object directly from QIIME 2 outputs ----------------------
ps <- qza_to_phyloseq(
features = "data/qiime2/feature-table.qza",
taxonomy = "data/qiime2/taxonomy.qza",
metadata = "data/qiime2/metadata.tsv"
)
psphyloseq-class experiment-level object
otu_table() OTU Table: [ 249 taxa and 128 samples ]
sample_data() Sample Data: [ 128 samples by 12 sample variables ]
tax_table() Taxonomy Table: [ 249 taxa by 7 taxonomic ranks ]
2.5 Sanity checks
Confirm that the object contains the expected number of samples, taxa, and metadata variables.
[1] 128
[1] 249
[1] "forward.absolute.filepath" "reverse.absolute.filepath"
[3] "unit_name" "sra"
[5] "adapters" "bioproject"
[7] "type" "librarylayout"
[9] "organism" "latitude"
[11] "longitude" "bases"
2.6 Validate identifiers
Even when using qza_to_phyloseq(), it’s good practice to confirm that sample identifiers align.
sample_ids_feature_table <- sample_names(ps)
sample_ids_metadata <- rownames(as(sample_data(ps), "data.frame"))
length(intersect(sample_ids_feature_table, sample_ids_metadata))[1] 128
character(0)
character(0)
2.7 Save an analysis-ready object
Save the validated object so downstream lessons can load it directly.