Q&A 5 How do you explore and summarize a microbiome OTU table?

5.1 Explanation

After generating an OTU (or feature) table from raw sequencing data, it’s essential to inspect and summarize it before moving into alpha or beta diversity analysis.

The OTU table is typically a matrix of samples Γ— features (ASVs/OTUs), where each cell contains the abundance count of a feature in a sample.

Key summary steps include: - Calculating sample richness (how many OTUs each sample contains) - Measuring OTU prevalence (in how many samples each OTU occurs) - Assessing abundance distribution (e.g., sparse vs dominant OTUs) - Identifying sparse or noisy features that may need filtering

5.2 Python Code

import pandas as pd

# Load OTU table (OTUs as rows, samples as columns)
otu_df = pd.read_csv("data/otu_table.tsv", sep="\t", index_col=0)

# Number of OTUs per sample (richness)
sample_richness = (otu_df > 0).sum(axis=0)

# Number of samples per OTU (prevalence)
otu_prevalence = (otu_df > 0).sum(axis=1)

# Distribution of total counts per OTU
otu_abundance_summary = otu_df.sum(axis=1).describe()

print("Sample Richness:", sample_richness.head())
print("OTU Prevalence:", otu_prevalence.head())
print("Abundance Summary:", otu_abundance_summary)

5.3 R Code

otu_df <- read.delim("data/otu_table.tsv", row.names = 1)

# Sample richness: number of OTUs per sample
colSums(otu_df > 0)
 Sample_1  Sample_2  Sample_3  Sample_4  Sample_5  Sample_6  Sample_7  Sample_8 
       44        44        43        40        40        42        45        43 
 Sample_9 Sample_10 
       45        41 
# OTU prevalence: number of samples each OTU appears in
rowSums(otu_df > 0)
 OTU_1  OTU_2  OTU_3  OTU_4  OTU_5  OTU_6  OTU_7  OTU_8  OTU_9 OTU_10 OTU_11 
    10      7      9      9      8      8      7      8      9      8     10 
OTU_12 OTU_13 OTU_14 OTU_15 OTU_16 OTU_17 OTU_18 OTU_19 OTU_20 OTU_21 OTU_22 
     9     10      9      9      7      9     10      9     10      7      8 
OTU_23 OTU_24 OTU_25 OTU_26 OTU_27 OTU_28 OTU_29 OTU_30 OTU_31 OTU_32 OTU_33 
     8      9      8      9      9     10      8      8      7      9      9 
OTU_34 OTU_35 OTU_36 OTU_37 OTU_38 OTU_39 OTU_40 OTU_41 OTU_42 OTU_43 OTU_44 
     8     10     10      9     10      7      8      8      9      9      8 
OTU_45 OTU_46 OTU_47 OTU_48 OTU_49 OTU_50 
     8      8      8      7      7      9 
# Distribution of total counts per OTU
summary(rowSums(otu_df))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   36.0    41.0    48.0    47.4    52.0    64.0