Q&A 17 How do you prepare microbiome data for machine learning?
17.1 Explanation
Before applying machine learning, you must structure your OTU table and metadata into a form suitable for modeling.
Typical steps include: - Filtering: Keep relevant OTUs/features - Merging: Align OTU table with sample metadata - Encoding: Set up group labels (e.g., Control = 0, Treatment = 1) - Splitting: Train-test split to evaluate generalizability
This Q&A sets up data for classification.
17.2 Python Code
import pandas as pd
from sklearn.model_selection import train_test_split
# Load data
otu_df = pd.read_csv("data/otu_table_filtered.tsv", sep="\t", index_col=0).T
meta_df = pd.read_csv("data/sample_metadata.tsv", sep="\t")
# Merge OTU table with metadata by sample
data = pd.merge(otu_df, meta_df, left_index=True, right_on="sample_id")
# Define features (X) and labels (y)
X = data[otu_df.columns] # OTU features
y = data["group"].map({"Control": 0, "Treatment": 1}) # binary encoding
# Split into training/testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Check shape
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)