Q&A 18 How do you train and evaluate a Random Forest classifier on microbiome data?
18.1 Explanation
The Random Forest algorithm is a popular and robust model for microbiome classification tasks due to its: - Built-in feature importance - Resistance to overfitting - Non-linear modeling capability
This Q&A demonstrates how to: - Train a Random Forest classifier - Evaluate it using accuracy and confusion matrix - Inspect important OTUs
18.2 Python Code
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
# Load and prepare data
otu_df = pd.read_csv("data/otu_table_filtered.tsv", sep="\t", index_col=0).T
meta_df = pd.read_csv("data/sample_metadata.tsv", sep="\t")
data = pd.merge(otu_df, meta_df, left_index=True, right_on="sample_id")
X = data[otu_df.columns]
y = data["group"].map({"Control": 0, "Treatment": 1})
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train Random Forest
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
# Predict and evaluate
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
print(f"Accuracy: {acc:.2f}")
print("Confusion Matrix:")
print(cm)
# Top 5 important OTUs
feat_imp = pd.Series(clf.feature_importances_, index=X.columns)
print("Top OTUs:
", feat_imp.sort_values(ascending=False).head())
18.3 R Code (caret)
library(tidyverse)
library(caret)
otu_df <- read.delim("data/otu_table_filtered.tsv", row.names = 1)
meta_df <- read.delim("data/sample_metadata.tsv")
otu_df <- otu_df[, meta_df$sample_id]
otu_df <- t(otu_df)
data <- cbind(as.data.frame(otu_df), group = meta_df$group)
# Encode group and split
data$group <- as.factor(data$group)
set.seed(42)
trainIndex <- createDataPartition(data$group, p = .7, list = FALSE)
train <- data[trainIndex, ]
test <- data[-trainIndex, ]
# Train Random Forest
rf_model <- train(group ~ ., data = train, method = "rf", trControl = trainControl(method = "cv", number = 5))
# Predict and evaluate
pred <- predict(rf_model, newdata = test)
confusionMatrix(pred, test$group)
Confusion Matrix and Statistics
Reference
Prediction Control Treatment
Control 0 1
Treatment 1 0
Accuracy : 0
95% CI : (0, 0.8419)
No Information Rate : 0.5
P-Value [Acc > NIR] : 1
Kappa : -1
Mcnemar's Test P-Value : 1
Sensitivity : 0.0
Specificity : 0.0
Pos Pred Value : 0.0
Neg Pred Value : 0.0
Prevalence : 0.5
Detection Rate : 0.0
Detection Prevalence : 0.5
Balanced Accuracy : 0.0
'Positive' Class : Control