Q&A 19 How do you build a Logistic Regression model for microbiome classification?

19.1 Explanation

Logistic Regression is a foundational classification model useful for: - Binary prediction (e.g., Control vs Treatment) - Interpreting OTU effects via coefficients - Establishing baselines before more complex models

This Q&A shows how to train and evaluate a Logistic Regression model.

19.2 Python Code

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split

# Load and prepare data
otu_df = pd.read_csv("data/otu_table_filtered.tsv", sep="\t", index_col=0).T
meta_df = pd.read_csv("data/sample_metadata.tsv", sep="\t")
data = pd.merge(otu_df, meta_df, left_index=True, right_on="sample_id")
X = data[otu_df.columns]
y = data["group"].map({"Control": 0, "Treatment": 1})
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train logistic regression model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {acc:.2f}")
print("Confusion Matrix:")
print(cm)

# Inspect top OTUs by absolute coefficient magnitude
coef = pd.Series(model.coef_[0], index=X.columns)
print("Top OTUs:
", coef.abs().sort_values(ascending=False).head())

19.3 R Code (caret)

library(tidyverse)
library(caret)

otu_df <- read.delim("data/otu_table_filtered.tsv", row.names = 1)
meta_df <- read.delim("data/sample_metadata.tsv")
otu_df <- otu_df[, meta_df$sample_id]
otu_df <- t(otu_df)
data <- cbind(as.data.frame(otu_df), group = meta_df$group)

# Encode group and split
data$group <- as.factor(data$group)
set.seed(42)
trainIndex <- createDataPartition(data$group, p = .7, list = FALSE)
train <- data[trainIndex, ]
test  <- data[-trainIndex, ]

# Train logistic regression
lr_model <- train(group ~ ., data = train, method = "glm", family = "binomial", trControl = trainControl(method = "cv", number = 5))

# Predict and evaluate
pred <- predict(lr_model, newdata = test)
confusionMatrix(pred, test$group)
Confusion Matrix and Statistics

           Reference
Prediction  Control Treatment
  Control         1         0
  Treatment       0         1
                                     
               Accuracy : 1          
                 95% CI : (0.1581, 1)
    No Information Rate : 0.5        
    P-Value [Acc > NIR] : 0.25       
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0        
            Specificity : 1.0        
         Pos Pred Value : 1.0        
         Neg Pred Value : 1.0        
             Prevalence : 0.5        
         Detection Rate : 0.5        
   Detection Prevalence : 0.5        
      Balanced Accuracy : 1.0        
                                     
       'Positive' Class : Control