Audience: Students, researchers, analysts, and practitioners
Theme: Assessing sequencing data before feature generation
Introduction
Quality control is the stage where acquired sequencing files are evaluated before they are used to generate microbiome features.
In the Microbiome Analysis System, quality control is not only a technical filtering step. It is a decision point. The analyst must decide whether the sequencing data are complete, readable, structurally valid, and suitable for downstream processing.
A workflow can only produce defensible microbiome results if the input data are reliable enough to support analysis.
Why Quality Control Matters
Raw sequencing data can contain technical problems that affect downstream results.
Common issues include:
incomplete downloads
corrupted compressed files
empty FASTQ files
inconsistent file naming
missing paired-end mates
low read counts
short reads
variable read lengths
poor sequencing quality
adapter contamination
primer contamination
sample-level outliers
sequencing batch effects
If these issues are not detected early, they can affect feature generation, taxonomic assignment, diversity analysis, differential analysis, and biological interpretation.
Quality control helps determine whether the data are ready to move forward.
Position in the Microbiome Analysis System
Quality control occurs after data acquisition and before feature generation.
Show code
flowchart LR A[Data Acquisition] --> B[Quality Control] B --> C[Feature Generation] C --> D[Taxonomic Profiling] C --> E[Functional Profiling]
flowchart LR
A[Data Acquisition] --> B[Quality Control]
B --> C[Feature Generation]
C --> D[Taxonomic Profiling]
C --> E[Functional Profiling]
At this stage, the analyst should confirm that the acquired sequencing files are present, readable, organized, and consistent with the expected study design.
Quality Control Is a Decision Stage
Quality control should produce a decision, not only a collection of plots.
The key question is:
Are these sequencing files suitable for downstream microbiome feature generation?
Possible decisions include:
proceed to feature generation
proceed with documented limitations
exclude specific files or samples
re-download missing or corrupted files
return to metadata or acquisition checks
request clarification from the sequencing provider
use a different workflow for single-end or paired-end data
The decision should be documented because it affects all downstream conclusions.
QC Layers
MAS separates quality control into several layers.
Show code
flowchart TB A[FASTQ Presence] --> B[Compression Integrity] B --> C[FASTQ Structure] C --> D[Read Count Summary] D --> E[Read Length Summary] E --> F[Pairing Check] F --> G[QC Readiness Decision]
flowchart TB
A[FASTQ Presence] --> B[Compression Integrity]
B --> C[FASTQ Structure]
C --> D[Read Count Summary]
D --> E[Read Length Summary]
E --> F[Pairing Check]
F --> G[QC Readiness Decision]
Each layer answers a different question.
FASTQ Presence
The first check is whether FASTQ files are present in the expected directories.
For the MAS example package, FASTQ files are expected under:
data/raw/ena/
or:
data/raw/ncbi/
A project may use one repository or both. Finding zero files in one directory is not automatically a problem if the dataset was intentionally acquired from the other source.
Compression Integrity
Most sequencing files are stored as compressed FASTQ files ending in:
.fastq.gz
or:
.fq.gz
Before analysis, compressed files should be tested to confirm that they can be decompressed successfully.
A corrupted gzip file may indicate an interrupted download or storage problem.
FASTQ Structure
A valid FASTQ record contains four lines:
@read_identifier
sequence
+
quality_scores
Therefore, a FASTQ file should have a total line count divisible by four.
This structural check does not confirm biological quality, but it helps detect incomplete or malformed files.
Read Count Summary
Read counts provide a basic summary of sequencing depth.
For each FASTQ file, MAS records:
file name
repository directory
number of lines
number of reads
minimum read length
maximum read length
average read length
gzip status
FASTQ structure status
Very low read counts may indicate incomplete data, failed sequencing, aggressive filtering, or toy example data.
Read Length Summary
Read length affects downstream processing.
For example, 16S amplicon workflows often depend on expected read length, target region, primer design, overlap, and quality profile.
Shotgun metagenomic workflows may use different filtering, host-removal, assembly, or profiling approaches.
Read length summaries help the analyst confirm whether the data are consistent with the expected sequencing strategy.
Paired-End Consistency
For paired-end data, each run should usually have two files:
SRR17868090_1.fastq.gz
SRR17868090_2.fastq.gz
A missing mate file can prevent paired-end processing.
The MAS example package uses paired-end toy FASTQ files, so each run accession has both _1 and _2 files.
Example QC Scripts
The following scripts provide a lightweight MAS-side quality control check for the example acquisition package created in Chapter 04.
These scripts are not replacements for tools such as FastQC, MultiQC, Cutadapt, Trimmomatic, fastp, DADA2, QIIME 2, mothur, KneadData, or other full microbiome processing workflows. They provide early file-level checks before feature generation.
The expected example dataset contains six tiny paired-end FASTQ files. Because the reads are toy sequences, the read counts are intentionally very small. This is acceptable for workflow testing, but not for biological interpretation.
Example QC Summary
A small example fastq-qc-summary.tsv may look like this:
file directory lines reads min_read_length max_read_length mean_read_length gzip_status fastq_structure_status
SRR17868090_1.fastq.gz data/raw/ena 4 1 12 12 12.00 OK OK
SRR17868090_2.fastq.gz data/raw/ena 4 1 12 12 12.00 OK OK
SRR17868091_1.fastq.gz data/raw/ena 4 1 12 12 12.00 OK OK
SRR17868091_2.fastq.gz data/raw/ena 4 1 12 12 12.00 OK OK
SRR17868092_1.fastq.gz data/raw/ena 4 1 12 12 12.00 OK OK
SRR17868092_2.fastq.gz data/raw/ena 4 1 12 12 12.00 OK OK
A real dataset should usually contain many more reads per file.
Interpreting QC Results
The lightweight MAS QC checks answer basic file-level questions:
Are FASTQ files present?
Can compressed files be decompressed?
Does each file have valid FASTQ structure?
How many reads are present?
What are the basic read length summaries?
These checks do not replace full sequencing quality assessment. They should be followed by appropriate microbiome-specific processing and quality evaluation.
For real analysis, additional checks may include:
per-base sequence quality
adapter content
primer content
sequence duplication
ambiguous bases
expected amplicon length
host contamination
negative control evaluation
sample-level outlier detection
batch effect assessment
When to Stop
The analyst should stop and review the data before feature generation if:
FASTQ files are missing
gzip integrity checks fail
FASTQ structure checks fail
paired-end mates are missing
metadata cannot be linked to FASTQ files
read counts are unexpectedly low
sequencing strategy is unclear
batch or sample identity problems are detected
Stopping early is better than generating downstream outputs from unreliable inputs.
MAS QC Outputs
At the end of this stage, MAS should have:
FASTQ QC summary
FASTQ QC status table
QC readiness report
notes on limitations or issues
decision about whether to proceed
Show code
flowchart LR A[FASTQ Files] --> B[QC Summary] B --> C[QC Readiness Report] C --> D[Feature Generation]
flowchart LR
A[FASTQ Files] --> B[QC Summary]
B --> C[QC Readiness Report]
C --> D[Feature Generation]
Key Takeaways
Quality control is the bridge between acquired data and microbiome feature generation.
A strong QC stage helps ensure that:
sequencing files are present
compressed files are readable
FASTQ structure is valid
read counts are summarized
read lengths are summarized
problems are detected before feature generation
the decision to proceed is documented
Quality control does not make a weak dataset strong, but it helps prevent weak or corrupted inputs from silently entering the analysis workflow.
What Comes Next
The next chapter examines Feature Generation, where quality-checked sequencing reads are transformed into microbiome features such as ASVs, OTUs, taxonomic profiles, or functional profiles.
# Quality Control:::cdi-message- **ID:** MICROB-005- **Type:** System Component- **Audience:** Students, researchers, analysts, and practitioners- **Theme:** Assessing sequencing data before feature generation:::## IntroductionQuality control is the stage where acquired sequencing files are evaluated before they are used to generate microbiome features.In the Microbiome Analysis System, quality control is not only a technical filtering step. It is a decision point. The analyst must decide whether the sequencing data are complete, readable, structurally valid, and suitable for downstream processing.A workflow can only produce defensible microbiome results if the input data are reliable enough to support analysis.## Why Quality Control MattersRaw sequencing data can contain technical problems that affect downstream results.Common issues include:- incomplete downloads- corrupted compressed files- empty FASTQ files- inconsistent file naming- missing paired-end mates- low read counts- short reads- variable read lengths- poor sequencing quality- adapter contamination- primer contamination- sample-level outliers- sequencing batch effectsIf these issues are not detected early, they can affect feature generation, taxonomic assignment, diversity analysis, differential analysis, and biological interpretation.Quality control helps determine whether the data are ready to move forward.## Position in the Microbiome Analysis SystemQuality control occurs after data acquisition and before feature generation.```{mermaid}flowchart LR A[Data Acquisition] --> B[Quality Control] B --> C[Feature Generation] C --> D[Taxonomic Profiling] C --> E[Functional Profiling]```At this stage, the analyst should confirm that the acquired sequencing files are present, readable, organized, and consistent with the expected study design.## Quality Control Is a Decision StageQuality control should produce a decision, not only a collection of plots.The key question is:> Are these sequencing files suitable for downstream microbiome feature generation?Possible decisions include:- proceed to feature generation- proceed with documented limitations- exclude specific files or samples- re-download missing or corrupted files- return to metadata or acquisition checks- request clarification from the sequencing provider- use a different workflow for single-end or paired-end dataThe decision should be documented because it affects all downstream conclusions.## QC LayersMAS separates quality control into several layers.```{mermaid}flowchart TB A[FASTQ Presence] --> B[Compression Integrity] B --> C[FASTQ Structure] C --> D[Read Count Summary] D --> E[Read Length Summary] E --> F[Pairing Check] F --> G[QC Readiness Decision]```Each layer answers a different question.## FASTQ PresenceThe first check is whether FASTQ files are present in the expected directories.For the MAS example package, FASTQ files are expected under:```textdata/raw/ena/```or:```textdata/raw/ncbi/```A project may use one repository or both. Finding zero files in one directory is not automatically a problem if the dataset was intentionally acquired from the other source.## Compression IntegrityMost sequencing files are stored as compressed FASTQ files ending in:```text.fastq.gz```or:```text.fq.gz```Before analysis, compressed files should be tested to confirm that they can be decompressed successfully.A corrupted gzip file may indicate an interrupted download or storage problem.## FASTQ StructureA valid FASTQ record contains four lines:```text@read_identifiersequence+quality_scores```Therefore, a FASTQ file should have a total line count divisible by four.This structural check does not confirm biological quality, but it helps detect incomplete or malformed files.## Read Count SummaryRead counts provide a basic summary of sequencing depth.For each FASTQ file, MAS records:- file name- repository directory- number of lines- number of reads- minimum read length- maximum read length- average read length- gzip status- FASTQ structure statusVery low read counts may indicate incomplete data, failed sequencing, aggressive filtering, or toy example data.## Read Length SummaryRead length affects downstream processing.For example, 16S amplicon workflows often depend on expected read length, target region, primer design, overlap, and quality profile.Shotgun metagenomic workflows may use different filtering, host-removal, assembly, or profiling approaches.Read length summaries help the analyst confirm whether the data are consistent with the expected sequencing strategy.## Paired-End ConsistencyFor paired-end data, each run should usually have two files:```textSRR17868090_1.fastq.gzSRR17868090_2.fastq.gz```A missing mate file can prevent paired-end processing.The MAS example package uses paired-end toy FASTQ files, so each run accession has both `_1` and `_2` files.## Example QC ScriptsThe following scripts provide a lightweight MAS-side quality control check for the example acquisition package created in Chapter 04.These scripts are not replacements for tools such as FastQC, MultiQC, Cutadapt, Trimmomatic, fastp, DADA2, QIIME 2, mothur, KneadData, or other full microbiome processing workflows. They provide early file-level checks before feature generation.The workflow uses two scripts:```textscripts/bash/05a-check-fastq-files.shscripts/bash/05b-build-qc-readiness-report.sh```The first script checks FASTQ file integrity and summarizes read counts and read lengths.The second script builds a simple QC readiness report.## 05a: Check FASTQ FilesSave this script as:```bashscripts/bash/05a-check-fastq-files.sh``````bash#!/bin/bash################################################################################ Microbiome Analysis System# 05a-check-fastq-files.sh## Purpose:# Perform lightweight FASTQ quality-control checks before feature generation.## Checks:# - FASTQ files exist# - gzip integrity for compressed files# - FASTQ line count is divisible by 4# - read count# - minimum, maximum, and average read length## Usage:# bash scripts/bash/05a-check-fastq-files.sh###############################################################################set-eRAW_DIRS="data/raw/ena data/raw/ncbi"QC_DIR="data/qc"REPORT_DIR="data/reports"mkdir-p"${QC_DIR}"mkdir-p"${REPORT_DIR}"SUMMARY_FILE="${QC_DIR}/fastq-qc-summary.tsv"STATUS_FILE="${QC_DIR}/fastq-qc-status.tsv"printf"file\tdirectory\tlines\treads\tmin_read_length\tmax_read_length\tmean_read_length\tgzip_status\tfastq_structure_status\n">"${SUMMARY_FILE}"printf"check\tstatus\tnotes\n">"${STATUS_FILE}"echo"Microbiome Analysis System: FASTQ QC Check"echofastq_count=0failed_count=0for dir in${RAW_DIRS};doif[!-d"${dir}"];thenecho"Directory not found: ${dir}"continuefifor file in"${dir}"/*.fastq.gz "${dir}"/*.fq.gz "${dir}"/*.fastq "${dir}"/*.fq;do[-e"${file}"]||continuefastq_count=$((fastq_count+1))filename=$(basename"${file}")gzip_status="NOT_COMPRESSED"fastq_structure_status="UNKNOWN"ifecho"${file}"|grep-Eq"\\.(fastq|fq)\\.gz$";thenifgzip-t"${file}"2>/dev/null;thengzip_status="OK"elsegzip_status="FAIL"failed_count=$((failed_count+1))filine_count=$(gzip-cd"${file}"|wc-l|tr-d' ')read_stats=$(gzip-cd"${file}"|awk'NR % 4 == 2 { len=length($0); count++; total+=len; if (min=="" || len < min) min=len; if (len > max) max=len; } END { if (count > 0) { printf "%d\t%d\t%.2f", min, max, total/count; } else { printf "0\t0\t0"; } }')elseline_count=$(wc-l<"${file}"|tr-d' ')read_stats=$(awk'NR % 4 == 2 { len=length($0); count++; total+=len; if (min=="" || len < min) min=len; if (len > max) max=len; } END { if (count > 0) { printf "%d\t%d\t%.2f", min, max, total/count; } else { printf "0\t0\t0"; } }'"${file}")fireads=$((line_count/4))if[$((line_count%4))-eq 0 ]&&["${line_count}"-gt 0 ];thenfastq_structure_status="OK"elsefastq_structure_status="FAIL"failed_count=$((failed_count+1))fimin_len=$(echo"${read_stats}"|awk-F'\t''{print $1}')max_len=$(echo"${read_stats}"|awk-F'\t''{print $2}')mean_len=$(echo"${read_stats}"|awk-F'\t''{print $3}')printf"%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n"\"${filename}""${dir}""${line_count}""${reads}"\"${min_len}""${max_len}""${mean_len}"\"${gzip_status}""${fastq_structure_status}">>"${SUMMARY_FILE}"echo"Checked: ${file}"donedoneif["${fastq_count}"-eq 0 ];thenprintf"FASTQ presence\tFAIL\tNo FASTQ files found in expected raw data directories\n">>"${STATUS_FILE}"echoecho"No FASTQ files found."exit 1elseprintf"FASTQ presence\tOK\t%s FASTQ files found\n""${fastq_count}">>"${STATUS_FILE}"fiif["${failed_count}"-eq 0 ];thenprintf"FASTQ file checks\tOK\tNo failed gzip or structure checks\n">>"${STATUS_FILE}"elseprintf"FASTQ file checks\tWARN\t%s failed checks detected\n""${failed_count}">>"${STATUS_FILE}"fiechoecho"FASTQ files checked: ${fastq_count}"echo"Failed checks: ${failed_count}"echoecho"Summary written to: ${SUMMARY_FILE}"echo"Status written to: ${STATUS_FILE}"```Run it from the MAS project root:```bashbash scripts/bash/05a-check-fastq-files.sh```This creates:```textdata/qc/fastq-qc-summary.tsvdata/qc/fastq-qc-status.tsv```## 05b: Build the QC Readiness ReportSave this script as:```bashscripts/bash/05b-build-qc-readiness-report.sh``````bash#!/bin/bash################################################################################ Microbiome Analysis System# 05b-build-qc-readiness-report.sh## Purpose:# Build a simple QC readiness report from FASTQ QC outputs.## Usage:# bash scripts/bash/05b-build-qc-readiness-report.sh###############################################################################set-eQC_DIR="data/qc"REPORT_DIR="data/reports"SUMMARY_FILE="${QC_DIR}/fastq-qc-summary.tsv"STATUS_FILE="${QC_DIR}/fastq-qc-status.tsv"READINESS_REPORT="${REPORT_DIR}/qc-readiness-report.tsv"mkdir-p"${REPORT_DIR}"if[!-s"${SUMMARY_FILE}"];thenecho"Missing FASTQ QC summary: ${SUMMARY_FILE}"echo"Run: bash scripts/bash/05a-check-fastq-files.sh"exit 1fiif[!-s"${STATUS_FILE}"];thenecho"Missing FASTQ QC status: ${STATUS_FILE}"echo"Run: bash scripts/bash/05a-check-fastq-files.sh"exit 1fitotal_files=$(tail-n +2 "${SUMMARY_FILE}"|wc-l|tr-d' ')failed_structure=$(awk-F'\t''NR > 1 && $9 != "OK" {count++} END {print count+0}'"${SUMMARY_FILE}")failed_gzip=$(awk-F'\t''NR > 1 && $8 == "FAIL" {count++} END {print count+0}'"${SUMMARY_FILE}")total_reads=$(awk-F'\t''NR > 1 {sum += $4} END {print sum+0}'"${SUMMARY_FILE}")min_reads=$(awk-F'\t''NR == 2 {min=$4} NR > 2 && $4 < min {min=$4} END {if (min=="") print 0; else print min}'"${SUMMARY_FILE}")max_reads=$(awk-F'\t''NR == 2 {max=$4} NR > 2 && $4 > max {max=$4} END {if (max=="") print 0; else print max}'"${SUMMARY_FILE}")decision="READY_FOR_NEXT_STEP"notes="FASTQ files passed lightweight file-level checks"if["${total_files}"-eq 0 ];thendecision="NOT_READY"notes="No FASTQ files were found"elif["${failed_structure}"-gt 0 ]||["${failed_gzip}"-gt 0 ];thendecision="REVIEW_REQUIRED"notes="One or more FASTQ files failed gzip or structure checks"fiprintf"metric\tvalue\n">"${READINESS_REPORT}"printf"total_fastq_files\t%s\n""${total_files}">>"${READINESS_REPORT}"printf"total_reads\t%s\n""${total_reads}">>"${READINESS_REPORT}"printf"minimum_reads_per_file\t%s\n""${min_reads}">>"${READINESS_REPORT}"printf"maximum_reads_per_file\t%s\n""${max_reads}">>"${READINESS_REPORT}"printf"failed_gzip_checks\t%s\n""${failed_gzip}">>"${READINESS_REPORT}"printf"failed_fastq_structure_checks\t%s\n""${failed_structure}">>"${READINESS_REPORT}"printf"qc_decision\t%s\n""${decision}">>"${READINESS_REPORT}"printf"notes\t%s\n""${notes}">>"${READINESS_REPORT}"echo"QC readiness report written to: ${READINESS_REPORT}"echocat"${READINESS_REPORT}"```Run it from the MAS project root:```bashbash scripts/bash/05b-build-qc-readiness-report.sh```This creates:```textdata/reports/qc-readiness-report.tsv```## Running the Complete QC ExampleIf you are continuing from Chapter 04, first make sure the example acquisition package exists:```bashbash scripts/bash/04a-create-example-acquisition-data.shbash scripts/bash/04b-check-data-acquisition.sh```Then run the QC scripts:```bashbash scripts/bash/05a-check-fastq-files.shbash scripts/bash/05b-build-qc-readiness-report.shcat data/qc/fastq-qc-summary.tsvcat data/reports/qc-readiness-report.tsv```The expected example dataset contains six tiny paired-end FASTQ files. Because the reads are toy sequences, the read counts are intentionally very small. This is acceptable for workflow testing, but not for biological interpretation.## Example QC SummaryA small example `fastq-qc-summary.tsv`may look like this:```textfile directory lines reads min_read_length max_read_length mean_read_length gzip_status fastq_structure_statusSRR17868090_1.fastq.gz data/raw/ena 4 1 12 12 12.00 OK OKSRR17868090_2.fastq.gz data/raw/ena 4 1 12 12 12.00 OK OKSRR17868091_1.fastq.gz data/raw/ena 4 1 12 12 12.00 OK OKSRR17868091_2.fastq.gz data/raw/ena 4 1 12 12 12.00 OK OKSRR17868092_1.fastq.gz data/raw/ena 4 1 12 12 12.00 OK OKSRR17868092_2.fastq.gz data/raw/ena 4 1 12 12 12.00 OK OK```A real dataset should usually contain many more reads per file.## Interpreting QC ResultsThe lightweight MAS QC checks answer basic file-level questions:- Are FASTQ files present?- Can compressed files be decompressed?- Does each file have valid FASTQ structure?- How many reads are present?- What are the basic read length summaries?These checks do not replace full sequencing quality assessment. They should be followed by appropriate microbiome-specific processing and quality evaluation.For real analysis, additional checks may include:- per-base sequence quality- adapter content- primer content- sequence duplication- ambiguous bases- expected amplicon length- host contamination- negative control evaluation- sample-level outlier detection- batch effect assessment## When to StopThe analyst should stop and review the data before feature generation if:- FASTQ files are missing- gzip integrity checks fail- FASTQ structure checks fail- paired-end mates are missing- metadata cannot be linked to FASTQ files- read counts are unexpectedly low- sequencing strategy is unclear- batch or sample identity problems are detectedStopping early is better than generating downstream outputs from unreliable inputs.## MAS QC OutputsAt the end of this stage, MAS should have:- FASTQ QC summary- FASTQ QC status table- QC readiness report- notes on limitations or issues- decision about whether to proceed```{mermaid}flowchart LRA[FASTQ Files]--> B[QC Summary]B--> C[QC Readiness Report]C--> D[Feature Generation]```## Key TakeawaysQuality control is the bridge between acquired data and microbiome feature generation.A strong QC stage helps ensure that:- sequencing files are present- compressed files are readable- FASTQ structure is valid- read counts are summarized- read lengths are summarized- problems are detected before feature generation- the decision to proceed is documentedQuality control does not make a weak dataset strong, but it helps prevent weak or corrupted inputs from silently entering the analysis workflow.## What Comes NextThe next chapter examines **Feature Generation**, where quality-checked sequencing reads are transformed into microbiome features such as ASVs, OTUs, taxonomic profiles, or functional profiles.