Audience: Students, researchers, analysts, and practitioners
Theme: Reproducible acquisition of microbiome sequencing data
Introduction
Data acquisition is the stage where microbiome analysis moves from study planning to usable sequencing data.
In the Microbiome Analysis System, data acquisition means more than downloading FASTQ files. It includes identifying the correct study accessions, retrieving metadata, confirming sample availability, organizing files, recording data sources, validating downloads, and preparing the dataset for quality control.
This chapter connects MAS to the CDI Data Acquisition System. Rather than duplicating the full CDI-DAS workflow, this chapter explains how MAS receives sequencing data and metadata in a reproducible, analysis-ready form.
Why Data Acquisition Matters
Microbiome analysis depends on the integrity of the data entering the workflow.
If the wrong samples are downloaded, metadata are incomplete, files are missing, or read pairs are mismatched, downstream results can become unreliable even if the analysis code is correct.
A strong data acquisition stage helps ensure that:
sequencing files correspond to the intended study
sample identifiers are traceable
metadata are available for interpretation
FASTQ files are organized consistently
paired-end files are matched correctly
public accession numbers are recorded
download steps are reproducible
file completeness can be checked before quality control
Data acquisition is therefore both a technical and interpretive step.
Position in the Microbiome Analysis System
Data acquisition occurs after study design, metadata planning, and sequencing strategy are understood.
Show code
flowchart LR A[Study Design and Metadata] --> B[Sample Collection and Sequencing] B --> C[Data Acquisition] C --> D[Quality Control] D --> E[Feature Generation]
flowchart LR
A[Study Design and Metadata] --> B[Sample Collection and Sequencing]
B --> C[Data Acquisition]
C --> D[Quality Control]
D --> E[Feature Generation]
For newly generated data, data acquisition may involve receiving FASTQ files from a sequencing provider or institutional storage system.
For public data, data acquisition may involve retrieving metadata and sequencing files from repositories such as NCBI SRA, ENA, DDBJ, MGnify, Qiita, or other study-specific repositories.
Microbiome Analysis System and CDI-DAS
The recommended CDI workflow separates public data acquisition from downstream analysis.
Show code
flowchart TB A[CDI Systematic Dataset Discovery] --> B[Selected BioProject or Study Accessions] B --> C[CDI Data Acquisition System] C --> D[Metadata Tables] C --> E[FASTQ Files] C --> F[Download Manifests] C --> G[Validation Reports] D --> H[Microbiome Analysis System] E --> H F --> H G --> H
flowchart TB
A[CDI Systematic Dataset Discovery] --> B[Selected BioProject or Study Accessions]
B --> C[CDI Data Acquisition System]
C --> D[Metadata Tables]
C --> E[FASTQ Files]
C --> F[Download Manifests]
C --> G[Validation Reports]
D --> H[Microbiome Analysis System]
E --> H
F --> H
G --> H
The CDI Data Acquisition System (DAS) is responsible for reproducible retrieval and validation of public sequencing data.
The Microbiome Analysis System (MAS) begins downstream analysis after data and metadata have been acquired, organized, and checked.
This separation keeps MAS focused on microbiome analysis while still preserving reproducibility from study discovery to final interpretation.
Expected Inputs
A microbiome data acquisition package should contain the files needed to begin quality control and downstream analysis.
At minimum, MAS expects:
sequencing files, usually FASTQ or FASTQ.gz
metadata table linking samples to biological variables
sample identifiers that match sequencing files or run accessions
study accession information
download manifest or file inventory
notes about sequencing strategy
validation or checksum report, when available
For public data acquired through CDI-DAS, the expected structure may look like this:
The exact file names may vary by project, but the principle is the same: raw data, metadata, manifests, inventories, and validation outputs should be separated and traceable.
Public Data Acquisition
Public microbiome datasets are often accessed through study or run accessions.
Common accession types include:
BioProject accessions
BioSample accessions
SRA run accessions
ENA run accessions
DDBJ accessions
study-specific repository identifiers
A BioProject accession can often be used to retrieve run-level metadata and FASTQ links.
For example, a human gut microbiome study may be represented by a BioProject accession such as:
PRJNA802976
The accession itself is not enough for analysis. It must be converted into a structured dataset package containing metadata, run accessions, FASTQ files, and validation outputs.
Recommended Acquisition Strategy
For MAS, the recommended public-data workflow is:
Show code
flowchart TB A[BioProject or Study Accession] --> B[Retrieve NCBI RunInfo] A --> C[Retrieve ENA Metadata] B --> D[Build SRR Accession List] C --> E[Build Download Manifest] E --> F[Download FASTQ Files] F --> G[Build File Inventory] G --> H[Validate Downloads] H --> I[MAS-Ready Data Package]
flowchart TB
A[BioProject or Study Accession] --> B[Retrieve NCBI RunInfo]
A --> C[Retrieve ENA Metadata]
B --> D[Build SRR Accession List]
C --> E[Build Download Manifest]
E --> F[Download FASTQ Files]
F --> G[Build File Inventory]
G --> H[Validate Downloads]
H --> I[MAS-Ready Data Package]
This mirrors the CDI-DAS approach:
retrieve run-level metadata
retrieve ENA metadata and FASTQ URLs
build a manifest
download a test subset first
download the full dataset when ready
inventory files
validate downloads
hand off data to MAS
The test-first approach is important because public datasets can be large, file paths can change, and download problems are easier to diagnose on a small subset.
MAS Example and Handoff Scripts
The following scripts do not replace CDI-DAS. They provide a lightweight MAS-side example and handoff check that confirm whether a CDI-DAS-style data package is present and ready for the next chapter.
The first script creates a small example acquisition package for testing the MAS workflow structure. The FASTQ reads are toy sequences and are not intended for biological interpretation.
The second script checks whether the expected metadata files, manifests, FASTQ files, inventory outputs, and validation reports are present.
#!/bin/bash################################################################################ Microbiome Analysis System# 04a-create-example-acquisition-data.sh## Purpose:# Create a small example data acquisition package for testing the MAS# data acquisition handoff workflow.## Important:# This script creates toy FASTQ reads and mock metadata for workflow testing.# These files are not real biological data and should not be used for# biological interpretation.## Usage:# bash scripts/bash/04a-create-example-acquisition-data.sh###############################################################################set-eBIOPROJECT="${BIOPROJECT:-PRJNA802976}"echo"Creating MAS example acquisition package..."echo"BioProject: ${BIOPROJECT}"echomkdir-p data/metadatamkdir-p data/manifestsmkdir-p data/raw/enamkdir-p data/raw/ncbimkdir-p data/inventorymkdir-p data/validationmkdir-p data/reports################################################################################ Example BioProject metadata###############################################################################cat>"data/metadata/runinfo-${BIOPROJECT}.csv"<<'EOF'Run,BioProject,BioSample,LibraryStrategy,LibraryLayout,Platform,ModelSRR17868090,PRJNA802976,SAMN00000001,AMPLICON,PAIRED,ILLUMINA,Illumina MiSeqSRR17868091,PRJNA802976,SAMN00000002,AMPLICON,PAIRED,ILLUMINA,Illumina MiSeqSRR17868092,PRJNA802976,SAMN00000003,AMPLICON,PAIRED,ILLUMINA,Illumina MiSeqEOFcat>"data/metadata/ena-${BIOPROJECT}.tsv"<<'EOF'run_accession sample_accession study_accession fastq_ftpSRR17868090 SAMN00000001 PRJNA802976 ftp://example/SRR17868090_1.fastq.gz;ftp://example/SRR17868090_2.fastq.gzSRR17868091 SAMN00000002 PRJNA802976 ftp://example/SRR17868091_1.fastq.gz;ftp://example/SRR17868091_2.fastq.gzSRR17868092 SAMN00000003 PRJNA802976 ftp://example/SRR17868092_1.fastq.gz;ftp://example/SRR17868092_2.fastq.gzEOFcat> data/metadata/srr-accessions.txt <<'EOF'SRR17868090SRR17868091SRR17868092EOF################################################################################ Example manifests###############################################################################cat> data/manifests/download-manifest.tsv <<'EOF'run_accession layout repository expected_fastq_filesSRR17868090 PAIRED ENA 2SRR17868091 PAIRED ENA 2SRR17868092 PAIRED ENA 2EOFcat> data/manifests/test-manifest.tsv <<'EOF'run_accession layout repository expected_fastq_filesSRR17868090 PAIRED ENA 2SRR17868091 PAIRED ENA 2SRR17868092 PAIRED ENA 2EOF################################################################################ Tiny toy FASTQ files###############################################################################tmpdir=$(mktemp-d)cat>"${tmpdir}/SRR17868090_1.fastq"<<'EOF'@SRR17868090.1/1ACGTACGTACGT+FFFFFFFFFFFFEOFcat>"${tmpdir}/SRR17868090_2.fastq"<<'EOF'@SRR17868090.1/2TGCATGCATGCA+FFFFFFFFFFFFEOFcat>"${tmpdir}/SRR17868091_1.fastq"<<'EOF'@SRR17868091.1/1ACGTACGTACGT+FFFFFFFFFFFFEOFcat>"${tmpdir}/SRR17868091_2.fastq"<<'EOF'@SRR17868091.1/2TGCATGCATGCA+FFFFFFFFFFFFEOFcat>"${tmpdir}/SRR17868092_1.fastq"<<'EOF'@SRR17868092.1/1ACGTACGTACGT+FFFFFFFFFFFFEOFcat>"${tmpdir}/SRR17868092_2.fastq"<<'EOF'@SRR17868092.1/2TGCATGCATGCA+FFFFFFFFFFFFEOFgzip-c"${tmpdir}/SRR17868090_1.fastq"> data/raw/ena/SRR17868090_1.fastq.gzgzip-c"${tmpdir}/SRR17868090_2.fastq"> data/raw/ena/SRR17868090_2.fastq.gzgzip-c"${tmpdir}/SRR17868091_1.fastq"> data/raw/ena/SRR17868091_1.fastq.gzgzip-c"${tmpdir}/SRR17868091_2.fastq"> data/raw/ena/SRR17868091_2.fastq.gzgzip-c"${tmpdir}/SRR17868092_1.fastq"> data/raw/ena/SRR17868092_1.fastq.gzgzip-c"${tmpdir}/SRR17868092_2.fastq"> data/raw/ena/SRR17868092_2.fastq.gzrm-rf"${tmpdir}"################################################################################ Example inventory and validation outputs###############################################################################cat> data/inventory/fastq-inventory-ena.tsv <<'EOF'file run_accession read directorySRR17868090_1.fastq.gz SRR17868090 1 data/raw/enaSRR17868090_2.fastq.gz SRR17868090 2 data/raw/enaSRR17868091_1.fastq.gz SRR17868091 1 data/raw/enaSRR17868091_2.fastq.gz SRR17868091 2 data/raw/enaSRR17868092_1.fastq.gz SRR17868092 1 data/raw/enaSRR17868092_2.fastq.gz SRR17868092 2 data/raw/enaEOFcat> data/validation/validation-report.tsv <<'EOF'item status notesmetadata OK example metadata presentfastq_files OK six paired-end example FASTQ files presentmanifest OK example manifest presentEOFecho"Example acquisition package created."echoecho"Created:"echo" data/metadata/runinfo-${BIOPROJECT}.csv"echo" data/metadata/ena-${BIOPROJECT}.tsv"echo" data/metadata/srr-accessions.txt"echo" data/manifests/download-manifest.tsv"echo" data/manifests/test-manifest.tsv"echo" data/raw/ena/*.fastq.gz"echo" data/inventory/fastq-inventory-ena.tsv"echo" data/validation/validation-report.tsv"echoecho"Next:"echo" bash scripts/bash/04b-check-data-acquisition.sh"
The first command creates the example acquisition package. The second command checks whether the expected files are present. The third command displays the generated summary table.
Example output from 04a-create-example-acquisition-data.sh:
Creating MAS example acquisition package...
BioProject: PRJNA802976
Example acquisition package created.
Created:
data/metadata/runinfo-PRJNA802976.csv
data/metadata/ena-PRJNA802976.tsv
data/metadata/srr-accessions.txt
data/manifests/download-manifest.tsv
data/manifests/test-manifest.tsv
data/raw/ena/*.fastq.gz
data/inventory/fastq-inventory-ena.tsv
data/validation/validation-report.tsv
Next:
bash scripts/bash/04b-check-data-acquisition.sh
Example output from 04b-check-data-acquisition.sh:
Microbiome Analysis System: Data Acquisition Check
BioProject: PRJNA802976
FOUND: NCBI RunInfo metadata (4 lines) -> data/metadata/runinfo-PRJNA802976.csv
FOUND: ENA metadata (4 lines) -> data/metadata/ena-PRJNA802976.tsv
FOUND: SRR accession list (3 lines) -> data/metadata/srr-accessions.txt
FOUND: Download manifest (4 lines) -> data/manifests/download-manifest.tsv
FOUND: Test manifest (4 lines) -> data/manifests/test-manifest.tsv
FASTQ files in ENA raw data: 6
FASTQ files in NCBI raw data: 0
FOUND: ENA FASTQ inventory (7 lines) -> data/inventory/fastq-inventory-ena.tsv
FOUND: Validation report (4 lines) -> data/validation/validation-report.tsv
Summary written to: data/reports/data-acquisition-summary.tsv
Next MAS step:
Review the summary, confirm metadata and FASTQ files are present,
then continue to 05-quality-control.qmd.
In this example, NCBI raw data: 0 is acceptable because the toy example package uses the ENA raw data directory.
Inspecting the Summary Table
After running the handoff check, inspect the generated summary table:
cat data/reports/data-acquisition-summary.tsv
Example summary:
item status path_or_count
NCBI RunInfo metadata FOUND 4 lines; data/metadata/runinfo-PRJNA802976.csv
ENA metadata FOUND 4 lines; data/metadata/ena-PRJNA802976.tsv
SRR accession list FOUND 3 lines; data/metadata/srr-accessions.txt
Download manifest FOUND 4 lines; data/manifests/download-manifest.tsv
Test manifest FOUND 4 lines; data/manifests/test-manifest.tsv
ENA raw data COUNT 6 FASTQ files; data/raw/ena
NCBI raw data COUNT 0 FASTQ files; data/raw/ncbi
ENA FASTQ inventory FOUND 7 lines; data/inventory/fastq-inventory-ena.tsv
Validation report FOUND 4 lines; data/validation/validation-report.tsv
This summary becomes the MAS handoff record for the next chapter.
Minimal Data Acquisition Checklist
Before continuing to quality control, confirm that:
raw FASTQ files are present
paired-end files are matched when applicable
metadata are available
sample identifiers can be connected to FASTQ files
study accession numbers are recorded
download manifests are preserved
file inventories are available
validation reports are available or planned
sequencing strategy is known
the dataset has a clear analysis objective
This checklist prevents downstream analysis from starting with unclear or incomplete data.
FASTQ File Organization
A consistent file organization makes microbiome workflows easier to automate.
For paired-end sequencing, file names commonly follow patterns such as:
If metadata and FASTQ files cannot be linked, downstream analysis becomes difficult or impossible to interpret.
What to Record
A reproducible data acquisition record should include:
study accession
repository source
date of acquisition
metadata files retrieved
number of runs expected
number of FASTQ files expected
number of FASTQ files downloaded
download method
test or production mode
validation status
known limitations
These details can be summarized in a simple report table and referenced during reporting.
Common Problems
Common data acquisition problems include:
missing FASTQ files
incomplete metadata
mismatched run accessions
paired-end files missing one mate
duplicate sample identifiers
failed downloads
changed repository links
compressed and uncompressed files mixed together
public metadata that lack key biological variables
study accessions that include multiple sample types or sub-studies
These issues should be resolved or documented before moving into quality control.
MAS Data Acquisition Outputs
At the end of this stage, MAS should have:
organized raw sequencing files
metadata tables
run accession list
download manifest
FASTQ inventory
validation report
data acquisition summary
These outputs support the next stage of the system.
Show code
flowchart LR A[Raw FASTQ Files] --> D[Quality Control] B[Metadata Tables] --> D C[Acquisition Summary] --> D
flowchart LR
A[Raw FASTQ Files] --> D[Quality Control]
B[Metadata Tables] --> D
C[Acquisition Summary] --> D
Key Takeaways
Data acquisition is not just downloading files.
It is the process of turning study accessions, repository records, and sequencing links into a structured dataset package that can support reproducible microbiome analysis.
A strong data acquisition stage ensures that:
data sources are traceable
metadata are preserved
files are organized
downloads are validated
sample identifiers are linkable
downstream quality control can begin confidently
What Comes Next
The next chapter examines Quality Control, where acquired sequencing data are assessed before feature generation and downstream microbiome analysis.
# Data Acquisition:::cdi-message- **ID:** MICROB-004- **Type:** System Component- **Audience:** Students, researchers, analysts, and practitioners- **Theme:** Reproducible acquisition of microbiome sequencing data:::## IntroductionData acquisition is the stage where microbiome analysis moves from study planning to usable sequencing data.In the Microbiome Analysis System, data acquisition means more than downloading FASTQ files. It includes identifying the correct study accessions, retrieving metadata, confirming sample availability, organizing files, recording data sources, validating downloads, and preparing the dataset for quality control.This chapter connects MAS to the **CDI Data Acquisition System**. Rather than duplicating the full CDI-DAS workflow, this chapter explains how MAS receives sequencing data and metadata in a reproducible, analysis-ready form.## Why Data Acquisition MattersMicrobiome analysis depends on the integrity of the data entering the workflow.If the wrong samples are downloaded, metadata are incomplete, files are missing, or read pairs are mismatched, downstream results can become unreliable even if the analysis code is correct.A strong data acquisition stage helps ensure that:- sequencing files correspond to the intended study- sample identifiers are traceable- metadata are available for interpretation- FASTQ files are organized consistently- paired-end files are matched correctly- public accession numbers are recorded- download steps are reproducible- file completeness can be checked before quality controlData acquisition is therefore both a technical and interpretive step.## Position in the Microbiome Analysis SystemData acquisition occurs after study design, metadata planning, and sequencing strategy are understood.```{mermaid}flowchart LR A[Study Design and Metadata] --> B[Sample Collection and Sequencing] B --> C[Data Acquisition] C --> D[Quality Control] D --> E[Feature Generation]```For newly generated data, data acquisition may involve receiving FASTQ files from a sequencing provider or institutional storage system.For public data, data acquisition may involve retrieving metadata and sequencing files from repositories such as NCBI SRA, ENA, DDBJ, MGnify, Qiita, or other study-specific repositories.## Microbiome Analysis System and CDI-DASThe recommended CDI workflow separates public data acquisition from downstream analysis.```{mermaid}flowchart TB A[CDI Systematic Dataset Discovery] --> B[Selected BioProject or Study Accessions] B --> C[CDI Data Acquisition System] C --> D[Metadata Tables] C --> E[FASTQ Files] C --> F[Download Manifests] C --> G[Validation Reports] D --> H[Microbiome Analysis System] E --> H F --> H G --> H```The **CDI Data Acquisition System (DAS)** is responsible for reproducible retrieval and validation of public sequencing data.The **Microbiome Analysis System (MAS)** begins downstream analysis after data and metadata have been acquired, organized, and checked.This separation keeps MAS focused on microbiome analysis while still preserving reproducibility from study discovery to final interpretation.## Expected InputsA microbiome data acquisition package should contain the files needed to begin quality control and downstream analysis.At minimum, MAS expects:- sequencing files, usually FASTQ or FASTQ.gz- metadata table linking samples to biological variables- sample identifiers that match sequencing files or run accessions- study accession information- download manifest or file inventory- notes about sequencing strategy- validation or checksum report, when availableFor public data acquired through CDI-DAS, the expected structure may look like this:```textdata/├── metadata/│ ├── runinfo-PRJNA802976.csv│ ├── ena-PRJNA802976.tsv│ └── srr-accessions.txt├── manifests/│ ├── download-manifest.tsv│ └── test-manifest.tsv├── raw/│ ├── ena/│ │ ├── SRR17868090_1.fastq.gz│ │ ├── SRR17868090_2.fastq.gz│ │ └── ...│ └── ncbi/├── inventory/│ └── fastq-inventory-ena.tsv└── validation/ ├── file-summary.tsv ├── validation-report.tsv └── validation-log.txt```The exact file names may vary by project, but the principle is the same: raw data, metadata, manifests, inventories, and validation outputs should be separated and traceable.## Public Data AcquisitionPublic microbiome datasets are often accessed through study or run accessions.Common accession types include:- BioProject accessions- BioSample accessions- SRA run accessions- ENA run accessions- DDBJ accessions- study-specific repository identifiersA BioProject accession can often be used to retrieve run-level metadata and FASTQ links.For example, a human gut microbiome study may be represented by a BioProject accession such as:```textPRJNA802976```The accession itself is not enough for analysis. It must be converted into a structured dataset package containing metadata, run accessions, FASTQ files, and validation outputs.## Recommended Acquisition StrategyFor MAS, the recommended public-data workflow is:```{mermaid}flowchart TB A[BioProject or Study Accession] --> B[Retrieve NCBI RunInfo] A --> C[Retrieve ENA Metadata] B --> D[Build SRR Accession List] C --> E[Build Download Manifest] E --> F[Download FASTQ Files] F --> G[Build File Inventory] G --> H[Validate Downloads] H --> I[MAS-Ready Data Package]```This mirrors the CDI-DAS approach:1. retrieve run-level metadata2. retrieve ENA metadata and FASTQ URLs3. build a manifest4. download a test subset first5. download the full dataset when ready6. inventory files7. validate downloads8. hand off data to MASThe test-first approach is important because public datasets can be large, file paths can change, and download problems are easier to diagnose on a small subset.## MAS Example and Handoff ScriptsThe following scripts do not replace CDI-DAS. They provide a lightweight MAS-side example and handoff check that confirm whether a CDI-DAS-style data package is present and ready for the next chapter.The workflow uses two scripts:```textscripts/bash/04a-create-example-acquisition-data.shscripts/bash/04b-check-data-acquisition.sh```The first script creates a small example acquisition package for testing the MAS workflow structure. The FASTQ reads are toy sequences and are not intended for biological interpretation.The second script checks whether the expected metadata files, manifests, FASTQ files, inventory outputs, and validation reports are present.## 04a: Create the Example Acquisition PackageSave this script as:```bashscripts/bash/04a-create-example-acquisition-data.sh``````bash#!/bin/bash################################################################################ Microbiome Analysis System# 04a-create-example-acquisition-data.sh## Purpose:# Create a small example data acquisition package for testing the MAS# data acquisition handoff workflow.## Important:# This script creates toy FASTQ reads and mock metadata for workflow testing.# These files are not real biological data and should not be used for# biological interpretation.## Usage:# bash scripts/bash/04a-create-example-acquisition-data.sh###############################################################################set-eBIOPROJECT="${BIOPROJECT:-PRJNA802976}"echo"Creating MAS example acquisition package..."echo"BioProject: ${BIOPROJECT}"echomkdir-p data/metadatamkdir-p data/manifestsmkdir-p data/raw/enamkdir-p data/raw/ncbimkdir-p data/inventorymkdir-p data/validationmkdir-p data/reports################################################################################ Example BioProject metadata###############################################################################cat>"data/metadata/runinfo-${BIOPROJECT}.csv"<<'EOF'Run,BioProject,BioSample,LibraryStrategy,LibraryLayout,Platform,ModelSRR17868090,PRJNA802976,SAMN00000001,AMPLICON,PAIRED,ILLUMINA,Illumina MiSeqSRR17868091,PRJNA802976,SAMN00000002,AMPLICON,PAIRED,ILLUMINA,Illumina MiSeqSRR17868092,PRJNA802976,SAMN00000003,AMPLICON,PAIRED,ILLUMINA,Illumina MiSeqEOFcat>"data/metadata/ena-${BIOPROJECT}.tsv"<<'EOF'run_accession sample_accession study_accession fastq_ftpSRR17868090 SAMN00000001 PRJNA802976 ftp://example/SRR17868090_1.fastq.gz;ftp://example/SRR17868090_2.fastq.gzSRR17868091 SAMN00000002 PRJNA802976 ftp://example/SRR17868091_1.fastq.gz;ftp://example/SRR17868091_2.fastq.gzSRR17868092 SAMN00000003 PRJNA802976 ftp://example/SRR17868092_1.fastq.gz;ftp://example/SRR17868092_2.fastq.gzEOFcat> data/metadata/srr-accessions.txt <<'EOF'SRR17868090SRR17868091SRR17868092EOF################################################################################ Example manifests###############################################################################cat> data/manifests/download-manifest.tsv <<'EOF'run_accession layout repository expected_fastq_filesSRR17868090 PAIRED ENA 2SRR17868091 PAIRED ENA 2SRR17868092 PAIRED ENA 2EOFcat> data/manifests/test-manifest.tsv <<'EOF'run_accession layout repository expected_fastq_filesSRR17868090 PAIRED ENA 2SRR17868091 PAIRED ENA 2SRR17868092 PAIRED ENA 2EOF################################################################################ Tiny toy FASTQ files###############################################################################tmpdir=$(mktemp-d)cat>"${tmpdir}/SRR17868090_1.fastq"<<'EOF'@SRR17868090.1/1ACGTACGTACGT+FFFFFFFFFFFFEOFcat>"${tmpdir}/SRR17868090_2.fastq"<<'EOF'@SRR17868090.1/2TGCATGCATGCA+FFFFFFFFFFFFEOFcat>"${tmpdir}/SRR17868091_1.fastq"<<'EOF'@SRR17868091.1/1ACGTACGTACGT+FFFFFFFFFFFFEOFcat>"${tmpdir}/SRR17868091_2.fastq"<<'EOF'@SRR17868091.1/2TGCATGCATGCA+FFFFFFFFFFFFEOFcat>"${tmpdir}/SRR17868092_1.fastq"<<'EOF'@SRR17868092.1/1ACGTACGTACGT+FFFFFFFFFFFFEOFcat>"${tmpdir}/SRR17868092_2.fastq"<<'EOF'@SRR17868092.1/2TGCATGCATGCA+FFFFFFFFFFFFEOFgzip-c"${tmpdir}/SRR17868090_1.fastq"> data/raw/ena/SRR17868090_1.fastq.gzgzip-c"${tmpdir}/SRR17868090_2.fastq"> data/raw/ena/SRR17868090_2.fastq.gzgzip-c"${tmpdir}/SRR17868091_1.fastq"> data/raw/ena/SRR17868091_1.fastq.gzgzip-c"${tmpdir}/SRR17868091_2.fastq"> data/raw/ena/SRR17868091_2.fastq.gzgzip-c"${tmpdir}/SRR17868092_1.fastq"> data/raw/ena/SRR17868092_1.fastq.gzgzip-c"${tmpdir}/SRR17868092_2.fastq"> data/raw/ena/SRR17868092_2.fastq.gzrm-rf"${tmpdir}"################################################################################ Example inventory and validation outputs###############################################################################cat> data/inventory/fastq-inventory-ena.tsv <<'EOF'file run_accession read directorySRR17868090_1.fastq.gz SRR17868090 1 data/raw/enaSRR17868090_2.fastq.gz SRR17868090 2 data/raw/enaSRR17868091_1.fastq.gz SRR17868091 1 data/raw/enaSRR17868091_2.fastq.gz SRR17868091 2 data/raw/enaSRR17868092_1.fastq.gz SRR17868092 1 data/raw/enaSRR17868092_2.fastq.gz SRR17868092 2 data/raw/enaEOFcat> data/validation/validation-report.tsv <<'EOF'item status notesmetadata OK example metadata presentfastq_files OK six paired-end example FASTQ files presentmanifest OK example manifest presentEOFecho"Example acquisition package created."echoecho"Created:"echo" data/metadata/runinfo-${BIOPROJECT}.csv"echo" data/metadata/ena-${BIOPROJECT}.tsv"echo" data/metadata/srr-accessions.txt"echo" data/manifests/download-manifest.tsv"echo" data/manifests/test-manifest.tsv"echo" data/raw/ena/*.fastq.gz"echo" data/inventory/fastq-inventory-ena.tsv"echo" data/validation/validation-report.tsv"echoecho"Next:"echo" bash scripts/bash/04b-check-data-acquisition.sh"```Run it from the MAS project root:```bashbash scripts/bash/04a-create-example-acquisition-data.sh```This creates a small example dataset structure containing mock metadata, manifests, tiny FASTQ files, an inventory table, and a validation report.## 04b: Check the Data Acquisition PackageSave this script as:```bashscripts/bash/04b-check-data-acquisition.sh``````bash#!/bin/bash################################################################################ Microbiome Analysis System# 04b-check-data-acquisition.sh## Purpose:# Check whether a CDI-DAS-style microbiome data acquisition package is present# and summarize files needed before quality control.## Usage:# bash scripts/bash/04b-check-data-acquisition.sh## Optional:# BIOPROJECT=PRJNA802976 bash scripts/bash/04b-check-data-acquisition.sh###############################################################################set-eBIOPROJECT="${BIOPROJECT:-PRJNA802976}"METADATA_DIR="data/metadata"MANIFEST_DIR="data/manifests"RAW_ENA_DIR="data/raw/ena"RAW_NCBI_DIR="data/raw/ncbi"INVENTORY_DIR="data/inventory"VALIDATION_DIR="data/validation"REPORT_DIR="data/reports"RUNINFO_FILE="${METADATA_DIR}/runinfo-${BIOPROJECT}.csv"ENA_FILE="${METADATA_DIR}/ena-${BIOPROJECT}.tsv"SRR_FILE="${METADATA_DIR}/srr-accessions.txt"MANIFEST_FILE="${MANIFEST_DIR}/download-manifest.tsv"TEST_MANIFEST_FILE="${MANIFEST_DIR}/test-manifest.tsv"ENA_INVENTORY_FILE="${INVENTORY_DIR}/fastq-inventory-ena.tsv"VALIDATION_REPORT="${VALIDATION_DIR}/validation-report.tsv"mkdir-p"${REPORT_DIR}"SUMMARY_FILE="${REPORT_DIR}/data-acquisition-summary.tsv"echo"Microbiome Analysis System: Data Acquisition Check"echo"BioProject: ${BIOPROJECT}"echoprintf"item\tstatus\tpath_or_count\n">"${SUMMARY_FILE}"check_file(){label="$1"file="$2"if[-s"${file}"];thenlines=$(wc-l<"${file}"|tr-d' ')echo"FOUND: ${label} (${lines} lines) -> ${file}"printf"%s\tFOUND\t%s lines; %s\n""${label}""${lines}""${file}">>"${SUMMARY_FILE}"elseecho"MISSING: ${label} -> ${file}"printf"%s\tMISSING\t%s\n""${label}""${file}">>"${SUMMARY_FILE}"fi}check_dir_fastq(){label="$1"dir="$2"if[-d"${dir}"];thencount=$(find"${dir}"-type f \(-name"*.fastq.gz"-o-name"*.fq.gz"-o-name"*.fastq"-o-name"*.fq"\)|wc-l|tr-d' ')echo"FASTQ files in ${label}: ${count}"printf"%s\tCOUNT\t%s FASTQ files; %s\n""${label}""${count}""${dir}">>"${SUMMARY_FILE}"elseecho"MISSING DIRECTORY: ${label} -> ${dir}"printf"%s\tMISSING\t%s\n""${label}""${dir}">>"${SUMMARY_FILE}"fi}check_file"NCBI RunInfo metadata""${RUNINFO_FILE}"check_file"ENA metadata""${ENA_FILE}"check_file"SRR accession list""${SRR_FILE}"check_file"Download manifest""${MANIFEST_FILE}"check_file"Test manifest""${TEST_MANIFEST_FILE}"check_dir_fastq"ENA raw data""${RAW_ENA_DIR}"check_dir_fastq"NCBI raw data""${RAW_NCBI_DIR}"check_file"ENA FASTQ inventory""${ENA_INVENTORY_FILE}"check_file"Validation report""${VALIDATION_REPORT}"echoecho"Summary written to: ${SUMMARY_FILE}"echoecho"Next MAS step:"echo" Review the summary, confirm metadata and FASTQ files are present,"echo" then continue to 05-quality-control.qmd."```Run it from the MAS project root:```bashbash scripts/bash/04b-check-data-acquisition.sh```For a different BioProject:```bashBIOPROJECT=PRJNA802976 bash scripts/bash/04b-check-data-acquisition.sh```The script creates:```textdata/reports/data-acquisition-summary.tsv```This file records which acquisition components were found and whether the dataset is ready to move into quality control.## Running the Complete ExampleTo test the full MAS data acquisition handoff workflow, run:```bashbash scripts/bash/04a-create-example-acquisition-data.shbash scripts/bash/04b-check-data-acquisition.shcat data/reports/data-acquisition-summary.tsv```The first command creates the example acquisition package. The second command checks whether the expected files are present. The third command displays the generated summary table.Example output from `04a-create-example-acquisition-data.sh`:```textCreating MAS example acquisition package...BioProject: PRJNA802976Example acquisition package created.Created:data/metadata/runinfo-PRJNA802976.csvdata/metadata/ena-PRJNA802976.tsvdata/metadata/srr-accessions.txtdata/manifests/download-manifest.tsvdata/manifests/test-manifest.tsvdata/raw/ena/*.fastq.gzdata/inventory/fastq-inventory-ena.tsvdata/validation/validation-report.tsvNext:bash scripts/bash/04b-check-data-acquisition.sh```Example output from `04b-check-data-acquisition.sh`:```textMicrobiome Analysis System: Data Acquisition CheckBioProject: PRJNA802976FOUND: NCBI RunInfo metadata (4 lines)-> data/metadata/runinfo-PRJNA802976.csvFOUND: ENA metadata (4 lines)-> data/metadata/ena-PRJNA802976.tsvFOUND: SRR accession list (3 lines)-> data/metadata/srr-accessions.txtFOUND: Download manifest (4 lines)-> data/manifests/download-manifest.tsvFOUND: Test manifest (4 lines)-> data/manifests/test-manifest.tsvFASTQ files in ENA raw data: 6FASTQ files in NCBI raw data: 0FOUND: ENA FASTQ inventory (7 lines)-> data/inventory/fastq-inventory-ena.tsvFOUND: Validation report (4 lines)-> data/validation/validation-report.tsvSummary written to: data/reports/data-acquisition-summary.tsvNext MAS step:Review the summary, confirm metadata and FASTQ files are present,thencontinueto 05-quality-control.qmd.```In this example, `NCBI raw data: 0`is acceptable because the toy example package uses the ENA raw data directory.## Inspecting the Summary TableAfter running the handoff check, inspect the generated summary table:```bashcat data/reports/data-acquisition-summary.tsv```Example summary:```textitem status path_or_countNCBI RunInfo metadata FOUND 4 lines;data/metadata/runinfo-PRJNA802976.csvENA metadata FOUND 4 lines;data/metadata/ena-PRJNA802976.tsvSRR accession list FOUND 3 lines;data/metadata/srr-accessions.txtDownload manifest FOUND 4 lines;data/manifests/download-manifest.tsvTest manifest FOUND 4 lines;data/manifests/test-manifest.tsvENA raw data COUNT 6 FASTQ files;data/raw/enaNCBI raw data COUNT 0 FASTQ files;data/raw/ncbiENA FASTQ inventory FOUND 7 lines;data/inventory/fastq-inventory-ena.tsvValidation report FOUND 4 lines;data/validation/validation-report.tsv```This summary becomes the MAS handoff record for the next chapter.## Minimal Data Acquisition ChecklistBefore continuing to quality control, confirm that:- raw FASTQ files are present- paired-end files are matched when applicable- metadata are available- sample identifiers can be connected to FASTQ files- study accession numbers are recorded- download manifests are preserved- file inventories are available- validation reports are available or planned- sequencing strategy is known- the dataset has a clear analysis objectiveThis checklist prevents downstream analysis from starting with unclear or incomplete data.## FASTQ File OrganizationA consistent file organization makes microbiome workflows easier to automate.For paired-end sequencing, file names commonly follow patterns such as:```textSRR17868090_1.fastq.gzSRR17868090_2.fastq.gzSRR17868091_1.fastq.gzSRR17868091_2.fastq.gz```For single-end sequencing, file names may look like:```textSRR12345678.fastq.gzSRR12345679.fastq.gz```Before quality control, the analyst should confirm whether the dataset is single-end or paired-end.This matters because many downstream workflows require different parameters for single-end and paired-end data.## Metadata and FASTQ MatchingMetadata must connect to sequencing files.A common relationship is:```textsample_id ↔ run_accession ↔ FASTQ file```For public data, `Run`or`run_accession`often provides the connection between metadata and FASTQ files.A simplified metadata table might contain:```textsample_id run_accession group sample_typeS1 SRR17868090 healthy stoolS2 SRR17868091 healthy stoolS3 SRR17868092 healthy stool```The corresponding FASTQ files might be:```textSRR17868090_1.fastq.gzSRR17868090_2.fastq.gzSRR17868091_1.fastq.gzSRR17868091_2.fastq.gzSRR17868092_1.fastq.gzSRR17868092_2.fastq.gz```If metadata and FASTQ files cannot be linked, downstream analysis becomes difficult or impossible to interpret.## What to RecordA reproducible data acquisition record should include:- study accession- repository source- date of acquisition- metadata files retrieved- number of runs expected- number of FASTQ files expected- number of FASTQ files downloaded- download method- test or production mode- validation status- known limitationsThese details can be summarized in a simple report table and referenced during reporting.## Common ProblemsCommon data acquisition problems include:- missing FASTQ files- incomplete metadata- mismatched run accessions- paired-end files missing one mate- duplicate sample identifiers- failed downloads- changed repository links- compressed and uncompressed files mixed together- public metadata that lack key biological variables- study accessions that include multiple sample types or sub-studiesThese issues should be resolved or documented before moving into quality control.## MAS Data Acquisition OutputsAt the end of this stage, MAS should have:- organized raw sequencing files- metadata tables- run accession list- download manifest- FASTQ inventory- validation report- data acquisition summaryThese outputs support the next stage of the system.```{mermaid}flowchart LRA[Raw FASTQ Files]--> D[Quality Control]B[Metadata Tables]--> DC[Acquisition Summary]--> D```## Key TakeawaysData acquisition is not just downloading files.It is the process of turning study accessions, repository records, and sequencing links into a structured dataset package that can support reproducible microbiome analysis.A strong data acquisition stage ensures that:- data sources are traceable- metadata are preserved- files are organized- downloads are validated- sample identifiers are linkable- downstream quality control can begin confidently## What Comes NextThe next chapter examines **Quality Control**, where acquired sequencing data are assessed before feature generation and downstream microbiome analysis.