Obtaining Metagenomic Data for the Tutorial
Using the SRA Toolkit
For this tutorial, we’ll use the “Kickstart” metagenome dataset which corresponds to sample SAMN05024035 and SRA SRR5058924.
We’ll download the dataset directly using the SRA toolkit. The SRA tools (sra-tools package) are included in the Conda environment created in the previous section.
Download the Kickstart Dataset
You can download the “Kickstart” dataset with the following commands:
#!/bin/bash
# Download the Kickstart dataset using SRA toolkit
prefetch SRR5058924
# Convert SRA to paired FASTQ files with gzip compression
fastq-dump --defline-seq '@$ac_$sn/$ri' --defline-qual '+' --split-3 -O . --gzip SRR5058924/SRR5058924.sra
# Optional cleanup: remove the SRA file as it's no longer needed
rm -f SRR5058924/SRR5058924.sra
SRA Toolkit Information
The SRA toolkit provides direct access to sequencing data from NCBI’s Sequence Read Archive. The prefetch command downloads the SRA file locally, and fastq-dump converts it to standard FASTQ format with proper paired-end splitting and gzip compression for efficiency.
You can remove the SRA file SRR5058924/SRR5058924.sra as it is no longer needed after conversion to FASTQ files. To remove it run:
rm SRR5058924/SRR5058924.sra
⌛ Expected Time
This process takes approximately 5-10 minutes to complete.
Directory Structure
After downloading, your directory structure should look like this:
├── SRR5058924/
├── SRR5058924_1.fastq.gz
└── SRR5058924_2.fastq.gz
The prefetch command downloads the SRA file to the SRR5058924/ directory, and fastq-dump converts it to paired FASTQ files with gzip compression. The SRA file is automatically cleaned up after conversion.
In the next section, we will assemble the two reads files to obtain an assembly of the dataset:
SRR5058924_1.fastq.gz(forward reads)SRR5058924_2.fastq.gz(reverse reads)