How to use the command 'samtools' (with examples)

How to use the command 'samtools' (with examples)

Samtools is a powerful suite of tools specifically designed for the processing and analysis of high-throughput sequencing data. It is widely used in genomics for reading, writing, editing, indexing, and viewing data in various formats, including SAM, BAM, and CRAM. These formats represent sequences and alignments in a way that is optimized for storage and computation. Researchers and bioinformaticians rely on Samtools to perform a variety of tasks that facilitate the interpretation of sequencing data.

Use Case 1: Convert a SAM input file to BAM stream and save to file

Code:

samtools view -S -b input.sam > output.bam

Motivation:

Converting SAM files to BAM format is an essential step in sequencing data analysis. BAM files are binary versions of SAM files and are more space-efficient, making them faster to process and easier to handle. Converting to BAM is often one of the first steps in a data processing pipeline and is crucial for further analysis.

Explanation:

  • view: This subcommand is used for viewing and converting sequence alignment data.
  • -S: Specifies that the input is in SAM format.
  • -b: Indicates that the output should be in BAM format.
  • input.sam: The input SAM file to be converted.
  • >: Redirects the output to a file.
  • output.bam: The file where the BAM output is stored.

Example Output:

The resulting file will not change visibly on the terminal, but the command will create a compressed binary BAM file similar to the original SAM file, reducing its size and making it more efficient for downstream processing.

Use Case 2: Take input from stdin and print the SAM header and any reads overlapping a specific region to stdout

Code:

other_command | samtools view -h - chromosome:start-end

Motivation:

Selecting reads from specific genomic regions is a common need in genomics, particularly when focusing on regions of interest for deeper analysis. This command facilitates the rapid extraction of data without requiring intermediate files, thus improving workflow efficiency.

Explanation:

  • |: A pipe character used to pass the output of other_command directly to samtools view.
  • view: A subcommand for manipulating and viewing sequence data.
  • -h: Ensures that the SAM header is included in the output.
  • -: Specifies that input is coming from stdin, the standard input stream.
  • chromosome:start-end: Defines the genomic region of interest through specifying the chromosome and the range (start and end) of base positions.

Example Output:

After executing this command, the output on the terminal will display the SAM header and reads that overlap the specified genomic region, allowing users to view the relevant sequence alignment data efficiently.

Use Case 3: Sort file and save to BAM

Code:

samtools sort input -o output.bam

Motivation:

Sorting alignments is often a prerequisite for many downstream applications like variant calling. Sorted BAM files ensure that reads are arranged in order of their position in the genome, making the data easier to index and query later.

Explanation:

  • sort: This subcommand arranges the input alignments in a specified order.
  • input: Refers to the unsorted BAM file.
  • -o: Specifies the name of the output file.
  • output.bam: The name of the BAM file where the sorted alignments will be stored.

Example Output:

The command will generate a sorted version of the input BAM file as specified by output.bam. The terminal will not display the contents, but a new file will be created with the aligns sorted by position.

Use Case 4: Index a sorted BAM file

Code:

samtools index sorted_input.bam

Motivation:

Indexing a BAM file is crucial for rapid access to specific parts of the data without loading the entire file into memory. Indexing is especially useful when working with large sequencing datasets to expedite data retrieval.

Explanation:

  • index: This subcommand creates an index for a BAM file.
  • sorted_input.bam: The name of the sorted BAM file to be indexed, resulting in the creation of a .bai index file.

Example Output:

Executing this command will result in the creation of a new file named sorted_input.bam.bai, which accompanies the sorted BAM file. This index file allows for optimized data querying.

Use Case 5: Print alignment statistics about a file

Code:

samtools flagstat sorted_input

Motivation:

Comprehension of data quality and mapping accuracy is critical in genomics. Flagstat quickly provides basic statistics on sequence data, such as the number of mapped reads, duplicates, and secondary alignments, which guide quality control decisions.

Explanation:

  • flagstat: This subcommand computes and displays statistics about the alignments in a BAM file.
  • sorted_input: Refers to the BAM file from which statistics will be derived.

Example Output:

The command will print detailed alignment statistics directly to the terminal. It will include information such as total reads, duplicates, mapped reads, and more, aiding in assessing the dataset’s quality.

Use Case 6: Count alignments to each index

Code:

samtools idxstats sorted_indexed_input

Motivation:

Understanding the distribution of reads across different chromosomes/contigs is essential in many research contexts. Index statistics provide a summary of alignment counts for each indexed segment, facilitating assessment of coverage and sequencing depth.

Explanation:

  • idxstats: This subcommand provides integration of index statistics for BAM files.
  • sorted_indexed_input: Specifies the BAM file that has been both sorted and indexed.

Example Output:

The terminal will display a tab-delimited text table, summarizing the number of alignments for each reference sequence in the BAM file and providing additional insights into the distribution of reads.

Use Case 7: Merge multiple files

Code:

samtools merge output input1 input2 ...

Motivation:

Merging multiple BAM files is often required when dealing with samples that were sequenced in parallel or split over different sequencing runs. This process consolidates data into a single file for comprehensive analysis.

Explanation:

  • merge: This subcommand is used to consolidate multiple BAM files.
  • output: The name of the merged BAM file to be created.
  • input1 input2 ...: Represents the list or space-separated names of the BAM files that will be merged.

Example Output:

The execution results in the creation of a single merged BAM file, named output, derived from the consolidation of specified input files. The merged file will enable unified data analysis across samples.

Use Case 8: Split input file according to read groups

Code:

samtools split merged_input

Motivation:

Splitting a BAM file by read groups can be crucial when different groups are processed separately. This is beneficial during quality control or when specific analyses apply uniquely to different subsets of the data.

Explanation:

  • split: This subcommand divides a BAM file into multiple parts based on read groups.
  • merged_input: The BAM file to be split, typically containing multiple read groups.

Example Output:

The command results in multiple output files, each corresponding to different read groups within the original BAM file. This separation assists in group-specific analyses or processing workflows.

Conclusion:

Samtools provides essential functionalities for managing and analyzing sequencing data efficiently and effectively. These use cases demonstrate a fraction of Samtools’ capabilities, highlighting its role as an indispensable tool in genomics for transforming raw sequencing reads into actionable insights. The straightforward syntax and powerful features make it a go-to resource for bioinformaticians worldwide.

Related Posts

How to Use the Command 'pgmtofs' (with Examples)

How to Use the Command 'pgmtofs' (with Examples)

The pgmtofs command is a utility that forms part of the Netpbm library, which is a toolkit for manipulation of graphic images.

Read More
How to use the command `flac` (with examples)

How to use the command `flac` (with examples)

The flac command is a powerful utility used for handling FLAC (Free Lossless Audio Codec) files, which are an audio format that allows compression without any loss in quality.

Read More
How to Use the Command 'gnucash' (with examples)

How to Use the Command 'gnucash' (with examples)

GnuCash is a robust personal and small-business financial-accounting software that helps users manage their finances efficiently.

Read More