How to use the command 'samtools' (with examples)

How to use the command 'samtools' (with examples)

Samtools is a powerful command-line tool used for handling high-throughput sequencing data in genomics. It provides various functions for reading, writing, editing, indexing, and viewing data in the SAM/BAM/CRAM format. In this article, we will illustrate several use cases of the samtools command.

Use case 1: Convert a SAM input file to BAM stream and save to file

Code:

samtools view -S -b input.sam > output.bam

Motivation:

Converting a SAM file to BAM format is a common task when working with sequencing data. The BAM format is a binary representation of the SAM format, which is more efficient for storing and processing large datasets. By using the samtools view command with the -b option, we can convert a SAM file to BAM format and save it as a separate file.

Explanation:

  • view: The command to view data from a SAM/BAM/CRAM format.
  • -S: Specifies that the input is in SAM format.
  • -b: Specifies that the output should be in BAM format.
  • input.sam: The input SAM file.
  • > output.bam: Redirects the output to a file named output.bam.

Example output:

The SAM file input.sam will be converted to BAM format and saved as output.bam.

Use case 2: Take input from stdin and print the SAM header and reads overlapping a specific region to stdout

Code:

other_command | samtools view -h - chromosome:start-end

Motivation:

Sometimes, we need to extract specific data from a SAM/BAM file based on a genomic region of interest. By using samtools view with the -h option, we can print the SAM header along with the reads overlapping the specified genomic region. This is useful for filtering and analyzing data for a specific region of interest.

Explanation:

  • view: The command to view data from a SAM/BAM/CRAM format.
  • -h: Prints the header along with the alignments.
  • -: Specifies that the input will be read from stdin.
  • chromosome:start-end: Specifies the genomic region of interest in the format “chromosome:start-end”. The reads overlapping this region will be printed.

Example output:

The output of “other_command” will be piped as input to samtools view, which will print the SAM header and any reads overlapping the specified genomic region to stdout.

Use case 3: Sort file and save to BAM format

Code:

samtools sort input -o output.bam

Motivation:

Sorting a SAM/BAM file is necessary for many downstream analyses. By using samtools sort, we can sort the input file and save it as a BAM file. The output format is automatically determined from the output file’s extension.

Explanation:

  • sort: The command to sort a SAM/BAM file.
  • input: The input file to be sorted.
  • -o output.bam: Specifies the output file name as “output.bam” in BAM format.

Example output:

The input file will be sorted, and the sorted data will be saved as output.bam in BAM format.

Use case 4: Index a sorted BAM file

Code:

samtools index sorted_input.bam

Motivation:

Indexing a sorted BAM file allows for quick retrieval of data from specific genomic locations. By indexing a sorted BAM file, we can create a companion index file that contains the positional information of each read, enabling efficient random access.

Explanation:

  • index: The command to index a sorted BAM file.
  • sorted_input.bam: The path to the sorted BAM file to be indexed.

Example output:

The sorted_input.bam file will be indexed, creating a file named sorted_input.bam.bai that contains the index information.

Use case 5: Print alignment statistics about a file

Code:

samtools flagstat sorted_input

Motivation:

Obtaining alignment statistics is essential for quality control and assessing the performance of a sequencing experiment. By using samtools flagstat, we can quickly generate alignment statistics from a sorted BAM file.

Explanation:

  • flagstat: The command to generate alignment statistics.
  • sorted_input: The input sorted BAM file to calculate the statistics for.

Example output:

Alignment statistics, such as the number of total reads, the number of mapped reads, and the mapping rate, will be printed to the console.

Use case 6: Count alignments to each index (chromosome/contig)

Code:

samtools idxstats sorted_indexed_input

Motivation:

Counting alignments to each index (chromosome/contig) provides insights into the coverage and distribution of reads across the genome. By using samtools idxstats, we can obtain the number of alignments to each index from a sorted and indexed BAM file.

Explanation:

  • idxstats: The command to count alignments to each index.
  • sorted_indexed_input: The input sorted and indexed BAM file.

Example output:

The number of alignments to each index (chromosome/contig) will be printed, providing information about the read coverage across the genome.

Use case 7: Merge multiple files

Code:

samtools merge output input1 input2 ...

Motivation:

Merging multiple SAM/BAM files into a single file is often necessary when combining data from multiple sequencing runs or samples. By using samtools merge, we can merge multiple input files into a single output file.

Explanation:

  • merge: The command to merge multiple SAM/BAM files.
  • output: The output file to store the merged data.
  • input1 input2 ...: Multiple input SAM/BAM files to be merged.

Example output:

The multiple input files (input1, input2, etc.) will be merged into a single output file, which is specified as the output parameter.

Use case 8: Split input file according to read groups

Code:

samtools split merged_input

Motivation:

Splitting a SAM/BAM file according to read groups is useful when you want to separate and analyze data from different biological replicates or experimental conditions independently. By using samtools split, we can split a merged BAM file into separate files based on the read groups.

Explanation:

  • split: The command to split a SAM/BAM file based on read groups.
  • merged_input: The input BAM file to be split.

Example output:

The merged_input BAM file will be split into separate output files based on the read groups present in the input file.

Conclusion

Samtools is a versatile command-line tool for manipulating high-throughput sequencing data in the SAM/BAM/CRAM format. The examples mentioned in this article demonstrate some of the commonly used functionalities of samtools, such as converting file formats, sorting, indexing, merging, and extracting specific data from SAM/BAM files. These examples should serve as a starting point for leveraging samtools to analyze and process sequencing data efficiently.

Related Posts

How to use the command fg (with examples)

How to use the command fg (with examples)

The fg command is used to bring suspended or running background jobs to the foreground.

Read More
How to use the command 'openvpn3' (with examples)

How to use the command 'openvpn3' (with examples)

The openvpn3 command is used to manage VPN sessions and configurations on Linux systems.

Read More
How to use the command 'sha1sum' (with examples)

How to use the command 'sha1sum' (with examples)

The ‘sha1sum’ command is used to calculate the SHA1 cryptographic checksums of files.

Read More