How to use the command 'blastn' (with examples)
- Linux
- December 17, 2024
The ‘blastn’ command stands for Basic Local Alignment Search Tool for nucleotides. It is a robust bioinformatics tool used for comparing nucleotide sequences to sequence databases and calculating the statistical significance of matches. It is commonly used in genomics and molecular biology to find homologous nucleotide sequences, allowing for the identification and annotation of genes, understanding evolutionary relationships, and providing insights into genetic functions.
Use case 1: Align two or more sequences using megablast with an e-value threshold of 1e-9, pairwise output format (default)
Code:
blastn -query query.fa -subject subject.fa -evalue 1e-9
Motivation:
This use case is ideal for quickly aligning sequences that are highly similar, such as those from closely related organisms or strains. By using megablast, which is optimized for speed and handling near-identical sequences, users can efficiently compare sequences with great precision. The e-value threshold of 1e-9 further refines the search, ensuring that only highly statistically significant alignments are reported, minimizing noise in the results.
Explanation:
-query query.fa
: Specifies the file containing the nucleotide sequence(s) to be searched against the subject.-subject subject.fa
: The file containing the sequence(s) to which the query sequences are compared.-evalue 1e-9
: Sets the expectation value threshold for statistical significance. Alignments with e-values lower than this are deemed significant.
Example Output:
Sequences producing significant alignments:
(>subject sequence)
Score E
(1000+) 1e-10
Use case 2: Align two or more sequences using blastn
Code:
blastn -task blastn -query query.fa -subject subject.fa
Motivation:
This approach is suitable for generic sequence alignment tasks when a comprehensive search is needed, without the speed constraints that accompany megablast. This is particularly useful where there’s a need to align more divergent sequences that do not share recent common ancestry.
Explanation:
-task blastn
: Specifies using the traditional ‘blastn’ algorithm for alignment which is more sensitive and can compare more distantly related sequences than megablast.-query query.fa
: The input file with query sequences.-subject subject.fa
: The target file containing subject sequences.
Example Output:
>subject sequence
Length = 708
Score = 200.1 bits (101), Expect = 0.01
Use case 3: Align two or more sequences, custom tabular output format, output to file
Code:
blastn -query query.fa -subject subject.fa -outfmt '6 qseqid qlen qstart qend sseqid slen sstart send bitscore evalue pident' -out output.tsv
Motivation:
This command is utilized for precise reporting of alignment metrics in a structured format, suitable for further data analysis, documentation, or reporting purposes. Storing the results in a tab-separated format is particularly useful for importing the data into spreadsheets or programs that analyze tabular data, enabling automated analysis pipelines.
Explanation:
-query query.fa
: File with the nucleotide sequences for the query.-subject subject.fa
: Target sequences for alignment.-outfmt '6 qseqid qlen qstart qend sseqid slen sstart send bitscore evalue pident'
: Specifies a custom output format including columns for query sequence ID, query length, start and end of query, subject sequence ID, subject length, start and end of the subject, alignment bit score, e-value, and percentage identity.-out output.tsv
: Designates an output file where results are saved.
Example Output:
query1 800 1 800 subject1 800 5 805 1000 1e-10 99.8
Use case 4: Search nucleotide databases using a nucleotide query, 16 threads (CPUs) to use in the BLAST search, with a maximum number of 10 aligned sequences to keep
Code:
blastn -query query.fa -db path/to/blast_db -num_threads 16 -max_target_seqs 10
Motivation:
This approach is optimal for high-throughput environments where computational efficiency and speed are crucial. By leveraging multiple processor threads, the search can be significantly accelerated, which is advantageous in large-scale comparisons or when working with extensive databases. Limiting the results to the top 10 aligned sequences also helps focus on the most promising leads.
Explanation:
-query query.fa
: The input file containing the query sequences.-db path/to/blast_db
: Specifies the path to the database against which the sequences are to be searched.-num_threads 16
: Utilizes 16 CPU threads to boost processing speed.-max_target_seqs 10
: Restricts the output to the top 10 alignments, focusing on the most relevant results.
Example Output:
>database_entry_01
Length = 1000
Score = 250.0 bits (120), Expect = 0.001
Use case 5: Search the remote non-redundant nucleotide database using a nucleotide query
Code:
blastn -query query.fa -db nt -remote
Motivation:
This scenario is frequently employed when on-site database resources are limited or unavailable. By searching against the extensive, regularly updated, and non-redundant nucleotide databases hosted remotely, researchers ensure they are working with the most current data globally available, which is critical for accurate annotations and evolutionary studies.
Explanation:
-query query.fa
: Contains the nucleotide sequences for searching.-db nt
: Specifies targeting the non-redundant nucleotide sequence database.-remote
: Enables service to access remote databases maintained by resources like NCBI, ensuring up-to-date information.
Example Output:
Results can be accessed through a URL provided by the remote server
Use case 6: Display help (use -help
for detailed help)
Code:
blastn -h
Motivation:
Accessing help documentation is crucial for users to understand the available options and configurations within ‘blastn’. This ensures proper usage and allows users to leverage its full potential by exploring advanced options when needed, facilitating efficient troubleshooting and learning.
Explanation:
-h
: This flag displays all available command-line options and brief descriptions, serving as an easy reference for users.
Example Output:
USAGE
blastn [options]...
Conclusion:
The ‘blastn’ command-line tool is an indispensable utility for biologists and bioinformaticians interested in nucleotide sequence analysis. With a variety of configurations and options, ‘blastn’ caters to diverse research needs—from rapid alignment of similar sequences to broader database searches—and provides functionalities to fine-tune output formats and boost computational performance. Through these examples, users can appreciate and exploit the versatility and power of ‘blastn’ in their research endeavors.