How to Use the Command 'blastp' (with Examples)
- Linux
- December 17, 2024
The BLASTP command stands for Basic Local Alignment Search Tool for Proteins and is employed widely in bioinformatics to compare an amino acid query sequence against a protein sequence database. It helps researchers identify homologous proteins and infer functional and evolutionary relationships. BLASTP leverages sophisticated algorithms to search protein sequences, which can aid in various research applications such as comparative genomics and proteomics.
Use case 1: Align two or more sequences using blastp, with the e-value threshold of 1e-9, pairwise output format, output to screen
Code:
blastp -query query.fa -subject subject.fa -evalue 1e-9
Motivation:
This use case is ideal for researchers looking to determine the degree of similarity between two protein sequences. Using a stringent e-value threshold ensures only highly significant matches are considered, which is useful in filtering out random matches and focusing on biologically relevant alignments.
Explanation:
-query query.fa
: Specifies the file containing the query protein sequence.-subject subject.fa
: Specifies the file containing the subject protein sequence for alignment.-evalue 1e-9
: Sets the expectation value (e-value) threshold for reporting matches. An e-value of 1e-9 indicates a high confidence in the matches, helping filter out lesser significant alignments.
Example Output:
Identities = 250/300 (83%), Positives = 270/300 (90%), Gaps = 2/300 (0.77%)
Query 1 MEIVA... 300
Subject 1 MEIVA... 300
Use case 2: Align two or more sequences using blastp-fast
Code:
blastp -task blastp-fast -query query.fa -subject subject.fa
Motivation:
BLASTP-FAST offers a quicker alignment option for users who need results promptly and can accommodate slightly less sensitivity in their analysis. This is particularly beneficial in large-scale studies or time-sensitive research projects.
Explanation:
-task blastp-fast
: Specifies using the BLASTP-FAST algorithm for quicker execution at the cost of some sensitivity.-query query.fa
: Input query protein sequence file.-subject subject.fa
: Input subject protein sequence file for alignment.
Example Output:
Identities = 240/300 (80%), Positives = 260/300 (87%)
Query 1 MFVLK... 300
Subject 1 MFVLK... 300
Use case 3: Align two or more sequences, custom tabular output format, output to file
Code:
blastp -query query.fa -subject subject.fa -outfmt '6 qseqid qlen qstart qend sseqid slen sstart send bitscore evalue pident' -out output.tsv
Motivation:
Researchers may require customized output formats to meet specific data analysis needs. By tailoring the results, they can integrate output into other bioinformatics tools or pipelines for further investigation or visualization.
Explanation:
-query query.fa
: Input query file containing the protein sequence.-subject subject.fa
: Input subject file containing the protein sequence for comparison.-outfmt '6 qseqid qlen qstart qend sseqid slen sstart send bitscore evalue pident'
: Specifies a custom tabular format for output, which includes fields like query and subject sequence IDs, lengths, start and end positions, bitscore, e-value, and percentage identity.-out output.tsv
: Directs the output to the specified file, “output.tsv.”
Example Output (output.tsv):
query1 300 1 300 subject1 300 1 300 500 1e-20 95.0
query2 285 1 285 subject2 280 1 280 450 3e-15 90.0
Use case 4: Search protein databases using a protein query, 16 threads to use in the BLAST search, with a maximum number of 10 aligned sequences to keep
Code:
blastp -query query.fa -db blast_database_name -num_threads 16 -max_target_seqs 10
Motivation:
Incorporating parallel computing by utilizing 16 threads accelerates the BLAST search, making it feasible to process large datasets efficiently. Limiting output to the top 10 sequences helps focus the analysis on the most relevant alignments.
Explanation:
-query query.fa
: Specifies the input query sequence.-db blast_database_name
: Designates the protein database to search.-num_threads 16
: Utilizes 16 threads for parallel processing, enhancing performance on multicore systems.-max_target_seqs 10
: Limits the number of stored alignments to 10, concentrating on top hits.
Example Output:
Sequence E-value Identity
Seq1 2e-50 95%
Seq2 4e-45 93%
...
Use case 5: Search the remote non-redundant protein database using a protein query
Code:
blastp -query query.fa -db nr -remote
Motivation:
Accessing remote databases like the non-redundant (nr) protein database allows researchers to utilize comprehensive datasets offered by NCBI. This is particularly useful for comparative studies against a wide array of known proteins.
Explanation:
-query query.fa
: Denotes the protein sequence to query.-db nr
: Specifies using the non-redundant protein database.-remote
: Enables querying a remote database, accessing up-to-date and extensive datasets without needing local database installation.
Example Output:
Protein_ID Description E-value
XP_001234 Hypothetical Protein 1 1e-35
XP_002345 Conserved protein precursor 2e-40
...
Use case 6: Display help (use -help
for detailed help)
Code:
blastp -h
Motivation:
Researchers and bioinformaticians often need to quickly check command options and syntax, especially when dealing with intricate parameters or when scripting automated workflows.
Explanation:
-h
: Displays a brief help message, listing available command-line options and their short descriptions.
Example Output:
Usage: blastp [options]
Options:
-help Print full usage, including all advanced options.
-query <File_In> File name of input file containing query sequence(s).
...
Conclusion:
BLASTP remains an invaluable tool in bioinformatics for protein sequence comparison and functional annotation. From aligning sequences to analyzing vast protein databases, the command offers a robust set of features to support diverse research endeavors. By understanding its different usage scenarios, researchers can leverage BLASTP to unveil novel insights into protein function and evolution.