getprotent - Linux
Overview
getprotent extracts protein sequences from EMBL or GenBank formatted nucleotide sequence databases. It is commonly used in bioinformatics workflows to retrieve specific protein sequences or build comprehensive protein databases.
Syntax
getprotent [options] <database_file> <query_file> output_file
Options/Flags
- -s, –sequence-type: Specify the type of protein sequence to retrieve. Options include
CDS
,PRE
, andmRNA
. (Default:CDS
) - -t, –translation-table: Define the translation table to use when extracting protein sequences. (Default:
standard
) - -a, –ambiguous: Allow for ambiguous protein sequences in the output.
- -f, –force-cleanup: Override errors and cleanup the output file even if retrieval is unsuccessful.
- –version: Display the version information.
- –help: Show the help screen.
Examples
Extract all protein sequences in CDS format from a GenBank file:
getprotent -s CDS genbank.gbk query.txt output.fasta
Retrieve protein sequences using a custom translation table:
getprotent -t custom_table.tbl genbank.gbk query.txt output.fasta
Extract mRNA sequences and allow for ambiguous sequences:
getprotent -s mRNA -a genbank.gbk query.txt output.fasta
Common Issues
Error: Invalid database format:
Ensure that the input database file is in either EMBL or GenBank format.
Error: Query sequence not found:
Verify that the query sequence in the query file matches a sequence in the database.
Error: Output file not writable:
Check if the output file has the correct permissions or if it is already open by another process.
Integration
Combine with grep:
getprotent genbank.gbk query.txt tmp.fasta | grep -A 1 "Protein" > output.fasta
Use with Blast:
getprotent genbank.gbk query.txt blast.fasta
blastp -query blast.fasta -db protein.db -out result.txt
Related Commands
- grep: Search and filter text.
- blastp: Perform protein sequence alignment and similarity search.
- seqret: Retrieve and manipulate sequence data from EMBL and GenBank databases.