getprotent - Linux


Overview

getprotent extracts protein sequences from EMBL or GenBank formatted nucleotide sequence databases. It is commonly used in bioinformatics workflows to retrieve specific protein sequences or build comprehensive protein databases.

Syntax

getprotent [options] <database_file> <query_file> output_file

Options/Flags

  • -s, –sequence-type: Specify the type of protein sequence to retrieve. Options include CDS, PRE, and mRNA. (Default: CDS)
  • -t, –translation-table: Define the translation table to use when extracting protein sequences. (Default: standard)
  • -a, –ambiguous: Allow for ambiguous protein sequences in the output.
  • -f, –force-cleanup: Override errors and cleanup the output file even if retrieval is unsuccessful.
  • –version: Display the version information.
  • –help: Show the help screen.

Examples

Extract all protein sequences in CDS format from a GenBank file:

getprotent -s CDS genbank.gbk query.txt output.fasta

Retrieve protein sequences using a custom translation table:

getprotent -t custom_table.tbl genbank.gbk query.txt output.fasta

Extract mRNA sequences and allow for ambiguous sequences:

getprotent -s mRNA -a genbank.gbk query.txt output.fasta

Common Issues

Error: Invalid database format:

Ensure that the input database file is in either EMBL or GenBank format.

Error: Query sequence not found:

Verify that the query sequence in the query file matches a sequence in the database.

Error: Output file not writable:

Check if the output file has the correct permissions or if it is already open by another process.

Integration

Combine with grep:

getprotent genbank.gbk query.txt tmp.fasta | grep -A 1 "Protein" > output.fasta

Use with Blast:

getprotent genbank.gbk query.txt blast.fasta
blastp -query blast.fasta -db protein.db -out result.txt

Related Commands

  • grep: Search and filter text.
  • blastp: Perform protein sequence alignment and similarity search.
  • seqret: Retrieve and manipulate sequence data from EMBL and GenBank databases.