This script uses Bio::SeqFeature::Tools::Unflattener and
Bio::Tools::GFF to convert GenBank flatfiles to GFF3 with gene
containment hierarchies mapped for optimal display in gbrowse.
The input files are assumed to be gzipped GenBank flatfiles for refseq
contigs. The files may contain multiple GenBank records. Either a
single file or an entire directory can be processed. By default, the
DNA sequence is embedded in the GFF but it can be saved into separate
fasta file with the --split(-y) option.
If an input file contains multiple records, the default behaviour is
to dump all GFF and sequence to a file of the same name (with .gff
appended). Using the nolump option will create a separate file for
each genbank record. Using the split option will create separate
GFF and Fasta files for each genbank record.
split and nolump produce many files
In cases where the input files contain many GenBank records (for
example, the chromosome files for the mouse genome build), a very
large number of output files will be produced if the split or
nolump options are selected. If you do have lists of files > 6000,
use the --long_list option in bp_bulk_load_gff.pl or
bp_fast_load_gff.pl to load the gff and/ or fasta files.
Designed for RefSeq
This script is designed for RefSeq genomic sequence entries. It may
work for third party annotations but this has not been tested.
But see below, Uniprot/Swissprot works, EMBL and possibly EMBL/Ensembl
if you dont mind some gene model unflattener errors (dgg).
G-R-P-E Gene Model
Don Gilbert worked this over with needs to produce GFF3 suited to
loading to GMOD Chado databases. Most of the changes I believe are
suited for general use. One main chado-specific addition is the
My favorite GFF is to set the above as ON by default (disable with --nocds2prot)
For general use it probably should be OFF, enabled with --cds2prot.
This writes GFF with an alternate, but useful Gene model,
instead of the consensus model for GFF3
[ gene > mRNA> (exon,CDS,UTR) ]
This alternate is
gene > mRNA > polypeptide > exon
means the only feature with dna bases is the exon. The others
specify only location ranges on a genome. Exon of course is a child
of mRNA and protein/peptide.
The protein/polypeptide feature is an important one, having all the
annotations of the GenBank CDS feature, protein ID, translation, GO
terms, Dbxrefs to other proteins.
UTRs, introns, CDS-exons are all inferred from the primary exon bases
inside/outside appropriate higher feature ranges. Other special gene
model features remain the same.
Several other improvements and bugfixes, minor but useful are included
* IO pipes now work:
curl ftp://ncbigenomes/... | bp_genbank2gff3 --in stdin --out stdout | gff2chado ...
* GenBank main record fields are added to source feature, e.g. organism, date,
and the sourcetype, commonly chromosome for genomes, is used.
* Gene Model handling for ncRNA, pseudogenes are added.
* GFF header is cleaner, more informative.
--GFF_VERSION flag allows choice of v2 as well as default v3
* GFF ##FASTA inclusion is improved, and
CDS translation sequence is moved to FASTA records.
* FT -> GFF attribute mapping is improved.
* --format choice of SeqIO input formats (GenBank default).
Uniprot/Swissprot and EMBL work and produce useful GFF.
* SeqFeature::Tools::TypeMapper has a few FT -> SOFA additions
and more flexible usage.
These items from Bioperl mail were tested (sample data generating
errors), and found corrected:
From: Ed Green <green <at> eva.mpg.de>
Subject: genbank2gff3.pl on new human RefSeq
Date: 2006-03-13 21:22:26 GMT
-- unspecified errors (sample data works now).
From: Eric Just <e-just <at> northwestern.edu>
Date: 2007-01-26 17:08:49 GMT
-- bug fixed in genbank2gff3 for multi-record handling
This error is for a /trans_splice gene that is hard to handle, and
From: Chad Matsalla <chad <at> dieselwurks.com>
Subject: genbank2gff3.PLS and the unflatenner - Inconsistent order?
Date: 2005-07-15 19:51:48 GMT