BioPerl
BioPerl is a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications. It has played an integral role in the Human Genome Project.
Background
BioPerl is an active open source software project supported by the Open Bioinformatics Foundation. The first set of Perl codes of BioPerl was created by Tim Hubbard and Jong Bhak at MRC Centre Cambridge, where the first genome sequencing was carried out by Fred Sanger. MRC Centre was one of the hubs and birth places of modern bioinformatics as it had a large quantity of DNA sequences and 3D protein structures. Hubbard was using the th_lib.pl Perl library, which contained many useful Perl subroutines for bioinformatics. Bhak, Hubbard's first PhD student, created jong_lib.pl. Bhak merged the two Perl subroutine libraries into Bio.pl. The name BioPerl was coined jointly by Bhak and Steven Brenner at the Centre for Protein Engineering. In 1995, Brenner organized a BioPerl session at the Intelligent Systems for Molecular Biology conference, held in Cambridge. BioPerl had some users in coming months including Georg Fuellen who organized a training course in Germany. Fuellen's colleagues and students greatly extended BioPerl; this was further expanded by others, including Steve Chervitz who was actively developing Perl codes for his yeast genome database. The major expansion came when Cambridge student Ewan Birney joined the development team.The first stable release was on 11 June 2002; the most recent stable release is 1.7.2 from 07 September 2017. There are also developer releases produced periodically. Version series 1.7.x is considered to be the most stable version of BioPerl and is recommended for everyday use.
In order to take advantage of BioPerl, the user needs a basic understanding of the Perl programming language including an understanding of how to use Perl references, modules, objects and methods.
Influence on the Human Genome Project
The Human Genome Project faced several challenges during its lifetime. A few of these problems were solved when many of the genomics labs started to use Perl. The process of analyzing all of the DNA sequences was one such problem. Some labs built large monolithic systems with complex relational databases that took forever to debug and implement, and got surpassed by new technologies. Other labs learned to build modular, loosely-coupled systems whose parts could be swapped in and out when new technologies arose. Many of the initial results from all of the labs were mixed. It was eventually discovered that many of the steps could be implemented as loosely coupled programs that were run with a Perl shell script. Another problem that was fixed was interchange of data. Each lab usually had different programs that they ran with their scripts, resulting in several conversions when comparing results. To fix this the labs collectively started using a super-set of data. One script was used to convert from super-set to each lab's set and one was used to convert back. This minimized the number of scripts needed and data exchange became simplified with Perl.Features and examples
BioPerl provides software modules for many of the typical tasks of bioinformatics programming. These include:- Accessing nucleotide and peptide sequence data from local and remote databases
use Bio::DB::GenBank;
$db_obj = Bio::DB::GenBank->new;
$seq_obj = $db_obj->get_Seq_by_acc;
- Transforming formats of database/ file records
use Bio::SeqIO;
my $usage = "all2y.pl informat outfile outfileformat";
my $informat = shift or die $usage;
my $outfile = shift or die $usage;
my $outformat = shift or die $usage;
my $seqin = Bio::SeqIO->new;
my $seqout = Bio::SeqIO->new;
while
- Manipulating individual sequences
use Bio::Tools::SeqStats;
$seq_stats = Bio::Tools::SeqStats->new;
$weight = $seq_stats->get_mol_wt;
$monomer_ref = $seq_stats->count_monomers;
$codon_ref = $seq_stats->count_codons;
- Searching for similar sequences
- Creating and manipulating sequence alignments
- Searching for genes and other structures on genomic DNA
- Developing machine readable sequence annotations
Usage
- SynBrowse
- GeneComber
- TFBS
- MIMOX
- BioParser
- Degenerate primer design
- Querying the public databases
- Current Comparative Table
- Dealing with phylogenetic trees and nested taxa
- FPC Web tools
Advantages