BioPerf: A Benchmark Suite to Evaluate High-Performance Computer Architecture on Bioinformatics Applications

Motivation:

The exponential growth in the amount of genomic data has spurred growing interest in large scale analysis of genetic information. Bioinformatics applications, which explore computational methods to allow researchers to sift through the massive biological data and extract useful information, are becoming increasingly important computer workloads. This paper presents BioPerf, a benchmark suite of representative bioinformatics applications to facilitate the design and evaluation of high-performance computer architectures for these emerging workloads. Currently, the BioPerf suite contains codes from 10 highly popular bioinformatics packages and covers the major fields of study in computational biology such as sequence comparison, phylogenetic reconstruction, protein structure prediction, and sequence homology & gene finding. We demonstrate the use of BioPerf by providing simulation points of pre-compiled Alpha binaries and with a performance study on IBM Power using IBM Mambo simulations cross-compared with Apple G5 executions.

The BioPerf suite (available from www.bioperf.org) includes benchmark source code, input datasets of various sizes, and information for compiling and using the benchmarks. Our benchmark suite includes parallel codes where available.

Benchmark Developers:

BioPerf Papers and Presentations:

How to Get the BioPerf Benchmark Suite:

  1. Download the BioPerf benchmark suite here [85MB]
  2. Since we wanted to keep the size of the BioPerf suite to a minimum for downloads, we decided to keep some of the input databases (sizes close to 100 Mb and more) used in running some of the codes as separate downloads. These databases can be downloaded from the links below. Once you download the databases, you will need to set the environment variable DATABASES to the directories containing these databases for running the scripts in BioPerf which use these databases as input. The scripts using these databases will exit with a error message if the databases are not present or the environment variable is not set properly.
    Download the required biological databases below:

Packages included in BioPerf:

BioPerf Input Datasets

Package NameExecutablesclass-Aclass-Bclass-C
BLAST blastn

blastp
Sequence of homo sapiens hereditary haemochromatosis------
E.Coli sequence against Drosoph database5 sequence of average length 7500Input dataset of 20 sequences of about 7000 residues each against the Swissprot database
CEce1hba.pdb and 4hhb.pdb are different
types of hemoglobin which is used to transport oxygen
------
CLUSTALWclustalw1a02J_1hjbA: 50 sequences of length almost 601290.seq: 66 sequences of length almost 11006000.seq:320 sequences of length almost 1000
FASTA fasta34_t

ssearch34_t
qrhuld.aa is a query file that contains the human LDL receptor precursor------
1abseq.pep of 40 residues aligned against 5 sequences of length almost 60Bacteria genomes of length almost 360,000 base pairsBacteria genome of 700,000 aligned with 940,000 base pairs
GLIMMER glimmer

glimmer-package
Haemophilus_influenzae genome of about 4.7 million with glimmer.icm which is a collection of Markov models------
NC_004061.fna of about 650,000 residuesBacteria genomes of length almost 2.8 million base pairsBacteria genome of about 9 million base pairs
GRAPPAgrappa10 bluebell flower species12 bluebell flower species13 bluebell flower species
HMMER hmmsearch

hmmpfam
Brine shrimp globin compared against 50 HMM's------
Sequence of 450 base pairs against database of 500 HMM'sSequence of 9000 residues compared against 2000 HMM'sSequence of 9000 residues compared against Pfam database (600M - 7500 HMM's)
PHYLIP dnapenny

promlk
16 Aligned sequences of 2581 characters------
6 aligned sequences of 39 characters14 aligned sequences of 230 characters each92 aligned sequences of almost 220 characters
PREDATORpredatorSingle aligned sequence of 9100 residues5 sequences each of length 750019 sequences each of length about 7000 residues
TCOFFEEtcoffee5 sequences each of length 25050 sequences each of length almost 28050 sequences of average length almost 850

Installation of BioPerf

BioPerf directory structure

The installation directory of BioPerf contain the following directories:

How to Use BioPerf

On successful installation, run the script use-bioperf.sh located in the main directory of BioPerf suite to run all the supported tasks in BioPerf.

The following choices are available: