Quantitative Assessment for Relationship between Sequence Similarity and Function Similarity


Prediction of function using comparative sequence analysis is widely used in genome annotation. However, if not performed appropriately, it may lead to the creation and propagation of assignment errors. In this study, we quantified the relationship between sequence similarity and function similarity in terms of the three aspects of Gene Ontology Annotation (biological process, molecular function, and subcellular localization). Our study provides a benchmark to estimate the confidence in assignment of functions purely based on sequence similarity. We present an analysis of the relationship between sequence similarity and function similarity for the well-characterized proteins from four different genomes. Using a simple measure of functional similarity based on the Gene Ontology classification, it is shown that functional similarity correlates well with sequence similarity measured by sequence identity or statistical significance of the alignment score. We also highlight the differences in the above relationship, which are observed in annotations based on experimental evidences and those based on sequence similarity. The data and results are available for download at this site.

DATA

1. Ontologies

GO Biological Process Ontology

GO Molecular Function Ontology

GO Cellular Component Ontology


2. Annotations

The annotations based on experimental evidence are shown with the species name in green; while annotations based on sequence similarity evidences are shown with species name in purple.

GO Annotations
Species NameGO Biological ProcessGO Molecular FunctionGO Cellular Component
Arabidopsis thalianaArab_Evid_Expt_GO_Bio_1_04.datArab_Evid_Expt_GO_Mol_1_04.datArab_Evid_Expt_GO_Cel_1_04.dat
Saccharomyces cerevisiaeYeast_Evid_Expt_GO_Bio_1_04.datYeast_Evid_Expt_GO_Mol_1_04.datYeast_Evid_Expt_GO_Cel_1_04.dat
Caenorrhabditis elegansWorm_Evid_Expt_GO_Bio_1_04.datWorm_Evid_Expt_GO_Mol_1_04.datWorm_Evid_Expt_GO_Cel_1_04.dat
Drosophila melanogasterDroso_Evid_Expt_GO_Bio_1_04.datDroso_Evid_Expt_GO_Mol_1_04.datDroso_Evid_Expt_GO_Cel_1_04.dat
Arabidopsis thalianaArab_Evid_SqSim_GO_Bio_1_04.datArab_Evid_SqSim_GO_Mol_1_04.datArab_Evid_SqSim_GO_Cel_1_04.dat
Saccharomyces cerevisiaeYeast_Evid_SqSim_GO_Bio_1_04.datYeast_Evid_SqSim_GO_Mol_1_04.datYeast_Evid_SqSim_GO_Cel_1_04.dat
Caenorrhabditis elegansWorm_Evid_SqSim_GO_Bio_1_04.datWorm_Evid_SqSim_GO_Mol_1_04.datWorm_Evid_SqSim_GO_Cel_1_04.dat
Drosophila melanogasterDroso_Evid_SqSim_GO_Bio_1_04.datDroso_Evid_SqSim_GO_Mol_1_04.datDroso_Evid_SqSim_GO_Cel_1_04.dat


3. Protein Sequences

Arabidopsis thaliana Protein Sequences (FASTA)

Saccharomyces cerevisiae Protein Sequences (FASTA)

Caenorrhabditis elegans Protein Sequences (FASTA)

Drosophila melanogaster Protein Sequences (FASTA)


RESULTS

1. Intragenome Homologous Pairs

FASTA (E-value): ORF1     ORF2     E-value

Arabidopsis thaliana Pairs (FASTA)

Saccharomyces cerevisiae Pairs (FASTA)

Caenorrhabditis elegans Pairs (FASTA)

Drosophila melanogaster Pairs (FASTA)

SSEARCH (% sequence identity) : ORF1     ORF2     OPT Score     Zscore     Bits Score     E-value     %Identity

Arabidopsis thaliana Pairs (SSEARCH)

Saccharomyces cerevisiae Pairs (SSEARCH)

Caenorrhabditis elegans Pairs (SSEARCH)

Drosophila melanogaster Pairs (SSEARCH)


2. Intergenome Homologous Pairs

SSEARCH (% sequence identity) : ORF1     ORF2     OPT Score     Zscore     Bits Score     E-value     %Identity

Saccharomyces cerevisiae - Arabidopsis thaliana Pairs (SSEARCH)

Saccharomyces cerevisiae - Caenorrhabditis elegans Pairs (SSEARCH)

Saccharomyces cerevisiae - Drosophila melanogaster Pairs (SSEARCH)

Arabidopsis thaliana - Caenorrhabditis elegans Pairs (SSEARCH)

Arabidopsis thaliana - Drosophila melanogaster Pairs (SSEARCH)

Drosophila melanogaster - Caenorrhabditis elegans Pairs (SSEARCH)


3. Figures

Figure1.bmp

Figure2.bmp

Figure3.bmp

Figure4.bmp

Figure5.bmp

Figure6.bmp

Figure7.bmp

Figure8.bmp

Figure9.bmp