Addressing the problems with life-science databases for traditional uses and systems biology

Addressing the problems with life-science databases for traditional uses and systems biology


Play all audios:


ABSTRACT A prerequisite to systems biology is the integration of heterogeneous experimental data, which are stored in numerous life-science databases. However, a wide range of obstacles that


relate to access, handling and integration impede the efficient use of the contents of these databases. Addressing these issues will not only be essential for progress in systems biology,


it will also be crucial for sustaining the more traditional uses of life-science databases. Access through your institution Buy or subscribe This is a preview of subscription content, access


via your institution ACCESS OPTIONS Access through your institution Subscribe to this journal Receive 12 print issues and online access $209.00 per year only $17.42 per issue Learn more Buy


this article * Purchase on SpringerLink * Instant access to full article PDF Buy now Prices may be subject to local taxes which are calculated during checkout ADDITIONAL ACCESS OPTIONS: *


Log in * Learn about institutional subscriptions * Read our FAQs * Contact customer support SIMILAR CONTENT BEING VIEWED BY OTHERS JULIA FOR BIOLOGISTS Article 06 April 2023 RECONSTRUCTING


ORGANISMS IN SILICO: GENOME-SCALE MODELS AND THEIR EMERGING APPLICATIONS Article 21 September 2020 DIVERSIFYING THE CONCEPT OF MODEL ORGANISMS IN THE AGE OF -OMICS Article Open access 19


October 2023 REFERENCES * Kitano, H. Systems biology: a brief overview. _Science_ 295, 1662–1664 (2002). Article  CAS  PubMed  Google Scholar  * Pennisi, E. How will big pictures emerge from


a sea of biological data? _Science_ 309, 94 (2005). Article  CAS  PubMed  Google Scholar  * Roos, D. S. Computational biology. Bioinformatics — trying to swim in a sea of data. _Science_


291, 1260–1261 (2001). Article  CAS  PubMed  Google Scholar  * Augen, J. Information technology to the rescue! _Nature Biotechnol._ 19, BE39–BE40 (2001). Article  CAS  Google Scholar  * Ge,


H., Walhout, A. J. & Vidal, M. Integrating 'omic' information: a bridge between genomics and systems biology. _Trends Genet._ 19, 551–560 (2003). Article  CAS  PubMed  Google


Scholar  * Carel, R. Practical data integration in biopharmaceutical research and development. _PharmaGenomics_ 22–35 (June 2003). * Galperin, M. Y. The Molecular Biology Database


Collection: 2006 update. _Nucleic Acids Res._ 34, D3–D5 (2006). Article  CAS  PubMed  Google Scholar  * Cerami, E. _Web services essentials_ (O'Reilly, Beijing; Sebastopol, California,


2002). Google Scholar  * Sugawara, H. & Miyazaki, S. Biological SOAP servers and web services provided by the public sequence data bank. _Nucleic Acids Res._ 31, 3836–3839 (2003).


Article  CAS  PubMed  PubMed Central  Google Scholar  * Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y. & Hattori, M. The KEGG resource for deciphering the genome. _Nucleic Acids Res._


32, D277–D280 (2004). Article  CAS  PubMed  PubMed Central  Google Scholar  * Pillai, S. et al. SOAP-based services provided by the European Bioinformatics Institute. _Nucleic Acids Res._


33, W25–W28 (2005). Article  CAS  PubMed  PubMed Central  Google Scholar  * Stajich, J. E. et al. The Bioperl toolkit: Perl modules for the life sciences. _Genome Res._ 12, 1611–1618 (2002).


Article  CAS  PubMed  PubMed Central  Google Scholar  * Mangalam, H. The Bio * toolkits — a brief overview. _Brief. Bioinformatics_ 3, 296–302 (2002). Article  PubMed  Google Scholar  *


Wang, L., Riethoven, J. J. & Robinson, A. XEMBL: distributing EMBL data in XML format. _Bioinformatics_ 18, 1147–1148 (2002). Article  CAS  PubMed  Google Scholar  * Bairoch, A. et al.


The Universal Protein Resource (UniProt). _Nucleic Acids Res._ 33, D154–D159 (2005). Article  CAS  PubMed  Google Scholar  * Luciano, J. S. PAX of mind for pathway researchers. _Drug Discov.


Today_ 10, 937–942 (2005). Article  CAS  PubMed  Google Scholar  * Lloyd, C. M., Halstead, M. D. & Nielsen, P. F. CellML: its future, present and past. _Prog. Biophys. Mol. Biol._ 85,


433–450 (2004). Article  CAS  PubMed  Google Scholar  * Spellman, P. T. et al. Design and implementation of microarray gene expression markup language (MAGE-ML). _Genome Biol._ 3,


RESEARCH0046 (2002). * Orchard, S. et al. Further steps in standardisation. Report of the second annual Proteomics Standards Initiative Spring Workshop (Siena, Italy 17–20th April 2005).


_Proteomics_ 5, 3552–3555 (2005). Article  CAS  PubMed  Google Scholar  * Hucka, M. et al. The systems biology markup language (SBML): a medium for representation and exchange of biochemical


network models. _Bioinformatics_ 19, 524–531 (2003). Article  CAS  PubMed  Google Scholar  * Green, M. L. & Karp, P. D. Genome annotation errors in pathway databases due to semantic


ambiguity in partial EC numbers. _Nucleic Acids Res._ 33, 4035–4039 (2005). Article  CAS  PubMed  PubMed Central  Google Scholar  * Stevens, R. et al. TAMBIS: transparent access to multiple


bioinformatics information sources. _Bioinformatics_ 16, 184–185 (2000). Article  CAS  PubMed  Google Scholar  * Köhler, J., Philippi, S. & Lange, M. SEMEDA: ontology based semantic


integration of biological databases. _Bioinformatics_ 19, 2420–2427 (2003). Article  PubMed  Google Scholar  * Ashburner, M. et al. Gene Ontology: tool for the unification of biology. The


Gene Ontology Consortium. _Nature Genet._ 25, 25–29 (2000). Article  CAS  PubMed  Google Scholar  * Philippi, S. & Köhler, J. Using XML technology for the ontology-based semantic


integration of life science databases. _IEEE Trans. Inf. Technol. Biomed._ 8, 154–160 (2004). Article  PubMed  Google Scholar  * NC-IUBMB. _Enzyme Nomenclature 1992: Recommendations of the


Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes_ (Academic Press, San Diego, 1992). * Wheeler, D. L.


et al. Database resources of the National Center for Biotechnology Information: update. _Nucleic Acids Res._ 32, D35–D40 (2004). Article  CAS  PubMed  PubMed Central  Google Scholar  *


Hendler, J. Communication. Science and the semantic web. _Science_ 299, 520–521 (2003). Article  CAS  PubMed  Google Scholar  * Noble, D. Will genomics revolutionise pharmaceutical R&D?


_Trends Biotechnol._ 21, 333–337 (2003). Article  CAS  PubMed  Google Scholar  * Smith, B., Köhler, J. & Kumar, A. On the application of formal principles to life science data: a case


study in the gene ontology. _Proc. Data Integr. Life Sci. First Int. Workshop_ 79–94 (2004). * Zhang, S. & Bodenreider, O. Law and order: assessing and enforcing compliance with


ontological modeling principles in the Foundational Model of Anatomy. _Comput. Biol. Med._ 6 Sep 2005 (doi:10.1016/j.compbiomed.2005.04.007). * van Helden, J. et al. Representing and


analysing molecular and cellular function using the computer. _Biol. Chem._ 381, 921–935 (2000). CAS  PubMed  Google Scholar  * Bornberg-Bauer, E. & Paton, N. W. Conceptual data


modelling for bioinformatics. _Brief. Bioinformatics_ 3, 166–180 (2002). Article  CAS  PubMed  Google Scholar  * Nelson, M. R., Reisinger, S. J. & Henry, S. G. Designing databases to


store biological information. _BioSilico_ 1, 134–142 (2003). Article  CAS  Google Scholar  * Taylor, C. F. et al. A systematic approach to modeling, capturing, and disseminating proteomics


experimental data. _Nature Biotechnol._ 21, 247–254 (2003). Article  CAS  Google Scholar  * Ma, Z. & Chen, J. (eds) _Database Modeling in Biology: Practices and Challenges_ (Springer, in


the press). * Karp, P. D. et al. Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. _Nucleic Acids Res._ 33, 6083–6089 (2005). Article  CAS  PubMed  PubMed


Central  Google Scholar  * Searls, D. B. Data integration — connecting the dots. _Nature Biotechnol._ 21, 844–845 (2003). Article  CAS  Google Scholar  * Karp, P. D. What we do not know


about sequence analysis and sequence databases. _Bioinformatics_ 14, 753–754 (1998). Article  CAS  PubMed  Google Scholar  * Camon, E. et al. The Gene Ontology Annotation (GOA) Database:


sharing knowledge in Uniprot with Gene Ontology. _Nucleic Acids Res._ 32, D262–D266 (2004). Article  CAS  PubMed  PubMed Central  Google Scholar  * Gattiker, A. et al. Automated annotation


of microbial proteomes in SWISS-PROT. _Comput. Biol. Chem._ 27, 49–58 (2003). Article  CAS  PubMed  Google Scholar  * Garcia-Berthou, E. & Alcaraz, C. Incongruence between test


statistics and P values in medical papers. _BMC Med. Res. Methodol._ 4, 13 (2004). Article  PubMed  PubMed Central  Google Scholar  * Mecham, B. H. et al. Increased measurement accuracy for


sequence-verified microarray probes. _Physiol. Genomics_ 18, 308–315 (2004). Article  CAS  PubMed  Google Scholar  * Ntzani, E. E. & Ioannidis, J. P. Predictive ability of DNA


microarrays for cancer outcomes and correlates: an empirical assessment. _Lancet_ 362, 1439–1444 (2003). Article  CAS  PubMed  Google Scholar  * Hirschhorn, J. N., Lohmueller, K., Byrne, E.


& Hirschhorn, K. A comprehensive review of genetic association studies. _Genet. Med._ 4, 45–61 (2002). Article  CAS  PubMed  Google Scholar  * Müller, H., Naumann, F. & Freytag,


J.-C. Data quality in genome databases. _Proc. Conf. Inf. Qual. (IQ 03)_ 269–284 (2003). * Iliopoulos, I. et al. Evaluation of annotation strategies using an entire genome sequence.


_Bioinformatics_ 19, 717–726 (2003). Article  CAS  PubMed  Google Scholar  * Leser, U. & Hakenberg, J. What makes a gene name? Named entity recognition in the biomedical literature.


_Brief. Bioinformatics_ 6, 357–369 (2005). Article  CAS  PubMed  Google Scholar  * Resnik, D. B. Strengthening the United States' database protection laws: balancing public access and


private control. _Sci. Eng. Ethics_ 9, 301–318 (2003). Article  PubMed  Google Scholar  * Maurer, S. M., Hugenholtz, P. B. & Onsrud, H. J. Intellectual property. Europe's database


experiment. _Science_ 294, 789–790 (2001). Article  CAS  PubMed  Google Scholar  * Merali, Z. & Giles, J. Databases in peril. _Nature_ 435, 1010–1011 (2005). Article  CAS  PubMed  Google


Scholar  * Ellis, L. B. & Kalumbi, D. The demise of public data on the web? _Nature Biotechnol._ 16, 1323–1324 (1998). Article  CAS  Google Scholar  * Greenbaum, D. & Gerstein, M. A


universal legal framework as a prerequisite for database interoperability. _Nature Biotechnol._ 21, 979–982 (2003). Article  CAS  Google Scholar  * Brazma, A. et al. Minimum information


about a microarray experiment (MIAME) — toward standards for microarray data. _Nature Genet._ 29, 365–371 (2001). Article  CAS  PubMed  Google Scholar  * Bourne, P. Will a biological


database be different from a biological journal? _PLoS Comput. Biol._ 1, 179–181 (2005). CAS  PubMed  Google Scholar  * Berman, H. M. et al. The Protein Data Bank. _Nucleic Acids Res._ 28,


235–242 (2000). Article  CAS  PubMed  PubMed Central  Google Scholar  * Rother, K. et al. Columba: multidimensional data integration of protein annotations. _Proc. Data Integr. Life Sci.


First Int. Workshop_ 156–171 (2004). * Zdobnov, E. M., Lopez, R., Apweiler, R. & Etzold, T. The EBI SRS server — recent developments. _Bioinformatics_ 18, 368–373 (2002). Article  CAS 


PubMed  Google Scholar  * Haas, L. M. et al. DiscoveryLink: a system for integrated access to life sciences data sources. _IBM Syst. J._ 40, 489–511 (2001). Article  Google Scholar  *


Köhler, J. et al. Linking experimental results, biological networks and sequence analysis methods using Ontologies and Generalised Data Structures. _In Silico Biol._ 5, 33–44 (2004). Google


Scholar  * Stein, L. D. Integrating biological databases. _Nature Rev. Genet._ 4, 337–345 (2003). Article  CAS  PubMed  Google Scholar  * Köhler, J. Integration of life science databases.


_Drug Discov. Today_ 2, 61–69 (2004). Article  Google Scholar  * Matys, V. et al. TRANSFAC: transcriptional regulation, from patterns to profiles. _Nucleic Acids Res._ 31, 374–378 (2003).


Article  CAS  PubMed  PubMed Central  Google Scholar  * Kolchanov, N. A. et al. Transcription Regulatory Regions Database (TRRD): its status in 2002. _Nucleic Acids Res._ 30, 312–317 (2002).


Article  CAS  PubMed  PubMed Central  Google Scholar  Download references ACKNOWLEDGEMENTS The authors would like to thank C. Rawlings and P. Verrier for commenting on an earlier version of


this article. Furthermore we would like to thank the following individuals for exploring with us the pitfalls of life-science databases over the past years: J. Baumbach, J. Butz, E.


Kirchem, F. Klingert, S. Knop, B. Kormeier, I. Kupp, A. Neu, A. Rüegg, A. Skusa, B. Steuernagel, J. Taubert, P. Verrier and R. Winnenburg. S.P. gratefully acknowledges funding by the


European Science Foundation. Rothamsted Research receives grant-aided support from the UK Biotechnological and Biological Science Research Council. AUTHOR INFORMATION AUTHORS AND


AFFILIATIONS * Stephan Philippi is at the Department of Computer Science, University of Koblenz, PO Box 201602, Koblenz, 56016, Germany Stephan Philippi * Jacob Köhler is at the


Biomathematics and Bioinformatics Division, Rothamsted Research, Harpenden, AL5 2JQ, Hertfordshire, UK Jacob Köhler Authors * Stephan Philippi View author publications You can also search


for this author inPubMed Google Scholar * Jacob Köhler View author publications You can also search for this author inPubMed Google Scholar CORRESPONDING AUTHOR Correspondence to Stephan


Philippi. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing financial interests. RELATED LINKS RELATED LINKS FURTHER INFORMATION BioJava BioPAX — Biological Pathways


Exchange BioPerl BioRuby CellML DiscoveryLink DNA Data Bank of Japan EC (enzyme class) numbers of the enzyme nomenclature Ensembl Trace Server European Bioinformatics Institute SRS server


European Bioinformatics Institute Extensible Markup Language (XML) Gene Ontology homepage Kyoto Encyclopedia of Genes and Genomes Microarray Gene Expression Data Society mySQL NCBI taxonomy


Nucleic Acids Research Database Categories List ONDEX Open Biomedical Ontologies Open Source Initiative License Index PostgreSQL Proteomics Standards Initiative — molecular interaction


Systems biology markup language Universal Protein Resource Web Services Activity GLOSSARY * Controlled vocabulary A standardized set of terms that can be used in a given application domain.


A prominent example is the enzyme class nomenclature, which describes classes of biochemical reaction. * Database management system A system that provides a means of storing, modifying and


extracting data from a database. * Evidence code A controlled vocabulary that is used to track the types of evidence that support a gene annotation. * Flat file Human readable,


non-standardized files that can be used to exchange the contents of life-science databases. * Ontology A commonly agreed definition of real-world concepts, such as 'protein' and


'enzyme', and their particular relationships, for example, an enzyme 'is a' protein. * Parser Software that reads a given input, such as a flat file, for further


processing. * Web service A standardized way to allow for interoperable machine-to-machine interaction over a network. * XML The extensible markup language (XML) is a standard for the


creation of application-specific, self-descriptive markup languages, which, for example, can be used for the definition of data-exchange formats. RIGHTS AND PERMISSIONS Reprints and


permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Philippi, S., Köhler, J. Addressing the problems with life-science databases for traditional uses and systems biology. _Nat Rev Genet_ 7,


482–488 (2006). https://doi.org/10.1038/nrg1872 Download citation * Published: 09 May 2006 * Issue Date: 01 June 2006 * DOI: https://doi.org/10.1038/nrg1872 SHARE THIS ARTICLE Anyone you


share the following link with will be able to read this content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the


Springer Nature SharedIt content-sharing initiative