EMBL Nucleotide Sequence Database in 2006

Tamara Kulikova^*, Ruth Akhtar, Philippe Aldebert, Nicola Althorpe, Mikael Andersson, Alastair Baldwin, Kirsty Bates, Sumit Bhattacharyya, Lawrence Bower, Paul Browne, Matias Castro, Guy Cochrane, Karyn Duggan, Ruth Eberhardt, Nadeem Faruque, Gemma Hoad, Carola Kanz, Charles Lee, Rasko Leinonen, Quan Lin, Vincent Lombard, Rodrigo Lopez, Dariusz Lorenc, Hamish McWilliam, Gaurab Mukherjee, Francesco Nardone, Maria Pilar Garcia Pastor, Sheila Plaister, Siamak Sobhany, Peter Stoehr, Robert Vaughan, Dan Wu, Weimin Zhu and Rolf Apweiler

EMBL Outstation—European Bioinformatics Institute, Wellcome Trust Genome Campus Hinxton, Cambridge, CB10 1SD, UK

^*To whom correspondence should be addressed. Tel: +44 01223 494463; Fax: +44 1223 494468; Email: kulikova{at}ebi.ac.uk

Received September 15, 2006. Revised October 16, 2006. Accepted October 16, 2006.

ABSTRACT

TOP
ABSTRACT
INTRODUCTION
DATA COLLECTION
NEW DEVELOPMENTS
REFERENCES

The EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl)at the EMBL European Bioinformatics Institute, UK, offers alarge and freely accessible collection of nucleotide sequencesand accompanying annotation. The database is maintained in collaborationwith DDBJ and GenBank. Data are exchanged between the collaboratingdatabases on a daily basis to achieve optimal synchrony. Webinis the preferred tool for individual submissions of nucleotidesequences, including Third Party Annotation, alignments andbulk data. Automated procedures are provided for submissionsfrom large-scale sequencing projects and data from the EuropeanPatent Office. In 2006, the volume of data has continued togrow exponentially. Access to the data is provided via SRS,ftp and variety of other methods. Extensive external and internalcross-references enable users to search for related informationacross other databases and within the database. All availableresources can be accessed via the EBI home page at http://www.ebi.ac.uk/.Changes over the past year include changes to the file format,further development of the EMBLCDS dataset and developmentsto the XML format.

INTRODUCTION

TOP
ABSTRACT
INTRODUCTION
DATA COLLECTION
NEW DEVELOPMENTS
REFERENCES

The EMBL Nucleotide Sequence Database is the European node ofthe International Nucleotide Sequence Database Collaboration(INSDC, http://www.insdc.org/) between DDBJ (1), EMBL and GenBank(2). The collaborative aim is to collect and present nucleotidesequence and annotation as comprehensively as possible.

The EMBL Nucleotide Sequence Database (EMBL) is maintained atthe European Bioinformatics Institute, which hosts several othercore biological databases (3).

The main goal of the EMBL Nucleotide Sequence Database is toaccept, process and make freely available sequence data fromindividual researchers, research groups and the European PatentOffice (EPO). Collected nucleotide sequences and accompanyingannotation are made available via the EBI Sequence RetrievalSystem (SRS), ftp, web services and similarity search tools.

EMBL database releases, with accompanying release notes, areproduced quarterly.

The database is presented as individual entries, each carryingsequence or information on sequence construction, submissioninformation (submission and update dates, version numbers andsubmitter details), literature citations and annotation in theform of a feature table. Full details of database flatfile formatare available in the user manual. Details of feature table formatare available in the INSDC Feature Table Definition. Data arealso presented in XML formats via the web tools, dbfetch andftp.

Each entry in the database belongs to one of the several entrytypes, which differ in either data format or handling of databy the database. Entry types include standard (STD), constructed(CON), third party annotation (TPA), whole genome shotgun (WGS),annotated constructed (ANN) and mass genome annotation library(MGA). New entry types are created as new types of data arriveat the database.

Over the past year, the size of the EMBL Nucleotide SequenceDatabase has increased from 58.7 million entries in Release84, September 2005 to 80.5 million entries in Release 88, September2006, of which 18 million entries are WGS data. The WGS entriesnow account for >50% of the nucleotide content of the database—80.3Gbp out of 146.5 Gbp in September 2006. There are now over 260000 organisms represented in the database.

During the last year, an important EMBL flatfile format changewas completed and there were further developments to XML formats,XML distribution and tools and the TPA dataset.

A detailed and up-to-date description of EMBL Nucleotide SequenceDatabase activities can be found at http://www.ebi.ac.uk/embl/;a list of relevant URLs is presented in Table 1.

View this table:
[in this window]
[in a new window]

Table 1 Relevant URLs and emails for EMBL nucleotide sequence database

	DATA COLLECTION

TOP ABSTRACT INTRODUCTION DATA COLLECTION NEW DEVELOPMENTS REFERENCES

Sequence submission
EMBL database submission procedures are briefly described below.Full details of procedures are available at http://www.ebi.ac.uk/embl/Submission/

Webin
Webin is the preferred submission system for nucleotide sequenceand biological annotation. Webin has been designed to allowrapid submission of single, multiple or very large numbers ofsequences (bulk data) and is available at http://www.ebi.ac.uk/embl/Submission/webin.html.Bulk data submission in the fasta format is possible via Webin,where the fasta format is sufficient to describe all differencesbetween submitted entries in terms of sequence and annotationfields.

TPA submissions are accepted via Webin; a modification of Webinis also available that is able to accept alignment submissionsfor inclusion into the EMBL-Align dataset (4). This serviceis available at http://www.ebi.ac.uk/embl/Submission/align_top.html.

Genome project submissions
Database entries produced at sequencing sites can be depositedand updated directly by the submitters using FTP or email. Groupsproducing and updating large volumes of genome sequence data,including WGS, over an extended period of time are advised tocontact the database at datasubs{at}ebi.ac.uk.

EPO data processing
Sequence data extracted from biotechnology patent applicationsubmissions to the EPO are received, processed and made availableweekly in the EMBL Nucleotide Sequence Database. A stable linkbetween the patent document number, the sequence number withinthe document and the accession number is maintained. The EMBLNucleotide Sequence Database processes both nucleotide and proteinsequences from the EPO, but the distribution methods, collaborativedata exchange mechanisms and exchange frequency for proteinsequences differ from those of nucleotide sequences.

Data acquisition via data exchange
All new and updated database records are exchanged on a dailybasis between EMBL, DDBJ and GenBank. WGS datasets are exchangedwhen they become available or have been updated and the restof the data are exchanged daily. In addition to data exchange,lists of accession numbers are exchanged weekly to achieve maximumsynchrony in data availability at all three sites.

Data access
Main access method to EMBL Nucleotide Sequence Database datais SRS (5,6); the FTP server, homology search tools, the Genomesweb server (for completely sequenced genomes) and sequence retrievalby accession number (Dbfetch, Wsdbfetch and netserv) are alsoavailable (7). Access to all versions, current and historical,of EMBL Nucleotide Sequence Database entries including CON,TPA and WGS data are available via the Sequence Version Archive,SVA (8).

In addition to these facilities that offer a range of ways tosearch and download data, there are several sites that mirrorEMBL Nucleotide Sequence Database data, which provide distributedftp access.

NEW DEVELOPMENTS

TOP
ABSTRACT
INTRODUCTION
DATA COLLECTION
NEW DEVELOPMENTS
REFERENCES

Important changes to the flatfile format
Since release 87 (JUN-2006) the format of the EMBL flat filehas undergone a change: the ID line now has a different structure(see below) and the SV line has been removed.

The changes to the ID line structure were as follows:

All tokens are separated by a semicolon, the entry name is notdisplayed (in its place there will be the primary accessionnumber), the sequence version is indicated in the ID line, thetopology is a distinct token and is indicated for both circularand linear molecules and both the data class and the taxonomicdivisions are displayed.

Below is an example of the new ID line:

The tokens represent:

[1] Primary accession number; [2] ‘SV’ + sequenceversion number; [3] Topology: ‘circular’ or ‘linear’;[4] Molecule type; [5] Data class: ANN, CON, PAT, EST, GSS,HTC, HTG, MGA, WGS, TPA, STS, STD, ‘normal’ entrieshave have ‘STD’ for ‘standard’; [6]Taxonomic division: HUM, MUS, ROD, PRO, MAM, VRT, FUN, PLN,ENV, INV, SYN, UNC, VRL, PHG'; [7] Sequence length + ‘BP.’

An explanation of dataclass and taxonomic division, representedin the ID line by three-letter abbreviation, is available inthe release notes.

The entry name is no longer displayed in the ID line. SinceEMBL release 3 (December 1983), the stable identifier for anentry has been the primary accession number.

A mapping file (deprecated entry name to accession number) wasprovided via the ftp server for those entries where the entryname did not coincide with the accession number at the pointof change.

Two other changes that are linked to the ID line change, bothrelated to the way the data are represented on the ftp server:release data and the cumulative file (file containing all thedata that are created or updated since the last release) aresplit into smaller files according to data class and taxonomicdivision. Full details on the way in which data are split onthe ftp are available in the ftp directories and in the releasenotes.

XML development
In the past year, INSDC-specific XML was developed further;in spring 2006, the decision was taken to stabilize the productionversion of the DTD in order to facilitate external developmentsbased on it. The current production version of the XML is INSDSeqv1.4 and can be obtained from http://www.insdc.org/documents.html.

Development of the EMBL-specific EMBLXML has continued and hasbeen extended to EMBLCDS dataset. CDS are now distributed viathe ftp server in the XML format in addition to the flatfiledistribution. To support further the external use of the INSDCand EMBL XML formats, a web-based tool for instantaneous conversionsbetween each XML and flatfile formats has been created.

EMBLCDS development
The EMBLCDS dataset was created in response to user requestsfor whole database dumps of coding sequence. EMBLCDS is nowoffered as a dataset updated daily, available by anonymous FTP,via SRS and via sequence similarity searches. There are currently5.4 million EMBLCDS entries and 4.8 million items in the non-redundantEMBLCDSnr. To produce the non-redundant dataset, sequence checksumsare used to collapse sequences with the same checksum into asingle record.

Over the past year, several ways of grouping entries withinthe EMBLCDS dataset, apart from the grouping by checksum, wereintroduced: groups by gene name, by species and by shared exons.Grouping indices are available from the ftp server and are usedin SRS views to link related records together.

As mentioned earlier in the ‘XML development’ section,EMBLXML has been extended to cover data from the EMBLCDS dataset.

Access to the data by map
In 2005, the International Nucleotide Sequence Database Collaborationintroduced the lat_lon (latitude-longitude) qualifier. The qualifierallows submitters to specify precisely where the sequenced specimenwas collected. The data collected so far can now be seen plottedon the world map at http://www3.ebi.ac.uk/Services/EMBLWorld/EMBLWorld.pl(Figure 1).

View larger version (42K):
[in this window]
[in a new window]
[Download PowerPoint slide]

Figure 1 There are three levels of zoom to the map to allow viewing at greater magnification. Using the same geographical information, SRS views of EMBL entries link data to googlemaps.

Cross-references
The EMBL Nucleotide Sequence Database continued to extend thenumber and diversity of its cross-references to other databases.The number of cross-referenced databases was 27 in the September2006 release and the number of individual cross-references wasover 62 million.

Cross-referenced databases include UniProt (9), InterPro (10),GOA (11) and a few other major databases, along with more specificdatabases. The cross-referenced database GeneDB (http://www.genedb.org/),for example, holds the latest sequence data and annotation fororganisms sequenced by the PSU (Pathogen Sequencing Unit) atThe Wellcome Trust Sanger Institute.

‘Intradatabase’ cross-references where introducedin December 2005 and are internal to the EMBL database. Theyinclude EMBL-TPA, EMBL-ANN, EMBL-CON, EMBL-ALIGN and EMBL-JOINand show some of relationships between the entries in the databasethat are otherwise difficult for users to infer; for example,EMBL-TPA cross-reference:

DR EMBL-TPA; BN000249 [GenBank] [EBI].

will appear in a standard entry that serves as primary sourcefor a TPA entry BN000249 [GenBank] [EBI]. Explanation for each type ofthe intradatabase cross-reference is given in the EMBL databaserelease notes.

Further development of the TPA dataset
TPA records are submitted to the International Nucleotide SequenceDatabases as part of the process of publishing biological studiesthat include the annotation of existing nucleotide sequencesin the primary sequence database. Over the past year, the TPAdataset was divided into two tiers, TPA:experimental and TPA:inferentialto distinguish between annotation supported by wet laboratoryexperimental evidence and inferred annotation, where the sourcemolecule or its products have not been the subject of directexperimentation (12).

Enhanced evidence system
In order to enable users to see evidence for a particular annotationand make an informed judgment about its validity, the evidencetagging system was improved over the year. In place of the oldqualifier ‘evidence’, two new qualifiers, ‘experiment’and ‘inference’ were introduced in the course ofthe year. ‘Experiment’ value is a free text namingthe experimental techniques used; ‘inference’ isa highly structured qualifier that details how the annotationwas inferred. The structure of the qualifier is

TYPE[ (same species)][:EVIDENCE_BASIS]

where TYPE is one of the following:

‘non-experimental evidence, no additional details recorded’

‘similar to sequence’

‘similarto AA sequence’

‘similar to DNA sequence’

‘similar to RNA sequence’

‘similarto RNA sequence, mRNA’

‘similarto RNA sequence, EST’

‘similarto RNA sequence, other RNA’

‘profile’

‘nucleotidemotif’

‘protein motif’

‘abinitio prediction’

The optional text ‘(same species)’ can be includedwhen the inference comes from the same species as the entry.

The optional ‘EVIDENCE_BASIS’ is either a referenceto a database entry (including accession and version) or analgorithm (including version), e.g. ‘INSD:AACN010222672.1’,‘InterPro:IPR001900’, ‘ProDom:PD000600’,‘Genscan:2.0’, etc.

A complete list of all features and qualifiers is availableat http://www.ebi.ac.uk/embl/WebFeat/index.html.

The new evidence tagging system described above have been availablesince December 2005 and has at the time of writing been appliedin 1662 entries, with over 145 000 instances of the new qualifierscontaining meaningful values (i.e. containing values differentfrom "[non-] experimental evidence, no additional details recorded").

ACKNOWLEDGEMENTS

Funding to pay the Open Access publication charges for thisarticle was provided by EMBL.

Conflict of interest statement. None declared.

REFERENCES

TOP
ABSTRACT
INTRODUCTION
DATA COLLECTION
NEW DEVELOPMENTS
REFERENCES

Okubo, K., Sugawara, H., Gojobori, T., Tateno, Y. (2006) DDBJ in preparation for overview of research activities behind data submissions Nucleic Acids Res, . 34, D6–D9[Abstract/Free Full Text] .
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L. (2006) GenBank Nucleic Acids Res, . 34, D16–D20[Abstract/Free Full Text] .
Brooksbank, C., Camon, E., Harris, M.A., Magrane, M., Martin, M.J., Mulder, N., O'Donovan, C., Parkinson, H., Tuli, M.A., Apweiler, R., et al. (2003) The European Bioinformatics Institute's data resources Nucleic Acids Res, . 31, 43–50[Abstract/Free Full Text] .
Lombard, V., Camon, E.B., Parkinson, H.E., Hingamp, P., Stoesser, G., Redaschi, N. (2002) EMBL-Align: a new public nucleotide and amino acid multiple sequence alignment database Bioinformatics, 18, 763–764[Abstract/Free Full Text] .
Zdobnov, E.M., Lopez, R., Apweiler, R., Etzold, T. (2002) The EBI SRS server-new features Bioinformatics, 18, 1149–1150[Abstract/Free Full Text] .
Etzold, T., Ulyanov, A., Argos, P. (1996) SRS: information retrieval system for molecular biology data banks Methods Enzymol, . 266, 114–128[ISI][Medline] .
Harte, N., Silventoinen, V., Quevillon, E., Robinson, S., Kallio, K., Fustero, X., Patel, P., Jokinen, P., Lopez, R. (2004) Public web-based services from the European Bioinformatics Institute Nucleic Acids Res, . 32, W3–W9[Abstract/Free Full Text] .
Leinonen, R., Nardone, F., Oyewole, O., Redaschi, N., Stoehr, P. (2003) The EMBL SVA Bioinformatics, 19, 1861–1862[Abstract/Free Full Text] .
Wu, C.H., Apweiler, R., Bairoch, A., Natale, D.A., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., et al. (2006) The Universal Protein Resource (UniProt): an expanding universe of protein information Nucleic Acids Res, . 34, D187–D191[Abstract/Free Full Text] .
Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bradley, P., Bork, P., Bucher, P., Cerutti, L., et al. (2005) InterPro, progress and status in 2005 Nucleic Acids Res, . 33, D201–D205[Abstract/Free Full Text] .
Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte, N., Lopez, R., Apweiler, R. (2004) The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology Nucleic Acids Res, . 32, D262–D266[Abstract/Free Full Text] .
Cochrane, G., Bates, K., Apweiler, R., Tateno, Y., Mashima, J., Kosuge, T., Mizrachi, I.K., Schafer, S., Fetchko, M. (2006) Evidence standards in experimental and inferential INSDC third party annotation data OMICS, 10, 105–113[CrossRef][ISI][Medline] .

This Article

	Abstract
	Print PDF (586K)
	Screen PDF (290K)
	OA All Versions of this Article: 35/suppl_1/D16 most recent gkl913v1
	Alert me when this article is cited
	Alert me if a correction is posted

Services

	Email this article to a friend
	Similar articles in this journal
	Similar articles in PubMed
	Alert me to new issues of the journal
	Add to My Personal Archive
	Download to citation manager
	Request Permissions
	Commercial Re-use Guidelines for Open Access NAR Content

Google Scholar

	Articles by Kulikova, T.
	Articles by Apweiler, R.

PubMed

	PubMed Citation
	Articles by Kulikova, T.
	Articles by Apweiler, R.

Nucleic Acids Research

Articles

EMBL Nucleotide Sequence Database in 2006