Sequence database

In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a computer. The UniProt database is an example of a protein sequence database. As of 2013 it contained over 40 million sequences and is growing at an exponential rate.^[1] Historically, sequences were published in paper form, but as the number of sequences grew, this storage method became unsustainable.

Search[]

Searching in a sequence database involves looking for similarities between a sequence query and the sequences located in a sequence database, finding the sequence in the database that "best" matches the target sequence (based on criteria which vary depending on the search method). The number of matches/hits is used to formulate a score that determines the similarity between the sequence query and the sequences in the sequence database.^[2]

Scoring methods[]

The method for scoring the similarity will determine the rules by which a set of sequences can be considered similar or not. These are the two main methods to find the similarity between sequences:

Local Alignment: This is the alignment between two sub-sequences. This method is used when only certain sections of the sequences are suspected to be similar.

Semi-Global Alignment: This is the alignment of two sequences. The semi-global alignment is a variation of Global Alignment, which allows the use of gaps at the beginning or end of one of the sequences to make the two sequences have the same length when performing a comparison.

Algorithms[]

Algorithms perform the searches. The algorithms focus on increasing the effectiveness by increasing the efficiency and the sensitivity of its results. The efficiency depends on the run time of the algorithm. Meanwhile, the sensitivity depends on the algorithm being able to find all true positive matches when comparing sequences. There are different types of algorithms that are used depending on the focus of the search. These are the following types:

Exhaustive search algorithms

These algorithms focus on finding all the possible solutions. Thus, they concentrate on sensitivity by making the results very accurate. The downside is the run time. The Smith-Waterman and the Burrows-Wheeler Transform are examples of these algorithms.

Heuristic search algorithms

These algorithms focus on faster run times as opposed to the quality of the results. These algorithms are used when the user needs to find the quickest solution with an acceptable result. However, the solution might not be the most accurate. FASTA and BLAST are examples of these algorithms.

Current issues[]

Records in sequence databases are deposited from a wide range of sources, from individual researchers to large genome sequencing centers. As a result, the sequences themselves, and especially the biological annotations attached to these sequences, may vary in quality. There is much redundancy, as multiple labs may submit numerous sequences that are identical, or nearly identical, to others in the databases.^[3]

Many annotations of the sequences are based not on laboratory experiments, but on the results of sequence similarity searches for previously-annotated sequences. Once a sequence has been annotated based on similarity to others, and itself deposited in the database, it can also become the basis for future annotations. This can lead to a transitive annotation problem because there may be several such annotation transfers by sequence similarity between a particular database record and actual wet lab experimental information.^[4] Therefore, care must be taken when interpreting the annotation data from sequence databases.

References[]

^ Cochrane, G.; Karsch-Mizrachi, I.; Nakamura, Y. (23 November 2010). "The International Nucleotide Sequence Database Collaboration". Nucleic Acids Research. 39 (Database): D15–D18. doi:10.1093/nar/gkq1150. PMC 3013722. PMID 21106499.
^ Sung, Wing-Kin (2010). Algorithms in bioinformatics : a practical introduction. Boca Raton: Chapman & Hall/CRC Press. p. 109. ISBN 9781420070330.
^ Sikic, K.; Carugo, O. (2010). "Protein sequence redundancy reduction: comparison of various method". Bioinformation. 5 (6): 234–9. doi:10.6026/97320630005234. PMC 3055704. PMID 21364823.
^ Iliopoulos, I.; Tsoka, S.; Andrade, MA.; Enright, AJ.; Carroll, M.; Poullet, P.; Promponas, V.; Liakopoulos, T.; et al. (April 2003). "Evaluation of annotation strategies using an entire genome sequence". Bioinformatics. 19 (6): 717–26. doi:10.1093/bioinformatics/btg077. PMID 12691983.

External links[]

European Bioinformatics Institute databases
NCBI completely sequenced genomes
Stanford Saccharomyces Genome Database
Protein, the NIH protein database, a collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and , as well as records from SwissProt, PIR, PRF, and PDB

[1] Cochrane, G.; Karsch-Mizrachi, I.; Nakamura, Y. (23 November 2010). "The International Nucleotide Sequence Database Collaboration". Nucleic Acids Research. 39 (Database): D15–D18. doi:10.1093/nar/gkq1150. PMC 3013722. PMID 21106499.

[2] Sung, Wing-Kin (2010). Algorithms in bioinformatics : a practical introduction. Boca Raton: Chapman & Hall/CRC Press. p. 109. ISBN 9781420070330.

[Sikic-2010-3] Sikic, K.; Carugo, O. (2010). "Protein sequence redundancy reduction: comparison of various method". Bioinformation. 5 (6): 234–9. doi:10.6026/97320630005234. PMC 3055704. PMID 21364823.

[Iliopoulos-2003-4] Iliopoulos, I.; Tsoka, S.; Andrade, MA.; Enright, AJ.; Carroll, M.; Poullet, P.; Promponas, V.; Liakopoulos, T.; et al. (April 2003). "Evaluation of annotation strategies using an entire genome sequence". Bioinformatics. 19 (6): 717–26. doi:10.1093/bioinformatics/btg077. PMID 12691983.

[1]

[2]

[3]

[4]

v t Bioinformatics
Databases	Sequence databases: GenBank, European Nucleotide Archive and DNA Data Bank of Japan Secondary databases: UniProt, database of protein sequences grouping together Swiss-Prot, TrEMBL and Protein Information Resource Other databases: Protein Data Bank, Ensembl and InterPro Specialised genomic databases: BOLD, Saccharomyces Genome Database, FlyBase, VectorBase, WormBase, Rat Genome Database, PHI-base, Arabidopsis Information Resource and Zebrafish Information Network
Software	BLAST Bowtie Clustal EMBOSS HMMER MUSCLE SAMtools SOAP suite TopHat
Other	Server: ExPASy Ontology: Gene Ontology Rosalind (education platform)
Institutions	Broad Institute China National GeneBank (CNGB) Computational Biology Department (CBD) Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI) Database Center for Life Science (DBCLS) DNA Data Bank of Japan (DDBJ) European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory (EMBL) Flatiron Institute J. Craig Venter Institute (JCVI) Max Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG) US National Center for Biotechnology Information (NCBI) Japanese Institute of Genetics Netherlands Bioinformatics Centre (NBIC) Philippine Genome Center (PGC) Scripps Research Swiss Institute of Bioinformatics (SIB) Wellcome Sanger Institute Whitehead Institute
Organizations	African Society for Bioinformatics and Computational Biology (ASBCB) Australia Bioinformatics Resource (EMBL-AR) European Molecular Biology network (EMBnet) International Nucleotide Sequence Database Collaboration (INSDC) International Society for Biocuration (ISB) International Society for Computational Biology (ISCB) Student Council (ISCB-SC) Institute of Genomics and Integrative Biology (CSIR-IGIB) Japanese Society for Bioinformatics (JSBi)
Meetings	Basel Computational Biology Conference‎ ([BC²]) European Conference on Computational Biology (ECCB) Intelligent Systems for Molecular Biology (ISMB) International Conference on Bioinformatics (InCoB) ISCB Africa ASBCB Conference on Bioinformatics Pacific Symposium on Biocomputing (PSB) Research in Computational Molecular Biology (RECOMB)
File formats	CRAM format FASTA format FASTQ format NeXML format Nexus format Pileup format SAM format Stockholm format VCF format
Related topics	Computational biology List of biobanks List of biological databases Molecular phylogenetics Sequencing Sequence database Sequence alignment
Category Commons