- •Current size of Genbank (June 2011): 129,178,292,958 bp (1.3 x 1011, or 129
- •Biological databases: Why?
- •Database Background
- •Some Common Databases
- •Codd’s Normal Forms
- •Basic Structure
- •Indexing
- •Search Trees
- •Computational Complexity
- •The different types of Databases in Bioinformatics
- •The different types of Databases in Bioinformatics
- •Identifiers and Accession numbers
- •An accession number
- •Nucleotide Sequence Databases
- •EMBL and DDBJ
- •EMBL
- •DDBJ
- •ww.ncbi.nlm.nih.gov
- •PubMed is…
- •Entrez integrates…
- •How can I use NCB (or other sites)
- •FASTA format
- •Graphics format
- •Thank you!!!
The different types of Databases in Bioinformatics
2) Database:
Organisation: |
Availability: |
||
• |
flat files |
• |
Publicly available, no restriction |
• |
Relational databases |
• |
Available, but with copyright |
• |
Object-oriented databases |
• |
Accessible, but not downloadable |
• |
…. |
• |
Academic, but not freely available |
|
|
• |
Commercial |
Curators:
•Large, public institution (EMBL, NCBI)
•Quasi-academic institute (Swiss institute of Bioinformatics, TIGR,…)
•Academic group or scientist
•Commercial company
Identifiers and Accession numbers
Identifier: string of letters and digits that generally is “understandable”
Example: TPIS_CHICK (Triose Phosphate Isomerase from chicken (gallus gallus) ) in SwissProt
The identifier can change (based on the curator)
Accession code: a string of letters and digits that uniquely identifies an entry in its database.
The accession number for TPIS_CHICK in Swissprot is P00940
Accession number should not changed!!
An accession number
An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.
Examples (all for retinol-binding protein, RBP4):
X02775 |
GenBank genomic DNA sequence DNA |
NT_030059 |
Genomic contig |
Rs7079946 |
dbSNP (single nucleotide polymorphism) |
RNA
N91759.1An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript)
NP_007635 |
RefSeq protein |
protein |
|
AAC02945 |
GenBank protein |
|
|
Q28369 |
|
SwissProt protein |
|
1KT7 |
Protein Data Bank structure record |
|
Nucleotide Sequence Databases
3 main databases
EMBL: www.ebi.ac.uk/embl
GenBank: www.ncbi.nlm.nih.gov/GenBankDDBJ: www.ddbj.nig.ac.jp
The 3 databases are synchronized on a daily basis, and the accession numbers are consistent.
There are no legal restriction in the usage of these databases. However, there are some patented sequences in the database
EMBL and DDBJ
Collaborative effort with NCBI GenBank
Searchable databases of gene information
EMBL
Gene Expression and Protein Sequences
UniProt Knowledgebase
A complete annotated protein sequence database
Macromolecular Structure Database
European Project for the management and distribution of data on macromolecular structures
ArrayExpress
Gene expression data
IntAct
Provides a freely available, open source database system and analysis tools for protein interaction data
DDBJ
Tools to compare nucleic acid sequences and amino acid sequences
Fasta, blastn, tblastx
nucleotide : nucleotide
Fastx and blastx
nucleotide : amino acid
Tfasta, tfastx,blastp, tblastn
amino acid : nucleotide