-润
1
蛋白质家族和结构域
数据库
1.1
蛋白质模体及结构域数据库
模体和结构域
PROSITE
数据库
PRINTS
数据库
BLOCKS
数据库
ProDom
数据库
Pfam
数据库
SMART
数据库
InterPro
数据库
Conserved
Domain
数据库
CDART
模体(
motifs
)和结构域
(
domai
ns
):
Biologists
can
gain
insight
of
the
protein
function
based
on
identification
of
short
consensus
sequences related to known functions.
These consensus sequence patterns are termed
motifs and
domains.
A motif
is
a
short
conserved
sequence
pattern
associated
with
distinct
functions
of
a
protein
or
DNA.
It is often associated with a distinct
structural site performing a particular function.
A typical motif, such as a Zn-finger
motif, is ten to twenty amino acids long.
A domain is
also
a
conserved
sequence
pattern,
defined
as
an
independent
functional
and
structural unit.
Domains are normally longer than
motifs.
A domain consists of more than
40 residues and up to 700 residues, with an
average length of 100
residues.
A domain may or may not include motifs
within its boundaries.
Examples
< br>,
transmembrane
domains
,
ligand-binding
domains.
Identification
of
motifs
and
domains
heavily
relies
on multiple
sequence
alignment as
well
as
profile
and hidden Markov model (HMM) construction
PROSITE
(蛋白质家族及结
构域数据库):
The first
established sequence pattern database
/prosite/
是蛋白质家族和结构域数据库,
包含具
有生物学意义的位点、
模式、
可帮助识别蛋白质家族
的统计特征。
PROSITE
< br>中涉及的序列模式包括酶的催化位点、配体结合位点、与金属离子结合的残基、
二
硫键的半胱氨酸、与小分子或其它蛋白质结合的区域等。
PR
OSITE
还包括根据多序列比对而构建的序列统计特征,能更敏感地发现一个(未知)
序
列是否具有相应的特征。
The
functional information of these patterns is
primarily based on published literature.
PRINTS
(蛋白质模体指纹数
据库):
A fingerprint is a
group of conserved motifs used to characterise a
protein family; its diagnostic
power is
refined by iterative scanning of a SWISS-
PROT/TrEMBL composite. Usually the motifs
do
not
overlap,
but
are
separated
along
a
sequence,
though
they
may
be
contiguous
in
3D-space..
/dbbrowser/PRINTS/
提供蛋白质同源性分析,
蛋白质模体指纹分析,
系统发生和序列进化分析,
以及微阵
列分析,
并提供生物信息学和
PRINTS
数据库数据下载。
BLOCKS:
A database of blocks
Blocks
:
ungapped
multiple
alignments
derived
from
the
most
conserved,
ungapped
regions
of
homologous protein
sequences.
The blocks, which are usually longer
than motifs, are subsequently converted to PSSMs.
Because blocks often encompass motifs,
the functional annotation of blocks is thus
consistent with
that for the motifs
/blocks.
检测和鉴定蛋白质模体,有
BLOCK
search
、
Get
Blocks
和
Block
Maker
工具
A
query
sequence
can
be
used
to
align
with
precomputed
profiles
in
the
database
to
select
the
highest scored matches.
ProDom
Domain database
ProDom
is
a
comprehensive
set
of
protein
domain
families
automatically
generated
from
the
SWISS-PROT and TrEMBL
sequence databases
The domains are
built using recursive iterations of PSI-BLAST.
/prodom/current/html/
提供相似性搜
索、来自
SWISSPROT
相关结构域的多序列比对
Pfam
(
Protein
families database of alignments and
HMMs
)
A database
with protein domain
derived from
sequences in SWISSPROT and TrEMBL. Each motif or
domain is represented by an
HMM
profile
generated
from
the
seed
alignment
of
a
number
of
conserved
homologous
proteins. /
The
Pfam database is composed of two parts
Pfam-A involves manual alignments
Pfam-B, automatic alignment in a way
similar to ProDom
(
PSI-BLAST
)
.
The functional
annotation of motifs in Pfam-A is often related to
that in PROSITE. Pfam-B only
contains
sequence families not covered in Pfam-A.
Because of the automatic nature, Pfam-B
has a much larger coverage but is also more error
prone
because some HMMs are generated
from unrelated sequences.
SMART (Simple Modular Architecture
Research Tool
):
Contains
HMM
profiles
constructed
from
manually
refined
protein
domain
alignments.
/
Alignments in the database are built
based on
tertiary structures whenever
available
or based on PSI-BLAST
profiles.
Alignments
are
further
checked
and
refined
by
human
annotators
before
HMM
profile
construction.
Protein functions are also manually
curated.
The database may be of better
quality than Pfam with more extensive functional
annotations.
Compared
to
Pfam,
the
SMART
database
contains
an
independent
collection
of
HMMs,
with
emphasis on signaling, extracellular,
and chromatin-associated motifs and domains.
Sequence searching in this database
produces a graphical output of domains with well-
annotated
information
with
respect
to
cellular
localization,
functional
sites,
superfamily,
and
tertiary
structure
InterPro
:
An integrated pattern database
/interpro/
The
database
integrates
information
from
PROSITE,
Pfam,
PRINTS,
ProDom,
and
SMART
databases.
The sequence
patterns from the five databases are further
processed. Only overlapping motifs and
domains in a protein sequence derived
by all five databases are included.
A
popular feature of this database is a graphical
output that summarizes
motif
matches and has
links to
more detailed information.
CDD( Conserved Domain Database)
a
collection
of
multiple
sequence
alignments
for
ancient
domains
and
full-length
proteins.
/Structure/cdd/
The CD-Search service
may be used to identify the conserved domains
present in a protein query
sequence:
/Structure/cdd/
RPS-BLAST (Reverse PSI-
BLAST) is the search tool used in the CD-Search
service.
uses a query sequence to
search against a pre-computed profile database
generated by PSI-BLAST.
The
role
of
the
PSSM
has
changed
from
to
hence
the
term
in
RPS-BLAST.
It
performs
only
one
iteration
of
regular
BLAST
searching
against
a
database
of
PSI-BLAST
profiles to find
the high-scoring gapped matches.
CDART (Conserved Domain Architecture) :
A domain search program
/BLAST/
Combines the results from RPS-
BLAST, SMART, and Pfam.
The
resulting
domain
architecture
of
a
query
sequence can
be
graphically
presented
along
with
related sequences.
CDART is not a substitute for
individual database searches because it often
misses certain features
that can be
found in SMART and Pfam.
1.2
蛋白质家族数据库
COG (Cluster of Orthologous Groups ):
A
protein
family
database
based
on
phylogenetic
classification.
/COG/
It is constructed by comparing
protein sequences encoded in completely sequenced
genomes.
Unicellular
clusters
:检索工具为
COGnitor
program
Eukaryotic
Clusters
:检索工具为
KOGnitor
A
query
sequence
can
be
assigned
function
if
it
has
significant
similarity
matches
with
any
member
of the cluster.
ProtoNet:
A
database
of
clusters
of
homologous
proteins
similar
to
COG
.
/
Orthologous
protein
sequences
in
the
SWISSPROT
database
are
clustered
based
on
pairwise
sequence comparisons between all
possible protein pairs using BLAST.
Protein relatedness is defined by the
E-values from the BLAST alignments.
A
query protein sequence can be submitted to the
server for cluster identification and functional
annotation.
1.3
、蛋白质结构数据库
PDB
(
Protein Data
Bank
)
PDB
< br>中含有通过实验(
X
射线晶体衍射,核磁共振
NMR
)测定的生物大分子的三维结构
蛋白质
核酸
糖类
其它复合物
/pdb
SCOP
(
Structural
Classification of Proteins
)蛋白质结构分类数据库
提供关于已
知结构的蛋白质之间结构和进化关系的详细描述,包括蛋白质结构数据库
PDB
中的所有条目。
/scop/
SCOP
数据库除了提供蛋白质结构和进化关系
信息外,对于每一个蛋白质还包括下述信息:
到
PDB
的连接,序列,参考文献,结构的图像等。
可
以按结构和进化关系对蛋白质分类,
分类结果是一个具有层次结构的树,
其主要的层次是
家族、超家族和折叠
:
家族:具有明显的进化关系
超家族:具有远源进化关系,具有共同的进化源
折叠类:主要结构相似
DSSP
(蛋白质二级结构数据库)
对生物大分子数据库
PDB
中的任何一
个蛋白质,根据其三维结构推导出对应的二级结构。
/dssp/
对研究蛋白质序列与蛋白质二级结构及空间结构的关系非常有用
除了二级结构以外,
DSSP
还包括蛋
白质的几何特征及溶剂。
HSS
P
(蛋白质同源序列比对数据库)
二级数据库
/hssp/
数据来源于
PDB
,或
来源于
SWISS-PROT
对于
P
DB
中的每一个蛋白质,
HSSP
将与
其同源的所有蛋白质序列对比排列起来,从而将
相似序列的蛋白质聚集成结构同源的家族
。
HSSP
有助于分析蛋白质的保守
区域,研究蛋白质的进化关系,有助于蛋白质的分子设计。
1.4
、其它生物大分子数据库
MMDB
(
Molecular
Modeling Database
)
MMDB
是
(
NCBI
)
Entrez
的一个部
分,
数据库的内容包括来自于实验的生物大分子结构数
据。
p>
/entrez/?db=Structure
与
PDB
相比,对于数据库中的每一个生物大分子结构,
MMDB
具有许多附加的信息,如分
子的生物学功能、产生功能
的机制、分子的进化历史等
。
p>
还提供生物大分子三维结构模型显示、结构分析和结构比较工具。
dbSNP
(
Single
nucleotide
polymorphisms
,单核苷酸多态性数据库)
/entrez/?db=snp
OMIM (Online Mendelian Inheritance in
Man)
是关于人类基因和遗传疾病的分类数据库
该数据库收集了已知的人类基因及由于这些基因突变或者缺失而导致的遗传疾病。
/entrez/?db=OMIM
EPD
真核基因启动子数据库
/
提供从
EMB
L
中得到的真核基因的启动子序列,目标是帮助实验研究人员、生物信息
学研究人员分析真核基因的转录信号。
TRRD
(
Transcription Regulatory
Regions Database
)
关于基因调控信息的集成数据库
该数据库搜集真核生物基因转录调控区域结构和功能的信息。
每
一
个
TRR
D
的
条
目
对<
/p>
应
于
一
个
基
因
,
包
含
特
定
基
因
各
种
结
构
-
功
能
特
性
/mgs/gnw/trrd/
2
蛋白质功能预测
蛋白质结构与功能的
研究已有相当长的历史,
由于其复杂性,
对其结构与功能的预测
不论是
方法论还是基础理论方面均较复杂。
蛋白质功能预测的一般过程:
数据库同源性搜索——根据同源信息预测功能
未知蛋白质序列(结构)是否和已知功能蛋白质的序列(结构)相似
根据序列特征预测功能
蛋白质的许多
特性可直接从序列上分析获得,
如疏水性,
它可以用于预测序列
是否位跨膜螺
旋
(transmenbrane
helix)
或是前导序列
(leader
sequence)
。
模体或结构域搜索——通过比对模体或结构域数据库确定功能
未知蛋白包含保守的模体或结构域,则具有该模体和结构域的功能