bioinformática básica · bioinformática básica blast rafael dias mesquita [email protected]...
TRANSCRIPT
Bioinformática Básica BLAST
Rafael Dias Mesquita [email protected]
Laboratório de Bioinformática
Departamento de Bioquímica Instituto de Química - UFRJ
• Como você agruparia estas latas?
• Usa-‐se algoritmos matemá6cos para buscar similaridade mas é preciso definir critérios
Uma questão central em bioinformá6ca é estabelecer um paradigma para similaridade.
Conteúdo de açúcar?
Cor? Fabricante?
Sabor?
Classificação por similaridade: Critério?
• Busca de regiões curtas onde ocorra similaridade entre pares de sequências.
• Iden6fica sequências que têm alta similaridade em determinadas regiões mas podem ter pouca similaridade no todo de sua extensão
• Usada normalmente para encontrar sequências similares, regiões de domínios em proteínas e regiões codificantes em DNA.
• A ferramenta mais comum se chama BLAST
Busca de sequencias baseada em Alinhamento local
Why do similarity search?
" Similarity indicates conserved function " Human and mouse genes are more than 80% similar at
sequence level " Comparing sequences helps us understand function
" Locate similar gene in another species to understand your new gene
" Why are biological sequences similar to one another? " In evolution they started identical, followed different paths until
similarity or beyond.
" Knowledge of how and why sequences change over time can help you interpret evolution.
Warning: similarity not transitive!
" If 1 is “similar” to 2, and 3 is “similar” to 2, is 1 similar to 3?
" Not necessarily " AAAAAABBBBBB is similar to AAAAAA and BBBBBB " But AAAAAA is not similar to BBBBBB
" “not transitive unless alignments are overlapping”
BLAST " Basic Local Alignment Search Tool " Developed in 1990 and 1997 (S. Altschul) " A heuristic method for performing local
alignments through searches of high scoring segment pairs (HSP’s)
" 1st to use statistics to predict significance of initial matches - saves on false leads
" Offers both sensitivity and speed
Como funciona o BLAST? Basic Local Alignment Search Tool.
The algorithm BLAST is a heuris6c search method that seeks words of length W (default = 3 in blastp) that score at least T when aligned with the query and scored with a subs6tu6on matrix. Words in the database that score T or greater are extended in both direc6ons in an aYempt to find a locally op6mal ungapped alignment or HSP (high scoring pair) with a score of at least S or an E value lower than the specified threshold. HSPs that meet these criteria will be reported by BLAST, provided they do not exceed the cutoff value specified for number of descrip6ons and/or alignments to report.
GTQITVEDLFYNIATRRKALKN
Query: GTQ TQI QIT ITV TVE VED EDL DLF LFY FYN
Neighborhood Words
-> LTV,MTV,ISV,LSV,MSV IAV,LAV,MAV,ITL,etc.
O programa cria uma tabela tanto para a sequência pergunta (query) quanto para cada sequência da database
Tabela para Busca (Hash Table)
Word Size default = 3 ( 2 or 3 para proteínas) > 7 para blastn ( 11 default )
Scoring Matrices " BLOSUM Matrices
" Developed by Henikoff & Henikoff (1992) " BLOcks SUbstitution Matrix " Derived from the BLOCKS database
" PAM Matrices " Developed by Schwarz and Dayhoff (1978) " Point Accepted Mutation " Derived from manual alignments of closely
related proteins
PAM versus BLOSUM
" First useful scoring matrix for protein
" Assumed a Markov Model of evolution (I.e. all sites equally mutable and independent)
" Derived from small, closely related proteins with ~15% divergence
" Newer " No evolutionary model is
assumed " Built from PROSITE
derived sequence blocks " Uses much larger, more
diverse set of protein sequences (30% - 90% ID)
PAM versus BLOSUM
" Higher PAM numbers to detect more remote sequence similarities
" Lower PAM numbers to detect high similarities
" 1 PAM ~ 1 million years of divergence
" Errors in PAM 1 are scaled 250X in PAM 250
" Lower BLOSUM numbers to detect more remote sequence similarities
" Higher BLOSUM numbers to detect high similarities
" Sensitive to structural and functional subsitution
" Errors in BLOSUM arise from errors in alignment
PAM Matricies
" PAM 40 - prepared by multiplying PAM 1 by itself a total of 40 times. best for short alignments with high similarity
" PAM 120 - prepared by multiplying PAM 1 by itself a total of 120 times. best for general alignment
" PAM 250 - prepared by multiplying PAM 1 by itself a total of 250 times. best for detecting distant sequence similarity
BLOSUM Matricies
" BLOSUM 90 - prepared from BLOCKS sequences with >90% sequence ID best for short alignments with high similarity
" BLOSUM 62 - prepared from BLOCKS sequences with >62% sequence ID best for general alignment (default)
" BLOSUM 30 - prepared from BLOCKS sequences with >30% sequence ID best for detecting weak local alignments
Scoring matrices Usada para atribuir pontuação para alinhamento de resíduos de aminoácidos no BLAST.
Qual o score do seguinte alinhamento? ACQNE | | + + ACBPD = 4 + 9 + 0 - 2 +2 = 13
BLAST Access " NCBI BLAST " http://www.ncbi.nlm.nih.gov/BLAST/ " European Bioinformatics Institute
BLAST " http://www.ebi.ac.uk/blastall/ " http://www.ebi.ac.uk/blast2/
Different Flavours of BLAST
" BLASTP - protein query against protein DB
" BLASTN - DNA/RNA query against GenBank (DNA)
" BLASTX - 6 frame trans. DNA query against proteinDB
" TBLASTN - protein query against 6 frame GB transl.
" TBLASTX - 6 frame DNA query to 6 frame GB transl.
" PSI-BLAST - protein ‘profile’ query against protein DB
" PHI-BLAST - protein pattern against protein DB
Other BLAST Services " MEGABLAST – BLASTN runs MEGABLAST by
default, in previous versions they were separated " RPS-BLAST - Conserved Domain Detection " BLAST 2 Sequences - for performing pairwise
alignments for 2 chosen sequences " Genomic BLAST - for alignments against select
human, microbial or malarial genomes " VecScreen - for detecting cloning vector
contamination in sequenced data
Running NCBI BLAST
" Choose a range of interest in the sequence “set subsequences” (not usually used)
" Select the database from pull-down menu (usually choose nr = non-redundant)
" Keep CD Search “check box” on " Leave “Options” unchanged (use defaults) " Go to “Format” menu and adjust Number of
descriptions and alignments as desired
Running NCBI BLAST
Lecture 3.1 23
Running NCBI BLAST
Select Database
Conserved Domain Database " Contains a collection of pre-identified
functional or structural domains " Derived from Pfam and Smart databases as
well as other sources " Uses Reverse Position Specific BLAST (RPS-
BLAST) to perform search " Query sequence is compared to a PSSM
derived from each of the aligned domains
Lecture 3.1 25
Running NCBI BLAST
Click BLAST!
Lecture 3.1 26
Formatting Results
Lecture 3.1 27
BLAST Format Options
Lecture 3.1 28
BLAST Output
Lecture 3.1 29
BLAST Output
Lecture 3.1 30
BLAST Output
Lecture 3.1 31
BLAST Output
BLAST Parameters " Identities - No. & % exact residue matches
" Positives - No. and % similar & ID matches
" Gaps - No. & % gaps introduced
" Score - Summed HSP score (S)
" Bit Score - a normalized score (S’)
" Expect (E) - Expected # of chance HSP aligns
" P - Probability of getting a score > X
" T - Minimum word score (Threshold)
BLAST - Rules of Thumb " Expect (E-value) is equal to the number of BLAST
alignments with a given Score that are expected to be seen simply due to chance
" Don’t trust a BLAST alignment with an Expect score > 10-10 (Grey zone is between 10-10 - 1)
" Expect and Score are related, but Expect contains more information. Note that %Identies is more useful than the bit Score
" Recall Doolittle’s Curve (%ID vs. Length, next slide) %ID > 30 - numres/50
" If uncertain about a hit, perform a PSI-BLAST search
Doolittle’s Curve Evolutionary Distance VS Percent Sequence Identity
0
20
40
60
80
100
120
0 40 80 120 160 200 240 280 320 360 400
Number of Residues
Se
qu
en
ce Id
en
tity
(%
)
Twilight Zone
Reliable
Doubt
Pairwise sequence similarity versus alignment length.
Rost B Protein Eng. 1999;12:85-94
© Oxford University Press
Homologues confirmed by 3D structure
False positives
Getting the Most from BLAST
Lecture 3.1 37
BLAST Options
BLAST Options " Composition-based statistics (Yes) " Sequence Complexity Filter (Yes / NO) " Expect (E) value (10) " Word Size (3) " Substitution or Scoring Matrix " Gap Insertion Penalty (11) " Gap Extension Penalty (1)
Composition Statistics " Permits calculated E (Expect) values to
account for amino acid composition of queries and database hits
" Improves accuracy and reduces false positives
" Effectively conducts a different scoring procedure for each sequence in database
LCR’s (low complexity)
" Watch out for… " transmembrane or signal peptide regions
" coil-coil regions
" short amino acid repeats (collagen, elastin)
" homopolymeric repeats
" BLAST uses SEG to mask amino acids
" BLAST uses DUST to mask bases
Still Confused? http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html
Conclusions
" BLAST is the most important program in bioinformatics (maybe all of biology)
" BLAST is based on statistical principles (key to its speed and sensitivity)
" A basic understanding of its principles is key for using/interpreting BLAST output
" Use BLASTN or MEGABLAST for DNA " Use BLASTP or PSI-BLAST for protein
searches
Scraping the Bottom of the Barrel with Psi-BLAST
PSI-BLAST Algorithm
" Perform initial alignment with BLAST using BLOSUM 62 substitution matrix
" Construct a multiple alignment from matches " Prepare position specific scoring matrix " Use PSSM profile as the scoring matrix for a
second BLAST run against database " Repeat steps 3-5 until convergence
Profiles for families of sequences can be built from MSAs
C A
A
—
G A
A
A
— T
A
—
A C
T
G
—
50% 25%
0%
0%
25%
75% 0%
0%
25%
0%
25% 0%
25%
0%
50%
1 2 3 1 2 3
Note: While profiles can be used for any kind of sequence data, we’ll focus on protein sequences
Profiles " Profile: A table that lists the frequencies of each amino
acid in each position of protein sequence. " Frequencies are calculated from a MSA containing a
domain of interest " Allows us to identify consensus sequence " Derived scoring scheme allows us to align a new sequence
to the profile " Profile can be used in database searches " Find new sequences that match the profile
" Profiles also used to compute multiple alignments heuristically " Progressive alignment
Profiles: Position-Specific Scoring Matrix (PSSM) " To compare a sequence to a profile, need to assign a
score for each amino acid " The score the profile for amino acid a at position p
is where " f(p,b) = frequency of amino acid b in position p " s(a,b) is the score of (a,b) (from, e.g., BLOSUM or PAM)
),(),(),(20
1basbpfapM
b⋅=∑
=
Profiles: PSSM
Gribskov et al. PNAS. 84 (13): 4355 (1987)
Insertion/deletion penalty
Profiles: Consensus Sequence " A consensus residue C(p) is generated at each
position of the profile to aid the display of alignments of target sequences with the profile.
" The consensus residue c is the amino acid at p that has the highest score M(p,c). " c is the amino acid most mutationally similar to all the
aligned residues of the probe sequences at p, rather than the most common one
Aligning a sequence to a profile
K L M – K
K L K L K
K M M L –
M L – L M
.75 .25 .75 .75 .75
.25 .25 .50 .25 .25 .25 .25
K
L
M
-
1 2 3 4 5
K K L - L M
1 - 2 3 4 5
Align with profile:
K K L - L M
K - L M – K
K - L K L K
K - M M L –
M - L – L M
K K L L M
New sequence:
Scoring a sequence-to-profile alignment
" Score each column separately according to PSSM
" Each character contributes to score, weighed by its frequency
.75 .25 .75
.75 .75 .25 .25 .50 .25
.25 .25 .25
K
L
M
-
1 2 3 4 5 K K L - L M
1 - 2 3 4 5
Column 1 score: 0.75 x s(K,K)
A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1
207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5
208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2
209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0
210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6
211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4
213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3
214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6
215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7
216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7
218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7
219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6
220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0
221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6
222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0
223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4
Serine pontuada diferentemente nestas duas posições
Nucleófilo do Sítio Ativo
Position Specific Score Matrix (PSSM)
Lecture 3.1 53
PSI-BLAST
Lecture 3.1 54
PSI-BLAST PresS Iterate!
55
PSI-BLAST
PresS Iterate!
56
PSI-BLAST
PSI-BLAST " For Protein Sequences ONLY " Much more sensitive than BLAST " Slower (iterative process) " SHOULD BE YOUR FIRST CHOICE
IN ANALYZING AN "HYPOTHETICAL" OR VERY DIFFERENT SEQUENCE
Still Confused? http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html
Conclusions
" BLAST is the most important program in bioinformatics (maybe all of biology)
" BLAST is based on sound statistical principles (key to its speed and sensitivity)
" A basic understanding of its principles is key for using/interpreting BLAST output
" Use NBLAST or MEGABLAST for DNA " Use PSI-BLAST for protein searches
Agradecimentos
Estudantes Rafael e Thayani
Referência Glória Braz Lab Comp Científica Univ. Beirute http://staff.aub.edu.lb/~webbic/
Financiamentos INCT Entomologia Molecular - CNPq+FAPERJ Grupos Emergentes - FAPERJ PRONEX Dengue - CNPq APQ1 – FAPERJ IC - FAPERJ PIBIC – UFRJ