bioinformática básica · bioinformática básica blast rafael dias mesquita [email protected]...

Bioinformática Básica BLAST

Rafael Dias Mesquita [email protected]

Laboratório de Bioinformática

Departamento de Bioquímica Instituto de Química - UFRJ

•  Como você agruparia estas latas?

•  Usa-‐se algoritmos matemá6cos para buscar similaridade mas é preciso definir critérios

Uma questão central em bioinformá6ca é estabelecer um paradigma para similaridade.

Conteúdo de açúcar?

Cor? Fabricante?

Sabor?

Classificação por similaridade: Critério?

•  Busca de regiões curtas onde ocorra similaridade entre pares de sequências.

•  Iden6fica sequências que têm alta similaridade em determinadas regiões mas podem ter pouca similaridade no todo de sua extensão

•  Usada normalmente para encontrar sequências similares, regiões de domínios em proteínas e regiões codificantes em DNA.

•  A ferramenta mais comum se chama BLAST

Busca de sequencias baseada em Alinhamento local

Why do similarity search?

"   Similarity indicates conserved function "   Human and mouse genes are more than 80% similar at

sequence level "   Comparing sequences helps us understand function

" Locate similar gene in another species to understand your new gene

"   Why are biological sequences similar to one another? " In evolution they started identical, followed different paths until

similarity or beyond.

"   Knowledge of how and why sequences change over time can help you interpret evolution.

Warning: similarity not transitive!

"   If 1 is “similar” to 2, and 3 is “similar” to 2, is 1 similar to 3?

"  Not necessarily " AAAAAABBBBBB is similar to AAAAAA and BBBBBB " But AAAAAA is not similar to BBBBBB

"   “not transitive unless alignments are overlapping”

BLAST "  Basic Local Alignment Search Tool "  Developed in 1990 and 1997 (S. Altschul) "  A heuristic method for performing local

alignments through searches of high scoring segment pairs (HSP’s)

"  1st to use statistics to predict significance of initial matches - saves on false leads

"  Offers both sensitivity and speed

Como funciona o BLAST? Basic Local Alignment Search Tool.

The algorithm BLAST is a heuris6c search method that seeks words of length W (default = 3 in blastp) that score at least T when aligned with the query and scored with a subs6tu6on matrix. Words in the database that score T or greater are extended in both direc6ons in an aYempt to find a locally op6mal ungapped alignment or HSP (high scoring pair) with a score of at least S or an E value lower than the specified threshold. HSPs that meet these criteria will be reported by BLAST, provided they do not exceed the cutoff value specified for number of descrip6ons and/or alignments to report.

GTQITVEDLFYNIATRRKALKN

Query: GTQ TQI QIT ITV TVE VED EDL DLF LFY FYN

Neighborhood Words

-> LTV,MTV,ISV,LSV,MSV IAV,LAV,MAV,ITL,etc.

O programa cria uma tabela tanto para a sequência pergunta (query) quanto para cada sequência da database

Tabela para Busca (Hash Table)

Word Size default = 3 ( 2 or 3 para proteínas) > 7 para blastn ( 11 default )

Scoring Matrices "  BLOSUM Matrices

" Developed by Henikoff & Henikoff (1992) " BLOcks SUbstitution Matrix " Derived from the BLOCKS database

"  PAM Matrices " Developed by Schwarz and Dayhoff (1978) " Point Accepted Mutation " Derived from manual alignments of closely

related proteins

PAM versus BLOSUM

"   First useful scoring matrix for protein

"   Assumed a Markov Model of evolution (I.e. all sites equally mutable and independent)

"   Derived from small, closely related proteins with ~15% divergence

"   Newer "   No evolutionary model is

assumed "   Built from PROSITE

derived sequence blocks "   Uses much larger, more

diverse set of protein sequences (30% - 90% ID)

PAM versus BLOSUM

"   Higher PAM numbers to detect more remote sequence similarities

"   Lower PAM numbers to detect high similarities

"   1 PAM ~ 1 million years of divergence

"   Errors in PAM 1 are scaled 250X in PAM 250

"   Lower BLOSUM numbers to detect more remote sequence similarities

"   Higher BLOSUM numbers to detect high similarities

"   Sensitive to structural and functional subsitution

"   Errors in BLOSUM arise from errors in alignment

PAM Matricies

"  PAM 40 - prepared by multiplying PAM 1 by itself a total of 40 times. best for short alignments with high similarity

"  PAM 120 - prepared by multiplying PAM 1 by itself a total of 120 times. best for general alignment

"  PAM 250 - prepared by multiplying PAM 1 by itself a total of 250 times. best for detecting distant sequence similarity

BLOSUM Matricies

"  BLOSUM 90 - prepared from BLOCKS sequences with >90% sequence ID best for short alignments with high similarity

"  BLOSUM 62 - prepared from BLOCKS sequences with >62% sequence ID best for general alignment (default)

"  BLOSUM 30 - prepared from BLOCKS sequences with >30% sequence ID best for detecting weak local alignments

Scoring matrices Usada para atribuir pontuação para alinhamento de resíduos de aminoácidos no BLAST.

Qual o score do seguinte alinhamento? ACQNE | | + + ACBPD = 4 + 9 + 0 - 2 +2 = 13

BLAST Access "  NCBI BLAST "   http://www.ncbi.nlm.nih.gov/BLAST/ "  European Bioinformatics Institute

BLAST "   http://www.ebi.ac.uk/blastall/ "   http://www.ebi.ac.uk/blast2/

Different Flavours of BLAST

"   BLASTP - protein query against protein DB

"   BLASTN - DNA/RNA query against GenBank (DNA)

"   BLASTX - 6 frame trans. DNA query against proteinDB

"   TBLASTN - protein query against 6 frame GB transl.

"   TBLASTX - 6 frame DNA query to 6 frame GB transl.

"   PSI-BLAST - protein ‘profile’ query against protein DB

"   PHI-BLAST - protein pattern against protein DB

Other BLAST Services "  MEGABLAST – BLASTN runs MEGABLAST by

default, in previous versions they were separated "  RPS-BLAST - Conserved Domain Detection "  BLAST 2 Sequences - for performing pairwise

alignments for 2 chosen sequences "  Genomic BLAST - for alignments against select

human, microbial or malarial genomes " VecScreen - for detecting cloning vector

contamination in sequenced data

Running NCBI BLAST

"  Choose a range of interest in the sequence “set subsequences” (not usually used)

"  Select the database from pull-down menu (usually choose nr = non-redundant)

"  Keep CD Search “check box” on "  Leave “Options” unchanged (use defaults) "  Go to “Format” menu and adjust Number of

descriptions and alignments as desired

Running NCBI BLAST

Lecture 3.1 23

Running NCBI BLAST

Select Database

Conserved Domain Database "  Contains a collection of pre-identified

functional or structural domains "  Derived from Pfam and Smart databases as

well as other sources "  Uses Reverse Position Specific BLAST (RPS-

BLAST) to perform search "  Query sequence is compared to a PSSM

derived from each of the aligned domains

Lecture 3.1 25

Running NCBI BLAST

Click BLAST!

Lecture 3.1 26

Formatting Results

Lecture 3.1 27

BLAST Format Options

Lecture 3.1 28

BLAST Output

Lecture 3.1 29

BLAST Output

Lecture 3.1 30

BLAST Output

Lecture 3.1 31

BLAST Output

BLAST Parameters "   Identities - No. & % exact residue matches

"   Positives - No. and % similar & ID matches

"   Gaps - No. & % gaps introduced

"   Score - Summed HSP score (S)

"   Bit Score - a normalized score (S’)

"   Expect (E) - Expected # of chance HSP aligns

"   P - Probability of getting a score > X

"   T - Minimum word score (Threshold)

BLAST - Rules of Thumb "   Expect (E-value) is equal to the number of BLAST

alignments with a given Score that are expected to be seen simply due to chance

"   Don’t trust a BLAST alignment with an Expect score > 10-10 (Grey zone is between 10-10 - 1)

"   Expect and Score are related, but Expect contains more information. Note that %Identies is more useful than the bit Score

"   Recall Doolittle’s Curve (%ID vs. Length, next slide) %ID > 30 - numres/50

"   If uncertain about a hit, perform a PSI-BLAST search

Doolittle’s Curve Evolutionary Distance VS Percent Sequence Identity

0

20

40

60

80

100

120

0 40 80 120 160 200 240 280 320 360 400

Number of Residues

Se

qu

en

ce Id

en

tity

(%

)

Twilight Zone

Reliable

Doubt

Pairwise sequence similarity versus alignment length.

Rost B Protein Eng. 1999;12:85-94

© Oxford University Press

Homologues confirmed by 3D structure

False positives

Getting the Most from BLAST

Lecture 3.1 37

BLAST Options

BLAST Options "  Composition-based statistics (Yes) "  Sequence Complexity Filter (Yes / NO) "  Expect (E) value (10) "  Word Size (3) "  Substitution or Scoring Matrix "  Gap Insertion Penalty (11) "  Gap Extension Penalty (1)

Composition Statistics "  Permits calculated E (Expect) values to

account for amino acid composition of queries and database hits

"  Improves accuracy and reduces false positives

"  Effectively conducts a different scoring procedure for each sequence in database

LCR’s (low complexity)

"  Watch out for… " transmembrane or signal peptide regions

" coil-coil regions

" short amino acid repeats (collagen, elastin)

" homopolymeric repeats

"  BLAST uses SEG to mask amino acids

"  BLAST uses DUST to mask bases

Still Confused? http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html

Conclusions

"  BLAST is the most important program in bioinformatics (maybe all of biology)

"  BLAST is based on statistical principles (key to its speed and sensitivity)

"  A basic understanding of its principles is key for using/interpreting BLAST output

"  Use BLASTN or MEGABLAST for DNA "  Use BLASTP or PSI-BLAST for protein

searches

Scraping the Bottom of the Barrel with Psi-BLAST

PSI-BLAST Algorithm

"  Perform initial alignment with BLAST using BLOSUM 62 substitution matrix

"  Construct a multiple alignment from matches "  Prepare position specific scoring matrix "  Use PSSM profile as the scoring matrix for a

second BLAST run against database "  Repeat steps 3-5 until convergence

Profiles for families of sequences can be built from MSAs

C A

A

—

G A

A

A

— T

A

—

A C

T

G

—

50% 25%

0%

0%

25%

75% 0%

0%

25%

0%

25% 0%

25%

0%

50%

1 2 3 1 2 3

Note: While profiles can be used for any kind of sequence data, we’ll focus on protein sequences

Profiles "   Profile: A table that lists the frequencies of each amino

acid in each position of protein sequence. "   Frequencies are calculated from a MSA containing a

domain of interest "   Allows us to identify consensus sequence "   Derived scoring scheme allows us to align a new sequence

to the profile " Profile can be used in database searches " Find new sequences that match the profile

"   Profiles also used to compute multiple alignments heuristically " Progressive alignment

Profiles: Position-Specific Scoring Matrix (PSSM) "  To compare a sequence to a profile, need to assign a

score for each amino acid "  The score the profile for amino acid a at position p

is where " f(p,b) = frequency of amino acid b in position p " s(a,b) is the score of (a,b) (from, e.g., BLOSUM or PAM)

),(),(),(20

1basbpfapM

b⋅=∑

=

Profiles: PSSM

Gribskov et al. PNAS. 84 (13): 4355 (1987)

Insertion/deletion penalty

Profiles: Consensus Sequence "  A consensus residue C(p) is generated at each

position of the profile to aid the display of alignments of target sequences with the profile.

"  The consensus residue c is the amino acid at p that has the highest score M(p,c). " c is the amino acid most mutationally similar to all the

aligned residues of the probe sequences at p, rather than the most common one

Aligning a sequence to a profile

K L M – K

K L K L K

K M M L –

M L – L M

.75 .25 .75 .75 .75

.25 .25 .50 .25 .25 .25 .25

K

L

M

-

1 2 3 4 5

K K L - L M

1 - 2 3 4 5

Align with profile:

K K L - L M

K - L M – K

K - L K L K

K - M M L –

M - L – L M

K K L L M

New sequence:

Scoring a sequence-to-profile alignment

"  Score each column separately according to PSSM

"  Each character contributes to score, weighed by its frequency

.75 .25 .75

.75 .75 .25 .25 .50 .25

.25 .25 .25

K

L

M

-

1 2 3 4 5 K K L - L M

1 - 2 3 4 5

Column 1 score: 0.75 x s(K,K)

A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1

207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5

208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2

209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0

210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6

211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4

213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3

214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6

215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7

216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7

218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7

219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6

220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0

221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6

222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0

223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4

Serine pontuada diferentemente nestas duas posições

Nucleófilo do Sítio Ativo

Position Specific Score Matrix (PSSM)

Lecture 3.1 53

PSI-BLAST

Lecture 3.1 54

PSI-BLAST PresS Iterate!

55

PSI-BLAST

PresS Iterate!

56

PSI-BLAST

PSI-BLAST "  For Protein Sequences ONLY "  Much more sensitive than BLAST "  Slower (iterative process) "  SHOULD BE YOUR FIRST CHOICE

IN ANALYZING AN "HYPOTHETICAL" OR VERY DIFFERENT SEQUENCE

Still Confused? http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html

Conclusions

"  BLAST is the most important program in bioinformatics (maybe all of biology)

"  BLAST is based on sound statistical principles (key to its speed and sensitivity)

"  A basic understanding of its principles is key for using/interpreting BLAST output

"  Use NBLAST or MEGABLAST for DNA "  Use PSI-BLAST for protein searches

Agradecimentos

Estudantes Rafael e Thayani

Referência Glória Braz Lab Comp Científica Univ. Beirute http://staff.aub.edu.lb/~webbic/

Financiamentos INCT Entomologia Molecular - CNPq+FAPERJ Grupos Emergentes - FAPERJ PRONEX Dengue - CNPq APQ1 – FAPERJ IC - FAPERJ PIBIC – UFRJ

bioinformática básica · bioinformática básica blast rafael dias mesquita [email protected]...

Documents