the em algorithm · 2021. 1. 13. · anova at marker loci also known as marker regression. split...
Post on 27-Jan-2021
5 Views
Preview:
TRANSCRIPT
-
The EM algorithmQTL mapping with a cure model
Karl Broman
Biostatistics & Medical Informatics, UW–Madison
kbroman.orggithub.com/kbroman
@kwbromanCourse web: kbroman.org/AdvData
https://kbroman.orghttps://kbroman.orghttps://github.com/kbromanhttps://twitter.com/kwbromanhttps://kbroman.org/AdvData
-
Intercross
P1 P2
F1 F1
F2
2
-
QTL mapping
0
2
4
6
8
Chromosome
LOD
sco
re
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X
BB BR RR
0.80.91.01.1
3
-
Phenotype data
Body weight
30 35 40 45
Heart weight
80 100 120 140 160 180 200
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
● ●
●
●
●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
30 35 40 45
80
100
120
140
160
180
200
Body weight
Hea
rt w
eigh
t
Sugiyama et al. (2002) Physiol Genomics 10:5–12 4
-
Genotype data
20 40 60 80
50
100
150
Markers
Indi
vidu
als
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
5
-
Genetic map
80
60
40
20
0
Chromosome
Loca
tion
(cM
)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
6
-
ANOVA at marker loci
▶ Also known as markerregression.
▶ Split mice into groups accordingto genotype at a marker.
▶ Do a t-test / ANOVA.▶ Repeat for each marker. ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
30
35
40
45
Genotype at D15Mit184B
ody
wei
ght
AA AB BB
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Genotype at D2Mit92
AA AB BB
7
-
ANOVA at marker loci
Advantages▶ Simple.▶ Easily incorporates covariates.▶ Easily extended to more complex
models.▶ Doesn’t require a genetic map.
Disadvantages▶ Must exclude individuals with
missing genotype data.▶ Imperfect information about QTL
location.▶ Suffers in low density scans.▶ Only considers one QTL at a
time.
8
-
Interval mappingLander & Botstein (1989)
▶ Assume a single QTL model.▶ Each position in the genome, one at a time, is posited as the putative
QTL.▶ Let g = 0/1/2 if the (unobserved) QTL genotype is AA/AB/BB.
Assume y|g ∼ N(µg, σ)▶ Given genotypes at linked markers, y ∼ mixture of normal dist’ns with
mixing proportions Pr(g | marker data)
9
-
Backcross
P1 P2
F1P1
BC
10
-
Genotype probabilities
A A A A A B
?
Calculate Pr(g | marker data), assuming▶ No crossover interference▶ No genotyping errors
Or use the hidden Markov model (HMM) technology▶ To allow for genotyping errors▶ To incorporate dominant markers▶ (Still assume no crossover interference.)
11
-
Genotype probabilities
A A A A A B
?
Calculate Pr(g | marker data), assuming▶ No crossover interference▶ No genotyping errors
Or use the hidden Markov model (HMM) technology▶ To allow for genotyping errors▶ To incorporate dominant markers▶ (Still assume no crossover interference.)
11
-
Genotype probabilities
B A A − B A
?
Calculate Pr(g | marker data), assuming▶ No crossover interference▶ No genotyping errors
Or use the hidden Markov model (HMM) technology▶ To allow for genotyping errors▶ To incorporate dominant markers▶ (Still assume no crossover interference.)
11
-
Genotype probabilities
B A A − B A
?
Calculate Pr(g | marker data), assuming▶ No crossover interference▶ No genotyping errors
Or use the hidden Markov model (HMM) technology▶ To allow for genotyping errors▶ To incorporate dominant markers▶ (Still assume no crossover interference.)
11
-
Genotype probabilities
A B − B A A
?
Calculate Pr(g | marker data), assuming▶ No crossover interference▶ No genotyping errors
Or use the hidden Markov model (HMM) technology▶ To allow for genotyping errors▶ To incorporate dominant markers▶ (Still assume no crossover interference.)
11
-
Genotype probabilities
A B − B A A
?
Calculate Pr(g | marker data), assuming▶ No crossover interference▶ No genotyping errors
Or use the hidden Markov model (HMM) technology▶ To allow for genotyping errors▶ To incorporate dominant markers▶ (Still assume no crossover interference.)
11
-
The normal mixtures
M1 M2Q
7 cM 13 cM
▶ Two markers separated by 20 cM, with theQTL closer to the left marker.
▶ The figure at right shows the distributions ofthe phenotype conditional on the genotypesat the two markers.
▶ The dashed curves correspond to thecomponents of the mixtures.
20 40 60 80 100
Phenotype
BB/BB
BB/AB
AB/BB
AB/AB
µ0
µ0
µ0
µ0
µ1
µ1
µ1
µ1
12
-
Interval mapping
Let pij = Pr(gi = j|marker data)
yi|gi ∼ N(µgi , σ2)
Pr(yi|marker data, µ0, µ1, σ) =∑
j pij f(yi;µj, σ)where f(y;µ, σ) = exp[−(y − µ)2/(2σ2)]/
√2πσ2
Log likelihood: l(µ0, µ1, σ) =∑
i log Pr(yi|marker data, µ0, µ1, σ)
Maximum likelihood estimates (MLEs) of µ0, µ1, σ:values for which l(µ0, µ1, σ) is maximized.
13
-
EM algorithmE step:
Let w(k)ij = Pr(gi = j|yi,marker data, µ̂(k−1)0 , µ̂
(k−1)1 , σ̂
(k−1))
=pij f(yi;µ̂
(k−1)j ,σ̂
(k−1))∑j pij f(yi;µ̂
(k−1)j ,σ̂
(k−1))
M step:Let µ̂(k)j =
∑i yiw
(k)ij /
∑i w
(k)ij
σ̂(k) =√∑
i∑
j w(k)ij (yi − µ̂
(k)j )
2/n
The algorithm:
Start with w(1)ij = pij; iterate the E & M steps until convergence.
14
-
Interactive illustration
bit.ly/em_alg15
https://bit.ly/em_alghttps://bit.ly/em_alg
-
LOD scoresThe LOD score is a measure of the strength of evidence for the presence ofa QTL at a particular location.
LOD(λ) = log10 likelihood ratio comparing the hypothesis of aQTL at position λ versus that of no QTL
= log10{
Pr(y|QTL at λ,µ̂0λ,µ̂1λ,σ̂λ)Pr(y|no QTL,µ̂,σ̂)
}µ̂0λ, µ̂1λ, σ̂λ are the MLEs, assuming a single QTL at position λ.
No QTL model:The phenotypes are independent and identically distributed (iid) N(µ, σ2).
16
-
Interactive plot
bit.ly/D3lod17
https://bit.ly/D3lodhttps://bit.ly/D3lod
-
Interval mapping
Advantages▶ Takes proper account of missing
data.▶ Allows examination of positions
between markers.▶ Gives improved estimates of
QTL effects.▶ Provides pretty graphs.
Disadvantages▶ Increased computation time.▶ Requires specialized software.▶ Difficult to generalize.▶ Only considers one QTL at a
time.
18
-
Survival after Listeria infection
Survival time (hours)
0 50 100 150 200 250
~1/3 of 116 mice survived to 264 hrs
19
-
Normal assumption in ANOVA
▶ ANOVA is remarkably robust▶ Transformation▶ Rank-based methods▶ Specially-tailored models (e.g. GLM)
20
-
Censoring?
21
-
Measurements with a spike at 0
▶ Mass of gallstones▶ Gene expression, when a gene might be turned off▶ Microbiome data, when a microbe might be absent▶ Area of garage
22
-
Two-part (“cure”) model
▶ Let zi = 1 if mouse i survived the infection
Let yi = survival time
▶ Assume Pr(zi|g) = πg
Assume yi|zi = 0, g ∼ Normal(µg, σ)
Assume {(yi, zi,g)} mutually independent
23
-
EM algorithm
E step M step
24
-
Tests
▶ πAA = πAB = πBB▶ µAA = µAB = µBB▶ πAA = πAB = πBB and µAA = µAB = µBB
25
-
LOD curves
0
2
4
6
Chromosome
LOD
sco
re
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 171819 X
π, µπµ
26
-
QTL effects
6.4
6.6
6.8
7.0
AA AB BB
●
●
●
log2
sur
viva
l tim
e
Genotype
Chr 1
6.4
6.6
6.8
7.0
AA AB BB
●
●
●
log2
sur
viva
l tim
e
Genotype
Chr 5
6.4
6.6
6.8
7.0
AA AB BB
●
●
●
log2
sur
viva
l tim
e
Genotype
Chr 13
0
20
40
60
80
AA AB BB
●
●
●
Per
cent
sur
vive
d
Genotype
Chr 1
0
20
40
60
80
AA AB BB
●
●
●
Per
cent
sur
vive
d
Genotype
Chr 5
0
20
40
60
80
AA AB BB
●
● ●
Per
cent
sur
vive
d
Genotype
Chr 13
27
-
Lesson
▶ Don’t just cram your data into the standard approach.
28
-
Lessons
▶ Don’t just cram your data into the standard approach.▶ Cramming your data into the standard approach might work fine.
28
-
Standard approach
0
2
4
6
Chromosome
LOD
sco
re
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 171819 X
29
-
Standard approach
0
2
4
6
Chromosome
LOD
sco
re
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 171819 X
29
-
References▶ Lander ES, Botstein D (1989) Mapping Mendelian factors underlying
quantitative traits using RFLP linkage maps. Genetics 121:185-199PMCID: PMC1203601
▶ Broman KW (2001) Review of statistical methods for QTL mapping inexperimental crosses. Lab Animal 30(7):44-52PMID: 11469113
▶ Boyartchuk VL, et al. (2001) Multigenic control of Listeriamonocytogenes susceptibility in mice. Nat Genet 27:259-260doi:10.1038/85812
▶ Broman KW (2003) Mapping quantitative trait loci in the case of aspike in the phenotype distribution. Genetics 163:1169-1175PMCID: PMC1462498
30
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1203601https://www.ncbi.nlm.nih.gov/pubmed/11469113https://doi.org/10.1038/85812https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1462498
top related