the em algorithm · 2021. 1. 13. · anova at marker loci also known as marker regression. split...

The EM algorithmQTL mapping with a cure model

Karl Broman

Biostatistics & Medical Informatics, UW–Madison

kbroman.orggithub.com/kbroman

@kwbromanCourse web: kbroman.org/AdvData

https://kbroman.orghttps://kbroman.orghttps://github.com/kbromanhttps://twitter.com/kwbromanhttps://kbroman.org/AdvData

Intercross

P1 P2

F1 F1

F2

2

QTL mapping

0

2

4

6

8

Chromosome

LOD

sco

re

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X

BB BR RR

0.80.91.01.1

3

Phenotype data

Body weight

30 35 40 45

Heart weight

80 100 120 140 160 180 200

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

● ●

●

●

●●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

30 35 40 45

80

100

120

140

160

180

200

Body weight

Hea

rt w

eigh

t

Sugiyama et al. (2002) Physiol Genomics 10:5–12 4

Genotype data

20 40 60 80

50

100

150

Markers

Indi

vidu

als

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

5

Genetic map

80

60

40

20

0

Chromosome

Loca

tion

(cM

)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

6

ANOVA at marker loci

▶ Also known as markerregression.

▶ Split mice into groups accordingto genotype at a marker.

▶ Do a t-test / ANOVA.▶ Repeat for each marker. ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

30

35

40

45

Genotype at D15Mit184B

ody

wei

ght

AA AB BB

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Genotype at D2Mit92

AA AB BB

7

ANOVA at marker loci

Advantages▶ Simple.▶ Easily incorporates covariates.▶ Easily extended to more complex

models.▶ Doesn’t require a genetic map.

Disadvantages▶ Must exclude individuals with

missing genotype data.▶ Imperfect information about QTL

location.▶ Suffers in low density scans.▶ Only considers one QTL at a

time.

8

Interval mappingLander & Botstein (1989)

▶ Assume a single QTL model.▶ Each position in the genome, one at a time, is posited as the putative

QTL.▶ Let g = 0/1/2 if the (unobserved) QTL genotype is AA/AB/BB.

Assume y|g ∼ N(µg, σ)▶ Given genotypes at linked markers, y ∼ mixture of normal dist’ns with

mixing proportions Pr(g | marker data)

9

Backcross

P1 P2

F1P1

BC

10

Genotype probabilities

A A A A A B

?

Calculate Pr(g | marker data), assuming▶ No crossover interference▶ No genotyping errors

Or use the hidden Markov model (HMM) technology▶ To allow for genotyping errors▶ To incorporate dominant markers▶ (Still assume no crossover interference.)

11


B A A − B A

?



11


A B − B A A

?



11

The normal mixtures

M1 M2Q

7 cM 13 cM

▶ Two markers separated by 20 cM, with theQTL closer to the left marker.

▶ The figure at right shows the distributions ofthe phenotype conditional on the genotypesat the two markers.

▶ The dashed curves correspond to thecomponents of the mixtures.

20 40 60 80 100

Phenotype

BB/BB

BB/AB

AB/BB

AB/AB

µ0

µ0

µ0

µ0

µ1

µ1

µ1

µ1

12

Interval mapping

Let pij = Pr(gi = j|marker data)

yi|gi ∼ N(µgi , σ2)

Pr(yi|marker data, µ0, µ1, σ) =∑

j pij f(yi;µj, σ)where f(y;µ, σ) = exp[−(y − µ)2/(2σ2)]/

√2πσ2

Log likelihood: l(µ0, µ1, σ) =∑

i log Pr(yi|marker data, µ0, µ1, σ)

Maximum likelihood estimates (MLEs) of µ0, µ1, σ:values for which l(µ0, µ1, σ) is maximized.

13

EM algorithmE step:

Let w(k)ij = Pr(gi = j|yi,marker data, µ̂(k−1)0 , µ̂

(k−1)1 , σ̂

(k−1))

=pij f(yi;µ̂

(k−1)j ,σ̂

(k−1))∑j pij f(yi;µ̂

(k−1)j ,σ̂

(k−1))

M step:Let µ̂(k)j =

∑i yiw

(k)ij /

∑i w

(k)ij

σ̂(k) =√∑

i∑

j w(k)ij (yi − µ̂

(k)j )

2/n

The algorithm:

Start with w(1)ij = pij; iterate the E & M steps until convergence.

14

Interactive illustration

bit.ly/em_alg15

https://bit.ly/em_alghttps://bit.ly/em_alg

LOD scoresThe LOD score is a measure of the strength of evidence for the presence ofa QTL at a particular location.

LOD(λ) = log10 likelihood ratio comparing the hypothesis of aQTL at position λ versus that of no QTL

= log10{

Pr(y|QTL at λ,µ̂0λ,µ̂1λ,σ̂λ)Pr(y|no QTL,µ̂,σ̂)

}µ̂0λ, µ̂1λ, σ̂λ are the MLEs, assuming a single QTL at position λ.

No QTL model:The phenotypes are independent and identically distributed (iid) N(µ, σ2).

16

Interactive plot

bit.ly/D3lod17

https://bit.ly/D3lodhttps://bit.ly/D3lod

Interval mapping

Advantages▶ Takes proper account of missing

data.▶ Allows examination of positions

between markers.▶ Gives improved estimates of

QTL effects.▶ Provides pretty graphs.

Disadvantages▶ Increased computation time.▶ Requires specialized software.▶ Difficult to generalize.▶ Only considers one QTL at a

time.

18

Survival after Listeria infection

Survival time (hours)

0 50 100 150 200 250

~1/3 of 116 mice survived to 264 hrs

19

Normal assumption in ANOVA

▶ ANOVA is remarkably robust▶ Transformation▶ Rank-based methods▶ Specially-tailored models (e.g. GLM)

20

Censoring?

21

Measurements with a spike at 0

▶ Mass of gallstones▶ Gene expression, when a gene might be turned off▶ Microbiome data, when a microbe might be absent▶ Area of garage

22

Two-part (“cure”) model

▶ Let zi = 1 if mouse i survived the infection

Let yi = survival time

▶ Assume Pr(zi|g) = πg

Assume yi|zi = 0, g ∼ Normal(µg, σ)

Assume {(yi, zi,g)} mutually independent

23

EM algorithm

E step M step

24

Tests

▶ πAA = πAB = πBB▶ µAA = µAB = µBB▶ πAA = πAB = πBB and µAA = µAB = µBB

25

LOD curves

0

2

4

6

Chromosome

LOD

sco

re

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 171819 X

π, µπµ

26

QTL effects

6.4

6.6

6.8

7.0

AA AB BB

●

●

●

log2

sur

viva

l tim

e

Genotype

Chr 1

6.4

6.6

6.8

7.0

AA AB BB

●

●

●

log2

sur

viva

l tim

e

Genotype

Chr 5

6.4

6.6

6.8

7.0

AA AB BB

●

●

●

log2

sur

viva

l tim

e

Genotype

Chr 13

0

20

40

60

80

AA AB BB

●

●

●

Per

cent

sur

vive

d

Genotype

Chr 1

0

20

40

60

80

AA AB BB

●

●

●

Per

cent

sur

vive

d

Genotype

Chr 5

0

20

40

60

80

AA AB BB

●

● ●

Per

cent

sur

vive

d

Genotype

Chr 13

27

Lesson

▶ Don’t just cram your data into the standard approach.

28

Lessons

▶ Don’t just cram your data into the standard approach.▶ Cramming your data into the standard approach might work fine.

28

Standard approach

0

2

4

6

Chromosome

LOD

sco

re

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 171819 X

29

References▶ Lander ES, Botstein D (1989) Mapping Mendelian factors underlying

quantitative traits using RFLP linkage maps. Genetics 121:185-199PMCID: PMC1203601

▶ Broman KW (2001) Review of statistical methods for QTL mapping inexperimental crosses. Lab Animal 30(7):44-52PMID: 11469113

▶ Boyartchuk VL, et al. (2001) Multigenic control of Listeriamonocytogenes susceptibility in mice. Nat Genet 27:259-260doi:10.1038/85812

▶ Broman KW (2003) Mapping quantitative trait loci in the case of aspike in the phenotype distribution. Genetics 163:1169-1175PMCID: PMC1462498

30
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1203601https://www.ncbi.nlm.nih.gov/pubmed/11469113https://doi.org/10.1038/85812https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1462498

the em algorithm · 2021. 1. 13. · anova at marker loci also known as marker regression. split...

Documents

artigo de teoria das organizações - administraÇÃo |...

universidade de sÃo paulo - teses.usp.br · abnt nbr 7190...

desempenho do modelo anova comparado a testes …

11 - exercício anova

anova spss

anova ii. anova ii – dados não relacionados utiliza-se a...

anova 2 fatores - prof. ivan balducci

a memória recriada: história e imagem em la jetée (1962)...

mÉtodo anova utilizado para realizar el estudio de

a memória recriada: história e imagem em la jetée (1962)...

análise de variância (anova)

br - anova

anÁlise de variÂncia - anova -...

aula 10. anova análise de variância em spss

one way anova an´alise de variˆancia simples -...

marker as br

anova 2 fatores prof. ivan

anova fator único

aula10 anova 000

programa que calcula anova e tukey