the em algorithm · 2021. 1. 13. · anova at marker loci also known as marker regression. split...

37
The EM algorithm QTL mapping with a cure model Karl Broman Biostatistics & Medical Informatics, UW–Madison kbroman.org github.com/kbroman @kwbroman Course web: kbroman.org/AdvData

Upload: others

Post on 27-Jan-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

  • The EM algorithmQTL mapping with a cure model

    Karl Broman

    Biostatistics & Medical Informatics, UW–Madison

    kbroman.orggithub.com/kbroman

    @kwbromanCourse web: kbroman.org/AdvData

    https://kbroman.orghttps://kbroman.orghttps://github.com/kbromanhttps://twitter.com/kwbromanhttps://kbroman.org/AdvData

  • Intercross

    P1 P2

    F1 F1

    F2

    2

  • QTL mapping

    0

    2

    4

    6

    8

    Chromosome

    LOD

    sco

    re

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X

    BB BR RR

    0.80.91.01.1

    3

  • Phenotype data

    Body weight

    30 35 40 45

    Heart weight

    80 100 120 140 160 180 200

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●

    ●●●

    ●●●

    ● ●

    ●●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    30 35 40 45

    80

    100

    120

    140

    160

    180

    200

    Body weight

    Hea

    rt w

    eigh

    t

    Sugiyama et al. (2002) Physiol Genomics 10:5–12 4

  • Genotype data

    20 40 60 80

    50

    100

    150

    Markers

    Indi

    vidu

    als

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

    5

  • Genetic map

    80

    60

    40

    20

    0

    Chromosome

    Loca

    tion

    (cM

    )

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

    6

  • ANOVA at marker loci

    ▶ Also known as markerregression.

    ▶ Split mice into groups accordingto genotype at a marker.

    ▶ Do a t-test / ANOVA.▶ Repeat for each marker. ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    30

    35

    40

    45

    Genotype at D15Mit184B

    ody

    wei

    ght

    AA AB BB

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●

    ●●

    Genotype at D2Mit92

    AA AB BB

    7

  • ANOVA at marker loci

    Advantages▶ Simple.▶ Easily incorporates covariates.▶ Easily extended to more complex

    models.▶ Doesn’t require a genetic map.

    Disadvantages▶ Must exclude individuals with

    missing genotype data.▶ Imperfect information about QTL

    location.▶ Suffers in low density scans.▶ Only considers one QTL at a

    time.

    8

  • Interval mappingLander & Botstein (1989)

    ▶ Assume a single QTL model.▶ Each position in the genome, one at a time, is posited as the putative

    QTL.▶ Let g = 0/1/2 if the (unobserved) QTL genotype is AA/AB/BB.

    Assume y|g ∼ N(µg, σ)▶ Given genotypes at linked markers, y ∼ mixture of normal dist’ns with

    mixing proportions Pr(g | marker data)

    9

  • Backcross

    P1 P2

    F1P1

    BC

    10

  • Genotype probabilities

    A A A A A B

    ?

    Calculate Pr(g | marker data), assuming▶ No crossover interference▶ No genotyping errors

    Or use the hidden Markov model (HMM) technology▶ To allow for genotyping errors▶ To incorporate dominant markers▶ (Still assume no crossover interference.)

    11

  • Genotype probabilities

    A A A A A B

    ?

    Calculate Pr(g | marker data), assuming▶ No crossover interference▶ No genotyping errors

    Or use the hidden Markov model (HMM) technology▶ To allow for genotyping errors▶ To incorporate dominant markers▶ (Still assume no crossover interference.)

    11

  • Genotype probabilities

    B A A − B A

    ?

    Calculate Pr(g | marker data), assuming▶ No crossover interference▶ No genotyping errors

    Or use the hidden Markov model (HMM) technology▶ To allow for genotyping errors▶ To incorporate dominant markers▶ (Still assume no crossover interference.)

    11

  • Genotype probabilities

    B A A − B A

    ?

    Calculate Pr(g | marker data), assuming▶ No crossover interference▶ No genotyping errors

    Or use the hidden Markov model (HMM) technology▶ To allow for genotyping errors▶ To incorporate dominant markers▶ (Still assume no crossover interference.)

    11

  • Genotype probabilities

    A B − B A A

    ?

    Calculate Pr(g | marker data), assuming▶ No crossover interference▶ No genotyping errors

    Or use the hidden Markov model (HMM) technology▶ To allow for genotyping errors▶ To incorporate dominant markers▶ (Still assume no crossover interference.)

    11

  • Genotype probabilities

    A B − B A A

    ?

    Calculate Pr(g | marker data), assuming▶ No crossover interference▶ No genotyping errors

    Or use the hidden Markov model (HMM) technology▶ To allow for genotyping errors▶ To incorporate dominant markers▶ (Still assume no crossover interference.)

    11

  • The normal mixtures

    M1 M2Q

    7 cM 13 cM

    ▶ Two markers separated by 20 cM, with theQTL closer to the left marker.

    ▶ The figure at right shows the distributions ofthe phenotype conditional on the genotypesat the two markers.

    ▶ The dashed curves correspond to thecomponents of the mixtures.

    20 40 60 80 100

    Phenotype

    BB/BB

    BB/AB

    AB/BB

    AB/AB

    µ0

    µ0

    µ0

    µ0

    µ1

    µ1

    µ1

    µ1

    12

  • Interval mapping

    Let pij = Pr(gi = j|marker data)

    yi|gi ∼ N(µgi , σ2)

    Pr(yi|marker data, µ0, µ1, σ) =∑

    j pij f(yi;µj, σ)where f(y;µ, σ) = exp[−(y − µ)2/(2σ2)]/

    √2πσ2

    Log likelihood: l(µ0, µ1, σ) =∑

    i log Pr(yi|marker data, µ0, µ1, σ)

    Maximum likelihood estimates (MLEs) of µ0, µ1, σ:values for which l(µ0, µ1, σ) is maximized.

    13

  • EM algorithmE step:

    Let w(k)ij = Pr(gi = j|yi,marker data, µ̂(k−1)0 , µ̂

    (k−1)1 , σ̂

    (k−1))

    =pij f(yi;µ̂

    (k−1)j ,σ̂

    (k−1))∑j pij f(yi;µ̂

    (k−1)j ,σ̂

    (k−1))

    M step:Let µ̂(k)j =

    ∑i yiw

    (k)ij /

    ∑i w

    (k)ij

    σ̂(k) =√∑

    i∑

    j w(k)ij (yi − µ̂

    (k)j )

    2/n

    The algorithm:

    Start with w(1)ij = pij; iterate the E & M steps until convergence.

    14

  • Interactive illustration

    bit.ly/em_alg15

    https://bit.ly/em_alghttps://bit.ly/em_alg

  • LOD scoresThe LOD score is a measure of the strength of evidence for the presence ofa QTL at a particular location.

    LOD(λ) = log10 likelihood ratio comparing the hypothesis of aQTL at position λ versus that of no QTL

    = log10{

    Pr(y|QTL at λ,µ̂0λ,µ̂1λ,σ̂λ)Pr(y|no QTL,µ̂,σ̂)

    }µ̂0λ, µ̂1λ, σ̂λ are the MLEs, assuming a single QTL at position λ.

    No QTL model:The phenotypes are independent and identically distributed (iid) N(µ, σ2).

    16

  • Interactive plot

    bit.ly/D3lod17

    https://bit.ly/D3lodhttps://bit.ly/D3lod

  • Interval mapping

    Advantages▶ Takes proper account of missing

    data.▶ Allows examination of positions

    between markers.▶ Gives improved estimates of

    QTL effects.▶ Provides pretty graphs.

    Disadvantages▶ Increased computation time.▶ Requires specialized software.▶ Difficult to generalize.▶ Only considers one QTL at a

    time.

    18

  • Survival after Listeria infection

    Survival time (hours)

    0 50 100 150 200 250

    ~1/3 of 116 mice survived to 264 hrs

    19

  • Normal assumption in ANOVA

    ▶ ANOVA is remarkably robust▶ Transformation▶ Rank-based methods▶ Specially-tailored models (e.g. GLM)

    20

  • Censoring?

    21

  • Measurements with a spike at 0

    ▶ Mass of gallstones▶ Gene expression, when a gene might be turned off▶ Microbiome data, when a microbe might be absent▶ Area of garage

    22

  • Two-part (“cure”) model

    ▶ Let zi = 1 if mouse i survived the infection

    Let yi = survival time

    ▶ Assume Pr(zi|g) = πg

    Assume yi|zi = 0, g ∼ Normal(µg, σ)

    Assume {(yi, zi,g)} mutually independent

    23

  • EM algorithm

    E step M step

    24

  • Tests

    ▶ πAA = πAB = πBB▶ µAA = µAB = µBB▶ πAA = πAB = πBB and µAA = µAB = µBB

    25

  • LOD curves

    0

    2

    4

    6

    Chromosome

    LOD

    sco

    re

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 171819 X

    π, µπµ

    26

  • QTL effects

    6.4

    6.6

    6.8

    7.0

    AA AB BB

    log2

    sur

    viva

    l tim

    e

    Genotype

    Chr 1

    6.4

    6.6

    6.8

    7.0

    AA AB BB

    log2

    sur

    viva

    l tim

    e

    Genotype

    Chr 5

    6.4

    6.6

    6.8

    7.0

    AA AB BB

    log2

    sur

    viva

    l tim

    e

    Genotype

    Chr 13

    0

    20

    40

    60

    80

    AA AB BB

    Per

    cent

    sur

    vive

    d

    Genotype

    Chr 1

    0

    20

    40

    60

    80

    AA AB BB

    Per

    cent

    sur

    vive

    d

    Genotype

    Chr 5

    0

    20

    40

    60

    80

    AA AB BB

    ● ●

    Per

    cent

    sur

    vive

    d

    Genotype

    Chr 13

    27

  • Lesson

    ▶ Don’t just cram your data into the standard approach.

    28

  • Lessons

    ▶ Don’t just cram your data into the standard approach.▶ Cramming your data into the standard approach might work fine.

    28

  • Standard approach

    0

    2

    4

    6

    Chromosome

    LOD

    sco

    re

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 171819 X

    29

  • Standard approach

    0

    2

    4

    6

    Chromosome

    LOD

    sco

    re

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 171819 X

    29

  • References▶ Lander ES, Botstein D (1989) Mapping Mendelian factors underlying

    quantitative traits using RFLP linkage maps. Genetics 121:185-199PMCID: PMC1203601

    ▶ Broman KW (2001) Review of statistical methods for QTL mapping inexperimental crosses. Lab Animal 30(7):44-52PMID: 11469113

    ▶ Boyartchuk VL, et al. (2001) Multigenic control of Listeriamonocytogenes susceptibility in mice. Nat Genet 27:259-260doi:10.1038/85812

    ▶ Broman KW (2003) Mapping quantitative trait loci in the case of aspike in the phenotype distribution. Genetics 163:1169-1175PMCID: PMC1462498

    30

    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1203601https://www.ncbi.nlm.nih.gov/pubmed/11469113https://doi.org/10.1038/85812https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1462498