técnicas de paralelização de laçosedson/disciplinas/mo615/2011-2s/semin… · 1 8 1. m c f 1 8...
TRANSCRIPT
Agenda• Motivação
• DOALL e Spec-DOALL
• DOACROSS e Spec-DOACROSS
• DSWP e Spec-DSWP
• PS-DSWP e Spec-PS-DSWP
• Conclusões
• Referências
Motivação
• Lei de Moore e arquiteturas atuais
• Tendência dos transistores
• Frequencia de clock• Consumo de energia• Dissipação de calor• Complexidade de design
Motivação
Motivação
• Processos
• Aplicação pesada single-thread
• Reescrever?
• Paralelização automática
• O que paralelizar?
• Custos de comunicação, sync, etc.
DOALL
for (i = 0; i < 900000; i++) A[i] = 100;
•Todas as iterações são independentes
•Podem ser processadas em paralelo
DOALLint *p = &y[7]for (i = 0; i < N; i++) x[i] += f(p);
•Assumindo que f não introduz loop-carried
•Simplicidade (sync, comm, etc.) e desempenho
•Como mapear, desempenho de caches, FS
Spec-DOALLint *p = &y[7]for (i = 0; i < N; i++) x[i] += f(p);
•Compilação estática
•Na dúvida... código correto!
•Profiling
•Especulação e suporte HW/SW
•Aplicações científicas e gráficas
DOACROSS
• Aplicações de propósito geral
• Loops irregulares
• Fluxo de controle irregular
• DOALL é inaplicável
• Spec-DOALL é inviável (mispeculation)
• Escalona cada iteração para uma thread
A
C
B
D
statement B;statement C;statement D;
A: while (condition){
}
DOACROSS - exemplo
A
C
B
D
Core 1 Core 2 Core 1 Core 2
A2
B2
C2
D2A3
B3
C3
D3 A4
B4
C4
D4
A1
B1
C1
D1
A3
C3
A4
C4
A1
C1
A2
C2 B1
D1
B2
D2
B3
D3
B4
D4
0
2
1
3
4
5
6
7
8
9
10
11
12
13DOACROSS DSWP
DOACROSS - exemplo
DOACROSS
• Comunicação entre threads no caminho crítico
• Dependente da eficiência do mecanismo de comunicação
• Shared memory e pressão no sistema de memória
• Especulação e Spec-DOACROSS
DSWP
• Aplicações de propósito geral
• Comunicação e caminho crítico
A
C
B
D
Core 1 Core 2 Core 1 Core 2
A2
B2
C2
D2A3
B3
C3
D3 A4
B4
C4
D4
A1
B1
C1
D1
A3
C3
A4
C4
A1
C1
A2
C2 B1
D1
B2
D2
B3
D3
B4
D4
0
2
1
3
4
5
6
7
8
9
10
11
12
13DOACROSS DSWP
DSWP
DSWP - Algoritmo
• Construir grafo de dependências
• Encontrar SCCs (se for 1 termina)
• Coalescing de SCCs
• TPP (se nao houver partição que valha a pena, termina)
• Code spliting
• Flows
Loop
DSWP - Exemplo
Loop
DSWP - Exemplo
Loop
DSWP - Exemplo
T1 T2
Loop
DSWP - Exemplo
DSWP - TPP
• P1, P2, ..., Pn
• Conjuntos de SCCs
• 1 <= n <= t
• Cada SCC pertence a exatamente uma partição
• (u[Pi], v[Pj]), i <= j
• Partição válida com menor tempo de execução
• Dependente de máquina e NP-completo
DSWP - TPP
• Heurística para maximizar balanceamento de carga
• Custo em ciclos
• Latencia de cada instrução
• Profile
DSWP - TPP• Estima o balanceamento
• Conjunto de SCCs cujos antecessores já foram para alguma partição
• Utiliza a maior SCC disponível
• Se empatar, utiliza a que reduza o número de dependências de saída da partição atual
• Ao terminar uma partição, considera a comunicação e veja se vale a pena
• Pode terminar o algoritmo
+45%
+48%
+43%
-2%
DSWP - TPP
DSWP - Code spliting
• BBs que contenham alguma instrução de Pi ou alguma instrução da qual algo em Pi dependa
• produce and consume
• Criar BBs
• Colocar as instruções alocadas para pi nos BBs correspondentes
• Alvos de branches e pós-dominador relevante mais próximo
DSWP - Flows
• Data dependence
• Control dependence
• Transmite a direção de um branch para um branch duplicado
• Memory/sync dependence
• Apenas o token para preservar a ordem
Loop
DSWP - Flows
Loop
L1
L2
DSWP - Flows
DSWP - Resultados
• Compilador IMPACT
• Accurate dual-core Itanium 2 simulator
DSWP - Resultados
1.0
1.2
1.4
1.6
Speedup
129.compress179.art
181.mcf
183.equake
188.ammp
256.bzip2
adpcmdec
epicdec
jpegenc wc
GeoMean
Fully AutomaticBest Manually Directed
(a) Speedup of DSWP over single-threaded.
0
2
4
6
IPC
129.compress179.art
181.mcf
183.equake
188.ammp
256.bzip2
adpcmdec
epicdec
jpegenc wc
Average
DSWP - Core 2
DSWP - Core 1
BASE
(b) Baseline and DSWP IPC.
Figure 6. Performance summary: full-width Itanium 2 baseline.
ing sections of the program while keeping the caches andbranch predictors warm.The first comparison made was between the dual-
threaded, DSWP version of the selected loops against theirsingle-threaded, base versions. In this experiment, the la-tency to produce/consume a value to/from the synchroniza-tion array was set to 1 cycle in our simulator (minimum 2cycles core-to-core). In order to evaluate both the effec-tiveness of our partitioning heuristic (from Section 2.2.2)and the potential of better heuristics, Figure 6(a) presentstwo speedup bars per benchmark loop. The first bar is thefully automated DSWP, using the heuristic. The second onecorresponds to the best performing partitioning found by it-eratively specifying the desired partitioning to the compilerand measuring its resulting performance. Figure 6(a) showsthat in many cases the heuristic found the best partitioningwe were able to find in our iterative search. The geometricmean across these benchmark loops is 14.4% and 19.4%,for the automatically created and manually specified par-titions respectively. In terms of whole-program speedup,these geometric means translate into 6.6% and 9.2% respec-tively. The average baseline IPC is 1.65 and the IPC aver-ages for the producer and consumer cores are 0.88 and 1.24respectively as shown in Figure 6(b). Notice that these IPCnumbers do not include the produce and consume in-structions inserted by DSWP.For simplicity, the simulator used did not model the cost
of coherence protocol. To gauge the effect of this coher-ence cost on our results, we analyzed, for all benchmarks,the memory traces of both cores for false-sharings. Noticethat true-sharings cannot occur inside the loop because ofthe memory dependence arcs in the dependence graph. Ifa load and a store may access the same memory address,there will be arcs in both directions between these instruc-tions, because of the RAW andWAR dependences (one loopcarried, and the other intra-iteration). This makes both in-structions part of the same SCC, assigning them to the samecore. Future work may relax this condition.
In order to analyze false-sharings, we replayed the mem-ory accesses from the traces in an invalidation-based co-herence model offline. Out of the nine applications, onlythree (181.mcf, 256.bzip2, and jpegenc) exhib-ited false-sharing. In 181.mcf and jpegenc, the false-sharing was always caused by writes in the consumer core tolocations already present in the producer core’s L1D cache.While in 181.mcf, the miss-rate of the producer core’sL1D went up by 0.01% to 98.62%, in jpegenc, therewas no change in the miss-rate of both cores. The reasonwhy the miss-rate is unaffected is because although thereis false-sharing, the producer core always runs ahead andaccesses any locations it needs before those locations areinvalidated by writes to memory from the consumer’s core.256.bzip2 has a slightly more interesting behavior. Wefind that a particular store from the producer core causesa lot of false-sharing and hence invalidations in the con-sumer’s L1 data cache. Since the consumer trails the pro-ducer, any extra latency arising out of such events could ad-versely affect the consumer thread and ought to have beenmodeled. However, on examining the code in detail, wefound that all these coherence conflicts are caused by false-sharing due to a write to a global variable (bsLive) in theproducer core. We promoted this global variable to a regis-ter and used the modified version of 256.bzip2 for all exper-iments. Thus, even without coherence modeling, our resultsare valid and not overstated.Figure 7 illustrates the importance of balancing work
across threads when partitioning loops. The figure showsthe DAGSCC for the target loop in 181.mcf. Each SCCis labeled with the number of instructions it contains. Eachleft-to-right line crossing the DAGSCC illustrates one pos-sible way of partitioning it into two threads. For eachpossible partitioning, the figure also illustrates the result-ing speedup, the corresponding synchronization array oc-cupancy for a sample period and the cumulative cycle dis-tribution at each possible occupancy level. An occupancy ofnegative one means the corresponding queue is empty and
DSWP - Resultados
0.50
0.75
1.00
1.25
1.50
Speedup
129.compress179.art
181.mcf
183.equake
188.ammp
256.bzip2
adpcmdec
epicdec
jpegenc wc
GeoMean
Half-width BaseHalf-width DSWPFull-width DSWP
(a) Varying issue-widths
1.0
1.2
1.4
1.6
Speedup
129.compress179.art
181.mcf
183.equake
188.ammp
256.bzip2
adpcmdec
epicdec
jpegenc wc
GeoMean
DSWP - 1 cycleDSWP - 10 cyclesDSWP - 100 cycles
(b) Varying communication latencies
Figure 9. Performance compatibility and sensitivity analyses: full-width Itanium 2 baseline.
DSWP over half-width ST is greater than the speedup offull-width DSWP over full-width ST. This occurs becauseDSWP sometimes trades ILP for TLP. Thus, the simplerand less powerful a core is, the more pronounced the bene-fits of DSWP are.
4.4 Sensitivity Analyses: Communication La-tency and Queue Size
In order to quantify the importance of communicationlatency for DSWP, we ran our experiments again with thefull-width CMP model, modified to have communicationlatencies of 10 and 100 cycles between the two cores. Wemodeled this pipelined delay in the produce instructions,while consume instructions continued to take one cycle(representing queue locality at the receiving side). The re-sults are presented in Figure 9(b) and they show that DSWPis not very sensitive to the communication latency. In fact,this was expected due to the design of the DSWP transfor-mation, as discussed in Section 2.We also evaluated the impact of queue size on perfor-
mance by varying the queue size to 8 and 128. We foundthat DSWP executions are fairly insensitive to queue size,with the mean slowdown with size 8 being 2% and the av-erage speedup with size 128 being 1% compared to 32-element queues. The highest slowdown was 6% and thepeak speedup was 7% respectively.
5 Detailed AnalysisAs already mentioned, DSWP is not applicable in cases
where the loop dependence graph has a single SCC. Ad-ditionally, even with multiple SCCs, DSWP does not havegood partitioning opportunities if there exists large SCCsthat create a thread imbalance. In our experiments, wefound that DSWP is generally applicable in spite of thisrequirement. Nevertheless, a few cases were encounteredwhere this restriction was limiting. This section describesthese cases and highlights opportunities to mitigate or elim-inate the problem.
(1) for (i=0; i<x_size*y_size; i++){
(2) dtemp = result[i] / scale_factor;(3) if (dtemp < 0) result[i] = 0;(4) else if (dtemp > 255) result[i] = 255;(5) else result[i] = (int) (dtemp + 0.5);
}
Figure 10. Important loop for epicdec.
5.1 Case Study: epicdec loopDespite epicdec’s simplicity, it illustrates the poten-
tial of having a more accurate memory analysis for DSWP.The source code of the loop in question is shown on Fig-ure 10. In the IMPACT compiler, this loop is unrolled once.During this process, we notice that both memory loads ofthe result array (line (2)) become memory dependent onall the stores. Because of this, the resulting dependencegraph has only 4 SCCs, with all the loads and stores as partof a single SCC. Despite this, DSWP achieves a speedup.Closer inspection revealed that these were false memory de-pendences, conservatively inserted by earlier optimizations.A solution proposed in [10] addresses this problem by per-forming accurate memory analysis at the assembly level.Better scheduling and partitioning resulting from re-
moval of these false memory dependences gave the DSWPand the base codes a 98% and an 87% speedup over theoriginal base code. We then measured the average instruc-tions per cycle (IPC), for both new versions, and verifiedthat the base had an average of 2.33, while the average forthe DSWP threads were 1.26 and 1.37 respectively. Thisshowed us that DSWP was trading ILP for TLP, and that ad-ditional ILP optimizations could potentially improve the re-sults. We then applied more aggressive unrolling (8x), andrecompiled both versions once again. The new DSWP’edversion resulted in a 45% speedup relative to the new baseversion by better utilizing the dual-core resources.
5.2 Case Study: adpcmdec loopLack of predicate-aware dependence analysis for DSWP
caused spurious dependences to be inserted in our depen-
DSWP - Resultados
Full-width
Half-width
• DSWP melhor independentemente do width• DSWP-half melhor que normal-full• DSWP-half x base-half melhor que DSWP-full x base-full• Trocando ILP por TLP (maior diferença em cores simples)
DSWP - Variando width
X
X
Apenas uma SCC
Spec-DSWP164.gzip: 38% speedup
com 3 threads
181.mcf: 2.9x speedup com 4 threads
Misspeculation detectada
PS-DSWP
• Escalabilidade e DOALL
• Mais cores
• Replicação de estágios
• s - p; s - p - s; p - s
PS-DSWP
A1
A2
A3
A4
A5
A6
A7
A8
A9
B1
B4
B7
B2
B5
C1
C2
C3
C4
C5
C6
B3
B6
0
1
2
3
4
5
6
7
8
9
Core 1 Core 2 Core 3 Core 4 Core 5
PS-DSWP - ResultadosBenchmark Description Loop Hotness Pipeline stagesks Kernighan-Lin graph partitioning FindMaxGpAndSwap (outer) 98% s! potter theorem prover for first-order logic find_lightest_geo_child 15% s! p! s300.twolf placement and routing new_dbox_a (outer) 30% s! p456.hmmer hidden markov model P7_Viterbi (inner) 85% p! s458.sjeng chess program std_eval 26% s! p
Table 2: Benchmark Details
1.0
1.5
2.0
2.5
3.0
LoopSpeedup
3 4 5 6Number of Threads
DSWPPS-DSWP
(a) ks
1.0
1.5
2.0
2.5
3.0
LoopSpeedup
4 5 6Number of Threads
DSWPPS-DSWP
(b) otter
1.0
1.5
2.0
2.5
3.0
LoopSpeedup
3 4 5 6Number of Threads
DSWPPS-DSWP
(c) 300.twolf
1.0
1.5
2.0
2.5
3.0
LoopSpeedup
3 4 5 6Number of Threads
DSWPPS-DSWP
(d) 458.sjeng
1.0
1.5
2.0
2.5
3.0
LoopSpeedup
3 4 5 6Number of Threads
DSWPPS-DSWP
(e) 456.hmmer
1.0
1.5
2.0
2.5
3.0
LoopSpeedup
CS1 CS1-Perf CS32 CS32-Perf
(f) Effects of chunking on 456.hmmerwith 4 threads
Figure 7: Speedups Over Single-threaded Execution
6. RELATEDWORKSeveral automatic techniques have been proposed to extractthread-level parallelism. The techniques closest to PS-DSWP areDSWP [13, 16] and loop distribution [11]. As discussed earlier,PS-DSWP is an extension of DSWP to also exploit iteration-levelparallelism. PS-DSWP is similar to loop distribution in the sensethat both techniques isolate the parts of loop with loop-carried de-pendences from the parts without these dependences. The maindifference between PS-DSWP and loop distribution combined withDOALL is that PS-DSWP allows the execution of the sequentialpart of the loop to overlap with the DOALL part, using pipelineparallelism. Moreover, loop distribution, as proposed in literature,does not permit arbitrary control flow within the region. Finally,the use of synchronization array in PS-DSWP allows it to operateon uncounted loops.DOPIPE [7, 14] is another technique that exploits pipeline paral-lelism. Unlike DSWP, DOPIPE does not handle loops with controlflow inside. Davies [7] also proposes executing multiple copiesof a DOPIPE stage if that stage contains an inner DOALL loop.The PS-DSWP technique is more general because it can create par-allel stages without requiring the operations to be inside an inner
DOALL loop.DOACROSS [5] is another non-speculative technique that can
be used to extract iteration-level parallelism on loops with loop-carried dependences. DOACROSS does not permit control flowwithin the loop. Moreover, DOACROSS does not result in a pipelineof stages. The inter-thread dependences in DOACROSS form a cy-cle and would be the bottleneck as the latency of inter-core com-munication increases. Similar to DSWP, inter-core communicationlatency has negligible impact on the performance of PS-DSWP.While PS-DSWP is a non-speculative technique, several specu-
lative techniques have been proposed to extract iteration-level par-allelism. Thread-level speculation (TLS) [10, 12, 19, 20] tech-niques speculatively execute consecutive iterations of a loop con-currently, assuming that the dependences between those iterationsare infrequent. Inter-iteration dependences that manifest frequentlyare synchronized. Other speculative parallelization techniques suchas LRPD test [17], R-LRPD test [6] and master/slave speculativeparallelization [24] have also been proposed to speculatively ex-tract iteration-level parallelism.Speculative DSWP [23] adds speculation support to DSWP. The
addition of speculation can increase the number of pipeline stages
• Cache perfeita, chunk32, chunk32 + perfect cache• Chunking e perda de paralelismo
Spec-PS-DSWP
• Arquiteturas atuais
• Especulação de locais de acesso de memória
• Especulação de dados
• Versionamento
• Especulação de controle
• Profile
Spec-PS-DSWP
0
2
4
6
8
10
12
14
16
18
20
22
1 2 4 6 8 10 12 14 16 18 20 22 23
Spe
edu
p
Number of Parallel Stage Threads
130.li164.gzip
197.parser256.bzip2
456.hmmercrc32
blackscholesgimp-oilify-range-checkgimp-nova-range-check
Figure 9: Scaling of application performance on machine M2: Parallel stage threads are executed on up to 23 cores, after leaving cores freefor execution of sequential stages.
are several global data structures that are modified inside an itera-tion, but they are reset at the end of the iteration to the same valuesthat they had at the beginning of that iteration. The value specula-tion described in Section 4.2 is used to break the dependences aris-ing out of the stores to these data structures. Also, some branchesthat are taken under error conditions are speculated to not be taken.The loop speedup is affected by the variability in sentence lengthand is mainly limited by the number of sentences to parse.
256.bzip2 performs data compression using the Burrows-Wheeler transform. Again, only the compression part is paral-lelized. The parallelization is similar to that of 164.gzip. In bothcases, a variable-sized block array (produced by applying Run-Length Encoding to the input data) is the data structure used tostore the blocks on which compression happens. Since the size ofthis array is unknown statically, very complicated whole-programanalyses would be required for static privatization of the data struc-ture. Memory Versioning allows the dynamic privatization of thesedata structures thus enabling parallelization.
456.hmmer is a computational biology application that searchesfor patterns in DNA sequences using profile Hidden Markov Mod-els (HMMs). The loop in main loop serial is parallelized. Scoresare calculated in parallel on sequences which are randomly se-lected. The Commutative annotation is used to break the depen-dence inside the random number generator. Following this parallelstage is a sequential stage in which a histogram is computed andthe maximum score selected by using max-reduction. The speedupis limited by this sequential phase.
crc32 computes the 32-bit CRC of files specified on the com-mand line. Very little speculation is needed: error conditions thatoccur during the computation are speculated to not occur. Thespeedup is limited by the number of files and the variability in thefile sizes.
blackscholes is an Intel RMS benchmark. It calculates theprices for a portfolio of European options analytically with theBlack-Scholes partial differential equation. As in crc32, the only
speculation needed is that an error will not occur in the pricing. Thespeedup is limited only by the number of options to price.
5.3.3 An in-depth look at GIMP
gimp-oilify and gimp-nova are artistic transformation filtersthat are part of GIMP- the GNU Image Manipulation Program. Theoilify filter makes an image look like an oil painting. The nova filterinserts a big star and associated light effects in the image. Rows areprocessed in parallel.
Figure 10 shows the speedup of gimp using Spec-DOALLon machine M1. The try-commit unit based checking algorithmscales up to 5 worker threads. Beyond that, the try-commit unitbecomes the bottleneck. There is not enough computation hap-pening in the worker threads in parallel with the replay of loadsand stores in the try-commit unit. Micro-architectural profiling re-veals that the try-commit unit spends a significant fraction of itsexecution time stalling due to the number of load-store instruc-tions in the processor pipeline reaching the limit the processorcan handle. This overhead is so much on machine M2 that thereis no speedup to be gained. Parallelizing the execution of the
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8
Speedup
Number of Parallel Stage Threads
gimp-oilify-mt-nocheckgimp-oilify-trycommit-check
gimp-oilify-range-check
Figure 10: Overhead of the checking algorithm on M1
Spec-PS-DSWPWhen misspeculation (inter-transaction conflict) is detected, the
commit process is notified of this condition. It then takes the fol-lowing actions:1. The relevant pages are marked as COW.2. For each worker, the page table entries of the commit process
are copied into the worker’s page tables.3. The TLB of the core on which the worker is executing is
flushed so that stale virtual-to-physical address mappings arenot reused.
Loads in subsequently executed MTXs will transparently read com-mitted, non-speculative values. Figure 5(b) illustrates this mem-ory rollback operation. Observe that rollback does not involve anycostly application-data copying operations. The speculative data issimply discarded. Comparing with Figure 5(a), the page tables havebeen remapped and the private pages have been discarded.
4. Loop Parallelization With SMTX
The SMTX system developed in the preceding section is imple-mented as a library. Table 1 presents the interface in two sec-tions. The first section consists of one-time setup and finalizationprimitives. The second section consists of primitives that are in-voked during the execution of SMTXs. The library may be usedin conjunction with many parallelization schemes; its use with onescheme is described below.
4.1 Parallelization Scheme
Recent work has shown that there is significant loop-level paral-lelism in the outermost loops of applications [7, 29, 30, 33]. In mostcases, the loop body is decomposed into three phases; dynamic in-stances of each phase are called tasks. Each phase exhibits a dif-ferent dependence pattern (ignoring the speculated dependences).The first phase depends only on prior tasks of the same phase. Thesecond phase depends only on the corresponding task of the firstphase. The third phase depends on the corresponding task of thesecond phase and prior tasks of itself. Such a partitioning of theloop body is shown in Figure 3(a); statements B, C, and D consti-tute the three phases.
Bridges et al. and Thies et al. present a parallelization schemewhich incorporates pipelining into the basic loop partitioning dis-cussed above [7, 29]. By inserting decoupling buffers in betweenthe stages of the pipeline, the different stages could be execut-ing simultaneously on different iterations. This technique is calledDecoupled Software Pipelining (DSWP)[21]. This base techniqueis extended with speculation to break dependence-recurrences tocreate a long pipeline in Spec-DSWP [32]. Speculation is fur-ther used to break inter-iteration dependences to expose data-levelparallelism, allowing the replication of pipeline stages with noloop-carried dependences. This is called Speculative Parallel-StageDSWP (Spec-PS-DSWP)[8]. Figure 6 shows this execution model.As described in Section 2, this pipelined parallelization has benefitsover TLS techniques. Thus, Spec-PS-DSWP is used as the generalparallelization strategy. Note that if the pipeline consists of just onestage and it is a speculatively parallel (DOALL) stage, the execu-tion model degenerates to Speculative DOALL [18, 33].
Loop parallelization is done as follows: The speculative part ofthe loop is wrapped in an MTX, with each iteration split into mul-tiple subTXs (shown in Figure 6 as similarly shaded ovals). ThesesubTXs are executed by the stages of the Spec-PS-DSWP pipeline.To orchestrate speculative execution, Spec-PS-DSWP relies on acommit stage (Figure 6, Commit Stage). The functionality neededin the commit stage is already encapsulated by the commit unit’sfunctionality in the SMTX system described in the previous sec-tion.
1
0
2
3
4
5
6
Core 1 Core 3Core 2 Core 4 Core 5
CommitStageStage 2
Iteration 1
Iteration 2
B.1
B.2
B.3
B.4
B.5
B.6
C.1
C.3
C.5
C.2
C.4
D.1
D.2
D.3 commit.2
commit.1
Stage 1 Stage 3
Figure 6: Pipelined Execution with Stage Replication: A dashedhorizontal line indicates the time at which all subTXs in an MTXhave finished and the MTX is ready to commit.
1 status CONFLICT DETECTOR(version) {2 for (each worker that entered this version) {3 do {4 status = ver checkForConflict(worker);5 } while (status == CONTINUE);6 if (status == MISSPEC)7 return CONFLICT;8 }9 return NO CONFLICT;
10 }11
12 status ver checkForConflict (worker) {13 token = sq consume(q[worker]);14 value = sq consume(q[worker]);15 if (token) {16 addr = mask1(token);17 isRead = mask2(token);18 if (isRead) {19 if (!addr != value) return MISSPEC;20 } else {21 !addr = value;22 }23 return CONTINUE;24 }25 return BREAK;26 }
Figure 7: A value-based conflict detector is used to detectmemory conflicts between transactions.
4.2 Conflict Detection
In the context of loops, there are flow, anti, or output dependencesbetween the loop iterations. The use of memory versioning ensuresthat MTXs write to different physical versions of the same logicalmemory address. This removes all false (anti and output) depen-dences without code transformations.
Loads and stores that may alias forward the !addr, value" tu-ple by calling ver read and ver writeTo (Table 1). The conflictdetector does a sequential replay of this memory trace in its pri-vate memory to detect whether flow-dependence violations haveoccurred (Figure 7: lines 2-8). For example, let a store operation inMTX1 store value v1 in location l, and a load operation in MTX2
Spec-PS-DSWP
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
Spec-DOALLTLS
(a) 052.alvinn
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
DSWP+[Spec-DOALL,S]TLS
(b) 130.li
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
Spec-DSWP+[S,DOALL,S]TLS
(c) 164.gzip
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
Spec-DSWP+[S,DOALL,S]TLS
(d) 179.art
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
Spec-DSWP+[S,DOALL,S]TLS
(e) 197.parser
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128Sp
eedu
pNumber of Cores
Spec-DSWP+[S,DOALL,S]TLS
(f) 256.bzip2
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
Spec-DSWP+[DOALL,S]TLS
(g) 456.hmmer
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
Spec-DSWP+[DOALL,S]TLS
(h) 464.h264ref
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
DSWP+[Spec-DOALL,S]TLS
(i) blackscholes
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
DSWP+[Spec-DOALL,S]TLS
(j) crc32
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
Spec-DOALLTLS
(k) swaptions
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
Spec-DSWPTLS
DSMTX Best
(l) Geomean
Figure 4: Performance scalability of Spec-DSWP and TLS using DSMTX
Spec-PS-DSWP
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
Spec-DOALLTLS
(a) 052.alvinn
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
DSWP+[Spec-DOALL,S]TLS
(b) 130.li
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
Spec-DSWP+[S,DOALL,S]TLS
(c) 164.gzip
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
Spec-DSWP+[S,DOALL,S]TLS
(d) 179.art
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
Spec-DSWP+[S,DOALL,S]TLS
(e) 197.parser
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
Spec-DSWP+[S,DOALL,S]TLS
(f) 256.bzip2
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
Spec-DSWP+[DOALL,S]TLS
(g) 456.hmmer
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
Spec-DSWP+[DOALL,S]TLS
(h) 464.h264ref
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
DSWP+[Spec-DOALL,S]TLS
(i) blackscholes
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
DSWP+[Spec-DOALL,S]TLS
(j) crc32
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Spee
dup
Number of Cores
Spec-DOALLTLS
(k) swaptions
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
100x
110x
120x
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128Sp
eedu
pNumber of Cores
Spec-DSWPTLS
DSMTX Best
(l) Geomean
Figure 4: Performance scalability of Spec-DSWP and TLS using DSMTX
Conclusões
• DOALL
• DSWP e PS-DSWP para poucos cores
• Muitos cores
• Reescrita
• Ajuda do programador
• Especulação
Referências
• Ottoni, G.; Rangan, R.; Stoler, A. e August, D. Automatic Thread Extraction with Decoupled Software Pipelining. Proc. of the 38th MICRO, 2005.
• Raman, E.; Ottoni, G.; Raman, A.; Bridges, M.; August, D. Parallel-Stage Decoupled Software Pipelining. Proc. of the CGO-2008.
• Raman, A.; Kim, H.; Mason, T.; Jablin, T.; August, D. Speculative Parallelization Using Software Multi-threaded Transactions. Proc. of the ASPLOS-2010.
• Kim, H.; Raman, A.; Liu, F.; Lee, J.; August, D. Scalable Speculative Parallelization on Commodity Clusters. Proc of the 43rd MICRO, 2010.