01front cai

8/3/2019 01Front Cai

1/22

Region-based Techniques forModeling and Enhancing Cluster

OpenMP Performance

Jie Cai

August 2011

A thesis submitted for the degree of Doctor of Philosophy

of the Australian National University


2/22

c Jie Cai 2011

This document was produced using TEX , LATEX and BIBTEX


3/22

For my wife, Ruru, who greatly supported my PhD research...

...and my loving parents.


4/22


5/22

Declaration

I declare that the work in this thesis is entirely my own and that to the best of

my knowledge it does not contain any materials previously published or written by

another person except where otherwise indicated.

Jie Cai


6/22


7/22

Acknowledgements

During my PhD, many people have offered kind help and generous support. I would

like to thank them and appreciate their help.

Supervisors

Dr. Peter Strazdins for the guidance and advise during my whole doctoral

research; Dr. Alistair Rendell for the time we spent together before paper

deadlines; Dr. Eric McCreath for the useful comments.

Readers

For reading my thesis and providing valuable feedbacks, thank you.

Warren Armstrong, Muhammad Atif, Michael Chapman, Pete Janes, Josh

Milthorpe, Peter Strazdins, and Jin Wong.

Computer System Group Members

For the cheerful four years of my PhD, thank you geeks.

Joseph Anthony, Ting Cao, Elton Tian, Xi Yang, Fangzhou Xiao and more ...

Industry Partners

For their generous financial contribution to support my research.

Australian Research Council, Intel, Sun Microsystems (Oracle)

Last but definitely not least

For being so supportive, NCI NF colleagues.

Ben Evans, Robin Humble, Judy Jenkinson and David Singleton.


8/22


9/22

Abstract

Cluster OpenMP enables the use of the OpenMP shared memory programming

model on distributed memory cluster environments. Intel has released a cluster

OpenMP implementation called Intel Cluster OpenMP (CLOMP). While this offers

better programmability than message passing alternatives such as the Message

Passing Interface (MPI), such convenience comes with overheads resulting from

having to maintain the consistency of underlying shared memory abstractions.

CLOMP is no exception. This thesis introduces models for understanding these

overheads of cluster OpenMP implementations like CLOMP and proposes tech-

niques for enhancing their performance.

Cluster OpenMP systems are usually implemented using page-based software

distributed shared memory (sDSM) systems, which create and maintain virtual

global shared memory spaces in pages. A key issue for such system is maintaining

the consistency of the shared memory space. This forms a major source of overhead,

and it is driven by detecting and servicing page faults.

To investigate and understand these systems, we evaluate their performance

with different OpenMP applications, and we also develop a benchmark, called

MCBENCH, to characterize the memory consistency costs. Using MCBENCH, we

discover that this overhead is proportional to the number of writers to the same

shared page and the number of shared pages.

Furthermore, we divide an OpenMP program into separate parallel and serial

regions. Based on the regions, we develop two region-based models to rationalize

the numbers and types of the page faults and their associated costs to performance.

The models highlight the fact that the major overhead is servicing the type of

page faults, which requires data (a page or its modifications, known as diffs) to

be transferred across a network.

With this understanding, we have developed three region-based prefetch

(ReP) techniques based on the execution history of each parallel and sequential

region. The first ReP technique (TReP) considers temporal page faulting behaviour

between consecutive executions of the same region. The second technique (HReP)

considers both the temporal page faulting behaviour between consecutive execu-

tions of the same region and the spatial paging behaviour within an execution of

a region. The last technique (DReP) utilizes our proposed novel stride-augmented

run length encoding (sRLE) method to address the both the temporal and spatial

page faulting behaviour between consecutive executions of the same region. These

techniques effectively reduce the number of page faults and aggregate data (pages

and diffs) into larger transfers, which leverages the network bandwidth provided


10/22

by interconnects.

All three ReP techniques are implemented into runtime libraries of CLOMP

to enhance its performance. Both the original and the enhanced CLOMP are

evaluated using the NAS Parallel Benchmark OpenMP (NPB-OMP) suite, and two

LINPACK OpenMP benchmarks on different hardware platforms, including two

clusters connected with Ethernet and InfiniBand interconnects. The performance

data is quantitatively analyzed and modeled. Also, MCBENCH is used to evaluate

the impact of ReP techniques on memory consistency cost.

The evaluation results demonstrate that, on average, CLOMP spends 75% and

55% overall elapsed time of the NPB-OMP benchmarks on Gigabit Ethernet and

double data rate InfiniBand network respectively. These ratios of the NPB-OMP

benchmarks are reduced effectively by 60% and 40% after implementing the

ReP techniques on to the CLOMP runtime. For the LINPACK benchmarks, withthe assistance of sRLE, DReP significantly outperforms the other ReP techniques

with effectively reducing 50% and 58% of page fault handling costs on Gigabit

Ethernet and InfiniBand networks respectively.


11/22

Contents

Declaration v

Acknowledgements vii

Abstract ix

I Introduction and Background 1

1 Introduction 3

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.1 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.1 Performance Evaluation of CLOMP . . . . . . . . . . . . . . . 6

1.2.2 Region-based Performance Models . . . . . . . . . . . . . . . . 7

1.2.3 Region-based Prefetch Techniques . . . . . . . . . . . . . . . . 8

1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Background 112.1 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 OpenMP Directives . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.2 Synchronization Operations . . . . . . . . . . . . . . . . . . . . 16

2.2 Cluster OpenMP Systems . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Relaxed Memory Consistency . . . . . . . . . . . . . . . . . . . 18

2.2.2 Software Distributed Shared Memory Systems . . . . . . . . . 19

2.2.3 Intel Cluster OpenMP . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.4 Alternative Approaches to sDSMs . . . . . . . . . . . . . . . . 26

2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.3.1 Performance Models . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.2 Prefetch Techniques for sDSM Systems . . . . . . . . . . . . . 31

2.3.3 Run-Length Encoding Methods . . . . . . . . . . . . . . . . . . 35

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

II Performance Issues of Intel Cluster OpenMP 39

3 Performance of Original Intel Cluster OpenMP System 41

3.1 Hardware and Software Setup . . . . . . . . . . . . . . . . . . . . . . . 42
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


12/22

3.2 Performance of CLOMP . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.1 NPB OpenMP Benchmarks Sequential Performance . . . . . . 44

3.2.2 Comparison of CLOMP and Intel Native OpenMP on a Single

Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2.3 CLOMP with Single Thread per Compute Node . . . . . . . . 48

3.2.4 CLOMP with Multiple Threads per Compute Node . . . . . . 48

3.2.5 Elapsed Time Breakdown for NPB-OMP Benchmarks . . . . . 53

3.3 Memory Consistency Cost of CLOMP . . . . . . . . . . . . . . . . . . . 55

3.3.1 Memory Consistency Cost Micro-Benchmark MCBENCH . . 56

3.3.2 MCBENCH Evaluation of CLOMP . . . . . . . . . . . . . . . . 57

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4 Region-Based Performance Models 654.1 Regions of OpenMP Programs . . . . . . . . . . . . . . . . . . . . . . . 66

4.2 SIGSEGV Driven Performance (SDP) Models . . . . . . . . . . . . . . 67

4.2.1 Critical Path Model . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.2.2 Aggregated Model . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.2.3 Coefficient Measurement . . . . . . . . . . . . . . . . . . . . . . 71

4.3 SDP Model Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.3.1 Critical Path Model Estimates . . . . . . . . . . . . . . . . . . 73

4.3.2 Aggregate Model Estimates . . . . . . . . . . . . . . . . . . . . 74

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

III Optimizations: Design, Implementation and Evaluation 79

5 Region-Based Prefetch Techniques 81

5.1 Limitations of Current Prefetch Techniques for sDSM Systems . . . 82

5.1.1 Parallel Application Examples . . . . . . . . . . . . . . . . . . 82

5.1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.1.3 Prefetch Technique Design Assumptions . . . . . . . . . . . . . 88

5.2 Evaluation Metrics of Prefetch Techniques . . . . . . . . . . . . . . . 895.3 Temporal ReP (TReP) Technique . . . . . . . . . . . . . . . . . . . . . 90

5.4 Hybrid ReP (HReP) Technique . . . . . . . . . . . . . . . . . . . . . . 90

5.5 ReP Technique for Dynamic Memory Accessing Applications (DReP) 93

5.5.1 Stride-augmented Run-length Encoded Page Fault Records . 93

5.5.2 Page Miss Prediction . . . . . . . . . . . . . . . . . . . . . . . . 95

5.6 Offline Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.6.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.6.2 Simulation Results and Discussions . . . . . . . . . . . . . . . 98

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


13/22

6 Implementation and Evaluation 111

6.1 ReP Prefetch Techniques Implementation Issues . . . . . . . . . . . . 112

6.1.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.1.2 New Region Notification . . . . . . . . . . . . . . . . . . . . . . 114

6.1.3 Record Encoding and Flush Filter enabled Decoding . . . . . . 116

6.1.4 Prefetch Page Prediction . . . . . . . . . . . . . . . . . . . . . . 116

6.1.5 Prefetch Request and Event Handling . . . . . . . . . . . . . . 117

6.1.6 Page State Transition . . . . . . . . . . . . . . . . . . . . . . . . 118

6.1.7 Garbage Collection Mechanism . . . . . . . . . . . . . . . . . . 119

6.2 Theoretical Performance of the ReP Enhanced CLOMP . . . . . . . . 120

6.3 Performance Evaluation of the ReP Enhanced CLOMP . . . . . . . . 123

6.3.1 MCBENCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1236.3.2 NPB OpenMP Benchmarks . . . . . . . . . . . . . . . . . . . . 130

6.3.3 LINPACK Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 138

6.3.4 ReP Techniques with Multiple Threads per Process . . . . . . 142

6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

IV Conclusions and Future Work 147

7 Conclusions and Future Work 149

7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

7.1.1 Performance Evaluation of CLOMP . . . . . . . . . . . . . . . 150

7.1.2 SIGSEGV Driven Performance Models . . . . . . . . . . . . . . 152

7.1.3 Performance Enhancement by RePs . . . . . . . . . . . . . . . 152

7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

7.2.1 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . 156

7.2.2 Performance Optimizations . . . . . . . . . . . . . . . . . . . . 156

7.2.3 Adapting ReP Techniques to the Latest Technologies . . . . . 156

7.2.4 Potential Use of sRLE . . . . . . . . . . . . . . . . . . . . . . . 157

V Appendices 159

A Algorithms Used in DReP 161

A.1 Stride-augmented Run-length Encoding Algorithms . . . . . . . . . . 162

A.1.1 Algorithm 1: Page Fault Record Reconstruction Step (a) . . . 162

A.1.2 Algorithm 2: Page Fault Record Reconstruction Step (b) . . . 162

A.1.3 Algorithm 3: Page Fault Record Reconstruction Step (c) . . . . 163

A.2 Algorithm 4: DReP Predictor . . . . . . . . . . . . . . . . . . . . . . . 163
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


14/22

B Tsegv,local and Nftotal for Theoretical ReP Speedup Calculation 165

B.1 NPB-OMP Benchmarks Datasheet . . . . . . . . . . . . . . . . . . . . 166

B.2 LINPACK Benchmarks Datasheet . . . . . . . . . . . . . . . . . . . . 166

C TReP and DReP Performance Results of the NPB-OMP benchmarks

on a 4-node Intel Cluster 169

C.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

C.2 Sequential Elapsed Time . . . . . . . . . . . . . . . . . . . . . . . . . . 170

C.3 TReP and DReP Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 170

C.3.1 Elapsed Time over Gigabit Ethernet . . . . . . . . . . . . . . . 170

C.3.2 Elapsed Time over DDR InfiniBand . . . . . . . . . . . . . . . 173

D MultiRail Networks Optimization for the Communication Layer 177

D.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

D.2 Micro-Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

D.2.1 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

D.2.2 Single-Rail Benchmark . . . . . . . . . . . . . . . . . . . . . . . 179

D.2.3 Multirail Benchmark . . . . . . . . . . . . . . . . . . . . . . . . 180

D.3 Bandwidth and Latency Experiments . . . . . . . . . . . . . . . . . . 181

D.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 182

D.3.2 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

D.3.3 Uni-directional Bandwidth . . . . . . . . . . . . . . . . . . . . . 185

D.3.4 Bi-directional Bandwidth . . . . . . . . . . . . . . . . . . . . . 186D.3.5 Elapsed Time Breakdown . . . . . . . . . . . . . . . . . . . . . 188

D.4 Related Work on Multirail InfiniBand Network . . . . . . . . . . . . . 190

D.5 Challenge and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 191

E Performance of CAL 193

E.1 Bandwidth and Latency of CAL . . . . . . . . . . . . . . . . . . . . . . 194

E.2 Comparison Between OpenMPI and CAL . . . . . . . . . . . . . . . . 194

Bibliography 197
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


15/22

List of Figures

2.1 OpenMP fork-join multi-threading parallelism mechanism [93] . . . . 13

2.2 OpenMP parallel directives and associated clauses in C and C++. . . . 13

2.3 OpenMP for directives and associated clauses in C and C++. . . . . . 14

2.4 An example OpenMP program in C using parallel for directives. . . . . 15

2.5 OpenMP synchronization directives in C and C++ languages: (a)

barrier, and (b) flush. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6 OpenMP threadprivate directive in C and C++ languages. . . . . . . . 16

2.7 Processes and threads in CLOMP . . . . . . . . . . . . . . . . . . . . . 23

2.8 State machine of CLOMP (derived from [47], [38], and experimental

observation). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.9 Illustration of two prefetch modes for Adaptive++ techniques. . . . . 34

3.1 Comparison of performance between native Intel OpenMP and

CLOMP on a XE compute node. . . . . . . . . . . . . . . . . . . . . . . 45

3.2 Comparison of performance between native Intel OpenMP and

CLOMP on a VAYU compute node. . . . . . . . . . . . . . . . . . . . . 46

3.3 Performance of CLOMP on XE with a single thread per compute node. 49

3.4 Performance of CLOMP on VAYU with a single thread per compute

node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.5 Performance of CLOMP on XE with multi-threads per compute node. 52

3.6 Performance of CLOMP on VAYU with multi-threads per compute

node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.7 MCBENCH An array of size a-bytes is divided into chunks of c-

bytes. The benchmark consists of Change and Read phases that can

be repeated for multiple iterations. Entering the Change phase of

the first iteration, the chunks are distributed to the available threads

(four in this case) in a round-robin fashion. In the Read phase after

the barrier, each thread reads from the chunk that its neighbour had

written to. This is followed by a barrier which ends the first iteration.

For the subsequent iteration, the chunks to Change are the same

as in the previous Read phase. That is, the shifting of the chunk

distribution only takes place when moving from the Change to Read

phases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


16/22

3.8 MCBENCH evaluation results of CLOMP on XE with both Ethernet

and InfiniBand interconnects: 64KB, 4MB and 8MB array sizes

are used in these three figures respectively; comparison among

difference chunk sizes 4B, 2KB and 4KB is illustrated in each figure

for both Ethernet and InfiniBand. . . . . . . . . . . . . . . . . . . . . . 59

4.1 Illustration of regions in an OpenMP parallel program. . . . . . . . . 67

4.2 Schematic illustration of timing breakdown for parallel region using

the SDP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.3 The algorithm used to determine the SDP coefficients. The code

shown is in a parallel region. R is a private array while S is a shared

one. Variables Dw and Dr represent reference times for accessing

private array R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.1 Pseudo code to demonstrate the memory access patterns of the naive

LINPACK OpenMP benchmark implementation for an nn column-

major matrix A with blocking factor nb. . . . . . . . . . . . . . . . . . 83

5.2 Naive OpenMP LINPACK program with n n matrix: (a) memory

access areas for different iterations. (b) page fault areas for different

iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.3 Pseudo code to demonstrate the memory access patterns of the

optimized LINPACK OpenMP benchmark implementation for ann n column-major matrix A with blocking factor nb. . . . . . . . . . 86

5.4 Optimized OpenMP LINPACK program: (a) memory access areas for

different iterations illustrated on a nn matrix panel. (b) page fault

areas for different iterations illustrated on the n n matrix panel. . 87

5.5 The page fault record entry for TReP and HReP prefetch techniques. 90

5.6 A flowchart of the HReP predictor. . . . . . . . . . . . . . . . . . . . . 92

5.7 Two levels of stride-augmented run-length encoding (sRLE) method:

(a) Based on strides between consecutive pages, sorted missed

pages are broken into small sub-arrays, and those consecutivepages with the same stride are stored in the same array. (b)

The sub-arrays are compressed in to the first level sRLE records

in a (StartPageID, CommonStride,RunLength) format. (c) Based

on the stride between the start pages of consecutive first level

sRLE records, they are further compressed into the second level

sRLE format, (FirstLevelRecord,CommonStride,RunLength) (more

details in Section 5.5.1). . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.8 Page fault record of region execution reconstructed via run-length

encoding method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


17/22

5.9 The effective page miss rate reduction for different prefetch tech-

niques on 2 threads (a), 4 threads (b) and 8 threads (c). . . . . . . . . 103

6.1 Intel Cluster OpenMP runtime structure. . . . . . . . . . . . . . . . . 1126.2 Data structure for stride-augmented run-length encoded page fault

records. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.3 ReP prefetch record data structure. . . . . . . . . . . . . . . . . . . . . 115

6.4 User interactive interface of new region notification. . . . . . . . . . . 115

6.5 The round-robin prefetch request communication pattern. . . . . . . 118

6.6 New page state machine after introduced Prefetched diff and

Prefetched page states. . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.7 RePs VS. Original CLOMP: MCBENCH with 4B chunk size over both

the GigE and IB networks. (a) 64KB array size, (b) 4MB array size,(c) 8MB chunk size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.8 RePs VS. Original CLOMP: MCBENCH with 2048 bytes chunk size

over both the GigE and IB networks. (a) 64KB array size, (b) 4MB

array size, (c) 8MB chunk size. . . . . . . . . . . . . . . . . . . . . . . 127

6.9 RePs VS. Original CLOMP: MCBENCH with 4KB chunk size over

both the GigE and IB networks. (a) 64KB array size, (b) 4MB array

size, (c) 8MB chunk size. . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.10 RePs VS. Original CLOMP: BT speedup comparison on both GigE

and IB networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1336.11 RePs VS. Original CLOMP: the naive LINPACK evaluation results

comparison usingNNmatrix (N = 4096) with blocking factorNB =

64 via both GigE and IB. . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.12 RePs VS. Original CLOMP: the optimized LINPACK evaluation

results comparison using N N matrix (N = 8192) with blocking

factor NB = 64 via both GigE and IB. . . . . . . . . . . . . . . . . . . 141

6.13 DReP vs Original CLOMP: the optimized LINPACK benchmark (N =

8192 and NB = 64) results comparison with multiple threads per

process via both GigE and IB. (a) 2 threads per process, (b) 4 threads

per process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

C.1 Speedup of the BT and CG benchmarks over Gigabit Ethernet. . . . 171

C.2 Speedup of IS and LU benchmarks over Gigabit Ethernet. . . . . . . 172

C.3 Speedup of BT and CG benchmarks over DDR InfiniBand. . . . . . . 174

C.4 Speedup of IS and LU benchmarks over DDR InfiniBand. . . . . . . . 175

D.1 Single-rail bandwidth benchmark . . . . . . . . . . . . . . . . . . . . . 179

D.2 Multirail communication memory access pattern. . . . . . . . . . . . 180

D.3 Non-threaded multirail bandwidth benchmark . . . . . . . . . . . . . 182
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


18/22

D.4 Threaded multirail benchmark design. . . . . . . . . . . . . . . . . . . 183

D.5 RDMA write latency comparison. . . . . . . . . . . . . . . . . . . . . . 184

D.6 Uni-directional multi-port bandwidth. . . . . . . . . . . . . . . . . . . 185

D.7 Uni-directional multi-HCA bandwidth. . . . . . . . . . . . . . . . . . . 186

D.8 Bi-directional multi-port bandwidth. . . . . . . . . . . . . . . . . . . . 187

D.9 Bi-directional multi-HCA bandwidth. . . . . . . . . . . . . . . . . . . 188

D.10 Benchmarks elapsed time breakdown for 512bytes message. . . . . . 188

D.11 Benchmarks elapsed time breakdown for 4KB message. . . . . . . . . 189

D.12 Different ways to configure a InfiniBand multirail network [62]. . . . 190
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


19/22

List of Tables

2.1 OpenMP synchronization operations. . . . . . . . . . . . . . . . . . . . 17

3.1 Evaluation experimental hardware platforms. . . . . . . . . . . . . . 43

3.2 Sequential elapsed time (sec) of NPB with CLOMP. . . . . . . . . . . 44

3.3 Page faults handling cost (SEGV Cost) of CLOMP for NPB bench-

marks as a ratio to corresponding elapsed time with single thread

per proess on XE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.4 Page faults handling cost breakdown for CLOMP for class A NPB

benchmarks with multiple threads per process on XE. SEGV

represents the ratio of page faults handling cost to the corresponding

elapsed time; SEGV Lock in turn represents a ratio ofpthread mutex

cost within SEGV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.1 Critical path page faults counts for the NPB-OMP benchmarks run

using CLOMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.2 Comparison between observed and estimated speedup for running

NPB class A and C on the AMD cluster with CLOMP . . . . . . . . . 77

4.3 Average relative errors for the predicted NPB speedups evaluated

using the critical path and aggregate (f = 0) SDP models and datafrom Tables 4.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.1 Threshold effects of ReP techniques for naive LINPACK benchmark. 98

5.2 Simulation prefetch efficiency (E) and coverage (Nu/Nf) for Adapt-

ive++, TODFCM (1 page), TReP, HReP and DReP techniques. . . . . 108

5.3 Breakdown of prefetches issued by different prefetch modes and

chosen list deployed in HReP. . . . . . . . . . . . . . . . . . . . . . . . 109

5.4 Comparison of F-HReP and HReP with the LU benchmark. . . . . . 109

6.1 Bandwidth and latency measured by the communication layer (CAL)of CLOMP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.2 ReP techniques prefetch efficiency and coverage for MCBNECH with

4MB array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.3 Message transfer counts (1000) comparison between RePs enhanced

CLOMP and the original CLOMP for MCBENCH with 4B chunk . . 126


CLOMP and the original CLOMP for MCBENCH with 2KB chunk . 128


CLOMP and the original CLOMP for MCBENCH with 4KB chunk . 129
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


20/22

6.6 Page fault handling costs comparison for BT benchmark among the

original CLOMP, the theoretical and the ReP techniques enchanced

CLOMP. The computation part of elapsed time is common to all

compared items. The page fault handling costs of the original

CLOMP is presented in second, and that of others are presented as a

reduction ratio (e.g. OrigTRePOrig

). . . . . . . . . . . . . . . . . . . . . . 132

6.7 Page fault handling costs reduction ratio (TorigsegvTsegv

Torigsegv

) comparison for

other NPB benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.8 Detailed Tsegv breakdown analysis of the IS Class A Benchmark

for the ReP techniques. Overall Tsegv stands for overall CLOMP

overhead. TMK Comm stands for the communication time spent

by TMK for data transfer. TMK local stands for the local software

overhead of TMK layer. ReP Comm stands for the communication

time spent on prefetching data. ReP local stands for the local

software overhead introduced by using the ReP prefetch techniques.

Tsegv is presented in seconds and its components are presented as a

ratio to the overall Tsegv . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.9 Detailed Tsegv breakdown analysis of the IS Class C Benchmark

for the ReP techniques. Overall Tsegv stands for overall CLOMP

overhead. TMK Comm stands for the communication time spent

by TMK for data transfer. TMK local stands for the local software

overhead of TMK layer. ReP Comm stands for the communication

time spent on prefetching data. ReP local stands for the local

software overhead introduced by using the ReP prefetch techniques.

Tsegv is presented in seconds and its components are presented as a

ratio to the overall Tsegv . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.10 Sequential elapsed time for LINPACK benchmarks . . . . . . . . . . 138

6.11 Page fault handling costs comparison for LINPACK benchmarks

among the original CLOMP, the theoretical and the ReP techniquesenchanced CLOMP. The computation part of elapsed time is common

to all compared items. The page fault handling costs of the original

CLOMP is presented in second, and that of others are presented as a

reduction ratio (e.g. OrigTRePOrig

). . . . . . . . . . . . . . . . . . . . . . 139

6.12 Page faults handling cost comparison between DReP and the ori-

ginal CLOMP for the optimized LINPACK benchmark with multiple

threads per process. SEGV represents the ratio of page faults

handling cost to the corresponding elapsed time; SEGV Lock in

turn represents a ratio of pthread mutex cost within SEGV. . . . . . 142
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


21/22

B.1 Tsegv,local (sec) for some NPB-OMP benchmarks with different num-

ber of processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

B.2 Nftotal for some NPB-OMP benchmarks with different number of

processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

B.3 Tsegvtotal (sec) for LINPACK benchmarks with different number of

processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

B.4 Nftotal for LINPACK benchmarks with different number of processes. 167

C.1 Elapsed Time (sec) of some NPB-OMP Benchmarks on one thread. . 170

E.1 Complete bandwidth and latency measured by the communication

layer (CAL) of CLOMP on XE. . . . . . . . . . . . . . . . . . . . . . . . 194

E.2 Comparison of CAL and OpenMPI: bandwidth and latency measured

on XE via GigE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

E.3 Comparison of CAL and OpenMPI: bandwidth and latency measured

on XE via DDR IB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


22/22

01front cai

Documents