01front cai

Upload: dudeabc

Post on 06-Apr-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 01Front Cai

    1/22

    Region-based Techniques forModeling and Enhancing Cluster

    OpenMP Performance

    Jie Cai

    August 2011

    A thesis submitted for the degree of Doctor of Philosophy

    of the Australian National University

  • 8/3/2019 01Front Cai

    2/22

    c Jie Cai 2011

    This document was produced using TEX , LATEX and BIBTEX

  • 8/3/2019 01Front Cai

    3/22

    For my wife, Ruru, who greatly supported my PhD research...

    ...and my loving parents.

  • 8/3/2019 01Front Cai

    4/22

  • 8/3/2019 01Front Cai

    5/22

    Declaration

    I declare that the work in this thesis is entirely my own and that to the best of

    my knowledge it does not contain any materials previously published or written by

    another person except where otherwise indicated.

    Jie Cai

  • 8/3/2019 01Front Cai

    6/22

  • 8/3/2019 01Front Cai

    7/22

    Acknowledgements

    During my PhD, many people have offered kind help and generous support. I would

    like to thank them and appreciate their help.

    Supervisors

    Dr. Peter Strazdins for the guidance and advise during my whole doctoral

    research; Dr. Alistair Rendell for the time we spent together before paper

    deadlines; Dr. Eric McCreath for the useful comments.

    Readers

    For reading my thesis and providing valuable feedbacks, thank you.

    Warren Armstrong, Muhammad Atif, Michael Chapman, Pete Janes, Josh

    Milthorpe, Peter Strazdins, and Jin Wong.

    Computer System Group Members

    For the cheerful four years of my PhD, thank you geeks.

    Joseph Anthony, Ting Cao, Elton Tian, Xi Yang, Fangzhou Xiao and more ...

    Industry Partners

    For their generous financial contribution to support my research.

    Australian Research Council, Intel, Sun Microsystems (Oracle)

    Last but definitely not least

    For being so supportive, NCI NF colleagues.

    Ben Evans, Robin Humble, Judy Jenkinson and David Singleton.

  • 8/3/2019 01Front Cai

    8/22

  • 8/3/2019 01Front Cai

    9/22

    Abstract

    Cluster OpenMP enables the use of the OpenMP shared memory programming

    model on distributed memory cluster environments. Intel has released a cluster

    OpenMP implementation called Intel Cluster OpenMP (CLOMP). While this offers

    better programmability than message passing alternatives such as the Message

    Passing Interface (MPI), such convenience comes with overheads resulting from

    having to maintain the consistency of underlying shared memory abstractions.

    CLOMP is no exception. This thesis introduces models for understanding these

    overheads of cluster OpenMP implementations like CLOMP and proposes tech-

    niques for enhancing their performance.

    Cluster OpenMP systems are usually implemented using page-based software

    distributed shared memory (sDSM) systems, which create and maintain virtual

    global shared memory spaces in pages. A key issue for such system is maintaining

    the consistency of the shared memory space. This forms a major source of overhead,

    and it is driven by detecting and servicing page faults.

    To investigate and understand these systems, we evaluate their performance

    with different OpenMP applications, and we also develop a benchmark, called

    MCBENCH, to characterize the memory consistency costs. Using MCBENCH, we

    discover that this overhead is proportional to the number of writers to the same

    shared page and the number of shared pages.

    Furthermore, we divide an OpenMP program into separate parallel and serial

    regions. Based on the regions, we develop two region-based models to rationalize

    the numbers and types of the page faults and their associated costs to performance.

    The models highlight the fact that the major overhead is servicing the type of

    page faults, which requires data (a page or its modifications, known as diffs) to

    be transferred across a network.

    With this understanding, we have developed three region-based prefetch

    (ReP) techniques based on the execution history of each parallel and sequential

    region. The first ReP technique (TReP) considers temporal page faulting behaviour

    between consecutive executions of the same region. The second technique (HReP)

    considers both the temporal page faulting behaviour between consecutive execu-

    tions of the same region and the spatial paging behaviour within an execution of

    a region. The last technique (DReP) utilizes our proposed novel stride-augmented

    run length encoding (sRLE) method to address the both the temporal and spatial

    page faulting behaviour between consecutive executions of the same region. These

    techniques effectively reduce the number of page faults and aggregate data (pages

    and diffs) into larger transfers, which leverages the network bandwidth provided

  • 8/3/2019 01Front Cai

    10/22

    by interconnects.

    All three ReP techniques are implemented into runtime libraries of CLOMP

    to enhance its performance. Both the original and the enhanced CLOMP are

    evaluated using the NAS Parallel Benchmark OpenMP (NPB-OMP) suite, and two

    LINPACK OpenMP benchmarks on different hardware platforms, including two

    clusters connected with Ethernet and InfiniBand interconnects. The performance

    data is quantitatively analyzed and modeled. Also, MCBENCH is used to evaluate

    the impact of ReP techniques on memory consistency cost.

    The evaluation results demonstrate that, on average, CLOMP spends 75% and

    55% overall elapsed time of the NPB-OMP benchmarks on Gigabit Ethernet and

    double data rate InfiniBand network respectively. These ratios of the NPB-OMP

    benchmarks are reduced effectively by 60% and 40% after implementing the

    ReP techniques on to the CLOMP runtime. For the LINPACK benchmarks, withthe assistance of sRLE, DReP significantly outperforms the other ReP techniques

    with effectively reducing 50% and 58% of page fault handling costs on Gigabit

    Ethernet and InfiniBand networks respectively.

  • 8/3/2019 01Front Cai

    11/22

    Contents

    Declaration v

    Acknowledgements vii

    Abstract ix

    I Introduction and Background 1

    1 Introduction 3

    1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.1.1 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.2.1 Performance Evaluation of CLOMP . . . . . . . . . . . . . . . 6

    1.2.2 Region-based Performance Models . . . . . . . . . . . . . . . . 7

    1.2.3 Region-based Prefetch Techniques . . . . . . . . . . . . . . . . 8

    1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2 Background 112.1 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.1.1 OpenMP Directives . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.1.2 Synchronization Operations . . . . . . . . . . . . . . . . . . . . 16

    2.2 Cluster OpenMP Systems . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.2.1 Relaxed Memory Consistency . . . . . . . . . . . . . . . . . . . 18

    2.2.2 Software Distributed Shared Memory Systems . . . . . . . . . 19

    2.2.3 Intel Cluster OpenMP . . . . . . . . . . . . . . . . . . . . . . . 23

    2.2.4 Alternative Approaches to sDSMs . . . . . . . . . . . . . . . . 26

    2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.3.1 Performance Models . . . . . . . . . . . . . . . . . . . . . . . . 29

    2.3.2 Prefetch Techniques for sDSM Systems . . . . . . . . . . . . . 31

    2.3.3 Run-Length Encoding Methods . . . . . . . . . . . . . . . . . . 35

    2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    II Performance Issues of Intel Cluster OpenMP 39

    3 Performance of Original Intel Cluster OpenMP System 41

    3.1 Hardware and Software Setup . . . . . . . . . . . . . . . . . . . . . . . 42

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/3/2019 01Front Cai

    12/22

    3.2 Performance of CLOMP . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    3.2.1 NPB OpenMP Benchmarks Sequential Performance . . . . . . 44

    3.2.2 Comparison of CLOMP and Intel Native OpenMP on a Single

    Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    3.2.3 CLOMP with Single Thread per Compute Node . . . . . . . . 48

    3.2.4 CLOMP with Multiple Threads per Compute Node . . . . . . 48

    3.2.5 Elapsed Time Breakdown for NPB-OMP Benchmarks . . . . . 53

    3.3 Memory Consistency Cost of CLOMP . . . . . . . . . . . . . . . . . . . 55

    3.3.1 Memory Consistency Cost Micro-Benchmark MCBENCH . . 56

    3.3.2 MCBENCH Evaluation of CLOMP . . . . . . . . . . . . . . . . 57

    3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    4 Region-Based Performance Models 654.1 Regions of OpenMP Programs . . . . . . . . . . . . . . . . . . . . . . . 66

    4.2 SIGSEGV Driven Performance (SDP) Models . . . . . . . . . . . . . . 67

    4.2.1 Critical Path Model . . . . . . . . . . . . . . . . . . . . . . . . . 68

    4.2.2 Aggregated Model . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    4.2.3 Coefficient Measurement . . . . . . . . . . . . . . . . . . . . . . 71

    4.3 SDP Model Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

    4.3.1 Critical Path Model Estimates . . . . . . . . . . . . . . . . . . 73

    4.3.2 Aggregate Model Estimates . . . . . . . . . . . . . . . . . . . . 74

    4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

    III Optimizations: Design, Implementation and Evaluation 79

    5 Region-Based Prefetch Techniques 81

    5.1 Limitations of Current Prefetch Techniques for sDSM Systems . . . 82

    5.1.1 Parallel Application Examples . . . . . . . . . . . . . . . . . . 82

    5.1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

    5.1.3 Prefetch Technique Design Assumptions . . . . . . . . . . . . . 88

    5.2 Evaluation Metrics of Prefetch Techniques . . . . . . . . . . . . . . . 895.3 Temporal ReP (TReP) Technique . . . . . . . . . . . . . . . . . . . . . 90

    5.4 Hybrid ReP (HReP) Technique . . . . . . . . . . . . . . . . . . . . . . 90

    5.5 ReP Technique for Dynamic Memory Accessing Applications (DReP) 93

    5.5.1 Stride-augmented Run-length Encoded Page Fault Records . 93

    5.5.2 Page Miss Prediction . . . . . . . . . . . . . . . . . . . . . . . . 95

    5.6 Offline Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

    5.6.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 97

    5.6.2 Simulation Results and Discussions . . . . . . . . . . . . . . . 98

    5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/3/2019 01Front Cai

    13/22

    6 Implementation and Evaluation 111

    6.1 ReP Prefetch Techniques Implementation Issues . . . . . . . . . . . . 112

    6.1.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

    6.1.2 New Region Notification . . . . . . . . . . . . . . . . . . . . . . 114

    6.1.3 Record Encoding and Flush Filter enabled Decoding . . . . . . 116

    6.1.4 Prefetch Page Prediction . . . . . . . . . . . . . . . . . . . . . . 116

    6.1.5 Prefetch Request and Event Handling . . . . . . . . . . . . . . 117

    6.1.6 Page State Transition . . . . . . . . . . . . . . . . . . . . . . . . 118

    6.1.7 Garbage Collection Mechanism . . . . . . . . . . . . . . . . . . 119

    6.2 Theoretical Performance of the ReP Enhanced CLOMP . . . . . . . . 120

    6.3 Performance Evaluation of the ReP Enhanced CLOMP . . . . . . . . 123

    6.3.1 MCBENCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1236.3.2 NPB OpenMP Benchmarks . . . . . . . . . . . . . . . . . . . . 130

    6.3.3 LINPACK Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 138

    6.3.4 ReP Techniques with Multiple Threads per Process . . . . . . 142

    6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

    IV Conclusions and Future Work 147

    7 Conclusions and Future Work 149

    7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

    7.1.1 Performance Evaluation of CLOMP . . . . . . . . . . . . . . . 150

    7.1.2 SIGSEGV Driven Performance Models . . . . . . . . . . . . . . 152

    7.1.3 Performance Enhancement by RePs . . . . . . . . . . . . . . . 152

    7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

    7.2.1 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . 156

    7.2.2 Performance Optimizations . . . . . . . . . . . . . . . . . . . . 156

    7.2.3 Adapting ReP Techniques to the Latest Technologies . . . . . 156

    7.2.4 Potential Use of sRLE . . . . . . . . . . . . . . . . . . . . . . . 157

    V Appendices 159

    A Algorithms Used in DReP 161

    A.1 Stride-augmented Run-length Encoding Algorithms . . . . . . . . . . 162

    A.1.1 Algorithm 1: Page Fault Record Reconstruction Step (a) . . . 162

    A.1.2 Algorithm 2: Page Fault Record Reconstruction Step (b) . . . 162

    A.1.3 Algorithm 3: Page Fault Record Reconstruction Step (c) . . . . 163

    A.2 Algorithm 4: DReP Predictor . . . . . . . . . . . . . . . . . . . . . . . 163

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/3/2019 01Front Cai

    14/22

    B Tsegv,local and Nftotal for Theoretical ReP Speedup Calculation 165

    B.1 NPB-OMP Benchmarks Datasheet . . . . . . . . . . . . . . . . . . . . 166

    B.2 LINPACK Benchmarks Datasheet . . . . . . . . . . . . . . . . . . . . 166

    C TReP and DReP Performance Results of the NPB-OMP benchmarks

    on a 4-node Intel Cluster 169

    C.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

    C.2 Sequential Elapsed Time . . . . . . . . . . . . . . . . . . . . . . . . . . 170

    C.3 TReP and DReP Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 170

    C.3.1 Elapsed Time over Gigabit Ethernet . . . . . . . . . . . . . . . 170

    C.3.2 Elapsed Time over DDR InfiniBand . . . . . . . . . . . . . . . 173

    D MultiRail Networks Optimization for the Communication Layer 177

    D.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

    D.2 Micro-Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

    D.2.1 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

    D.2.2 Single-Rail Benchmark . . . . . . . . . . . . . . . . . . . . . . . 179

    D.2.3 Multirail Benchmark . . . . . . . . . . . . . . . . . . . . . . . . 180

    D.3 Bandwidth and Latency Experiments . . . . . . . . . . . . . . . . . . 181

    D.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 182

    D.3.2 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

    D.3.3 Uni-directional Bandwidth . . . . . . . . . . . . . . . . . . . . . 185

    D.3.4 Bi-directional Bandwidth . . . . . . . . . . . . . . . . . . . . . 186D.3.5 Elapsed Time Breakdown . . . . . . . . . . . . . . . . . . . . . 188

    D.4 Related Work on Multirail InfiniBand Network . . . . . . . . . . . . . 190

    D.5 Challenge and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 191

    E Performance of CAL 193

    E.1 Bandwidth and Latency of CAL . . . . . . . . . . . . . . . . . . . . . . 194

    E.2 Comparison Between OpenMPI and CAL . . . . . . . . . . . . . . . . 194

    Bibliography 197

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/3/2019 01Front Cai

    15/22

    List of Figures

    2.1 OpenMP fork-join multi-threading parallelism mechanism [93] . . . . 13

    2.2 OpenMP parallel directives and associated clauses in C and C++. . . . 13

    2.3 OpenMP for directives and associated clauses in C and C++. . . . . . 14

    2.4 An example OpenMP program in C using parallel for directives. . . . . 15

    2.5 OpenMP synchronization directives in C and C++ languages: (a)

    barrier, and (b) flush. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.6 OpenMP threadprivate directive in C and C++ languages. . . . . . . . 16

    2.7 Processes and threads in CLOMP . . . . . . . . . . . . . . . . . . . . . 23

    2.8 State machine of CLOMP (derived from [47], [38], and experimental

    observation). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    2.9 Illustration of two prefetch modes for Adaptive++ techniques. . . . . 34

    3.1 Comparison of performance between native Intel OpenMP and

    CLOMP on a XE compute node. . . . . . . . . . . . . . . . . . . . . . . 45

    3.2 Comparison of performance between native Intel OpenMP and

    CLOMP on a VAYU compute node. . . . . . . . . . . . . . . . . . . . . 46

    3.3 Performance of CLOMP on XE with a single thread per compute node. 49

    3.4 Performance of CLOMP on VAYU with a single thread per compute

    node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    3.5 Performance of CLOMP on XE with multi-threads per compute node. 52

    3.6 Performance of CLOMP on VAYU with multi-threads per compute

    node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    3.7 MCBENCH An array of size a-bytes is divided into chunks of c-

    bytes. The benchmark consists of Change and Read phases that can

    be repeated for multiple iterations. Entering the Change phase of

    the first iteration, the chunks are distributed to the available threads

    (four in this case) in a round-robin fashion. In the Read phase after

    the barrier, each thread reads from the chunk that its neighbour had

    written to. This is followed by a barrier which ends the first iteration.

    For the subsequent iteration, the chunks to Change are the same

    as in the previous Read phase. That is, the shifting of the chunk

    distribution only takes place when moving from the Change to Read

    phases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/3/2019 01Front Cai

    16/22

    3.8 MCBENCH evaluation results of CLOMP on XE with both Ethernet

    and InfiniBand interconnects: 64KB, 4MB and 8MB array sizes

    are used in these three figures respectively; comparison among

    difference chunk sizes 4B, 2KB and 4KB is illustrated in each figure

    for both Ethernet and InfiniBand. . . . . . . . . . . . . . . . . . . . . . 59

    4.1 Illustration of regions in an OpenMP parallel program. . . . . . . . . 67

    4.2 Schematic illustration of timing breakdown for parallel region using

    the SDP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

    4.3 The algorithm used to determine the SDP coefficients. The code

    shown is in a parallel region. R is a private array while S is a shared

    one. Variables Dw and Dr represent reference times for accessing

    private array R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    5.1 Pseudo code to demonstrate the memory access patterns of the naive

    LINPACK OpenMP benchmark implementation for an nn column-

    major matrix A with blocking factor nb. . . . . . . . . . . . . . . . . . 83

    5.2 Naive OpenMP LINPACK program with n n matrix: (a) memory

    access areas for different iterations. (b) page fault areas for different

    iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

    5.3 Pseudo code to demonstrate the memory access patterns of the

    optimized LINPACK OpenMP benchmark implementation for ann n column-major matrix A with blocking factor nb. . . . . . . . . . 86

    5.4 Optimized OpenMP LINPACK program: (a) memory access areas for

    different iterations illustrated on a nn matrix panel. (b) page fault

    areas for different iterations illustrated on the n n matrix panel. . 87

    5.5 The page fault record entry for TReP and HReP prefetch techniques. 90

    5.6 A flowchart of the HReP predictor. . . . . . . . . . . . . . . . . . . . . 92

    5.7 Two levels of stride-augmented run-length encoding (sRLE) method:

    (a) Based on strides between consecutive pages, sorted missed

    pages are broken into small sub-arrays, and those consecutivepages with the same stride are stored in the same array. (b)

    The sub-arrays are compressed in to the first level sRLE records

    in a (StartPageID, CommonStride,RunLength) format. (c) Based

    on the stride between the start pages of consecutive first level

    sRLE records, they are further compressed into the second level

    sRLE format, (FirstLevelRecord,CommonStride,RunLength) (more

    details in Section 5.5.1). . . . . . . . . . . . . . . . . . . . . . . . . . . 94

    5.8 Page fault record of region execution reconstructed via run-length

    encoding method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/3/2019 01Front Cai

    17/22

    5.9 The effective page miss rate reduction for different prefetch tech-

    niques on 2 threads (a), 4 threads (b) and 8 threads (c). . . . . . . . . 103

    6.1 Intel Cluster OpenMP runtime structure. . . . . . . . . . . . . . . . . 1126.2 Data structure for stride-augmented run-length encoded page fault

    records. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

    6.3 ReP prefetch record data structure. . . . . . . . . . . . . . . . . . . . . 115

    6.4 User interactive interface of new region notification. . . . . . . . . . . 115

    6.5 The round-robin prefetch request communication pattern. . . . . . . 118

    6.6 New page state machine after introduced Prefetched diff and

    Prefetched page states. . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

    6.7 RePs VS. Original CLOMP: MCBENCH with 4B chunk size over both

    the GigE and IB networks. (a) 64KB array size, (b) 4MB array size,(c) 8MB chunk size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

    6.8 RePs VS. Original CLOMP: MCBENCH with 2048 bytes chunk size

    over both the GigE and IB networks. (a) 64KB array size, (b) 4MB

    array size, (c) 8MB chunk size. . . . . . . . . . . . . . . . . . . . . . . 127

    6.9 RePs VS. Original CLOMP: MCBENCH with 4KB chunk size over

    both the GigE and IB networks. (a) 64KB array size, (b) 4MB array

    size, (c) 8MB chunk size. . . . . . . . . . . . . . . . . . . . . . . . . . . 130

    6.10 RePs VS. Original CLOMP: BT speedup comparison on both GigE

    and IB networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1336.11 RePs VS. Original CLOMP: the naive LINPACK evaluation results

    comparison usingNNmatrix (N = 4096) with blocking factorNB =

    64 via both GigE and IB. . . . . . . . . . . . . . . . . . . . . . . . . . . 140

    6.12 RePs VS. Original CLOMP: the optimized LINPACK evaluation

    results comparison using N N matrix (N = 8192) with blocking

    factor NB = 64 via both GigE and IB. . . . . . . . . . . . . . . . . . . 141

    6.13 DReP vs Original CLOMP: the optimized LINPACK benchmark (N =

    8192 and NB = 64) results comparison with multiple threads per

    process via both GigE and IB. (a) 2 threads per process, (b) 4 threads

    per process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

    C.1 Speedup of the BT and CG benchmarks over Gigabit Ethernet. . . . 171

    C.2 Speedup of IS and LU benchmarks over Gigabit Ethernet. . . . . . . 172

    C.3 Speedup of BT and CG benchmarks over DDR InfiniBand. . . . . . . 174

    C.4 Speedup of IS and LU benchmarks over DDR InfiniBand. . . . . . . . 175

    D.1 Single-rail bandwidth benchmark . . . . . . . . . . . . . . . . . . . . . 179

    D.2 Multirail communication memory access pattern. . . . . . . . . . . . 180

    D.3 Non-threaded multirail bandwidth benchmark . . . . . . . . . . . . . 182

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/3/2019 01Front Cai

    18/22

    D.4 Threaded multirail benchmark design. . . . . . . . . . . . . . . . . . . 183

    D.5 RDMA write latency comparison. . . . . . . . . . . . . . . . . . . . . . 184

    D.6 Uni-directional multi-port bandwidth. . . . . . . . . . . . . . . . . . . 185

    D.7 Uni-directional multi-HCA bandwidth. . . . . . . . . . . . . . . . . . . 186

    D.8 Bi-directional multi-port bandwidth. . . . . . . . . . . . . . . . . . . . 187

    D.9 Bi-directional multi-HCA bandwidth. . . . . . . . . . . . . . . . . . . 188

    D.10 Benchmarks elapsed time breakdown for 512bytes message. . . . . . 188

    D.11 Benchmarks elapsed time breakdown for 4KB message. . . . . . . . . 189

    D.12 Different ways to configure a InfiniBand multirail network [62]. . . . 190

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/3/2019 01Front Cai

    19/22

    List of Tables

    2.1 OpenMP synchronization operations. . . . . . . . . . . . . . . . . . . . 17

    3.1 Evaluation experimental hardware platforms. . . . . . . . . . . . . . 43

    3.2 Sequential elapsed time (sec) of NPB with CLOMP. . . . . . . . . . . 44

    3.3 Page faults handling cost (SEGV Cost) of CLOMP for NPB bench-

    marks as a ratio to corresponding elapsed time with single thread

    per proess on XE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    3.4 Page faults handling cost breakdown for CLOMP for class A NPB

    benchmarks with multiple threads per process on XE. SEGV

    represents the ratio of page faults handling cost to the corresponding

    elapsed time; SEGV Lock in turn represents a ratio ofpthread mutex

    cost within SEGV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    4.1 Critical path page faults counts for the NPB-OMP benchmarks run

    using CLOMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    4.2 Comparison between observed and estimated speedup for running

    NPB class A and C on the AMD cluster with CLOMP . . . . . . . . . 77

    4.3 Average relative errors for the predicted NPB speedups evaluated

    using the critical path and aggregate (f = 0) SDP models and datafrom Tables 4.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    5.1 Threshold effects of ReP techniques for naive LINPACK benchmark. 98

    5.2 Simulation prefetch efficiency (E) and coverage (Nu/Nf) for Adapt-

    ive++, TODFCM (1 page), TReP, HReP and DReP techniques. . . . . 108

    5.3 Breakdown of prefetches issued by different prefetch modes and

    chosen list deployed in HReP. . . . . . . . . . . . . . . . . . . . . . . . 109

    5.4 Comparison of F-HReP and HReP with the LU benchmark. . . . . . 109

    6.1 Bandwidth and latency measured by the communication layer (CAL)of CLOMP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

    6.2 ReP techniques prefetch efficiency and coverage for MCBNECH with

    4MB array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

    6.3 Message transfer counts (1000) comparison between RePs enhanced

    CLOMP and the original CLOMP for MCBENCH with 4B chunk . . 126

    6.4 Message transfer counts (1000) comparison between RePs enhanced

    CLOMP and the original CLOMP for MCBENCH with 2KB chunk . 128

    6.5 Message transfer counts (1000) comparison between RePs enhanced

    CLOMP and the original CLOMP for MCBENCH with 4KB chunk . 129

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/3/2019 01Front Cai

    20/22

    6.6 Page fault handling costs comparison for BT benchmark among the

    original CLOMP, the theoretical and the ReP techniques enchanced

    CLOMP. The computation part of elapsed time is common to all

    compared items. The page fault handling costs of the original

    CLOMP is presented in second, and that of others are presented as a

    reduction ratio (e.g. OrigTRePOrig

    ). . . . . . . . . . . . . . . . . . . . . . 132

    6.7 Page fault handling costs reduction ratio (TorigsegvTsegv

    Torigsegv

    ) comparison for

    other NPB benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

    6.8 Detailed Tsegv breakdown analysis of the IS Class A Benchmark

    for the ReP techniques. Overall Tsegv stands for overall CLOMP

    overhead. TMK Comm stands for the communication time spent

    by TMK for data transfer. TMK local stands for the local software

    overhead of TMK layer. ReP Comm stands for the communication

    time spent on prefetching data. ReP local stands for the local

    software overhead introduced by using the ReP prefetch techniques.

    Tsegv is presented in seconds and its components are presented as a

    ratio to the overall Tsegv . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

    6.9 Detailed Tsegv breakdown analysis of the IS Class C Benchmark

    for the ReP techniques. Overall Tsegv stands for overall CLOMP

    overhead. TMK Comm stands for the communication time spent

    by TMK for data transfer. TMK local stands for the local software

    overhead of TMK layer. ReP Comm stands for the communication

    time spent on prefetching data. ReP local stands for the local

    software overhead introduced by using the ReP prefetch techniques.

    Tsegv is presented in seconds and its components are presented as a

    ratio to the overall Tsegv . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

    6.10 Sequential elapsed time for LINPACK benchmarks . . . . . . . . . . 138

    6.11 Page fault handling costs comparison for LINPACK benchmarks

    among the original CLOMP, the theoretical and the ReP techniquesenchanced CLOMP. The computation part of elapsed time is common

    to all compared items. The page fault handling costs of the original

    CLOMP is presented in second, and that of others are presented as a

    reduction ratio (e.g. OrigTRePOrig

    ). . . . . . . . . . . . . . . . . . . . . . 139

    6.12 Page faults handling cost comparison between DReP and the ori-

    ginal CLOMP for the optimized LINPACK benchmark with multiple

    threads per process. SEGV represents the ratio of page faults

    handling cost to the corresponding elapsed time; SEGV Lock in

    turn represents a ratio of pthread mutex cost within SEGV. . . . . . 142

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/3/2019 01Front Cai

    21/22

    B.1 Tsegv,local (sec) for some NPB-OMP benchmarks with different num-

    ber of processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

    B.2 Nftotal for some NPB-OMP benchmarks with different number of

    processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

    B.3 Tsegvtotal (sec) for LINPACK benchmarks with different number of

    processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

    B.4 Nftotal for LINPACK benchmarks with different number of processes. 167

    C.1 Elapsed Time (sec) of some NPB-OMP Benchmarks on one thread. . 170

    E.1 Complete bandwidth and latency measured by the communication

    layer (CAL) of CLOMP on XE. . . . . . . . . . . . . . . . . . . . . . . . 194

    E.2 Comparison of CAL and OpenMPI: bandwidth and latency measured

    on XE via GigE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

    E.3 Comparison of CAL and OpenMPI: bandwidth and latency measured

    on XE via DDR IB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/3/2019 01Front Cai

    22/22