gipherfs - fenix.tecnico.ulisboa.pt · a segurança da informação é uma das principais...

GipherFS

A GPU Accelerated Ciphered File System

José Duarte Gomes Pires Lourenço

Thesis to obtain the Master of Science Degree in

Information Systems and Computer Engineering

Supervisors: Prof. Ricardo Jorge Fernandes ChavesProf. Aleksandar Ilic

Examination Committee

Chairperson: Prof. Alberto Manuel Rodrigues da SilvaSupervisor: Prof. Ricardo Jorge Fernandes Chaves

Member of the Committee: Prof. Miguel Ângelo Marques de Matos

June 2017

Acknowledgments

I would like to thank my supervisors, Professor Ricardo Chaves and Professor Aleksandar Ilic,

for their support and for the opportunity to finish this dissertation. Their knowledge and experience

provided me the right guidance to complete this work.

I would also like to thank my co-workers, especially Ricardo Ramalho, for their support, availability

and precious advice throughout the course of this work.

Finally, I would like to thank my family and especially my girlfriend Sofia for her patience, support,

understanding and motivation. My most sincere thanks to all of them.

i

Abstract

Information security is a major concern in computer systems and its importance has only increased

in recent years due to the pervasive use of information technology in common activities. The unau-

thorized release of confidential data is one the main threats to information security and is commonly

addressed by cryptographic mechanisms, through encryption. However, encryption algorithms are

typically computationally demanding and can place a heavy burden on a machine’s resources, espe-

cially those without accelerators, thus limiting and often degrading the overall performance in systems

with several users running concurrent applications.

On the other side, the Graphics Processing Unit (GPU) has evolved from a specialized computer

graphics accelerator to a general-purpose co-processor. The vast amount of computing power pro-

vided by the GPU, combined with its wide availability in commodity computer systems, presents a

great opportunity for delivering computationally expensive features without the penalty of degraded

system performance.

The work presented in this dissertation focuses on improving the file system security by transpar-

ently integrating efficient encryption mechanisms into its operation. It proposes to offload the AES

encryption algorithm, used for protecting file data, to the GPU, using CUDA. For this purpose, a pro-

tection layer was added to the ext4 file system, using a Virtual File System based on FUSE with

built-in confidentiality mechanisms, using the GPU as a cryptographic co-processor.

The experimental results obtained show that the relocation of this computation to the GPU mini-

mizes the impact on the CPU of introducing cryptographic mechanisms and improves its throughput

by a factor of 3,5.

Keywords

Confidentiality, File System, Cryptography, AES, CUDA, FUSE

iii

Resumo

A segurança da informação é uma das principais preocupações em sistemas computacionais e a

sua importância tem vindo a aumentar nos últimos anos devido ao uso generalizado de tecnologias

da informação em atividades comuns. O acesso não autorizado a informação confidencial é uma

das principais ameaças à segurança da informação, e é normalmente evitado através de mecanis-

mos criptográficos. No entanto, estes mecanismos são tipicamente exigentes computacionalmente e

consumem muitos recursos, especialmente em sistemas sem aceleradores, limitando e degradando

o desempenho geral de sistemas utilizados por vários utilizadores simultaneamente.

Por outro lado, o GPU evoluiu de um acelerador especializado em computação gráfica para um

coprocessador de uso genérico. O vasto poder computacional disponibilizado pelo GPU, assim como

a sua ampla disponibilidade apresenta-se como uma óptima oportunidade para a implementação de

funcionalidades computacionalmente exigentes sem afectar o desempenho do sistema.

O trabalho apresentado nesta dissertação foca-se na melhoria da segurança do sistema de fichei-

ros através da integração de mecanismos criptográficos no seu funcionamento, de forma eficiente e

transparente. Propõe-se a realocação do algoritmo AES, usado para proteger a informação dos fi-

cheiros, para o GPU, utilizando CUDA. Para tal foi adicionada uma camada de proteção ao sistema

de ficheiros ext4, recorrendo a um sistema de ficheiros virtual, baseado em FUSE, que incorpora

mecanismos de confidencialidade, utilizando o GPU como coprocessador criptográfico.

Os resultados experimentais mostram que a realocação da computação para o GPU minimiza o

impacto da introdução de mecanismos criptográficos no CPU e melhora o seu throughtput por um

factor de 3,5.

Palavras Chave

Confidencialidade, Sistema de Ficheiros, Criptografia, AES, CUDA, FUSE

v

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background and State of the Art 7

2.1 Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Symmetric-key cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.2 Assymmetric-key cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.1 Evolution of the GPU architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.2 CUDA Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3 User Space File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Proposed Solution 31

3.1 Automatic implicit encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Encryption mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.1 Block based implementation of read and write operations . . . . . . . . . . . . . 35

3.2.2 Padding: PKCS#7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 Solution Implementation 39

4.1 plainfs: a non-ciphered file system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1.1 Intercepting I/O operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1.2 Performing I/O operations on underlying file system . . . . . . . . . . . . . . . . 42

4.2 cipherfs: a non-accelerated ciphered file system . . . . . . . . . . . . . . . . . . . . . . 45

4.2.1 Block based I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.2 AES functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3 gipherfs: a GPU-accelerated ciphered file system . . . . . . . . . . . . . . . . . . . . . . 49

4.3.1 GPU implementation of AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3.2 Maximizing payload length: FUSE big-writes mode . . . . . . . . . . . . . . . . . 51

vii

4.3.3 Improving device memory management . . . . . . . . . . . . . . . . . . . . . . . 52

5 Solution Evaluation 55

5.1 Direct integration of GPU AES acceleration . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.2 Enabling big-writes in the proposed solution . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.3 Improving device memory management . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6 Conclusions and Future Work 65

Bibliography 67

viii

List of Figures

2.1 Private-key cipher [17] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 AES round operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 ECB encryption [26] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 ECB decryption [26] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 CBC encryption [26] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6 CBC decryption [26] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.7 CTR encryption [26] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.8 CTR decryption [26] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.9 GPU vs CPU performance evolution [30] . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.10 G80 architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.11 Fermi Architecture [38] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.12 Full Maxwell chip-block [40] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.13 CUDA automatic load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.14 CUDA two-dimensional block grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.15 CUDA memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.16 FUSE architecture [41] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1 Virtual File System: read and write operations . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 Straightforward read operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3 Block based read operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4 Padding of a partial block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.5 Padding of a complete block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1 Virtual File System: read and write operations . . . . . . . . . . . . . . . . . . . . . . . 40

4.2 User level file system I/O operations implementation . . . . . . . . . . . . . . . . . . . . 43

4.3 I/O operations redirection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 Encryption transformations: adding padding and ciphering . . . . . . . . . . . . . . . . . 46

4.5 Decryption transformations: deciphering and removing padding . . . . . . . . . . . . . . 46

4.6 Handling a 4096 bytes payload, made of 256 AES blocks . . . . . . . . . . . . . . . . . 48

5.1 Naive gipherfs: read bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 Naive gipherfs: write bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

ix

5.3 AES encryption benchmark: average block processing time . . . . . . . . . . . . . . . . 59

5.4 AES encryption benchmark: average total processing time . . . . . . . . . . . . . . . . 60

5.5 AES encryption benchmark: average bandwidth . . . . . . . . . . . . . . . . . . . . . . 60

5.6 Increasing gipherfs write payload: write bandwidth . . . . . . . . . . . . . . . . . . . . . 61

5.7 GPU memory operations benchmark: relative cost . . . . . . . . . . . . . . . . . . . . . 62

5.8 Improved device memory management in gipherfs: read bandwidth . . . . . . . . . . . 63

5.9 Improved device memory management in gipherfs: write bandwidth . . . . . . . . . . . 63

5.10 gipherfs optimizations: increased read bandwidth . . . . . . . . . . . . . . . . . . . . . . 64

5.11 gipherfs optimizations: increased write bandwidth . . . . . . . . . . . . . . . . . . . . . 64

x

List of Tables

2.1 Maxwell’s key properties [40] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 List of FUSE operations and ext4 counterparts . . . . . . . . . . . . . . . . . . . . . . . 45

xi

Abbreviations

3DES Triple Data Encryption Standard

AES Advanced Encryption Standard

AES-NI AES New Instructions

API Application Programming Interface

CBC Cipher Block Chaining

CPU Central Processing Unit

CTR Counter

CUDA Compute Unified Device Architecture

DES Data Encryption Standard

DRAM Dynamic Random Access Memory

ECB Electronic Codebook

ECC Error-Correcting Code

ext4 Fourth Extended Filesystem

FIPS Federal Information Processing Standards

FUSE Filesystem in Userspace

GBPS Gigabytes per second

GPC Graphical Processing Cluster

GPU Graphics Processing Unit

GPGPU General Purpose GPU

NIST National Institute of Standards and Technology

OS Operating System

PGA Professional Graphics Controller

xiii

PC Personal Computer

PKCS Public Key Cryptography Standards

RSA Rivest Shamir Adleman

SIMT Single-Instruction Multiple-Thread

SM Streaming Multiprocessor

SMM Maxwell Streaming Multiprocessor

SP Streaming Processor

SPA Scalable Processor Array

VFS Virtual File System

XTS XEX-based tweaked-codebook mode with ciphertext stealing

xiv

1Introduction

Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1

1.1 Motivation

Security awareness has been growing among computer users and one of the major concerns is

the confidentiality of the users’ data, which is not trivial to achieve, especially in modern multi-user

computer systems. Several mechanisms [1] [2] [3] have been proposed in order to ensure confiden-

tiality but the most typical approach, and one of the most practical ones, is to encrypt data [4] [5],

particularly that which is stored in the hard drive. The basic principle of encryption is to transform

data into an unintelligible form to an adversary. This transformation, especially when dealing with

large amounts of data, is done by symmetrical encryption algorithms, such as the Advanced Encryp-

tion Standard (AES) [7]. These algorithms are based on a secret value called the secret key, which

must be known in order to encrypt and decrypt the data. While being a relatively efficient encryption

approach, when large amounts of data need to be accessed and protected the cost of this process

can be significant, thus additionally overloading the Central Processing Unit (CPU) resources. Even

with multi-core CPUs and the introduction of specialized dedicated instructions at the hardware level,

the impact of cryptographic primitives on a system’s CPU load can still be quite noticeable. This is

especially the case when a system’s CPU load is already very high and there are little computational

resources left available. In these situations, security mechanisms, such as data confidentiality pro-

vided by the AES encryption algorithm, tend to be overlooked and considered expendable in favour

of the system’s functional requirements.

On the other hand, Graphics Processing Units (GPUs) have become pervasive in most compu-

tational systems from servers and desktops, up to mobile phones [8], and even some embedded

devices [9]. In recent years the GPU computational power has increased significantly, mainly due to

the high requirements of graphical computer applications. However, this computing power is currently

not restricted to graphical computation. Thanks to the introduction of general purpose programming

environments [10][11], other types of applications can now leverage the GPU compute resources.

This has made it a very popular parallel computing platform, especially in the scientific community,

where it enabled tackling problems previously considered intractable. Motivated by this fact, the work

proposed in this dissertation particularly focuses on exploiting the computational power of the GPU to

enhance the file system with cryptographic protection mechanisms, with the goal of not only providing

faster encryption, but also offloading the CPU resources.

In brief, this dissertation proposes a solution to the problem of building a secure system that pro-

vides confidentiality of user file data, by using the AES algorithm, the standard for data encryption,

while minimizing the penalty induced to the overall system performance, caused by the additional load

on a machine’s CPU. The approach proposed in this dissertation focuses on offloading the compu-

tationally heavy operations required by AES from the CPU to the seldom used and computationally

powerful GPU. This relocation drastically reduces the impact of the AES cryptographic primitives on

the CPU, apart from the required execution control of GPU kernel launches and data transfers. At

the same time, it aims at making the most of the usually underused GPU capabilities to support this

widely adopted data encryption standard. Moreover, the protection is added as an additional layer

2

at the file system level, so as to make it as transparent to the end user as possible. A Virtual File

System is implemented on top of ext4, the Fourth Extended Filesystem [12], using the Filesystem in

Userspace (FUSE) framework [13] [14] .

Other similar approaches have been proposed for solving this problem, in diverse settings [15]

[16]. Our approach goes beyond the existing state of the art by proposing a complete file system

design from scratch and not simply the modification of a pre-existing encrypted file system. This

approach makes it possible to tune specific aspects of the file system operation, making it a better

fit for the GPU computing paradigm. This allows to make full use of the GPU computing power, thus

maximizing the integration benefits. Our experiments further support the validity of this approach,

and results of read and write throughputs on the order of 580 MB/s and 95 MB/s, respectively, are

presented, improving on previous state of the art solutions by a factor of 2.

1.2 Goals

The importance of information security has long been recognised as a major concern in computer

science and data confidentiality is probably one of its most valued properties for the general public. No

longer is this an exclusive concern of the expert user but it has also become so for the less security-

aware one, who can easily recognise the value of keeping its private data safe from unauthorised

access.

This dissertation aims at implementing a secure file system providing confidentiality of its users

data by making it automatic and by minimising its impact on the overall system performance. Its main

goal is thus to enable widespread use of cryptographic encryption mechanisms ensuring data privacy

by making them as transparent and efficient as possible, so as to minimise any potential obstacles

hindering their use.

Confidentiality

The system must be able to keep user data private and only accessible to its legitimate owners.

Any unauthorised access attempt should be prevented by making data ineligible to those not allowed

to access it.

Performance

The system mechanisms providing privacy of user data must be efficient and their impact on the

overall system performance must be kept to a minimum. Moreover, the solution should make full use

of all the GPU computing resources available in the system in order to maximize its efficiency.

Usability

Security mechanisms tend to be disabled by its users when they fail to be easy to use. This can

be due to several reasons, such as requiring repetitive user interaction. In order to prevent these

3

shortcomings, the system must operate in a transparent way to the end user. Ideally, the user should

not even be aware of the system’s additional security features in order to leverage them.

The main goal of this dissertation is to design a solution and implement a system that fulfils the

aforementioned properties and deliver a secure system ensuring the privacy of its user data in a

efficient and transparent way.

1.3 Requirements

A formal set of requirements can be specified, based on the goals of this dissertation. The system

proposed in this document must be able to fulfill the following requirements:

1. User file data should be stored encrypted on the hard disk;

2. The cryptographic mechanisms should be executed on the GPU, when it provides better perfor-

mance than the CPU;

3. Users without administrator rights should be able to use the system;

4. The performance of the cryptographic mechanisms implemented on a GPU with sufficient re-

sources should have better performance than an equivalent version implemented on the CPU

and provide an higher access bandwidth to encrypted files;

5. The system cryptographic mechanisms should incur negligible load on the system’s CPU;

6. File data privacy should be transparent and must not require additional user interaction.

1.4 Dissertation Outline

This document describes a secure file system which leverages state of the art mechanisms for us-

ing the GPU as a cryptographic co-processor and ensure privacy of its users file data. The proposed

solution, implementation and experimental results are presented in the following Chapters, organised

as follows:

• Chapter 2 presents the background and state of the art analysis which determined the proposed

solution design. The background Section consists of the description of: cryptographic mecha-

nisms, concerned with providing security of information and more specifically data encryption

algorithms for data privacy; the use of the GPU as a general purpose processor, delivering a

large amount of computing power in diverse application fields; and user space file systems,

which can be used for implementing file systems outside kernel space. The state of the art

section presents previous attempts at using the GPU for: accelerating data encryption algo-

rithms; accelerating typical file system operations; and integrating both into GPU-accelerated

cryptographic file systems.

4

• Chapter 3 describes the proposed solution, a secure encrypted file system with transpar-

ent cryptographic mechanisms, efficiently implemented using the GPU as a cryptographic co-

processor.

• Chapter 4 discusses the implementation details of the proposed solution. It describes the im-

plementation of a FUSE file system, the AES encryption algorithm implementation on the GPU

and the integration of both into a file system with enhanced cryptographic mechanisms.

• Chapter 5 presents the experimental evaluation of the proposed solution. This chapter de-

scribes how the systems’ design and implementation achieves the goals and requirements of

the dissertation. For this purpose a comparison is made between different file systems en-

hanced with cryptographic mechanisms implemented on the CPU and the GPU.

• Chapter 6 concludes this document by summarising the work developed in the context of this

dissertation and introducing future work directions which can be followed based on the proposed

solution.

5

2Background and State of the Art

Contents2.1 Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3 User Space File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.4 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

7

This section presents the relevant background on the main topics covered in this dissertation.

Section 2.1 provides a brief introduction to cryptography and how it can be used to provide data

confidentiality. Section 2.2 describes the GPU and its use as a general purpose co-processor used

to accelerate primitives typically implemented in the CPU. The development of file systems in user

space and the frameworks allowing it are presented in Section 2.3. Finally, Section 2.4 presents

previous state of the art solutions to the problem covered in this dissertation.

2.1 Cryptography

Cryptography [4] [17] [18] has been historically defined as the art of writing and solving codes.

Its origins date back to several centuries ago when it was first used for exchanging secret messages

between two parties in the presence of adversaries. Modern cryptography however is a much vaster

field and has become a science on its own, more than solely an art form. Its use is no longer re-

stricted to that of secret communication and it can be described nowadays as the science concerned

with problems that may arise in any distributed computation subject to internal or external attacks.

These include not only privacy but also message authentication, digital signatures and secret key

exchange and authentication protocols. Thus, modern cryptography can be more adequately de-

fined as the scientific study of techniques for securing digital information, transactions and distributed

computations.

Modern cryptography also differs from classical cryptography in respect to the entities using it.

Roughly until the 1980s, cryptography was mainly used by military and intelligence organisations.

Actually, the use of cryptography in World War II to decipher enemy communications was one of the

main drivers for the development of computer science. This is no longer the case, as cryptography is

an integral part of any modern computer system today. Even though mostly unknowingly, users rely

on cryptography for tasks such as securely accessing a website, ensuring the integrity of an e-mail

message or enforcing access control in multi-user computer systems.

Due to its wide range of applications, cryptography is currently a central topic in computer science

and its main purposes can be summarised as being the following:

• ensuring the confidentiality of private data exchanged between a sender and its intended recip-

ient;

• authenticating data so that its source is correctly identified;

• ensuring message integrity, so that its content upon retrieval by the intended recipient is the

same as when it was originally sent;

• preventing the sender of a message from refuting the claim of having sent it, also known as

non-repudiation;

• specifying and controlling access to resources;

• ensuring the availability of resources to authorised parties at all the times.

8

Figure 2.1: Private-key cipher [17]

Even though they are obviously related, these concerns do not overlap and each aims at provid-

ing distinct security properties which can then be used as the building blocks of a secure computer

system. The cryptographic mechanisms used to tackle each of these specific problems are based on

the following main concepts:

• plain text : the state in which information can be understood by the sender and recipient of a

message, but also by anyone who may intercept it;

• cipher text : obtained by transforming the original content of a message in such a way that it

becomes ineligible to anyone but its sender and intended recipient;

• encryption: the process by which plain text is transformed into cipher text ;

• decryption: reverse process, which transforms cipher text back into its original legible form;

• key : secret shared by two communicating parties, used to encrypt and decrypt data.

Despite its various potential uses, the remaining of this section will focus on the use of cryptogra-

phy to ensure confidentiality of user data in computer systems, as that is the goal of this dissertation.

Cryptographic mechanisms for ensuring data confidentiality can be divided into two main classes,

described next in the context of private communication between two parties, in the presence of ad-

versaries.

2.1.1 Symmetric-key cryptography

Symmetric-key cryptographic mechanisms rely on the existence of a secret piece of information

shared exclusively among the communicating parties in advance. This information is called a private

key and it is used by the communication parties whenever they need to exchange messages. The

sender uses the private key to encrypt the message, transforming it into cipher text, ineligible to any

adversary capable of intercepting the message. Symmetrically, the receiver uses the same private

key to decrypt the message back into its original plain text form. In this setting the possession of

9

the unique private key distinguishes the communicating parties from any adversaries. It is the sym-

metry of both the sender and the receiver using the same key that makes this class of cryptographic

mechanisms known as symmetric-key, which is depicted by Figure 2.1.

Symmetric-key schemes assume that the communicating parties are able to securely exchange

the secret private key prior to the exchange of messages. This may not always be easy to achieve in

the modern distributed setting, and is in itself a major concern limiting the applicability of symmetric-

key mechanisms. There are however several applications for which this is not a great concern, of

which disk encryption is an example. In this case the key does not actually need to be distributed

among distinct parties, as it must only be known by the owner of the data in order to encrypt and

decrypt its files.

The Data Encryption Standard (DES) [22], the Triple Data Encryption Standard (3DES) [24], the

Advanced Encryption Standard (AES) [7] and Blowfish [25] are some of the most relevant and widely

used symmetric encryption algorithms of all time. DES and 3DES are mentioned for historical rea-

sons, since their use is no longer recommended. AES and Bowlfish are widely used today and AES

is actually considered as the de facto standard for data encryption.

AES

The Advanced Encryption Standard (AES), as defined by the Federal Information Processing

Standards (FIPS) Publication #197 in 2001 [7], is the standard symmetric encryption scheme rec-

ommended by the National Institute of Standards and Technology (NIST) of the United States. It

replaced the previous standard, the Data Encryption Standard (DES), and was the winner of a 3 year

long competition, out of a total of 15 competing proposals. It is based on the Rijndael [19] algorithm,

created by John Daemen and Vincent Rijmen. The choice of holding a competition to chose the new

encryption standard has been regarded as an ingenious one because it was the perfect motivation for

competing groups, composed of the world’s best cryptanalysts, to try to attack and exploit each other

proposals, therefore greatly boosting the confidence on the security of the winner.

Since its recognition by NIST, AES has been widely regarded as secure and efficient and has been

adopted by both the government and the industry worldwide. It is the de facto symmetric encryption

standard and almost any new protocol that requires symmetric encryption supports it.

AES is a block cipher that uses a block size of 128 bits and key sizes of 128, 192 or 256 bits. It is

essentially a substitution-permutation network [17] that operates on a 4x4 array of bytes, called state,

initialised with the cipher input, and an expanded key, derived from the master secret key.

It operates by repeatedly applying rounds of transformations to the state. The number of rounds

varies with the key size and amounts to 10, 12 and 14 for 128-, 192- and 256-bit keys, respectively.

The four operations illustrated in Figure 2.2 are performed in each round of the algorithm, except

for the final round where the MixColumns stage is replaced by an additional AddRoundKey. These

operations consist of:

• AddRoundKey : XORes a 16 byte round key, derived from the master key with the state array;

10

(a) Add round key operation

(b) SBox Translation operation

(c) Shift rows operation

(d) Mix collumns operation

Figure 2.2: AES round operations

11

• SubBytes: replaces each byte of the state array, according to a single constant substitution

table, called S-box ;

• ShiftRows: left shifts the bytes in each row of the state array as many times as the row number,

starting at zero;

• MixColumns: mixes each column via an invertible linear transformation in which each column is

interpreted as a polynomial. This step can also be viewed as a matrix multiplication.

Each round of the algorithm uses a different 128-bit round key. This key is derived from the original

private master key by applying a Key Expansion algorithm and is independent of the processed data,

meaning it can be done prior to the encryption/decryption phase.

The decryption process can be implemented in two equivalent ways: the inverse cipher and the

equivalent inverse cipher, which differ in the order of the round stages and in the way that the decryp-

tion round keys are defined.

AES has undertaken intense scrutiny ever since the competition that led to its selection and the

only cryptanalytic attacks found so far [20] are for reduced-round variants of it. Reduced-round vari-

ants are not standard compliant tough. Side-channel attacks [21], a different class of attacks that

assume that the adversary is able to make measurements such as the time taken or the power con-

sumed by the processor during encryption, have been able to extract secret keys in a reasonable

amount of time. However, these kind of attacks make very strong assumptions on the power of the

adversary and are not considered feasible in the common scenario. Besides its security guarantees,

AES is free and standardised, being available in virtually any platform. It is considered an excellent

choice for almost any cryptographic implementation.

AES-NI

Since its adoption as a NIST standard, AES has become very popular and its importance is ex-

pected to continue to rise in the near future. Any possible improvements to its performance and

security are therefore regarded as highly beneficial. With this in mind, in 2010, Intel introduced a new

set of processor instructions AES-NI (AES New Instructions) aimed at providing hardware accelera-

tion for AES [6].

The architecture consists of six instructions: AESENC, AESENCLAST, AESDEC and AESDE-

CLAST, supporting the encryption and decryption processes; and AESIMC and AESKEYGENAS-

SIST, for expanding the AES key. The instructions support the 128, 192 and 256 standard key length,

using the 128 bits block size and are well suited to all common uses of AES: bulk encryption/de-

cryption using any of the ECB, CBC and CTR cipher modes; data authentication; random number

generation and authenticated encryption.

AESENC and AESDEC are short for AES encrypt and decrypt round, while AESENCLAST and

AESDECLAST are short for AES encrypt and decrypt last round. These instructions operate on

both register-register and register-memory mode and perform the longest AES encryption/decryption

sequence possible, without introducing any branch variation.

12

AESKEYGENASSIST and AESIMC are short for AES Key Generation Assist and AES Inverse

Mix Columns. They are used for assisting the generation of the encryption round keys and converting

them to a form suitable for decryption using the Equivalent AES Inverse Cipher method, respectively.

Given the input block and the AES key, it is possible to generate the required expanded key and

implement any key size variation of the AES algorithm with these set of instructions. Moreover, non

AES compliant versions of the original Rjindael algorithm, e.g. versions with a different block size,

can also be implemented using these instructions.

In addition to the obvious performance benefits provided by these instructions, when used properly

they can also boost security because they have been designed to mitigate all of the already mentioned

known AES timing and cache side channel attacks.

Modes of operation

A block cipher mode of operation determines how the blocks making up the input of the encryption

algorithm are combined during the encryption process. Several modes with distinct security properties

have been proposed over the years.

Figure 2.3: ECB encryption [26]

The Electronic Codebook (ECB) is the simplest of all modes of operation and encrypts all blocks in

the same way. It has no inter-block dependencies making it easily paralellizable. Figures 2.3 and 2.4

show ECB’s encryption and decryption process, which consist of simply applying the block algorithm

on each data block. This means that two identical plain texts will result in two identical cipher texts,

which makes it rather insecure.

Figure 2.4: ECB decryption [26]

13

Figure 2.5: CBC encryption [26]

The Cipher Block Chaining (CBC) mode of operation was developed by IBM in 1976. It is based

on XORing each block of plain text with the output of ciphering the previous block, except for the first

block which is XORed with an initialization vector, as Figure 2.5 shows. The decryption process is

similar, as is shown in Figure 2.6. CBC is more secure than ECB but the inter-dependency between

blocks prevents its paralellization.

Figure 2.6: CBC decryption [26]

Figure 2.7: CTR encryption [26]

The Counter (CTR) mode differs from the previous modes in that it introduces a nonce which is

similar to an initialization vector, but includes a counter element in order to introduce variability. CTR

operates by encrypting the nonce, incremented by one for each block, at every step and XORing it

with the plain text block, as show in Figure 2.7. The decryption process is equivalent, as is presented

in Figure 2.8. The lack of dependency between blocks makes it possible to parallelize CTR while the

nounce makes it a secure mode of operation.

14

Figure 2.8: CTR decryption [26]

2.1.2 Assymmetric-key cryptography

Assymetric-key is the other major class of cryptographic mechanisms whose primary goal, in the

context of private communication, is to solve the problem of securely exchanging the private key

among all communicating parties. It achieves this goal by actually eliminating the need of exchanging

a single unique private key and using two distinct keys for encrypting and decrypting messages. It is

able to do so by ensuring that the security of its encryption scheme is preserved even in the presence

of an adversary in possession of the encryption key. As such, instead of sharing a single private key

each communicating party has its own pair of keys, a public key and a private key. The public key is

used to encrypt messages and, as the name suggests, is made publicly available to everyone. The

private key on the other hand is kept private by its owner, and is used to decrypt messages. In this

setting, in order for two entities to communicate privately each must only be able to obtain the other’s

public key, which by definition is publicly available, and ensure that its own private key is kept secure.

The sender then encrypts the message with the receiver’s public key and the receiver uses its private

key to decrypt it. In this case, it is the asymmetry of the keys used by the sender and the receiver of

the messages that makes this class of cryptographic mechanisms known as asymmetric-key.

The development of asymmetric-key mechanisms, and more specifically how they solve the key

distribution problem, revolutionised the field of cryptography, allowing its widespread use, specially on

the Internet. Moreover, assymetric-key mechanisms made it possible to address several of the cryp-

tographic purposes listed above, such as message authentication and non-repudiation. Asymmetric-

key mechanisms are not however without some caveats, being its most prominent the susceptibility

to man-in-the-middle attacks. One such example is the case in which an adversary is able to provide

a false public key to a communicating party and is thereafter able to decipher all of its messages.

The Diffie-Hellman [27] key exchange protocol and the RSA [28] encryption algorithm are two of

the most notable and widely used asymmetric-key cryptographic mechanisms. Due to its higher com-

putational requirements when compared to symmetric-key encryption schemes, often 2 to 3 orders

of magnitude higher, the RSA encryption algorithm is actually often used for exchanging symmetric

keys, thereafter used to encrypt messages on the Internet, using an algorithm such as AES.

As a final remark, any secure cryptographic mechanism should obey the famous Kerckhoffs’ prin-

ciple [29]. This principle states that an encryption scheme’s security should not rely on the secrecy

of its operation, but rather only on that of its key. Several arguments can be presented in favor of this

15

principle: it is easier for the involved parties to maintain secrecy of a short, say 128-bit key, than doing

the same for a whole algorithm; in case of disclosure, it is much easier to replace a key than to do the

same for the encryption algorithm; most of the cryptographic community agrees that intense public

scrutiny is the best way to ensure the security and reliability of encryption schemes.

A private-key encryption mechanism is characterised by three algorithms: one for generating pri-

vate keys; another for encrypting messages; and finally one for decrypting them. The basic property

that must hold for these algorithms in order for the mechanism to be correct is that the decryption of a

ciphertext, with the same key used for encrypting it, yields the original message. Moreover, according

to the Kerckhoffs’ principle, these algorithms are not required to be secret, and their knowledge by an

adversary should not impact the mechanism’s security.

2.2 Graphics Processing Unit

The Graphics Processing Unit (GPU) is one of the most pervasive parallel processor to date, both

in commercial computer systems and in supercomputing platforms. It was first introduced by IBM

back in 1984, but the term GPU was only introduced by NVIDIA in 1999 for the first time. Since then

it has been constantly evolving. The GPU is a piece of hardware specifically designed to enable fast

graphics rendering, specially important in computer games and image processing applications. The

market’s constant demand for realtime, high-definition and ever more realistic 3D graphics, mainly

due to the video game industry, has transformed it into a highly parallel, multi-threaded, many core

processor with great computational power and memory bandwidth. Actually, today’s GPU arithmetic

throughput and memory bandwidth far surpass and have evolved more quickly than those of the CPU

in the last decade, as Figure 2.9 shows. These characteristics make the GPU the ideal processor for

tackling a large amount of data parallel problems in all sorts of scientific application fields.

Many attempts to harness the computational power made available by the GPU outside the field

of graphics processing applications have been undertaken. For as long as 2003, several data parallel

algorithms have been ported to the GPU using graphical oriented shading languages such as DirectX,

OpenGL [31] and Cg [32]. These have ranged from as diverse application domains as protein folding,

stock options pricing and SQL queries. Many of these attempts have been successful at achieving

remarkable performance speedups when run on the GPU. These early efforts at using the GPU

and its graphical APIs for general purpose computing have been designated General Purpose GPU

(GPGPU) [33] programs.

Despite the noticeable performance benefits, the mapping of general purpose algorithms into

graphical APIs isn’t an easy endeavour: deep knowledge of the APIs and the GPU architecture is

required; problems must be expressed in terms of vertex coordinates, texture and shader programs;

and common basic programming features, such as random read and writes to memory are not sup-

ported. The increased complexity and lack of appropriate tools often rendered the GPU approach

infeasible.

NVIDIA promptly recognised the potential of the GPU as a general purpose processor and intro-

16

duced two key technologies that greatly boosted its adoption: the G80 unified graphics and compute

architecture and CUDA. CUDA stands for Compute Unified Device Architecture, and is a software and

hardware architecture for programming the GPU in a variety of high level programming languages.

With the introduction of these new technologies the GPU no longer needed to be programmed as

a dedicated graphics processing unit, using specialised graphics APIs. General purpose problems

could now be solved in the GPU, using the C programming language and CUDA extensions, and

leverage a massively parallel processor.

Figure 2.9: GPU vs CPU performance evolution [30]

2.2.1 Evolution of the GPU architecture

The GPU architecture was originally based on the concept of the graphics pipeline. The graphics

pipeline is an abstraction of the stages through which graphics data is sent and its implementation

was rather consistent across all major GPU manufacturers at the time, which helped the adoption of

the GPU technology. The graphics pipeline can be thought of as a set of geometry and rendering

operations that take 3D space coordinates as input from the programmer and transforms them into

2D pixels on the computer screen. Early GPUs only implemented the rendering stages, relying on the

CPU to perform geometry transformations, but more stages were added to the GPU as its technology

evolved, freeing the CPU for other tasks. Next are some of the most relevant milestones in the history

of the GPU [34]:

• The IBM Professional Graphics Controller (PGA) [35], released in 1984, was one of the first

video cards and processed all video related tasks using an on-board Intel 8088 microprocessor.

17

Its price, around $5500, and lack of compatibility, prevented it from achieving mass-market

success. However, its approach of using a dedicated auxiliary processor marked an important

step in the evolution of the GPU;

• NVIDIA’s GeForce256 and ATI’s Radeon 7500, both released in 1999, were the first GPUs,

available at the consumer level, to implement the entire graphics pipeline. All stages of the

pipeline, were performed on the GPU by fixed-function units;

• The programmable pipeline was the next step in the evolution of the GPU. It addressed the

lack of flexibility of the fixed function pipeline and was introduced by 2001’s NVIDIA’s GeForce

3. NVIDIA’s GeForce FX and ATI Radeon 9700 followed in 2002 and were the first fully pro-

grammable graphics cards on the market;

• The unification of the programmable portions of the pipeline was the last major step in the

evolution of the GPU and was introduced by NVIDIA’s G80 architecture, with the release of

the GeForce 8800 in 2006. The G80 architecture was the first to feature a fully programmable

unified processor, called Streaming Multiprocessor (SM) that handled all kinds of computation,

be it vertex, pixel or geometry computation. The unification of the GPU hardware turned the

graphics pipeline into a purely software abstraction;

• Since the introduction of the original G80 architecture, several modifications have been made to

it by NVIDIA with the release of new generations of its GPU architecture, namely Fermi in 2010,

Kepler in 2012, Maxwell in 2015 and Pascal in 2016. A new architecture, codenamed Volta has

already been announced and is set to be released in 2018.

As manycore GPUs’ computing power continues to grow according to Moore’s law [36], the biggest

challenge posed by this type of parallel systems is to build application software that is able to trans-

parently scale its parallelism to leverage the increasing number of processor cores. CUDA is NVIDIA’s

answer to address this challenge, a general purpose parallel computing architecture. CUDA has its

own parallel programming model and instruction set architecture and leverages NVIDIA GPUs’ par-

allel compute engine to solve many complex computational problems that may not fit well under the

CPU architecture. In general, its simple programming model is designed to allow the scaling of the ap-

plication parallelism with respect to the different number of processing cores among different NVIDIA

GPU architectures.

Evolution of NVIDIA’s GPU architectures

The G80 architecture was introduced by NVIDIA in November 2006, along with CUDA. Together,

these two key technologies enabled the use of the GPU as a general purpose massively parallel

processor. G80 introduced several key innovations, which made GPU computing possible:

• support for C-like programming constructs to ease the development of GPU code;

18

• unification of the vertex and pixel pipelines into a single processor capable of running vertex,

geometry, pixel and computing programs;

• use of a scalar thread processor, eliminating the need for manually managing vector registers;

• introduction of the Single-Instruction Multiple-Thread (SIMT) execution model in which a single

instruction is concurrently executed by multiple independent threads;

• support of shared memory and barrier synchronization for inter-thread communication.

Figure 2.10: G80 architecture

Figure 2.10 shows a diagram of the G80 architecture. It is based on a Scalable Processor Array

(SPA) composed of 128 Streaming Processors (SPs) cores, organised as 16 Streaming Multiproces-

sors (SMs) composed of 8 SP cores each. Work flows from top to bottom, originating at the PCI-

Express host interface, and is handled by the SPA. An external Dynamic Random Access Memory

(DRAM) makes up the scalable memory system. Compute thread arrays are dispatched to the SPAs

by the compute work distribution block which handles multiple logical streams simultaneously. The

remaining components are dedicated to implement the graphics pipeline and perform fixed-function

operations on graphic data prior to it being fed to the SPAs.

GT200, introduced in 2008, was a major revision to G80 and featured a set of improvements,

namely:

• increased number of streaming processors, thereafter called CUDA cores, from 128 to 240;

• double the size of each processor register file, allowing the execution of a greater number of

threads at any given time;

19

• addition of hardware memory access coalescing enabling more efficient memory access;

• support for double precision floating point arithmetic, required by scientific and high-performance

computing applications.

Figure 2.11: Fermi Architecture [38]

In 2010, a new CUDA architecture generation, codenamed Fermi [38], was released based on

the groundwork of G80 and GT200. It employed a completely new approach to the GPU architecture

design and introduced improvements focused on the following key areas:

• improving the performance of double precision arithmetic and bring it up to par with single

precision floating point performance;

• adding Error-Correcting Code (ECC) support in order to protect data-sensitive applications like

medical imaging and financial options pricing from memory errors;

• introducing an improved memory cache hierarchy;

• increasing the amount of shared memory to more than 16 KB per SM and providing faster

context switching.

These requirements were met by increasing the GPUs raw computing power through architectural

innovations which dramatically improved the GPU programmability and compute efficiency. Figure

2.11 presents an overview of the Fermi architecture. It features 512 CUDA cores organised in 16 SMs

of 32 cores each; has a 384-bit memory interface, divided in 6 64-bit memory partitions supporting a

total of 6 GB of GDDR5 DRAM memory; connects with the CPU via a PCI-Express host interface; and

uses the GigaThread global scheduler to distribute thread blocks among the SM thread schedulers.

Kepler was released in 2012 and it offered even higher processing power than its predecessors by

providing new methods to optimize and increase parallel workload execution on the GPU. It provided

20

over 1 Tflop of double precision throughput, while also representing a huge leap in power efficiency,

delivering up to 3x the performance per watt of Fermi. Kepler enabled increased GPU utilisation and

simplified parallel program design. The key characteristics of the Kepler [39] architecture making

these features possible were a new SMX processor architecture, an enhanced memory subsystem

and hardware support throughout the design to enable new programming model capabilities. Be-

sides its new and improved computing capabilities, Kepler ’s improved power efficiency made GPU

computing in devices such as laptops a more affordable option.

Figure 2.12: Full Maxwell chip-block [40]

Maxwell was released in 2015 and was mainly driven by the leap in PC displays technology at

the time. After years of little or no progress, the demand for higher screen resolutions increased

dramatically with the introduction of 4K and Virtual Reality technologies, both at an affordable cost.

This resulted in a even greater need for powerful and power efficient GPUs. Maxwell aimed exactly

at delivering the performance required by this new type of technologies while still improving on the

already efficient power consumption of the previous architecture generation by a factor of 2.

21

Table 2.1: Maxwell’s key properties [40]

Figure 2.12 presents the diagram of Maxwell ’s [40] full-chip block. It consists of 4 Graphical

Processing Clusters (GPCs),16 Maxwell SMs (SMM) and four memory controllers. Each GPC has

4 SMMs and a dedicated raster engine. Each SMM is composed of 128 CUDA cores, a PolyMorph

Engine and eight texture units, making a total of 2048 CUDA cores and 16 texture units. Table 2.1

summarises some of Maxwell ’s key properties, most notably the high number of CUDA cores (2048)

and a memory bandwidth exceeding 200 GBPS.

2.2.2 CUDA Programming Model

The CUDA programming model main design goal is to enable the development of GPU programs

that transparently scale their parallelism according to the number of processing cores available, while

maintaining a low learning curve for programmers already familiar with standard programming lan-

guages, such as C. It is based on three core concepts, exposed as a minimal set of language ex-

tensions: a hierarchy of thread groups, shared memories and barrier synchronisation. These provide

both fine- and coarse-grained data parallelism at thread and task level, respectively, allowing the pro-

grammer to partition the problem into coarse sub-problems, solved in parallel by independent groups

of threads, called thread blocks. Each coarse sub-problem can then in turn be further refined into

finer sub-problems, solved cooperatively by all threads within the same block.

22

Figure 2.13: CUDA automatic load balancing

In this model, threads are able to work together in groups, solving problems cooperatively by shar-

ing data in low-bandwidth memory and by synchronizing among themselves within the same Stream-

ing Multiprocessor. At the same time, groups of threads can leverage the high number of SMs to

scale program execution by running independently and without reliance on order execution. The lack

of dependence between independent groups of threads allows them to be scheduled by the CUDA

runtime on any of the available SMs without impacting program correctness. This model relieves the

programmer from knowing the physical characteristics of the GPU, which are handled automatically

by the CUDA runtime in order to optimize workload scheduling. This mechanism achieves automatic

load balancing, as depicted by Figure 2.13, by mapping work to available GPU resources. In this

example, the same amount of work, 8 thread blocks, is assigned to 2 different GPUs: one with 2 SMs

and the other with 4 SMs. On the GPU with 2 SMs, each SM is expected to handle 4 blocks of threads,

while the one with 4 SMs is expected to only handle 2 blocks of threads per SM. The assignment of

thread blocks to the available SMs is done automatically by the CUDA runtime and enables CUDA

programs to run in a multitude of graphics cards, ranging from high performance cards to those with

less resources, without modification of the source code or additional configuration.

23

Figure 2.14: CUDA two-dimensional block grid

The work performed by each CUDA thread is represented by a kernel in the CUDA C language

extension which corresponds to a typical C function, annotated with specific CUDA directives. Kernels

are concurrently executed by N different CUDA threads which are assigned an unique thread ID, used

to orchestrate work among threads in the same group. Each group of threads, composed of at most

1024 threads, is able to share memory and synchronise its work through low-latency shared memory

and synchronisation primitives. Multiple groups of threads can be executed by different SMs resulting

in a much higher total number of threads executing simultaneously at any given time in the GPU.

Groups of threads themselves can have a hierarchical structure and can be composed of one-, two-

or three-dimensional grids. Figure 2.14 shows a two-dimensional grid of thread blocks as an example.

Finally, Figure 2.15 depicts the CUDA memory hierarchy which is composed of multiple memory

spaces, with a distinct set of characteristics. Each individual thread has access to a local private

memory which can be exclusively accessed by it. Groups of threads are able to communicate through

two distinct types of memory: low-bandwidth block memory shared among all threads in the same

group; and global memory, a larger memory accessible by all threads running on the GPU. Global

memory is persistent across kernel launches, while private and shared memory spaces have the

same lifetime as its associated thread and thread block, respectively.

24

Figure 2.15: CUDA memory hierarchy

2.3 User Space File Systems

The development of Unix in-kernel file systems is not a trivial task, for several reasons. It requires a

great understanding of complicated kernel code and data structures, making new code prone to bugs

which are usually hard to trace and fix for the unexperienced kernel developer. Moreover, writing

kernel code is much different from writing application code, and the learning curve for the typical

programmer can be a very steep one. This is due to the absence, in kernel code development,

of commonly available features when developing user space applications. When developing kernel

code there are no memory protection mechanisms in place, careful use of synchronization primitives

is required and the use of a somewhat limited version of the C programming language (without the

standard C library support) is mandatory. Moreover, debugging kernel code can be a very tedious

task, typically requiring repeatedly recompiling and rebooting the kernel and the machine.

On top of the hindrances faced during development, a fully functional in-kernel file system still

has several drawbacks. Even though the use of standard file system interfaces, such as the Virtual

File System (VFS) layer, can make the task of porting an in-kernel file system to a different Unix-like

operating system easier, it can still require significant design and implementation changes. Also, in-

kernel file systems must be mounted with superuser privileges. This can severely limit their usability

25

in centrally administered machines and their feasibility.

Most of the problems described above, faced in kernel development, are mitigated, or completely

eliminated, when developing file systems in user space. Thus research in the area of storage and

file systems has focused on the enhancement of basic file systems in user space, as opposed to

designing low-level file systems directly. When developing a feature in user space, the programmer

can leverage a wide range of familiar programming languages, third-party tools and libraries, other-

wise unavailable in the kernel. This allows the programmer to focus on the essential features that the

system must deliver without worrying about the intricacies posed by the kernel.

Regardless of its advantages, the development of file systems in user space also brings to practice

its own set of challenges. The more a file system implementation is coupled to a particular operating

system, the more effort will be required to port it to a different Unix-flavour, and even more to a different

operating system. Furthermore, typically user space code cannot run as efficiently as a counterpart

kernel version by virtue of inherent user space overheads such as processor context switches. Yet

the ever increasing processor, memory and bus speeds have made performance less of a factor in

current commodity machines.

FUSE [13] is a widely used framework for developing file systems in user space. It is available for

Unix-like operating systems and has been merged into the Linux kernel since version 2.6.14. Several

ports for other mainstream operating systems are also available, such as Mac OS’s OSXFUSE.

FUSE

FUSE, which stands for Filesystem in Userspace, is a widely used framework that allows the

development of fully functional file systems in user space. FUSE’s purpose is to solve the previously

mentioned issues faced when developing in-kernel file systems. It does so by exposing a simple API

to the developer, made of a set of familiar file system operations, which he only needs to implement

in user space. File systems can be implemented in any of the programming languages bindings

available for FUSE (e.g. Java, Python, Ruby) and leverage the large number of libraries available

for it. The simplicity of both its development and its use, since it can be mounted by regular non

privileged users, makes file system development possible to a much larger number of programmers,

vastly potentiating its application in various fields.

The FUSE framework is relatively simple and it is composed of two main components: a loadable

kernel module and a user space library, libfuse. The fusermount utility is also part of the framework

and enables non-privileged users to mount and unmount FUSE file systems. The remaining of this

section describes how a typical FUSE file system operation works in order to present the framework’s

internals.

Figure 2.16 depicts the steps involved in a FUSE operation. In this example, the userfs executable

implements the set of FUSE API callbacks and the file system is mounted in the /fuse directory using

fusermount. Once mounted, a fusefs file system is registered with VFS and all file system calls target-

ing the /fuse mount point are forwarded to the FUSE kernel module. The kernel module then invokes

the appropriate callback in the userfs executable, which handles the operation and returns the result

26

Figure 2.16: FUSE architecture [41]

to the kernel. The communication between the FUSE kernel module and the user level executable

is achieved by using the /dev/fuse character device to forward system calls and its arguments to the

libfuse library. Finally, the result is propagated back through the same device to the kernel and to the

application from which the initial request originated.

All FUSE file system I/O operations follow an identical mechanism. The read operation is pre-

sented next. When a call to read a file hits fusefs at the VFS layer, it is forwarded to the FUSE kernel

module. If the requested data is found on the page cache, it is immediately returned. Otherwise the

read callback registered in the user provided executable is invoked. The actual implementation of the

file system callback is completely up to the developer and the operations it performs may be of any

kind. In the case of the read operation, data may be obtained from a remote server, or some type

of pre-processing can be made on the requested data, leveraging other file systems for storage. The

result is then forwarded back to the kernel and to the originating application.

FUSE provides two different APIs to the file system developer, a high-level and a low-level API.

The low level API forces the programmer to directly deal with and manage inodes, perform pathnames

translations and manually fill the data structure used to reply to the kernel. This API is specially useful

for file systems developed from scratch. The high-level API allows the programmer to deal only with

pathnames, rather than inodes, mapping more directly to system calls. All the low level operations,

like inode to pathname translations and the filling and sending of the reply data structures to the

kernel, are automatically handled by libfuse. Most FUSE-based file systems use the high-level API,

as it is sufficient for most purposes and far more accessible and simple.

A wide variety of FUSE based file systems are available including traditional disk-based file sys-

27

tems, but the possibilities are limitless and several nontraditional uses have been proposed by several

projects, as discussed in the following state of the art section.

2.4 State of the art

Using the GPU as a cryptographic accelerator is not a new endeavour. The first efforts to efficiently

implement the AES encryption algorithm in the GPU date back to the time when the GPU was still a

dedicated graphics accelerator, which could only be programmed using graphical APIs and was very

limited by the graphics pipeline. This was neither an easy task nor did it provide great performance

benefits. [49] demonstrated the feasibility of implementing the AES encryption algorithm on the GPU

for the first time. However, the implementation of AES in OpenGL also showed that the APIs available

for programming the GPU at the time did not suit this this kind of problem properly. Despite the

poor results achieved for offloading the computation of AES to the GPU for the general use-case,

the approach proved worthy nonetheless for the encryption and decryption of graphical displays and

images in order to avoid temporarily storing them on the system main memory in plain text. These

results were some of the first proving the merit of using the GPU for offloading computation typically

done on the CPU for very specific requirements.

The advent of General Purpose GPU (GPGPU) [33] made it possible for a wider group of re-

searchers to exploit the Graphics Processing Unit as a cryptographic accelerator, by eliminating the

need for a deep knowledge of the GPU graphical computing pipeline details and specific graphical

programming languages. Early CUDA based implementations of AES presented evidence that the

support for general purpose programming on the GPU could be leveraged for using the GPU as a

cryptographic co-processor. The first attempts at porting the AES encryption algorithm to the GPU

using CUDA under performed when compared with optimised implementations on standard CPUs at

the time [50]. However, they were successful at demonstrating that the additional computing power

made available by relocating the computational load incurred by cryptographic primitives from the

CPU to the GPU, enabled an increased overall system throughput, by making the CPU available for

other tasks.

Eventually, GPU-accelerated implementations of AES presenting better performance than those

using only the CPU were proposed. [51] reported an AES CUDA implementation delivering a speed

up of about 20 times when compared with a modest CPU at the time and on par with solutions using

dedicated hardware. These results were the first to portrait the GPU as an efficient cryptographic

accelerator, and speedups as high as 50, achieving bandwidths of 60 GBPS on contemporary state

of the art GPUs have been reported more recently [52] .

The GPU has also been used in the context of computer file systems for diverse applications.

[53] proposed GPUfs, a POSIX-like API for GPU programs exploiting the concurrent capabilities of

the GPU for efficiency and extending the host CPU buffer cache into the GPU memory in order to

optimize GPU file access. The benefits of this type of mechanism were demonstrated in the form of

GPU programs for applications such as searching for a set of strings on the Linux kernel tree running

28

over seven times faster than an equivalent eighth-core CPU implementation.

Harnessing the GPU computational power to improve the performance, reliability and security of

distributed storage systems by offloading and accelerating computationally intensive hashing prim-

itives was proposed by [54]. The approach was successful at bringing tangible performance gains

without negatively impacting the performance of concurrently running applications for preserving data

integrity and doing online similar detection in content addressable storage systems. [55] further al-

lowed leveraging the GPU for this specific use case by designing StoreGPU, a library accelerating a

number of hashing primitives popular in distributed storage system implementations on the GPU.

Deeper integration of the GPU into the operating system has also been explored in order to provide

a further generalization of its execution model. [56] presented Gdev a GPU resource management

ecosystem allowing the use of the GPU as a first class computing resource that allows the use of

the GPU memory for swapping, virtualises it into multiple logical GPUs, enhancing isolation among

working sets of multi-tasking systems and GPU programs to leverage typical operating system mech-

anisms, such as scheduling and memory management [57].

The use of GPU for accelerating file encryption mechanisms has also been proposed before, both

for replacing the typical CPU implementation of encryption algorithms found in widely used encryption

suites, such as TrueCrypt [58], and for accelerating the cryptographic primitives used by secure file

systems. Some of these efforts are based on extending the operating system mechanisms, in order

to better support the integration of the GPU as a cryptographic accelerator [57].

The possibility of developing file systems in user space greatly reduced their implementation com-

plexity and made it possible to explore the integration of novel functionality into their operation by

leveraging established user space libraries providing all kinds of functionality. Data processing fea-

tures can be implemented as additional layers on top of pre-existing, stable and widely used standard

file systems by levering them for storage. This type of approach led to the development of several

types of user level file systems: LZOlayer_fs, FuseCompress [46], fuse.gunzip [44] and fuse-zip [45],

abstract data compression from the user and are able to save disk space by storing their files in a

compressed format; archived files can be directly accessed using rarfs [43]; sshfs [47] and DAVfs [48]

allow access to remote files through the network; and many more uses have been proposed [42]. The

integration of GPU-accelerated data processing features also becomes possible with user space file

systems.

Scientific studies are predominantly based on user level constructs to achieve encryption of file

data, typically through the use and extension of user level file systems. [59] proposes a very similar

approach to the one presented in this dissertation. It also proposes offloading the execution of the

AES encryption algorithm to the GPU, in the context of a user space file system enhanced with crypto-

graphic mechanisms, ensuring privacy of its users’ data. However, it mentions several shortcomings,

which the design of the solution proposed in this dissertation aims at avoiding from the ground up.

The solution is based on EncFS, a pre-existent FUSE-based file system implementing encryption ex-

tensions using OpenSSL. The decision of using a pre-existent file system however proved to be one

of its main limitations. Due to the relative complexity of the file system source code, the opportunity

29

for adapting its operation in order for it to fit better the GPU computing paradigm was very limited.

Our approach differs substantially in respect to this decision. A complete Virtual File System was de-

veloped from scratch, making it possible to adapt its operation and to maximise the potential benefits

of integrating the GPU as a co-processor, as is described in the following sections. This allowed for a

better use of the GPU computing capabilities, resulting in a higher I/O throughput. While [59] report

a 30% AES execution time reduction, our solution is able to further reduce it to a maximum of about

70%.

30

3Proposed Solution

Contents3.1 Automatic implicit encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2 Encryption mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

31

The implementation of cryptographic mechanisms rely on operations that can place a noticeable

burden on a system’s CPU and can thus impact a system’s overall performance. Even with multi-core

CPUs and the introduction of specialised dedicated instructions at the hardware level, the impact of

cryptographic primitives on a system’s CPU load can still be quite noticeable. This is especially the

case when a system’s CPU load is already very high and there are little computational resources left

available, or even none at all. In these situations, security mechanisms, such as data confidentiality,

tend to be overlooked and considered expendable in favour of a system’s functional requirements, due

to their prohibitive computational impact. This can ultimately lead to cryptographic security mecha-

nisms being disabled in a system with limited computational resources, in order to improve its overall

performance.

This dissertation presents a pragmatic solution to minimise the impact of the cryptographic mech-

anisms required to ensure protection of the file system on a system’s CPU and speedup its execution

using a GPU. The solution aims at avoiding degraded system performance caused by the penalty of

additional load on a machine’s CPU by offloading the operations required by the encryption algorithm

from the CPU, which is highly sought after, to the seldom used and computationally powerful GPU.

This relocation reduces the impact of the cryptographic primitives on the CPU to close to zero. At

the same time, it makes the most of the usually underused GPU capabilities. Additionally, it achieves

this as transparently as possible to the end user so as to maximise its adoption and minimise any

obstacles to its use.

The task of building a security system to provide confidentiality of user data can be broken down

into two major independent subtasks: determining the point at which to transform data into and from

an encrypted format, i.e. when to cipher and decipher the data; and the actual process of ciphering

and deciphering it. The proposed system is built on top of a fully functional physical file system by

incorporating the automatic and transparent encryption and decryption of the accessed files using the

GPU as a cryptographic engine. Rather than relying on low-level Operating System (OS) calls, this

solution uses FUSE, a user level framework allowing the creation of a Virtual File System (VFS) in

user space, enriched with CUDA support, allowing to integrate specific GPU cryptographic kernels,

also at the user level. FUSE was chosen for this task given its flexibility and availability in mainstream

Linux kernels since version 2.6.14.

This approach allows to focus on the system’s fundamental goal of adding cryptographic security

as an additional layer of data processing and leverage the functionality already provided by stable and

widely adopted general-purpose file systems. Moreover, the ability for this solution to be deployed

by non-privileged users is another advantage, along with the resulting transparent usage, making it

easier to adopt.

3.1 Automatic implicit encryption

Two approaches can be taken when addressing the first subtask of determining the point at which

to introduce the encryption mechanism. One is to make the process explicit to the user and the other

32

is to embed it into the normal operation of the system, making it implicit.

The explicit approach typically relies on the existence of a tool, either built into the operating

system or in the form of an external standalone application, used by the user for ciphering and de-

ciphering data. This has the advantage of giving the user total control over the encryption process,

allowing it to only be triggered when there is a real necessity and thus saving potentially valuable sys-

tem resources. It however has the disadvantage of requiring explicit action from the user, which can

make it cumbersome after some time. Moreover, since it is neither automatic nor enforced by the sys-

tem, it can be overlooked or regarded as an unnecessary additional step by less security concerned

users, sometimes rendering it ineffective.

An implicit approach usually leverages mechanisms built into the operating system and requires

no additional action from the user. In this case the user doesn’t even need to be aware of the mech-

anism’s existence, since it is automatically enforced by the system. By opposition to the explicit

approach, this has the downside of giving the user less control over the mechanism’s operation.

Systems using this approach typically only give the option of completely enabling or disabling the

mechanism, not giving the user a finer control over its operation. This may potentially result in the

consumption of more computational resources than would be actually required. However, since it is

automatically and constantly enforced, this approach ensures that the security mechanism is always

in place and data privacy is always guaranteed.

The solution presented in this section proposes the use of an implicit approach. This is the one

that best suits the system’s usability goal, presented in Section 1.2. For this purpose, a FUSE-

based Virtual File System with pre- and post- data processing extensions has been implemented.

These extensions present the perfect opportunity for adding an automatic extra layer of security for

implementing the required encryption and decryption mechanisms.

3.2 Encryption mechanism

The second task, concerning the actual encryption process, consists of picking, among the several

encryption algorithms available in the computer security community, the one that best suits the sys-

tem’s requirements. Given the dissertation’s goals of minimising the mechanism’s load on the CPU by

offloading the execution of the encryption algorithm to the GPU, the chosen algorithm should be eas-

ily paralellizable, so as to make the most of the Graphics Processing Unit capabilities. Also given the

relatively high complexity of this type of mechanism, an already available open source implementation

of the algorithm greatly favours its adoption.

The solution presented in this dissertation uses the AES algorithm, since it is a widely accepted

de facto standard, for which several implementations are available. In regard to the encryption it-

self, 128 bit keys and the ECB (Electronic Codebook) cipher mode were selected. Besides the data

transformation functions, the encryption and decryption process requires an additional component for

setting up and initializing all the AES parameters, particularly the secret key. It is important to note that

only by providing the same secret key between calls will it be possible to properly access previously

33

Figure 3.1: Virtual File System: read and write operations

created files.

The proposed system, codenamed gipherfs, consists of three processing steps, as illustrated

in Figure 3.1 for the Virtual File System’s read and write operations. In the case of a write, the

I/O operation, mapped to the virtual file system, is intercepted by FUSE in the first step. After this

interception step, the implementation transfers all the necessary data received from the call to the

GPU and calls the GPU encryption kernel. To conclude, the GPU encrypted data is transferred back

to the host (system main memory) and effectively written into the underlying physical file system by

performing the write system call to a physically mapped address. The corresponding physical address

is defined by the virtual to physical file system mapping. In the case of a read operation, a similar

three-step procedure is followed. The main difference lies in the second step, where the encrypted

data is retrieved from the physical file system and the GPU kernel is called for decrypting it before

passing it to the invoking application.

Although the interception of FUSE may impose some additional delays, the major limiting factor

for the overall system performance lies in the data transformation step, consisting of encrypting or

decrypting the data. These cryptographic operations can be performed either by the CPU or by a

cryptographic co-processor, in this case the GPU. The approach considered to speedup the cryp-

tographic operations relies on the GPU as a cryptographic co-processor, where the encryption and

decryption procedures are offloaded, after the data is intercepted by FUSE. As in any typical CUDA-

based GPU program, the implementation of AES on the GPU requires the three following steps:

1. Transfer data from the host (CPU) to the GPU

2. Process the data on the GPU

3. Transfer resulting data back from the GPU to the host

34

The data transfer steps are an obvious source of execution overheads, nevertheless required in

order to take advantage of the GPU computational power. The trade-off between this additional cost

and the performance benefits obtained from implementing AES on the GPU is one of the main aspects

of a successful GPU implementation. These tradeoffs are evaluated in Chapter 5 in order to deter-

mine the optimal amount of data processed in each GPU kernel invocation and the possible memory

management optimisations. The GPU AES code used in the proposed system, which resembles the

implementation described in [52], is open-source and was obtained from [61].

To properly evaluate the performance gains that can be achieved with the proposed GPU-based

approach, a CPU-based version of the system was also implemented, where the CPU is used for

the cryptographic processing. In this version, designated as cipherfs, the GPU is not used for the

cryptographic operations. cipherfs is identical to the giphers GPU-based version of the system but

implemented using only the CPU. In order to optimize the use of the CPU processing resources,

the CPU-based implementation of the AES cryptographic primitives was extracted from the wolfSSL

[62] cryptographic library implemented in the C programming language. This implementation takes

advantage of AES-NI (AES New Instructions), the x86 instruction set architecture extension for AES

acceleration [7], where the data blocks are sequentially processed using dedicated AES hardware

instructions.

3.2.1 Block based implementation of read and write operations

The block nature of AES requires the implementation of the read and write operations to be

adapted in order to enable its proper application. A block cipher must always process data in fixed

size block units, more specifically of 16 bytes (128 bits) each in the case of AES. Because of this,

instead of a more straightforward approach, the Virtual File System’s read and write operations also

read and write data from files in block units whose length is always a multiple of 16 bytes. Figures 3.2

and 3.3 show a comparison between a straightforward and a block based approach, respectively, for

reading data from a file. Both represent a request to read 4 bytes from a 16 byte file, starting at byte

3.

Figure 3.2: Straightforward read operation

In a straightforward approach, the file system reads data from the file exactly as requested. In

this example, it would read bytes 3 through 6 and return all of them to the requesting application.

This is depicted in Figure 3.2 in which bytes 3, 4, 5 and 6 are read from the file, and returned to the

35

application. Using this approach, only the data requested by the application is actually read by the file

system from the file.

A block based approach only reads data from the file in sizes multiple of a predefined block size.

In the example of Figure 3.3 the file is divided in four 4 byte blocks. When the same request to read

4 bytes, starting at byte 3, is made to a system following this approach, it must determine which

blocks contain the required data and read them entirely. In this case blocks 0 and 1 must be read,

which means all bytes from 0 through 7 are read by the file system from the file. However, since the

application only requested bytes 3 through 6, only those will actually be returned to the application.

Figure 3.3: Block based read operation

This may appear as a waste of additional logic, processing and I/O load for achieving the same

end results, but it is a requirement for the proper application of the AES encryption algorithm on

the file system data. Using again the example of Figure 3.2, if the file data was ciphered, it would

not be possible to decipher the requested data by only reading bytes 3 through 7. In this setting,

only by following the approach of entirely reading blocks 0 and 1, depicted in Figure 3.3, containing

the requested data, it is possible to decipher them, obtain the requested data, and return it to the

application.

3.2.2 Padding: PKCS#7

Another implication of the AES encryption algorithm block nature is that it can only handle full 16

byte data blocks and must first extend its input when it is composed of partial blocks. Padding is the

most commonly used technique to enforce this requirement and consists of extending a piece of data

by adding the necessary data units to it, until it has the required length. Assuming l and k are the

input data length and block size, in bytes, k − (l mod k) is the number of trailing bytes which must be

added to the input in order to make it a multiple of the block size.

The padding method implemented in the system presented in this dissertation is the one proposed

in PKCS#7. In PKCS#7 the value of each byte added to the original input data matches the number

of bytes added. Using again the l and k values defined above, where k = 16 in AES, PKCS#7 adds n

trailing bytes to the input data, where the value of each byte is n and n = 16 − (l mod 16). However,

for the mechanism to be reversible, it must always be possible to determine how many bytes were

padded. In order to achieve this, when the input data length is already a multiple of the block size, a

36

whole new full block is added with the value of all bytes equal to the block size. Without this exception,

it would not be possible to reverse the process. Considering a padded data chunk whose last block

value is 0x01 one would not be able to determine if the original data block length was already equal

to the block size and 0x01 was its original last byte or if the original chunk was padded with one 0x01

valued byte. Adding the aforementioned special case, the padded block last byte always matches the

number of bytes added to the block, thus making it trivial to remove the pad bytes and reverse the

process.

Figures 3.4 and 3.5 depict the operation of the mechanism implemented in the system, for the

AES 16 byte block size. Figure 3.4 shows that a 13-byte long data chunk is padded by appending

16 − (13 mod 16) = 3 bytes with the value 0x03 to the original input. The reverse operation simply

consists of reading the padded block’s last byte, and removing as many trailing bytes from it as this

value, in this case 3. Figure 3.5 presents the case when the input block length is already 16. When

this happens, a complete new 16 byte block is appended after the original one with all values equal

to 0x16. Again, removing as many bytes as the value of the padded block last byte yields the original

block.

Figure 3.4: Padding of a partial block

Besides the already mentioned adaptation required by the read and write operations, padding of

file data also requires modification of all the operations reporting file size. This information is stored

as metadata about each file on the underlying physical file system on top of which the Virtual File

System described in this dissertation is built, and will therefore include padding, which is an artificial

artefact required by AES. In order to report the actual file size, the padding bytes must be subtracted

from the size reported by the underlying physical file system. This means that it becomes no longer

possible to retrieve the size of a file without first reading and deciphering its last block. Using again

the padded file in the example of Figure 3.5, the size reported by the underlying physical file system

would be 32 bytes. By reading the file’s last byte, one determines that the last 16 bytes were added

as padding and that the actual file size is 32− 16 = 16 bytes. This logic is implemented in the getattr

FUSE I/O operation. The operation implementation starts by using the underlying stat file system

operation to obtain the physical file’s actual size, in this case 32 bytes. It then reads and deciphers

the file’s last block in order to obtain the 0x16 pad value and subtracts it from the obtained file size in

order to report the file’s length correctly.

37

Figure 3.5: Padding of a complete block

38

4Solution Implementation

Contents4.1 plainfs: a non-ciphered file system . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 cipherfs: a non-accelerated ciphered file system . . . . . . . . . . . . . . . . . . . 454.3 gipherfs: a GPU-accelerated ciphered file system . . . . . . . . . . . . . . . . . . 49

39

The system developed in the context of this dissertation ensures confidentiality of user file data

through the use of cryptographic mechanisms. More specifically, a GPU version of the AES encryption

algorithm is used to cipher user files, in order to speedup its execution and avoid additional load on

a machine’s CPU. This is achieved through a Virtual File System, implementing automatic ciphering

and deciphering of its files on the GPU. This section starts by presenting the high-level design of a

FUSE-based user level file system with enhanced cryptographic features. The implementation details

are presented in the following subsections, which describe a set of different implementations of a set

of iterative versions resulting in the final system.

Figure 4.1 provides a high-level view of the file system’s basic operation, hereafter referred to as

gipherfs. The name gipherfs originates from the junction of the cipher text and fs keywords, standing

for ciphered data and file system, respectively. The replacement of the letter c with a g, denotes

the use of the GPU as a cryptographic accelerator. As evidenced in Figure 4.1, the cryptographic

primitives are introduced in the write and read operations. Every time a file is written, data is first

automatically ciphered by the virtual file system, before being effectively written to the underlying

physical device. Conversely, data is deciphered upon being read from the physical device and before

being returned to the invoking application on a file read operation.

Figure 4.1: Virtual File System: read and write operations

A level of indirection can be identified in Figure 4.1. Even though I/O operations are targeted at the

/mnt/private_fs directory, they are actually performed in /mnt/private_fs_back. In fact, /mnt/private_fs

is only a logical endpoint used to intercept calls to the FUSE-based file system. The user-level im-

plementation of the Virtual File System I/O operations then relies on the existence of an underlying

physical file system, such as ext4, and on the use of its mechanisms. This is achieved by redirecting

the target of the I/O operations from the Virtual File System mount point to another directory, in this

case /mnt/private_fs_back, which is mounted on a physical file system and uses its I/O API.

By using this approach, one can focus on the system’s fundamental goal which is to add encryption

as an extra layer of data processing and leverage the functionality already provided by stable and

40

widely adopted general purpose file systems. Hence the task of building such a file system can be

broken down into the 3 independent subtasks:

Subtask 1: intercepting I/O operations targeting the Virtual File System’s mount point;

Subtask 2: applying data transformation (ciphering/deciphering) to the I/O operation payload,

whenever it is required;

Subtask 3: performing I/O operations on the underlying file system.

Besides making it easier to reason about the system’s design, by making it more modular, this

decomposition makes it possible to build the system incrementally. This was the approach taken

during both the development of the system and for describing it in the remainder of this section.

The system was implemented as a set of incremental iterations, each version building on top of the

previous one by adding or replacing each of the aforementioned subtasks.

Subtasks 1 and 3 are a requirement of any functional virtual file system and their implementation is

rather static, regardless of how the system transforms its data. As described in Section 2.3, subtask 1

is provided out of the box by FUSE, by means of its architecture and how it enables the implementation

of user level file systems. However, subtask 3 needs to be implemented in the Virtual File System.

The implementation details of both of these subtasks are the topic of Section 4.1 which describes

the system’s first iteration, codenamed plainfs. plainfs consists of a simple Virtual File System, with

no transformation applied to its data, which simply intercepts calls to a virtual mount point and maps

them to another directory mounted on a fully functional general purpose file system. The codename

plainfs originates in the fact that file data is stored with no transformation applied to it, i.e. in plain text,

as it is usually referred to in the cryptographic bibliography.

Even though it is not possible to build a system of this kind without both subtasks 1 and 3, subtask

2 is the one that mostly defines the system, by allowing the developer to specify the transformation to

be applied to the file system data. This transformation can be of any type, or even none at all, as in

the case of the plainfs iteration of the system.

Section 4.2 describes cipherfs, the system’s second iteration, in which an implementation of sub-

task 2 is introduced for the first time. In this version, the transformation consists of the encryption and

decryption functions of the AES encryption algorithm implemented in the CPU. As such, the second

iteration of the system already provides confidentiality of user data, even though it still implements it

on the CPU, as it is usually the case in most security systems.

Lastly, the final iteration of the system, codenamed gipherfs, is presented in Section 4.3. Here

the CPU AES functions are replaced by their counterparts implemented on the GPU. By using the

GPU, the system is expected to provide the same security features of the previous iteration, without

the penalty on the CPU load and an improved throughput.

41

4.1 plainfs: a non-ciphered file system

The system’s first iteration, designated as plainfs was built with two main purposes in mind. First

and foremost, is to gain a deeper knowledge about the operation of FUSE and understand how it can

be used to implement user-level file systems. By only requiring the implementation of the already

described subtasks 1 and 3, plainfs makes it possible to focus on FUSE’s mechanisms and not on

any specific kind of data transformation. Moreover, the modular approach followed in the system’s

design also makes it possible to use plainfs as a bootstrap, on top of which the remaining versions of

the system can be implemented by adding specific implementations of subtask 2, consisting of a data

transformation step.

Second, in spite of its lack of additional functionality compared to the underlying general purpose

file system it is built on top of, plainfs is actually very useful in the context of this dissertation in other

aspects. It makes it possible to determine the overhead of using the FUSE interface by measuring

and comparing the time it takes to complete the same I/O operations on plainfs and directly on the

underlying physical file system. Also, plainfs can be used for setting a baseline against which to

compare the overhead of encrypting data using both the CPU and GPU based implementations of

the AES encryption algorithm, in the next iterations of the system.

As mentioned earlier, in order to provide the functionality of plainfs one needs to be able to inter-

cept I/O operations targeting the Virtual File System’s mount point and to redirect and perform them

on another directory mounted on a fully operational file system.

4.1.1 Intercepting I/O operations

The interception of I/O operations targeting the Virtual File System is provided by the fuse kernel

module, which includes the fusefs file system. Whenever a FUSE based file system is mounted on

a system directory, e.g. /mnt/test_fuse, the fuse kernel module registers an association between the

mount point and the fusefs file system at the Linux VFS (Virtual File System) layer. This is done by

using the fusermount utility, which communicates with the fuse kernel module through the /dev/fuse

character device.

When this association is established, all I/O operations targeting the /mnt/test_fuse mount point

are handled by the fusefs file system. fusefs main function is to call the corresponding user level

operation implementation, by communicating with the user level libfuse library. This again is done

using the /dev/fuse character device. This mechanism is depicted in Figure 2.16 and achieves the

subtask 1 goal of intercepting the I/O operations targeting the virtual file system’s mount point.

4.1.2 Performing I/O operations on underlying file system

Subtask 3 is implemented by using FUSE’s mechanisms for building file systems in user space. A

FUSE file system is implemented by a user level standalone application, which is linked against the

libfuse library. The application must provide the implementation of a set of file system I/O operations,

which are to be invoked by libfuse in response to requests from the fuse kernel module.

42

FUSE’s architecture resembles the client-server model. In this context, the fuse kernel module

is the client and the user defined application, together with libfuse, are the server. Whenever an I/O

operation targets the FUSE file system, it is redirected by the fuse kernel module to libfuse, which in

turn uses the callbacks defined in the user defined application to perform the operation and returns the

results back to the kernel module. The communication between kernel and user space is completely

handled by libfuse, allowing the file system developer to only focus on implementing the I/O callbacks

in the standalone application.

libfuse provides a set of function prototypes of the I/O operations that must be implemented as

callbacks by the standalone application linked against it. These are a variation of the typical I/O

system calls found on general purpose file system APIs, such as open, close, read and write. The

implementation of these callbacks is what defines the file system operation. Two different APIs are

provided by libfuse: the high- and low-level APIs. plainfs, as well as the subsequent versions of the

system, make use of the high-level API only, as it provides all the mechanisms required to implement

the systems’ functionality.

Figure 4.2 shows how a user level application is linked with libfuse, in order to implement a FUSE

file system. To simplify, the presented example only references the open, read, and write operations.

The figure’s blue box represents libfuse and the set of function prototypes it defines, in this case

composed of only open, read and write. These are implemented as a set of callbacks in the plainfs.c

C program, depicted by the green box. The bridge between libfuse and the standalone application is

the fuse_operations data structure, which pairs the functions specification and their implementation.

For clarity, the fuse_operations structure is depicted as a third external entity in Figure 4.2, but it is

typically also part of the plainfs.c program.

Figure 4.2: User level file system I/O operations implementation

The goal of subtask 3 is to redirect the I/O operations to an underlying file system and is im-

plemented by applying the same principle in the implementation of all I/O operations. In addition to

43

the file system mount point, plainfs is parameterised with the path to a backing directory, hereafter

denominated back_dir. This directory must be located under the logical mount point of the under-

lying physical file system which is to be leveraged and is where I/O operations will be redirected to.

The redirection thus consists of performing all I/O operations targeting the plainfs mount point on the

equivalent location having back_dir as root instead.

Figure 4.3: I/O operations redirection

Figure 4.3 shows how this mechanism works, by using the open operation as an example. Here,

plainfs is mounted in /mnt/plainfs and uses /mnt/plainfs_back as its back_dir. When a request to open

file /mnt/plainfs/a hits plainfs, the user-defined open implementation converts it into a call to the same

open operation but to a path where the plainfs mount point is replaced by back_dir. This results in a

call to open(/mnt/plainfs_back/a), using the underlying file system API. The same principle is used for

implementing the remaining I/O operations.

While most functions simply require calling its equivalent version on back_dir and filling the FUSE

data structures used to pass results back to the kernel, others are more complex. FUSE operations

are an abstraction of the actual I/O operations found on physical file systems, and their implemen-

tation on top of a physical file system must reflect that file system’s specificities. Table 4.1 lists all

FUSE operations implemented in plainfs, and the corresponding ext4 counterparts supporting their

implementation, as it is the preferred physical file system used in the system.

44

FUSE operation ext4 operation Notesgetattr lstat must decipher last file block in oder to compensate for paddinggetxattr getxattrreaddir opendirrmdir rmdiropen openflush dup and close call to dup followed by closefsync fdatasyncread read must decipher datawrite write must cipher data

release closechmod chmodutimes utimeschown lchown

truncate truncatemkdir mkdircreate creatunlink unlink

rename renameaccess access

Table 4.1: List of FUSE operations and ext4 counterparts

4.2 cipherfs: a non-accelerated ciphered file system

cipherfs introduces file data confidentiality into the system. It does so by leveraging the already

functional plainfs iteration of the virtual file system, described in the previous section, and introducing

an implementation for subtask 2. In short, subtask 2 consists of transforming the file system data

actually stored on the physical medium. In this case, data must be transformed into an illegible form

to an attacker, i.e. ciphered, before being stored, and transformed back into its original form, i.e.

deciphered, upon being retrieved from storage.

Among the three subtasks identified as the ones composing the implementation of this type of

system, subtask 2 can be described as the most fundamental. While the other 2 are essential in order

to implement it correctly, subtask 2 is the one that defines what the system does, by determining how

it transforms the data. Without subtask 2, the Virtual File System is practically indistinguishable from

its underlying physical file system as is the case of plainfs. In short the transformation applied to the

file system data is what best defines it.

For the sake of consistency, plainfs can be described as implementing subtask 2, thus actually

including a data transformation step, which in this case does nothing. As such, cipherfs can be seen

as a modification of plainfs, which in subtask 2 performs the AES encryption or decryption, thus

maintaining most of the previous structure and implementation.

One of the main advantages of the modular approach followed in the system’s design is its de-

composition into well defined and self-contained functional blocks. This organisation makes it easy

to locate and modify each functional block, independently of each other, which also reflects on the

system’s source code. As a result, the replacement of the transformation applied to the system’s data

affects only a small portion of the system’s code, mainly the read and write file system operations.

45

Moreover, the fundamental structural modifications required to implement the AES-based data trans-

formation in cipherfs are the same as those required for gipherfs. The only difference between the two

reside on the AES implementation which is CPU-based in cipherfs and GPU-based in gipherfs. This

means that most of the required system structural modifications will be implemented in this iteration.

This allows to fully focus on the specificities of moving the execution of AES to the GPU, on gipherfs.

The introduction of the AES-based data transformation into the system actually requires data to

be transformed twice. First, as already mentioned, AES can only handle data in block units, thus

requiring all its input data size to be a multiple of its block size. The padding technique described in

Section 3.2.2, can be used in order to enforce this requirement and apply a first transformation step

to the data processed by the system, extending it to the appropriate length. Second, the actual AES

encryption and decryption functions must be applied to the data. Both of these steps must be applied

together in both directions of the transformation, in order for AES to be properly implemented.

Figure 4.4: Encryption transformations: adding padding and ciphering

The encryption direction thus consists of padding the original operation payload and applying the

AES encryption function to the resulting data. These steps, depicted in Figure 4.4, must be applied

whenever data is to be written to the underlying physical file system, and are implemented on the VFS

write operation. Conversely, Figure 4.5 shows the decryption direction which consists of applying the

AES decryption operation to the ciphered data and removing the padding, previously added. These

must be implemented in the VFS read operation, whenever data is retrieved from the physical storage

medium.

Figure 4.5: Decryption transformations: deciphering and removing padding

46

Besides the actual data transformation functions, the introduction of AES requires an additional

component. The function of this component it to setup and initialise all the required AES parameters,

namely the symmetric secret key, used for ciphering and deciphering data. This only needs be done

once, during the system’s bootstrap. It is important to note that only by providing the same secret key

between system invocations will it be possible to gain access to previously created files.

The focus of the proposed system is to assess the feasibility and measure the benefits of offload-

ing the execution of the AES encryption algorithm from the CPU to the GPU, and not on the key

management. As such, the AES secret key has been hardcoded into the system code in order to

simplify this management and the same key is used in all executions of the system. A more secure

management of the AES secret key can be implemented on the system regardless of its computation.

The functional decomposition of the system reduces the effort of implementing a complex secret key

management approach, given its isolation from the remaining system functional blocks.

4.2.1 Block based I/O

Section 3.2.1 presented the block based approach used for implementing the Virtual File System

read and write operations. The description illustrated the importance of handling I/O operations in

blocks for introducing the AES data transformation functions into system. This is true due to the fact

that AES must always handle data in blocks of a minimum of 16 bytes. By already handling data in

aligned block units, of at least 16 bytes each, it becomes trivial to handle AES blocks.

Had the read and write operations not been implemented in this manner, additional logic would be

required for correctly handling them. When reading, it would be possible to fetch partial AES blocks,

which would not be decipherable without further access to the underlying physical file system in order

to obtain the missing AES block bytes. The same would be true for writes, which would require

fetching additional data from the underlying physical file system in order to properly cipher the full

AES block. By taking the block nature of the encryption algorithm into account and always handling

data in a multiple of the AES block size, it becomes impossible to either retrieve or try to write partial

AES blocks, thus greatly simplifying the read and write operations implementation. When reading or

writing a file, the operation payload will always be made of aligned 16 byte blocks. It then becomes

trivial to split the payload into individual AES blocks which are then fed sequentially, one at time, to

the AES decryption and encryption functions. Figure 4.6 summarizes the implemented mechanism,

using a 4 KB data payload as an example, which is split into 256 individual AES blocks processed

sequentially, one at a time, by the CPU.

A possible modification to cipherfs would be to process several AES blocks at a time, taking ad-

vantage of current commodity multicore CPUs. This was not implemented mainly because it would

make the implementation of ciphefs more complex, which is not the focus of this dissertation. More-

over, one of the goals of the work proposed in this dissertation is to improve the encryption’s process

performance under an already heavy loaded CPU. In this setting, the use of additional cores would

not necessarily increase the overall system performance, since they would all probably be under a

high load. For these reasons, a multicore version of this system’s iteration was not implemented.

47

Figure 4.6: Handling a 4096 bytes payload, made of 256 AES blocks

4.2.2 AES functions

cipherfs AES data transformation functions, i.e. its encryption and decryption functions, are based

on an optimised implementation of AES which takes advantage of AES-NI. From a purely functional

perspective, and to a certain degree, the performance factor would be irrelevant for the selection of

the CPU-based AES implementation. However, AES-NI has been available in commodity processors

already for some years now and is currently commonly found in many CPUs. Therefore, an AES-NI-

enabled implementation of AES was chosen, in order to make a fair comparison between the CPU

and GPU versions of the system, without increasing its complexity.

It is important to mention once more that the focus of the solution presented in this document is

on building a system providing confidentiality of user file data with minimal impact on the CPU and

not on improving the performance of AES by proposing any modifications to it. Therefore, the imple-

mentations of the AES encryption algorithm used on both cipherfs and gipherfs are open source and

were made freely available by its developers to the community. More specifically, the AES implemen-

tation used in cipherfs was extracted from the wolfSSL cryptographic library for the C programming

language [62], which has an open source license for non-commercial purposes.

The wolfSSL implementation was chosen mainly due to its ease of use, specially when com-

pared to other cryptographic libraries. It provides a comprehensive documentation and has seen

wide adoption in the cryptographic community. This library includes an example on how to use the

AES implementation which was used as a prototype for the cipherfs data transformation functions, as

well as the auxiliary initial setup operation, used to generate the required AES parameters, including

the symmetric secret key.

One of the most important aspects about the operation of block ciphers, such as AES, is how

blocks are combined during the algorithm’s execution. This is called the algorithm’s mode of operation

and many different modes have been proposed over the years, each providing different types of

48

properties such as confidentiality, authentication and integrity.

Besides the obvious concern about security, the choice among the recommended modes for AES

in the context of this system was mainly driven by the mode’s suitability for parallelization. Due to the

ultimate goal of removing the computation of AES from the CPU, the most fundamental property of

any successful GPU implementation is the use of a mode of operation which does not rely on serial

execution. As a result, this also impacted the selection of the CPU-based implementation mode of

operation. Again, in order for the two versions to be as similar as possible and its comparison more

relevant, the same mode was chosen for both.

Given the main goals of proving the system’s feasibility and maximize the opportunity to paralellize

the execution of AES functions, the Electronic Codebook (ECB) mode was the one selected for both

cipherfs and gipherfs. The paralellization of ECB is straightforward, as the processing of each block is

completely independent of all the others. Moreover, it enables random access to ciphered data, due

to the lack of dependency between blocks.

The Counter (CTR) mode is also mentioned here for reference as an alternative to ECB. CTR

offers the same opportunities for parallelisation and random access as ECB, while providing a better

set of security guarantees by introducing a random counter which is mixed with each block before

applying the AES transformation.

4.3 gipherfs: a GPU-accelerated ciphered file system

gipherfs is the final iteration of the system described in this document and focus on the task of

offloading the computation required by the AES cryptographic functions from the CPU to the GPU.

It builds on the structure of cipherfs and replaces its data encryption extensions implemented on the

CPU with an equivalent implementation on the GPU, in order to speed it up and minimise its impact

on the CPU and the system’s overall performance. The integration of the GPU-accelerated implemen-

tation of the AES encryption and decryption mechanisms and its benefits are the core of the solution

proposed in this dissertation and also pose the greatest challenges. The modular approach followed

in the system design makes it possible to leverage an already fully functional Virtual File System

and to concentrate only on the specific challenges of using the GPU to accelerate the computation

intensive AES primitives.

Going back to the decomposition of the system into three independent subtasks, gipherfs is ob-

tained from cipherfs by replacing the implementation of subtask 2. In this case, the AES data trans-

formation functions are replaced by their GPU equivalent, performing exactly the same transformation

and using the GPU.

The description of the gipherfs implementation starts by describing the mapping of the encryption

and decryption functions of AES into the CUDA programming model. It then focuses on the main

aspects and tradeoffs to efficiently implement the required cryptographic mechanisms on the GPU

using CUDA and integrating it within the FUSE file system.

49

4.3.1 GPU implementation of AES

As described above, the CUDA implementation of AES used in gipherfs is open sourced and was

obtained from an online repository [61]. Like any GPU program, the implementation of AES on the

GPU requires the three following steps:

1. Transfer data from the host to the GPU

2. Process the data on the GPU

3. Transfer resulting data back from the GPU to the host

From a performance perspective, steps one and three consist of transferring data from the host

to the GPU, and vice versa, through the system I/O bus. It is thus a source of overhead, required in

order to take advantage of the GPU computational power. The trade-off between the additional mem-

ory transfer overheads and the performance benefits obtained from running the AES cryptographic

functions on the GPU is the first aspect to consider in an efficient GPU implementation and greatly

determines the merit of this type of approach. The impact of the memory transfer overheads is eval-

uated in Section 5.2, in order to determine the minimum amount of data processed in each GPU

invocation so that the integration of the GPU into the system becomes beneficial.

The other main aspect of an efficient implementation of AES in the GPU is the ability to prop-

erly map its operations into the CUDA programming model. Only by decomposing the processing

steps of AES into a set of both coarse- and fine- grained processing tasks, as described in Section

2.2, it is possible to make the most of the GPU computing capabilities. Coarse-grained tasks must

be independent from each other, in order to take advantage of the high number of Streaming Multi-

processors and CUDA cores available in current GPUs. All data dependencies between tasks must

be handled in finer-grained tasks and kept to an absolute minimum in order to fully exploit the GPU

parallel processing capabilities.

gipherfs uses the ECB mode of operation in order to minimise the amount of data dependencies

in the AES encryption algorithm. In this way, there are no data dependencies between individual AES

blocks. Each block is processed independently of all the others, which maps perfectly to the CUDA

programming model and is highly beneficial for an efficient implementation on the GPU. As such, the

CUDA implementation of AES starts by decomposing the application of the encryption and decryption

functions to a data payload of n bytes into n/16 coarse tasks, consisting of processing each AES block

independently.

However, the processing of each AES block can not be further refined into a set of parallel opera-

tions. The AES algorithm, as described in Section 2.1.1, applies rounds of successive transformations

to the input data block. The output of the previous round is used as the input of the next one, making

it impossible to parallelise the application of rounds. This is reflected in the CUDA implementation

by mapping CUDA threads to AES blocks, where each thread is responsible for processing a single

AES block. Therefore, a data payload of n bytes is split into n/16 AES blocks which are concurrently

processed by n/16 CUDA threads.

50

A further optimisation can be applied to the implementation of AES by taking advantage of the

CUDA thread block abstraction and by enforcing thread cooperation inside a thread block. In detail,

CUDA threads can be grouped together in CUDA thread blocks. Threads inside the same thread

block are guaranteed to execute on the same Streaming Multiprocessor and are able to share a

limited amount of data and synchronize their execution. Moreover, the CUDA memory hierarchy is

composed of several levels, each with its own characteristics. In particular, shared memory is a low-

level high bandwidth memory space, shared among all threads within a block, which can be highly

beneficial for any efficient GPU implementation.

The gipherfs CUDA implementation of AES, also makes use of a fine-grained task decomposition

by having threads inside the same thread block cooperating among themselves. This is achieved

by having a single thread within the block copying the Substitution Box translation table and Key

Schedule, used by all threads in the rounds of AES, from the device main memory into the thread

block’s shared memory. The remaining threads must thus synchronise and wait for this transfer to

complete before proceeding. Still, the improved memory access is expected to make up for this

synchronisation point.

In order to maximize the benefits yielded by this mechanism, the number of threads per block

should be as high as possible. The theoretical maximum number of threads per block in the Maxwell

architecture is 1024. However, this value is limited by the amount of shared memory and the number

of registers used by the AES kernels, which limits the effective number of threads per block to 256 in

the AES CUDA implementation. When the payload length is higher than 4 KB, and the corresponding

number of processing threads exceeds 256, multiple thread blocks are launched at a time. A data

payload of 32 KB, for example, would result in launching 8 thread blocks simultaneously, which would

be assigned to 8 different Streaming Multiprocessors. The case when the number of scheduled thread

blocks exceeds the number of available SMs is automatically handled by the CUDA runtime, which

queues thread blocks for execution as soon as resources become available.

Considering a payload of 4 KB, it will be split into 4096/16 = 256 independent AES blocks. The

payload is then processed by a single CUDA thread block, made up of 256 CUDA threads, who

cooperatively process it concurrently. Each AES block is assigned to a single CUDA thread which

implements the AES block encryption or decryption operation. After being processed by the GPU, the

AES data blocks are merged together again to obtain the final ciphered/deciphered payload.

4.3.2 Maximizing payload length: FUSE big-writes mode

Compute-intensive tasks are the ones that benefit most of the GPU computing paradigm. The

high density of computing units found on the GPU greatly favours applications which spend most of

their time doing computation. In contrast, I/O-intensive tasks, which spend most of their time doing

I/O operations do not fit well under the GPU computing model. This becomes rather evident from the

first aspect of an efficient GPU implementation mentioned previously, regarding the memory transfer

overheads imposed by the requirement of transferring data between the host and the GPU.

The operations making up the AES encryption algorithm can be categorized as compute-intensive

51

as they consist of the application of several rounds of data transformation functions to the same

piece of data, the AES block. However, the overall performance of the AES encryption algorithm will

only be able to benefit from the relocation of its computation from the CPU to the GPU if its able to

compensate for the penalty of transferring the input and output data to and from the GPU. This can

only be achieved by maximizing the amount of data processed by each GPU invocation.

The amount of data processed by gipherfs in each GPU invocation is determined by the length

of FUSE’s read and write operations payload. The payload ultimately depends of how the client

application uses the file system API, but FUSE also plays a major role in it by modifying the original

request and handling it in smaller data chunks in the case of the write operation. By default, FUSE

handles the payload of incoming write operations in data chunks of at most 4 KB. This maps to

a single CUDA thread block composed of 256 threads, and is not sufficient to compensate for the

memory transfer overhead, as is demonstrated in Section 5.2.

FUSE provides an alternative mode which makes it possible to overcome this limitation and in-

crease the maximum payload of write operations. The mode is called big-writes and increases the

maximum write operation payload by a factor of 16, i.e. from 4 KB to 64 KB. As verified in Section 5.2,

this amount of data is already sufficient to compensate the overheads incurred by the host-to-device

memory transfers, thus greatly boosting the performance of gipherfs.

4.3.3 Improving device memory management

Some of most expensive operations in any GPU program implementation are the ones related to

memory management and transfer. Memory transfers to and from the GPU are perhaps the most

notable and introduce a significant amount of execution overhead. These can actually completely

render the approach of using the GPU for accelerating compute intensive tasks ineffective by incurring

an excessive amount of overhead over doing the computation directly on the CPU. Even though recent

GPU architectures already support overlapping of memory transfers and kernel execution, it is still not

possible to avoid them completely.

There are however some operations whose cost can be minimized, or even completely eliminated,

by employing more efficient memory management mechanisms. This is the case of device memory

allocation. In more detail, a typical memory transfer operation between the host and the GPU, required

whenever a kernel is launched on the GPU, consists of the following steps, assuming data is already

available at the host:

1. Allocate device memory

2. Copy data from the host to the device

3. Copy data from the device to the host

4. Free device memory

Steps two and three can not be avoided at all, since memory must be transferred to the device for

processing and back to the host for collecting results. However, when a kernel is launched repeatedly,

52

the impact of steps one and four can be minimized by performing them only a single time. Memory

can be allocated once before launching kernels, and it can be freed only after all kernels finish, thus

reusing previously allocated memory on the device. However, this mechanism can only be applied to

systems where the required amount of allocated memory on the device can be estimated with a high

degree of confidence. Otherwise the program will either fail due to lack of memory or the reallocation

of device memory will render the mechanism ineffective.

This is the case of gipherfs where the maximum amount of required allocated memory on the

device is limited by the maximum amount of data FUSE tries to read or write, i.e. decipher or cipher,

simultaneously on the GPU. These can be determined by analyzing how FUSE works. In respect to

writes, and even though gipherfs uses FUSE in multithreaded mode, only a single write operation is

handled by FUSE at any given time. This is due to a lock on the fuse kernel module, which serializes

write operations. As a result, the maximum amount of memory required by a write operation will be

the maximum payload handled by a single write operation, which is 64 KB in FUSE big-writes mode

plus 16 bytes of potential padding, generated whenever write operations extend the previous file size.

On the other hand, it is possible to determine, analysing the operation of FUSE for different payloads,

that at most four read operations can execute concurrently, each capable of handling a maximum

payload of 128 KB. Padding need not be considered in this case, since it is not generated by read

operations.

gipherfs implements an efficient memory allocation mechanism by pre-allocating and reusing a

pool of device memory blocks, sized according the previously mentioned observations. When the

system starts, 64 128 KB blocks are pre-allocated on the device. Even though, at most 5 concurrent

operations may be executing on the device according to the above mentioned observations, blocks

are over allocated in order to prevent starvation in the case the system actually tries to allocate

more memory on the device. Whenever there is a need to allocate memory on the device, a pointer

to one of the available device memory blocks is obtained from the pool, instead of performing the

actual memory allocation. Conversely, after kernel execution, instead of freeing the block from device

memory, the no longer required pointer to the device memory is returned to the pool. In the case

when more than 64 blocks are required at the same time (which is highly not expected), the thread

requiring the block will wait until one is available. The performance improvements provided by this

mechanism are demonstrated in Section 5.3, where a comparison is made between the system with

and without this mechanism in place.

53

5Solution Evaluation

Contents5.1 Direct integration of GPU AES acceleration . . . . . . . . . . . . . . . . . . . . . . 565.2 Enabling big-writes in the proposed solution . . . . . . . . . . . . . . . . . . . . . 585.3 Improving device memory management . . . . . . . . . . . . . . . . . . . . . . . . 61

55

This section presents the evaluation of gipherfs, a secure Virtual File System implementing effi-

cient data confidentiality of its files by offloading the computation of the AES encryption algorithm from

the CPU to the GPU, according to the solution proposed in Section 3. The system evaluation is per-

formed by comparing the performance of gipherfs against both of its intermediate iterations cipherfs

and plainfs, in order to determine the advantages of offloading the AES computation to the GPU, as

well as the overhead of integrating GPU-accelerated cryptographic mechanisms into the operation

of a Virtual File System. The effectiveness of the optimizations enabling an efficient integration of

GPU computing mechanisms, proposed in Section 4.3, are analyzed in detail by considering different

modes of gipherfs with and without including them. This section also provides evidence that the in-

troduction of GPU-accelerated cryptographic mechanisms into the operation of a Virtual File System

without any security features only reduces its performance by a half in the worst case scenario while

relocating the computation required by this kind of mechanism from the CPU to the GPU can boost

its performance by a factor as high as 3.5 times.

All experimental results presented in this Section were obtained in a computing platform consisting

of an Intel Core i7-5960X CPU (3 GHz) with 32GB of main memory (DRAM) and an NVIDIA GeForce

GTX 980 GPU, powered by NVIDIA’s Maxwell GPU computing architecture, with Compute Capability

5.2.

5.1 Direct integration of GPU AES acceleration

The evaluation of the system starts by evaluating the performance of a naive implementation of

gipherfs, which corresponds to the first original system implementation without including any of the

optimizations described in Section 4.3. Even though these could have been included into the system

design since its inception, it was deemed worthy to present the evaluation of the system incrementally,

by adding each of the optimizations one at a time. This organisation provides an in-depth evaluation

of the performance improvements introduced with each optimization and makes it easier to highlight

their overall impact.

Figures 5.1 and 5.2 present the bandwidth, in MB/s, obtained when copying a file into and out of a

directory mounted on three different Virtual File Systems. The referred file systems correspond to the

three incremental iterations described in Chapter 4, namely: plainfs, cipherfs and gipherfs, where the

latter one corresponds to a naive GPU implementation without any optimizations. Figure 5.1 presents

the results obtained when copying a file out of the file system, and can be used to determine the

file system’s read bandwidth. Figure 5.2 provides the evaluation of the opposite operation, i.e. when

copying a file into a directory mounted in the file system, and can thus be used to determine the file

system’s write bandwidth. The results were obtained by measuring the time taken to perform a copy

operation, by using the cp bash command and by considering different file sizes, ranging from 64 KB

to 128 MB. The transfer time was averaged across 100 iterations of the copy operation performed in

each direction, in order to obtain a stable value, which was used to compute the average bandwidth.

56

Figure 5.1: Naive gipherfs: read bandwidth

The obtained results match the expected performance for each of the evaluated file systems.

plainfs is the fastest among them by a large margin (in both directions), by yelding about 6 and

20 times higher bandwidth for read and write operations, respectively. This is obvious given the

absence of any data transformation applied to its file data, since plainfs does not provide any security

features and stores its files data in plain text. The discrepancy becomes even more apparent for write

operations, which are serialized and limited to process at most 4 KB at a time by FUSE in this mode.

Focusing on the performance of cipherfs and gipherfs, it can be noted that both systems attain

similar read bandwidth, with gipherfs providing slightly higher bandwidth for bigger files. This is ex-

plained by the benefits of processing bigger chunks of data in the GPU. As the file grows, the amount

of data processed by FUSE in each read operation also increases, up to a maximum of 128 KB per

operation. This amount of data is already enough to compensate for the GPU memory transfer over-

heads incurred in each GPU invocation. As such, offloading the computation of AES to the GPU

already pays off in this situation, even if only by a small margin. In contrast, the limitations imposed by

FUSE on write operations specially hurt the performance of gipherfs. It is almost 5 times slower than

cipherfs due to the fact that processing at most 4 KB at a time in each GPU invocation is not enough

to compensate for the memory transfer overheads between the CPU and the GPU, required for each

kernel invocation.

Another interesting property that can be deduced from the presented evaluation is that, in respect

to read operations, only plainfs benefits from processing bigger files, which is evidenced by the almost

linear increase in the attained bandwidth with the increase of the file size. The same observation

does not hold for gipherfs and cipherfs. This can be explained by plainfs not being limited by any data

transformation computation, in contrast to the other two systems. The increased bandwidth of plainfs

means that the data processing time is not proportional to the total amount of data processed, thus it

benefits more from processing a larger amount of data. gipherfs and cipherfs however are limited by

57

the time taken to perform the AES transformations, which also limits their maximum bandwidth.

Figure 5.2: Naive gipherfs: write bandwidth

5.2 Enabling big-writes in the proposed solution

The naive integration of the GPU accelerated version of AES in the proposed system yields rather

disappointing results when compared with a typical implementation that relies only on the CPU. Even

though it provides equivalent read performance, it attains a much lower write bandwidth. This can

be explained by how FUSE handles write operations by default, in which payloads are split into data

chunks of at most 4 KB, which are serially processed. This mode of operation does not fit well under

the GPU programming paradigm, which favours compute intensive tasks and is highly penalized by

I/O operations. The limited amount of data processed by each call to the GPU and an excessive

amount of relatively small data transfers between the host and the GPU, severely impact the write

bandwidth of gipherfs, as evidenced in Figure 5.2.

It seems plausible that increasing the amount of data processed by each write operation, and con-

sequently by each GPU invocation, would greatly benefit the write performance of gipherfs. In order

to determine the potential benefits of increasing the size of the FUSE write payload, as described in

Section 4.3.2, the bandwidth of the isolated AES encryption operation was benchmarked on both the

CPU and the GPU. For this purpose, the impact of increasing the amount of data processed by each

individual call was measured. This experimental evaluation consisted of processing an 8 MB file by

increasing the size of processed data chunks, ranging from 2 to 512 KB, both in the CPU and on the

GPU and comparing the obtained bandwidth.

58

Figure 5.3: AES encryption benchmark: average block processing time

Figure 5.3 presents the average processing time of each data block across 100 iterations. It

shows that the processing time of each data block increases with its size in the CPU version, while

it remains relatively constant in the GPU. More specifically, as the block size doubles, also does the

CPU processing time. This can be explained by the sequential nature of the CPU AES encryption

implementation, in which each AES block is processed serially, one at a time. In contrast, the GPU

AES encryption implementation processes all AES blocks concurrently, thus incurring little overhead

when processing a larger amount of data. Actually, increasing the amount of data processed in a

single GPU call by a factor of 256 times, i.e. from 2 to 512 KB, only increases processing time by a

factor of 2. Figure 5.3 also shows that processing at least 64 KB of data in each GPU call already

makes the GPU implementation of the AES encryption operation more efficient than the one that

strictly relies on the CPU.

59

Figure 5.4: AES encryption benchmark: average total processing time

Figure 5.5: AES encryption benchmark: average bandwidth

Figure 5.4 presents the total time taken by each implementation to process the complete 8 MB file.

The obtained results are consistent with the results presented for the average block processing time in

Figure 5.3 , and show that the GPU performance is highly penalized when the file is processed in small

data chunks. The CPU version on the contrary is unaffected by this variation, as its total processing

60

time remains constant. However, the GPU implementation is able to outperform the CPU one when

the block size is at least 64 KB. Figure 5.5 presents an alternative perspective of the complete file

processing performance, based on the average bandwidth. These results are consistent with the ones

presented in Figure 5.2 and further sustain the argument for the poor write performance of the naive

implementation of gipherfs. Recalling that FUSE splits write operations in chunks of 4 KB by default,

the low write bandwidth of gipherfs becomes evident by comparing the bandwidth of the CPU and

GPU versions of the AES encryption operation for 4 KB data chunks in Figure 5.5. The benchmark

results further sustain the hypothesis formulated in Section 4.3.2 that increasing the amount of data

processed in each FUSE write operation would benefit the performance of gipherfs. This can be

observed in Figure 5.4 for block sizes starting at 64 up to 512 KB, which enable the execution of the

GPU to outperform its counterpart on the CPU by a factor of up to 3 times.

Figure 5.6 presents the write performance of an optimized version of all Virtual File Systems,

using FUSE in big-writes mode. In this mode, each write operation is able to process at most 64

KB of data at a time, instead of the default maximum of 4 KB. By comparing these results with the

ones presented in Figure 5.2, it becomes clear that all iterations benefit from the introduction of big-

writes mode. plainfs and cipherfs write bandwidth increases by a factor of about 0.5 and 2 times,

respectively. However, gipherfs is the one that benefits the most. Its write bandwidth is boosted by a

factor of 10 for big files, and brings it to par with the performance of cipherfs for both read and write

operations, with writes actually being consistently more efficient in gipherfs.

Figure 5.6: Increasing gipherfs write payload: write bandwidth

5.3 Improving device memory management

The last GPU integration optimization proposed in Section 4.3.3 aims at minimizing the memory

management overheads required by GPU programs. More specifically, eliminating the device memory

allocation and deallocation steps. As in the previous experimental evaluation, a specific benchmark

61

was developed to evaluate the potential benefits of this optimization. For this purpose, all the individual

function calls making up a complete call to the GPU were analyzed in detail, in order to determine their

relative cost. The analysis was again based on the encryption of a 8 MB file, processed in different

block sizes ranging from 2 KB to 512 KB.

Figure 5.7: GPU memory operations benchmark: relative cost

Figure 5.7 presents the relative time spent when allocating memory on the GPU (MemAlloc);

transferring data from the host to the GPU (MemH2D); executing the AES encryption kernel on the

GPU (ExecKernel); transferring the results from the GPU to the host (MemD2H); and releasing the

GPU memory (MemFree). The results show that most of the time is spent allocating memory on

the device, followed by releasing it, especially for smaller block sizes. As the block size increases,

memory transfers and kernel execution amortize the cost of these operations. However, they always

represent at least 50% of total GPU execution time. The percentages for 64 and 128 KB block sizes

are specially relevant, as these are the maximum payloads handled by FUSE in big-writes mode for

write and read operations respectively. In this setting, device memory allocation and deallocation take

about 80% of the total GPU execution time. These results support the expected benefits introduced

by diminishing the impact of the memory allocation and deallocation steps in GPU invocations by

pre-allocating and reusing device memory buffers between GPU invocations, described in Section

4.3.3.

Figures 5.8 and 5.9 present a final comparison between the three iterations of the system, using a

version of gipherfs implementing all optimization steps described in Sections 4.3.2 and 4.3.3. Focus-

ing first on the benefits enabled by avoiding device memory allocation, gipherfs performance is further

boosted by a factor of 2 for both read and write operations. This comes as a result of eliminating the

operations responsible for about 80% of the total GPU execution time in each kernel invocation, as

62

evidenced in Figure 5.7. Moreover, the final version of gipherfs outperforms cipherfs by a factor of

3.5 and 3 for read and write operations, respectively. This provides evidence of the benefits when

offloading the computation of the AES algorithm from the CPU to the GPU, especially when large

amounts of data are processed. Finally, comparing the obtained performance of gipherfs with plainfs,

it is possible to determine that the impact of the security guarantees offered by the proposed solution

only reduce the performance of a default non-secure file system by a half, in the worst case scenario.

Figure 5.8: Improved device memory management in gipherfs: read bandwidth

Figure 5.9: Improved device memory management in gipherfs: write bandwidth

63

Figure 5.10: gipherfs optimizations: increased read bandwidth

Figures 5.10 and 5.11 summarize the benefits introduced by each optimization to gipherfs in or-

der to better highlight their impact. Due to the concurrent nature of FUSE read operations, the read

bandwidth of gipherfs is consistently higher than its write bandwidth. Read operations do not ben-

efit from the introduction of big-writes mode, as it affects only the write operation, but the improved

device memory management boosts its performance by a factor of almost 3. On the other hand,

gipherfs write operations benefit the most from the introduction of big-writes mode, which increases

its throughput by a factor of more than 10, from 4 MB/s to 46 MB/s in the best case. When the im-

proved device memory management is introduced, the write bandwidth is further improved by a factor

of 2, approximately. These results demonstrate the relevance and the impact of the optimizations

introduced into the system, which provide different benefits for different types of I/O operations, but in

a balanced manner.

Figure 5.11: gipherfs optimizations: increased write bandwidth

64

6Conclusions and Future Work

65

The work developed in this dissertation aimed at implementing a secure file system, using the AES

standard encryption algorithm to provide confidentiality to a file system and respective data in an effi-

cient and transparent manner. A Virtual File System based on FUSE and extended with cryptographic

mechanisms accelerated by a Graphics Processing Unit was implemented in order to achieve the de-

sired goals. Using the high performance capabilities of the GPU for executing the compute intensive

cryptographic primitives of the AES algorithm allowed to speedup its execution by increasing the file

processing throughput. At the same time, making use of the GPU as a cryptographic co-processor

maximized its utility and minimized the impact of the additional security features on the CPU and

on the overall system performance as a whole. Moreover, the decision of automatically integrating

the encryption mechanism into the operation of a Virtual File System facilitates its adoption by not

requiring explicit user intervention once it is in place.

The implemented system was evaluated by comparing its performance with an equivalent alter-

native, using only the CPU, in order to determine the benefits of offloading the computation to the

GPU.

The experimental results were obtained by measuring the Virtual File System’s read and write

bandwidth when performing typical file system operations, namely accessing files. The obtained

results, using state of the art GPUs, confirm the advantages and merit of this approach by showing

a consistent increase of the Virtual File System file processing throughput, specially for big files of

around hundreds of Megabytes. The system was able to increase write and read bandwidths by as

much as a factor of 3 and 3.5 respectively, when compared with an equivalent CPU only alternative.

Moreover, the results also show that the introduction of encryption primitives accelerated by the GPU

into the Virtual File System only reduces its throughput by a half on the worst case.

Future work

Several directions could be followed in the future work. In terms of the system’s security, a better

key management mechanism and a more secure mode of operation, such as CTR [63] or XTS [64], for

the AES encryption algorithm may be used, increasing the system’s security, without compromising

its performance. Additionally, other security concerns such as data integrity and authentication may

also be introduced into the system, using a similar GPU based approach. In terms of performance,

more advanced CUDA mechanisms, such as CUDA streams, may be leveraged in order to further

increase the overall system’s throughput. FUSE’s limitations, namely the serialization imposed on

write operations could also be addressed in order to further improve the system’s write bandwidth.

66

Bibliography

[1] J. Barkley, “Comparing simple role based access control models and access control lists”, Pro-

ceedings of the second ACM workshop on Role-based access control, 1997, pages 127-132,

ACM Press

[2] Y. Cherdantseva and J. Hilton, "A Reference Model of Information Assurance & Security”, 2013

International Conference on Availability, Reliability and Security, Regensburg, 2013, pp. 546-555

[3] J. Saltzer, and M. Schroeder, “The protection of information in computer systems”, Proceedings

of the IEEE 63, 1975, pp. 1278-1308

[4] Ronald L. Rivest., J. Van Leeuwen, Cryptography, Handbook of Theoretical Computer Science,

Chapter 13, Elsevier, 1990, pp. 717-755

[5] Jawahar Thakur, Nagesh Kumar, DES, AES and Blowfish: Symmetric Key Cryptography Algo-

rithms Simulation Based Performance Analysis, International Journal of Emerging Technology

and Advanced Engineering, December 2011

[6] Shay Gueron, Intel R© Advanced Encryption Standard (AES) New Instructions Set, Intel Archi-

tecture Group, Israel Development Center, Revision 3.01, 2012, https://software.intel.com/

sites/default/files/article/165683/aes-wp-2012-09-22-v01.pdf

[7] FIPS-197: Advanced Encryption Standard (AES), Federal Information Processing Standards

Publication, National Institute of Standards and Technology (NIST), November, 2001, http:

//csrc.nist.gov/publications/fips/fips197/fips-197.pdf

[8] K. T. Cheng and Y. C. Wang, Using mobile GPU for general-purpose computing – a case study

of face recognition on smartphones, In Proceedings of 2011 International Symposium on VLSI

Design, Automation and Test, pp. 1-4, Hsinchu, 2011

[9] J. Leskela, J. Nikula, M. Salmela, OpenCL Embedded Profile Prototype in Mobile Device, IEEE

Workshop on Signal Processing Systems, pp. 279-284, October 2009

[10] NVIDIA. CUDA 4.0., http://developer.nvidia.com/cuda-toolkit-40, 2011

[11] Stone, John E., Gohara D., Shi G., OpenCL: A Parallel Programming Standard for Heteroge-

neous Computing Systems. Computing in science & engineering, 2010, pp. 66-72

[12] Christoph Mitasch, https://www.thomas-krenn.com/en/wiki/Ext4_Filesystem

67

https://software.intel.com/sites/default/files/article/165683/aes-wp-2012-09-22-v01.pdf

https://software.intel.com/sites/default/files/article/165683/aes-wp-2012-09-22-v01.pdf

http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf

http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf

http://developer.nvidia.com/cuda-toolkit-40

https://www.thomas-krenn.com/en/wiki/Ext4_Filesystem

[13] M. Szeredi, File system in user space, http://fuse.sourceforge.net

[14] FUSE, File system in user space, http://northstar-www.dartmouth.edu/~richard/

WhitePapers/FUSE.html

[15] Z. Wang, R. Murmuria and A. Stavrou, "Implementing and Optimizing an Encryption Filesystem

on Android," 2012 IEEE 13th International Conference on Mobile Data Management, Bengaluru,

Karnataka, 2012, pp. 52-62

[16] Weibin Sun, Robert Ricci, and Matthew L. Curry, “GPUstore: harnessing GPU computing for

storage systems in the OS kernel”, In Proceedings of the 5th Annual International Systems and

Storage Conference, SYSTOR ’12, ACM, New York, NY, USA, 2012, Article 9, 12 pages

[17] Katz J., Lindell Y., Introduction to Modern Cryptography, Second Edition, 2014, ISBN

9781466570269, 2nd edition, Chapman & Hall/CRC

[18] Mitali, Vijay Kumar, Arvind Sharma, “A Survey on Various Cryptography Techniques”, Interna-

tional Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Volume 3, Issue

4, July-August 2014, ISSN 2278-6856.

[19] Daemen, Joan; Rijmen, Vincent, "AES Proposal: Rijndael”, National Institute of Standards and

Technology, March 9, 2003

[20] Gilbert, Henri and Peyrin, Thomas, Super-Sbox Cryptanalysis: Improved Attacks for AES-Like

Permutations, In Fast Software Encryption: 17th International Workshop, FSE 2010, Seoul, Korea,

February 7-10, 2010, pp. 365-383 Korea, 2010

[21] YongBin Zhou and DengGuo Feng, Side-Channel Attacks: Ten Years After Its Publica-

tion and the Impacts on Cryptographic Module Security Testing, In IACR Eprint archive,

url=http://eprint.iacr.org/2005/388, 2005

[22] Tuchman, Walter, A Brief History of the Data Encryption Standard, 1998, pp. 275-280, ACM

Press/Addison-Wesley Publishing Co., New York, NY, USA

[23] FIPS Pub 46-3, Federal Information Processing Standards Publication: Data Encryption Stan-

dard (DES), Obtober 1999

[24] Grabbe J., Data Encryption Standard: The Triple DES algorithm illustrated Laissez faire city time,

Volume: 2, No. 28, 2003

[25] B. Schneier, “Fast Software Encryption”, Cambridge Security Workshop Proceedings, December

1993, pp. 191-204

[26] Wikipedia, “Block cipher mode of operation”, https://en.wikipedia.org/wiki/Block_cipher_

mode_of_operation

[27] Diffie, W. and Hellman, M., New Directions in Cryptography, IEEE Trans. Inf. Theor., November

1976

68

http://fuse.sourceforge.net

http://northstar-www.dartmouth.edu/~richard/WhitePapers/FUSE.html

http://northstar-www.dartmouth.edu/~richard/WhitePapers/FUSE.html

https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation

https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation

[28] Rivest, R. L. and Shamir, A. and Adleman, L., “A Method for Obtaining Digital Signatures and

Public-key Cryptosystems”, Commun. ACM, Feb. 1978, pp. 120-126

[29] Auguste Kerckhoffs, "La cryptographie militaire" Journal des sciences militaires, vol. IX, pp. 5–83,

January 1883

[30] Michael Galloy, “CPU vs GPU performance”, http://michaelgalloy.com/2013/06/11/

cpu-vs-gpu-performance.html

[31] Mark Segal, Kurt Akeley, The OpenGL Graphics System: A Specification, Version 4.0, March 11,

2010

[32] Fernando, Randima and Kilgard, Mark J., The Cg Tutorial: The Definitive Guide to Programmable

Real-Time Graphics, 2003

[33] J. Owens, D. Luebke, M. Harris, J. Krüger, A. Lefohn, T. Purcell, A Survey of General-Purpose

Computation on Graphics Hardware, in Computer Graphics Forum, March 2007

[34] C. McClanahan, “History and Evolution of GPU Architecture,” Paper Survey, 2010

[35] IBM Personal Computer Professional Graphics Controller Technical Reference, August 1984

[36] Schaller, Robert R., Moore’s Law: Past, Present, and Future, IEEE Spectr., June 1997,

[37] Akeley, Kurt, Reality Engine Graphics, Proceedings of the 20th Annual Conference on Computer

Graphics and Interactive Techniques, SIGGRAPH ’93, 1993, pp. 109–116

[38] NVIDIA, NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, White Pa-

per, http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_

architecture_whitepaper.pdf, 2009

[39] NVIDIA, NVIDIA’s Next Generation CUDA Compute Architecture: Ke-

pler GK110, White Paper, https://www.nvidia.com/content/PDF/kepler/

NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, 2012

[40] NVIDIA, NVIDIA GeForce GTX 980, White Paper, https://international.download.nvidia.

com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF, 2014

[41] Rajgarhia, Aditya and Gehani, Ashish, Performance and Extension of User Space File Systems,

In Proceedings of the 2010 ACM Symposium on Applied Computing, Sierre, Switzerland, pp.

206–213, 2010

[42] List of FUSE file systems https://github.com/libfuse/libfuse/wiki/Filesystems

[43] rarfs, https://github.com/vadmium/rarfs

[44] gunzip, http://freecode.com/projects/fusegunzip

[45] fuse-zip, https://linux.die.net/man/1/fuse-zip

69

http://michaelgalloy.com/2013/06/11/cpu-vs-gpu-performance.html

http://michaelgalloy.com/2013/06/11/cpu-vs-gpu-performance.html

http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf

http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf

https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf

https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf

https://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF

https://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF

https://github.com/libfuse/libfuse/wiki/Filesystems

https://github.com/vadmium/rarfs

http://freecode.com/projects/fusegunzip

https://linux.die.net/man/1/fuse-zip

[46] fusecompress https://github.com/tex/fusecompress

[47] sshfs https://wiki.archlinux.org/index.php/SSHFS

[48] davfs https://wiki.archlinux.org/index.php/Davfs

[49] D. Cook and J. Ioannidis and A. Keromytis and J. Luck, CryptoGraphics: Secret Key Cryptogra-

phy Using Graphics Cards, In RSA Conference, Cryptographer’s Track (CT-RSA), February 2005

[50] Harrison, O. and Waldron, J., AES Encryption Implementation and Analysis on Commodity

Graphics Processing Units, in Cryptographic Hardware and Embedded Systems - CHES 2007:

9th International Workshop, pp. 209-226, Vienna, Austria, September 10-13, 2007

[51] S. Manavski et al., Cuda compatible gpu as an efficient hardware accelerator for aes cryptogra-

phy, IEEE Int. Conf. on Signal Processing & Communications, pp. 65-68, 2007

[52] Q. Li, C. Zhong, K. Zhao, X. Mei and X. Chu, Implementation and Analysis of AES Encryption

on GPU, 2012 IEEE 14th International Conference on High Performance Computing and Com-

munication & 2012 IEEE 9th International Conference on Embedded Software and Systems, pp.

843-848, Liverpool, 2012

[53] Silberstein, Mark and Ford, Bryan and Keidar, Idit and Witchel, Emmett, GPUfs: Integrating a

File System with GPUs, SIGARCH Comput. Archit. News, pp. 485-498, March 2013

[54] Gharaibeh, Abdullah and Al-Kiswany, Samer and Gopalakrishnan, Sathish and Ripeanu, Matei,

A GPU Accelerated Storage System, In Proceedings of the 19th ACM International Symposium

on High Performance Distributed Computing, pp. 167-178, Chicago, Illinois, 2010

[55] Al-Kiswany, Samer and Gharaibeh, Abdullah and Santos-Neto, Elizeu and Yuan, George and Ri-

peanu, Matei, StoreGPU: Exploiting Graphics Processing Units to Accelerate Distributed Storage

Systems, In Proceedings of the 17th International Symposium on High Performance Distributed

Computing, pp. 165-174, Boston, MA, USA, 2008

[56] Shinpei Kato and Michael McThrow and Carlos Maltzahn and Scott Brandt, Gdev: First-Class

GPU Resource Management in the Operating System, Presented as part of the 2012 USENIX

Annual Technical Conference (USENIX ATC 12), pp. 401-412, Boston, MA, 2012

[57] Rossbach, Christopher J. and Currey, Jon and Silberstein, Mark and Ray, Baishakhi and Witchel,

Emmett, PTask: Operating System Abstractions to Manage GPUs As Compute Devices, In Pro-

ceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, Cascais, Portu-

gal, 2011

[58] G. Agosta, A. Barenghi, F. D. Santis, A. D. Biagio and G. Pelosi, Fast Disk Encryption through

GPGPU Acceleration, 2009 International Conference on Parallel and Distributed Computing, Ap-

plications and Technologies, Higashi Hiroshima, pp. 102-109, 2009

70

https://github.com/tex/fusecompress

https://wiki.archlinux.org/index.php/SSHFS

https://wiki.archlinux.org/index.php/Davfs

[59] Eimot, Magne, Offloading an encrypted user space file system on Graphical Processing Units,

Master Thesis, University of Oslo, Department of Informatics, Oslo, Norway, 2009

[60] PKCS #7, Cryptographic Message Syntax Standard, RSA Laboratories, Version 1.5, Nov 1993

[61] https://github.com/nablahero/cuda_aes

[62] https://www.wolfssl.com

[63] W. Diffie and M. E. Hellman, Privacy and authentication: An introduction to cryptography, in

Proceedings of the IEEE, vol. 67, no. 3, pp. 397-427, March 1979

[64] Standard for Cryptographic Protection of Data on Block-Oriented Storage Devices, IEEE

P1619/D16, p. 34, 2007

71

https://github.com/nablahero/cuda_aes

https://www.wolfssl.com

gipherfs - fenix.tecnico.ulisboa.pt · a segurança da informação é uma das principais...

Documents