biological engineering - ulisboa...acelerar o desenvolvimento de ferramentas de diagnóstico...

59
1 Towards the identification of biomarker-genes for the diagnosis of invasive candidemia caused by Candida glabrata Carolina de Carvalho Nunes Thesis to obtain the Master of Science Degree in Biological Engineering Supervisors: Prof. Sara Alexandra Cordeiro Madeira Prof. Nuno Gonçalo Pereira Mira Examination Committee: Chairperson: Prof. Jorge Humberto Gomes Leitão Supervisor: Prof. Nuno Gonçalo Pereira Mira Members of the committee: Prof. Miguel Nobre Parreira Cacho Teixeira Prof. Susana de Almeida Mendes Vinga Martins July 2015

Upload: others

Post on 26-Apr-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

1

Towards the identification of biomarker-genes for the diagnosis of invasive candidemia caused by Candida

glabrata

Carolina de Carvalho Nunes

Thesis to obtain the Master of Science Degree in

Biological Engineering

Supervisors:

Prof. Sara Alexandra Cordeiro Madeira

Prof. Nuno Gonçalo Pereira Mira

Examination Committee:

Chairperson: Prof. Jorge Humberto Gomes Leitão

Supervisor: Prof. Nuno Gonçalo Pereira Mira

Members of the committee: Prof. Miguel Nobre Parreira Cacho Teixeira

Prof. Susana de Almeida Mendes Vinga Martins

July 2015

Page 2: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

2

Page 3: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

3

Acknowledgments

I would like to express my sincere gratitude for the great opportunity to work in this

project given by Professors Nuno Mira and Sara Madeira, with their advice and constant

support. My acknowledgements are also going to Professor Geraldine Butler and CanWang for

all the help with the transcriptomic analysis. Acknowledgments to Professor Maria Manuel

Lopes, Dr. Rosa Barros and Dr. Teresa Ferreira of CHLC for help in facilitating access to the

cohort of clinical isolates used in this study. I would also like to thank financial support of project

PanCandida – Towards the development of a pan- genomic DNA chip for the early detection of

invasive candidiasis caused by C. albicans and C. glabrata, sponsored by the Gilead Génese

Program; of Pfizer research program WI178570; and of FCT (UID/BIO/04565/2013). I also

would like to thank Ruben Bernardo who made available the transcriptomes data, used in this

master thesis. I also wish to express my gratitude for all the members of the Examination

Committee for having accepted the task of evaluating and discussing this master thesis.

Again and always thanks to Mr. Paulo Xavier, for the exchanges of ideas about this theme,

and to Mrs. Cristina Camoêsas and Miss Sofia Camoêsas, for the orthographic review. My

endless gratitude to my family, especially to my mother, father and sister, thanks for their moral

support! I would also like to thank my special friend Ricardo Xavier.

At last but not less, to my grandfather, Carlos, “Thanks” for making it all possible!

Page 4: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

4

Page 5: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

5

Abstract

The incidence of infections caused by the opportunist pathogenic fungus Candida

glabrata is increasing, these having associated high rates of mortality and morbidity. Infections

caused by C. glabrata range from superficial infections in mucosal surfaces to disseminated

mycoses (known as invasive candidiasis) in which the yeasts cross the bloodstream and may

colonize any major organ. Currently established protocols for diagnosis of invasive candidiasis

are essentially based on culture of samples collected from the patients who are known to lack

sensitivity, particularly in the initial stages of the infection when the fungal burden is low.

Diagnosis is further difficulted by the fact that C. glabrata is a common commensal of the

gastrointestinal and genitourinary tracts. In this project it was aimed to identity a set of genes

signatures that could serve as biomarkers of invasive candidiasis, that could be subsequently

used for diagnosis. For that the transcriptomes of a small cohort of commensal and invasive

isolates were compared using species-specific DNA microarrays. Unsupervised and supervised

methods were used to treat the data obtained, both these two approaches. A set of 24 genes

was identified as being those that have the higher potential to discriminate between commensal

and invasive isolates, this representing an interest set of candidates to be used in subsequent

follow-up studies. Furthermore, a large number of genes that was not previously associated with

C. glabrata virulence was also uncovered as being differently expressed in the commensal and

in the invasive isolates, a knowledge that can help in a better understanding of the mechanisms

underlying the transition from commensalism to pathogenesis in C. glabrata.

Keywords: Candida glabrata; invasive candidiasis; molecular diagnosis; unsupervised and

supervised methods to treat transcriptomics data

Page 6: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

6

Page 7: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

7

Resumo

A incidência de infecções sistémicas causadas pela C. glabrata, um fungo patogénico

oportunista, tem amentado ao longo dos anos, contribuindo para um aumento da mortalidade e

mobilidade de populações em risco. Os métodos actualmente usados no diagnóstico de

candidiase sistémica são baseiam-se na cultura de amostras recolhidas do doente, em

particular amostras de sangue. Contudo estes métodos apresentam baixa sensibilidade,

particularmente nos estádios iniciais de infecção em que a carga fúngica é baixa.

Consequentemente ocorrem atrasos no inicio da terapia fúngica levando muitas vezes ao

aumento da taxa de mortalidade dos pacientes Para além disto, o diagnóstico é também

dificultado pelo facto da C. glabrata ser um comensal comum do Homem tornando-se difícil

avaliar o potencial patogénico das estirpes isoladas. Neste projecto tentou identificar-se um

conjunto de genes que possam servir como biomarcadores para a detecção de candidiasis

invasiva causada por C. glabrata. Partindo da análise de transcriptomas de um pequeno

conjunto de isolados comensais e invasivos recorreu-se a um conjunto de abordagens

bioinformáticas supervisionadas e não-supervisionadas, identificando 24 genes como sendo

aqueles que têm um maior potencial para discriminar os isolados comensais dos invasivos.

Além disso, um grande número de genes que não foram até agora associados à virulência de

C. glabrata, foram identificados como sendo diferencialmente expressos em isolados

comensais e invasivos. Espera-se que os resultados gerados possam contribuir não só para

acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como

também para uma melhor compreensão dos mecanismos subjacentes à transição comensal-

patogénico em C. glabrata.

Palavras-chave: Candida glabrata; candidisis invasiva; diagnósticos moleculares; métodos não

supervisionados e supervisionados para tratar os dados.

Page 8: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

8

Page 9: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

9

Table of Contents

Acknowledgments 3

Abstract 5

Resumo 7

Table of Contents 9

List of Figures 11

List of Tables 12

Abbreviations 13

1. Introduction 14

1.1. Overview on biology and physiology of Candida species. 14

1.2. Diagnosis of Fungal Infections 15

1.3. Use of microarrays analysis in medical diagnosis 16

1.4. Bioinformatics approaches used in microarrays-based diagnosis 17

1.4.1. Unsupervised Data Analysis 18

1.4.2. Supervised Data Analysis 19

1.5. Introduction to the theme of the thesis 20

2. Materials and Methods 24

2.1. Strains and growth media 24

2.2. Transcriptomic profiling 25

2.3. Data processing of the microarray data 26

2.4. Exploratory analysis 26

2.5. Gene selection 28

3. Results 30

3.1. Overview 30

3.2. Exploratory analysis of the transcriptomics data 31

3.3. Gene Selection 42

4. Discussion 47

5. References 51

6. Appendix 57

Page 10: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

10

Page 11: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

11

List of Figures

Figure 1: Scheme of exploratory analysis 27

Figure 2: Scheme of K-means method applied in submatrix of each cluster. 27

Figure 3: Scheme of gene selection analysis. 28

Figure 4: PCA applied to samples. Purple and blue points represent commensal and invasive class,

respectively. 30

Figure 5: The HCL applied to samples, purple line, represent the cut in dendrogram 31

Figure 6: HCL applied to all genes. Purple line represents the cut in dendogram, originating the three

resulting groups Green, blue and red lines group 1106, 4022 and 110 genes, respectively. Color

schema used in the expression matrix: Minimum expression green and maximum expression red.

32

Figure 7: Enriched functional classes (p-value below 0,01) based on MIPS functional catalogue, of

genes that were found to be down-regulated in the tested cohort of clinical isolates in

comparison with the reference strain KChR606. 33

Figure 8: Enriched functional classes (p-value below 0,01) based on MIPS functional catalogue , of

genes that were found to be up-regulated in the tested cohort of clinical isolates in comparison

with the reference strain KChR606. 33

Figure 9: Resulting clusters from K-means (K = 4). Charts present the expression profile of the genes

grouped in each cluster. From top to bottom: cluster 1, cluster 2, cluster 3 and cluster 4. 34

Figure 10: Centroid expression profile of resulting clusters from K-means (K=4). From top to bottom:

cluster 1, cluster 2, cluster 3 and cluster 4. 34

Figure 11: Resulting clusters from K-means (K = 4) applied in submatrix of cluster 1. Charts present the

expression profile of the genes grouped in each cluster. From top to bottom: cluster 1.1, cluster

1.2, cluster 1.3 and cluster 1.4. 36

Figure 12: Centroid expression profile of resulting clusters from K-means (K=4) applied in submatrix of

cluster 1. From top to bottom: cluster 1.1, cluster 1.2, cluster 1.3 and cluster 1.4. 36

Figure 13: Resulting clusters from K-means (K = 4) applied in submatrix of cluster 2. Charts present the

expression profile of the genes grouped in each cluster. From top to bottom: cluster 2.1, cluster

2.2, cluster 2.3 and cluster 2.4. 37

Figure 14: Centroid expression profile of resulting clusters from K-means (K=4) applied in submatrix of

cluster 2. From top to bottom: cluster 2.1, cluster 2.2, cluster 2.3 and cluster 2.4. 37

Figure 15: Resulting clusters from K-means (K = 2) applied in submatrix of cluster 3. Charts present the

expression profile of the genes grouped in each cluster. From top to bottom: cluster 3.1, cluster

3.2. 38

Figure 16: Centroid expression profile of resulting clusters from K-means (K=2) applied in submatrix of

cluster 3. From top to bottom: cluster 3.1, cluster 3.2. 38

Page 12: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

12

Figure 17: Resulting clusters from K-means (K = 2) applied in submatrix of cluster 4. Charts present the

expression profile of the genes grouped in each cluster. From top to bottom: cluster 4.1, cluster

4.2.. 39

Figure 18: Centroid expression profile of resulting clusters from K-means (K=2) applied in submatrix of

cluster 4. From top to bottom: cluster 4.1, cluster 4.2. 40

Figure 19: Enriched functional classes (p-value below 0.01), based on MIPS functional catalogue, of

genes that were found to be less expressed in the invasive isolates compared to the commensal

ones. 40

Figure 20: Enriched functional classes (p-value below 0.01), based on MIPS functional catalogue, of

genes that were found to be higher expressed in the invasive isolates compared to the

commensal ones. 41

Figure 21: Resulting number of genes by intersection of SVM, Rank Product, Gain Ratio and Info Gain.

(Oliveros, J.C. (2007) VENNY. An interactive tool for comparing lists with Venn Diagrams). 42

Figure 22: Genes that allow to better distinguish the type of classes (commensal and invasive ones). 49

List of Tables

Table 1: C. glabrata strain and isolates used in this master thesis. 24

Table 2: Resulting data from K-means (K = 4), applied to 5239 genes. 35

Table 3: Resulting data from K-means (K = 4) applied in submatrix of cluster 1 (1865 genes). 36

Table 4: Resulting data from K-means (K=4) applied in submatrix of cluster 2 (2468 genes). 38

Table 5: Resulting data from K-means (K=2) applied in submatrix of cluster 3 (351 genes). 39

Table 6: Resulting data from K-means (K=2) applied in submatrix of cluster 4 (555genes). 40

Table 7: ORF, Function, expression levels for each sample, values of mean, variance and standard

deviation, regarding class, of the resulting 24 genes. 43

Page 13: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

13

Abbreviations

C. albicans Candida albicans

C. glabrata Candida glabrata

C. krusei Candida krusei

C. parapsilosis Candida parapsilosis

C. tropicalis Candida tropicalis

GI tract Gastrointestinal tract

HCL Hierarchical Cluster

Info Gain Information Gain algorithm

MAPKKK Mitogen-activated protein kinase kinase kinase

NCAC Non-Candida albicans Candida species

OD600nm Optical density at 600 nm

S. cerevisiae Saccharomyces cerevisiae

SVM Support Vector Machine

Page 14: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

14

1. Introduction

1.1. Overview on biology and physiology of Candida species.

The incidence and prevalence of fungal infections in humans have increased significantly

in the latest decades. This rise is attributed to the increase in the number of

immunocompromised population, mostly as the result of the increase in life-expectancy and

from the extended use of aggressive therapies, in particular, those used to treat cancer patients

that have a strong deleterious effect over the action of the patient’s immune system. The

emergence of fungal strains resistant to the antifungals drugs commonly used in the clinical

practice, in particular, to azoles, is another reason contributing for the increase in the incidence

of fungal infections (Sampaio & Pais, 2014) (Sardi, Scorzoni, Bernardi, Fusco- Almeida, &

Mendes Giannini, 2013) (Rodrigues, Silva, & Henriques, 2014).

Around 90% of all diagnosed fungal infections occurring in humans are attributed to

species of the Candida genus. Infections caused by Candida, generally known as candidiasis,

can range from superficial rashes in mucosas to life-threatening systemic infections in which the

yeasts cross the bloodstream and may colonize any major organ (Sampaio & Pais, 2014)

(Rodrigues, Silva, & Henriques, 2014). Invasive candidiasis infections have high rates of

morbidity and mortality ranging 40-60%, depending on the Candida species involved (Sendid, et

al., 2004) (Clancy & Nguyen, 2013). The more relevant Candida species associated to human

infections are C. albicans, the most common causative agent of candidiasis, and C. glabrata, C.

parapsilosis, C.tropicalis and C. krusei, generally known as non-albicans Candida species or

NAC (Murray, Rosenthal, & Pfaller, 2005) (Sendid, et al., 2004) (Rodrigues, Silva, & Henriques,

2014). Although C. albicans still remains the species more frequently isolated from patients with

diagnosed candidiasis, NAC, in particular, C. glabrata, has been emerging (Fidel, JR., Vazquez,

& Sobel, 1999) (Sampaio & Pais, 2014).

Up to now much of the knowledge gathered on the biology and pathophysiology of

Candida has been focused on C. albicans were several virulence factors have been identified

including its ability to undergo filamentation, which allows evasion from the host immune

system; adhesive properties; ability to form biofilm either in host tissues and in medical devices,

and secretion of a variety of hydrolytic enzymes (proteases, phospholipases and hemolysins)

which damage host tissues (Rodrigues, Silva, & Henriques, 2014). More recently, efforts have

been put into the characterization of the molecular mechanisms underlying C. glabrata

pathogenesis. Its remarkable innate tolerance to azoles and its ability to rapidly acquire

resistance to antifungals, reflecting a great genomic plasticity, are two factors underlying the

success of C. glabrata as a pathogen. Furthermore, its adhesive properties and its ability to

form biofilms and to secrete phospholipases, lipases and hemolysines, are other virulence

factors attributed to C. glabrata (Silva, Negri, Henriques, Oliveira, Williams, & Azeredo, 2012)

(Rodrigues, Silva, & Henriques, 2014). Very recently it was demonstrated that C. glabrata is

able to undergo filamention inside a macrophage, this representing a true paradigmatic change

Page 15: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

15

from what was the current knowledge on the infection properties of this yeast species, which

was mostly known for its inability to form true-hyphae ( Brunke, et al., 2014). The production of

these different virulence factors is dependent on the C. glabrata strain and also of the host

physiological conditions, in particular the severity of the disease, the patient's age, pre-exposure

to specific antibacterial agents or to fluconazole (Dabas, 2013) (Sampaio & Pais, 2014)

(Rodrigues, Silva, & Henriques, 2014).

1.2. Diagnosis of Fungal Infections

The methods for diagnosing fungal infections more commonly used in microbiology and

histopathology hospital laboratories rely on classic methodologies including microscopy of fluid

samples, analysis of tissues histopathology or culture in plates. Although practical, these

methods have two main disadvantages: limited sensitivity and dependence on the possibility of

obtaining deep tissue samples, which sometimes is not be possible due to the patients’ fragile

conditions. The European Fungal Infection Study Group (EFISG) of the European Society of

Clinical Microbiology and Infectious Diseases (ESCMID), recommends the use of microscopy

and histopathology methods, whenever possible (Sampaio & Pais, 2014). Nevertheless,

accurate identification of the infecting Candida species is highly dependent on the expertise and

experience of the technician undertaking the test. Commercial methods allowing the detection

of Candida species in agar plates (known as Chromagar Candida) is also used, although

identification based on morphology alone cannot be used as a definitive diagnostic, as it often

originates false-negative and false-positive results.

The culture method is the method most frequently used to diagnose fungal infections,

particularly for blood samples. However, this method generates around one-third of false-

negative cases in which the yeasts are not detected (Sampaio & Pais, 2014) (Clancy & Nguyen,

2013). To reduce the number of false-negatives, international guidelines recommend tight

procedures regarding the collection of blood samples, as a way to ensure the optimal isolation

of microorganisms. When these recommendations are taken into account, the technique

sensitivity is around 50 - 70 %, which is still a low sensibility (Sampaio & Pais, 2014). Another

constraint of the culture method is the time necessary to take a definitive conclusion (24 to 48

hours) on whether the blood sample is colonized or not, this being even more problematic when

the fungal burden is low. The precise identification of the Candida species is even more time-

consuming, being in most cases dependent on the use of MALDI-TOF or VITEK profilings.

Non-culture-based methods, generally known as alternative diagnostic procedures, are

being increasingly used in patients at risk of developing invasive candidiasis, as a way to

improve and anticipate the detection of candidemia (Sampaio & Pais, 2014). Some of these

methods are based on quantification and detection of fungal biomarkers or metabolites from

patients’ blood serum. The main serological tests used to detect Candida infections are focused

on quantification of mannan (Mn), anti-Mn antibodies or of β-1,3-D-glucan in blood samples.

Mannan and β-1,3-D-glucan are two of the main components present in the cell wall of Candida

Page 16: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

16

cells, being particularly useful for detection as they are not present in eukaryotic cells (Sampaio

& Pais, 2014) (Ostrosky-Zeichner, Vitale, & Nucci, 2012). The sensitivity and specificity of the

serological methods in diagnosing invasive candidiasis is of 77 % and 85 % for β-1,3-D-glucan,

respectively; these numbers increasing to 83 % and 86 % when combined with Mn and with

anti-Mn (Held, Kohlberger, Rappold, Grawitz, & Hacker, 2013). For the detection of invasive

candidiasis, including of candidemia, EFISG/ESCMID just recommends the use of β-1,3-D-

glucan test. The use of the Mn and Anti-Mn tests are not recommend, due to the lack of

available studies (Cuenca-Estrella1, et al., 2012) (Sampaio & Pais, 2014) (White, Archer, &

Barnes, 2005) (Clancy & Nguyen, 2013). Both culture and non-culture-based serological

methods lack the ability of discriminating between infection and colonization (Sampaio & Pais,

2014).

Other non-cultured-based methods used for detection of fungal infections include the use

of molecular typing techniques based on polymerase chain reaction (PCR). The use of this

technique in systemic fungal infections detection has been widely published and show

promising results regarding both the specificity and the sensitivity. Most frequently the genes

amplified are the RNA ribosomal genes (rRNA) (18S, 28S, and 5.8S rRNA genes), due to their

universal nature and higher abundance. Genes encoding cytochrome P450, heat shock proteins

or proteins involved in pH regulation have also been used (White, Archer, & Barnes, 2005).

Although very significant improvements have been made in these last years, it is still

clear that the methods currently used for the detection of Candida species require

improvements in what regards to sensitivity, speed and specificity. One difficulty that none of

these methods can surpass is the distinction between colonization and infection, mostly

because the information regarding the mechanisms underlying transition from these two states

in the several Candida species is still quite scarce. In this context future efforts should be

oriented towards the expansion of the availability of new biomarkers that could be used for the

successful detection of invasive candidiasis, this aspect being emphasized by several authors

as being highly relevant in the future development of fungal diagnosis tools (Shahzad, Cohrs,

Shen, & Lass-Florl, 2012) (Avni, Leibovici, & Paul, 2011).

1.3. Use of microarrays analysis in medical diagnosis

Microarrays and the bioinformatics approaches used to treat the massive amount of data

obtained have a recognized strong potential to be used in medical diagnostic, although up to

now most of the work performed in this field has been focused on cancer ( Tarca, Romero, &

Draghici, 2006) (Breitling R. , Armengaud, Amtamann, & Herzyk, 2004) (Jones & Pevzner,

2004). Essentially the data obtained from microarrays can be explored in three different

manners: to compare classes, to predict classes and to discover classes ( Tarca, Romero, &

Draghici, 2006) (Simon, McShane, Wright, Korn, Radmacher, & Zhao, 2010). The objective of

class comparison is to assess the genes that have a different expression within a set of

predefined classes and that may can be used for distinguishing purposes. The identification of

Page 17: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

17

these genes can lead to a better understanding of the biological differences among the classes

that are being compare, and it may also help in the functional analysis of poorly characterized

genes (Simon, McShane, Wright, Korn, Radmacher, & Zhao, 2010) ( Tarca, Romero, &

Draghici, 2006). The methods used for class comparison are supervised in the sense that, they

determine which genes are differentially expressed between each class. This is in contrast to

methods, like cluster analysis (indicated below), which only consider expression profiles and are

thus unsupervised (Simon, McShane, Wright, Korn, Radmacher, & Zhao, 2010).

The objective of class prediction is to develop statistical models or machine learning

methods that can predict to which class a new specimen belongs to base on a set gene

expression profile. Since class predictors can help predict the new sample class (phenotype),

these tools have been used with some success in improving diagnostic classification and in

prediction and selection of appropriate treatments (Golub, et al., 1999) (Simon, McShane,

Wright, Korn, Radmacher, & Zhao, 2010) ( Tarca, Romero, & Draghici, 2006). Class discovery

involves the use of unsupervised analysis methods to identify sets of genes with similar

expression patterns (co-expressed genes) and the identification of expression patterns in

different species. These patterns may consist of a classification into subgroups, or clusters, and

there may be a multilevel structure within the classification ( Tarca, Romero, & Draghici, 2006)

(Simon, McShane, Wright, Korn, Radmacher, & Zhao, 2010) (Malone & Oliver, 20011).

1.4. Bioinformatics approaches used in microarrays-based diagnosis

To identify relevant genes for discovering potential drug targets and biomarkers through

the microarray analysis it is necessary to identify the genes that are differently between two (or

more) groups of experiments. In the past the fold change was used to find differences, under

the assumption that changes above some threshold were biologically relevant. Later on several

univariate statistical methods approach were developed in order to determine the expression

level of a given gene in a normalized microarray these including t-tests, modified t-test (SAM),

two-sample t-tests, F-statistics and Bayesian models. For more complex data sets with multiple

classes Analysis of Variance (ANOVA) was also used (Dalman, Deeter, Nimishakavi, & Duan,

2012) (Selvaraj & Natarajan, 2011). However, Datman et al. demonstrated that fold change and

statistical cut-offs modulate the outcome of the microarray, once it is defined a threshold,

meaning which values are considered the most significant. To overcome this bias,

bioinformatics tools capable of comparing the expression of thousands of genes have been

developed, out of which those involving supervised and unsupervised techniques of data mining

are the more relevant ones. These techniques can be used independently or in a combined

manner (for example using an unsupervised method followed by a supervised one) in order to

extract useful knowledge from microarray data (Eisen, Spellman, Brown, & Botstein, 1998)

(Badu, 2004) (Do & Choi, 2008). The unsupervised and supervised techniques used in analysis

of microarrays aiming medical application are briefly described in the following sections.

Page 18: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

18

1.4.1. Unsupervised Data Analysis

Unsupervised data analysis, usually performed using clustering algorithms, allows the

grouping of genes and/or samples with common characteristics without a prior knowledge about

the classes or the phenotype to which samples correspond (Eisen, Spellman, Brown, &

Botstein, 1998) (Do & Choi, 2008) (Pan, Lin, & Le, 2002). This type of analysis is described by

some authors as an useful tool for identification of co-expressed gene and to study gene

function and transcriptional genomic regulation (Jiang, Tang, & Zhang, 2004) (Chandrasekhar,

Thangavel, & Elayaraja, 2011). Clustering is the partition process of a set of data objects (genes

or samples) in several subsets. Each subset is a cluster. This objects that belong to each

cluster should be similar between themselves but dissimilar to objects belonging to another

cluster. Most clustering methods are based on some similarity or distance measure

(dissimilarity) defined between objects, aiming to group similar objects together. When

considering distance or dissimilarity metric, it is understood that the larger calculated value, the

greater difference between objects. Euclidian distance, Manhattan distance, Mahalonobis

distance, angular distance and Pearson correlation, are the basis for some of the most common

distance and similarity metrics. Thus, the smaller is the intra-cluster similarity/distance and the

larger is the inter-cluster similarity, the better is the clustering (Simon, McShane, Wright, Korn,

Radmacher, & Zhao, 2010) (Do & Choi, 2008).

Several clustering algorithms have been applied to identify patterns in gene expression

data. The most widely used are: Hierarchical Clustering (HCL) and K-means clustering (Selvaraj

& Natarajan, 2011) (Brazma & Vilo, 2000). The HCL is a type of exploratory data analysis,

specifying the relationships among the objects in each cluster and between them

(similarity/dissimilarity) (Badu, 2004). This method allows to measure the similarity between the

biological samples, resulting in a grouped series of hierarchical clusters, which can be

represented in a tree shaped graph. This type of graphs is denominated dendrograms. The

represented genes (or samples) look like leaves and the branches represent the distance

between the leaves, indicating the similarity in the cluster (Jiang, Tang, & Zhang, 2004). This

algorithm can have two types of approaches: agglomerative or divisive (Brazma & Vilo, 2000).

The agglomerative approach, or bottom-up, regards each object as an individual cluster and at

each step, merges the closest pair of clusters, until all the groups are merged into one cluster

(Jiang, Tang, & Zhang, 2004). After each two clusters join, the distance between all the other

clusters and the new joined cluster, are recalculated (Brazma & Vilo, 2000). The complete

linkage, average linkage, and single linkage methods use the maximum, average and minimum

distance respectively, between the members of two clusters (Selvaraj & Natarajan, 2011) (Do &

Choi, 2008). The divisive approach, or top-down, starts with one cluster containing all the data

objects and, at each step split (Jiang, Tang, & Zhang, 2004). The HCL not only groups together

genes with similar expression patterns, but also provides a natural way to graphically represent

the data set. The graphic representation allows users a detailed inspection of the whole data

set, as well as an initial impression of the data distribution (Jiang, Tang, & Zhang, 2004) (Simon,

McShane, Wright, Korn, Radmacher, & Zhao, 2010). This algorithm has been applied to class

Page 19: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

19

discovery in disease subtype identification. Alizadeh et al. studied 96 samples of normal and

malignant lymphocytes (Diffuse Large B-Cell Lymhoma (DLBCL)). The HCL algorithm allowed

them to identify diversity among gene expression in patients, and to identify two molecular

distinct forms of DLBCL, with gene expression patterns distinguishing different stages. The

clusters significance was confirmed by the survival rates among patients (Brazma & Vilo, 2000)

(Alizadeh, et al., 2000).

The K-means clustering algorithm is one of the most commonly used partitioning

methods (Jiang, Tang, & Zhang, 2004). This algorithm is used to cluster observations into

groups of related observations without any prior knowledge (Selvaraj & Natarajan, 2011). This

algorithm typically uses the properties of the Euclidean vector space, and the user needs to

choose the number of K clusters before its use. The algorithm starts by randomly choosing K

patterns as the initial mean of each cluster. After, the patterns are assigned to each cluster by

finding the ones closest to the mean, a new mean is calculated for each cluster and patterns are

reassigned to new means. This process is iterated until the cluster means are such that no

patterns move from clusters (Do & Choi, 2008) (Brazma & Vilo, 2000). An issue that needs

careful consideration in using K-means is the choice of the number of clusters K, since prior

knowledge of this number is not available. However, combining this method with other

clustering algorithm, eg: HCL, can provide an useful conjunction to determine the number of K

groups to initiate the analysis. Since the HCL and K-means are different algorithms, the number

of K groups defined by the HCL may not correspond to the best group separation. As such, an

increment of the number of groups can be necessary. Aronow et al. in 2001, used K-means

clustering to analyse microarray profiles of four transgenic mouse models of cardiac

hypertrophy (Brazma & Vilo, 2000) (Simon, McShane, Wright, Korn, Radmacher, & Zhao,

2010).

1.4.2. Supervised Data Analysis

The goal of supervised analysis in expression data is to construct a classifier, or a class

predictor, such as decision trees, linear discriminants or Support Vector Machines (SVM), which

assign predefined classes to a given expression profile. Typically, classifiers are trained on a

data subset, with an initially given classification and tested on another subset, with a known

classification. After assessing the prediction quality, the classification can be applied to

unclassified data (Brazma & Vilo, 2000).

The classifiers have a promising application in determining the type of disease and

subtype in clinical diagnosis (Selvaraj & Natarajan, 2011). When they are constructed based on

gene expression profiles, they are distinguish between two different classes (Selvaraj &

Natarajan, 2011) (Brazma & Vilo, 2000). Golub et al. used a classifier to predict the class of

leukemia samples predictor. The 29 of 34 new samples were correctly predicted (Golub, et al.,

1999).

Page 20: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

20

One of the great challenges, in the classifier construction of gene expression data

derived, is the data dimension. Due to the large number of genes (features), compared to the

small sample sizes, the classifier construction becomes a challenge, in respect to its accuracy

and speed capacity. In order to way to overcome these problems, feature selections techniques

are used in machine learning, to remove irrelevant genes (George & Raj, 2011). This is a

systematic process, of data dimensionality reduction to obtain an optimal subset of attributes,

for classification purposes. This process has been applied in a pre-selection step in genomic

and proteomic classification problems (Do & Choi, 2008). Although a small number of genes is

preferred, an extremely small subset of genes should also avoided (Stiglic, Rodriguez, & Kokol,

2008). In other words, the feature selection methods intend to select informative genes, which

can improve the classification accuracy, avoiding overfitting (Stiglic, Rodriguez, & Kokol, 2008)

(Simon, McShane, Wright, Korn, Radmacher, & Zhao, 2010). These methods are used in gene

selection, since they select the ones that allow differentiation among classes (Saeys, Inza, &

Larranaga, 2007).

Brad et al. identified new signature genes, which can be used to distinguish among lung

cancer patients, with higher and low risk, of cancer recurrence within five years. They had in

consideration that the signature genes, for the classification, should have the best possible

accuracy. They used applied the Chi Squared, Filtered Attribute Selection, InfoGain, Gain

Ration, OneR, Relief and Symmetrical Uncertainty, feature selection algorithms (to the 22283

initial genes analysed in 229 samples. Only the ones common to at least 5 types of algorithms

were selected. This way, genes were reduced to a subset of 190. In the next step, they used

some statistical methods, such as Significance Analysis of Microarray (SAM) and Unequal

Variance t-test, reducing the genes set to 26. To these genes they applied a Genetic algorithm

was applied to find a subset aimed to produce a higher classification accuracy prognostic. In

this way, a 12 genes signature for lung cancer prognosis was identified (Bard & Hu, 2011).

1.5. Introduction to the theme of the thesis

One of the main issues that contribute to greatly increase the mortality and the severity of

invasive infections caused by Candida species is the delay in diagnosis which, consequently,

delays the initiation of antifungal therapy. Such difficulty in obtaining an early diagnosis is, in

large portion, caused by the modest sensitivity of the methods currently used that are unable to

detect low fungal burdens and therefore might not detect the yeast in the early stages of the

infection. Furthermore, the fact that most of the Candida species responsible for invasive

infections are also common commensals of humans is also challenging, as it turns difficult to

distinguish between non-pathogenic isolates (that is, those that may be harmless and belonging

to the endogenous patient’s microflora) from those that are indeed pathogenic and therefore

prone to cause disease. The work described in this thesis is integrated in the framework of the

PanCandida project, one of the research initiatives that was funded by the Gilead Génese

Program (2013 Edition), sponsored by Gilead Pharmaceuticals. The main objective of

Page 21: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

21

PanCandida is the development of a pan-genomic DNA chip containing genes that could be

used for the early detection of invasive candidiasis caused by C. albicans or C. glabrata, the two

Candida species that are known to cause a higher number of infections. Ideally the genes to be

used in the microarray would serve as biomarkers of invasive disease and thereby should be

differently expressed when the Candida cells are in the “commensal” and in the “pathogenic”

state. However, the molecular mechanisms underlying the transition from commensal to

pathogen in Candida species are still very poorly characterized, especially at a genome-wide

level. The use of a DNA chip as a diagnosis tool would turn possible not only to use this device

to detect the presence of the two Candida species using a highly-sensitive molecular method,

but also to have some idea on what is the “pathogenic potential” of the isolate identified to

cause invasive disease based on the profile of expression of the “pathogenesis-associated”

genes. The use of this differential diagnosis tool to distinguish between commensal and

pathogenic isolates would also allow a more rational use of antifungal therapies thereby

preventing the massive utilization of prophylactic therapies, which are now known to be one of

the most important factors that have led to the emergence of resistant strains. Ideally only

patients colonized by isolates having higher expression of the “pathogenesis-associated”

biomarker should be put under antifungal therapy. Within that line of thought this thesis

represents a first step in the identification of genes that may serve as biomarkers of invasive

disease in C. glabrata. For that, the transcriptome of isolates collected from blood of patients

with diagnosed candidemia (invasive isolates) and isolates recovered from non-sterile niches

where these species live as commensals (commensal isolates) were compared. The results of

that transcriptomic analysis, undertaken by Ruben Bernardo, were the starting point for this

thesis.

In order to identify the genes differentially expressed in the two classes of the examined

isolates (commensals and invasive ones) data mining techniques have been applied to the

microarray data obtained. These methodologies have been used with more success in cancer

research and, in particular, in distinguishing tumors based on gene expression profiles. A first

unsupervised exploratory analysis of the data was performed to allow the identification of

groups of differentially expressed genes among the samples, and the existence of more than

one type of class (Simon, McShane, Wright, Korn, Radmacher, & Zhao, 2010) (Chandrasekhar,

Thangavel, & Elayaraja, 2011). The HCL and K-means methods were used for this, since they

were described by several authors, as useful tools for expression patterns identification (Simon,

McShane, Wright, Korn, Radmacher, & Zhao, 2010) (Freyhult, Landfors, Onskog, Hvidsten, &

Rydén, 2010) (Jones & Pevzner, 2004). The divisive approach of HCL was used for genes and

samples group formation. The resulting number of gene groups, was then used for the groups

number determination, required by K-means analysis (Do & Choi, 2008). Implementation of the

HCL K-means was performed with the Genesis and MEV software.

After data exploratory analysis, the differentially expressed genes in commensal and

invasive samples were identified using several feature selection methods including Info Gain,

Gain Ratio, Support Vector Machines (SVM) and Rank Product. These methods were described

Page 22: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

22

by Geroge et al. Saeys et al. Furey et al., as being able to identify in an accurate manner the

genes that are most informative to characterize the class they belong to. In other words, these

genes should be those in which the differences of expression in commensal and invasive

isolates are most prominent, thus allowing their putative use as classifiers. The info Gain and

Gain Ratio are based on information gain measures, but the gain ratio is extended from the

information gain, which attempts to overcome this bias. These methods determine which the

best useful features (genes) are for the classifier construction. SVM determines which genes will

help in the discrimination between the two classes, the genes with a higher rank being those in

which the expression levels differ more in the two classes. This algorithm also favors those

genes that have smaller variation scores within the respective classes. Implementation of the

different bioinformatics methods was performed with the WEKA software. Rank Product is a

non-parametric statistical method, which uses replicated data measurements, allowing the

identification of genes up or down-regulated in the classes, but taking into account to which

class the samples belong (Breitling R. , Armengaud, Amtamann, & Herzyk, 2004). This method

was implemented using the MEV software. The genes identified by the all the methodologies

applied as being those more interesting to separate the commensal and invasive classes were

those selected for further discussion.

Page 23: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

23

Page 24: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

24

2. Materials and Methods

2.1. Strains and growth media

The C. glabrata isolates used in this work are described in Table 1. The KCHr606 wild-

type strain is derived from the same genetic background as the reference strain CBS138. The

clinical isolates were recovered by Professor Maria Manuel Lopes from Faculdade de Fármacia

da Universidade de Lisboa, and by Dr Rosa Barros, head of the microbiology laboratory from

Centro Hospitalar Lisboa Central, during longitudinal epidemiological surveys carried out in

hospitals of the Lisbon area. All strains were maintained at -80 º C in a rich growth medium

Yeast Peptone Dextrose (YPD) supplemented with 30 % glycerol (v/v).

Table 1: C. glabrata strain and isolates used in this master thesis.

Strain / GU clinical isolates

Type of isolate Source

KCHr606 Laboratorial strain derived

from CBS 138/ATCC2001

Professor Hiroji Chibana (from

Medical Mycology Research Center,

Chiba University, Japan)

VG216F Vaginal Professor Maria Manuel Lopes

(Faculdade de Farmácia da

Universidade de Lisboa) VG242F Vaginal

VG6 Vaginal Dra. Rosa Barros (Laboratório

de microbiologia hospitalar de Lisboa

central)

INV25 Blood

INV28 Blood

C .glabrata strains were batch-cultured at 30 ºC, with orbital agitation (250 rpm), in RPMI

growth media (synthetic rich medium used for fungi growth). RPMI contains, per liter, 20.8 g

RPMI-1640 synthetic medium (Sigma), 36 g glucose (Merck Millipore), 0.3 g of L-glutamine

(Sigmaa) and 0.165 mol/L of MOPS (3-(N-morpholino) propanesulfonic acid (Sigma). All media

were prepared in deionized water and sterilized by autoclaving them for 15 minutes at 121 ºC

and 1 atm. Solid media were obtained by supplementing the liquid growth medium with

2 % agar (Iberagar).

Page 25: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

25

2.2. Transcriptomic profiling

The transcriptomics profiling of the clinical isolates were done by Ruben Bernardo from

Instituto Superior Técnico.The isolates and the KCHr606 reference strain were cultivated over-

night in RPMI growth medium and then re-inoculated into fresh growth medium at an initial

OD600nm of 0.1. Cells were left to grow at 30 ºC, with orbital agitation (250 rpm) until the cell

culture reached 1.5-2.0, corresponding to mid-log phase. At this point of the growth curve cells

were harvested by centrifugation (8000 xg, 8 min, 4 ⁰C- Beckman J2.21 Centrifuge, rotor

JA.10), immediately frozen in liquid nitrogen and finally stored at -80 ⁰C until RNA extraction.

RNA extraction was performed using the RiboPure TM RNA Isolation Kit (Ambion, Life

Technologies, California, USA). The recovered cell pellet was resuspended in 480 µl of lysis

buffer. Afterwards 48 µl of SDS and 480 µl of a mixture of Phenol:Chloroform:IAA were added to

the cell suspension and the suspension was transferred to one of the prepared tubes containing

750 µl cold Zirconia Beads. The sample tubes were positioned horizontally on a vortex and cells

were beaten, at maximum speed, for 10 min. After the disruption process, the obtained lysate

was centrifuged for 5 min at 16,100 x g (at room temperature) to separate the aqueous phase,

containing the RNA, from the organic phase. The aqueous phase was collected to a new tube.

1.90 mL of Binding Buffer and 1.25 mL of 100 % Ethanol were added to this new tube

containing the RNA. The obtained suspension was applied into one of the filter cartridges

provided with the RNA extraction kit, which was subsequently washed with 700 µl of Wash

Solution 1. The filter was washed twice with 500 µl of Wash Solution 2/3 and total RNA was

eluted in 50 µl of elution solution, previously heated 95 ⁰C. The isolated RNA was treated with

DNase I to remove traces of chromosomal DNA. For this, the 100 µl-RNA sample was added to

8 U of DNase I and 10 µl of 10 X DNase I Buffer. The mixture was incubated at 37 ⁰C during

30 minutes. Then, 10 µl of DNase Inactivation Reagent were added to the mixture, which was

vortexed and left for 5 minutes at room temperature. Afterwards the suspension was centrifuged

(> 10000 rpm, 3 min) to separate the purified RNA from the Inactivation Reagent beads.

The DNA chips used for this microarray analysis were obtained from Agilent with an

optimized design by Dr. Geraldine’s Butler laboratory to be used in C. glabrata (Rossignol,

Logue, Reynolds, Grenon, Lowndes, & Butler, 2007). Microarray experiment was carried using

incorporation of Cy3 and Cy5 into cDNA targets, followed by quantification by measuring each

sample absorbance at 260 nm, 550 nm, or 650 nm. cDNA was synthesized through the

following process: 25 µg of total RNA was first incubated with 200 pmol of anchored

oligonucleotide deoxyribosylthymine (Invitrogen) for 10 min at 70 ⁰C. First-strand buffer (1X;

Invitrogen), 1.5 mM dATP, dTTP, and dGTP; 25 µl dCTP; 10 mM dithiothreitol; 2 µl

Superscript II reverse transcriptase (Invitrogen); and 37.5 µM Cy3-dCTP or 37.5 µM Cy5-dCTP

(Amersham) were added to a total volume of 40 µl. The mixture was incubated at 42 ⁰C for 2 h.

Then it was added to the mixture 1 µl of Superscript II and the incubation lasted 1 h at

42 ⁰C. RNA was degraded by addition of 1 µl of RNase A (QIAGEN) at 50 µg/ mL and 1 µl of

RNase H (Invitrogen) at 0.05 unit/ µl and incubated at 37 ⁰C for 30 min. The labeled cDNAs

Page 26: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

26

were purified using a QIAquick PCR purification kit (QIAGEN), following the manufacturer’s

instructions. The labeled cDNA were concentrated by drying them under vacuum, resuspended,

and pooled in 45 µl of hybridization buffer (1x DigEasy [Roche] containing 0.5 mg/ mL baker’s

yeast tRNA [Invitrogen] and 0.5 mg/ mL salmon sperm DNA [Sigma]). The mixture was

denatured at 95 ⁰C for 2 min and quickly cooled on ice then it was applied to the DNA

microarray, and covered with a 24 –mm by 60 –mm coverslip. Slides were placed in a

hybridization chamber (Corning, Palo Alto, CA) and incubated at 42 ⁰C for approximately

16 hours. Then the slides were washed in 1 x SSC at 42 ⁰C until the coverslip fell off and then

washed three times for 10 min in 1 x SSC, 0.1 % SDS at 42 ⁰C, and three times for 5 min in

1 x SSC at 42 ⁰C, and finally rinsed in 0.2 x SSC at room temperature. The slides were dried by

centrifugation at 1,000 rpm for 5 min. The microarrays were scanned with an Axon 4000B

scanner (Axon Instruments) at a 10 µm resolution. Data were acquired with GenePix Pro 5.0

software (Axon Instruments), and low-quality spots were automatically flagged. In addition,

saturated spots or with no background-correct intensities greater than 20 in the Cy3 or the Cy5

channel were flagged. Data normalization was performed using the Lowess method in

GeneSpring (Agilent Technologies).

2.3. Data processing of the microarray data

The expression matrix of the microarray data comprises the ratio between the expression

registered in the clinical isolates and in the KCHr606 strain, which was used as a reference

strain. In order to do this it was necessary to process the original expression matrix in two types

of matrices: one with a replicated log ratio of each sample and the other with the average of the

replicated log ratio of samples. For both expression matrices, the 5239 genes were normalized

to a scale comprised in the interval [-1, 1].

2.4. Exploratory analysis

The Genesis Software was used to apply the HCL and K-means clustering methods.

Before applying these methods the Euclidean distance measured the distances between genes

(and samples). The applied HCL algorithm (agglomerative approach), was based on the

average-linkage method for genes and samples. Then, the divisive approach was applied to

observe genes and sample groups. Through this procedure, it was possible to observe genes

and sample groups in the expression matrix. With a prior knowledge of the gene cluster numbers, the K-means clustering algorithm

started firstly with K equal to 3. However, in order to observe a better separation data profiles,

the cluster number was increased to 4. This process is represented in the above scheme,

Figure 1.

Page 27: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

27

.

Figure 1: Scheme of exploratory analysis

For each resulting clusters, it is obtained a submatrix of genes, in which the HCL and the

K-means previously presented steps were repeated, to perform a better exploratory analysis of

the microarray data. In order to identify the subclusters the nomenclature used was based on

the number of the previous clusters. For example: to genes grouped in cluster 1, the

implemented K-means, K=4, clusters 1.1, 1.2, 1.3 and 1.4 are obtained. This type of

enumeration is represented in the Figure 2, and used in this document.

Figure 2: Scheme of K-means method applied in submatrix of each cluster.

Distance matrix

•The expression matrix (n x m) was transformed into a distance matrix (n x n) and (m x m).

•The Euclidean distance was used.

Hierarchical Clustering

• Average linkage applied to group samples and genes• Divisive approach applied to cut the dendrogram, resulting in groups

K-means

•Applied method only to genes •With a prior knoweledg of number of possible groups•If the resulting clusters, show a bad split of expression patterns, the number of K- groups is incremented

Initial expression matrixK=4

Clustering 1-K=4 applied

Cluster 1.1

Cluster 1.2

Cluster 1.3

Cluster 1.4

Clustering 2-K=4 applied

Cluster 2.1

Cluster 2.2

Cluster 2.3

Cluster 2.4

Clustering 3-K=2 applied

Cluster 3.1

Cluster 3.2

Clustering 4-K=2 applied

Cluster 4.1

Cluster 4.2

Page 28: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

28

2.5. Gene selection

The Info Gain, Gain Ratio and SVM algorithms were applied to the expression matrix,

containing the average of replicates. For the expression matrix with replicated data, the Rank

Product was applied, where in the software the predetermined condition option of ‘two-class

unpaired’ was selected. For the Info Gain and Gain Ratio, the data was homogeneously discretized by the

software WEKA into five groups, since it was the maximum number of samples in this work. The

SVM ranks genes through decreasing ordering, from the most important to the less important

ones. Due to this, it was not possible to define which genes are considered significant, being

necessary establish a threshold. On the other hand Info Gain and Gain Ratio identify which

genes are the most informative, defining them with the same rank. The lowest number of genes

with the same rank obtained by these two methods was the number of genes selected as the

cutt-off for SVM.

Then, the common genes between Info Gain, Gain Ratio and SVM were combined with

the positively and negatively differentially expressed genes, which resulted from Rank Product.

The Figure 3 represents in scheme form these parts of process.

Figure 3: Scheme of gene selection analysis.

Expression matrix (n x m) with

replicate average

Expression matrix (n x m)

with replicate

InfoGain Rank Product GainRatio SVM

Genes selected

Page 29: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

29

Page 30: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

30

3. Results

3.1. Overview

In this project the transcriptome of clinical C. glabrata isolates recovered from different

infection sites were this species exists as a commensal or a pathogen was compared with the

transcriptome of the reference strain KCHr606, using species-specific Agilent DNA microarrays.

Such option of using the transcriptome of the KCHr606 strain as a reference was taken in order

to avoid any interference that could arise from the different genetic background of the isolates.

In specific, it was examined the transcriptome of two isolates (designated INV25 and INV28)

collected from the blood of patients diagnosed with candidemia and the transcriptome of three

isolates (designated VG216F, VG242F and VG6) that were recovered from the vaginal tract of

asymptomatic patients. For the comparative transcriptomic analysis the cells were cultivated in

RPMI growth medium and harvested in mid-log phase. In case of the commensal and invasive

experiments were performed in triplicate and duplicate, respectively, and only genes

consistently found to be up- or down-regulated in the replicates, comparing to the expression

levels registered in the lab strain Kchr606, were selected for further analysis. The isolates used

in this work were confirmed to belong to the C. glabrata species by sequencing of the D1/D2

region of ribosomal RNA.

In Figure 4 it is shown a PCA analysis of the samples rendering clear the close proximity

of samples VG216F and VG242F. Separation between the invasive and the commensal isolates

was also observed, although samples VG6 and INV28 clustered relatively close to each other.

Figure 4: PCA applied to samples. Purple and blue points represent commensal and invasive class, respectively.

Page 31: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

31

3.2. Exploratory analysis of the transcriptomics data

At first an exploratory analysis of the data was performed as this has been described to

be useful for a broad overview of the data and for the identification of different types of classes,

as well as for the identification of gene groups that may have a different expression profile

within the classes and within the samples (Simon, McShane, Wright, Korn, Radmacher, & Zhao,

2010) (Chandrasekhar, Thangavel, & Elayaraja, 2011). As described in the “materials and

methods” section, HCL was used for the initial analysis of the data, this being applied both to

samples and genes. After application of HCL to genes, implementation of K-means was

performed. Application of HCL to the samples results in the dendrogram that is shown in Figure

5. As the result of the divisive approach used the dendrogram has an intersecting line that

separates the obtained clusters. The divisive approach applied to the samples by HCL resulted

in two clusters, one corresponding to the commensal samples (VG216F, VG242F e VG6) and

the other to the invasive samples (INV25 and INV28). This result together with the above

described PCA analysis of the data, reinforce the idea that genomic expression of commensal

and invasive isolates is, indeed, different. Furthermore, this observation also supports the idea

that it may be possible to identifying gene groups having a different expression profile in the

commensal and in the invasive classes.

Figure 5: The HCL applied to samples, purple line, represent the cut in dendrogram

In Figure 6 it is shown the result of the application of HCL to genes being evident that the

genes can be clustered into 4 distinct groups: three big clusters (represented in the figure by a

green, blue and red line) and another containing only one gene (this is not shown by a line as

only its branch is visible). The clusters marked with green and red lines are composed by genes

that have a lower or higher (1106 and 110, respectively) expression in all the clinical isolates,

comparing to the reference strain. The group represented by the blue line is the one having a

higher number of genes (4022) and is composed by a very large set of genes that show almost

no difference to the expression registered in the reference strain. This method did not allowed

the observation of gene groups differentially expressed between the commensal and the

invasive isolates.

Page 32: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

32

Figure 6: HCL applied to all genes. Purple line represents the cut in dendogram, originating the three resulting groups

Green, blue and red lines group 1106, 4022 and 110 genes, respectively. Color schema used in the expression matrix:

Minimum expression green and maximum expression red.

After the use of the HCL divisive method, a k-means-based approach was implemented.

Despite the use of the HCL divisive approach has resulted in 3 big clusters, when we run K-

means with a K parameter of 3 it was not possible to obtain a good separation of the data in

gene groups (as shown in Appendix Figure 1). Increasing K to 4 resulted in the separation of

clusters that have genes down-regulated in all the isolates (grouped clusters 1 and 3) and

genes up-regulated in all the isolates (grouped in cluster 4), all compared to the reference strain

Figure 9. Genes whose expression variation in all the isolates is close to zero were also

clustered together, in cluster 2. The k-means method, compared to HCL, identified a higher

number of down or up-regulated genes in all the samples and also genes were differentially

expressed in the different gene groups. This difference is attributable to the different algorithms

used by the two methods: while HCL measures the similarity between genes, resulting in a

grouped series of hierarchical clusters; in K-means, given some specified number of clusters k,

determines to which of the current clusters centroids the object is closest.

Genes grouped in clusters 1, 3 and 4 are comprised by genes that had an identical

expression profile in all the tested isolates thereby differing from the reference strain in the

same manner. The reference strain KCHr606 derives from the CBS138 strain which was

retrieved from the gastrointestinal tract (GI tract ) of an asymptomatic patient and therefore it is

possible that the differential expression observed may result from the development of specific

adaptive responses to the GI tract that are not required for colonization of the blood or of the

vaginal tract. Functional clustering of the down and up-regulated genes, based on MIPS

Functional Catalogue, showed enrichment (p-value < 0.01) of several functional classes, as

shown in Figure 7 and Figure 8.

Page 33: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

33

Figure 7: Enriched functional classes (p-value below 0,01) based on MIPS functional catalogue, of genes that were

found to be down-regulated in the tested cohort of clinical isolates in comparison with the reference strain KChR606.

Figure 8: Enriched functional classes (p-value below 0,01) based on MIPS functional catalogue , of genes that were

found to be up-regulated in the tested cohort of clinical isolates in comparison with the reference strain KChR606.

In Table 2 it is shown the dispersion of gene expression around the centroid (average

distance from centroid) and the variance in expression levels regarding each sample (within

cluster variance) obtained for clusters 1, 2, 3 and 4. The first parameter indicates that the

smaller the mean distance to the centroid the smaller is the dispersion of expression profiles.

The second parameter shows the expression variance among the samples for a given gene

0 2 4 6 8 10 12 14 16 18 20

Metabolism of methionineNitrogen, sulfur and selenium metabolism

Phosphate metabolismC-compound and carbohydrate metabolismLipid, fatty acid and isoprenoid metabolism

EnergyProtein modification

Assembly of protein complexesProtein/ peptide degradation

Complex cofactor/cosubstrate/vitamine bindingTransported compounds (substrates)Enzyme mediated signal transduction

MAPKKK cascadeCell rescue, defense and virulence

Interaction with the environmentCell wall

Actin cytoskeleton

Gene percentage %

Genome Dataset

0 10 20 30 40 50 60

Metabolism of arginine

Nucleotide/ nucleoside/ nucleobase metabolism

Transcription

Protein synthesis

Protein with binding function or cofactorrequirement (structural or catalytic)

Unfolded protein response (e.g. ER qualitycontrol)

Gene percentage %

Genome Dataset

Page 34: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

34

group, the higher the variance observed in a cluster, the more different is the expression among

samples. These parameters, with the aid of cluster expression charts, allow the determination of

whether the clusters have, or not, a representative expression profile. The obtained results for

clusters 1 and 2 do not evidence the existence of expression differences among samples.

Differently, clusters 3 and 4 evidence a distinct expression pattern among samples.

Nevertheless, a high dispersion in the grouped genes was also evident in the genes that are

included in all the above mentioned clusters.

Figure 9: Resulting clusters from K-means (K = 4). Charts present the expression profile of the genes grouped in each

cluster. From top to bottom: cluster 1, cluster 2, cluster 3 and cluster 4.

Figure 10: Centroid expression profile of resulting clusters from K-means (K=4). From top to bottom: cluster 1, cluster 2,

cluster 3 and cluster 4.

Page 35: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

35

Table 2: Resulting data from K-means (K = 4), applied to 5239 genes.

Cluster Nº genes in Cluster

Average distance from cluster

mean

Within cluster

variance Description

Cluster 1 1865 0,1335 0,0032 Lower expression in all clinical isolates comparing

to the reference strain. Is not observed differences between samples

Cluster 2 2468 0,1245 0,0028 Lower and higher expression in all clinical isolates

comparing to the reference strain. Is not observed differences between samples

Cluster 3 351 0,2474 0,0185

Lower expression in all clinical isolates comparing to the reference strain.

Genes more expressed in VG242F than in the other samples

Cluster 4 555 0,2124 0,0210

Higher expression in all clinical isolates comparing to the reference strain.

Expressions differs in commensal and in invasive isolates

A specific analysis, to these clusters, aims the exploration of expression patterns able to

identify gene groups that may be dissimilar among the samples. As pointed in material and

methods chapter, before the K-means is applied to each cluster, HCL was applied to estimate

the initial group number. These results are shown in Appendix 2, as they were found not to

contribute for the identification of genes that could contribute to distinguish the commensal and

invasive isolates.

Cluster 1

Detailed analysis on the expression of the genes grouped in cluster 1 (shown in Figure 11

and in Table 3) revealed the existence of different expression profiles in sub-clusters 1.1, 1.3

and 1.4, with the exception of the genes that were grouped in cluster 1.2 that had a relatively

similar expression in all samples. In sub-cluster 1.1 are included the genes that seem to have a

lower expression in samples VG216F and VG241F but not in the other considered commensal

isolate VG6. In Cluster 1.3 are grouped those genes that have expression values lower in the

three commensal isolates than in the invasive isolates.

Page 36: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

36

Figure 11: Resulting clusters from K-means (K = 4) applied in submatrix of cluster 1. Charts present the expression

profile of the genes grouped in each cluster. From top to bottom: cluster 1.1, cluster 1.2, cluster 1.3 and cluster 1.4.

Figure 12: Centroid expression profile of resulting clusters from K-means (K=4) applied in submatrix of cluster 1. From

top to bottom: cluster 1.1, cluster 1.2, cluster 1.3 and cluster 1.4.

Table 3: Resulting data from K-means (K = 4) applied in submatrix of cluster 1 (1865 genes).

Cluster 1 Nº genes in Cluster

Average distance from cluster mean

Within cluster variance Differences among samples

Cluster 1.1 313 0,0959 0,0020 Genes less expressed in VG216F and VG242F than in the other samples

Cluster 1.2 777 0,0844 0,0015 No significant differences observed

Cluster 1.3 402 0,0925 0,0021 Expressions differs in commensal and in invasive isolates

Cluster 1.4 373 0,0988 0,0018 Expressions differences of VG242F and INV25 samples from the others

Page 37: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

37

Cluster 2

The genes grouped in cluster 2 do not display significant differences in expression in the

set of samples tested. However, a detailed analysis to this cluster led to the identification of a

particular interesting sub-cluster, denominated 2.4, which included a set of genes whose

expression differed in the commensal and in the invasive isolates (Figure 9 and Table 2). In

specific, these genes clustered in sub-cluster 2.4 were found to be down-regulated in the

commensal isolates and up-regulated in the invasive isolates, as shown in Figure 13.

Figure 13: Resulting clusters from K-means (K = 4) applied in submatrix of cluster 2. Charts present the expression

profile of the genes grouped in each cluster. From top to bottom: cluster 2.1, cluster 2.2, cluster 2.3 and cluster 2.4.

Figure 14: Centroid expression profile of resulting clusters from K-means (K=4) applied in submatrix of cluster 2. From

top to bottom: cluster 2.1, cluster 2.2, cluster 2.3 and cluster 2.4.

Page 38: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

38

Table 4: Resulting data from K-means (K=4) applied in submatrix of cluster 2 (2468 genes).

Cluster 2 Nº genes in Cluster

Average distance from cluster mean

Within cluster variance Description

Cluster 2.1 705 0,0823 0,0016 Genes less expressed in INV28 than in the other samples

Cluster 2.2 921 0,0821 0,0017 No significant differences observed

Cluster 2.3 401 0,0917 0,0018 Genes less expressed in VG242F than in the other samples

Cluster 2.4 441 0,0958 0,0018 Expressions differs in the commensal and in the invasive isolates

Cluster 3

Genes grouped in cluster 3 exhibit a high dispersion in comparison with its centroid, also

exhibiting higher expression variability among samples. The resulting analysis to this cluster

confirmed that the genes included are those considered to be outliers (Figure 15).

Figure 15: Resulting clusters from K-means (K = 2) applied in submatrix of cluster 3. Charts present the expression

profile of the genes grouped in each cluster. From top to bottom: cluster 3.1, cluster 3.2.

Figure 16: Centroid expression profile of resulting clusters from K-means (K=2) applied in submatrix of cluster 3. From

top to bottom: cluster 3.1, cluster 3.2.

Page 39: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

39

Table 5: Resulting data from K-means (K=2) applied in submatrix of cluster 3 (351 genes).

Cluster 3 Nº genes in Cluster

Average distance from cluster mean

Within cluster variance

Differences among samples

Cluster 3.1 89 0,2393 0,0123 No significant differences

observed Cluster

3.2 261 0,1677 0,0057 No significant differences observed

Cluster 4

Detailed analysis of the genes grouped in cluster 4 shows these genes are those that

have a different expression in the commensal and in the invasive samples, although some

dispersion has been observed, especially in what concerns to samples VG216F and INV28

(Figure 9and Table 2). In specific, a set of genes (clustered in sub-cluster 4.2) exhibiting a

higher expression in the invasive isolates than in the commensal isolates was identified (Figure

17and Table 5).

Figure 17: Resulting clusters from K-means (K = 2) applied in submatrix of cluster 4. Charts present the expression

profile of the genes grouped in each cluster. From top to bottom: cluster 4.1, cluster 4.2..

Page 40: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

40

Figure 18: Centroid expression profile of resulting clusters from K-means (K=2) applied in submatrix of cluster 4. From

top to bottom: cluster 4.1, cluster 4.2.

Table 6: Resulting data from K-means (K=2) applied in submatrix of cluster 4 (555genes).

Cluster 4 Nº genes in Cluster

Average distance from cluster mean

Within cluster variance Differences among samples

Cluster 4.1 106 0,2233 0,0290 Genes less expressed in VG6 than in the other samples

Cluster 4.2 449 0,1377 0,0035 Expressions differs in the commensal and in the invasive isolates

Taken into consideration the specific analysis of each submatrix derived from the four

clusters, it was possible to identify two sub-clusters of genes that are higher expressed in the

invasive isolates compared to the commensal (sub-cluster 2.4 and 4.2). Differently, genes

clustered in sub-cluster 1.3 are higher regulated in the commensal isolates but not in the

invasive ones. The higher expression of these genes in the invasive isolates may suggest their

involvement in C. glabrata ability to colonize the blood and therefore the physiological function

of these genes was assessed, taking into consideration the information available in the MIPS

database concerning their biological functions. The results obtained showed a significant

enrichment of several biological functions which are shown in Figure 19 and Figure 20.

Figure 19: Enriched functional classes (p-value below 0.01), based on MIPS functional catalogue, of genes that were

found to be less expressed in the invasive isolates compared to the commensal ones.

0 5 10 15 20 25

EnergyProtein targeting, sorting and translocation

Ion transportPeptide transportProtein transport

Intracellular traffickingHomeostasis of cations

Temperature perception and responsePeroxisome biogenesisEndosome biogenesis

Vacuole or lysosome biogenesis

Gene percentage %

Genome Dataset

Page 41: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

41

Figure 20: Enriched functional classes (p-value below 0.01), based on MIPS functional catalogue, of genes that were

found to be higher expressed in the invasive isolates compared to the commensal ones.

0 5 10 15 20 25 30

Amino acid metabolismNucleotide/ nucleoside/ nucleobase metabolism

RNA degradationDNA processing

RNA synthesisrRNA synthesistRNA synthesis

mRNA synthesis rRNA processing tRNA processing

Ribosome biogenesisTranslation

Protein folding and stabilizationProtein binding

Nucleic acid bindingNuclear transport

Non-vesicular ER transportUnfolded protein response (e.g. ER quality…

Gene percentage %

Genome Dataset

Page 42: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

42

3.3. Gene Selection

The exploratory analysis of the data that was described above has led to the identification

of genes that have a different expression profiles in the commensal and in the invasive isolates.

This observation reinforced the idea that there are genes that have a different pattern of

expression in the invasive and in the commensal isolates, according to the initially proposed

hypothesis. To identify which of these genes are better suited to distinguish the two classes of

isolates, a set of feature selection techniques were implemented. In specific, were used the Info

Gain, Gain Ratio, SVM and Rank Product algorithms. To avoid overfitting, these algorithms

were applied in whole data set and not only to the genes that were identified to be associated

with invasive colonization by the exploratory analysis above mentioned. The Info Gain, Rank

Product and Gain Ratio methods identified, respectively, 2199, 514 and 47 genes that can be

used to distinguish the commensal and the invasive isolates. From the SVM algorithm we

selected the firsts 2199 genes. The intersection of the genes that were identified by each

method (the common genes) resulted in a group of 24 genes, as evidenced in the Venn

diagram shown in Figure 21. By taking into account genes that were considered to have a high

distinguishing capacity by all the algorithms it is aimed to increased reliability of the gene

selection performed. The 24 genes that emerged from this analysis are listed in Table 7. These

genes are sorted in descending order, according their expression differences between the

commensal and invasive samples.

Figure 21: Resulting number of genes by intersection of SVM, Rank Product, Gain Ratio and Info Gain. (Oliveros, J.C.

(2007) VENNY. An interactive tool for comparing lists with Venn Diagrams).

Page 43: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

43

Table 7: ORF, Function, expression levels for each sample, values of mean, variance and standard deviation, regarding class, of the resulting 24 genes.

ORF

Commensal Isolates Invasive Isolates

푉퐺216퐹퐾퐶퐻푟606

푉퐺242퐹퐾퐶퐻푟606

푉퐺6퐾퐶퐻푟606

Average Standard deviation

퐼푁푉25퐾퐶퐻푟606

퐼푁푉28퐾퐶퐻푟606 Average Standard

deviation

A 1,08 0,95 0,83 0,96 0,12 4,75 4,58 4,67 0,11

B 0,44 0,43 0,49 0,45 0,03 0,15 0,13 0,14 0,02

C 2,30 2,07 2,28 2,21 0,13 6,78 6,76 6,77 0,02

D 2,55 2,30 2,57 2,48 0,15 6,99 6,83 6,91 0,11

E 1,33 1,47 1,40 1,40 0,07 3,04 3,07 3,05 0,02

F 1,55 1,47 1,49 1,51 0,04 3,17 3,35 3,26 0,12

G 2,00 2,08 1,79 1,96 0,15 4,50 3,81 4,15 0,49

H 1,21 1,27 1,16 1,21 0,06 2,52 2,36 2,44 0,11

I 1,16 1,29 1,22 1,23 0,06 2,43 2,47 2,45 0,03

J 0,80 0,78 0,74 0,77 0,03 0,38 0,40 0,39 0,01

K 1,17 1,20 1,02 1,13 0,10 2,09 2,41 2,25 0,23

M 1,66 1,86 1,94 1,82 0,14 3,40 3,75 3,58 0,25

N 1,26 1,14 1,21 1,21 0,06 2,31 2,37 2,34 0,05

O 1,61 1,43 1,60 1,55 0,10 3,12 2,86 2,99 0,18

P 1,02 1,09 1,15 1,08 0,07 2,06 2,05 2,06 0,01

Q 2,40 2,36 2,17 2,31 0,12 4,72 4,04 4,38 0,48

R 0,55 0,52 0,54 0,54 0,02 0,27 0,30 0,29 0,02

S 2,17 1,95 1,89 2,00 0,15 1,06 1,10 1,08 0,03

T 1,32 1,19 1,21 1,24 0,07 2,38 2,19 2,29 0,13

U 2,01 2,18 2,30 2,16 0,15 4,18 3,66 3,92 0,37

V 1,19 1,13 1,14 1,15 0,03 2,13 1,97 2,05 0,11

X 1,29 1,37 1,46 1,37 0,08 2,23 2,49 2,36 0,18

Page 44: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

44

ORF

Commensal Isolates Invasive Isolates

푉퐺216퐹퐾퐶퐻푟606

푉퐺242퐹퐾퐶퐻푟606

푉퐺6퐾퐶퐻푟606

Average Standard deviation

퐼푁푉25퐾퐶퐻푟606

퐼푁푉28퐾퐶퐻푟606 Average Standard

deviation

Z 1,55 1,69 1,55 1,60 0,08 0,93 0,94 0,93 0,01

W 0,97 0,94 0,95 0,95 0,02 0,62 0,61 0,62 0,01

Page 45: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

45

The physiological function of the most of these genes has not been fully elucidated in C.

glabrata, nevertheless it is clear the emergence of genes involved in processing of ribosomal

RNA.

Page 46: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

46

Page 47: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

47

4. Discussion

The use of unsupervised and supervised methods focused on the analysis of

transcriptomics data has been widely used for diagnosis and prediction of prognosis and

treatment in cancer patients ( Tarca, Romero, & Draghici, 2006) (Choudhuri, 2004). These

techniques also have the potential to be used in other problems involving medical diagnosis of

microbial infections, however up to now, this remains a relatively unexplored possibility. In this

work a computational approach was used to identify a set of genes that could serve as

biomarkers for the detection of invasive candidiasis caused by C. glabrata. This analysis was

based on the analysis of the transcriptomes of commensal isolates and of isolates retrieved

from the blood of patients with candidemia. It is important to stress that the transcriptomes used

for the analysis were obtained from commensal and invasive isolates cultivated in RPMI growth

medium in mid-log phase. This means that the differences registered in the transcriptomes of

the isolates correspond to stable alterations, that is, to alterations that are not induced when

these isolates are growing in their infection sites. Focus on those sets of genes makes sense

because it is difficult to obtain information on the transcriptomes of the isolates when they are

colonizing the patients in the different infection sites and also because it is more advantageous

to have a diagnosis tool that is focused on stable alterations as these are more prone to be less

dependent on the patient.

PCA analysis of the transcriptomic data revealed that the transcriptomes of the examined

invasive isolates differs from the one of the commensal isolates, nevertheless, some degree of

similarity between the commensal isolate VG6 and the invasive isolates was observed. Some

proximity of commensal and invasive isolates is expected due to dichotomic behavior of C.

glabrata between commensalism and pathogenesis. This is actually a recognized medical

problem for medical diagnosis of candidemia as often the isolates retrieved from sterile sites

(such as the blood) are actually part of the commensal flora of the patient. The proximity of

isolate VG6 with the invasive isolates, specially with invasive isolate INV28, suggests that the

later isolate could correspond to an isolate in the early stage of blood colonization or that isolate

VG6 could correspond to a commensal isolate transitioning to a “pathogenic state”. No

information regarding the medical condition of the patient where isolate VG6 was retrieved is

available and therefore it cannot be confirmed if this patient ended up by developing candidemia

or even a superficial candidiasis. An interesting observation that emerged from this work relates

to the identification of 313 genes, clustered in sub-cluster 1.1, that were found to be similarly

expressed in isolate VG6 and in the invasive isolates. It will be interesting to examine in the

future if these genes play a role in the transition of C. glabrata from commensalism to

pathogenesis.

Exploratory analysis of the data with unsupervised methods revealed the existence of

around 1292 genes differently expressed in the invasive isolates and in the commensal isolates.

In specific, 402 genes were found to be less expressed in the invasive isolates than in the

commensal isolates, while 890 genes were found to be more expressed in the invasive isolates.

Page 48: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

48

This comparison is performed taking into account that all isolates are compared against the

same reference strain. One of the biological functions that stood out within the set of genes

found to be more expressed in the invasive isolates is “ribosome biogenesis” and “proteins

synthesis”. The higher expression of these genes in the invasive isolates is interesting

considering that genes involved in these biological functions have also found to be up-regulated

during cultivation of C. albicans in human blood (Fradin, Kretschmar, Nichterlein, Gaillardin

Claude, d'Enfert, & Hube, 2003). Furthermore, several of the genes that were herein identified

as having a higher expression in the invasive isolates were also previously associated with

increased virulence in C. glabrata including several heat-shock chaperones, the Ace2

transcription factor and adhesins (Rodrigues, Silva, & Henriques, 2014). It is possible that other

genes included in this cluster may also play a role in virulence of C. glabrata and in the ability of

this microbe to colonize the blood, although confirmation of such hypothesis requires further

work.

In the second part of this work a set of methods for gene selection were employed to find out

a set of genes having the higher potential to discriminate the invasive and the commensal

isolates. 24 genes were identified by all the algorithms used as having a higher discriminatory

capacity, this representing only 0.46 % of the overall set of C. glabrata genes that were

examined in this work. This means that, the expression profiles of these genes are the ones that

allow us to better distinguish the type of classes (commensal and invasive ones), the

expression levels of these genes exhibiting a lower variance among the samples of each class

and having a higher variance between the classes. 18 genes were found to be more expressed

in the invasive isolates and 6 in the commensal isolates, as it can be observed by the graph

shown in Figure 22.

Page 49: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

49

Figure 22: Genes that allow to better distinguish the type of classes (commensal and invasive ones).

These genes are interesting candidates to be further examined as eventual biomarkers of C.

glabrata candidemia, however, extensive analysis of their expression in a larger cohort of

commensal and invasive isolates is required to confirm the usefulness of their use in diagnosis

of candidemia. Most of these 24 genes do not have a characterized biological function in C.

glabrata and the information available is based on the knowledge gathered in the corresponding

C. albicans and S. Cerevisiae orthologues. The herein obtained results open the door to a

future investigation to investigate what is the biological significance of the up-regulation of these

genes in the invasive isolates and to assess if these genes do play a role in C. glabrata

virulence potential. Interestingly, 6 of the genes that we have herein identified as being more

0,00

1,00

2,00

3,00

4,00

5,00

6,00

7,00

8,00

A B C D E F G H I J K L M N O P Q R S T U V X Z

Fold

( Cl

inca

l iso

late

s/ re

fere

nce

stra

in)

The most informative genes

VG216F/KCHr606 VG242F/KCHr606 VG6/KCHr606 INV25/KCHr606 INV28/KCHr606

Page 50: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

50

expressed in the invasive isolates are homologous to genes found to be up-regulated during C.

albicans growth in human blood.

In conclusion, the set of genes identified in this work through feature selection techniques

provided a cohort of 24 genes that may be used for the detection of invasive candidiasis caused

by C. glabrata. Necessarily the results need to be fine-tuned by measuring the expression of

these genes in a larger cohort of clinical isolates. Increase in the number of datasets coming

from invasive and commensal isolates would also be very beneficial in order to increase the

confidence in the results obtained and also aiming to develop a diagnosis classifier tool in which

it could be possible to predict what would be the outcome of a given isolate based on its

genomic expression profile. The results have also opened the door to a better understanding of

the molecular mechanisms underlying the transition between commensalism and pathogenesis

in C. glabrata since a very high number of genes that was not previously associated with

increased virulence in this pathogenic yeast has been uncovered in this work as having a higher

expression in the invasive isolates. The identification of these genes represents thus a step

forward in the better understanding of the pathophysiology of C. glabrata infections.

Page 51: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

51

5. References

Brunke, S., Seider, K., Fischer, D., Jacobsen, I., Kasper, L., Jablonowski, N., et al. (2014). One

Small Step for a Yeast - Microevolution within Macrophages Renders Candida glabrata

Hypervirulent Due to a Single Point Mutation. PLoS Pathog, 10(10):e1004478.

Mikulska, M., Furfaro, E., & Viscoli, C. (2012). Biomarkers for Diagnosis and Follow-Up of

Invasive Candidiasis:A Brief Review of the ECIL Recommendations. Current Fungal

Infection Reports, Volume 6, 192-197.

Tarca, A., Romero, R., & Draghici, S. (2006). Analysis of microarray experiments of gene

expression profiling. Am J Obstet Gynecol., 195(2): 373–388.

Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., et al. (2000).

Distinct types of diffuse large B-cell lymphoma identfiied by gene expression profling.

Nature, 403(6769):503-11.

Avni, T., Leibovici, L., & Paul, M. (2011). PCR Diagnosis of Invasive Candidiasis: Systematic

Review and Meta-Analysis. J Clin Microbiol., 49(2): 665–670.

Badu, M. (2004). Introduction to Microarray Data Analysis. In R. Grant, Computational

Genomics. Horizon press.

Bard, E., & Hu, W. (2011). Identification ofa 12-Gene signature for lung cancer prognosis

through machine learning. Journal of Cancer Therapy, Vol.2 No.2, 2011.

Borekçi, G., Ersoz, G., Otag, F., Ozturhan, H., Sen, S., Soylemez, F., et al. (2010). Identification

of Candida Species from Blood Cultures with Fluorescent In Situ Hybridization (FISH),

Polymerase Chain Reaction-Restriction Fragment Length Polymorphism (PCR-RFLP)

and Conventional Methods. Medical Journal of Trakya University, 27(2):183-191.

Brazma, A., & Vilo, J. (2000). Gene expression data analysis. FEBS Lett. , 480(1):17-24.

Breitling, R., Armengaud, P., Amtamann, A., & Herzyk, P. (2004). Rank products: a simple, yet

powerful, new method to detect differentially regulated genes in replicated microarray

experiments. FEBS Lett, 573(1-3):83-92.

Chandrasekhar, T., Thangavel, K., & Elayaraja, E. (2011). Gene Expression Data clustering

using Unsupervised methods. Advanced Computing (ICoAC), 146 - 150.

Choudhuri, S. (2004). Microarray in biology and medicine. J Biochem Mol Toxicol., 18(4):171-9.

Clancy, C. J., & Nguyen, M. H. (2013). Finding the "Missing 50%" of invasive candidiasis: How

nonculture diagnostics will improve understanding of disease spectrum and transform

patient care. Clin Infect Dis, 56(9):1284-92.

Page 52: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

52

Cuenca-Estrella1, M., Verweij, P., Arendrup, M., Arikan-Akdagli, S., Bille, J., Donnelly, J., et al.

(2012). ESCMID guideline for the diagnosis and management of Candida diseases

2012: diagnosis procedures. Clin Microbiol Infect, 18 Suppl 7:9-18.

Cui, X., & Churchill, G. A. (2003). Review: Statistical tests for differential expression in cDNA

microarray experiments. Genome Biol., 4, 4(4):210.

Dabas, P. S. (2013). Review- An approach to etiology, diagnosis and management of different

types of candidiasis. Academic Journals, 4, Vol. 4(6), pp. 63-74.

Dalman, M., Deeter, A., Nimishakavi, G., & Duan, Z.-H. (2012). Fold change and p-value cutoffs

significantly alter microarray interpretations. BMC Bioinformatics, 13.

Do, J. H., & Choi, D.-K. (2008). Clustering Approaches to identifying gene expression patterns

from DNA microarray data. Mol Cells, 25(2):279-88.

Eichinger Group. (12 de September de 2013). Dictyostelium discoideum-Microarray analysis.

(Institute for Biochemistry I) Obtido em 4 de June de 2015, de http://www.uni-

koeln.de/med-fak/biochemie/transcriptomics/07_analysis.shtml

Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein, D. (1998). Cluster analysis and display

of genome-wide expression patterns. Genetics, vol. 95 no. 25, 14863-14868.

Fidel, P. L., Jr., Cutright, J. L., & Tait, L. (1995). A murine model of Candida glabrata vaginitis. J

Infect Dis, 173(2):425-31.

Fidel, P. L., JR., Vazquez, J. A., & Sobel, J. D. (1999). Candida glabrata : Review of

epidemiology, pathogenesis and clinical disease with comparison to C. albicans. Clin

Microbiol Rev., 12(1):80-96.

Fradin, C., Kretschmar, M., Nichterlein, T., Gaillardin Claude, d'Enfert, C., & Hube, B. (2003).

Stage-specific gene expression of Candida albicans in human blood. Mol Microbiol.,

47(6):1523-43.

Freyhult, E., Landfors, M., Onskog, J., Hvidsten, T. R., & Rydén, P. (2010). Challenges in

microarray class discovery: a comprehensive examination of normalization, gene

selection and clustering. BMC Bioinformatics, 11:503.

Fryer, R. M., Randall, J., Yoshida, T., Hsiao, L.-L., Blumenstock, J., Jensen, K. E., et al. (2002).

Global analysis of gene expression: methods, interpretation, and pitfalls. Exp Nephrol,

10(2):64-74.

Fukuda, Y., Tsai, H.-F., Myers, T., & Bennett, J. (2013). Transcriptional Profiling of Candida

glabrata during Phagocytosis by Neutrophils and in the Infected Mouse Spleen. Infect

Immun., 81(4):1325-33.

Page 53: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

53

Furey, T., Cristianini, N., Duffy, N., Bednarski, D., Schummer, M., & Haussler, D. (2000).

Support vector machine classification and validation of cancer tissue samples using

microarray expression data. Bioinformatics, 16 (10): 906-914.

George, G. V., & Raj, D. C. (2011). Review on feature selection techniques and the impact of

SVM for Cancer classification using gene expression profile. International Journal of

Computer Science & Engineering Survey (IJCSES), Vol.2, No.3.

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Messirov, J. P., et al.

(1999). Molecular classification of cancer: class discovery and class prediction by gene

expression monitoring. Science, Vol. 286 no. 5439 pp. 531-537.

Guyon, I., Weston, J., & Barnhill, S. (2002). Gene selection for cancer classification using

support vector machines. Machine Learning, 46, 389–422.

Han, J., Kamber, M., & Pei, J. (2012). Data Mining concepts and techniques. Elsevier.

Held, J., Kohlberger, I., Rappold, E., Grawitz, A. B., & Hacker, G. (2013). Comparison of (1->3)-

b-1,3-D-glucan,Mannan/Anti-Mannan Antibodies, and Cand-tec Candida Antigen as

serum biomarkers for candidemia. J Clin Microbiol, 51(4): 1158–1164.

Herwig R, & Lehrach H. (2006). Expression profiling of drug response- from genes to pathways.

Dialogues Clin Neurosci., 8(3):283-93.

Innocent, N., & Kurian, M. (2014). Cancer prediction Based on gene expression data through

association rule based classification and fuzzy rough set attribute reduction on

information gain ratio. ISSN, Vol. 2;2321-9653.

Jiang, D., Tang, C., & Zhang, A. (2004). Cluster Analysis for Gene Expression Data: A survey.

Knowledge and Data Engineering, Volume:16 , 1370 - 1386.

Jones, N. C., & Pevzner, P. A. (2004). An introduction to bioinformatics algorithms.

Massachusetts Institute of Technology.

Linde, J., Duggan, S., Weber, M., Horn, F., Sieber, P., Hellwig, D., et al. (2015). Defining the

transcriptomic landscape of Candida glabrata by RNA-seq. Nucleic Acids Res,

43(3):1392-406.

Malone, J. H., & Oliver, B. (20011). Microarrays, deep sequencing and the true measure of the

transcriptome. BMC Biology, 9:34.

Mikulska, M., Calandra, T., Sanguinetti, M., Poulain, D., & Viscoli, C. (2010). The use of

mannan antigen and anti-mannan antibodies in the diagnosis of invasive candidiasis:

recommendations from the third european conference on infections in leukemia. Crit

Care, 14, 14(6):R222.

Page 54: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

54

Murray, P. R., Rosenthal, K. S., & Pfaller, M. A. (2005). Medical Microbiology (Vol. 5th Edition).

Elsevier.

Ostrosky-Zeichner, L., Vitale, R. G., & Nucci, M. (2012). New serological markers in medical

mycology: (1,3)-β-D-glucan and Aspergillus galactomannan. Infectio, 16(Supl 3): 59-63.

Pan, W., Lin, J., & Le, C. (2002). Model-based cluster analysis of microarray gene-expression

data. Genome Biology, 3;0009.1–0009.

Raychaudhuri, S., Stuart, J. M., & Altman, R. B. (2000). Principal components analysis to

summarize microarray experiments: aplication to sporulation time series. Pac Symp

Biocomput, 455-66.

Rodrigues, C. F., Silva, S., & Henriques, M. (2014). Candida glabrata: a review of its features

and resistance. Eur J Clin Microbiol Infect Dis., 33(5):673-88.

Rodríguez- Tudela, J. L., Brachiesi, F., Biller, J., Chryssanthou, E., Cuenca- Estrella, M.,

Denning, D., et al. (2002). Method for the determination of minimum inhibitory

concentration (MIC) by broth dilution of fermentative yeasts. Clinical Microbiology and

Infection, Volume 9, Issue 8, pages i–viii.

Rossignol, T., Logue, M. E., Reynolds, K., Grenon, M., Lowndes, N. F., & Butler, G. (2007).

Transcriptional Response of Candida parapsilosis following exposure to farnesol.

Antimicrob Agents Chemother, 51(7):2304-12.

Saeys, Y., Inza, I., & Larranaga, P. (2007). A review of feature selection techinques in

bioinformatics. Bioinformatics, 23 (19): 2507-2517.

Sampaio, P., & Pais, C. (2014). Epidemiology of invasive candidiasis and challenges for the

mycology laboratory: specificities of Candida glabrata. Current Clinical Microbiology

Reports, Volume 1, Issue 1-2, pp 1-9.

Sardi, J. C., Scorzoni, L., Bernardi, T., Fusco- Almeida, A. M., & Mendes Giannini, M. J. (2013).

Review- Candida species: current epidemiology pathogenicity, biofilm formation, natural

antifungal products and new therapeutic options. J Med Microbiol, 62(Pt 1):10-24.

Schabereiter-Gurtner, C., Selitsch, B., Rotter, M. L., Hirchl, A. M., & Willinger, B. (2007).

Development of novel real-time PCR assays for detection and differentiation of eleven

medically important Aspergillus and Candida species in clinical specimens. J Clin

Microbiol, 45(3): 906–914.

Selvaraj, S., & Natarajan, J. (2011). Microarray data analysis and mining tools. Bioinformation,

6(3): 95–99.

Page 55: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

55

Sendid, B., Jouault, T., Coudriau, R., Camus, D., Odds, F., Tabouret, M., et al. (2004).

Increased Sensitivity of Mannanemia Detection Tests by Joint Detection of α- and β-

Linked Oligomannosides during Experimental and Human Systemic Candidiasis. J Clin

Microbiol, 42(1):164-71.

Shahzad, A., Cohrs, R. J., Shen, A., & Lass-Florl, C. (2012). Human Fungal Infections: Need to

improve Diagnosis with New Biomarkers Developed by Translational Research. Mol

Med Ther , 1:1 doi:10.4172/2324-8769.1000e101.

Silva, S., Negri, M., Henriques, M., Oliveira, R., Williams, D. W., & Azeredo, J. (2012). Candida

glabrata,Candida parapsilosis and Candida tropicalis:biology, epidemiology,

pathogenicityand antifungal resistance. FEMS Microbiol Rev., 36(2):288-305.

Simon, R. M., McShane, L. M., Wright, G. W., Korn, E. L., Radmacher, M. D., & Zhao, Y.

(2010). Design and analysis of DNA microarray investigations. Springer.

Stiglic, G., Rodriguez, J. J., & Kokol, P. (2008). Feature selection and classification for small

gene sets. Pattern Recognition in Bioinformatics Lecture Notes in Computer Science,

Volume 5265;121-131.

Wang, J. (2008). Data warehousing and mining: concepts methodologies, tools and

applications. Information science reference.

White, P. L., Archer, A. E., & Barnes, R. A. (2005). Comparison of Non-Culture-Based methods

for Detection of systemic fungal infections, with an emphasis on invasive Candida

infections. J Clin Microbiol, 43(5):2181-7.

Page 56: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

56

Page 57: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

57

6. Appendix

Appendix 1

Appendix Figure 1: Resulting clusters from K-means (K = 3). The first three charts on the left, present the expression

profile of the genes grouped in each cluster (from top to bottom: cluster 1, cluster 2, and cluster 3). The other charts

represent the centroid expression profile of each cluster, respectively.

Appendix Table 1: Resulting data from K-means (K = 3).

Cluster Nº genes in Cluster

Average distance from cluster mean

Within cluster

variance

Differences among samples

Cluster1 3093 0,1323 0,0033 Is not observed

Cluster 2 1103 0,1962 0,0201 Expressions differences between commensal

from the invasive Cluster 3 1043 0,2170 0,0224 Expressions differences

between Vg242F from the others

Page 58: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

58

Appendix 2

Appendix Figure 2: HCL applied to the submatrix of cluster 1

Appendix Figure 3: HCL applied to the submatrix of cluster 2

Page 59: Biological Engineering - ULisboa...acelerar o desenvolvimento de ferramentas de diagnóstico molecular de candidiase como também para uma melhor compreensão dos mecanismos subjacentes

59

Appendix Figure 4: HCL applied to the submatrix of cluster 3

Appendix Figure 5 :HCL applied to the submatrix of cluster 4.