hierarchical dynamical systems -...

FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO

Hierarchical Dynamical Systems

Pedro Manuel Nunes Sequeira

PREPARAÇÃO DA DISSERTAÇÃO


Advisor: Jaime dos Santos Cardoso

Co-Advisor: José Carlos Príncipe

February 18, 2015


Pedro Manuel Nunes Sequeira


February 18, 2015

Contents

1 Introduction 1

2 Literature Review 32.1 Speech Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Intensity perception (Loudness) . . . . . . . . . . . . . . . . . . . . . . 32.1.2 Frequency perception (Pitch) . . . . . . . . . . . . . . . . . . . . . . . . 42.1.3 Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.4 Phonemes and allophones . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Hierarchical Dynamical Systems 93.1 Hierarchical architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Hierarchical Linear Dynamical Systems . . . . . . . . . . . . . . . . . . . . . . 12

4 Work Plan 154.1 Calendarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Tools and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3 Work Done . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

References 17

i

ii CONTENTS

List of Figures

2.1 Normal equal-loudness-level contours . . . . . . . . . . . . . . . . . . . . . . . 32.2 Pitch perception with frequency . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Impulse response of cochlear filters (gammatone) . . . . . . . . . . . . . . . . . 52.4 Example of a 3-state HMM (from Makhoul and Schwartz (1995)) . . . . . . . . 62.5 Phonetic HMM (from Makhoul and Schwartz (1995)) . . . . . . . . . . . . . . . 6

3.1 DBN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 RBM architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 DBN/DNN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.4 DNN-HMM architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.5 Long Short-term Memory Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.6 Bidirectional Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . 123.7 Deep Bidirectional Long Short-Term Memory . . . . . . . . . . . . . . . . . . . 123.8 Hierarchical Dynamical System . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

iii

Chapter 1

Introduction

This document is the final report of the course Preparação da Dissertação of the Master’s degree

in Electrical and Computer Engineering at FEUP.

The Thesis consists on the study of the algorithm HLDS (Hierarchical Linear Dynamical

Systems) described on the papers (Cinar and Principe, 2014; Cinar et al., 2014), which showed

promising results in pitch estimation of isolated notes.

The first stage of dissertation is the replication of the results presented in the papers. In the

second stage, an adaptation of the algorithm to Speech Recognition will be attempted, more specif-

ically to phonetic classification and recognition.

The main motivation for choosing a hierarchical model is the fact that the auditory cortex has a

layered hierarchical structure (Read et al., 2002). This approach is useful for modeling sequences

on several time scales.

In (Liao, 2005), the state of the art in time series clustering is shown to be modifying existing

static data methods to work in time series. However, (Cinar and Principe, 2014; Cinar et al.,

2014) argue that those methods do not take advantage of the temporal information present in the

structure, ignoring the time dependency between the features. For this reason, Dynamical Systems

are used, since they intrinsically model a temporal structure.

Chapter 2 explains the main characteristics of speech and the difficulties that its recognition

entails. Furthermore, there will presented the State of the Art Algorithms. In Chapter 3 other

Hierarchical Models and the Algorithm of interest are described. Finally, in Chapter 4 the planned

time schedule for the semester will be presented.

1

2 Introduction

Chapter 2

Literature Review

2.1 Speech Perception

To be able to adapt an algorithm to speech recognition, it is necessary to have some notions of the

way that the human perceptual system reacts to the sound waves.

It is important to realize that the relationship between the human auditory perception of sound

and the associated physical quantities are neither simple nor linear. Huang et al. (2001)

The human ear has three sections: the outer ear, the middle ear and the inner ear. The relevant

structure of the inner ear for sound perception is the cochlea, which behaves as a filter bank.

2.1.1 Intensity perception (Loudness)

Generally speaking, sounds with greater intensity usually sound louder. However, the sensitivity

of the ear varies with the frequency and the quality of the sound. Figure 2.1 shows the graph of

equal loudness contours adopted by ISO British Standard (2003) which describes in detail this

non-linearity.

Figure 2.1: Normal equal-loudness-level contours

3

4 Literature Review

2.1.2 Frequency perception (Pitch)

Since the cochlea behaves as a spectrum analyzer, some effort has been made to model its behavior.

As referred in Fletcher (1940), the cochlea behaves as bank of overlapping auditory filters, whose

bandwidths are called critical bandwidth.

The western musical pitch is described in octaves and semi-tones which is a logarithmic fre-

quency scale. Scales based on human perception system however, are roughly logarithmic for

high-frequencies and linear at low-frequencies. Two of these scales (Mel scale and Bark scale) are

expressed by the equations 2.2 and 2.1. They are normalized and plotted together in Figure 2.2.

Barkscale : b( f ) = 13arctan(0.00076 f )+3.5arctan

((f

700

)2)

(2.1)

Melscale : B( f ) = 1127ln(

1+f

700

)(2.2)

Figure 2.2: Pitch perception with frequency

2.1.3 Masking

Frequency masking is a phenomenon under which one sound cannot be perceived if another sound

close in frequency has a high enough level. The first sound masks the other one Huang et al.

(2001). If two sounds played at the same time have frequencies close enough, they will be inter-

preted as a combination tone instead of two separate sounds. This happens due to the filter bank

2.2 Algorithms 5

associated with the cochlea, which splits the signal into different frequency components which are

coded independently on the auditory nerve which transmits the information to the brain.

One model of auditory filters widely used for speech recognition is the Gammatone Filter,

which is described in (Lyon et al., 2010; Qi et al., 2013) and is employed in the HLDS. Its impulse

response is given by equation 2.3 and represented in Figure 2.3.

Gammatone : g(t) = atn−1e−2πbt cos(2π f t +φ) (2.3)

Figure 2.3: Impulse response of cochlear filters (gammatone)

2.1.4 Phonemes and allophones

In speech science, the term phoneme is used to denote any of the minimal units of speech sound

in a language that can serve to distinguish one word from another (Huang et al., 2001). It is an ab-

straction and does not define a realization, as it is a context-independent and speaker-independent

concept. Their phonetic realizations are called allophones. The Phonetic Alphabet is defined in

the International standard, (ipa, 2005).

2.2 Algorithms

2.2.1 Hidden Markov Models

The Hidden Markov Model used to be the state of the art in Speech Recognition systems (Makhoul

and Schwartz, 1995). It can be viewed as a state machine whose transitions are defined by prob-

6 Literature Review

abilities. Also, unlike Non-hidden Markov Models, the state does not correspond to an output

symbol, but defines a probability distribution of output symbols. This model is illustrated in Fig-

ure 2.4, where ai j’s are the transition probabilities and b’s are the output probabilities.

Figure 2.4: Example of a 3-state HMM (from Makhoul and Schwartz (1995))

The procedure of modeling phonetic speech events using HMMs as a generative model is

described by Makhoul and Schwartz (1995) in Figure 2.5.

Figure 2.5: Phonetic HMM (from Makhoul and Schwartz (1995))

We start by noting that the structure of the model only allows transitions in one direction. This

is known as a "left to right" model. This represents the flow of time. Transitions from one state to

itself serve to model different phoneme durations.

2.2 Algorithms 7

The reason for the need of three states is the coarticulation effect, which means that the acous-

tic realization of a phoneme is affected by the preceding and following phonemes, specially by the

two neighboring ones.

The codebook of spectral templates represents the space of possible speech spectra. They

serve as output symbols of the HMM. From the moment we enter into state 1 until we get out of

state 3 a sequence of symbols is generated, this sequence corresponds to a single phoneme.

The recognition process uses the same model. There is one HMM for each phonetic context of

interest. Usually the same structure is employed for every HMM, only differing in the transition

probabilities. For a given input speech spectrum that has been quantized to one of the templates,

we find the probability that the template was the output generated by the model. If we consider

that state sequence followed the path 1 −→ 2, we multiply the current path probability by the

corresponding transition probability and using the new frame of speech, the probability that it was

the output of the state 2. We continue this process until the model is exited. This procedure is

done for every phoneme model and all possible state paths. In the end, the result with the highest

probability is considered the recognized sequence of phonemes.

8 Literature Review

Chapter 3


3.1 Hierarchical architectures

Deep learning, also known as hierarchical learning, is a class of machine learning techniques, in

which the information processing is done in many stages. They are composed of many layers of

nonlinear processing in which the lower layer’s output are the inputs to the layer above. (Deng,

2012)

It has become increasingly popular since the development of new training algorithms, and

the increase of hardware capabilities (GPU’s). These algorithms have shown success in many

applications, such as: audio processing, speech recognition, hand-writing recognition, computer

vision, object recognition and information retrieval.

Most Deep learning architectures can be described as either generative, discriminative ou hy-

brid.

• Generative deep architecture — a model that characterizes the joint probability distri-

bution of the observed data and the corresponding classes. Since it models a probability

distribution, it can be used to generate synthetic data in the input space. Furthermore, using

the Bayes theorem, one can transform this model in a discriminative model.

• Discriminative deep architecture — used for class assignment. This is often done by

characterizing the a posteriori class probabilities conditioned on the input data.

• Hybrid deep architecture — a model whose goal is class assignment but uses the outcomes

of generative models, or a model where discriminative criteria are used to learn parameters

of a generative model.

Deep learning originated in the attempt of increasing the number of layers in Feed-forward

neural networks or multi-layer perceptron(MLP). This didn’t work since the learning algorithms

of the time (back-propagation) would get trapped in poor local optima.

This difficulty in training deep models eased with the research of (Hinton et al., 2006). This

paper introduced the model Deep Belif Network (DBN). As illustrated in Figure 3.1, this is a multi-

layered probabilistic generative model whose two higher layers have symmetric connections and

9

10 Hierarchical Dynamical Systems

whose lower layers have top-down connections with the layer above. The Hidden layers consist on

Restricted Boltzmann Machines (RBM), which is a network of symmetrically connected neuron-

like units which forms a bipartite graph in respect to the visible and hidden units, see Figure 3.2

Figure 3.1: DBN architecture Figure 3.2: RBM architecture

The learning is done in a greedy, layer-by-layer fashion. This algorithm allowed a much better

initialization of the Deep neural network model, see Figure 3.3. This has been shown effective in

the application of speech recognition (Hinton et al., 2012). It is shown that this method achieves

maximum likelihood learning. Since the learning is unsupervised, when a classification is desired,

a final layer of variables (corresponding to the labels) is added.

Figure 3.3: DBN/DNN architecture

Another interesting deep model described in (Deng, 2012) is an interface between the previ-

ously referred DBN-DNN and the HMM. This overcomes the limitation of the input vectors being

restricted to having a fixed dimensionality, which might be relevant in applications such as speech

3.1 Hierarchical architectures 11

recognition and video processing that require sequence recognition. The HMM is a convenient

tool for enabling what was a static classifier to handle dynamic or sequential patterns.

This architecture, represented in Figure 3.4 has been successfully used in speech recognition

in (Dahl et al., 2012).

Figure 3.4: DNN-HMM architecture

Recurrent neural networks (RNN) have a larger state-space and richer dynamics than HMMs,

making them powerful in modeling sequential data like speech, which is a intrinsically dynamic

process. The depth in time of the RNN is given by the model’s structure, which makes its hidden

state a function of all previous hidden state, as it can be observed by the equations 3.1. The

non-linearity H usually represents a elementwise sigmoid function. There are some noticeable

similarities between this model and the state space model of the HLDS.

ht = H (Wxhxt +Whhht−1 +bh)

yt = Whyht +by(3.1)

To adapt the standard RNN model for speech recognition, the authors of (Graves et al., 2013b,a)

have introduced 3 extensions, creating what they called a Deep Bidirectional Long Short-Term

Memory (DBLSTM). This model has achieved the lowest recorded error rates so far on the TIMIT

database.

Firstly, they introduced a much more complicated non-linearity H , represented in Figure 3.5.

Secondly, they made it possible for the model to make use not only of previous context, but also

able to exploit future contex. This is possible since in the speech recognition applications, the

whole utterances are transcribed at once. They included this functionality by including 2 hidden


layers in the hierarchy which process the data in both directions of time. This is illustrated in

Figure 3.6.

Figure 3.5: Long Short-term Memory Cell

Lastly, due to the recent interest in deep architectures for being able to build progressively

higher representations of the data, they stacked multiple of these structures on top of each other,

as shown in Figure 3.7

Figure 3.6: Bidirectional Recurrent Neural Network

Figure 3.7: Deep BidirectionalLong Short-Term Memory

3.2 Hierarchical Linear Dynamical Systems

The HLDS model (Cinar and Principe, 2014; Cinar et al., 2014) has an architecture which consists

on a hierarchical structure whose layers are coupled linear dynamical systems. A block diagram

in presented in Figure 3.8. The system dynamics are described in equations 3.2, 3.3 and 3.4.

3.2 Hierarchical Linear Dynamical Systems 13

The model’s hidden states are xt ∈ Rn, ut ∈ Rk and zt ∈ Rs for the first, second and third layer

respectively. The observation vector is yt ∈ Rm.

The dimensionality decreases as we go up in the hierarchy (n > k > s). The reason for this is

so that the states are restricted to smaller representation spaces to be used in clustering.

The authors decided to insert some a priori information in the model by using Gammatone

Filters, which are reliable models for cochlear filters in the observation matrix. A fixed point

behavior is imposed by the identity matrix in the highest layer of the hierarchy. This stabilizes the

system since each layer is driven by the one above it, resulting in the creation of the clusters in the

state space.

As we can see in the model’s equations, the model can be re-written in a joint state space, 3.4.

This enables de estimation of the hidden states of all layers simultaneously using the standard

Kalman Filter equations.

This model learns by estimating the parameters of the matrices while inferring the states of the

HLDS. This is called sequential estimation. For the same observation, we consider two dual sys-

tems, the usual state system and a second one which represents the parameters dynamics. To create

this parameter system, we vectorize the original system’s matrices and treat those parameters as if

they were states. For this dual system we consider an identity transition matrix. Therefore, two

Kalman filters are used in parallel, one for estimation the states and another one for the estimation

of the parameters.

Figure 3.8: Hierarchical Dynamical System

zt = zt−1 + pt

ut = Gut−1 +Dzt−1 + rt

xt = Fxt−1 +But−1 +wt

yt = Hxt−1 + vt

(3.2)

zt

ut

xt

=

I 0 0D G 00 B F

zt−1ut−1xt−1

+pt

rt

wt

yt =

[0 0 H

]zt

ut

xt

+ vt

(3.3)

X̃t = F̃X̃t−1 +W̃t

yt = H̃X̃t + vt(3.4)

Chapter 4

Work Plan

4.1 Calendarization

The work plan for this thesis is illustrated in the Gantt chart in Figure 4.1. The thesis officially

starts at the 18th of February and the due date is considered to be the 29th of June, which corre-

sponds to a total time available of 131 days. The constituting tasks are the following:

• Web page development — As a requirement of Preparação da Dissertação, during the thesis

development, a personal website with weekly reports has to created and updated. This will

be executed for the entire duration of the work except for the time reserved for the writing;

• Implementation of the Hierarchical model in study — This will be the starting point

of the thesis, everything will be built on top of this initial system. The time expected to

complete this task is 4 weeks;

• Testing the model performance in musical data — The results will be compared with

the ones shown in the original paper. This is essential for verifying the correctness of the

implementation. The time expected to complete this task is 2 weeks;

• Adaptation and Implementation of the HLDS algorithm for speech — This is the point

were the most difficulties are expected to appear. The algorithm is not expected to perform

well without some modification, due to the convergence time until a cluster is reached. The

time expected to complete this task is 4 weeks;

• Testing the new model performance in speech data — The experimental results will be

measured and compared with the state of the art algorithms. This will make possible the

testing of new ideas by seeing what improves or not the performance of the algorithm. The

time expected to complete this task is 4 weeks;

• Writing the thesis and scientific article — This is the last stage of the project and will

consist in the writing a report describing in detail all work done, the experiments made and

the results obtained. The writing of a scientific article is also expected. The time expected

15

16 Work Plan

to complete this task is at least 4 weeks. However, depending on the workload, this task

might begin earlier in parallel with remaining work;

Figure 4.1: Gantt Chart

4.2 Tools and Resources

The implementation of the algorithms and experiments will be done in Matlab. For the first stage of

the thesis which consists in implementing the initial algorithm and testing for music, the "Musical

Instrument Samples" from University of Iowa Electronic Music Studios (iow, 1997) will be used,

as it was the database employed in the original papers. For the adaptation and testing of the

algorithm for speech, the TIMIT Speech Corpus (Garofolo et al., 1993) will be the database of

choice. The Matlab Toolboxes used will be the MatlabADT (MATLAB Audio Database Toolbox)

for the TIMIT database easy access, and Auditory Toolbox for generating the Auditory Filters

required by the algorithms.

4.3 Work Done

So far the algorithm has been studied and its implementation is about halfway through. Moreover,

the tools and databases required have been acquired. There was a Skype meeting with the authors

of the algorithm in question for some implementation details clarification.

References

University of Iowa Electronic Music Studios, "Musical Instrument Samples". theremin.music.uiowa.edu/, 1997. Accessed: 2015-02-17.

International Phonetic Association, "The International Phonetic Alphabet".internationalphoneticassociation.org/sites/default/files/IPA_chart_%28C%292005.pdf, 2005. Accessed: 2015-02-17.

ISO British Standard. 226: 2003. Acoustics–normal equal-loudness level contours, BSi, 2003.

Goktug T Cinar and Jose C Principe. Clustering of time series using a hierarchical linear dynam-ical system. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE InternationalConference on, pages 6741–6745. IEEE, 2014.

Goktug T Cinar, Carlos A Loza, and Jose C Principe. Hierarchical linear dynamical systems: Anew model for clustering of time series. In Neural Networks (IJCNN), 2014 International JointConference on, pages 2464–2470. IEEE, 2014.

George E Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep neuralnetworks for large-vocabulary speech recognition. Audio, Speech, and Language Processing,IEEE Transactions on, 20(1):30–42, 2012.

Li Deng. Three classes of deep learning architectures and their applications: A tutorial survey.APSIPA Transactions on Signal and Information Processing, 2012. URL http://research.microsoft.com/apps/pubs/default.aspx?id=192937.

Harvey Fletcher. Auditory patterns. Reviews of modern physics, 12(1):47, 1940.

John S Garofolo, Linguistic Data Consortium, et al. TIMIT: acoustic-phonetic continuous speechcorpus. Linguistic Data Consortium, 1993.

Alex Graves, Navdeep Jaitly, and A-R Mohamed. Hybrid speech recognition with deep bidirec-tional lstm. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshopon, pages 273–278. IEEE, 2013a.

Alex Graves, A-R Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neuralnetworks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE InternationalConference on, pages 6645–6649. IEEE, 2013b.

Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep beliefnets. Neural computation, 18(7):1527–1554, 2006.

17

theremin.music.uiowa.edu/

theremin.music.uiowa.edu/

internationalphoneticassociation.org/sites/default/files/IPA_chart_%28C%292005.pdf

internationalphoneticassociation.org/sites/default/files/IPA_chart_%28C%292005.pdf

http://research.microsoft.com/apps/pubs/default.aspx?id=192937

http://research.microsoft.com/apps/pubs/default.aspx?id=192937

18 REFERENCES

Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly,Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural net-works for acoustic modeling in speech recognition: The shared views of four research groups.Signal Processing Magazine, IEEE, 29(6):82–97, 2012.

Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, and Raj Foreword By-Reddy. Spoken languageprocessing: A guide to theory, algorithm, and system development. Prentice Hall PTR, 2001.

T Warren Liao. Clustering of time series data—a survey. Pattern recognition, 38(11):1857–1874,2005.

Richard F Lyon, Andreas G Katsiamis, and Emmanuel M Drakakis. History and future of audi-tory filter models. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE InternationalSymposium on, pages 3809–3812. IEEE, 2010.

John Makhoul and Richard Schwartz. State of the art in continuous speech recognition. Proceed-ings of the National Academy of Sciences, 92(22):9956–9963, 1995.

Jun Qi, Dong Wang, Yi Jiang, and Runsheng Liu. Auditory features based on gammatone fil-ters for robust speech recognition. In Circuits and Systems (ISCAS), 2013 IEEE InternationalSymposium on, pages 305–308. IEEE, 2013.

Heather L Read, Jeffery A Winer, and Christoph E Schreiner. Functional architecture of auditorycortex. Current opinion in neurobiology, 12(4):433–440, 2002.

hierarchical dynamical systems -...

Documents