geometria de distâncias euclidianas e...

Jorge Ferreira Alencar Lima

Geometria de Distâncias Euclidianas e Aplicações

CAMPINAS2015

i

Ficha catalográficaUniversidade Estadual de Campinas

Biblioteca do Instituto de Matemática, Estatística e Computação CientíficaAna Regina Machado - CRB 8/5467

Lima, Jorge Ferreira Alencar, 1986- L628g LimGeometria de distâncias euclidianas e aplicações / Jorge Ferreira Alencar

Lima. – Campinas, SP : [s.n.], 2015.

LimOrientador: Carlile Campos Lavor. LimCoorientador: Tibérius de Oliveira e Bonates. LimTese (doutorado) – Universidade Estadual de Campinas, Instituto de

Matemática, Estatística e Computação Científica.

Lim1. Geometria de distâncias. 2. Matrizes de distâncias euclidianas. 3.

Escalonamento multidimensional. I. Lavor, Carlile Campos,1968-. II. Bonates,Tibérius de Oliveira e. III. Universidade Estadual de Campinas. Instituto deMatemática, Estatística e Computação Científica. IV. Título.

Informações para Biblioteca Digital

Título em outro idioma: Euclidean distance geometry and applicationsPalavras-chave em inglês:Distance geometryEuclidean distance matricesMultidimensional scalingÁrea de concentração: Matemática AplicadaTitulação: Doutor em Matemática AplicadaBanca examinadora:Carlile Campos Lavor [Orientador]José Mario Martínez PérezDouglas Soares GonçalvesMarcos Napoleão RabeloManoel Bezerra Campêlo NetoData de defesa: 23-01-2015Programa de Pós-Graduação: Matemática Aplicada

Powered by TCPDF (www.tcpdf.org)

iv

Abstract

Euclidean distance geometry (EDG) is the study of Euclidean geometry based on the concept ofdistance. This is useful in several applications, where the input data consists of an incomplete setof distances and the output is a set of points in some Euclidean space realizing the given distances.

The key problem in EDG is known as the Distance Geometry Problem (DGP), where aninteger K>0 is given, as well as a simple undirected weighted graph 𝐺 = (𝑉, 𝐸, 𝑑), whose edges areweighted by a non-negative function 𝑑. The problem consists in determining whether or not thereis a (realization) function that associates the vertices of 𝑉 with coordinates of the 𝐾-dimensionalEuclidean space, in such a way that those coordinates satisfy all distances given by 𝑑.

We considered both theoretical issues and applications of EDG. In theoretical terms, we provedthe exact number of solutions of a subclass of DGP that is very important in the molecularconformation problems. Moreover, we described necessary and sufficient conditions for determiningwhether a complete graph associated to a DGP is realizable and the minimum dimension of suchrealization. In practical terms, we developed an algorithm that computes such realization, whichoutperforms a classical algorithm from the literature. Finally, we showed a direct application ofDGP to multidimensional scaling.

Keywords: Distance Geometry, Euclidean Distance Matrices, Multidimensional Scaling.

Resumo

Geometria de Distâncias Euclidianas (GDE) é o estudo da geometria euclidiana baseado noconceito de distância. É uma teoria útil em diversas aplicações, onde os dados consistem em umconjunto de distâncias e as possíveis soluções são pontos em algum espaço euclidiano que realizamas distâncias dadas.

O problema chave em GDE é conhecido como Problema de Geometria de Distâncias (PGD),em que é dado um inteiro 𝐾 > 0 e um grafo simples, não direcionado, ponderado 𝐺 = (𝑉, 𝐸, 𝑑),cujas arestas são ponderadas por uma função não negativa 𝑑, e queremos determinar se existe umafunção (realização) que leva os vértices de V em coordenadas no espaço euclidiano 𝐾-dimensional,satisfazendo todas as restrições de distâncias dadas por 𝑑.

vii

Consideramos tanto problemas teóricos quanto aplicações da GDE. Em termos teóricos, de-monstramos a quantidade exata de soluções de uma classe de PGDs muito importante para proble-mas de conformação molecular e, além disso, conseguimos condições necessárias e suficientes paradeterminar quando um grafo completo associado a um PGD é realizável e qual o espaço euclidianocom dimensão mínima para tal realização. Em termos práticos, desenvolvemos um algoritmo quecalcula tal realização em dimensão mínima com resultados superiores a um algoritmo clássico daliteratura. Finalmente, mostramos uma aplicação direta do PGD em problemas de escalonamentomultidimensional.

Palavras-chave: Geometria de Distâncias, Matrizes de Distâncias Euclidianas, EscalonamentoMultidimensional

viii

Sumário

Dedicatória xi

Agradecimentos xiii

1 Introdução 1

2 Counting the number of solutions of KDMDGP instances 52.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Background material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Incongruence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.2 Probability 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.3 Partial reflections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Counting incongruent realizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 An algorithm for realizing Euclidean distance matrices 103.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Some results about EDM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 A Distance Geometry-Based Combinatorial Approach to Multidimensional Sca-ling 204.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1.1 Application: MDS of Clustered Data . . . . . . . . . . . . . . . . . . . . . . 214.2 Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3 Distance Geometry and Multidimensional Scaling . . . . . . . . . . . . . . . . . . . 23

4.3.1 An Approach to MDS via EDMCP . . . . . . . . . . . . . . . . . . . . . . . 264.4 Branch-and-Prune Algorithm for Multidimensional Scaling . . . . . . . . . . . . . . 274.5 Cluster Partition-Preserving MDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.6 The Backtrack Problem and a Naive Randomization Approach . . . . . . . . . . . . 304.7 Computational Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.7.1 Application as a Confirmatory MDS Technique . . . . . . . . . . . . . . . . 33

ix

4.7.2 BP-Based Confirmatory MDS for Large Datasets . . . . . . . . . . . . . . . 344.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Conclusão 36

Referências 37

A Counting the number of solutions of the Discretizable Molecular Distance Ge-ometry Problem 42A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42A.2 The Euclidean Distance Matrix Completion Problem . . . . . . . . . . . . . . . . . 43A.3 Counting the number of solutions of the DMDGP . . . . . . . . . . . . . . . . . . . 43

B Branch-and-prune algorithm for multidimensional scaling preserving clusterpartition 45B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46B.2 A Cluster-Partition Preserving MDS Algorithm . . . . . . . . . . . . . . . . . . . . 46B.3 Computational Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

C Learning Forbidden Subtrees in Branch-and-Prune-Based MDS 50

x

A minha mãe e minha filha . . . .

xi

Agradecimentos

Agradeço primeiramente e, principalmente, à minha mãe Olivia pelo apoio incondicional e aminha filha Nicolle por ser a principal razão para conseguir essa conquista.

Agradeço imensamente ao meu grande amigo Germano pelas horas de trabalho, conversa,troca de ideias, conselhos e, principalmente, pela a amizade. Igualmente, agradeço a minha amigaChristianne por aguentar ao meu lado os altos e baixos que apareceram ao longo desses anos.Agradeço ainda aos amigos Estevão, Felipe, Carlos, Douglas, Jardel, Mateus, Michael, Luciano,entre outros, por tudo que aconteceu ao longo desses anos de doutorado.

Ao prof. Carlile, meu reconhecimento pela oportunidade de realizar este trabalho, sempre commuito respeito e amizade. Levarei aquilo que aprendi por toda a vida.

Ao prof. Tibérius, meu imenso obrigado pela oportunidade de trabalhar ao lado de alguémque preza pela qualidade e pelo trabalho em si. Sempre com respeito e dedicação exemplares quepretendo levar comigo pelo resto de minha vida acadêmica.

Aos professores e funcionários do IMECC, que direta ou indiretamente contribuiram de al-guma forma, meu reconhecimento e gratidão, em especial aos professores Aurélio, Márcia, Plínio,Cristiane e Laecio.

Agradeço ainda ao CNPq e a CAPES pelo apoio financeiro.

xiii

Lista de Ilustrações

2.1 The action of the reflection 𝑅𝑣𝑥 in R𝐾 . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 On the left: the set 𝑋𝐷 of realizations of the graph induced by the discretizationedges. On the right: the effect of the pruning edge {1, 4} on 𝑋𝐷. . . . . . . . . . . . 9

3.1 A plot that shows the Stress values associated with the resulting embedding for eachartificial molecular instance. The instances are ordered with respect to the Stressvalue of 𝑖𝑠𝑒𝑑𝑚2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 A plot that shows the Stress values obtained on protein instances. The instancesare ordered with respect to the Stress value of 𝑖𝑠𝑒𝑑𝑚2. . . . . . . . . . . . . . . . . 18

C.1 Example of failed subtree rooted on an embedding of point 𝑥𝑖. . . . . . . . . . . . . 51

xv

Lista de Tabelas

3.1 Stress values obtained on Moré-Wu instances. . . . . . . . . . . . . . . . . . . . . . 173.2 Stress values obtained on Proteins instances. . . . . . . . . . . . . . . . . . . . . . . 19

4.1 Comparison between the BP algorithm for MDS (Algorithm 4.1), rBP and theMetric Multidimensional Scaling algorithm. . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Comparison between rBP algorithm and the Metric Multidimensional Scaling algo-rithm on Parkinsons dataset using different orders of the points. . . . . . . . . . . . 33

4.3 Comparison between the standard BP algorithm (Algorithm 4.1) and the proposedcluster partition-preserving BP algorithm (Algorithm 4.2). . . . . . . . . . . . . . . 34

B.1 Comparison between the standard BP algorithm and the proposed cluster-partitionpreserving BP algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

xvii

Lista de Algoritmos

3.1 𝐾 = edmsph(𝐷, 𝑥) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.1 Branch-and-prune algorithm for MDS. . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 Cluster partition-aware branch-and-prune algorithm. . . . . . . . . . . . . . . . . . 30B.1 Pseudocode of cluster-partition preserving BP algorithm. . . . . . . . . . . . . . . . 47

xix

Capítulo 1

Introdução

Na primeira metade do século XX, Menger caracterizou vários conceitos geométricos (por exem-plo, congruência ou convexidade) em termos de distâncias [45]. Esses e outros resultados foramorganizados, completados e apresentados por Blumenthal [11], originando toda uma área de co-nhecimento chamada Geometria de Distâncias (DG)1. Este trabalho está associado a aplicaçõesdessa área, em particular, à solução do problema fundamental da área [36]:

Problema de Geometria de Distâncias (DGP)2. Dado um inteiro 𝐾 > 0 e um grafosimples não direcionado 𝐺 = (𝑉, 𝐸), cujas arestas são valoradas por um função 𝑑 :𝐸 → R+, determinar se existe uma função 𝑥 : 𝑉 → R𝐾 tal que:

∀{𝑢, 𝑣} ∈ 𝐸, ‖𝑥(𝑢)− 𝑥(𝑣)‖= 𝑑({𝑢, 𝑣}). (1.0.1)

Ao longo desse trabalho, escreveremos 𝑥𝑣 ao invés de 𝑥(𝑣) e 𝑑𝑢𝑣 (ou 𝑑(𝑢, 𝑣)) ao invés de 𝑑({𝑢, 𝑣}).Além disso, admitimos que ‖�‖ se refere à norma Euclidiana, fazendo com que, na realidade,adentremos numa subárea de DG chamada Geometria de Distâncias Euclidianas.

A função 𝑥 satisfazendo (1.0.1) é chamada uma realização de 𝐺 em R𝐾 . Se 𝐻 é um subgrafo de𝐺 e �̄� é uma realização de 𝐻, então �̄� é uma realização parcial de 𝐺. Dado um grafo 𝐺, indicamosseu conjunto de vértices por 𝑉 (𝐺) e seu conjunto de arestas por 𝐸(𝐺).

Vale ressaltar que, para Blumenthal, o problema fundamental de DG era o que ele chamou de“subset problem” [11], ou seja, achar condições necessárias e suficientes para saber se uma dada ma-triz é uma matriz de distâncias. Condições necessárias, para o caso particular em que as distânciassão Euclidianas, foram descobertas por Cayley implicitamente ao longo do seu trabalho, que provouque cinco pontos em R3, quatro pontos no plano e três pontos numa reta possuem determinante deCayley-Menger igual a zero [16]. Algumas condições suficientes foram determinadas por Menger[44], que provou que é suficiente verificar que toda submatriz quadrada (𝐾 + 3)× (𝐾 + 3) da dadamatriz são todas matrizes de distâncias [11]. A maior diferença é que uma matriz de distânciasrepresenta um grafo completo ponderado, enquanto o DGP não impõem qualquer estrutura sobre𝐺. A primeira menção explícita sobre o DGP foi, provavelmente, a seguinte [60]:

1do inglês, Distance Geometry.2do inglês, Distance Geometry Problem.

1

The positioning problem arises when it is necessary to locate a set of geographicallydistributed objects using measurements of the distances between some object pairs.

(Yemini)

A menção explícita que somente alguns pares de objetos têm distâncias conhecidas faz a transi-ção crucial do conhecimento clássico de GD para o DGP. Nos anos seguintes a essa publicação,Yemini escreveu uma outra sobre a complexidade computacional de alguns problemas envolvendografos rígidos [59], onde introduziu o position-location problem como o problema de determinar ascoordenadas de um conjunto de objetos no espaço a partir de um conjunto esparso de distâncias.Isto estava em contraste com os resultados estruturais típicos, cujo foco era a determinação darigidez de dadas estruturas (ver [56] e referências contidas). Enquanto isso, Saxe [50] introduziu oDGP como um problema de 𝐾-imersão e mostrou ser NP-completo, quando 𝐾 = 1, e fortementeNP-difícil, para 𝐾 > 1.

O interesse em DGP reside na sua larga possibilidade de aplicações, bem como na beleza damatemática associada. Além disso, o DGP possui uma vasta lista de variações. Nosso trabalhotrata de uma dessas variações, denominada Problema Discretizável de Geometria de Distâncias(DDGP)3:

Problema Discretizável de Geometria de Distâncias (DDGP). É um subconjunto doscasos de DGP para os quais existe uma ordem sobre o conjunto de vértices 𝑉 (𝐺) talque:

1. A realização dos 𝐾 primeiros vértices é dada;2. Cada vértice 𝑣 numa posição 𝑖 > 𝐾 é adjacente a pelo menos 𝐾 vértices que são

predecessores de 𝑣 na ordem dada, esses vértices formam um 𝐾-clique e, dadauma realização parcial �̄� de 𝐺, a realização desses vértices geram um subespaçoafim de dimensão 𝐾 − 1.

A principal ideia por trás da discretização é que a intersecção de 𝐾 esferas no espaço 𝐾-dimensional pode produzir no máximo dois pontos, sob a hipótese de que seus centros estejamnum hiperplano, mas não em um subespaço afim (𝐾 − 2)-dimensional. Consideremos (𝐾 + 1)pontos: {𝑢𝑖}𝐾

𝑖=1 e 𝑣 em R𝐾 . Se as coordenadas de {𝑢𝑖}𝐾𝑖=1 são conhecidas, bem como as distâncias

{𝑑(𝑢𝑖, 𝑣)}𝐾𝑖=1, então 𝐾 esferas podem ser definidas e suas intersecções provêm, no máximo, duas

possíveis posições para o ponto 𝑣. A definição de uma ordem sobre o conjunto de vértices 𝑉 (𝐺)satisfazendo tais condições sugere uma busca recursiva sobre uma árvore binária contendo aspossíveis coordenadas para os vértices. A árvore binária de possiveis soluções é explorada a partirde seu topo, onde os primeiros 𝐾 vértices foram fixados, colocando um vértice de cada vez. Acada passo, duas possíveis posições para o vértice 𝑣 em questão são computadas, e dois novosramos são adicionados à árvore. Como consequência, o tamanho da árvore pode crescer muitorapidamente, mas a presença de distâncias adicionais não relacionadas com a construção da árvorepode ajudar a verificar a factibilidade dos pontos adicionados. Assim que uma posição não factível

3do inglês, Discretizable Distance Geometry Problem.

2

é encontrada, o galho da árvore correspondente pode ser podado e a busca sofre um recuo. Essaestratégia define um algoritmo eficiente chamado Branch-and-Prune (BP) [32]. Vale enfatizar quea noção de não factibilidade, mecionada acima, pode diferir das noções apresentas em [32], deacordo com a aplicação em questão, como veremos em uma de nossas aplicações.

Os próximos 3 capítulos descrevem algumas contribuições do autor, em colaboração com outrospesquisadores, onde o intuito é mostrar a aplicabilidade da abordagem oriunda da discretizaçãoacima mencionada. A seguir, faremos um breve resumo de cada capítulo.

No Capítulo 2, exploramos um caso particular do DGP, o MDGP4, associado a experimentos deRessonância Magnética Nuclear(NMR)5, que geram distâncias interatômicas 𝑑𝑖𝑗 para certos paresde átomos (𝑖, 𝑗) de uma dada proteína [20]. O problema é como usar esse conjunto de distânciasa fim de calcular posições 𝑥1, . . . , 𝑥𝑛 ∈ R3 para os átomos que formam a molécula [18]. Um grafosimples não orientado e ponderado 𝐺(𝑉, 𝐸, 𝑑) pode ser associado ao problema, onde 𝑉 é o conjuntode átomos, 𝐸 modela o conjunto de pares de átomos cujas distâncias Euclidianas são conhecidase a função 𝑑 : 𝐸 → R+ associa os valores das distâncias a cada par em 𝐸. Assim, podemos definiro MDGP formalmente por:

Dado um grafo simples não orientado e ponderado 𝐺(𝑉, 𝐸, 𝑑), existe uma função 𝑥 :𝑉 → R3 tal que ‖𝑥𝑖 − 𝑥𝑗‖= 𝑑𝑖𝑗,∀(𝑖, 𝑗) ∈ 𝐸?

Explorando algumas propriedades de rigidez do grafo 𝐺, o espaço de busca pode ser discretizadoonde um subconjunto dos casos do MDGP é definido por DMDGP 6, que nada mais é que umcaso particular do DDGP em que 𝐾 = 3, e, além diso, adicionamos a seguinte condição extra aoitem 2 da definição de um DDGP: o conjunto de vértices adjacentes predecessores a cada vértice 𝑣numa posição 𝑖 > 3 contém, pelo menos, os 3 vértices imediatamente anteriores na ordem. Comoprimeira contribuição [39], propomos um forma de contar o número de soluções para um dadoDMDGP, baseados nas propriedades de simetria estabelecidas em [42].

No Capítulo 3, tratamos de um dos problemas mais clássicos em Geometria de Distâncias: oproblema de ajuste de distâncias. Esse problema vem sendo estudado desde as primeiras décadasdo século passado [36]. Dada uma matriz não-negativa com diagonal nula, deseja-se determinarse esta é ou não uma matriz de distâncias Euclidianas e, caso seja, determinar um conjunto depontos que realize essa matriz num espaço Euclidiano de dimensão menor possível.

Existem vários trabalhos na literatura sobre esse tipo de problema [52, 36] e diversos algoritmosforam propostos [19, 21, 53]. Alguns deles precisam da dimensão “mínima” como entrada, enquantooutros podem se mostrar bastante sensíveis em casos práticos.

Como contribuição, elaboramos um algoritmo que resolve tal problema precisando apenas damatriz como entrada e identificando se esta é uma matriz de distâncias Euclidiana. Uma realizaçãoe a dimensão mínima são dadas, caso tal matriz seja uma matriz de distâncias Euclidiana. Oalgoritmo é baseado no problema de determinar a intersecção entre 𝐾 esferas em R𝐾 , onde 𝐾varia ao longo do algoritmo [4].

4do inglês, Molecular Distance Geometry Problem.5do inglês, Nuclear Magnetic Resonance.6do inglês, Discrezable Molecular Distance Geometry Problem.

3

No Capítulo 4, desenvolvemos uma técnica de Escalonamento Multidimensional (MDS)7 asso-ciada ao seguinte problema: dadas as informações sobre dissimilaridades entre pares de 𝑛 objetosde um dado conjunto, achar uma representação de baixa dimensão de tais objetos que minimizauma função perda que mede o erro entre as dissimilaridades originais e as distâncias resultantes daimersão em baixa dimensão [14]. Essa representação em baixa dimensão é normalmente chamadade um representação de MDS.

Consideremos um conjunto de pontos em R𝑁 ao qual um procedimento de agrupamento (porexemplo, 𝑘-means) foi aplicado. A aplicação de um procedimento padrão de MDS não garante que,se o método de agrupamento utilizado anteriormente for aplicado novamente da representação deMDS , um estrutura similar de agrupamento será obtida em relação àquela obtida para o conjuntode dados original.

Tentativas de integrar MDS e agrupamentos em uma única técnica já existem na literatura(Escalonamento de Diferenças de Agrupamento (CDS)8 é uma dessas técnicas [27]). Ao contráriodesses métodos, em que os agrupamentos são deteminados durante o processo, nossa abordagemrequer informações sobre o agrupamento do conjunto desde o ínicio. Mais especificamente, assumi-mos que, além das informações de dissimilaridades entre pares de objetos, dados sobre a pertinênciados elementos em agrupamentos é dada como parte da entrada, especificando a qual agrupamentocada objeto pertence. Nosso objetivo é, dado um agrupamento inicial do conjunto de dados, obteruma representação dos dados em baixa dimensão que preserve as dissimilaridades, sendo aindapossível recuperar a estrutura do agrupamento inicial [3, 2].

No Capítulo 5, enunciamos futuros trabalhos e finalizamos com as principais conclusões.

7do inglês, Multidimensional Scaling.8do inglês, Cluster Differences Scaling.

4

Capítulo 2

Counting the number of solutions ofKDMDGP instances

Leo Liberti1, Carlile Lavor2, Jorge Alencar3 and Germano Abud4

1 École Polytechnique, CNRS LIX, Paris, [email protected] Universidade Estadual de Campinas, IMECC-Unicamp, Campinas, São Paulo, [email protected] Instituto Federal de Educação, Ciência e Tecnologia do Sul de Minas Gerais, IFSULDEMINAS, Inconfidentes,Minas Gerais, Brazil. [email protected] Universidade Federal de Uberlândia, FAMAT-UFU, Uberlândia, Minas Gerais, Brazil. [email protected].

Abstract

We discuss a method for finding the number of realizations in R𝐾 of certain simple undirectedweighted graphs.

2.1 IntroductionIn this paper we deal with Euclidean realizations of weighted graphs such that the Euclidean distance between

pairs of realization points are the same as the weight on the corresponding edge. The vertex sets of our graphs areassumed to be ordered in certain ways formally described below. An order is given as a rank function 𝜌 mappingthe vertex (having cardinality 𝑛) set into and onto the set �̄� = {1, . . . , 𝑛}. In general, for clarity of notation, wemay identify a vertex with its rank, e.g. for any two vertices 𝑢, 𝑣 and an integer 𝐾, we write 𝑢 < 𝑣 or 𝑣 > 𝐾 tomean 𝜌(𝑢) < 𝜌(𝑣) and 𝜌(𝑣) > 𝐾.

The 𝐾-Discretizable Molecular Distance Geometry Problem (KDMDGP) is as follows. Givena positive integer 𝐾, a simple undirected weighted graph 𝐺 = (𝑉, 𝐸, 𝑑) where 𝑑 : 𝐸 → R+, an order < on 𝑉 suchthat {𝑢, 𝑣} ∈ 𝐸 for each 𝑣 > 𝐾 and 𝑣 −𝐾 ≤ 𝑢 ≤ 𝑣 − 1, and a partial realization �̄� : {1, . . . , 𝐾} → R𝐾 , does thereexist a realization 𝑥 : 𝑉 → R𝐾 such that:

∀{𝑢, 𝑣} ∈ 𝐸 ‖𝑥𝑢 − 𝑥𝑣‖= 𝑑𝑢𝑣 (2.1.1)

5

and such that 𝑥𝑣 = �̄�𝑣 for each 𝑣 ∈ {1, . . . , 𝐾}? We remark that solving Eq. (2.1.1) for any given weighted graphand integer 𝐾 is known as the Distance Geometry Problem (DGP). Both the DGP [50] and the KDMDGP [41]are NP-hard, even for fixed 𝐾.

The motivation for the name — molecular — stems from the natural application to finding molecular confor-mation (so 𝐾 = 3). The vertices of the input graph 𝐺 are atoms, and the edges are pairs of atoms for whichthe distance is known. We focus on the important case of proteins: since all proteins consist of a backbone withsome side chains, we consider the backbone as a natural vertex order. Since covalent bond lengths are known, andthe angles between covalent bonds is also known [51], distances corresponding to pairs of atoms {𝑣 − 1, 𝑣} and{𝑣 − 2, 𝑣} are known. Nuclear Magnetic Resonance (NMR) experiments provide an estimation of distances shorterthan around 6Å, which covers the case of pairs {𝑣 − 3, 𝑣} (as well as other pairs — the backbone folds in space,and it often happens that two atoms that are far apart in the order are actually close in Euclidean space) [51].Moreover, for elementary geometrical reasons, it is always possible to fix positions of the first, second and thirdatom in the protein backbone so that the inter-atomic distances over {1, 2, 3} are satisfied. Thus, protein backbonesprovide natural examples of KDMDGP instances [32]. In the following, we shall partition the edge set 𝐸 into thediscretization edges 𝐸𝐷 = {{𝑣, 𝑣 − 𝑗} | 𝑗 ∈ {1, . . . , 𝐾}} and the pruning edges 𝐸𝑃 = 𝐸 r 𝐸𝐷. We let 𝑚 = |𝐸|.

Finding realizations for general graphs usually involves a continuous search [37], but if the graph is rigid [26]then a discrete search type is possible [35]. It was observed in [31] that KDMDGP graphs are Henneberg graphs,which are known to be rigid [54]. In [40] we proposed a discrete search algorithm called Branch-and-Prune (BP),where the discretization edges are used to make sure that only a discrete set of points needs to be checked forfeasibility w.r.t. Eq. (2.1.1), and the pruning edges are used to reduce the search space.

Since every vertex 𝑣 > 𝐾 is adjacent to (at least) its 𝐾 immediate predecessors, if we know the position 𝑥𝑢

of each of these predecessors 𝑢 of 𝑣, then 𝑥𝑣 is at the intersection of 𝐾 spheres in R𝐾 [17]. Provided the strictsimplex inequalities (a generalization of the triangle inequalities to R𝐾 [31]) hold, this intersection is either emptyor consists of exactly two points. This provides an inductive step to find the next vertex in the order, having placedall its predecessors. The base case is dealt with since we are given the partial realization �̄�. Since at each step theremay be two feasible positions 𝑥𝑣 for the next vertex 𝑣, in the worst case the BP yields an exponentially large searchtree, where each node 𝑥𝑣 at level 𝑣 is a possible position for vertex 𝑣. Since the first branch occurs at level 𝐾 + 1,this worst-case tree has 2𝑛−𝐾 leaf nodes. Each leaf node 𝑥𝑛 corresponds to a unique path from the root 𝑥1 to 𝑥𝑛,which therefore encodes a valid realization 𝑥 = (𝑥1, . . . , 𝑥𝑛). We let 𝑋 be the set of all these realizations.

In this paper, we propose an efficient method for computing the cardinality of 𝑋.

2.2 MotivationKnowing |𝑋| is important for at least two practical reasons. First, in the application of DGP to proteomics,

the set 𝑋 is of interest to biochemists, who will evaluate each potential backbone according to chemical criteria. If|𝑋| is too large, this evaluation might be too costly; on the other hand, if |𝑋| is too small, 𝑋 might not contain the“correct” backbone. This observation might sound strange to mathematicians, but one must not forget that theKDMDGP provides a model of reality, rather than being reality itself: none of the realizations in 𝑋 might be reallycorrect from the point of view of the biochemical practitioner, but some may be close enough for him or her torecognize them. From a different point of view, the experimental data set usually contains errors [10] which mightinfluence the number of realizations in 𝑋: a small |𝑋| might be evidence of wrong data.

Secondly, the class of globally rigid (also known as “uniquely realizable” [28]) graphs, i.e. those for which itcan be shown that |𝑋|= 1, is interesting because in several DGP applications, such as e.g. to wireless sensorlocalization, ensuring that sufficient distance data are known for the the graph having a unique realization is ofparamount importance: recovering a large set of different possible networks obviously prevents practitioners fromunderstanding the actual network geometry. Necessary (combinatorial) conditions for a graph to be globally rigidare given in [28]: informally speaking, if removing a certain edge from a rigid graph still yields a rigid graph, theedge is redundant; in a redundantly rigid graph, all edges are redundant. Redundant rigidity turns out to be anecessary condition for unique realizability. Although there are exact methods for verifying whether a graph isredundantly rigid for 𝐾 ∈ {1, 2}, no such method is known for higher dimensions. A randomized 𝑂(𝑛2𝑚) methodis given in [28].

6

Although no necessary and sufficient conditions for unique realizability is known so far, several different sufficientconditions are known. Cliques are obviously globally rigid, and the realization can be found in polynomial time[21]. Trilateration graphs are those for which there exists a vertex order where each 𝑣 > 𝐾 has at least 𝐾 + 1adjacent predecessors: these can be shown to have a unique realization, which can be found in polynomial time[23]. The graphs occurring in KDMDGP are a natural generalization of trilateration graphs, insofar as they requireat least 𝐾 adjacent predecessors. As shown in [42], in general such graphs are not globally rigid, but the numberof realizations can be counted in time 𝑂(𝑛 + 𝑚), as shown in Sect. 2.4 below; so those KDMDGP graphs that areglobally rigid can be recognized in polynomial time (under some genericity assumptions, see Sect. 2.3.2). Uniquelylocalizable graphs possess a unique realization in a given 𝐾, and no other realization for any higher value of 𝐾. It isshown in [43] that these graphs can be realized in polynomial time (up to some approximation constant) by solvinga semidefinite programming problem.

2.3 Background materialAlthough our method for computing |𝑋| is straightforward, is rests on many known but nontrivial results, which

we summarize here.

2.3.1 IncongruenceTwo sets of points in R𝐾 are congruent if there is a sequence of translations, rotations and reflections that turns

one into the other. Since any realizable graph has uncountably many congruent realizations, we are only interestedin the number of incongruent ones. Unfortunately, the way we defined 𝑋 above (i.e. 𝑋 is the set of solutions foundby the BP algorithm on KDMDGP instances) is only partially correct in this respect. Because the realizations ofHenneberg graphs are rigid frameworks, each realization in 𝑋 is rigid; so the fact that the first 𝐾 vertices are fixedin given positions �̄�1, . . . , �̄�𝐾 eliminates rotations and translations. By Thm. [32, Thm. 2], with 𝐾 = 3 there is a“fourth-level symmetry” in 𝑋: half of the realizations in 𝑋 are reflections of the other half along the plane through�̄�1, �̄�2, �̄�3. This was generalized in [42] for any 𝐾.

So that the definition of 𝑋 is consistent with 𝑋 being a set of incongruent realizations, we simply modify the BPalgorithm to choose any of the two possible positions for 𝑥𝐾+1 (without exploring the other), and start branchingfrom level 𝐾 + 2.

2.3.2 Probability 1The theory supporting the BP algorithm is always based on the edge weight function 𝑑 satisfying the strict

simplex inequalities (i.e. the Cayley-Menger determinant of each 𝐾-subsequence of vertices in the given ordermultiplied by (−1)𝐾+1 is strictly positive). Otherwise, the intersection of 𝐾 spheres in R𝐾 might have uncountablecardinality, or be a singleton set. These occurrences only happen when the KDMDGP instance is YES, and the valuesassigned to 𝑑 yield zero Cayley-Menger determinants [31], i.e. they satisfy a certain given system of polynomialequations. Such systems define manifolds of Lebesgue measure zero in R𝐾 . Moreover, it is easy to prove that allpoints in all realizations in 𝑋 are in a ball centered at �̄�1 with radius bounded by the sum of all edge distances. So,the probability of uniformly sampling 𝑑 satisfying these equations is zero. This in turn means that the probabilityof uniformly sampling 𝑑 such that it yields a YES KDMDGP instance satisfying the strict simplex inequalities is 1.Accordingly, we state most of our results “with probability 1”.

There are at least three related concepts in the literature. The first, genericity (in the standard sense), requiresthat there should be no rational polynomial satisfying the instance data 𝑑. This condition is “too strong”, in thesense that it would require at least one value of 𝑑 to be transcendental, which makes little sense for computers.The second concept requires that all minors of the complete rigidity matrix are nontrivial [25]. The third requiresthat 𝑑 is a rational function contained in the (open) complement of the set of those rational functions 𝑑′ yieldingzero Cayley-Menger determinants [46, 48]. The notion we employ is very similar to both the second and the thirdconcept.

7

2.3.3 Partial reflectionsFor any realization 𝑥 ∈ 𝑋 and 𝑣 ∈ 𝑉 with 𝑣 > 𝐾, we let 𝑅𝑣

𝑥 be the reflection along the hyperplane through𝑥𝑣−𝐾 , . . . , 𝑥𝑣−1, as shown in Fig. 2.1. Now, for any 𝑣 > 𝐾, we define a partial reflection operator with respect to 𝑥

𝑥𝑣−3

𝑥𝑣−2𝑥𝑣−1

Figura 2.1: The action of the reflection 𝑅𝑣𝑥 in R𝐾 .

as:𝑔𝑣(𝑥) = (𝑥1, . . . , 𝑥𝑣−1, 𝑅𝑣

𝑥(𝑥𝑣), 𝑅𝑣𝑥(𝑥𝑣+1), . . . , 𝑅𝑣

𝑥(𝑥𝑛)). (2.3.1)

The partial reflection 𝑔𝑣 acts on a realization 𝑥 by reflecting all vectors from rank 𝑣 onwards. We define a productbetween partial reflections by setting 𝑔𝑢𝑔𝑣 = 𝑔𝑢 ∘𝑔𝑣 for all 𝑢, 𝑣 > 𝐾, i.e. 𝑔𝑢𝑔𝑣 is the operation consisting in applying𝑔𝑣 first, and then 𝑔𝑢 later to a realization 𝑥 ∈ 𝑋. More precisely, for 𝑣 > 𝑢 > 𝐾 and 𝑥 ∈ 𝑋,

𝑔𝑢𝑔𝑣(𝑥) = 𝑔𝑢(𝑔𝑣(𝑥))= 𝑔𝑢(𝑥1, . . . , 𝑥𝑣−1, 𝑅𝑣

𝑥(𝑥𝑣), . . . , 𝑅𝑣𝑥(𝑥𝑛))

= (𝑥1, . . . , 𝑥𝑢−1, 𝑅𝑢𝑥(𝑥𝑢), . . . , 𝑅𝑢

𝑥(𝑥𝑣−1), 𝑅𝑢𝑔𝑣(𝑥)(𝑥𝑣), . . . , 𝑅𝑢

𝑔𝑣(𝑥)(𝑥𝑛))

(the case for 𝑢 < 𝑣 is similar). Notice that the action of the left operand 𝑔𝑢 after rank 𝑣 does not apply 𝑅𝑢𝑥 to the

components of the argument, but 𝑅𝑢𝑔𝑣(𝑥). By [41, Lemma 2], this product is commutative.

Now let Γ𝐷 = {𝑔𝑣 | 𝑣 > 𝐾}, and consider the set 𝒢𝐷 = ⟨Γ𝐷⟩ generated by all possible products of elementsin Γ𝐷. By [41], 𝒢𝐷 turns out to be the invariant group of the set of realizations 𝑋𝐷 consisting of all the possiblerealizations found by the BP algorithm on the graph 𝐺𝐷 = (𝑉, 𝐸𝐷) induced by the discretization edges (seeFig. 2.2). Our purpose is to find the invariant group 𝒢𝑃 of the set of realizations of the given graph 𝐺, which weassume to have a nontrivial set of 𝐸𝑃 of pruning edges. Let the span of a pruning edge {𝑢, 𝑤} ∈ 𝐸𝑃 be the set𝑆𝑢𝑤 = {𝑢 + 𝐾 + 1, . . . , 𝑤} (assuming 𝑢 < 𝑤; if 𝑤 > 𝑢 we let 𝑆𝑢𝑤 = 𝑆𝑤𝑢). By [41], 𝒢𝑃 is the subgroup of 𝒢𝐷

generated byΓ𝑃 = {𝑔𝑣 | 𝑣 > 𝐾 ∧ ∀{𝑢, 𝑤} ∈ 𝐸𝑃 (𝑣 ̸∈ 𝑆𝑢𝑤)}. (2.3.2)

In other words, only those vertices that are not in the span of any pruning edge give rise to partial reflectionoperators that generate the discretization group 𝒢𝑃 .

8

1

2

345

1

2

345

Figura 2.2: On the left: the set 𝑋𝐷 of realizations of the graph induced by the discretization edges.On the right: the effect of the pruning edge {1, 4} on 𝑋𝐷.

2.4 Counting incongruent realizationsBy [41, Thm. 4], there is an integer ℓ such that |𝑋|= 2ℓ with probability 1. We can easily refine the proof of

this result so that it says something more precise on ℓ.

Proposition 2.4.1. With probability 1, |𝑋|= 2|Γ𝑃 |.

Proof. The following statements hold with probability 1. By [41], 𝒢𝐷∼= 𝐶𝑛−𝐾

2 (where 𝐶2 is the cyclic group of order2), so that |𝒢𝐷|= 2𝑛−𝐾 . Since 𝒢𝑃 ≤ 𝒢𝐷, |𝒢𝑃 | divides the order of |𝒢𝐷|. By elementary group theory, |𝒢𝑃 |= 2|Γ𝑃 |.By Thm. [42, Thm. 6.4], the action of 𝒢𝑃 on 𝑋 only has one orbit, i.e. 𝒢𝑃 𝑥 = 𝑋 for any 𝑥 ∈ 𝑋. We remark thatevery partial reflection operator is idempotent, i.e. 𝑔2

𝑣 = 1, and hence 𝑔−1𝑣 = 𝑔𝑣 for all 𝑣 > 𝐾. Thus, if 𝑔𝑥 = 𝑔′𝑥 for

two 𝑔, 𝑔′ ∈ 𝒢𝑃 and 𝑥 ∈ 𝑋, then (𝑔′)−1𝑔𝑥 = 𝑥, which implies 𝑔′𝑔𝑥 = 𝑥, which implies 𝑔′𝑔 = 1 whence 𝑔′ = 𝑔. Thismeans that |𝒢𝑃 𝑥|= |𝒢𝑃 |. Thus, for any 𝑥 ∈ 𝑋, |𝑋|= |𝒢𝑃 𝑥|= |𝒢𝑃 |= 2|Γ𝑃 |.

Now, all that remains to do is to present an algorithm to compute |Γ𝑃 |. This follows directly from the definitionin Eq. (2.3.2). We let 𝑏 = (𝑏𝐾+1, . . . , 𝑏𝑛) be an array initialized so that 𝑏𝑖 = 1 for all 𝑖 in {𝐾 + 1, . . . , 𝑛}. Then wescan every edge {𝑢, 𝑣} in 𝐸𝑃 , and for each 𝑖 in 𝑆𝑢𝑣 we set 𝑏𝑖 = 0. Finally, |Γ𝑃 |=

𝑛∑︀𝑖=𝐾+1

𝑏𝑖. This algorithm runs in

𝑂(𝑛 + 𝑚). We remark that, by Sect. 2.3.1, if 𝑋 is required to only contain incongruent realizations, then the firstcomponent of 𝑏 should be 𝑏𝐾+2 rather than 𝑏𝐾+1.

Obviously, if |Γ𝑃 |= 1, then the KDMDGP graph is globally rigid (with probability 1).

AcknowledgmentsFinancial support is gratefully acknowledged from French National Research Agency project ANR-10-BINF-03-

08 “Bip:Bip”, and the Brazilian research agencies FAPESP, CNPq and CAPES.

9

Capítulo 3

An algorithm for realizing Euclideandistance matrices

Jorge Alencar1, Tibérius Bonates2, Carlile Lavor3 and Leo Liberti4

1 Instituto Federal de Educação, Ciência e Tecnologia do Sul de Minas Gerais, IFSULDEMINAS, Inconfidentes,Minas Gerais, Brazil. [email protected] Universidade Federal do Ceará, DEMA-UFC, Fortaleza, Ceará, Brazil. [email protected] Universidade Estadual de Campinas, IMECC-Unicamp, Campinas, São Paulo, [email protected] École Polytechnique, CNRS LIX, Paris, [email protected].

Abstract

We present an efficient algorithm to find a realization of a (full) 𝑛×𝑛 Euclidean distance ma-trix in the smallest possible dimension. Most existing algorithms work in a given dimension:most of these can be transformed to an algorithm to find the minimum dimension, but gain alogarithmic factor of 𝑛 in their worst-case running time. Our algorithm performs linearly in𝑛 (and quadratically in another parameter which is fixed for most applications).

3.1 IntroductionThe problem of adjustment of distances among points has been studied since the first decades of the 20th

century [36]. It can be formally defined as follows: Let 𝐷 be a 𝑛 × 𝑛 symmetric hollow (i.e., with zero diagonal)matrix with non-negative elements. We say that 𝐷 is a squared Euclidean Distance Matrix (EDM) if there are𝑥1, 𝑥2, . . . , 𝑥𝑛 ∈ R𝐾 , for a positive integer 𝐾, such that

𝐷(𝑖, 𝑗) = 𝐷𝑖𝑗 = ‖𝑥𝑖 − 𝑥𝑗‖2, 𝑖, 𝑗 ∈ {1, . . . , 𝑛},

where ‖·‖ denotes the Euclidean norm. The smallest 𝐾 for which such a set of points exists is called the embeddingdimension of 𝐷, denoted by dim(𝐷). If 𝐷 is not an EDM, we define dim(𝐷) =∞.

10

We are concerned with the problem of determining dim(𝐷) for a given symmetric hollow matrix 𝐷. If dim(𝐷) =𝐾 < ∞, we also want to determine a sequence 𝑥 = (𝑥1, . . . , 𝑥𝑛) of 𝑛 points in R𝐾 such that 𝐷 is the EDM of 𝑥.We emphasize that 𝐷 is a full matrix.

In the literature we prevalently find efficient methods for solving a related problem, i.e. whenever 𝐾 is given aspart of the input (see e.g. [21]). Each of these algorithms can be used within a bisection search to determine theembedding dimension, incurring a multiplicative factor of 𝒪(log(𝑛)) to their running time. These algorithms alsorequire the embedding of a clique in R𝐾 , this procedure incurring a multiplicative factor of 𝒪(𝐾3) to their runningtime. Therefore, in the worst case, using these algorithms within a bisection search accomplishes the requiredtask in 𝒪(𝑛3 log(𝑛)). We propose an algorithm which accomplishes the required task in 𝒪(𝑛3). If the embeddingdimension is known, this reduces to linear time in 𝑛.

Our algorithm, detailed below, is based on the problem of determining the intersection of 𝐾 spheres in R𝐾 ,where 𝐾 varies during the algorithm. The problem of determining the intersections of spheres is well known, asare its applications, which include navigation problems, molecular conformation, network location, robotics, as wellas many other problems of distance geometry (see, e.g., [36]). We also numerically compare the algorithm with anexisting technique available in the literature, and show we also do better in terms of realization quality.

3.2 Some results about EDMIt is well known, [7], that a symmetric hollow matrix 𝐷 with nonnegative entries is a EDM if and only if 𝐷 is

negative semidefinite on𝑀 = {𝑥 ∈ R𝑚 : 𝑥𝑡𝑒 = 0},

the orthogonal complement of 𝑒, the 𝑛-dimensional vector of all ones. Let 𝐽 = 𝐼𝑚 − 1𝑛 𝑒𝑒𝑡 be the orthogonal

projection matrix onto subspace 𝑀 , then we can enunciate such result as follows,

Theorem 3.2.1. Let 𝐷 be a symmetric hollow matrix with nonnegative entries. Then 𝐷 is a EDM if and only if𝜏(𝐷) = − 1

2 𝐽𝐷𝐽 is semidefinite positive. And, if 𝐷 is a EDM, then its embedding dimension is the rank of 𝜏(𝐷).

Actually, we can easily see that, if 𝐷 is a EDM, then 𝜏(𝐷) is the Gram matrix associated to the points (vec-tors) which realize 𝐷, i.e., the matrix of inner products of such points. Based on this result, Datorro deve-loped a routine to verify whether or not a matrix 𝐷 is a EDM and, if so, to determine an embedding in theleast possible dimension. The routine, called isedm, was written in Matlab and can be downloaded for free athttp://www.convexoptimization.com/wikimization. Due to necessity of a spectral decomposition step, we canestimate the complexity order of this algorithm to be 𝒪(𝑛3).

Our algorithm does not use such classical result to work. In the rest of this section we present the theoreticalbasis of our algorithm. Let [𝑛] = {1, . . . , 𝑛} and [𝑛1, 𝑛2] = {𝑛1, 𝑛1+1, · · · , 𝑛2−1, 𝑛2}. Furthermore, if 𝑈, 𝑉 ⊆ [𝑛] suchthat 𝑉 = {𝑣1, . . . , 𝑣𝑛1}, 𝑈 = {𝑢1, . . . , 𝑢𝑛2} and 𝐷 is a 𝑛×𝑛 matrix, then 𝐷(𝑉, 𝑈) = (𝑑𝑖𝑗) is the submatrix of 𝐷 suchthat 𝑑𝑖𝑗 = 𝐷(𝑣𝑖, 𝑢𝑗) with 𝑖 ∈ [𝑛1] e 𝑗 ∈ [𝑛2]. Given a positive integer 𝑛, we define {𝑥𝑖}𝑛

𝑖=1 = {𝑥1, 𝑥2, · · · , 𝑥𝑛−1, 𝑥𝑛}.The following is a well-known result of the literature and provides an upper bound on the embedding dimension

of a given EDM in terms of its order. For the sake of completeness we will prove this result using a differentapproach.

Proposition 3.2.2. Let 𝐷 be a 𝑛× 𝑛 EDM. Then dim(𝐷) ≤ 𝑛− 1.

Proof. If 𝐷 is a 𝑛×𝑛 EDM, then exists {𝑥𝑖}𝑛𝑖=1 ⊆ R𝑚 for any 𝑚 ≥ dim(𝐷) which realizes 𝐷. Let 𝑘 be the dimension

of the linear subspace generated by the vectors {𝑥𝑖 − 𝑥1}𝑛𝑖=2 ⊆ R𝑚. Since this space is a 𝑘-dimensional subspace of

R𝑚, then it is isomorphic to R𝑘 by a linear isometry 𝑄. Let 𝑦1 = 0 and {𝑦𝑖 = 𝑦1 + 𝑄(𝑥𝑖 − 𝑥1)}𝑛𝑖=1. Thus:

‖𝑦𝑖 − 𝑦𝑗‖= ‖𝑦1 + 𝑄(𝑥𝑖 − 𝑥1)− 𝑦1 −𝑄(𝑥𝑗 − 𝑥1)‖= ‖𝑄(𝑥𝑖 − 𝑥𝑗)‖= ‖𝑥𝑖 − 𝑥𝑗‖

for all 𝑖, 𝑗 ∈ [𝑛]. From this, we have that {𝑦𝑖}𝑛𝑖=1 ⊆ R𝑘 also realizes 𝐷. Therefore dim(𝐷) ≤ 𝑘 ≤ 𝑛− 1.

The next couple of results will help us in our main result and in establishing some interesting properties aboutthe points that realize a given EDM.

11

Lemma 3.2.3. Let 𝐷 be a 𝑛 × 𝑛 EDM and {𝑥𝑖}𝑛𝑖=1, {𝑦𝑖}𝑛

𝑖=1 ⊆ R𝑚, for any 𝑚 ≥ dim(𝐷), sets of points whichrealize 𝐷. For 𝑖, 𝑗, 𝑘 ∈ [𝑛], we have:

(𝑥𝑖 − 𝑥𝑗)𝑡(𝑥𝑧 − 𝑥𝑗) = (𝑦𝑖 − 𝑦𝑗)𝑡(𝑦𝑧 − 𝑦𝑗).

Proof. Without loss of generality, let us assume that 𝑖 < 𝑗 < 𝑘. Let 𝐷′ be the EDM realized by the subset ofpoints {𝑥𝑖, 𝑥𝑗 , 𝑥𝑘} and {𝑦𝑖, 𝑦𝑗 , 𝑦𝑘}. From Proposition 3.2.2 there are {�̄�𝑖, �̄�𝑗 , �̄�𝑘}, {𝑦𝑖, 𝑦𝑗 , 𝑦𝑘} ⊆ R2 which realize𝐷′. We notice that, by the isometry used in Proposition 3.2.2, (𝑥𝑖 − 𝑥𝑗)𝑡(𝑥𝑧 − 𝑥𝑗) = (�̄�𝑖 − �̄�𝑗)𝑡(�̄�𝑧 − �̄�𝑗) and(𝑦𝑖 − 𝑦𝑗)𝑡(𝑦𝑧 − 𝑦𝑗) = (𝑦𝑖 − 𝑦𝑗)𝑡(𝑦𝑧 − 𝑦𝑗). Since

‖�̄�𝑖 − �̄�𝑗‖ = ‖𝑦𝑖 − 𝑦𝑗‖‖�̄�𝑖 − �̄�𝑘‖ = ‖𝑦𝑖 − 𝑦𝑘‖‖�̄�𝑗 − �̄�𝑘‖ = ‖𝑦𝑗 − 𝑦𝑘‖

we have that the triangles obtained are similar. Therefore,

(�̄�𝑖 − �̄�𝑗)𝑡(�̄�𝑧 − �̄�𝑗) = (𝑦𝑖 − 𝑦𝑗)𝑡(𝑦𝑧 − 𝑦𝑗).

Thus,

(𝑥𝑖 − 𝑥𝑗)𝑡(𝑥𝑧 − 𝑥𝑗) = (�̄�𝑖 − �̄�𝑗)𝑡(�̄�𝑧 − �̄�𝑗)= (𝑦𝑖 − 𝑦𝑗)𝑡(𝑦𝑧 − 𝑦𝑗) = (𝑦𝑖 − 𝑦𝑗)𝑡(𝑦𝑧 − 𝑦𝑗).

We say that two subsets of points {𝑥𝑖}𝑛𝑖=1, {𝑦𝑖}𝑛

𝑖=1 ⊆ R𝑚 for any 𝑚 ≥ dim(𝐷) are orthogonally similar ifthere is an orthogonal operator 𝑄 on R𝑚, such that 𝑄(𝑥𝑖 − 𝑥𝑗) = 𝑦𝑖 − 𝑦𝑗 , for 𝑖, 𝑗 ∈ [𝑛].

Proposition 3.2.4. Let 𝐷 be a 𝑛× 𝑛 EDM and {𝑥𝑖}𝑛𝑖=1, {𝑦𝑖}𝑛

𝑖=1 ⊆ R𝑚, for any 𝑚 ≥ dim(𝐷), be subsets of pointswhich realize 𝐷. Then {𝑥𝑖}𝑛

𝑖=1 is orthogonally similar to {𝑦𝑖}𝑛𝑖=1.

Proof. We define the sets of vectors {𝑣𝑖1 = 𝑥𝑖 − 𝑥1}𝑛𝑖=2, {𝑢𝑖1 = 𝑦𝑖 − 𝑦1}𝑛

𝑖=2 ⊆ R𝑚. Let 𝑇 : [𝑣𝑖1]𝑛𝑖=2 −→ [𝑢𝑖1]𝑛𝑖=2 be alinear transformation such that 𝑇 (𝑣𝑖1) = 𝑢𝑖1 with 𝑖 ∈ [2, 𝑛]. If 𝑣 =

∑︀𝑛𝑖=2 𝑎𝑖𝑣𝑖1, then

𝑇 (𝑣)𝑡𝑇 (𝑣) =𝑛∑︁

𝑖=2

𝑛∑︁𝑗=2

𝑎𝑖𝑎𝑗𝑢𝑡𝑖1𝑢𝑗1.

By Lemma 3.2.3, we have 𝑢𝑡𝑖1𝑢𝑗1 = 𝑣𝑡

𝑖1𝑣𝑗1. Thus

𝑇 (𝑣)𝑡𝑇 (𝑣) =𝑛∑︁

𝑖=2

𝑛∑︁𝑗=2

𝑎𝑖𝑎𝑗𝑢𝑡𝑖1𝑢𝑗1 =

𝑛∑︁𝑖=2

𝑛∑︁𝑗=2

𝑎𝑖𝑎𝑗𝑣𝑡𝑖1𝑣𝑗1 = 𝑣𝑡𝑣.

Therefore, 𝑇 is a linear isometry, i.e., an isomorphism. This implies that there is a linear isometry 𝑇 :([𝑣𝑖1]𝑛𝑖=2)⊥ −→ ([𝑢𝑖1]𝑛𝑖=2)⊥, so we can define 𝑄 : R𝑚 −→ R𝑚 such that, if 𝑣 = 𝑣1 + 𝑣2 ∈ R𝑚 where 𝑣1 ∈ [𝑣𝑖1]𝑛𝑖=2 e𝑣2 ∈ ([𝑣𝑖1]𝑛𝑖=2)⊥, 𝑄(𝑣) = 𝑇 (𝑣1) + 𝑇 (𝑣2), and we have that 𝑄 is linear and

𝑄(𝑣)𝑡𝑄(𝑣) = 𝑇 (𝑣1)𝑡𝑇 (𝑣1) + 𝑇 (𝑣2)𝑡𝑇 (𝑣2) = 𝑣𝑡1𝑣1 + 𝑣𝑡

2𝑣2 = 𝑣𝑡𝑣,

implying that 𝑄 is a orthogonal operator.

Corollary 3.2.5. Let 𝐷 be a 𝑛× 𝑛 EDM and {𝑥𝑖}𝑛𝑖=1 ⊆ R𝑚 for any 𝑚 ≥ dim(𝐷) a subset of points which realizes

𝐷. Then the dimension of [𝑥𝑖 − 𝑥1]𝑛𝑖=2 is equal to dim(𝐷).

12

In the proof of the Proposition 3.2.4 we verified that if {𝑥𝑖}𝑛𝑖=1 and {𝑦𝑖}𝑛

𝑖=1 ⊆ R𝑚, for any 𝑚 ≥ dim(𝐷), aresubset of points which realizes 𝐷, then the linear subspaces [𝑥𝑖 − 𝑥1]𝑛𝑖=2 and [𝑦𝑖 − 𝑦1]𝑛𝑖=2 have the same dimension.This means that any two subsets of points which realizes 𝐷 generate linear subspaces with the same dimension, inparticular the subset of points in the embedding dimension.

Given a EDM of order 𝑛, the following lemma establishes that the embedding dimension of the given EDM isgreater than the embedding dimension of any of its (𝑛− 1)-th principal submatrices by at most one.

Lemma 3.2.6. Let 𝐷 be a (𝑛 + 1)× (𝑛 + 1) EDM. If dim(𝐷([𝑛], [𝑛])) = 𝐾, then dim(𝐷) ∈ {𝐾, 𝐾 + 1}.

Proof. Let {𝑥𝑖}𝑛+1𝑖=1 be a subset of points in R𝐾+1 which realizes 𝐷, defining the subset of vectors {𝑣𝑖1 = 𝑥𝑖−𝑥1}𝑛+1

𝑖=2 ,we have [𝑣𝑖1]𝑛𝑖=2 is a linear 𝐾-dimensional subspace, since {𝑥𝑖}𝑛

𝑖=1 realizes 𝐷([𝑛], [𝑛]) and dim(𝐷([𝑛], [𝑛])) = 𝐾.Therefore, we have

[𝑣𝑖1]𝑛𝑖=2 ⊆ [𝑣𝑖1]𝑛+1𝑖=2 = [𝑣𝑖1]𝑛𝑖=2 + [𝑣𝑖(𝑛+1)]

⇒dim([𝑣𝑖1]𝑛𝑖=2) ≤ dim([𝑣𝑖1]𝑛+1𝑖=2 ) ≤ dim([𝑣𝑖1]𝑛𝑖=2) + dim([𝑣𝑖(𝑛+1])

⇒dim(𝐷([𝑛], [𝑛])) ≤ dim(𝐷) ≤ dim(𝐷([𝑛], [𝑛])) + 1⇒𝐾 ≤ dim(𝐷) ≤ 𝐾 + 1.

The next lemma ensures that, given a set 𝑆 ⊂ R𝑚 of 𝑛 points that realizes the 𝑛-th principal submatrix ofa EDM of order 𝑛 + 1 and embedding dimension at most 𝑚, 𝑆 can be augmented into a realizing set for the fullmatrix without any change on the space dimension.

Lemma 3.2.7. Let 𝐷 be a (𝑛 + 1)× (𝑛 + 1) EDM, and let dim(𝐷) ≤ 𝑚. Additionally, let {𝑥𝑖}𝑛𝑖=1 ⊆ R𝑚 be a set of

points which realizes 𝐷([𝑛], [𝑛]), the 𝑛-th principal submatrix of 𝐷. Then there exists 𝑥𝑛+1 ∈ R𝑚 such that {𝑥𝑖}𝑛+1𝑖=1

realizes 𝐷.

Proof. Let {𝑦𝑖}𝑛+1𝑖=1 ⊆ R𝑚 be a subset of points which realizes 𝐷 and let {𝑥𝑖}𝑛

𝑖=1 be a subset of points which realizes𝐷([𝑛], [𝑛]). By Proposition 3.2.4, we have that {𝑦𝑖}𝑛

𝑖=1 and {𝑥𝑖}𝑛𝑖=1 are orthogonally similar, i.e., there is a linear

operator 𝑄 on R𝑚 such that 𝑄(𝑦𝑖 − 𝑦𝑗) = 𝑥𝑖 − 𝑥𝑗 , for 𝑖, 𝑗 ∈ [𝑛]. If 𝑥𝑛+1 = 𝑥1 + 𝑄(𝑦𝑛+1 − 𝑦1), then

‖𝑥𝑛+1 − 𝑥𝑖‖ = ‖𝑥1 − 𝑥𝑖 + 𝑄(𝑦𝑛+1 − 𝑦1)‖= ‖𝑄(𝑦1 − 𝑦𝑖) + 𝑄(𝑦𝑛+1 − 𝑦1)‖= ‖𝑄(𝑦1 − 𝑦𝑖 + 𝑦𝑛+1 − 𝑦1)‖= ‖𝑄(𝑦𝑛+1 − 𝑦𝑖)‖= ‖𝑦𝑛+1 − 𝑦𝑖‖,

with 𝑖 ∈ [𝑛]. Therefore, {𝑥𝑖}𝑛+1𝑖=1 realizes 𝐷.

The following theorem establishes necessary and sufficient conditions for a 𝑛×𝑛 symmetric hollow matrix withnonnegative elements to be a EDM. If this matrix is a EDM with dim(𝐷) = 𝐾, then there exists a set of pointswhich realizes the given matrix such that 𝐾 + 1 of them form a triangular structure in some sense, as explainedbelow.

Theorem 3.2.8. Let 𝐾 be a positive integer and 𝐷 be a 𝑛×𝑛 symmetric hollow matrix with nonnegative elements,with 𝑛 ≥ 2. 𝐷 is a EDM with dim(𝐷) = 𝐾 if and only if there exist {𝑥𝑖}𝑛

𝑖=1 ⊆ R𝐾 and an index set 𝐼 = {𝑖𝑗}𝐾+1𝑗=1 ⊆

[𝑛] such that ⎧⎨⎩ 𝑥𝑖1 = 0𝑥𝑖𝑗

(𝑗 − 1) ̸= 0, 𝑗 ∈ [2, 𝐾 + 1]𝑥𝑖𝑗

(𝑖) = 0, 𝑗 ∈ [2, 𝐾 + 1], 𝑖 ∈ [𝑗, 𝐾],where {𝑥𝑖}𝑛

𝑖=1 realizes 𝐷.

13

Proof. Let 𝐾 be a positive integer and 𝐷 be a 𝑛× 𝑛 EDM such that dim(𝐷) = 𝐾, we want to prove that there is{𝑥𝑖}𝑛

𝑖=1 ⊆ R𝐾 and an index set 𝐼 ⊆ [𝑛] with 𝐾 + 1 elements such that⎧⎨⎩ 𝑥𝑖1 = 0𝑥𝑖𝑗 (𝑗 − 1) ̸= 0, 𝑗 ∈ [2, 𝐾 + 1]𝑥𝑖𝑗

(𝑖) = 0, 𝑗 ∈ [2, 𝐾 + 1], 𝑖 ∈ [𝑗, 𝐾]

for {𝑖𝑗}𝐾+1𝑗=1 ⊆ 𝐼 and {𝑥𝑖}𝑛

𝑖=1 realizes 𝐷.We remark that, since 𝐾 is a positive integer, then 𝐷 ̸= 0. We proceed by induction on 𝑛. For 𝑛 = 2 we have

𝐷 =(︂

0 𝐷(1, 2)𝐷(1, 2) 0

)︂,

therefore, dim(𝐷) = 1 and {𝑥1 = 0, 𝑥2 =√︀

𝐷(1, 2)} ⊂ R1 and 𝐼 = {1, 2}, then the statement is true.As induction hypothesis suppose that the statement is true for some 𝑛 ≥ 1, i.e., given a 𝑛 × 𝑛 EDM 𝐷 such

that dim(𝐷) = 𝐾, exists {𝑥𝑖}𝑛𝑖=1 ⊆ R𝐾 and an index set 𝐼 ⊆ [𝑛] with 𝐾 + 1 elements such that⎧⎨⎩ 𝑥𝑖1 = 0

𝑥𝑖𝑗 (𝑗 − 1) ̸= 0, 𝑗 ∈ [2, 𝐾 + 1]𝑥𝑖𝑗

(𝑖) = 0, 𝑗 ∈ [2, 𝐾 + 1], 𝑖 ∈ [𝑗, 𝐾]

for {𝑖𝑗}𝐾+1𝑗=1 ⊆ 𝐼 and {𝑥𝑖}𝑛

𝑖=1 realizes 𝐷.Let 𝐷 be a (𝑛 + 1)× (𝑛 + 1) EDM such that dim(𝐷) = 𝐾, so �̄� = 𝐷([𝑛], [𝑛]) is a EDM such that, by Lemma

3.2.6, dim(�̄�) = 𝑘, with 𝑘 = 𝐾 or 𝑘 = 𝐾−1. By the induction hypothesis, there exists {𝑥𝑖}𝑛𝑖=1 ⊆ R𝑘 which realizes

�̄� and an index set 𝐼 ⊆ [𝑛] with 𝑘 + 1 elements such that⎧⎨⎩ 𝑥𝑖1 = 0𝑥𝑖𝑗

(𝑗 − 1) ̸= 0, 𝑗 ∈ [2, 𝑘 + 1]𝑥𝑖𝑗

(𝑖) = 0, 𝑗 ∈ [2, 𝑘 + 1], 𝑖 ∈ [𝑗, 𝑘]

for {𝑖𝑗}𝑘+1𝑗=1 ⊆ 𝐼. Without any loss of generality, we can assume {𝑥𝑖}𝑛

𝑖=1 ⊆ R𝑘+1: we can do that by definingthe (𝑘 + 1)st coordinate of each vector to be zero. Since dim(𝐷) ≤ (𝑘 + 1), by Lemma 3.2.7, there exist 𝑦 =(𝑦1, 𝑦2, · · · , 𝑦𝑘+1) such that {𝑥𝑖}𝑛

𝑖=1 ∪ {𝑦} realizes 𝐷.This means, 𝑦 belongs to the intersection of the spheres centered in {𝑥𝑖}𝑛

𝑖=1 ⊆ R𝑘+1 each with radius√︀

𝐷(𝑖, 𝑛 + 1).Therefore, 𝑦 is the solution of the following non-linear system:⎧⎪⎪⎪⎨⎪⎪⎪⎩

‖𝑥1 − 𝑦‖2= 𝐷(1, 𝑛 + 1)‖𝑥2 − 𝑦‖2= 𝐷(2, 𝑛 + 1)...‖𝑥𝑛 − 𝑦‖2= 𝐷(𝑛, 𝑛 + 1)

Reordering the equations such way that 𝑗th equation is ‖𝑥𝑖𝑗− 𝑦‖2= 𝐷(𝑖𝑗 , 𝑛 + 1) for 𝑗 ∈ [𝑘 + 1], we have⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

‖𝑥𝑖1 − 𝑦‖2= 𝐷(𝑖1, 𝑛 + 1)‖𝑥𝑖2 − 𝑦‖2= 𝐷(𝑖2, 𝑛 + 1)...‖𝑥𝑖𝑘+1 − 𝑦‖2= 𝐷(𝑖𝑘+1, 𝑛 + 1)...‖𝑥𝑗1 − 𝑦‖2= 𝐷(𝑗1, 𝑛 + 1)‖𝑥𝑗2 − 𝑦‖2= 𝐷(𝑗2, 𝑛 + 1)...‖𝑥𝑗𝑛−𝑘−1 − 𝑦‖2= 𝐷(𝑗𝑛−𝑘−1, 𝑛 + 1)

14

where {𝑗𝑖}𝑛−𝑘−1𝑖=1 = [𝑛]∖𝐼. Applying the induction hypothesis, we know the points {𝑥𝑖𝑗}𝑘+1

𝑗=1 . Using this informationand subtracting the first equation from the others, we obtain:⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

‖𝑦‖2= 𝐷(𝑖1, 𝑛 + 1)𝑥𝑡

𝑖2𝑦 = 𝑏𝑖2

...𝑥𝑡

𝑖𝑘+1𝑦 = 𝑏𝑖𝑘+1

...𝑥𝑡

𝑗1𝑦 = 𝑏𝑗1

𝑥𝑡𝑗1

𝑦 = 𝑏𝑗2...𝑥𝑡

𝑗𝑛−𝑘−1𝑦 = 𝑏𝑗𝑛−𝑘−1

where𝑏𝑖 = ‖𝑥𝑖‖2−𝐷(𝑖, 𝑛 + 1) + 𝐷(𝑖1, 𝑛 + 1)

2for 𝑖 ∈ [𝑛]− {𝑖1}. Let 𝐵 be (𝑛− 1)× 𝑘 the associated matrix to the linear part of the non-linear system and 𝑏 thesolution vector, both of them ordered according to the system above. Then, we can rewrite the system of equationsas {︂

‖𝑦‖2= �̄�(𝑖1, 𝑛 + 1)𝐵𝑦([𝑘]) = 𝑃𝑏.

By construction, we have that 𝐵 is a lower triangular matrix without null elements in the diagonal, thereforethe linear part of the system has only two possible outcomes: there is a unique solution or there is no solution. Ifthe system has no solution, then the set generated by the intersection of the spheres in R𝑘+1 is empty, and,thus, 𝐷 is not a EDM, that is an absurd. So, the linear part of the system has only one solution 𝑦*.

Replacing the found solution in the linear system in ‖𝑦‖2= ‖𝑦([𝑘])‖2+𝑦2𝑘+1 = 𝐷(𝑖1, 𝑛 + 1), we obtain

𝑦2𝑘+1 = 𝐷(𝑖1, 𝑛 + 1)− ‖𝑦*‖2.

If 𝐷(𝑖1, 𝑛 + 1) − ‖𝑦*‖2 is negative, then the system has no solution, i.e., the intersection of the spheres inR𝑘+1 is empty, and, thus, 𝐷 is not a EDM, which is, again, an absurd. Therefore, the difference is not negative. Ifthe difference is null, then 𝐾 = 𝑘 and the last entry of each point is unnecessary and the index set of the inductionhypothesis remains valid and the statement is true.

If 𝑘 = 𝐾 − 1, then the difference is strictly positive which implies two possible solutions from what we choseone of them. Then we state 𝑥𝑛+1 = 𝑦, 𝐼 = 𝐼 ∪ {𝑥𝑛+1} as the index set, and there exists {𝑥𝑖}𝑛+1

𝑖=1 ⊆ R𝑘+1 whichrealizes 𝐷 and an index set 𝐼 ⊆ [𝑛 + 1], with 𝐾 + 1 elements such that⎧⎨⎩ 𝑥𝑖1 = 0

𝑥𝑖𝑗(𝑗 − 1) ̸= 0, 𝑗 ∈ [2, 𝐾]

𝑥𝑖𝑗(𝑖) = 0, 𝑗 ∈ [2, 𝐾], 𝑖 ∈ [𝑗, 𝐾]

for {𝑖𝑗}𝐾+1𝑗=1 = 𝐼. Therefore, the statement is true.

The reciprocal of the statement is true by definition of EDM and of the embedding dimension of the linearspace generated by the defined set of points.

This induction process suggests an algorithm to verify whether or not a matrix 𝐷 is a EDM and, if so, todetermine an embedding in the least possible dimension. The procedure is shown in Alg. 3.1, and we refer to it asedmsph, from “EDM” and “sphere”. The pseudocode makes use of a function expand(𝑥) which endows point vectorsin the sequence 𝑥 with an additional zero component. We denote the sphere centered in 𝑝 ∈ R𝐾+1 with radius 𝑟by S𝐾(𝑝, 𝑟).

We remark that, given 𝐾 spheres in R𝐾 , we assume their centers are in general position, i.e. they span a (𝐾−1)-dimensional affine space. Then we have at most two points in the intersection of these spheres. More specifically,

15

we have no point if the intersection is empty, one point if the intersection lies in the (𝐾 − 1)-dimensional affinespace generated by the centers and two points if there are no points of the intersection in the (𝐾 − 1)-dimensionalaffine space generated by the centers.

Using trilateration on the appropriately indexed points guaranteed by Thm. 3.2.8, finding Γ in Alg. 3.1 requiressolving triangular linear systems of order from 2 to the embedding dimension of a EDM, which can be carried outin time proportional to �̄�2, for �̄� ∈ [2, dim(𝐷)]. This leads to a total time of 𝒪(𝑛3) in the worst case.

Alg. 3.1 𝐾 = edmsph(𝐷, 𝑥)1: 𝐼 = {1, 2}2: 𝐾 = 13: (𝑥1, 𝑥2) = (0,

√𝐷12)

4: for 𝑖 ∈ {3, . . . , 𝑛} do5: Γ = ⋂︀

𝑗∈𝐼S𝐾(𝑥𝑗, 𝐷𝑖𝑗)

6: if Γ = ∅ then7: return ∞8: else if Γ = {𝑝𝑖} then9: 𝑥𝑖 = 𝑝𝑖

10: else if Γ = {𝑝+𝑖 , 𝑝−

𝑖 } then11: 𝑥𝑖 = 𝑝+

𝑖

12: 𝑥← expand(𝑥)13: 𝐼 ← 𝐼 ∪ {𝑖}14: 𝐾 ← 𝐾 + 115: else16: error: dim aff(span(𝑥𝐼)) < 𝐾 − 117: end if18: end for19: return 𝐾

3.3 Numerical ExperimentsAs mentioned early, in [19] Dattorro developed isedm, an algorithm for checking whether a given symmetric

hollow matrix 𝐷 with nonnegative entries is a EDM. In what follows we compare the experimental results obtainedwith isedm and edmsph based on a series of tests.

In the first series of experiments, we used the constructions proposed by Moré and Wu [More1997]. Thoseconstructions consist of structures with 𝑠3 elements (𝑠 ∈ N) positioned on the three-dimensional lattice defined by{(𝑖1, 𝑖2, 𝑖3) ∈ R3|0 ≤ 𝑖𝑘 ≤ 𝑠−1, 𝑘 ∈ [3]}, for 𝑠 ∈ [2, 10]. The second and third columns in Table 3.1 shows the resultsof those experiments. Each entry reports the Stress value (the Frobenius norm) between the original EDM and theone obtained by the algorithms. In all experiments all algorithms obtained the correct embedding dimension.

We can see a significant difference between the Stress values obtained by the algorithms. One possible explana-tion for such a difference is the sensitivity of the spectral decomposition applied in the isedm routine. The edmsphalgorithm is more stable in that respect because it solves triangular linear systems in each step. In order to solvethis problem with the sensitivity of the isedm we re-made the algorithm using another routines, e.g. svd. This“new” routine is called isedm2 and its results on the Moré-Wu constructions are presented in last column of theTable 3.1, and these results are pretty much similar to the ones obtained by edmsph as we can see. For the rest ofthe section we will use the routine isedm2 instead of the usual isedm.

16

𝑠 edmsph isedm isedm26 4.2862× 10−12 1.4213× 10+02 4.3257× 10−12

7 1.1311× 10−11 2.2174× 10+02 1.0699× 10−11

8 2.6081× 10−11 1.8635× 10+03 2.3190× 10−11

9 1.2382× 10−10 1.9240× 10+03 1.8667× 10−10

10 1.2037× 10−10 8.1544× 10+03 2.8984× 10−10

Tabela 3.1: Stress values obtained on Moré-Wu instances.

In the second series of experiments we generated 500 molecular instances with 1,000 points built accordingto the ideas of Lavor [34]. Lavor suggested the generation of artificial molecular instances based on practicalexperimentation and on constructions proposed by Moré and Wu [More1997]. Figure 3.3 presents the resultsobtained in the first series of experiments in a plot that shows the Stress value (i.e., the Frobenius norm betweenthe original and obtained matrices) associated with the resulting embedding, for each dataset in our set of instances.We sorted the datasets according to the error obtained by the isedm2 algorithm. In all experiments both algorithmsobtained the correct embedding dimension. The results obtained by edmsph were smaller than those obtained byisedm2 in 412 of the datasets, a figure that amounts to 82.4% of the experiments.

Figura 3.1: A plot that shows the Stress values associated with the resulting embedding for eachartificial molecular instance. The instances are ordered with respect to the Stress value of 𝑖𝑠𝑒𝑑𝑚2.

For the third, and last, experiment we took 69 proteins from the Protein Data Bank with the number of atomsfrom 356 to 1997. Table 3.2 shows the results of those experiments. And Figure 3.3 presents the results obtainedin the third series of experiments in a plot that shows the Stress value associated with the resulting embedding,

17

for each dataset in our set of instances. In Figure 3.3, we sorted the datasets according to the error obtained bythe isedm2 algorithm. In all experiments both algorithms obtained the correct embedding dimension. The resultsobtained by edmsph were smaller than the results obtained by isedm2 in 40 of the datasets, a figure that amountsto 57.97% of the experiments.

Figura 3.2: A plot that shows the Stress values obtained on protein instances. The instances areordered with respect to the Stress value of 𝑖𝑠𝑒𝑑𝑚2.

3.4 ConclusionsWe have introduced a novel algorithm which determines whether a given symmetric hollow (i.e., with zero

diagonal) matrix with nonnegative elements is a EDM. Additionally, if the matrix is indeed a EDM, the algorithmcomputes the matrix’s embedding dimension, alongside an actual embedding. Further work will focus on adaptingedmsph to noisy distance data, and to the rank deficient case where the centers of 𝐾 spheres in R𝐾 do not span a(𝐾 − 1)-dimensional subspace.

18

Protein Atoms edmsph isedm21AN1 1915 1.6495× 10−09 3.7035× 10−09

1BUN 1440 1.5776× 10−09 4.4280× 10−09

1C9P 1997 2.2125× 10−09 2.7637× 10−09

1CEA 1276 1.2177× 10−09 1.8378× 10−09

1CEB 1276 1.4791× 10−09 1.7857× 10−09

1CO7 1995 2.3975× 10−09 2.8108× 10−09

1CSO 1703 1.4353× 10−09 1.8663× 10−09

1CT0 1703 1.6678× 10−09 4.7817× 10−09

1CT2 1710 1.4789× 10−09 1.4745× 10−09

1CT4 1712 1.5556× 10−09 1.6678× 10−09

1DS2 1705 1.6220× 10−09 1.1859× 10−09

1EDM 643 2.8938× 10−10 3.0870× 10−10

1EN2 679 2.5515× 10−10 2.2819× 10−10

1ENM 663 2.6854× 10−10 7.0722× 10−10

1F2S 1847 1.4219× 10−09 2.7319× 10−09

1H9H 1911 1.5718× 10−09 2.1129× 10−09

1H9I 1883 1.5272× 10−09 2.4748× 10−09

1I5K 1737 2.0142× 10−09 1.9053× 10−09

1I71 683 2.5668× 10−10 1.7035× 10−10

1IQB 1276 1.8594× 10−09 6.0770× 10−09

1KIV 632 2.2024× 10−10 2.0969× 10−10

1KRN 685 3.0571× 10−10 2.1344× 10−10

1LDT 1992 1.9458× 10−09 3.1630× 10−09

1NL1 1096 9.8307× 10−10 3.0882× 10−09

1NL2 1059 8.5447× 10−10 1.6981× 10−09

1PK4 610 2.0620× 10−10 1.7962× 10−10

1PKR 641 2.5351× 10−10 2.6498× 10−10

1PMK 1242 8.7996× 10−10 8.7962× 10−10

1PML 1912 3.4932× 10−09 1.9197× 10−08

1PPE 1851 1.6840× 10−09 1.2841× 10−09

1SGD 1696 1.9804× 10−09 1.5572× 10−09

1SGE 1697 1.8138× 10−09 2.1639× 10−09

1SGN 1696 1.6647× 10−09 1.4251× 10−09

1SGP 1694 1.5422× 10−09 1.6623× 10−09

1SGQ 1693 1.5924× 10−09 1.5787× 10−09

Protein Atoms edmsph isedm21SGR 1697 1.3880× 10−09 2.5377× 10−09

1SGY 1700 3.2929× 10−09 2.8169× 10−09

1TAB 1904 1.8848× 10−09 2.5129× 10−09

2BTC 1855 1.6613× 10−09 2.4580× 10−09

2CMY 1798 1.3696× 10−09 3.9646× 10−09

2DOH 1538 1.6490× 10−09 5.5536× 10−09

2F3C 1954 1.7126× 10−09 3.1338× 10−09

2FYG 961 9.1740× 10−10 6.2604× 10−10

2G43 1757 3.7713× 10−09 1.0004× 10−08

2NU0 1706 1.7550× 10−09 1.9696× 10−09

2NU1 1712 1.6210× 10−09 1.4695× 10−09

2NU2 1722 2.8413× 10−09 1.1111× 10−09

2NU3 1699 2.7657× 10−09 3.4512× 10−09

2NU4 1717 1.5142× 10−09 2.4563× 10−09

2PF1 947 6.3205× 10−10 1.3097× 10−09

2PF2 1046 8.7116× 10−10 1.4785× 10−09

2PK4 630 2.1141× 10−10 2.0473× 10−10

2SGD 1696 1.6878× 10−09 1.6010× 10−09

2SGE 1697 2.4389× 10−09 2.1018× 10−09

2SGF 1699 1.9586× 10−09 1.3995× 10−09

2SGP 1695 1.4327× 10−09 1.5202× 10−09

2SGQ 1697 1.5672× 10−09 1.1469× 10−09

2SPT 1046 9.6905× 10−10 1.4284× 10−09

2STA 1901 1.5857× 10−09 3.7422× 10−09

2STB 1887 1.7552× 10−09 1.6179× 10−09

3BTW 1996 2.0703× 10−09 1.6479× 10−09

3D9T 1639 2.2721× 10−09 3.8062× 10−09

3HTC 356 3.6703× 10−10 3.0685× 10−10

3KIV 646 4.3559× 10−10 1.7527× 10−10

3SGB 1690 1.7802× 10−09 1.7192× 10−09

3SGQ 1697 1.9636× 10−09 1.6871× 10−09

4KIV 633 2.6762× 10−10 2.4183× 10−10

4SGB 1690 1.6166× 10−09 1.9758× 10−09

5HPG 1296 1.0357× 10−09 1.4561× 10−09

Tabela 3.2: Stress values obtained on Proteins instances.

19

Capítulo 4

A Distance Geometry-BasedCombinatorial Approach toMultidimensional Scaling

Jorge Alencar1, Tibérius Bonates2 and Carlile Lavor3

1 Instituto Federal de Educação, Ciência e Tecnologia do Sul de Minas Gerais, IFSULDEMINAS, Inconfidentes,Minas Gerais, Brazil. [email protected] Universidade Federal do Ceará, DEMA-UFC, Fortaleza, Ceará, Brazil. [email protected] Universidade Estadual de Campinas, IMECC-Unicamp, Campinas, São Paulo, [email protected].

Abstract

In standard Multidimensional Scaling (MDS) one is concerned with finding a low-dimensionalrepresentation of a set of n objects, so that pairwise dissimilarities among the original objectsare realized as distances in the embedded space with minimum error. We propose an com-binatorial algorithm for MDS that, in addition to minimizing a usual Stress function, canaccommodate additional optimization criteria, as well as side constraints associated with theunderlying visualization task. Our algorithm exploits a class of Euclidean Distance Geometryproblems, leading to a combinatorial procedure. We also present the theoretical bases of ourmethod, as well as results on the existence of such a low-dimensional representation. Moreo-ver, we show how a simple randomization strategy can be incorporated into our algorithm andsubstantially improve its performance. We illustrate the use of the algorithm on an applicationwhere a secondary objective function is to be minimized: the cluster membership discrepancybetween a given cluster structure in the original data and the resulting cluster structure in thelow-dimensional embedding. Finally, we discuss a few properties of the algorithm that makeit an interesting choice for Big Data visualization.

20

4.1 IntroductionMultidimensional scaling (MDS) is a set of techniques concerned with variants of the following problem: given

the information on pairwise dissimilarities between elements of a set of 𝑛 objects, find an embedding of the givenobjects in a low-dimensional Euclidean space, while minimizing a loss function that measures the error between theoriginal dissimilarities and the distances resulting from the low-dimensional embedding [14]. Such an embedding ofthe given objects is usually referred to as an MDS embedding.

MDS is an important tool for visualization, with applications including psychometrics, scientific visualizationand graph drawing. Every MDS technique attempts to minimize a particular objective function that measures theinaccuracies between the given dissimilarities and the distances in the embedded space [22]. Standard MDS tech-niques, however, are limited to minimizing a single optimization criterion and cannot account for side constraints.Thus, if a visualization task arising from a certain application domain demands a set of additional constraints to bemet by any valid MDS embedding, then one is often forced to give up local optimality and use ad hoc techniquesin order to produce an embedding of reasonable quality.

Among the MDS techniques that provide some support for enforcing side constraints on the resulting embedding,we mention the body of work referred to as Confirmatory MDS (see [14], [15], [13]), which includes the recentlyproposed Supervised Multidimensional Scaling (SMDS) [57]. In Confirmatory MDS, external constraints on thestructure of the MDS embedding are integrated into a MDS algorithm by means of equations, penalties or pseudo-data [15]. The types of external constraints that have been studied include enforcing circular arrangements of points,rectangular lattice-like structures, and order constraints on the realized distances. In SMDS, the input points areassumed to be labeled with numbers from 1 to 𝑘, for some positive integer 𝑘. The actual loss function minimizedby SMDS includes a typical measure of Stress and a term that promotes points labeled with higher values to beassigned higher coordinate values along every dimension of the embedding space.

In most cases, Confirmatory MDS requires the tuning of parameters such as penalty or scaling factors, whichare used to specify whether one is interested in effectively enforcing the external constraints, or in attemptingto produce a solution that approximates the intended structure. Deciding whether or not a confirmatory MDSembedding is appropriate involves evaluating the increase in the Stress value with respect to a standard MDSembedding. Since most MDS techniques produce locally optimal solutions, it is important, in practice, to providea starting embedding that partially (or approximately) satisfies the external constraints and has a good value ofStress.

4.1.1 Application: MDS of Clustered DataWe illustrate our proposed technique on the following Confirmatory MDS problem. Let 𝑃 be a set of points in

R𝑚 to which a clustering procedure (e.g., k-means) has been applied. The application of a standard MDS procedureto 𝑃 provides no guarantee that, if the same clustering procedure were also applied to the embedded data, a clusterpartition similar to the one obtained for the original data would result. In other words, one cannot expect that theinformation on which points are assigned to the same cluster is preserved by standard MDS.

Cluster Differences Scaling (CDS) [27] is an example of a Confirmatory MDS technique that attempts to in-tegrate MDS and clustering into a single procedure, in a way that resembles the problem discussed here. Givenpairwise distances between a set of objects, CDS assigns objects to clusters and creates a low-dimensional represen-tation for each cluster. Therefore, the resulting representation includes as many points as the number of clusters.The distance error is measured over the cluster representations for pairs of points that are assigned to differentclusters. Another line of work relating clustering and MDS is the one described in [49]. There, an MDS embedding isdetermined with the property that a k-means partition of the embedded data is identical to the optimal partition inthe original space given by a so-called pairwise clustering cost function. One of the advantages of such an approachis that, instead of carrying out an expensive pairwise clustering cost procedure on the original data, one can applya standard k-means algorithm to the embedded data and recover precisely the same information.

Unlike these approaches, in which clusters are determined as part of the process, we require that a clusterpartition be obtained prior to the application of MDS. More specifically, we assume that, in addition to the pairwisedissimilarity information, cluster membership data is given as part of the input, specifying to which cluster eachpoint is assigned. The current availability of highly specialized algorithms for clustering (see, e.g., [58]) allows for

21

instances with very large numbers of entities and/or complex data types to be solved. Thus, it is justified to arguefor an MDS algorithm that preserves cluster partition but does not enforce the use of a specific clustering method,unlike [27] and [49]. By considering the cluster partition structure as part of the input, such a MDS algorithmcould be applied to data that has been clustered with virtually any clustering algorithm. The confirmatory MDSalgorithm we present here can be applied to a variety of confirmatory MDS scenarios, including this particular kindof cluster preserving MDS task.

This paper is organized as follows. Section 4.2 introduces the notation and the main definitions used throughoutthe text. In Section 4.3, some results from distance geometry are presented and an approach to MDS is introducedbased on the Euclidean Distance Matrix Completion Problem. In Sections 4.4 and 4.5, we describe an existingcombinatorial algorithm for MDS and how it can be modified in order to take into account a secondary optimizationcriterion related to the preservation of cluster membership. In Section 4.7, we discuss the results of preliminarycomputational experiments carried out on a few clustering datasets. Finally, Section 4.8 highlights the main findingsand discusses some future research directions.

4.2 Notation and DefinitionsIn what follows, we shall use the following notations and definitions:

1. [𝑛] = {1, · · · , 𝑛}, 𝑛 ∈ N.2. [𝑛1, 𝑛2] = {𝑛1, 𝑛1 + 1, · · · , 𝑛2 − 1, 𝑛2}, 𝑛1, 𝑛2 ∈ N.3. {𝑥𝑖}𝑛

𝑖=1 = {𝑥1, 𝑥2, · · · , 𝑥𝑛−1, 𝑥𝑛}, 𝑛 ∈ N.4. If {𝑣𝑖}𝑛

𝑖=1 ⊆ R𝑚, then [𝑣𝑖]𝑛𝑖=1 is the subspace of R𝑚 spanned by vectors 𝑣𝑖, 𝑖 = 1, . . . , 𝑛.5. 𝑀(𝑛) is the space of real square matrices or order 𝑛.6. Given 𝐷 = (𝑑𝑖,𝑗) ∈ 𝑀(𝑛), with 𝑇 = {𝑣𝑖}𝑝

𝑖=1, 𝑈 = {𝑢𝑖}𝑞𝑖=1, 𝐷(𝑇, 𝑈) = (𝑑𝑖𝑗) stands for the submatrix of 𝐷

such that 𝑑𝑖𝑗 = 𝑑𝑣𝑖,𝑢𝑗 for 𝑖 ∈ [𝑝] and 𝑗 ∈ [𝑞], where 𝑑𝑣𝑖,𝑢𝑗 is the Euclidean distance between 𝑣𝑖 and 𝑢𝑗 . Forsimplicity, when 𝑇 = 𝑈 we denote 𝐷(𝑇, 𝑈) simply by 𝐷(𝑈).

All graphs in this paper are simple and undirected. We shall denote a graph by the pair 𝐺 = (𝑉, 𝐸), where𝑉 is a finite set of vertices and the edge-set 𝐸 is a family of unordered pairs of elements from 𝑉 . A clique in agraph 𝐺 is a subset 𝑆 of vertices such that the subgraph induced by 𝑆, denoted by 𝐺[𝑆], is a complete graph.We shall refer to 𝑆 as a |𝑆|-clique and denote the fact that 𝐺[𝑆] is isomorphic to K|𝑆|, the complete graph on|𝑆| vertices, by 𝐺[𝑆] ≃ K|𝑆|. A clique is maximal if it is not contained by any other clique of 𝐺. We define𝑁(𝑣) = {𝑢 ∈ 𝑉 : {𝑢, 𝑣} ∈ 𝐸} as the neighborhood of 𝑣 in 𝐺.

Finally, we shall use the fact that, given a set 𝐴 = {𝑥𝑖}𝐾+1𝑖=1 of points in R𝐾 , with Euclidean distances 𝑑𝑖,𝑗 =

‖𝑥𝑖 − 𝑥𝑗‖ (𝑖, 𝑗 = 1, . . . , 𝐾 + 1), the volume of the 𝐾-dimensional simplex with vertices 𝑥1, . . . , 𝑥𝐾+1 is given by theCayley-Menger formula [36]:

Δ𝐾(𝐴) =

√︃(−1)𝐾+1

2𝐾(𝐾! )2 det(𝐷𝐴), (4.2.1)

where

𝐷𝐴 =

⎛⎜⎜⎜⎜⎜⎝0 1 1 · · · 11 0 𝑑2

1,2 · · · 𝑑21,𝐾+1

1 𝑑21,2 0 · · · 𝑑2

2,𝐾+1...

...... . . . ...

1 𝑑21,𝐾+1 𝑑2

2,𝐾+1 · · · 0

⎞⎟⎟⎟⎟⎟⎠ . (4.2.2)

In general, for any given set of points 𝐴, we shall use the notation 𝐷𝐴 to denote the matrix shown in Equation(4.2.2). We shall also use a similar notation in order to express the volume of the simplex formed by a set of points

22

whose actual coordinates might not be available, but whose pairwise distances are known. Let 𝐷 = (𝑑𝑖,𝑗) be a givenEuclidean Distance Matrix of order 𝐾 + 1, i.e., there exists a set 𝐴 = {𝑥𝑖}𝐾+1

𝑖=1 of points in R𝐾 , with Euclideandistances 𝑑𝑖,𝑗 = ‖𝑥𝑖 − 𝑥𝑗‖2 (𝑖, 𝑗 = 1, . . . , 𝐾 + 1). We shall denote the Cayley-Menger formula for these points usingthe notation Δ𝐾(𝐷), with Δ𝐾(𝐷) = Δ𝐾(𝐴).

4.3 Distance Geometry and Multidimensional ScalingThe problem of fitting a set of known distances has been systematically studied since the early twentieth century

(see [52], [31]). Formally, let 𝐷 = (𝑑𝑖,𝑗) ∈𝑀(𝑛) be a symmetric matrix with nonnegative entries and main diagonalelements equal to zero. 𝐷 is called a Euclidean Distance Matrix (EDM) if there exists {𝑥𝑖}𝑛

𝑖=1 ⊂ R𝑟, for somepositive integer 𝑟, such that

𝑑𝑖,𝑗 = ‖𝑥𝑖 − 𝑥𝑗‖2, 𝑖, 𝑗 ∈ [𝑛], (4.3.1)where ‖·‖ denotes the Euclidean norm in R𝑟. In such case, the set of points {𝑥𝑖}𝑛

𝑖=1 is said to realize matrix 𝐷.The smallest integer 𝑟 for which there exists such a set of points is called the embedding dimension of 𝐷 and wedenote it by dim(𝐷).

In what follows we present some results that help us to lay the theoretical foundations for the construction ofMDS models via discrete geometry techniques.

Lemma 4.3.1. Let 𝐷 be a 𝑛 × 𝑛 EDM and {𝑥𝑖}𝑛𝑖=1, {𝑦𝑖}𝑛

𝑖=1 ⊆ R𝑚, for some 𝑚 ≥ dim(𝐷), be sets of points thatrealize 𝐷. For 𝑖, 𝑗, 𝑘 ∈ [𝑛], we have that

(𝑥𝑖 − 𝑥𝑗)𝑡(𝑥𝑘 − 𝑥𝑗) = (𝑦𝑖 − 𝑦𝑗)𝑡(𝑦𝑘 − 𝑦𝑗).

In other words, given two sets of points {𝑥𝑖}𝑛𝑖=1, {𝑦𝑖}𝑛

𝑖=1 ⊆ R𝑚 that realize a given EDM 𝐷, the planar anglesformed by vector (𝑥𝑖 − 𝑥𝑗) and (𝑥𝑧 − 𝑥𝑗) and by vectors (𝑦𝑖 − 𝑦𝑗) e (𝑦𝑧 − 𝑦𝑗), with 𝑖, 𝑗, 𝑘 ∈ [𝑛], are congruent.

Proof. Let us consider the triangles 𝑇𝑥 and 𝑇𝑦 formed by points {𝑥𝑖, 𝑥𝑗 , 𝑥𝑘} and {𝑦𝑖, 𝑦𝑗 , 𝑦𝑘}, respectively. It can beverified that these triangles are similar, since they possess three sides with coinciding measures, and, consequently,have congruent angles. Therefore, if 𝜃𝑥 is the angle formed by vectors 𝑥𝑖 − 𝑥𝑗 and 𝑥𝑘 − 𝑥𝑗 , and 𝜃𝑦 is the angleformed by vectors 𝑦𝑖 − 𝑦𝑗 and 𝑦𝑘 − 𝑦𝑗 , then

(𝑦𝑖 − 𝑦𝑗)𝑡(𝑦𝑘 − 𝑦𝑗)‖𝑦𝑖 − 𝑦𝑗‖‖𝑦𝑘 − 𝑦𝑗‖

= cos(𝜃𝑦)

= cos(𝜃𝑥)

= (𝑥𝑖 − 𝑥𝑗)𝑡(𝑥𝑘 − 𝑥𝑗)‖𝑥𝑖 − 𝑥𝑗‖‖𝑥𝑘 − 𝑥𝑗‖

Let {𝑥𝑖}𝑛𝑖=1 ⊆ R𝑚1 and {𝑦𝑖}𝑛

𝑖=1 ⊆ R𝑚2 be sets of points that realize the same EDM 𝐷, for 𝑚1, 𝑚2 ≥ dim(𝐷).Without loss of generality, let us assume that 𝑚1 ≥ 𝑚2 and define the set {𝑌𝑖}𝑛

𝑖=1 ⊆ R𝑚1 such that 𝑌𝑖(𝑗) = 𝑦𝑖(𝑗),for 𝑗 = 1, · · · , 𝑚2, and 𝑌𝑖(𝑗) = 0, for 𝑗 > 𝑚2 with 𝑖 = 1, · · · , 𝑛. We have that {𝑌𝑖}𝑛

𝑖=1 also realizes 𝐷 and

(𝑌𝑖 − 𝑌𝑗)𝑡(𝑌𝑘 − 𝑌𝑗) = (𝑦𝑖 − 𝑦𝑗)𝑡(𝑦𝑘 − 𝑦𝑗), 𝑖, 𝑗, 𝑘 = 1, · · · , 𝑛.

By Lemma 4.3.1, we have

(𝑥𝑖 − 𝑥𝑗)𝑡(𝑥𝑘 − 𝑥𝑗) = (𝑌𝑖 − 𝑌𝑗)𝑡(𝑌𝑘 − 𝑌𝑗), 𝑖, 𝑗, 𝑘 = 1, · · · , 𝑛.

Thus,(𝑥𝑖 − 𝑥𝑗)𝑡(𝑥𝑘 − 𝑥𝑗) = (𝑦𝑖 − 𝑦𝑗)𝑡(𝑦𝑘 − 𝑦𝑗), 𝑖, 𝑗, 𝑘 = 1, · · · , 𝑛.

This result can be enunciated as follows:

23

Lemma 4.3.2. Let 𝐷 be a 𝑛 × 𝑛 EDM and {𝑥𝑖}𝑛𝑖=1 ⊆ R𝑚1 and {𝑦𝑖}𝑛

𝑖=1 ⊆ R𝑚2 , for 𝑚1, 𝑚2 ≥ dim(𝐷), be sets ofpoints that realize 𝐷. Then,

(𝑥𝑖 − 𝑥𝑗)𝑡(𝑥𝑘 − 𝑥𝑗) = (𝑦𝑖 − 𝑦𝑗)𝑡(𝑦𝑘 − 𝑦𝑗), 𝑖, 𝑗, 𝑘 = 1, · · · , 𝑛.

Lemma 4.3.2 generalizes Lemma 4.3.1, i.e., given two realizations of a EDM the planar angles are retainedregardless the dimension of the Euclidean spaces where the realizations are.

We say that two subsets of points {𝑥𝑖}𝑛𝑖=1, {𝑦𝑖}𝑛

𝑖=1 ⊆ R𝑚 for any 𝑚 ≥ dim(𝐷) are orthogonally similar ifthere is an orthogonal operator 𝑄 on R𝑚, such that 𝑄(𝑥𝑖 − 𝑥𝑗) = 𝑦𝑖 − 𝑦𝑗 , for 𝑖, 𝑗 ∈ [𝑛].

Proposition 4.3.3. Let 𝐷 be a 𝑛 × 𝑛 EDM and {𝑥𝑖}𝑛𝑖=1, {𝑦𝑖}𝑛

𝑖=1 ⊆ R𝑚, for some 𝑚 ≥ dim(𝐷), be sets of pointsthat realize 𝐷. Then, {𝑥𝑖}𝑛

𝑖=1 is orthogonally similar to {𝑦𝑖}𝑛𝑖=1.

Proof. For convenience, let us define 𝑣𝑖𝑗 = 𝑥𝑖−𝑥𝑗 and 𝑢𝑖𝑗 = 𝑦𝑖−𝑦𝑗 , for 𝑖, 𝑗 ∈ [𝑛] and 𝑖 ̸= 𝑗. Let us define the function𝐹 : 𝑉 → 𝑈 , where 𝑉 = [𝑣1𝑖]𝑛𝑖=2 and 𝑈 = [𝑢1𝑖]𝑛𝑖=2, and such that, if 𝑣 =

∑︀𝑛𝑖=2 𝑎𝑖𝑣1𝑖, then 𝐹 (𝑣) =

∑︀𝑛𝑖=2 𝑎𝑖𝑢1𝑖, with

{𝑎𝑖}𝑛𝑖=2 ⊆ R. We claim that 𝐹 is linear. Indeed, if we take 𝑤, 𝑣 ∈ [𝑣1𝑖]𝑛𝑖=2 and 𝜆 ∈ R, then, by setting 𝑣 =

∑︀𝑛𝑖=2 𝑎𝑖𝑣1𝑖

e 𝑤 =∑︀𝑛

𝑖=2 𝑏𝑖𝑣1𝑖 we have:

𝑎𝐹 (𝑤) + 𝐹 (𝑣) = 𝜆

(︃𝑛∑︁

𝑖=2𝑎𝑖𝑢1𝑖

)︃+(︃

𝑛∑︁𝑖=2

𝑏𝑖𝑢1𝑖

)︃

=(︃

𝑛∑︁𝑖=2

𝜆𝑎𝑖𝑢1𝑖

)︃+(︃

𝑛∑︁𝑖=2

𝑏𝑖𝑢1𝑖

)︃

=𝑛∑︁

𝑖=2(𝜆𝑎𝑖 + 𝑏𝑖)𝑢1𝑖

= 𝐹 (𝜆𝑣 + 𝑤)

From Lemma 4.3.2, we have that 𝐹 (𝑤)𝑡𝐹 (𝑣) = 𝑤𝑡𝑣, for any 𝑣, 𝑤 ∈ 𝑉 . Thus, we have that 𝑑𝑖𝑚(𝑈) = 𝑑𝑖𝑚(𝑉 ),and 𝑑𝑖𝑚(𝑈⊥) = 𝑑𝑖𝑚(𝑉 ⊥). Hence, there exists a linear transformation 𝑃 : 𝑉 ⊥ −→ 𝑈⊥ that takes an orthonormalbasis of 𝑉 ⊥ into a orthonormal basis of 𝑈⊥. Therefore, 𝑃 (𝑤)𝑡𝑃 (𝑣) = 𝑤𝑡𝑣 for any 𝑣, 𝑤 ∈ 𝑉 ⊥. Let use define now𝑄 : R𝑚 −→ R𝑚 in such a way that 𝑄(𝑣) = 𝐹 (𝑣1) + 𝑃 (𝑣2), where 𝑣 = 𝑣1 + 𝑣2 with 𝑣1 ∈ 𝑉 and 𝑣2 ∈ 𝑉 ⊥. We havethat 𝑄 is linear by construction, and, for any 𝑣, 𝑤 ∈ R𝑚 we have

𝑤𝑡𝑣 = 𝑤𝑡1𝑣1 + 𝑤𝑡

2𝑣2

= 𝐹 (𝑤1)𝑡𝐹 (𝑣1) + 𝑃 (𝑤2)𝑡𝑃 (𝑣2)= 𝑄(𝑤)𝑡𝑄(𝑣),

i.e., 𝑄 is orthogonal.

Let 𝐺 = (𝑉, 𝐸) be an undirected graph and let us consider a weight function 𝜔 : 𝐸 → R+ on the edges of 𝐺,with 𝜔𝑖𝑗 = 𝜔({𝑣𝑖, 𝑣𝑗}), for each {𝑣𝑖, 𝑣𝑗} ∈ 𝐸. The graph embedding problem [7] consists of determining whether ornot 𝐺 admits a mapping 𝜑 : 𝑉 → R𝑟, for some positive integer 𝑟, such that for each edge {𝑣𝑖, 𝑣𝑗} ∈ 𝐸 we have‖𝜑(𝑣𝑖)− 𝜑(𝑣𝑗)‖= 𝜔𝑖𝑗 . In such a case, 𝐺 is said to be embeddable.

There is a straightforward equivalence (see [21, 36]) between the graph embedding problem and the so-calledEuclidean Distance Matrix Completion Problem, which we define next.

Let 𝐷 = (𝑑𝑖,𝑗) ∈ 𝑀(𝑛) be a partial symmetric real matrix, i.e., a matrix in which some elements are specified(or fixed), while the remaining elements are non-specified (or free), and in which whenever 𝑑𝑖𝑗 is specified then sois 𝑑𝑗𝑖, with 𝑑𝑖𝑗 = 𝑑𝑗𝑖. The problem of determining whether or not 𝐷 can be “completed” into a Euclidean distancematrix is known as the Euclidean Distance Matrix Completion Problem (EDMCP) (see [30]).

Let 𝐷 ∈ 𝑀(𝑛) be the matrix of an EDMCP instance and let 𝐺 = 𝐺(𝐷) be the graph of the correspondinggraph embedding instance. If 𝑉 ⊆ 𝑉 (𝐺) is a maximal clique in 𝐺, then we shall call the submatrix 𝐷(𝑉 ) associatedwith this clique a maximal submatrix. We refer by 𝑆𝑢𝑏𝑠(𝐷) to the collection of all subsets of 𝑉 (𝐺) associated with

24

maximal submatrices of 𝐷. A matrix 𝐴 = (𝑎𝑖𝑗) ∈𝑀(𝑛) is said to be (2𝑘+1)-diagonal if 𝑎𝑖𝑗 = 0 whenever |𝑖−𝑗|> 𝑘.𝐴 is a complete (2𝑘 + 1)-diagonal matrix if 𝑎𝑖𝑗 = 0 whenever |𝑖− 𝑗|> 𝑘 or 𝑖 = 𝑗, and 𝑎𝑖𝑗 ̸= 0, otherwise.

At this point we introduce the following result:

Theorem 4.3.4. Let 𝐷 be the square matrix of an EDMCP instance. If the adjacency matrix 𝐴(𝐺) of graph𝐺 = 𝐺(𝐷) is complete (2𝑘 + 1)-diagonal and for all 𝑉 ∈ 𝑆𝑢𝑏𝑠(𝐷) we have that 𝐷(𝑉 ) is a EDM, then there existsa solution to the EDMCP instance and its dimension is equal to max𝑉 ∈𝑆𝑢𝑏𝑠(𝐷) dim(𝐷(𝑉 )) ≤ 𝑘.

Proof. Without loss of generality, let us suppose that all maximal submatrices of 𝐷 are EDMs and that theadjacency matrix 𝑀 of graph 𝐺 = 𝐺(𝐷) is (2𝑘 + 1)-diagonal complete. If 𝑐𝑎𝑟𝑑(𝑆𝑢𝑏𝑠(𝐷)) = 𝑛, then by inductionon 𝑛 we have, for 𝑛 = 1, that the statement is true, since in this case 𝐷 is a complete EDM. Therefore, this is atrivial case of EDMCP. Now let us spposed that the statement is true for some positive integer 𝑛. Hence, thereexists a solution to the EDMCP and the minimum dimension of a solution equals to max𝑉 ∈𝑆𝑢𝑏𝑠(𝐷) dim(𝐷(𝑉 )) ≤ 𝑘.Considering the case in which 𝑐𝑎𝑟𝑑(𝑆𝑢𝑏𝑠(𝐷)) = 𝑛 + 1, we have that the graph 𝐺(𝐷) has 𝑘 + 𝑛 vertices, i.e., 𝐷 isa matrix of order 𝑛 + 𝑘. Let �̄� = 𝐷([𝑛 + 𝑘 − 1]). Since we assumed that 𝑀(𝐷) is (2𝑘 + 1)-diagonal complete,then we have that 𝑀(�̄�) is also (2𝑘 + 1)-diagonal complete and satisfies 𝑐𝑎𝑟𝑑(𝑆𝑢𝑏𝑠(�̄�)) = 𝑛. By the inductivehypothesis, there exists a solution to the EDMCP instance and the minimum dimension of such a solution equalsmax𝑉 ∈𝑆𝑢𝑏𝑠(�̄�) dim(�̄�(𝑉, 𝑉 )) ≤ 𝑘. Consequently, there must exist {𝑦𝑖}𝑛+𝑘−1

𝑖=1 ⊆ R𝑘 that solve the EDMCP. Weassumed that 𝐷([𝑛, 𝑛 + 𝑘]) is a maximal submatrix and, hence, a EDM. Thus, there exist {𝑥𝑖}𝑛+𝑘

𝑖=𝑛 ⊆ R𝑘 thatrealize 𝐷([𝑛, 𝑛 + 𝑘]). Since the sets {𝑦𝑖}𝑛+𝑘−1

𝑖=𝑛 and {𝑥𝑖}𝑛+𝑘−1𝑖=𝑛 realize the same EDM, by Proposition 4.3.3, there

exists an orthogonal operador 𝑄 over R𝑘 such that 𝑄(𝑥𝑖 − 𝑥𝑗) = 𝑦𝑖 − 𝑦𝑗 , with 𝑖, 𝑗 ∈ [𝑛, 𝑛 + 𝑘 − 1]. By taking𝑦𝑛+𝑘 = 𝑦𝑛 + 𝑄(𝑥𝑛+𝑘 − 𝑥𝑛) we obtain

‖𝑦𝑛+𝑘 − 𝑦𝑖‖= ‖𝑄(𝑥𝑛+𝑘 − 𝑥𝑖)‖= ‖𝑥𝑛+𝑘 − 𝑥𝑖‖,

where 𝑖 ∈ [𝑛, 𝑛 + 𝑘 − 1]. Consequently, {𝑦𝑖}𝑛+𝑘𝑖=1 solve the EDMCP instance. Moreover, by construction,

dim(𝐷) = max{dim(�̄�), dim(𝐷([𝑛, 𝑛 + 𝑘]))} (4.3.2)= max{ max

𝑉 ∈𝑆𝑢𝑏𝑠(�̄�)dim(�̄�(𝑉 )), dim(𝐷([𝑛, 𝑛 + 𝑘]))} (4.3.3)

= max𝑉 ∈𝑆𝑢𝑏𝑠(𝐷)

dim(𝐷(𝑉 )) ≤ 𝑘 (4.3.4)

and the proof is complete.

Unfortunately, this result does not offer us a better way of solving the EDMCP than the existing methodsbased on continuous techniques [37]. We now introduce a related problem, the Discretizable Distance GeometryProblem, discussed in [36] and [37], which will allow us to solve the EDMCP via a combinatorial procedure. Beforeintroducing the problem, we need a few additional definitions.

Consider a graph 𝐺 = (𝑉, 𝐸), a weight function 𝑑 : 𝐸 → R+, and a total order ≺ on 𝑉 . Let 𝛾(𝑣) = {𝑢 ∈ 𝑉 :𝑢 ≺ 𝑣} be the set of predecessors of 𝑣 in 𝐺 according to ≺, and let 𝜌(𝑣) = |𝛾(𝑣)|+1 be the rank of 𝑣 given by ≺.An embedding 𝑥 : 𝑉 → R𝐾 , for some positive integer 𝐾, of the vertices of 𝐺 is considered valid if

∀{𝑢, 𝑣} ∈ 𝐸, ‖𝑥𝑢 − 𝑥𝑣‖= 𝑑({𝑢, 𝑣}). (4.3.5)

We denote by 𝑈𝑣 ⊆ 𝑉 ∖ {𝑣} a subset of vertices adjacent to 𝑣 whose coordinates are known, i.e., 𝑈𝑣 ⊆ 𝑁(𝑣)and 𝑥𝑢 has been determined for each 𝑢 ∈ 𝑈𝑣.

Definition 4.3.5 (Discretizable Distance Geometry Problem (DDGP)). Given a simple undirected graph 𝐺 =(𝑉, 𝐸), a weight function 𝑑 : 𝐸 → R+, an integer 𝐾 > 0, a total order ≺ over 𝑉 satisfying:

1. ∀𝑣 ∈ 𝑉 : 𝜌(𝑣) > 𝐾 ⇒ |𝑁(𝑣) ∩ 𝛾(𝑣)|≥ 𝐾

2. ∀𝑣 ∈ 𝑉, ∃𝑈𝑣 ⊆ 𝑁(𝑣) ∩ 𝛾(𝑣) : 𝐺[𝑈𝑣] ≃ K𝐾 and Δ𝐾−1(𝑈𝑣) > 0,

25

and a valid partial embedding �̄� : 𝑉0 = {𝑣 ∈ 𝑉 : 𝜌(𝑣) ≤ 𝐾} → R𝐾 over 𝐺[𝑉0], decide whether or not there existsan extension 𝑥 : 𝑉 → R𝐾 of �̄� which is valid.

An alternative, weaker definition of the DDGP has been discussed in [31] and [36]. In that definition, therequirements 𝐺[𝑈𝑣] ≃ K𝐾 and Δ𝐾−1(𝑈𝑣) > 0 in item 2 are replaced simply by |𝑈𝑣|= 𝐾. This apparently lessrestrictive definition is derived from two facts. First, we might not have 𝐺[𝑈𝑣] ≃ K𝐾 from the beginning, since thegiven distances might not be enough to enforce that requirement. However, in a sequential algorithm that assignsa point to each vertex of 𝐺 along the order ≺, additional distances become available as the algorithm progresses,ensuring the required structure 𝐺[𝑈𝑣] ≃ K𝐾 by the time when 𝑣 is to be processed. Nevertheless, we still donot have any guarantee that Δ𝐾−1(𝑈𝑣) > 0. The second fact supporting the alternative definition argues that,in practice, it is not necessary to enforce this condition. The argument consists of the observation that the set ofinstances of the DDGP that do not satisfy the strict simplex inequality (i.e., the set of instances with Δ𝐾−1(𝑈𝑣) = 0,for some 𝑣) has zero Lebesgue measure on the set of all instances of the DDGP. Consequently, the chance of oneencountering such an instance in practice is zero.

In this paper, we shall consider only instances of EDMCP such as those described in Theorem 4.3.4, ensuringthat we have 𝐺[𝑈𝑣] ≃ K𝐾 . Thus, the definition of the DDGP presented is not limited in terms of generality.Furthermore, we shall be interested in a subset of DDGP instances known as Discretizable Molecular DistanceGeometry Problem (DMDGP𝐾) in which the set 𝑈𝑣 ∈ 𝑁(𝑣) ∩ 𝛾(𝑣) is formed by the 𝐾 immediate predecessors of𝑣 in the given order ≺, for 𝑣 with 𝜌(𝑣) > 𝐾.

Thus, if we consider the conditions imposed to the EDMCP instance in Theorem 4.3.4, together with theadditional requirement that Δ𝐾(𝑊 ) > 0 for all 𝑊 ⊆ 𝑉 such that 𝐺[𝑊 ] is a maximal clique in 𝐺, then we cansolve an EDMCP instance via a generalization of the so-called Branch-and-Prune algorithm, as shown in [36] anddiscussed in Section 4.4.

4.3.1 An Approach to MDS via EDMCPGiven a positive integer 𝐾 < 𝑛, let us consider the function 𝐻𝐾 : 𝑀(𝑛) → 𝑀(𝑛) such that 𝐻𝐾(𝑀) = (ℎ𝑖𝑗)

satisfies ℎ𝑖𝑗 = 𝑀(𝑖, 𝑗) whenever |𝑖− 𝑗|≤ 𝐾 and ℎ𝑖𝑗 = 0, otherwise. In other words, 𝐻𝐾(𝑀) is the (2𝐾 +1)-diagonalmatrix associated with 𝑀 .

Let {𝑦𝑖}𝑛𝑖=1 ⊆ R𝑟, for some positive integer 𝑟 ≥ 𝐾, and 𝑀 the EDM realized by this set of points and satisfying

Δ𝐾(𝐹 𝑖) > 0, where 𝐹 𝑖 = (𝑀𝑝𝑞), 𝑝, 𝑞 ∈ [𝑖, 𝑖+𝐾−1], 𝑖 ∈ [𝑛−𝐾 +1]. We are interested in embedding those points inR𝐾 , with 𝐾 < 𝑟. In order to accomplish that, we solve the EDMCP instance associated with matrix 𝐷 = 𝐻𝐾(𝑀).From Theorem 4.3.4, there exists a solution for this EDMCP instance and, since the problem is equivalent to aninstance of DMDGP𝐾 , we can obtain all its solutions by applying the Branch-and-Prune algorithm of [36]. Inother words, we can generate all embeddings that satisfy the structural properties of the corresponding DMDGP𝐾

instances.The embeddings obtained in this way, despite fitting precisely the distances in the (2𝑘 + 1)-diagonal matrix

𝐷 = 𝐻𝐾(𝑀), do retain some errors with respect to the distances in the original matrix 𝑀 that are not available in𝐷. Since in a MDS setting one is interested in finding an embedding that minimizes a certain Stress function onthe entire set of known distances, it is clear that not all solutions to the equivalent DMDGP𝐾 instance are desirablein terms of Stress value. Moreover, while Theorem 4.3.4 grants the existence of solutions in R𝐾 to the EDMCPinstance associated with 𝐷 = 𝐻𝐾(𝑀), not all orders on the points {𝑦𝑖}𝑛

𝑖=1 satisfy the DMDGP𝐾 conditions, andnot every order results in an Stress-optimal embedding. Assuming that the corresponding EDMCP instance is validfor the DMDGP𝐾 , for a fixed order, we can obtain at most 2𝑛−(𝐾+1) possible embeddings (see, e.g., [1, 39]).

The following result characterizes how one can permute the rows and columns of a given (2𝐾 + 1)-diagonalmatrix and still obtain a matrix with the same (2𝐾 + 1)-diagonal structure. Given a matrix 𝑀 ∈ 𝑀(𝑛), let usdenote by 𝜎(𝑀) ∈𝑀(𝑛) the matrix obtained from 𝑀 by simultaneously permuting its rows and columns accordingto 𝜎, where 𝜎 ∈ 𝑆𝑛 and 𝑆𝑛 is the set of permutations of [𝑛].

Lemma 4.3.6. The only elements of 𝑆𝑛 that commute with 𝐻𝐾 are the identity and the permutation 𝜙 ∈ 𝑆𝑛

satisfying 𝜙(𝑖) = 𝑛 + 1− 𝑖, for 𝑖 ∈ [𝑛].

26

Proof. Let us take 𝑀 ∈ M(𝑛). Let 𝜎 ∈ 𝑆𝑛 be such that 𝜎 ∘𝐻𝐾(𝑀) = 𝐻𝐾 ∘𝜎(𝑀), and let 𝐺1 = (𝑉1, 𝐸1, 𝑤1) be thegraph associates with 𝐻𝐾(𝑀) and 𝐺2 = (𝑉2, 𝐸2, 𝑤2) be the graph associated with 𝐻𝐾 ∘ 𝜎(𝑀). those graphs areisomorphic via 𝜎, since we assumed that 𝐻𝐾 ∘𝜎(𝑀) = 𝜎 ∘𝐻𝐾(𝑀). Thus, because of the degree of each vertex (i.e.,the number of adjacent vertices), we have that 𝜎(𝑉1(1)) = 𝑉2(1) or 𝜎(𝑉1(1)) = 𝑉2(𝑛). If 𝜎(𝑉1(1)) = 𝑉2(1), then, forthe same reason, we have that 𝜎(𝑉1(𝑖)) = 𝑉2(𝑖), with 𝑖 ∈ [2, 𝑛]. Therefore, 𝜎 is the identity. If 𝜎(𝑉1(1)) = 𝑉2(𝑛),we would have, for the same reason, 𝜎(𝑉1(𝑖)) = 𝑉2(𝑛 + 1− 𝑖), with 𝑖 ∈ [2, 𝑛]. Consequently, 𝜎 = 𝜙, as intended.

Given 𝐷 = 𝐻𝐾(𝑀) be an EDMCP instance that satisfies the conditions of DMDGP𝐾 , it is noteworthy thatthe two permutations in Lemma 4.3.6 result in the same set of solutions, with respect to the space of distances.The following result computes an upper bound on the number of possible embeddings of an EDM.

Proposition 4.3.7. Let 𝑆𝑛 be the set of all possible permutations of [𝑛], 𝑀 = (𝑚𝑖,𝑗) ∈𝑀(𝑛) be a EDM of pointsin R𝑟, for some positive integer 𝑟 ≥ 𝐾, such that 𝐷 = 𝐻𝐾(𝜎(𝑀)) is an instance of DMDGP𝐾 for all 𝜎 ∈ 𝑆𝑛,where 𝜎(𝑀) = (𝑚𝜎(𝑖),𝜎(𝑗)), with 𝑛 > 𝐾 + 1 being a positive integer. Therefore, there are at most 2𝑛−(𝐾+2)(𝑛! )possible embeddings in R𝐾 .

Proof. Take 𝐷 = 𝐻𝐾(𝜎(𝑀)), for a given permutation 𝜎 ∈ 𝑆𝑛. Since we assumed that 𝐷 is an instance of DMDGP𝐾 ,we obtain a total of 2𝑛−(𝐾+1) solutions for it. But, since |𝑆𝑛|= 𝑛!, we have a total of 2𝑛−(𝐾+1)(𝑛! ) embeddings.Furthermore, we have, from Lemma 4.3.6, that 𝜙 ∘𝐻𝐾(𝑀) = 𝐻𝐾 ∘𝜙(𝑀). Therefore, we counted twice the numberof solutions, since 𝐻𝐾(𝑀) and 𝐻𝐾 ∘𝜙(𝑀) represent the same EDMCP and, hence, share the same set of solutions.Thus, the number of solutions is given by 2𝑛−(𝐾+2)(𝑛! ).

Thus, if we have, for instance, 10 points in R4 and we intend to embed them in R3, then there are 210−5(10! ) =116, 121, 600 possible solutions. Therefore, there is an obvious necessity for good heuristics for deciding on an orderof the points, as well appropriate procedures for pruning and performing error analysis.

4.4 Branch-and-Prune Algorithm for Multidimensional Sca-ling

Let us consider a set 𝑉 = {𝑦𝑖}𝑛𝑖=1 ⊂ R𝑚 of 𝑛 points, for which pairwise Euclidean distances (to which we shall

refer as dissimilarities) 𝛿𝑖𝑗 are known. From the previously results we may construct a Branch-and-Prune (BP)algorithm for finding an MDS embedding in R𝐾 while minimizing a Stress function given by

𝑆(x) =𝑛∑︁

𝑖=1

𝑛∑︁𝑗=1

(𝑑(𝑥𝑖, 𝑥𝑗)− 𝛿𝑖𝑗)2, (4.4.1)

where x = (𝑥1, . . . , 𝑥𝑛) is the resulting MDS representation and 𝑑(𝑥𝑖, 𝑥𝑗) stands for the Euclidean distance betweenpoints 𝑥𝑖 and 𝑥𝑗 .

We note here that when dealing with datasets with a large number of points – such as those arising in BigData applications – the computation of (4.4.1) might become excessively demanding and, therefore, nonviable froma practical point of view. Therefore, we shall make use of the following alternative Stress function, which is relatedto (4.4.1), but more amenable for use with our algorithm:

𝑆𝑚(x) =∑︁

1 ≤ 𝑖, 𝑗 ≤ 𝑛|𝑖− 𝑗|≤ 𝑚

(𝑑(𝑥𝑖, 𝑥𝑗)− 𝛿𝑖,𝑗)2, (4.4.2)

where 𝑚 is the dimension of the points in the dataset. Indeed, Equation (4.4.2) can be computed in subquadratictime and accounts for part of the total Stress. Namely, given an order 𝜎 on the points, 𝑆𝑚(x) corresponds to thedistance error between points whose ranks in 𝜎 differ by at most 𝑚.

27

The algorithm, as well as its variant presented in the next section, can produce embeddings in R𝐾 , for integer𝐾 ≥ 2. We chose to describe their pseudocodes for the case 𝐾 = 3 for the sake of clarity.

Given a total order on the original points, the BP assigns standard positions for the first three points in such away as to exactly match the dissimilarities among them. From the fourth point and on, the algorithm determinesthe possible coordinates of each point 𝑥𝑖 by exactly matching distances and dissimilarities of 𝑥𝑖 with respect to theprevious 3 points in the order. It has been shown in [31] that, with probability 1, there are exactly two possiblepositions for each such point, considering a DMDGP𝐾 instance. Note that, by following this procedure, additionalgiven distances (i.e., distances between points that are more than 3 positions apart in the order) are not guaranteedto match their corresponding realized distances.

The observation that, given a partial embedding, only two positions exist for the next point in the ordernaturally leads to a combinatorial procedure, which is the basis of the tree-search BP algorithm. Indeed, there aretwo possible positions in the embedding space for the fourth point, the fifth point admits a total of 4 positions, andso on. This process results in a binary tree (to which we refer as the BP tree), along which each point is assigneda position in a sequential manner. Full MDS embeddings are available at the (𝑛− 2)th level of the BP tree, whilepartial embeddings are associated with internal nodes of the tree.

Since only a subset of the given distances are enforced, different MDS embeddings by the algorithm might havedifferent values of the Stress function. An implicit enumeration scheme can then be applied based on the value ofthe Stress function, with tree nodes whose partial Stress values are higher than the total Stress of the best knownMDS embedding being removed from further investigation.

Alg. 4.1 Branch-and-prune algorithm for MDS.Require: Distances 𝑑𝑖,𝑗 (𝑖, 𝑗 ∈ [𝑛]), between 𝑛 points {𝑦𝑖}𝑛

𝑖=1 ⊂ R𝑚.Ensure: Embedding 𝑁* in R3.

1: 𝑇 ← {𝑁0}, where 𝑁0 = (𝑘0, 𝑆0, 𝑥0) is the initial node, with standard positions for the first 3points;

2: 𝑁* ← (0,∞, []);3: 𝐵 ← ∅;4: while (𝑇 ̸= ∅) do5: Select 𝑁 𝑡 ← (𝑘𝑡, 𝑆𝑡, 𝑥𝑡) ∈ 𝑇 ;6: 𝑇 ← 𝑇 ∖ {𝑁 𝑡};7: if not prunable(𝑁 𝑡) then8: {𝑁𝑝1 , 𝑁𝑝2} ←branch(𝑁 𝑡);9: for (𝑁 𝑞 ∈ {𝑁𝑝1 , 𝑁𝑝2}) do

10: if not prunable(𝑁 𝑞) then11: if (𝑘𝑞 = 𝑛) then12: 𝑁* ← 𝑁 𝑞;13: 𝐵 ← 𝐵 ∪ {𝑁*};14: else15: 𝑇 ← 𝑇 ∪ {𝑁 𝑞};16: end if17: end if18: end for19: end if20: end while

The pseudocode of the BP search is shown in Algorithm 4.1. Each node of the search tree corresponds toa specific choice of positions for the points that have been embedded so far. A tree node 𝑁 𝑡 is denoted by a

28

tuple (𝑘𝑡, 𝑆𝑡, 𝑥𝑡), with 𝑘𝑡 being the index of the last point embedded, 𝑆𝑡 being the value of the Stress functioncorresponding to the points that have been embedded so far, and 𝑥𝑡 being the actual embedded data. Line 2initializes the incumbent embedding 𝑁* as an empty solution, with Stress value equal to infinity. Two high-levelfunctions are used in the pseudocode’s specification. The first, branch(𝑁 𝑡), takes as argument a node 𝑁 𝑡 =(𝑘𝑡, 𝑆𝑡, 𝑥𝑡) and returns two nodes, 𝑁𝑝1 and 𝑁𝑝2 , each corresponding to a possible position of the (𝑘𝑡 + 1)th pointin the order; the nodes returned have their corresponding Stress values and partial embeddings properly defined.The function prunable(𝑁 𝑡) encapsulates the rule for prunning nodes of the tree which are not worth exploring,in view of the best MDS embedding obtained so far. This logic consists solely of verifying whether an incumbentMDS embedding is known at this time and comparing the value of 𝑆𝑡 and 𝑆*, where 𝑆* is the Stress value of theincumbent embedding. In case 𝑆𝑡 ≥ 𝑆*, a value of true is returned, meaning that the current node 𝑁 𝑡 must bediscarded from further consideration; otherwise, the function returns false.

4.5 Cluster Partition-Preserving MDSWe next show how to extend Algorithm 4.1 to incorporate cluster membership information, assumming that a

clustering procedure was applied to the original data and that a cluster label is given for each point as part of theinput data.

First, we identify among the input points a reference point for each cluster. This reference point can be, forinstance, a cluster centroid (that is included in the data for the specific purpose of running the algorithm), or simplyan original point belonging to the cluster and preferably occupying a “central” position with respect to other pointsin its cluster. The only requirement on the choice of a reference point 𝑦 is that the dissimilarities between 𝑦 andall other points (including other reference ones) are known.

Thus, based on a total order on this augmented set of input points, we can apply the BP algorithm with thecaveat that nodes corresponding to MDS representations having a high number of cluster-partition discrepancies(with respect to the original partition) are pruned. A cluster-partition discrepancy is said to take place in a node(𝑘𝑡, 𝑆𝑡, 𝑥𝑡) of the search tree for each point that has already been embedded, i.e. each 𝑥𝑖 (𝑖 ≤ 𝑘𝑡), that is closer (inembedded space) to the embedded reference point of a different cluster than to the embedded reference point of itsown cluster. Note that, for this kind of pruning to take place, it is necessary to have some reference points alreadymapped. We propose to order the input points in such a way that points belonging to the same cluster are groupedtogether, with the reference point of each cluster preceding the remaining points of its cluster.

Algorithm 4.2 summarizes the procedure. Here, the function prunable(𝑁 𝑡) has a more involved logic than itscounterpart in Algorithm 4.1. Namely, the value of prunable(𝑁 𝑡) is true if and only if one of the following conditionsholds:

(i) the cluster-partition discrepancy count of 𝑁 𝑡 is larger than that of the incumbent embedding;

(ii) the cluster-partition discrepancy count of 𝑁 𝑡 is equal of incumbent embedding and 𝑆𝑡 is greater than orequal to the Stress value of the incumbent embedding.

In addition to keeping track of the incumbent embedding 𝑁*, the algorithm also maintains a set 𝐵 with allincumbent embeddings found during its execution. Among all embeddings in 𝐵, the algorithm will report, as thebest solution found, one with the smallest value of cluster misclassification, a concept that we introduce in whatfollows. Let 𝑘 be the number of clusters, and let 𝑝, 𝑞 ∈ [𝑘]𝑛 be two cluster index vectors, each of which assigns acluster index 𝑖 ∈ [𝑘] to each point in 𝑉 . In order to compare two such point-cluster assignments, we must accountfor a possible permutation of cluster labels: two cluster index vectors can describe the same cluster partition,even though 𝑝 ̸= 𝑞. Thus, we define cluster misclassification as the function 𝑑𝑀 : [𝑘]𝑛 × [𝑘]𝑛 → Z+, such that𝑑𝑀 (𝑝, 𝑞) = min𝜎∈𝑃𝑘

𝑑𝐻(𝜎(𝑝), 𝑞), where 𝑃𝑘 is the set of permutations of [𝑘], 𝑑𝐻 is the Hamming distance, and 𝜎(𝑝)is a vector in [𝑘]𝑛 obtained from 𝑝 via the application of 𝜎 ∈ 𝑃𝑘 (with 𝜎(𝑝)𝑖 = 𝜎(𝑝𝑖), for 𝑖 = 1, . . . , 𝑛). Function 𝑑𝑀

is a pseudo-metric that allows us to assess how dissimilar the index vector 𝑝 produced by a clustering procedureapplied to the embedded data is with respect to the original index vector 𝑞, obtained by clustering the originaldata. In Algorithm 4.2, we use 𝑑𝑀 (𝑁 𝑖) to denote the cluster misclassification value obtained from the embedding𝑥𝑖 of node 𝑁 𝑖.

29

Alg. 4.2 Cluster partition-aware branch-and-prune algorithm.Require: Distances 𝑑𝑖,𝑗 (𝑖, 𝑗 ∈ [𝑛]), between 𝑛 points {𝑦𝑖}𝑛

𝑖=1 ⊂ R𝑚; cluster label for each 𝑦𝑖 (𝑖 ∈[𝑛]); index set of reference points.

Ensure: Embedding 𝑁* in R3.1: 𝑇 ← {𝑁0}, where 𝑁0 = (𝑘0, 𝑆0, 𝑥0) is the initial node, with standard positions for the first 3

points;2: 𝑁* ← (0,∞, []);3: 𝐵 ← ∅;4: while (𝑇 ̸= ∅) do5: Select 𝑁 𝑡 ← (𝑘𝑡, 𝑆𝑡, 𝑥𝑡) ∈ 𝑇 ;6: 𝑇 ← 𝑇 ∖ {𝑁 𝑡};7: if not prunable(𝑁 𝑡) then8: {𝑁𝑝1 , 𝑁𝑝2} ←branch(𝑁 𝑡);9: for (𝑁 𝑞 ∈ {𝑁𝑝1 , 𝑁𝑝2}) do

10: if not prunable(𝑁 𝑞) then11: if (𝑘𝑞 = 𝑛) then12: 𝑁* ← 𝑁 𝑞;13: 𝐵 ← 𝐵 ∪ {𝑁*};14: else15: 𝑇 ← 𝑇 ∪ {𝑁 𝑞};16: end if17: end if18: end for19: end if20: end while21: 𝑁+ ← argmin{𝑑𝑀(𝑁 𝑖) : 𝑁 𝑖 ∈ 𝐵};

4.6 The Backtrack Problem and a Naive RandomizationApproach

Analogously to the classical BP algorithm for distance geometry, described in [36], our algorithm employs abacktrack-style search, i.e., it constructs solutions iteratively based on a series of decisions and, when necessary,reverts some of those decisions in order to explore alternative choices that might lead to improving solutions. Theperformance of backtrack-based algorithms can vary dramatically, depending on the procedures used for selectingthe next variable to branch on (which is called the “variable selection heuristic”) and on the order in which thealternative decisions available at each node of the tree are explored (the “value selection heuristic”). In our algorithm,the alternative decisions available at each node correspond to the two possible positions for each point. Becausewe have assumed that the points are embedded according to a fixed order, the variable selection heuristic follows apredefined sequence of decisions.

Branching (i.e., variable/value) heuristics can play a central role in guiding backtrack search procedures towardregions of the search space that contain improved solutions. The variants of the BP algorithm for MDS presentedhere so far rely on branching heuristic that is essentially the same as the one used in the original BP algorithmused for molecular conformation problems. Namely, the feasibility of each partial embedding is verified by checkingthe value known distances against the distances calculated between the position for the 𝑖th point and the positions

30

already calculated for the previous points. If the embedding obtained from the most recent branching decision isfeasible, the search continues from that decision and on; otherwise, the decision is reverted and the node correspon-ding to it is said to be pruned. If neither decision is feasible, both nodes are pruned and the search is backtracked,with the previous decision being reverted. Backtracking also takes place whenever the subtree under a given nodehas been fully explored.

Unlike the BP algorithm used for molecular conformation problems – in which for a decision to be feasible alldistances involved must match the given distances exactly – our BP variant for the case of MDS attempts to minimizethe Stress function (4.4.2), which cannot, in general, be expected to attain a specific value. Therefore, we cannotguarantee the optimality of any given solution (embedding) without exploring, explictly or implicitly, the entiresearch tree. By explicitly exploring the tree we mean enumerating all possible solutions, while implicitly exploringthe tree amounts to implicitly enumerating all solutions by taking advantage of additional pruning strategies thatallow discarding portions of the tree whenever it can be determined that no solution of improved Stress value canbe achieved by exploring the subtree under a given node.

As a result, this simple branching heuristic can be quite ineffective in terms of exploring a substantial fractionof the search tree, especially in the case of medium to large datasets. Consider, for instance, an instance with 1,000points. After embedding the first 900 points, we still have a subtree that contains pontentially as many as 2100

solutions. A simplistic rule for selecting the next branch can lead a backtracing procedure to spend an enormousamount of time enumerating such a subtree, without ever evaluating alternatives to the embedding of earlier pointsin the order adopted. Therefore, the chances of exploring – or at least sampling – other parts of tree is low, ingeneral.

As a means of mitigating this negative effect, we introduce two modifications to our algorithm. First, wepropose a simple randomization at the point of choosing which of the two possible positions for the current pointis to be explored first. In line 8 of Algorithm 4.2 we compute the two possible postitions for the 𝑖-th point, with𝑖 > 𝐾 + 1, based on the positions determined for the 𝐾 points immediately preceding it in the order adopted.Instead of pursuing one of those positions and storing the other for future exploration, we simply select one of thepositions to be explored, while the other is discarded. This amounts to performing a single “dive” in the searchtree, a process that can be carried out relatively quickly. By repeating this process several times, the algorithmeffectively samples the search space. The application of this simple modification showed a marked improvement, interms of both solution quality and running time, when compared to the simpler branching heuristic derived formthe original BP algorithm.

Since this modified branching strategy does not incorporate any direct consideration with respect to the actualvalue of Stress, we introduced a second modification to the algorithm in order to account for the objective functionunder consideration. In [39] it was shown that the two possible positions for the 𝑖-th point in a branching step ofthe BP algorithm result in two distinct distances between the 𝑖-th point and the (𝑖−𝐾 − 1)-th point in the order.Choosing the position for the 𝑖-th point that corresponds to the smallest of the two distances is a locally optimalchoice. Due to the local nature of the objective function (4.4.2), this choice can be argued as sensible. With thepurpose of including this heuristic criterion in the algorithm, while not giving up the benefits of the randomizeddiving, we retain the randomized heuristic described above, while limiting the number of times in which the randomchoice favors the largest of the two possible distances: at most 50% of the branching decisions of each dive processwere allowed to favor the largest of the two distances. The results of this mixed strategy were superior to those ofthe pure random selection and suggest that more careful strategies can be even more beneficial. We shall refer tothis modified variant of Algorithm 4.1 as 𝑟𝐵𝑃 . Its results are shown in the first column of Table 4.1.

4.7 Computational ExperimentsIn order to validate our BP-based MDS algorithm on large datasets, we conducted a series of computational

experiments using artificially generated datasets, as well as the Parkinsons Telemonitoring dataset [55].To allow for pruning to take place since early levels of the search tree – and still focus on producing MDS

embeddings with small deviations from the given dissimilarities 𝛿𝑖𝑗 – we attempted to minimize the Stress function

31

𝑛 rBP(𝑆𝑚(𝑥)) BP(𝑆𝑚(𝑥)) MMDS(𝑆𝑚(𝑥))5,000 2.531× 108 2.602× 108 3.4127× 108

6,000 2.801× 108 4.679× 108 4.6849× 108

7,000 3.228× 108 3.490× 108 5.2418× 108

8,000 3.437× 108 3.880× 108 6.2557× 108

9,000 3.773× 108 4.501× 108 7.7351× 108

10,000 4.056× 108 4.837× 108 7.6877× 108

5,875 2.0946× 104 4.162× 104 2.5661× 103

Tabela 4.1: Comparison between the BP algorithm for MDS (Algorithm 4.1), rBP and the MetricMultidimensional Scaling algorithm.

given by (4.4.2), while using the following function as an additional pruning criterion:

𝜎(x) = max𝑖,𝑗=1...,𝑛

|𝑑(𝑥𝑖, 𝑥𝑗)− 𝛿𝑖𝑗 | . (4.7.1)

Thus, in addition to comparing the value of the Stress function 𝑆𝑚, prunable(𝑁 𝑡) compares the value of 𝜎 on node𝑁 𝑡 with that of the incumbent solution. This heuristic induces a more aggressive pruning behavior of the algorithm,since it allows early nodes (i.e., nodes close to the root of the tree) to be pruned as soon as a single large deviationfrom a given distance to its corresponding realized distance is detected.

Table 4.1 contains results of our BP-based MDS algorithm on 7 datasets (6 of which were artificially generated).Each artificial dataset was constructed according to the following procedure. A set of 𝑛 points in R50, uniformlydistributed among 50 clusters, was generated as follows:

1. 100 vectors {𝑣1, . . . , 𝑣50}∪{𝑑1, . . . , 𝑑50} ⊆ R50 were randomly generated according to the uniform distributionin the open box (0, 50)𝑛;

2. For 𝑖 ∈ {1, . . . , 50}, we generated 𝑛50 random vectors according to a multivariate normal distribution with

mean vector 𝑣𝑖 and diagonal covariance matrix with diagonal 𝑑𝑖.

The six first rows of Table 4.1 show the results obtained by Algorithm 4.1 and by rBP on the random datasetsas compared to the Metric Multidimensional Scaling algorithm (MMDS) [14]. The number of points in the randomdatasets range from 5, 000 to 10, 000.

Our approach produced embeddings with inferior error on all random datasets. We remark, however, that thesmallest dataset was the only case in which the BP error was of the same order of magnitude as that of MMDS.Finally, the last row of Table 4.1 refers to the Parkinsons dataset. The BP and rBP errors in that case were greater,by one order of magnitude, than the MMDS error. On the other hand, it is possible that a different order of thepoints, or a different value selection heuristic, could have made a difference for our algorithm on that particulardataset. To exemplify how different orders can make impressive difference we generated additional orders of thepoints (labeled order𝑖, for 𝑖 ∈ {1, . . . , 8}), for the points in Parkinsons dataset and applied algorithms rBP andMMDS to the modified datasets. We generated each order randomly (with labels order𝑖 𝑖 ∈ {1, 2, 3}). The orderswith labels order𝑖, for 𝑖 ∈ {4, 5, 6, 7}, were constructed using heuristics designed to find advantageous orders for therBP algorithm, as shown below:

• Heuristic One – Based on the given distances between points, we select the largest distance in the dataset.Let us say that the distance involves points 𝑝𝑖 and 𝑝𝑗 . We now have an (yet incomplete) order starting atpoint 𝑝𝑖 and ending at 𝑝𝑗 . We now select the largest distance in the dataset involving either 𝑝𝑖 or 𝑝𝑗 , andextend the order by including a new point not yet in the order. The process is repeated, giving rise to apath of large weight in the graph corresponding to the dataset. The order is given by either of the two

32

rBP(𝑆𝑚(𝑥)) MMDS(𝑆𝑚(𝑥))Original order 2.0946× 104 2.5661× 103

order1 6.7756× 103 3.5960× 103

order2 4.1886× 104 3.0736× 103

order3 4.1925× 104 3.1397× 103

order4 7.6543× 103 4.4157× 103

order5 5.8678× 103 6.3979× 103

order6 4.9394× 103 6.0898× 103

order7 4.6252× 103 6.2728× 103

order8 4.9371× 103 1.0162× 104

Tabela 4.2: Comparison between rBP algorithm and the Metric Multidimensional Scaling algo-rithm on Parkinsons dataset using different orders of the points.

orders of the vertices along such path. Apart from the cases where there are edges with equal weights, it isstraightforward to show the uniqueness of the path obtained via this procedure.

• Heuristic Two – The k-means algorithm is used to cluster the dataset. Next, we apply Heuristic Oneto obtain an order of the centroids of the clusters. This order on the clusters establishes that, for any twoclusters 𝐶1 and 𝐶2, with 𝐶1 coming before 𝐶2 in this order, all points of 𝐶1 will appear before any pointof 𝐶2 in the final order. Finally, Heuristic One is used to provide a relative order for the points in eachcluster.

More specifically, order4 was generated using Heuristic One, while order5, order6 and order7 were generatedusing Heuristic Two for 10, 15 and 20 respectively. The results of this experiment can be seen in Table 4.2.Note that, due to the nature of the 𝑆𝑚 objective function, the order of the points is relevant for measuring theperformance of the MMDS, even though the algorithm itself is insensitive to the order.

As discussed before, the rBP error using the original order was greater than the corresponding MMDS error.On the basis of the results corresponding to orders order𝑖, for 𝑖 ∈ {1, 2, 3}, we can suppose that, for a general order,the MMDS error is typically smaller than rBP error. However, the results suggest that using the heuristic rBPerror remains in the same order of magnitude or even lower than MMDS error.

In order to show how an order can make a drastic difference in the final result, we constructed order8 in anartificial way, by applying Heuristic One on the matrix of absolute differences between the original euclideandistance matrix of the Parkinsons dataset and the one obtained by MMDS algorithm. This results in an orderthat goes against the MMDS algorithm, since we intentionally selected an order along which the value of 𝑆𝑚 isheuristically maximum, i.e., along which MMDS accomplishes its worst approximations of the original distances.On the other hand, any representation obtained by rBP using this order has several of the distances accounted forby 𝑆𝑚 exactly matched. The last row of Table 4.2 makes it clear that such order is indeed more favorable to rBP.

4.7.1 Application as a Confirmatory MDS TechniqueWe validate the BP-based confirmatory MDS algorithm proposed here with a series of computational experi-

ments using the classic Fisher dataset [24] and the cluster partition-preserving application described in this paper.Prior to the application of our partition-preserving MDS algorithm, duplicated points were removed and the datawas clustered using a standard 𝑘-means procedure, setting the number 𝑘 of clusters equal to 3, 5 and 8. We usedas the reference point of each cluster its centroid, defined as the average of the points belonging to the cluster.Therefore, the actual data fed to the algorithm was augmented by as many points as the number of clusters in each

33

Alg. 4.1 Alg. 4.2𝑘 Stress 𝑑𝑀 Discr. Stress 𝑑𝑀 Discr.3 9.863× 102 2 2 1.937× 103 0 05 8.372× 102 2 8 6.967× 103 0 08 1.017× 103 12 5 3.098× 104 0 2

Tabela 4.3: Comparison between the standard BP algorithm (Algorithm 4.1) and the proposedcluster partition-preserving BP algorithm (Algorithm 4.2).

experiment. Since the Fisher dataset involves a rather small number of points, we chose to run the experiments inTable 4.3 using (4.4.1) as Stress function.

The first column of Table 4.3 displays the number of clusters used for clustering the original data. The nextthree columns refer to: (i) the value of the Stress function corresponding to the best MDS embedding found byapplying the BP algorithm (Algorithm 4.1), (ii) the value of the cluster misclassification pseudo-metric, and (iii) thecorresponding number of cluster-partition discrepancies. The following three columns provide similar informationconcerning our cluster-partition preserving BP algorithm. In both cases, the BP search was limited to a maximumof 5 million nodes.

The results show that our algorithm was able to construct MDS embeddings with low (in fact, zero) clustermisclassification and low cluster-partition discrepancies, while incurring a relatively small increase in the value ofthe Stress function.

It is important to remark that Table 4.3 shows a simultaneous decrease in misclassification and discrepancyfor the cluster partition-preserving BP (Algorithm 4.2), for all values of 𝑘. On the other hand, the Stress valuefor Algorithm 4.2 was greater than that for the standard BP (Algorithm 4.1), for all values of 𝑘. Since the searchtree was pruned with respect to cluster discrepancy, this scenario is to be expected: discarding certain solutionsthat were taken into consideration by Algorithm 4.1 might lead to an increase in Stress. However, since both BPsearches were limited to exploring 5 million nodes, it is conceivable that the search carried out by Algorithm 4.2could lead to a solution with a Stress value closer to that of the best embedding found by Algorithm 4.1, if thealgorithm had been allowed to explore a larger fraction of the tree.

As far as running time is concerned, Algorithm 4.2 has practically the same performance as that of the Algorithm4.1, since we introduce a negligible amount of extra computation in each node of the tree due to the discrepancycalculation. The computation of the cluster misclassification metric is currently carried out as a post-processingphase, applied only to a set of elite embeddings stored in the set 𝐵 of Alg. 4.2.

4.7.2 BP-Based Confirmatory MDS for Large DatasetsA few desirable properties of the proposed algorithms can be identified that support their applicability to the

processing of extremely large datasets arising in Big Data applications.First, the algorithms do not require substantial pre-processing or initialization. In the case of Algorithm 4.2,

the only actual requirement of the algorithm is that the input data points be sorted by cluster membership, i.e., thedata must be sorted according to the label of the cluster to which each point belongs. This type of sorting can beaccomplished in time that is linear on the number of points, via a one-pass scan of the dataset. The preprocessingrequired for finding suitable orders is also inexpensive, as discussed in the beginning of this section in connectionto the heuristics used for computing Table 4.2.

Additionally, a first embedding of the data is found early in the search, during the initial descent of the algorithmalong the BP tree. Thus, an initial embedding is obtained in linear time on the number of points of the dataset.This is also true for all additional embeddings found by the randomized version of the algorithm, since the proposedrandomization procedure consists of a series of independent descents. Thus, the proposed algorithms can be qualifiedas anytime algorithms, in the sense that their execution can be suspended at any moment, and the best known

34

solution found so far can be obtained and inspected. While a large part of the BP tree (several million nodes) canbe maintained in a few Gigabytes of main memory, the algorithm can benefit from strategies that involve storageof unexplored regions of the tree in secondary memory, or parallelization of the search along different subrees.

Finally, while not required, an initial embedding can be used to further prune the search along the BP tree.This can be of particular interest in applications in which different embeddings must be produced in an interactivefashion, based on additional side constraints or slightly different optimization criteria imposed by a decision-maker.The incremental effort to produce the next embedding can be substantially reduced if an initial embedding can beprovided as input to the algorithm, in replacement of the naïve initialization shown in line 2 of Algorithm 4.2.

4.8 ConclusionWe have introduced a MDS algorithm based on a combinatorial procedure whose foundations lie on recent

developments in distance geometry. The algorithm is capable of enumerating all elements of a large family of MDSembeddings. It consists of a combinatorial search that can handle a large variety of side constraints and additionaloptimization criteria, as illustrated in an application of MDS in which an embedding is sought that preserves agiven cluster partition structure existing in the original space, i.e., the algorithm can be easily modified into aconfirmatory MDS algorithm.

Two main components of the basic MDS algorithm described here can be modified in order to address aspecfiic confirmatory MDS task: the order in which points are embedded (in the example, points were sorted bycluster membership) and the introduction of specific pruning criteria (in the example, a pseudo-metric called clusterdiscrepancy). Such artifacts can be used to account for a vast number of optimization criteria and side constraintsrequired by the MDS application at hand, allowing the procedure to be applied to a large set of visualizationapplications of MDS.

We argued that the branching heuristics used in the classical BP algorithm for molecular conformation problemsare not adequate for our proposed algorithm, especially when dealing with large datasets. In view of that, werecommend a variant of our algorithm which involves a value selection heuristic, along with a randomizationstrategy that ensures that: (i) the algorithm is guided to search for improving solutions, and (ii) the algorithmsamples the search tree according to a (limited) randomization strategy. The randomization prevents the algorithmfrom spending excessive time exploring a limited fraction of the search tree as a result of bracktracking. In practice,the algorithm works by performing several random “dives” in the search tree; each dive is independent of previousones and cannot consist of more than 50% of decisions that are in disagreement with the value selection heuristicsadopted.

Additional features of the algorithm are the facts that: (i) it does not require substantial calibration of parame-ters, (ii) it consists of an anytime algorithm, delivering a series of improving solutions in the course of the search,and (iii) it does not suffer from local optimality issues (a starting embedding is not needed in order to reach aconfirmatory embedding of good quality).

35

Capítulo 5

Conclusão

Em cada um dos Capítulos do trabalho, apresentamos contribuções novas para a Geometria de DistânciasEuclidianas, seja como um resultado teórico, Capítulo 2, como uma aplicação direta, Capítulo 3, ou mesmo ummeio termo entre teoria e prática, Capítulo 4.

Nossa primeira contribuição aparece no Capítulo 2 sobre o número de soluções do DMDGP [39]. Conhecer acardinalidade do conjunto solução de uma dada instância do DMDGP associada ao cálculo da estrutura tridimensi-onal de proteinas é importante, pois tal informação está ligada à “qualidade” e quantidade dos dados experimentaisde NMR1. Também é útil em outras aplicações do DGP para saber quando certas distâncias são suficientes parao grafo ter uma única realização. Ainda, se conhecemos o número de soluções de uma dada instância e desejamosobter todas as soluções, podemos utilizar este valor como um critério de parada para o algoritmo BP, reduzindo aárvore de busca associada.

A principal vantagem do método descrito no Capítulo 3 é que ele não necessita que a dimensão mínima deimersão seja conhecida ou de uma clique inicial já realizada como parte do conjunto de dados de entrada doproblema, informações utilizadas na literatura.

O Capítulo 4 é o que sugere mais frutos. Como dito, em Confirmatory MDS, restrições extras sobre a estruturada representação de MDS são integradas a um algoritmo de MDS por meio de equações ou penalidades. Os tiposde restrições extras estudados incluem forçar um arranjo esférico dos pontos imersos, estrutura de reticuladosretangulares e restrições de ordem sobre as distâncias realizadas. Pelo que foi mostrado, a metodologia apresentadaé claramente uma metodologia confirmatória, uma vez que nossa restrição extra requer que os pontos imersosretenham a estrutura de agrupamento oriunda do espaço original.

Uma nova abordagem do algoritmo para o DDGP, ao invés do DMDGP, está em andamento. Outros tipospossíveis de poda e ordem para esse método também apontam para novas aplicações de Confirmatory MDS, alémda aplicação para retenção da estrutura de agrupamento. Pesquisa sobre uma busca mais eficiente na árvore debusca ainda é necessária. Algumas ideias de randomização foram apresentadas ao longo do capítulo 4.

Vale ressaltar que os resultados iniciais dos capítulos 2 e 4 foram publicados nos Anais do I Workshop ofDistance Geometry and Applications, que ocorreu em Manaus, entre os dias 23 e 27 de Junho de 2013 [1, 6], ondeo trabalho relacionado ao capítulo 4 foi selecionado como “‘Best Paper”. Além disso, uma primeira consequênciadireta desse trabalho foi apresentada no evento Many Faces of Distances que ocorreu em Campinas, entre os dias22 e 24 de outubro de 2014 [12], selecionado entre os 4 melhores trabalhos apresentados do congresso. Os trabalhosmencionados acima foram colocados no apêndice.

1do inglês, Nuclear Magnetic Resonance.

36

Referências

[1] G. Abud e J. Alencar. “Counting the number of solutions of the Discretizable MolecularDistance Geometry Problem”. Em: Proceedings of Workshop on Distance Geometry and Ap-plications (DGA2013). Jun. de 2013, pp. 29–32.

[2] J. Alencar, C. Lavor e T. Bonates. “A Combinatorial Approach to Multidimensional Scaling”.Em: Big Data (BigData Congress), 2014 IEEE International Congress on. 2014, pp. 562–569.

[3] J. Alencar, T. Bonates e C. Lavor. “A Distance Geometry-Based Combinatorial Approachto Multidimensional Scaling”. Em: submetido ().

[4] J. Alencar, T. Bonates, C. Lavor e L. Liberti. “An algorithm for realizing Euclidean distancematrices”. Em: submetido ().

[5] J. Alencar, T. Bonates, G. Liberali e D. Aloise. “Branch-and-prune algorithm for multidi-mensional scaling preserving cluster partition”. Em: Proceedings of Workshop on DistanceGeometry and Applications (DGA2013). Jun. de 2013, pp. 41–46.

[6] J. Alencar, C. Torezzan, S. I. R. Costa e A. Andrioni. “The Kissing Number Problem froma Distance Geometry Viewpoint”. Em: Proceedings of Workshop on Distance Geometry andApplications (DGA2013). Jun. de 2013, pp. 47–52.

[7] A. Y. Alfakih e H. Wolkowicz. On the Embeddability of Weighted Graphs in Euclidean Spaces.Rel. téc. CORR 98-12, Department of Combinatorics e Optimization, University of Waterloo,1998.

[8] D. Aloise, P. Hansen e L. Liberti. “An improved column generation algorithm for minimumsum-of-squares clustering”. English. Em: Mathematical Programming 131 (2012), pp. 195–220.

[9] A. C. R. Alonso, S. M. Carvalho, C. Lavor e A. R. Oliveira. “Escalonamento Multidimensio-nal: uma Abordagem Discreta”. Em: Proceedings of the Congreso Latino-Iberoamericano deInvestigacion Operativa. 2012.

[10] B. Berger, J. Kleinberg e T. Leighton. “Reconstructing a Three-dimensional Model withArbitrary Errors”. Em: J. ACM 46 (1999), pp. 212–235.

[11] L. Blumenthal. Theory and Applications of Distance Geometry. Oxford University Press,1953.

37

[12] T. Bonates e J. Alencar. “Learning Forbidden Subtrees in Branch-and-Prune-Based MDS”.Em: Proceedings of Many Faces of Distances. 2014.

[13] I. Borg. Multidimensional similarity structure analysis. Springer-Verlag New York, Inc., 1987.[14] I. Borg e P. J. Groenen. Modern multidimensional scaling: Theory and applications. Springer

Verlag, 2005.[15] I. Borg, P. J. Groenen e P. Mair. Applied multidimensional scaling. Springer, 2013.[16] A Cayley. “On a theorem in the geometry of position”. Em: Cambridge Mathematical Journal

2 (1841), pp. 267–271.[17] I. Coope. “Reliable computation of the points of intersection of 𝑛 spheres in R𝑛”. Em: Aus-

tralian and New Zealand Industrial and Applied Mathematics Journal 42 (2000), pp. C461–C477.

[18] G. M. Crippen e T. F. Havel. Distance geometry and molecular conformation. Vol. 15. Re-search Studies Press Taunton, England, 1988.

[19] J. Dattorro. Convex Optimization and Euclidean Distance Geometry. Palo Alto: ℳ𝜖𝛽𝑜𝑜,2005.

[20] B. R. Donald. Algorithms in structural molecular biology. The MIT Press, 2011.[21] Q. Dong e Z. Wu. “A linear-time algorithm for solving the molecular distance geometry

problem with exact inter-atomic distances”. English. Em: Journal of Global Optimization 22(2002), pp. 365–375.

[22] G. Dzemyda, O. Kurasova e J. Žilinskas. “Multidimensional Data and the Concept of Visua-lization”. English. Em: Multidimensional Data Visualization. Vol. 75. Springer Optimizationand Its Applications. Springer New York, 2013, pp. 1–4.

[23] T. Eren, O. Goldenberg, W. Whiteley, Y. Yang, A. Morse, B. Anderson e P. Belhumeur.“Rigidity, computation, and randomization in network localization”. Em: INFOCOM 2004.Twenty-third AnnualJoint Conference of the IEEE Computer and Communications Societies.Vol. 4. 2004, 2673–2684 vol.4.

[24] R. A. Fisher. “The use of multiple measurements in taxonomic problems”. Em: Annals ofEugenics 7 (1936), pp. 179–188.

[25] J. Graver. “Rigidity Matroids”. Em: SIAM Journal on Discrete Mathematics 4 (1991),pp. 355–368.

[26] J. Graver, B. Servatius e H. Servatius. Combinatorial Rigidity. American Mathematical So-ciety, 1993.

[27] W. Heiser e P. Groenen. “Cluster differences scaling with a within-clusters loss componentand a fuzzy successive approximation strategy to avoid local minima”. English. Em: Psycho-metrika 62 (1997), pp. 63–83.

[28] B. Hendrickson. “Conditions for Unique Graph Realizations”. Em: SIAM J. Comput. 21(1992), pp. 65–84.

38

[29] M. Laurent. “Cuts, matrix completions and graph rigidity”. Em: Mathematical Programming,Series B 79 (1997), pp. 255–283.

[30] M. Laurent. “Matrix completion problems”. English. Em: Encyclopedia of Optimization. Ed.por C. A. Floudas e P. M. Pardalos. Springer US, 2009, pp. 1967–1975.

[31] C. Lavor, J. Lee, A. L.-S. John, L. Liberti, A. Mucherino e M. Sviridenko. “Discretizationorders for distance geometry problems”. Em: Optimization Letters 6 (2012), pp. 783–796.

[32] C. Lavor, L. Liberti, N. Maculan e A. Mucherino. “The discretizable molecular distancegeometry problem”. Em: Computational Optimization and Applications 52 (2012), pp. 115–146.

[33] C. Lavor, L. Liberti e A. Mucherino. “The interval Branch-and-Prune algorithm for thediscretizable molecular distance geometry problem with inexact distances”. Em: Journal ofGlobal Optimization (2011), pp. 1–17.

[34] C. Lavor. “On generating Instances for the Molecular Distance Geometry Problem”. English.Em: Global Optimization. Ed. por L. Liberti e N. Maculan. Vol. 84. Nonconvex Optimizationand Its Applications. Springer US, 2006, pp. 405–414.

[35] C. Lavor, L. Liberti, N. Maculan e A. Mucherino. “Recent advances on the DiscretizableMolecular Distance Geometry Problem”. Em: European Journal of Operational Research219 (2012), pp. 698 –706.

[36] L. Liberti, C. Lavor, N. Maculan e A. Mucherino. “Euclidean Distance Geometry and Ap-plications”. Em: SIAM Review 56 (2014), pp. 3–69.

[37] L. Liberti, C. Lavor, A. Mucherino e N. Maculan. “Molecular distance geometry methods:From continuous to discrete”. Em: International Transactions in Operational Research 18(2011), pp. 33–51.

[38] L. Liberti e C. Lavor. “On a Relationship Between Graph Realizability and Distance MatrixCompletion”. English. Em: Optimization Theory, Decision Making, and Operations ResearchApplications. Ed. por A. Migdalas, A. Sifaleras, C. K. Georgiadis, J. Papathanasiou e E.Stiakakis. Vol. 31. Springer Proceedings in Mathematics & Statistics. Springer New York,2013, pp. 39–48.

[39] L. Liberti, C. Lavor, J. Alencar e G. Abud. “Counting the Number of Solutions of KDMDGPInstances”. English. Em: Geometric Science of Information. Ed. por F. Nielsen e F. Bar-baresco. Vol. 8085. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2013,pp. 224–230.

[40] L. Liberti, C. Lavor e N. Maculan. “A Branch-and-Prune algorithm for the MolecularDistance Geometry Problem”. Em: International Transactions in Operational Research 15(2008), pp. 1–17.

[41] L. Liberti, C. Lavor e A. Mucherino. “The Discretizable Molecular Distance Geometry Pro-blem seems Easier on Proteins”. English. Em: Distance Geometry. Ed. por A. Mucherino,C. Lavor, L. Liberti e N. Maculan. Springer New York, 2013, pp. 47–60.

39

[42] L. Liberti, B. Masson, J. Lee, C. Lavor e A. Mucherino. “On the number of realizations ofcertain Henneberg graphs arising in protein conformation”. Em: Discrete Applied Mathema-tics 165 (2014). 10th Cologne/Twente Workshop on Graphs and Combinatorial Optimization(CTW 2011), pp. 213 –232.

[43] A. Man-Cho So e Y. Ye. “Theory of semidefinite programming for sensor network localiza-tion”. Em: Mathematical Programming B 109 (2007), pp. 367–384.

[44] K. Menger. “New foundation of Euclidean geometry”. Em: American Journal of Mathematics53 (1931), pp. 721–745.

[45] K. Menger. “Untersuchungen über allgemeine Metrik”. German. Em: Mathematische Annalen103 (1930), pp. 466–501.

[46] J. Nie, K. Ranestad e B. Sturmfels. “The algebraic degree of semidefinite programming”.Em: Mathematical Programming A 122 (2010), pp. 379–405.

[47] J Porta, L. Ros e F Thomas. “Inverse kinematics by distance matrix completion”. Em:Proceedings of the 12th International Workshop on Computational Kinematics. 2005, pp. 1–9.

[48] K. Ranestad e B. Sturmfels. Personal communication. 2013.[49] V. Roth, J. Laub, M. Kawanabe e J. M. Buhmann. “Optimal cluster preserving embed-

ding of nonmetric proximity data”. Em: Pattern Analysis and Machine Intelligence, IEEETransactions on 25 (2003), pp. 1540–1551.

[50] J. B. Saxe. Embeddability of weighted graphs in k-space is strongly NP-hard. Carnegie-MellonUniversity, Department of Computer Science, 1980.

[51] T. Schlick. Molecular modelling and simulation: an interdisciplinary guide. New York: Sprin-ger, 2002.

[52] I. Schoenberg. “Remarks to Maurice Fréchet’s article “Sur la définition axiomatique d’uneclasse d’espaces distanciés vectoriellement applicable sur l’espace de Hilbert"”. Em: Annalsof Mathematics 36 (1935), pp. 724–732.

[53] M. Sippl e H. Scheraga. “Solution of the embedding problem and decomposition of symmetricmatrices”. Em: Proceedings of the National Academy of Sciences 82 (1985), pp. 2197–2201.

[54] T.-S. Tay e W. Whiteley. “Generating isostatic frameworks”. Em: Structural Topology 11(1985), pp. 21–69.

[55] A. Tsanas, M. Little, P. McSharry e L. Ramig. “Accurate Telemonitoring of Parkinson’s Di-sease Progression by Noninvasive Speech Tests”. Em: Biomedical Engineering, IEEE Tran-sactions on 57 (2010), pp. 884–893.

[56] W. Whiteley. “Infinitesimally rigid polyhedra. I. Statics of frameworks”. Em: Transactionsof the American Mathematical Society 285 (1984), pp. 431–465.

[57] D. M. Witten e R. Tibshirani. “Supervised multidimensional scaling for visualization, clas-sification, and bipartite ranking”. Em: Computational Statistics & Data Analysis 55 (2011),pp. 789 –801.

40

[58] W. D. Xu R. Clustering. Wiley-IEEE Press, out. de 2008, p. 368.[59] Y. Yemini. “Some theoretical aspects of position-location problems”. Em: Foundations of

Computer Science, 1979., 20th Annual Symposium on. IEEE. 1979, pp. 1–8.[60] Y. Yemini. The Positioning Problem-A Draft of an Intermediate Summary. Rel. téc. Defense

Technical Information Center, University of southtern California Marina Del Rey InformationSciences Inst., 1978.

41

Apêndice A

Counting the number of solutions of theDiscretizable Molecular DistanceGeometry Problem

Germano Abud1 and Jorge Alencar2

1 Universidade Federal de Uberlândia, FAMAT-UFU, Uberlândia, Minas Gerais, Brazil. [email protected] Universidade Estadual de Campinas, IMECC-Unicamp, Campinas, São Paulo, Brazil. [email protected].

Abstract

The Discretizable Molecular Distance Geometry Problem (DMDGP) is a subset of the Mole-cular Distance Geometry Problem, where the solution space has a finite number of solutions.We propose a way to count this value, based on the symmetric properties of the DMDGP.

Keywords

Branch-and-Prune, Molecular Distance Geometry Problem, Number of Solutions

A.1 IntroductionThe Molecular Distance Geometry Problem (MDGP) arises in nuclear magnetic resonance (NMR) spectroscopy

analysis, which provides a set of inter-atomic distances 𝑑𝑖𝑗 for certain pairs of atoms (𝑖, 𝑗) of a given protein [20].The question is how to use this set of distances in order to calculate the positions 𝑥1, . . . , 𝑥𝑛 ∈ R3 of the atomsforming the molecule [18].

A simple undirected graph 𝐺 = (𝑉, 𝐸, 𝑑) can be associated to the problem, where 𝑉 represents the set of atoms,𝐸 models the set of atom pairs for which a Euclidean distance is available, and the function 𝑑 : 𝐸 → R+ assigns

42

distance values to each pair in 𝐸. The MDGP can then be formally defined as the following: given a weightedsimple undirected graph 𝐺 = (𝑉, 𝐸, 𝑑), is there a function 𝑥 : 𝑉 → R3 such that

||𝑥𝑖 − 𝑥𝑗 ||= 𝑑𝑖𝑗 ∀(𝑖, 𝑗) ∈ 𝐸? (A.1.1)

Many algorithms have been proposed for the solution of the MDGP, and most of them are based on a searchin a continuous space [37].

Exploring some rigidity properties of the graph 𝐺, the search space can be discretized where a subset of MDGPinstances is defined as the Discretizable MDGP (DMDGP) [32]. The main idea behind the discretization is thatthe intersection of three spheres in the three-dimensional space consists of at most two points under the hypothesisin which their centers are not aligned. The definition of an ordering on the atoms of the protein satisfying theconditions that distances to at least three immediate predecessors are known and suggests a recursive search ona binary tree containing the potential coordinates for the atoms of the molecule [33]. The binary tree of possiblesolutions is explored starting from its top, where the first three atoms are positioned and by placing one vertex pertime. At each step, two possible positions for the current vertex 𝑣 are computed, and two new branches are addedto the tree. As soon as a position is found to be infeasible, the corresponding branch is pruned and the search isbacktracked. This strategy defines an efficient algorithm called Branch and Prune (BP) [33].

We propose a way to count the number of solutions of the DMDGP, based on its symmetric properties establishedin [42].

A.2 The Euclidean Distance Matrix Completion ProblemFunctions (or realizations) 𝑥 : 𝑉 → R3 satisfying (A.1.1) are called valid realizations. Once a valid realization

is found, distances between all pairs of vertices can be determined, which extends 𝑑 : 𝐸 → R+ to a function𝑑′ : 𝑉 × 𝑉 → R+, where the values of the function 𝑑′ can be arranged into a square Euclidean distance matrix onthe set 𝐷 = {𝑥𝑣 : 𝑣 ∈ 𝑉 } ⊂ R3. The pair (𝐷, 𝑑′) is known as a distance space [11].

In the Euclidean Distance Matrix Completion Problem (EDMCP) [29], the input is a partial square symmetricmatrix 𝑀 and the output is a pair (𝑀 ′, 𝑘), where 𝑀 ′ is a symmetric completion of 𝑀 and 𝑘 ∈ N such that: (a)𝑀 ′ is a Euclidean distance matrix in R𝑘 and (b) 𝑘 is minimum as possible. We consider a variant of the EDMCP,called EDMCP𝑘, where 𝑘 = 3 is actually given as part of the input and the output certificate for YES instancesonly consists of the completion matrix 𝑀 ′ of the partial matrix 𝑀 as a Euclidean distance matrix (𝑀 ′ is called avalid completion) [38].

There is a strong relationship between the MDGP and the EDMCP3: each MDGP instance 𝐺 can be transformedin linear time to an EDMCP3 instance (and vice versa [21]) by just considering the weighted adjacency matrix of𝐺 where vertex pairs {𝑢, 𝑣} /∈ 𝐸 correspond to entries missing from the matrix related to the EDMCP3 instance.

A.3 Counting the number of solutions of the DMDGPAs remarked in [47], the completion in R3 of a partial distance matrix with the structure⎡⎢⎢⎢⎢⎣

0 𝑑12 𝑑13 𝑑14 ?𝑑21 0 𝑑23 𝑑24 𝑑25𝑑31 𝑑32 0 𝑑34 𝑑35𝑑41 𝑑42 𝑑43 0 𝑑45? 𝑑52 𝑑53 𝑑54 0

⎤⎥⎥⎥⎥⎦can be carried out in constant time by solving a quadratic system in the unknown 𝑑15, represented as a questionmark in the matrix above derived from setting the Cayley-Menger determinant [11] of the related distance space tozero.

The matrix above is an EDMCP3 instance related to some DMDGP instance. In fact, for any DMDGP instance,we have an EDMCP3 instance given by a matrix 𝑀 such that (at least) the elements (𝑀𝑖𝑗) satisfying |𝑖− 𝑗|≤ 3 areknown [32].

43

We need now some results related to the symmetric properties of the DMDGP [42] (for a given DMDGP instance𝐺 = (𝑉, 𝐸) with |𝑉 |= 𝑛, let the distances 𝑑𝑖𝑗 of the associated EDMCP3 instance given according to the orderingon 𝑉 that guarantees that all 𝑑𝑖𝑗 satisfying |𝑖− 𝑗|≤ 3 are known and consider that 𝑥1, 𝑥2, 𝑥3, 𝑥4 are fixed):

Theorem A.3.1. Given an EDMCP3 instance of order 𝑛, related to some DMDGP instance, the results below holdwith probability 1 [42].

1. If the distance 𝑑1,𝑛 is known, there is just one solution to the given EDMCP3 instance.

2. If all the distances 𝑑𝑖,𝑖+4, 𝑖 = 1, . . . 𝑛 − 4, are known, there is also just one solution to the given EDMCP3instance.

3. There are just 2 possible (distinct) values for the unknown distances 𝑑𝑖,𝑖+4, 𝑖 = 1, . . . 𝑛 − 4, related to theEDMCP3 instance.

In order to illustrate how to count the number of solutions of the DMDGP, consider the following example ofthe EDMCP3 associated to some DMDGP instance (by the symmetry, we only consider 𝑑𝑖𝑗 such that 𝑖 ≤ 𝑗, for𝑖, 𝑗 = 1, · · · , 𝑛): ⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

0 𝑑12 𝑑13 𝑑14 ? 𝑑16 ? ? ? ? ? ?0 𝑑23 𝑑24 𝑑25 ? ? ? ? ? ? ?

0 𝑑34 𝑑35 𝑑36 ? ? ? ? ? ?0 𝑑45 𝑑46 𝑑47 ? ? 𝑑4,10 ? ?

0 𝑑56 𝑑57 𝑑58 ? ? ? ?0 𝑑67 𝑑68 𝑑69 ? ? ?

0 𝑑78 𝑑79 𝑑7,10 ? ?0 𝑑89 𝑑8,10 𝑑8,11 ?

0 𝑑9,10 𝑑9,11 𝑑9,120 𝑑10,11 𝑑10,12

0 𝑑11,120

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

.

Define the 𝑘-diagonal as the subdiagonal of a simmetric matrix 𝐴 of order 𝑛, whose elements (𝐴𝑖𝑗) satisfy|𝑗 − 𝑖|= 𝑘, 𝑘 = 0, . . . , 𝑛− 1.

Since the distance 𝑑16 is known, there is just one possible value for the distances 𝑑15 and 𝑑26 (by Result 1,considering 𝑉 = {𝑣1, 𝑣2, 𝑣3, 𝑣4, 𝑣5, 𝑣6}). Also, since the distance 𝑑4,10 is known, there is just one possible valuefor the distances 𝑑48, 𝑑49,𝑑59,𝑑5,10 and 𝑑6,10 (by Result 1, considering 𝑉 = {𝑣4, 𝑣5, 𝑣6, 𝑣7, 𝑣8, 𝑣9, 𝑣10}). In order tocomplete the 4-diagonal, the only missing distances are 𝑑37, 𝑑7,11, and 𝑑8,12. So, by Results 2 and 3, there are 23

possible solutions to this EDMCP3 instance.Based on these ideas, it is possible to define an efficient algorithm to count the number of solutions of a given

EDMCP3 instance related to some DMDGP instance. From the example above, we can also notice that if we know,in fact, any 𝑘-diagonal of the matrix related to the EDMCP3 instance, for 𝑘 = 4, . . . , 𝑛 − 1, there is also just onesolution to the EDMCP3 instance.

Now given a DMDGP instance, if we know the number of solutions to the related EDMCP3 then we also knownthe number of solutions (realizations) to the DMDGP instance. In fact, each solution of the given EDMCP3 isasociated to two realizations (solutions) of the related DMDGP, up to rotations and translations.

In [38], it is proposed a coordinate-free BP, called the dual BP, that takes decisions about distance values onmissing edges rather than on realizations of vertices in R3. The original algorithm (the primal BP) decides on points𝑥𝑣 ∈ R3 to assign to the next vertex 𝑣, whereas the dual BP decides on distances 𝛿 to assign to the next missingdistance incident to 𝑣 and to a predecessor of 𝑣. In addition to the formalization of the results of this work, weare studying the possibilities to define a primal-dual BP algorithm in order to get a more efficient method to solveDMDGP instances.

44

Apêndice B

Branch-and-prune algorithm formultidimensional scaling preservingcluster partition

Jorge Alencar1, Tibérius Bonates2, Guilherme Liberali3 and Daniel Aloise4

1 Universidade Estadual de Campinas, IMECC-Unicamp, Campinas, São Paulo, [email protected] Universidade Federal do Semiárido, UFERSA, Mossoró, Rio Grande do Norte, [email protected] Erasmus University Rotterdam, EUR, Rotterdam, Netherlands [email protected] Universidade Federal do Rio Grande do Norte, UFRN, Natal, Rio Grande do Norte, [email protected].

Abstract

In standard Multidimensional Scaling (MDS) one is concerned with finding a low-dimensionalrepresentation of a set of 𝑛 objects, so that pairwise dissimilarities among the original objectsare represented as distances in the embedded space with minimum error. We propose anMDS algorithm that simultaneously optimizes the distance error and the cluster membershipdiscrepancy between a given cluster structure in the original data and the resulting clusterstructure in the low-dimensional representation. We report on preliminary computationalexperience, which shows that the algorithm is able to find MDS representations that preservethe original cluster structure while incurring a relatively small increase in the distance error,as compared to standard MDS.

Keywords

Branch-and-Prune, Distance Geometry, Multidimensional Scaling

45

B.1 IntroductionMultidimensional scaling (MDS) is a set of techniques concerned with variants of the following problem: given

the information on pairwise dissimilarities between elements of a set of 𝑛 objects, find a low-dimensional representa-tion of the given objects, while minimizing a loss function that measures the error between the original dissimilaritiesand the distances resulting from the low-dimensional embedding [14]. This low-dimensional embedding of the givenobjects is usually referred to as an MDS representation.

Let us consider a set 𝑃 of points in R𝑁 to which a clustering procedure (e.g., k-means) has been applied. Theapplication of a standard MDS procedure to 𝑃 provides no guarantee that, if the clustering procedure were alsoapplied to the MDS representation, a cluster structure similar to the one obtained for the original data would result.

Despite this fact, attempts at integrating MDS and clustering into a single technique are not entirely absentfrom the MDS literature. Cluster Differences Scaling (CDS) is one such technique [27]. Given pairwise distancesbetween a set of objects, CDS assigns objects to clusters and creates a low-dimensional representation for eachcluster. Therefore, the resulting representation includes as many points as the number of clusters. The distanceerror is measured over the cluster representations for pairs of points that are assigned to different clusters. Anotherline of work relating clustering and MDS is the one described in [49]. There, an MDS representation is determinedwith the property that a k-means partition of the embedded data is identical to the optimal partition in the originalspace given by a so-called pairwise clustering cost function. One of the advantages of such an approach is that,instead of carrying out an expensive pairwise clustering cost procedure on the original data, one can apply a standardk-means algorithm to the embedded data and recover precisely the same information.

Unlike these approaches, in which clusters are determined as part of the process, our approach requires a clusterpartition obtained a priori. More specifically, we assume that, in addition to the pairwise dissimilarity information,cluster membership data is given as part of the input, specifying to which cluster each point is assigned. Thecurrent availability of highly specialized optimization algorithms for clustering (see, e.g., [8]) allows for instances tobe solved with good accuracy, even when the data involves a large number of entities and/or complex data types.Thus, it is justified to argue for an MDS algorithm that preserves cluster partition but does not enforce the use of aspecific clustering method, unlike [27, 49]. By considering the cluster partition structure as part of the input, theapproach pursued in this paper can be applied in conjunction with virtually any clustering algorithm, including onesthat are not based exclusively on dissimilarities. Given an appropriate cluster partition for the original data, thequestion is whether or not there is a low-dimensional representation of the data, which preserves the dissimilaritiesto an extent that makes it still possible to recover the original cluster partition structure.

This presentation is organized as follows. In Section B.2 we describe an existing combinatorial algorithm forMDS and how it can be modified in order to take into account the preservation of cluster membership in theresulting MDS representation. In Section B.3 we discuss the results of computational experiments carried out on aclassic clustering dataset.

B.2 A Cluster-Partition Preserving MDS AlgorithmLet us consider a set 𝑉 ⊂ R𝑁 of 𝑛 points, for which pairwise Euclidean distances (to which we shall refer

as dissimilarities) 𝛿𝑖𝑗 are known. In [9] a Branch-and-Prune (BP) algorithm was proposed for finding an MDSrepresentation in R3 while minimizing a Stress function given by

𝑆(x) =𝑛∑︁

𝑖=1

𝑛∑︁𝑗=1

(𝑑(𝑥𝑖, 𝑥𝑗)− 𝛿𝑖𝑗)2, (B.2.1)

where x = (𝑥1, . . . , 𝑥𝑛) is the resulting MDS representation and 𝑑(𝑥𝑖, 𝑥𝑗) stands for the Euclidean distance betweenpoints 𝑥𝑖 and 𝑥𝑗 .

Given a total order on the original points, the BP assigns standard positions for the first 3 points in such away as to exactly match the dissimilarities among them. From the 4-th point and on, the algorithm determinesthe possible coordinates of each point 𝑥𝑖 by exactly matching distances and dissimilarities of 𝑥𝑖 with respect to the

46

previous 3 points in the order. It is possible to show that, with probability 1, there are two possible positions foreach such point [32].

This fact naturally leads to a combinatorial procedure, which is the basis of the tree-search BP algorithm.Since the algorithm determines the placement of points in a sequential manner, we shall say that a point has beenmapped if its coordinates have already been determined. Thus, MDS representations are available at the (𝑛− 2)-thlevel of the search tree, once all points have been mapped. Moreover, since the algorithm does not enforce that alldistances match the corresponding dissimilarities, different MDS representations might have different values of theStress function. An implicit enumeration scheme can then be applied based on the value of the Stress function, withtree nodes that correspond to Stress values higher than that of the best known MDS representation being removedfrom further investigation.

We next show how to extend this algorithm to incorporate cluster membership information, assumming that aclustering procedure was applied to the original data and that such information is available. First, we include amongthe input points a reference point for each cluster. This reference point can be, for instance, a cluster centroid,or simply an original point belonging to the cluster and preferably occupying a somewhat “central” position withrespect to other points in the cluster. The only requirement on the choice of a reference point 𝑦 is that thedissimilarities between 𝑦 and all other points (including other reference ones) are known.

Thus, based on a total order on this augmented set of input points, we can apply the BP algorithm with thecaveat that nodes corresponding to MDS representations having a high number of cluster-partition discrepancies(with respect to the original partition) are pruned. A cluster-partition discrepancy can be detected in a node ofthe search tree whenever a point that has already been mapped is closer (in the embedded space) to the mappedreference point of a different cluster than to the mapped reference point of its own cluster. Note that, for this kindof pruning to take place, it is necessary to have some reference points already mapped. We propose to order theinput points in such a way that points belonging to the same cluster are grouped together, with the reference pointof each cluster preceding the remaining points of its cluster.

Algorithm B.1 summarizes the procedure. In line 10 of Alg. B.1, we refer to the property of a node beingprunable. A node 𝑠 is said to be prunable if it has a larger Stress value (or cluster-partition discrepancy) than thatof the best known MDS representation.

Alg. B.1 Pseudocode of cluster-partition preserving BP algorithm.Require: Pairwise dissimilarities 𝛿𝑖𝑗 between 𝑛 points (𝑖, 𝑗 = 1, . . . , 𝑛).Ensure: An MDS representation.

1: Establish total order on points, reference ones included;2: 𝑇 ← {𝑟}, where 𝑟 is the initial node, with positions for the first 3 points;3: while (𝑇 ̸= ∅) do4: Select a node 𝑡 ∈ 𝑇 , 𝑇 ← 𝑇 ∖ {𝑡};5: for each (possible position of the first not yet mapped point in 𝑡) do6: Create new node 𝑠, updated with newly placed point;7: if (𝑠 is an MDS representation) then8: Consider updating best known MDS representation;9: else

10: if (𝑠 is not prunable) then11: 𝑇 ← 𝑇 ∪ {𝑠};12: end if13: end if14: end for15: end while

47

Standard BP [9] Partition-Preserving BP𝑘 Stress Misclass. Discr. Stress Misclass. Discr.3 9.8625e+002 2 2 1.9366e+003 0 05 8.3719e+002 2 8 6.9674e+003 0 08 1.0173e+003 12 5 3.0981e+004 0 2

Tabela B.1: Comparison between the standard BP algorithm and the proposed cluster-partitionpreserving BP algorithm.

Among all solutions produced during the search, the algorithm will report, as the best solution found, one withthe smallest value of cluster misclassification, a concept that we introduce in what follows. Let 𝑝, 𝑞 ∈ N𝑛

𝑘 , with N𝑘 ={1, . . . , 𝑘}, be cluster index vectors, each of which assigns a cluster index 𝑖 (1 ≤ 𝑖 ≤ 𝑘) to each point in 𝑉 . In orderto compare two such point-cluster assignments, we must account for a possible permutation of cluster labels. Thus,we define cluster misclassification as the function 𝑑𝑀 : N𝑛

𝑘 × N𝑛𝑘 → Z+, such that 𝑑𝑀 (𝑝, 𝑞) = min𝜎∈𝑃𝑘

𝑑𝐻(𝜎(𝑝), 𝑞),where 𝑃𝑘 is the set of permutations of N𝑘, 𝑑𝐻 is the Hamming distance, and 𝜎(𝑝) is an index vector obtained from𝑝 via the application of 𝜎 ∈ 𝑃𝑘 (with 𝜎(𝑝)𝑖 = 𝜎(𝑝𝑖), for 𝑖 = 1, . . . , 𝑛). Function 𝑑𝑀 is a metric that allows us toassess how dissimilar the index vector 𝑝 produced by a clustering procedure applied to the embedded data is withrespect to the original index vector 𝑞, obtained by clustering the original data.

B.3 Computational ExperimentsIn order to validate the proposed MDS algorithm we conducted a series of computational experiments using the

classical Fisher data set [24]. Prior to the application of the MDS algorithm, duplicate points were removed andthe data was clustered using a standard 𝑘-means procedure, with the number 𝑘 of clusters equal to 3, 5 and 8. Weused as the reference point of each cluster its centroid, defined as the average of the points belonging to the cluster.

To allow for pruning to take place since early levels of the search tree – and still focus on producing MDSrepresentations with small deviations from the given dissimilarities 𝛿𝑖𝑗 – we attempted to minimize the Stressfunction given by (B.2.1), while using the following function as pruning criterion:

𝜎(x) = max𝑖,𝑗=1...,𝑛

|𝑑(𝑥𝑖, 𝑥𝑗)− 𝛿𝑖𝑗 | . (B.3.1)

The first column of Table B.1 displays the number of clusters used for clustering the original data. The nextthree columns refer to: (i) the value of the Stress function corresponding to the best MDS representation foundby applying the original BP algorithm of [9], (ii) the value of the cluster misclassification metric and (iii) thecorresponding number of cluster-partition discrepancies. The following three columns provide similar informationconcerning our cluster-partition preserving BP algorithm. In both cases, the BP search was limited to a maximumof 5 · 106 nodes.

The results show that our algorithm was able to construct MDS representations with low (in fact, zero) mis-classification counts and low cluster-partition discrepancies, while incurring a relatively small increase in the valueof the Stress function.

It is important to remark that Table 1 shows a simultaneous decrease in misclassification and discrepancy forthe Partition-Preserving BP, for all values of 𝑘. On the other hand, the Stress value for the Partition-PreservingBP is greater than that for the standard BP, for all values of 𝑘. Since the search tree is pruned with respect todiscrepancy, this scenario is to be expected: discarding certain solutions that were taken into consideration by theStandard BP search might lead to an increase in Stress. However, since both BP searches were limited to exploring5 million nodes, it is conceivable that the search carried out by the Partition-Preserving BP could lead to a solutionwith better Stress value than that of the best solution found by the Standard BP search.

48

As far as running time is concerned, the Partition-Preserving BP search has practically the same performanceas that of the Standard BP search, since we introduce a negligible amount of extra computation in each node of thetree due to the discrepancy calculation. The computation of the cluster misclassification metric is currently carriedout as a post-processing phase, applied only to a set of elite solutions generated during the search.

While different orders of the points – as well as different reference points – may be used, our preliminary experi-ments showed that the order suggested here provides a good compromise between quality of the MDS representationand running time.

49

Apêndice C

Learning Forbidden Subtrees inBranch-and-Prune-Based MDS

Jorge Alencar1 and Tibérius Bonates2

1 Universidade Estadual de Campinas, IMECC-Unicamp, Campinas, São Paulo, [email protected] Universidade Federal do Semiárido, UFERSA, Mossoró, Rio Grande do Norte, [email protected].

Branch-and-Prune (BP) is a tree search procedure that has been recently applied to distance geometry problems[36]. In particular, it has been appplied to multidimensional scaling (MDS), i.e., the problem of embedding a givenset of 𝑛 points in a low-dimensional space, while attempting to preserve distances among the original points [5, 14]1.The BP algorithm explores a discretization of the problem and implicitly enumerates the nodes of a binary tree, inwhich each branching decision corresponds to one of the two possible embeddings of a point in the low-dimensionalspace [32]. An ordering of points is established a priori, so that the embedding of any new point is influenced bythe embedding of points that precede it in the order.

Based on such an order, the BP algorithm assigns standard positions for the first three points in such a way asto exactly match the distances among them. From the fourth point and on, the algorithm determines the possiblecoordinates of each point by exactly matching its distances to the previous three embedded points in the order.

A particular type of pruning involves detecting large discrepancies between realized and original distances.Consider the original points to be labeled as 𝑥1, 𝑥2, . . . , 𝑥𝑛, with 𝑥𝑖 immediately preceding 𝑥𝑖+1 in the order. Then,when point 𝑥𝑖 is embedded, the distances between 𝑥𝑖 and points 𝑥𝑖−1, 𝑥𝑖−2, 𝑥𝑖−3 are exactly matched. Distancesinvolving 𝑥𝑖 and any point 𝑥𝑗 , with 𝑗 ∈ {1, . . . , 𝑖 − 4}, are not necessarily matched by the embedded points.Whenever such a discrepancy exceedes a threshold, the current part of the enumeration tree is pruned (discarded).

Even in the presence of pruning strategies, the remainder of BP tree can still be excessively large. We proposea technique for recording information on local decision structures that cause certain subtrees to fail entirely dueto pruning. Consider, for instance, the embedding of point 𝑥𝑖 along any branch of the tree and let us denote by𝑇𝑖 the subtree explored by the BP algorithm from that point an on, i.e., all the branching decisions attempted bythe algorithm while trying to embed points 𝑥𝑖+1 through 𝑥𝑛. Tree 𝑇𝑖 might have some of its branches pruned, as

1Our discussion is based on the 3-dimensional case, but the ideas described here apply in a more general setting.

50

explained before. In particular, it is possible that pruning was so pronounced on 𝑇𝑖 that no single embedding ofthe full set of points was produced. This amounts to say that along every possible path from the root of 𝑇𝑖 toa full embedding of the 𝑛 points, some excessive discrepancy between an original distance and a realized one wasdetected. We say that 𝑇𝑖 is a failed subtree.

𝑖 − 𝑘

...

𝑖

errorerror

errorerror

Figura C.1: Example of failed subtree rooted on an embedding of point 𝑥𝑖.

Additionally, consider that all pruning applied to 𝑇𝑖 could be justified in terms of discrepancies involvingdistances between points of rank not smaller than 𝑖 − 𝑘 in the order, for 𝑘 ≥ 4. This means that the reasonsfor 𝑇𝑖 being a failed subtree can be expressed in terms of branching decisions concerning exclusively points in aset 𝐽 ⊆ {𝑥𝑖−𝑘, . . . , 𝑥𝑛}. A direct conclusion from this fact is that, whenever the same set of branching decisionsinvolving the embedding of points 𝑥𝑖−𝑘, . . . , 𝑥𝑖 is attempted along a subsequently explored branch of the tree, a failedsubtree isomorphic to 𝑇𝑖 will be produced. Therefore, pruning can be leveraged by learning about failed subtreesand incorporating such knowledge in the search, avoiding recomputation that, otherwise, would be necessary.

Figure C illustrates the concept of a failed subtree. Red nodes correspond to pruned branches of the tree.Solid nodes are feasible embeddings, while faded nodes correspond to embeddings that were not enumerated dueto pruning. The dashed lines show the distances responsible for pruning.

Detecting the branching decisions associated with failed subtrees can be accomplished with a minor amount ofbookkeeping during backtrack moves along the tree branches. Moreover, we can associate with each level ℓ of thetree a data structure that contains all forbidden triples of branching decisions leading up to the embeding of 𝑥ℓ,i.e., all branching decisions that resulted in embeddings of 𝑥ℓ−𝑘, . . . , 𝑥ℓ that resulted in failed subtrees. For fixed𝑘, this information can be stored in 𝒪(𝑛)-space and can be both queried and updated in constant time for eachbranching decision.

51

geometria de distâncias euclidianas e...

Documents