sistemasde( bancos(de(dados( não1relacionais(wiki.dpi.inpe.br/lib/exe/fetch.php?media=nosql.pdf ·...
TRANSCRIPT
Sistemas de Bancos de Dados Não-‐Relacionais
Sistemas NoSQL, NewSQL e MapReduce
Gilberto Ribeiro de Queiroz ([email protected])
Evolução das Tecnologias de Bancos Dados
1960 1970 1980 1990 2000 2010
Programação Generalizada
Programas dependentes arquivos
Evolução das Tecnologias de Bancos Dados
1960 1970 1980 1990 2000 2010
Modelo Banco Dados Hierárquico (IBM IMS)
Fonte: W
ikiped
ia
Programação Generalizada
Programas dependentes arquivos
Evolução das Tecnologias de Bancos Dados
1960 1970 1980 1990 2000 2010
Modelo Banco Dados Hierárquico (IBM IMS)
Bancos Dados Relacionais (SGBD-‐R)
System R e INGRES
Programação Generalizada
Programas dependentes arquivos
Codd: Modelo Relacional
Evolução das Tecnologias de Bancos Dados
1960 1970 1980 1990 2000 2010
Modelo Banco Dados Hierárquico (IBM IMS)
Bancos Dados Relacionais (SGBD-‐R)
System R e INGRES
Programação Generalizada
Programas dependentes arquivos
Codd: Modelo Relacional
Relação (ou Tabela)
• Um banco de dados relacional é organizado em uma coleção de relações (ou tabelas) possivelmente relacionadas entre si.
paises id nome populacao fronteira
1 Alemanha 82.000.000
2 Brasil 190.000.000
... ... ... ...
Tabela
Colunas
Linha
Esquema Tabela
Instância
Linguagem de Consulta: SQL
• O modelo relacional (Codd, 1970) é a base para linguagens de alto nível: – Álgebra Relacional → Linguagem Declarafva → ISO/SQL (Structured
Query Language)
CREATE TABLE paises ( id INT4 PRIMARY KEY, nome VARCHAR(50), populacao INT4 );
paises
id nome populacao
Definição Dados
paises
id nome populacao
1 Alemanha 82.000.000
2 Brasil 190.000.000
... ... ...
Manipulação Dados
INSERT INTO paises VALUES (1, ‘Alemanha’, 82000000)
INSERT INTO paises VALUES (2, ‘Brasil’, 190000000)
Manipulação Dados
Linguagem de Consulta: SQL
• O modelo relacional (Codd, 1970) é a base para linguagens de alto nível: – Álgebra Relacional → Linguagem Declarafva → ISO/SQL (Structured
Query Language)
SELECT nome FROM paises WHERE populacao > 80000000
paises
id nome populacao
1 Alemanha 82.000.000
2 Brasil 190.000.000
... ... ...
Consulta (Não-‐Procedural)
Nota: stored procedures ou procedural languages: PL/SQL, T-‐SQL, PL/pgSQL
Métodos de Acesso (Indexação)
• Problema: Como processar de forma eficiente as consultas? – Através do uso de estruturas de dados conhecidas como Índices ou
Métodos de Acesso;
• Os índices reduzem o conjunto de objetos a serem verificados durante o processamento das consultas: – Normalmente, uma consulta envolve apenas uma
pequena parcela do banco de dados; – Neste caso, percorrer todo o banco pode
ser bastante ineficiente; – Portanto, um plano de execução eficiente para a
consulta fpicamente considera a existência de índices.
Independência Física dos Dados
paises
id nome populacao
1 Alemanha ...
2 Brasil ...
... ... ...
paises
Tabela Nome Colunas
municipios lotes paises quadras locais
Fonte: Adaptado de Gray (1996)
Organização do Sistema Arquivo
Tipos Dados Colunas,
Linha (registro)
Esquema Tabela
Esquema BD Índices Logs
Nível Interno/Esquema
Nível Lógico/Esquema
Restrições
Row-‐store Column-‐Store B+-‐tree, Hash
Visões (Views)
• Muitas vezes pode ser necessário fornecer diferentes perspecfvas do banco de dados dependendo do usuário. Uma visão (ou view) pode ser: – um subconjunto dos dados do banco de dados – pode conter dados derivados do banco de dados
paises cidades
capitais renda_us
Tabelas (Dado Bruto)
Visões (Dado Virtual)
Arquiteturas de SGBD-‐R
• Cliente/Servidor ou Embufdo (ou embarcado)
• Em memória (In-‐memory)
• Paralelos ou Distribuídos
• Armazenamento Linha x Coluna: – linha: atributos do registro são colocados confguamente no meio de
armazenamento. (bom para aplicações OLTP) – coluna: todos os valores de uma coluna são armazenados
confguamente. (bom para aplicações ofmizadas para leitura tais como data warehouses e OLAP)
Evolução das Tecnologias de Bancos Dados
1960 1970 1980 1990 2000 2010
Bancos Dados Orientado Objeto
Objeto Relacional
Novas aplicações: CAD, SIG, MulXmedia, OLAP, Real-‐Xme, CienYficas
Programação Generalizada
Programas dependentes arquivos
Modelo Banco Dados Hierárquico (IBM IMS)
Bancos Dados Relacionais (SGBD-‐R)
SGBD Objeto-‐Relacionais CREATE TYPE Address AS ( street VARCHAR(50), city VARCHAR(50), zip CHAR(8) );
CREATE FUNCTION ValidAddress RETURNS BOOLEAN EXTERNAL NAME ‘/address-‐module.so’ LANGUAGE ‘C’;
CREATE TYPE Geometry ( internallength = variable, input = geometry_in, output = geometry_out, send = geometry_send, receive = geometry_recv, typmod_in = geometry_typmod_in, typmod_out = geometry_typmod_out, delimiter = ':', alignment = double, analyze = geometry_analyze, storage = main);
CREATE TABLE Student ( name VARCHAR(50), address Address );
INSERT INTO student VALUES('eduardo', ('albino sartori', 'ouro preto', '35400')); SELECT (address).city FROM student;
Evolução das Tecnologias de Bancos Dados
1960 1970 1980 1990 2000 2010
Geoespacial
PostgreSQL → PostGIS MySQL → SpaXal Types SQLite → SpaXaLite and RasterLite Oracle → Oracle SpaXal, GeoRaster, Topology and Network Models IBM DB2 → SpaXal Extender SQL Server (2008) → SpaXal Types
Programação Generalizada
Programas dependentes arquivos
Bancos Dados Orientado Objeto
Objeto Relacional
Modelo Banco Dados Hierárquico (IBM IMS)
Bancos Dados Relacionais (SGBD-‐R)
Evolução das Tecnologias de Bancos Dados
1960 1970 1980 1990 2000 2010
NoSQL
Geoespacial
Programação Generalizada
Programas dependentes arquivos
Bancos Dados Orientado Objeto
Objeto Relacional
Modelo Banco Dados Hierárquico (IBM IMS)
Bancos Dados Relacionais (SGBD-‐R)
Interessante: o número de tecnologias de bancos de dados com caracterísXcas diferentes dos SGBD-‐R tem aumentado nos úlXmos 8 anos!
O “cardápio” de opções aumentou? • Sistemas Não-‐Relacionais ou Not Only SQL ou Pós-‐relacionais:
– h~p://nosql-‐database.org/ – h~ps://en.wikipedia.org/wiki/NoSQL
• Diferentes modelos de dados: – Document Oriented: MongoDB, CouchDB; – Column Stores: Cassandra; – Graph Databases: OrientDB, Neo4J; – Array Databases: SciDB, Rasdaman.
• Nem todos são baseados no paradigma de transações ACID.
• Escalabilidade: Horizontal x Verfcal
Document Oriented
#id nome sexo idade
0 Eduardo M 33
1 Emiliano M 35
2 Julia F 18
Tabela: estudantes
{ "nome": "Eduardo Queiroz", "sexo": "masculino", "idade": 33 , "local-‐nascimento": "Ouro Preto" }
{ "nome": "Emiliano Castejon", "sexo": "male", "idade": 35, "endereço": { … } }
SGBD-‐R Document Store: documentos JSON
schemaless todas as linhas comparflham o mesmo schema
Document Oriented x Relafonal: Schemaless
{ "name": "Gilberto", "addresses": [{ "type": "residential" "street": "Cidade Jardim", "city": "S. J. Campos", "state": "SP" }, { "type": "professional" "street": "Astronautas", "city": "S. J. Campos", "state": "SP" } ] }
#id name ...
0 Gilberto ..
... ... ...
#id type street ...
0 residenfal Cidade Jardim ...
0 professional Astronautas ...
... ... ... ...
Table: customers
Table: adresses
more natural? avoid data repeffon?
Document Oriented • The basic unit of work is the document
– in general they work with some kind of JSON notafon
• Schemaless: – the documents may not share a global schema – this is one of the greatest benefits of a document database:
• the ability to change the format of documents stored in the database at any fme without requiring a costly schema update
• Ability to replicate data between nodes
• Ideal for applicafons that need eventual or relaxed consistency
• Most widely used systems: CouchDB and MongoDB
CouchDB
• Based on a REST API: curl -‐X GET hHp://localhost:5984/geodb/_all_docs
curl -‐X GET hHp://localhost:5984/geodb/0
{ "total_rows":3, "rows":[{"id":"0","key":"0","value":{"rev":"2-‐0000099af25ecf610000000000000000"}}, {"id":"1","key":"1","value":{"rev":"2-‐000009a5c3ff15e40000000000000000"}}, {"id":"2","key":"2","value":{"rev":"2-‐000009b5e30f20c40000000000000000"}} ] }
* for Couchbase you should use port 8092
{ "name":"Eduardo Queiroz", "gender":"male", "age":33 }
CouchDB
• At its core there is a B-‐tree and then all data retrieval is key based: curl -‐X GET hHp://localhost:5984/geodb/_all_docs curl -‐X GET hHp://localhost:5984/geodb/0 curl -‐X GET hHp://localhost:5984/geodb/_all_docs?descending=true curl -‐X GET hHp://localhost:5984/geodb/_all_docs?key=\"2\" curl -‐X GET hHp://localhost:5984/geodb/_all_docs?startkey=\“1\" curl -‐X GET hHp://localhost:5984/geodb/_all_docs?startkey=\“1\“;endkey=\”2\”
CouchDB • Views:
Enables to index and query data based on Map/Reduce funcfons wri~en in JavaScript
Eduardo male 33 OP
Emiliano male 35
Uberaba
Julia female 18 BH
34
Documents
Output null
key value
View compute the average age for males
index
CouchDB
• Views: – Enables to index and query data based on Map/Reduce funcfons
wri~en in JavaScript – Generate a list of documents – Updated when a document changes
funcfon(doc) { if(doc.gender == "male") { emit(doc.age, doc.age); } }
funcfon(keys, values, rereduce) { return sum(values); }
Map Reduce
Index → Key → age Output data → doc
Allows to generate aggregates
CouchDB
• How Map/Reduce funcfons work? – The map funcfon outputs (or emits) : (key, id) and doc-‐value – The reduce funcfon receives:
• ([ [key1, id1], [key2, id2], [key3, id3] ], [value1, value2, value3], false) or, • ( null, [reduceResult, reduceResult, reduceResult], true)
funcfon(doc) { if(doc.gender == "male") { emit(doc.name, doc.age); } }
funcfon(keys, values, rereduce) { return sum(values); }
Map Reduce
Index → Key → age Output data → doc
Allows to generate aggregates
CouchDB
Parameter Value DescripXon
descending -‐ return the docs in descendent key order
key key-‐value return only docs that match the specified key
keys array of key-‐values return docs that match the specified array of keys
limit number limit the number of returned docs
skip number
startkey key-‐value return docs with key greater or equal to the value
endkey key-‐value stop returning docs with key greater than the value
Some supported arguments for querying a view
Querying through the REST API: GET hHp://localhost:5984/dbname/_design/designdocname/_view/viewname
CouchDB
• Querying a view that sums all of the ages: curl -‐X GET h~p://127.0.0.1:5984/geodb/_design/myviews/_view/sum_ages
GeoCouch
• It is an spafal extension for CouchDB
• Site: h~ps://github.com/couchbase/geocouch
• The spafal funcfonality is base on GeoJSON
• It uses an R-‐tree to index spafal data
• Spafal views are limited to map funcfons
• GDAL/OGR has a driver for it
GeoCouch
funcfon(doc) { if(doc.geometry) { emit(doc.geometry, { "image" : doc.image_url}); } }
Map
Key → geometry value → an image URL
Querying through the REST API: hHp://localhost:5984/geodb/_design/fotos/_spa0al/fotos?bbox=-‐180,-‐90,180,90
{"total_rows":0, "rows":[ {"id":"8758094944", "bbox":[16.95973,51.106573,16.95973,51.106573], "geometry":{"type":"Point", "coordinates":[16.95973,51.106573]}, "value":{"image":"h~p://farm3.stafc.flickr.com/2875/8758094944_a4e7a01fa9_s.jpg"}}, ... ]}
query result
CouchDB Implementafons
• Apache CouchDB: – Site: h~p://couchdb.apache.org – Web admin tool (Futon): h~p:// 127.0.0.1:5984/_ufls/
• Couchbase – Merge of CouchDB and Membase – Site: h~p://www.couchbase.com – Community Edifon x Enterprise Edifon – Version used: 2.0.1 – Installing a client library: pip install couchbase – Admin: h~p://localhost:8091/index.html – Default port: 8092 – GeoCouch is already present
CouchDB Hosfng
• Cloudant (h~p://cloudant.com): – limited size datasets and limited number of requests can be hosted for
free – Big Couch: h~p://cloudant.com/solufons/bigcouch
• Iris Couch: – Couchbase
MongoDB
• It is an open source document oriented database available for several pla�orms: Linux, Mac OS X and Microso� Windows.
• Designed from the ground to be scalable and to provide a dynamic schema.
• Site: h~p://www.mongodb.org
• Used version: 2.4.3
• The source code is wri~en in C++: h~ps://github.com/mongodb/mongo
Document Oriented? • Let’s think in MongoDB as a JSON database:
– as it fundamentally works with JSON objects; – although internally it uses a representafon called BSON (Binary JSON);
• Row → Document: – A set of keys with associated values; – A single record can be used to represent a complex hierarchical
relafonship (nested objects and arrays) → remember the impedance mismatch?
– Key and Values don’t have fixed types or sizes; – No fixed Schema → adding or removing fields become simpler.
• CollecXons → Tables – Dynamic schemas.
Embedded Objects and Arrays • In relafonal database systems we are used to normalizing the data, that
is, create a table schema to accommodate lists of values ��or to avoid data duplicafon.
• In a document-‐oriented database like MongoDB, it is natural to create nested objects and arrays.
{ "name" : "Gilberto", "email" : "[email protected]", "addresses" : [{"type" : "professional", "street" : "Av. Astronautas"}, {"type" : "residential", "street" : "Av. Cidade Jardim" }], "following" : ["Camara", "Miguel"] };
Each Key Has a Data Type
• Common JSON data types: Number (integer or float/real): { "int-key" : 10 } or { "real-key" : 15.2 }
internally it is used 64-‐bit float numbers for both cases.
String (UTF-‐8): { "string-key" : "any text value" }
Each Key Has a Data Type
• Common JSON data types: Boolean: { "boolean-key" : true } or { "boolean-key" : false }
Null: { "null-key" : null }
can be used to represent null values and also non-‐exisfng fields.
Each Key Has a Data Type
• Common JSON data types: Objects: { "name" : "Gilberto", "age" : 36, "address" : { "street" : "Cidade Jardim", "zip-code" : "12233-002" } }
Arrays: { "phones" : ["8123-0179", "3933-7794"] }
A document is an object and as in JSON it can have nested objects.
can be used to represent sets/lists of objects, numbers, strings, boolean values and any other allowed data type
although not usual, it can have mix data types.
Each Key Has a Data Type
• MongoDB specific data types: Date: { "last-login" : new Date() } or { "birthday" : ISODate("1976-12-30"); }
stored as milliseconds since the epoch.
Object IdenXfier: { "user-id" : ObjectId() } or { "user-id" : ObjectId("51ad29342d86c7052dca383a") }
Each Key Has a Data Type
• MongoDB specific data types: 4-‐byte signed integer: { "int4" : NumberInt("2147483648") }
8-‐byte signed integer: { "int8" : NumberLong("9223372036854775808") }
Using Query Operators
• Retrieving the documents of an author list:
• Retrieving all authors born between 1978 and 1998:
> db.authors.find( { "name": { $in : ["gribeiro", "eduardo"] } } ); > db.authors.find( { "birthday": { $gte : new Date(1978, 0, 1), $lte : new Date(1998, 0, 1)} } ); Or > db.authors.find( { $and: [ { "birthday": { $gte : new Date(1978, 0, 1) } }, { "birthday": { $lte : new Date(1998, 0, 1) } } ] } );
Defining a Map Funcfon > var map = function() { emit(this.key, this.value); };
> var map = function() { author = this["author"]; tags = this["tags"]; emit( author, { "tags": tags } ); };
Let’s create a map funcfon that signals the author of a post and the tags used in the post
Defining a Reduce Funcfon > var reduce = function(key, emits) { return same-type-as-emits; };
> var reduce = function(key, emits) { var alltags = []; for(var i in emits) { var tags = emits[i]["tags"]; for(var j in tags) { tag = tags[j]; if(alltags.indexOf(tag) == -1) { alltags.push(tag); } } } return {"tags" : alltags}; };
Let’s create a reduce funcfon that merges the tags used by a given author for all its posts
Running Map-‐Reduce > db.posts.mapReduce(map, reduce, {"out": "authors_tags" } ); { "result" : "authors_tags", "timeMillis" : 13, "counts" : { "input" : 4, "emit" : 4, "reduce" : 1, "output" : 3 }, "ok" : 1, } > db.authors_tags.find(); { "_id": "cassia", "value": { "tags": [ "meals", "meat", "chicken" ] } } { "_id": "gribeiro", "value": { "tags": [ "geo", "database", "nosql", "scidb", "mongodb", "couchdb" ] } } ...
The name of the temporary collecfon
Notes on Map-‐Reduce • We can:
– add a query clause before passing documents to the map funcfon; – limit the number of documents to sent to the map funcfon; – sort documents before sending to map funcfon; – use a finalize funcfon called a�er the last reduce output; – use a scoped value to pass to map, reduce and finalize funcfons.
• Store the result in a temporary collecfon: – By default these collecfons are dropped a�er the connecfon being
closed; – We can control this behaviour → “keeptemp “: true
• Can be performed in mulf-‐thread.
• mongos dispatch the jobs across all shards.
Replicafon
• Database replicafon is a well known technique for keeping copies of our data on mulfple servers: – we need to be prepared for situafons where one instance crashes or
it becomes unavailable.
• Replicafon ensures: – Data redundancy and backup; – Automafc failover.
Replicafon
• In MongoDB, replicafon occurs through groups of servers known as replica set: – One is designated as the primary and the others as secondaries; – All writes from the clients are directed to the primary; – While the secondaries replicate asynchronously from the primary; – The automafc failover will elect a new primary instance from the
secondaries.
• The algorithm used in MongoDB for replicafon is based on majorifes: – you need a majority of members to elect a primary; – a primary can only stay primary so long as it can reach a majority; – and a write is safe when it’s been replicated to a majority; – You can have an arbiter.
Replica Set and Clients
primary
secondary
secondary
Replica Set
Clients
reads/ writes
asynchronous replicafon/ elecfon
monitoring
What is Shard?
• Somefmes we need to split a large database across several instances: – resources can be to big to fit in a single machine; – each individual parffon is referred to as a shard or database shard.
• Do not get confused between sharding and replicafon, as they are very disfnct.
mongod mongod mongod
mongos Clients
• Instead of tables, rows and columns we have: – Nodes, Edges and Properfes
• Nodes and Edges are first class cifzens • This model fits well for
– Web 2.0 Social Networking Applicafons – Semanfc Web
Gilberto Ribeiro
Miguel
Gilberto Câmara
Advisor
Xavier
Advisor
Graph Databases
Neo4J
• Site: h~p://www.neo4j.org
firstNode = graphDb.createNode(); firstNode.setProperty( "message", "Hello, " ); secondNode = graphDb.createNode(); secondNode.setProperty( "message", "World!" ); relafonship = firstNode.createRelaXonshipTo( secondNode, RelTypes.KNOWS ); relafonship.setProperty( "message", "brave Neo4j " );
API Java
Source: neo4j manual
Neo4J Spafal
• It is a library that adds spafal funcfonalifes to Neo4j: – Support for geometry types – Topology operafons – R-‐tree – Layer: shapefile importer, Open Street Map file importer
• Work with: – GeoServer, Geotools and uDig
• Site: h~ps://github.com/neo4j/spafal
Graph Databases
• Other Implementafons: – OrientDB:
• License: Apache 2.0 • Language: Java • Site: h~p://www.orientechnologies.com/orient-‐db.htm
• Gremlin: – a graph traversal language – Site: h~p://wiki.github.com/fnkerpop/gremlin
Referências
• ELMASRI, R.; NAVATHE, S. B. Fundamentals of database systems. Addison Wesley, 2006. 1139p.
• GRAY, J. Evolu0on of Data Management. IEEE Computer 29(10): 38-‐46, 1996.
• NAUGHTON, J. F. DBMS Research: First 50 Years, Next 50 Years. Kynote speaker’ slides at ICDE 2010. Available at: h~p://pages.cs.wisc.edu/~naughton/naughtonicde.pptx. Access: April, 2013.