arquiteturas java pragmáticas para usar big data na nuvem

Arquiteturas Java Pragmá1cas para usar Big Data na Nuvem

Fabiane Bizinella Nardon (@fabianenardon) Fernando Babadopulos (@babadopulos)

BigDataTailTargetDataScienceMachineLearningHiveHadoopC

runchMongoDBRedisAWS

MySQLTailTargetMahoutJavaPlayMavenCityWatchNginxJavaDataScienceTerabytesTail



























Nós e Big Data

GRANDE? o quão grande é

Hadoop HBase

Hive

Crunch

HDFS

Cascading

Pig Mahout Re

dis

MongoDB

MySQL Cassandra

Aplicações Incríveis!

Big Data

Cloud +

Quando usar tecnologias de Big Data tenha certeza que é Big mesmo

Nada tem mais impacto na performance da sua aplicação do que a oQmização do seu próprio código

Na nuvem você tem recursos virtualmente ilimitados. Mas o custo também

Nada tem mais impacto na performance da sua aplicação

do que a oQmização do seu próprio código

u=0C010003 -‐ hTp://www.tailtarget.com/home/ -‐ 179.203.156.194 u=12070002 -‐ hTp://cnn.com/news -‐ 189.19.123.161 u=00AD0e12 -‐ hTp://www.tailtarget.com/about/ -‐ 187.74.232.127

tailtarget.com – 2 cnn.com -‐ 1

public sta1c class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private sta1c final IntWritable one = new IntWritable(1); public void map(LongWritable key, Text value, Context context) {

String line = value.toString(); String[] parts = line.split(" "); Text page = new Text(new URL(parts[2]).getHost()); context.write(page, one);

} } public sta1c class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context){ int count = 0; for (IntWritable value : values) { count = count + value.get(); } context.write(key, new IntWritable(count)); } }

HDFS Chunk 1

Chunk 2

Record Reader

Record Reader

Map

Map

Combine

Combine Local Storage

Map Local

Storage

Copy Sort Reduce

Reduce

public sta1c class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int count = 0; for (IntWritable value : values) { count = count + value.get(); }

context.write(key, new IntWritable(count)); } }

job.setMapperClass(Mapp.class); job.setCombinerClass(Reduce.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.submit();

Naive ImplementaQon With Combiner Counter Map Reduce Total

FILE_BYTES_WRITTEN 1,913,542,238 956,106,404 2,869,648,642

Map output materialized bytes

956,063,008 0 956,063,008

Map input records 33,809,720 0 33,809,720

Map output records 33,661,880 0 33,661,880

Combine output records

0 0 0

Reduce shuffle bytes 0 956,063,008 956,063,008

Spilled Records 67,323,760 33,661,880 100,985,640

CPU 1me spent (ms) 465,750 78,900 544,650

Reduce input records

0 33,661,880 33,661,880

Reduce output records

0 343 343

Counter Map Reduce Total FILE_BYTES_WRITTEN 3,674,362 721,578 4,395,940


677,942 0 677,942




74,154 0 74,154

Reduce shuffle bytes

0 677,942 677,942

Spilled Records 75,510 22,622 98,132

CPU 1me spent (ms) 426,330 9,930 436,260


0 22,622 22,622


0 343 343

public sta1c class Mapp extends Mapper<LongWritable, Text, Text, IntWritable> { private Map<String, Integer> items = new HashMap<String, Integer>();

public void map(LongWritable key, Text value, Context context) { StringTokenizer st = new StringTokenizer(value.toString(), " "); st.nextToken(); st.nextToken(); String page = new URL(st.nextToken()).getHost(); Integer count = items.get(page); if (count == null) { items.put(page, 1); } else { items.put(page, count+1); } }

public void cleanup(Context context) throws IOExcepQon, InterruptedExcepQon { for (Entry<String, Integer> item : items.entrySet()) { context.write(new Text(item.getKey()), new IntWritable(item.getValue())); } } }

With Combiner OpQmized Counter Map Reduce Total

FILE_BYTES_WRITTEN 3,674,362 721,578 4,395,940


677,942 0 677,942




74,154 0 74,154

Reduce shuffle bytes

0 677,942 677,942

Spilled Records 75,510 22,622 98,132

CPU 1me spent (ms) 426,330 9,930 436,260


0 22,622 22,622


0 343 343

Counter Map Reduce Total FILE_BYTES_WRITTEN 2,073,462 701,288 2,774,750


657,920 0 657,920


Map output records 21,952 0 21,952


0 0 0

Reduce shuffle bytes 0 657,920 657,920

Spilled Records 21,952 21,952 43,904

CPU 1me spent (ms) 270,540 8,770 279,310

Reduce input records 0 21,952 21,952


0 343 343

Os gargalos geralmente são causados pela quanQdade de dados que é trafegada na rede

Garanta tolerância a falhas na sua arquitetura.

Se o Map e o Reduce recomeçarem toda hora, você nunca vai terminar o trabalho

*É big data, lembra?

Hadoop/HDFS não funcionam bem com arquivos pequenos.

Processing Pipelines

hTp://www.tailtarget.com/home/ hTp://cnn.com/news hTp://www.tailtarget.com/about/

hTp://www.tailtarget.com/home/ -‐ Tecnologia hTp://cnn.com/news -‐ Nowcias hTp://www.tailtarget.com/about/ -‐ Tecnologia

u=0C010003 -‐ Tecnologia u=12070002 -‐ Nowcias u=00AD0e12 -‐ Tecnologia


u=0C010003 -‐ hTp://www.tailtarget.com/home/ u=12070002 -‐ hTp://cnn.com/news u=00AD0e12 -‐ hTp://www.tailtarget.com/about/

MapReduce Pipelines

Orquestrar

Encadear

OQmizar

Exemplo (com Crunch) Pipeline pipeline = new MRPipeline(SiteAndUserClassifier.class, getConf()); RedisSetSource source = new RedisSetSource(Text.class, redisMasters); PCollecQon<Text> crawledSites = pipeline.read(source); PTable<String, DNA> classifiedItems = crawledSites.parallelDo("classify sites", new SiteClassifier(modelPath, crawledFilesFolder),

Writables.tableOf(Writables.strings(), Writables.writables(DNA.class))); PCollecQon<String> logSavedRedis = classifiedItems.parallelDo("save classified", new SaveSiteToRedis(redisMasters), Writables.strings()); pipeline.writeTextFile(logSavedRedis, "/tmp/redisLog/classifier/redis”); pipeline.done();

MapReduce Pipelines

hTp://www.tailtarget.com/home/ hTp://cnn.com/news hTp://www.tailtarget.com/about/

hTp://www.tailtarget.com/home/ -‐ Tecnologia hTp://cnn.com/news -‐ Nowcias hTp://www.tailtarget.com/about/ -‐ Tecnologia




Merge

MapReduce Pipelines

hTp://www.tailtarget.com/home/ hTp://cnn.com/news

hTp://www.tailtarget.com/home/ -‐ Tecnologia hTp://cnn.com/news -‐ Nowcias


Redis

4

2 4

3

5 6


1


2

MapReduce Pipelines u=0C010003 -‐ hTp://www.tailtarget.com/home/ -‐ 179.203.156.194 u=12070002 -‐ hTp://cnn.com/news -‐ 189.19.123.161 u=00AD0e12 -‐ hTp://www.tailtarget.com/about/ -‐ 187.74.232.127


hTp://www.tailtarget.com/home/ hTp://cnn.com/news

hTp://www.tailtarget.com/home/ -‐ Tecnologia hTp://cnn.com/news -‐ Nowcias


Redis

1

2 3

4 6

Pipeline A: Input: 1 Output: 2, 3, 4

Pipeline B: Input: 2 Output: 6

Na nuvem você tem recursos virtualmente ilimitados. Mas o custo também.

Se você quer fazer mágica, quanto mais flexível o serviço melhor.

Amazon EC2

RESERVED

$1427 $0.104 / hora

2,338.04

ON-‐DEMAND $0

$0.32 / hora

2,803.20

SPOT $0

$0.042 / hora

≅367.92

* Custo para uQlizar uma instancia Large durante 1 ano

Economia

picos de 60 servidores ao longo do dia

On-‐demand On-‐demand + Spot

1 ano (≈8,760 horas) ≈ USD 38.000 ≈ USD 24.580

≈ 35% de economia em 1 ano

40 servidores rodando full Qme (18 spot)

Escolher bem o Qpo de instância faz toda a diferença

Monitorar a variação de preços ao longo do tempo pode trazer informações úteis para as futuras compras

Como uQlizar spot instances

Dê preferência a arquiteturas share nothing

Lembre-‐se: Você pode perder o servidor a qualquer momento

Escolha máquinas em zonas diferentes

UQlize algumas instâncias on-‐demand

Auto Scaling

como não ficar maluco com isso?

#SejaPreguiçoso

Execute ações baseadas em dados

Monitore seus indicadores

UQlize templates para seus servidores

WebFront Server

Hadoop TaskTracker

Back-‐end Server

Auto scaling

Deixe seu script decidir o endereçamento dos servidores

Scale Up

10.0.15.104

10.0.15.101

10.0.15.102

10.0.15.103

10.0.15.[N]

* IPs aleatórios só tornam a administração mais complexa

Comprando uma máquina via API

Collec1on<InstanceNetworkInterfaceSpecifica1on> networkInterfaces; InstanceNetworkInterfaceSpecifica1on networkInterface; . . networkInterface.setDeviceIndex(0); networkInterface.setPrivateIpAddress("10.0.1.101"); . . networkInterfaces.add(networkInterface); specs.setNetworkInterfaces(networkInterfaces); . . client.requestSpotInstances(request);

AWS Java SDK

Auto Scaling Hadoop

Pré-‐configure o arquivo ‘conf/slaves’ com os hosts dos servidores que pretente subir quando precisar escalar

UQlize spot instances para os JobTrackers

Coisas que eu gostaria de saber quando começamos

Via API é possível comprar máquinas com IP pré determinado

IOPS provisionado nos discos que necessitem mais performance de escrita

Dá para fazer mais coisas pela API do que pela interface web

Quando usar tecnologias de Big Data

tenha certeza que é Big mesmo

COMO SABER SE VOCÊ TEM DADOS GRANDES MESMO:

Todos os seus dados não cabem em uma só máquina

by Fernand

o Stankuns


Você está falando mais em Terabytes do que em Gigabytes


A quanQdade de dados que você processa cresce constantemente. E deve dobrar no ano que vem.

by Saulo Cruz

PARA TODO O RESTO: KEEP IT SIMPLE!

Arquiteturas Java Pragmá1cas para usar Big Data na Nuvem

Fabiane Bizinella Nardon (@fabianenardon) Fernando Babadopulos (@babadopulos)

arquiteturas java pragmáticas para usar big data na nuvem

Technology