inf550 - computa˘c~ao em nuvem i apache sparkislene/2016-inf550/aula02.pdf · 1998 darpa intrusion...

41
INF550 - Computa¸c˜ ao em Nuvem I Apache Spark Islene Calciolari Garcia Instituto de Computa¸c˜ ao - Unicamp Julho de 2016

Upload: others

Post on 27-May-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

INF550 - Computacao em Nuvem I

Apache Spark

Islene Calciolari Garcia

Instituto de Computacao - Unicamp

Julho de 2016

Page 2: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Sumario

Revisao da aula passada...ObjetivosHDFSMapReduce

Ecossistema do Hadoop

SparkRDDs e SparkContextComo testar?

Revisao rapida de Pythonpyspark

Laboratorio

Page 3: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Revisao da aula passada...

Page 4: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Objetivos

I Primeira parte do cursoI Cloud computingI Data centers

I Segunda parte do cursoI Modelo de Programacao MapReduceI Spark e Ecossistema do Hadoop

Page 5: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Arquitetura do HDFS

Fonte: http://hadoop.apache.org

Page 6: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

HDFSLeitura de arquivo

Data Flow

Anatomy of a File ReadTo get an idea of how data flows between the client interacting with HDFS, the name‐node, and the datanodes, consider Figure 3-2, which shows the main sequence of eventswhen reading a file.

Figure 3-2. A client reading data from HDFS

The client opens the file it wishes to read by calling open() on the FileSystem object, which for HDFS is an instance of DistributedFileSystem (step 1 in Figure 3-2).DistributedFileSystem calls the namenode, using remote procedure calls (RPCs), todetermine the locations of the first few blocks in the file (step 2). For each block, thenamenode returns the addresses of the datanodes that have a copy of that block. Fur‐thermore, the datanodes are sorted according to their proximity to the client (accordingto the topology of the cluster’s network; see “Network Topology and Hadoop” on page

70). If the client is itself a datanode (in the case of a MapReduce task, for instance), the

client will read from the local datanode if that datanode hosts a copy of the block (seealso Figure 2-2 and “Short-circuit local reads” on page 308).

The DistributedFileSystem returns an FSDataInputStream (an input stream thatsupports file seeks) to the client for it to read data from. FSDataInputStream in turnwraps a DFSInputStream, which manages the datanode and namenode I/O.

The client then calls read() on the stream (step 3). DFSInputStream, which has storedthe datanode addresses for the first few blocks in the file, then connects to the first

Data Flow | 69

Fonte: Hadoop—The Definitive Guide, Tom White

Page 7: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

HDFSEscrita em arquivo

Anatomy of a File WriteNext we’ll look at how files are written to HDFS. Although quite detailed, it is instructiveto understand the data flow because it clarifies HDFS’s coherency model.

We’re going to consider the case of creating a new file, writing data to it, then closingthe file. This is illustrated in Figure 3-4.

Figure 3-4. A client writing data to HDFS

The client creates the file by calling create() on DistributedFileSystem (step 1 inFigure 3-4). DistributedFileSystem makes an RPC call to the namenode to create anew file in the filesystem’s namespace, with no blocks associated with it (step 2). Thenamenode performs various checks to make sure the file doesn’t already exist and thatthe client has the right permissions to create the file. If these checks pass, the namenodemakes a record of the new file; otherwise, file creation fails and the client is thrown anIOException. The DistributedFileSystem returns an FSDataOutputStream for theclient to start writing data to. Just as in the read case, FSDataOutputStream wraps aDFSOutputStream, which handles communication with the datanodes and namenode.

As the client writes data (step 3), the DFSOutputStream splits it into packets, which itwrites to an internal queue called the data queue. The data queue is consumed by theDataStreamer, which is responsible for asking the namenode to allocate new blocks bypicking a list of suitable datanodes to store the replicas. The list of datanodes forms apipeline, and here we’ll assume the replication level is three, so there are three nodes in

72 | Chapter 3: The Hadoop Distributed Filesystem

Fonte: Hadoop—The Definitive Guide, Tom White

Page 8: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

MapReduceVisao colorida

http://www.cs.uml.edu/~jlu1/doc/source/report/MapReduce.html

Page 9: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Word Count

http://www.cs.uml.edu/~jlu1/doc/source/report/img/MapReduceExample.png

Page 10: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Ecossistema do Hadoop

Page 11: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Ecossistema do Hadoop

http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview

Page 12: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Hadoop 1.0

Fonte: http://hadoop.apache.org

Page 13: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Hadoop 1.0

I JobTrackerI Gerenciamento dos Task Trackers (recursos e falhas)I Gerenciamento do ciclo de vida dos jobs

I TaskTrackerI iniciar e encerrar taskI enviar status para o JobTracker

I Escalabilidade?

I Outros modelos de programacao?

Page 14: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

YARNYet Another Resource Negotiator

Fonte: http://hadoop.apache.org

Page 15: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

YARN

I JobTracker estava sobrecarregadoI Gerenciamento de recursosI Gerenciamento de aplicacoes

I Container: abstracao que incorpora recursos como cpu,memoria, disco, rede...

I ResourceManagerI Escalonador de recursos

I NodeManager

I ApplicationMaster

Page 16: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Testando o YARN

$ sbin/start-yarn.sh

$ bin/hadoop dfs -put input /input

$ bin/yarn jar \

share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar \

wordcount /input /output

$ bin/hadoop dfs -get /output output

I Verifique os jobs em http://localhost:8088/

I Quando terminar de usar

$ sbin/stop-yarn.sh

Page 17: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Gerenciamento de multiplas aplicacoes

http://hortonworks.com/blog/apache-spark-yarn-ready-hortonworks-data-platform/

Page 18: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv
Page 19: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv
Page 20: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Como o Spark consegue ser tao mais rapido?Modelo de processamento MapReduce

© 2014 MapR Technologies 12

MapReduce Processing Model

• Define mappers

• Shuffling is automatic

• Define reducers

• For complex work, chain jobs together

– Use a higher level language or DSL that does this for you

© 2014 MapR Technologies 37

Typical MapReduce Workflows

Input to

Job 1

SequenceFile

Last Job

Maps Reduces

SequenceFile

Job 1

Maps Reduces

SequenceFile

Job 2

Maps Reduces

Output from

Job 1

Output from

Job 2

Input to

last job

Output from

last job

HDFS

Carol McDonald: An Overview of Apache Spark

Page 21: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Como o Spark consegue ser tao mais rapido?Resilient Distributed Datasets

© 2014 MapR Technologies 38

Iterations

Step Step Step Step Step

In-memory Caching

• Data Partitions read from RAM instead of disk Carol McDonald: An Overview of Apache Spark

I RDD: principal abstracao em Spark

I Imutavel

I Tolerante a falhas

Page 22: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

SparkContext

© 2014 MapR Technologies 56

Spark Programming Model

sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map

Driver Program

SparkContext

cluster

Worker Node

Task Task

Task Worker Node

Carol McDonald: An Overview of Apache Spark

Page 23: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

SparkContextCluster overview

http://spark.apache.org/docs/latest/cluster-overview.html

Page 24: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

SparkContextRDDs e particoes

© 2014 MapR Technologies 57

Resilient Distributed Datasets (RDD)

Spark revolves around RDDs

• Fault-tolerant

• read only collection of

elements

• operated on in parallel

• Cached in memory

• Or on disk

http://www.cs.berkeley.edu/~matei/papers/

2012/nsdi_spark.pdf

http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Page 25: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Operacoes com RDDs

I Muito mais do que Map e Reduce

I Transformacoes e Acoes

I Processamento lazy das transformacoes

����������������������������������������������������������������������������������������������������������������������������������������������������������

���������������������������������������������������������������������������������������������������������������������������������������

���������������������

������ �����

���������

��������������� ���

http://databricks.com

Page 26: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Como testar?

Page 27: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Como testar?Spark e Python

I Spark pode facilmente ser utilizado com Scala, Java ouPython

I Veja Spark Quick Start

I ShellsI python shellI pyspark

Page 28: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Revisao rapida de PythonOperacoes com Strings

Comandos Saıda

astring = "Spark"

print astring Spark

print astring.len 5

print astring[0] S

print astring[1:3] pa

print astring[3:] rk

print astring[0:5:2] Sak

print astring[::-1] krapS

Page 29: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Python: Mais operacoes com Strings

Comandos Saıda

line = " GNU is not Unix. "

line = line.strip()

print line GNU is not Unix.

words = line.split()

print words [’GNU’, ’is’, ’not’, ’Unix’]

print words[1] is

Page 30: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Revisao rapida de PythonFuncoes

def soma(a,b):

return a + b

def mult(a,b):

return a * b

def invertString(s):

return s[::-1]

Page 31: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Revisao rapida de PythonFuncoes lambda

Funcoes que nao recebem um nome em tempo de execucao

>>> lista = range(1,10)

>>> print lista

[1, 2, 3, 4, 5, 6, 7, 8, 9]

>>> def impar(i) :

... return i % 2 != 0

>>> filter (impar, lista)

[1, 3, 5, 7, 9]

>>> filter (lambda x: x % 2 != 0, lista)

[1, 3, 5, 7, 9]

Page 32: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

pysparkPrimeiro SparkContext

© 2014 MapR Technologies 58

Working With RDDs

RDD

textFile = sc.textFile(”SomeFile.txt”)

Carol McDonald: An Overview of Apache SparkWelcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ ‘/ __/ ’_/

/__ / .__/\_,_/_/ /_/\_\ version 1.6.1

/_/

Using Python version 2.7.11 (default, Jun 20 2016 14:45:23)

SparkContext available as sc, HiveContext available as sqlContext.

>>> lines = sc.textFile("tcpdump.list");

Page 33: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

1998 DARPA Intrusion Detection Evaluation

Start Start Src Dest Src Dest Attack

Date Time Duration Serv Port Port IP IP Score Name

1 01/27/1998 00:00:01 00:00:23 ftp 1755 21 192.168.1.30 192.168.0.20 0.31 -

2 01/27/1998 05:04:43 67:59:01 telnet 1042 23 192.168.1.30 192.168.0.20 0.42 -

3 01/27/1998 06:04:36 00:00:59 smtp 43590 25 192.168.1.30 192.168.0.40 12.0 -

4 01/27/1998 08:45:01 00:00:01 finger 1050 79 192.168.0.40 192.168.1.30 2.56 guess

5 01/27/1998 09:23:45 00:00:01 http 1031 80 192.168.1.30 192.168.0.40 -1.3 -

7 01/27/1998 15:11:32 00:00:12 sunrpc 2025 111 192.168.1.30 192.168.0.20 3.10 rpc

8 01/27/1998 21:53:17 00:00:45 exec 2032 512 192.168.1.30 192.168.0.40 2.95 exec

9 01/27/1998 21:58:21 00:00:01 http 1031 80 192.168.1.30 192.168.0.20 0.45 -

10 01/27/1998 22:57:53 26:59:00 login 2031 513 192.168.0.40 192.168.1.20 7.00 -

11 01/27/1998 23:57:28 130:23:08 shell 1022 514 192.168.1.30 192.168.0.20 0.52 guess

13 01/27/1998 25:38:00 00:00:01 eco/i - - 192.168.0.40 192.168.1.30 0.01 -

Page 34: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

1998 DARPA Intrusion Detection Evaluation

https://www.ll.mit.edu/ideval/docs/index.html

Page 35: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Filter

© 2014 MapR Technologies 59

Working With RDDs

RDD RDD RDD RDD

Transformations

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

textFile = sc.textFile(”SomeFile.txt”)

Carol McDonald: An Overview of Apache Spark

Page 36: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Como imprimir e filtrar um RDD

>>> lines = sc.textFile("tcpdump.list")

>>> lines.take(10)

>>> for x in lines.collect():

... print x

>>> telnet = lines.filter(lambda x: "telnet" in x)

>>> for x in telnet.collect():

... print x

Page 37: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Algumas transformacoes simples

map(func) todo elemento do RDD original seratransformado por func

flatmap(func) todo elemento do RDD original seratransformado em 0 ou mais itens por func

filter(func) retorna apenas elementos selecionados por funcgroupByKey() Dado um dataset (k, v)

retorna (k, Iterable<v>)reduceByKey(func) Dado um dataset (k, v)

retorna outro, com chaves agrupadas por funcsortByKey(ascending) Dado um dataset (k,v) retorna outro

ordenado em ordem ascendente ou descendente

Veja mais em Spark Programming Guide

Page 38: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Algumas acoes

count() retorna o numero de elementos no datasetcollect() retorna todos elementos do datasettake(n) retorna os n primeiros elementos do dataset

Veja mais em Spark Programming Guide

Page 39: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Laboratorio

I Instale o Spark

I Obtenha uma versao do darpa dataset

I Elabore questoes interessantes e opere com os dados

I Entrega de codigo e relatorio via Moodle

Page 40: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Exemplo simples

Como ordenar os acessos por tipo de servico

>>> lines = sc.textFile("tcpdump.list")

>>> servicePairs = lines.map(lambda x: (str(x.split()[4]), str(x)))

>>> sortedServ = servicePairs.sortByKey()

Como poderıamos rastrear os dados para identificar os ataques???

Page 41: INF550 - Computa˘c~ao em Nuvem I Apache Sparkislene/2016-inf550/aula02.pdf · 1998 DARPA Intrusion Detection Evaluation Start Start Src Dest Src Dest Attack Date Time Duration Serv

Referencias

I Python TutorialI Apache Spark

I Spark Programming Guide

I Clash of Titans: MapReduce vs. Spark for Large Scale DataAnalytics, Juwei Shi e outros, IBM Research, China