benefícios e melhores práticas no uso do amazon redshift

99
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Daniel Bento, Arquiteto de Soluções, AWS Novembro 2016 | São Paulo, Brasil Workshop Amazon Redshift

Upload: amazon-web-services-latam

Post on 16-Apr-2017

325 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Benefícios e melhores práticas no uso do Amazon Redshift

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Daniel Bento, Arquiteto de Soluções, AWS

Novembro 2016 | São Paulo, Brasil

WorkshopAmazon Redshift

Page 2: Benefícios e melhores práticas no uso do Amazon Redshift

Agenda

• Introduction

• Architecture and Use Cases

• Data Loading and Migration

• Table and Schema Design

• Operations

Page 3: Benefícios e melhores práticas no uso do Amazon Redshift

We start with the customer… and innovate

Managing databases is painful & difficult

SQL DBs do not perform well at scale

Hadoop is difficult to deploy and manage

DWs are complex, costly, and slow

Commercial DBs are punitive & expensive

Streaming data Is difficult to capture & analyze

BI Tools are expensive and hard to manage

ü Amazon RDS

ü Amazon DynamoDB

ü Amazon EMR

ü Amazon Redshift

ü Amazon Aurora

ü Amazon Kinesis

ü Amazon QuickSight

Customers told us… We created…

Page 4: Benefícios e melhores práticas no uso do Amazon Redshift

AnalyzeStore

Glacier

S3

DynamoDB

RDS, Aurora

AWS Big Data Portfolio

Data Pipeline

CloudSearch

EMR EC2

RedshiftMachineLearning

ElasticSearch

Launched

Database Migration

NewQuickSight

New

SQL over Streams

NewKinesis

Firehose

New

Import Export

Direct Connect

Collect

Kinesis

Page 5: Benefícios e melhores práticas no uso do Amazon Redshift

Global Footprint

14 Regions; 33 Availability Zones; 54 Edge Locations

Redshift

Page 6: Benefícios e melhores práticas no uso do Amazon Redshift

Relational data warehouse

Massively parallel; Petabyte scale

Fully managed

HDD and SSD Platforms

$1,000/TB/Year; starts at $0.25/hour

Free Tier – 2 months

Amazon Redshift

a lot fastera lot simplera lot cheaper

Page 7: Benefícios e melhores práticas no uso do Amazon Redshift

The legacy view of data warehousing ...

Multi-year commitment

Multi-year deployments

Multi-million dollar deals

Page 8: Benefícios e melhores práticas no uso do Amazon Redshift

… Leads to dark data

This is a narrow view

Small companies also have big data

(mobile, social, gaming, adtech, IoT)

Long cycles, high costs, administrative complexity all stifle innovation

0

200

400

600

800

1000

1200

Enterprise Data Data in Warehouse

Page 9: Benefícios e melhores práticas no uso do Amazon Redshift

The Amazon Redshift view of data warehousing

10x cheaper

Easy to provision

Higher DBA productivity

10x faster

No programming

Easily leverage BI tools, Hadoop, Machine Learning, Streaming

Analysis in-line with process flows

Pay as you go, grow as you need

Managed availability & DR

Enterprise Big Data SaaS

Page 10: Benefícios e melhores práticas no uso do Amazon Redshift

The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailedspreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.

The Forrester Wave™: Enterprise Data Warehouse, Q4 2015

Page 11: Benefícios e melhores práticas no uso do Amazon Redshift

Selected Amazon Redshift Customers

NTT Docomo | Telecom FINRA | Financial Svcs Philips | Healthcare Yelp | Technology NASDAQ | Financial Svcs

The Weather Company | Media Nokia | Telecom Pinterest | Technology Foursquare | Technology Coursera | Education

Coinbase | Bitcoin Amazon | E-Commerce Etix | Entertainment Spuul | Entertainment Vivaki | Ad Tech

Z2 | Gaming Neustar | Ad Tech SoundCloud | Technology BeachMint | E-Commerce Civis | Technology

Page 12: Benefícios e melhores práticas no uso do Amazon Redshift

Amazon Redshift Security

Petabyte-Scale Data Warehousing Service

Amazon Redshift Architecture

Page 13: Benefícios e melhores práticas no uso do Amazon Redshift

Amazon Redshift Architecture

Leader NodeSimple SQL end pointStores metadataOptimizes query planCoordinates query execution

Compute NodesLocal columnar storageParallel/distributed execution of all queries, loads, backups, restores, resizes

Start at just $0.25/hour, grow to 2 PB (compressed)DC1: SSD; scale from 160 GB to 326 TBDS1/DS2: HDD; scale from 2 TB to 2 PB

Ingestion/BackupBackupRestore

JDBC/ODBC

10 GigE(HPC)

Page 14: Benefícios e melhores práticas no uso do Amazon Redshift

Benefit #1: Amazon Redshift is fast

Dramatically less I/OColumn storage

Data compression

Zone maps

Direct-attached storage

Large data block sizes

analyze compression listing;

Table | Column | Encoding ---------+----------------+----------listing | listid | deltalisting | sellerid | delta32klisting | eventid | delta32klisting | dateid | bytedictlisting | numtickets | bytedictlisting | priceperticket | delta32klisting | totalprice | mostly32listing | listtime | raw

10 | 13 | 14 | 26 |…

… | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

10

324375

623637

959

Page 15: Benefícios e melhores práticas no uso do Amazon Redshift

SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2016’

MIN: 01-JUNE-2016MAX: 20-JUNE-2016

MIN: 08-JUNE-2016MAX: 30-JUNE-2016

MIN: 12-JUNE-2016MAX: 20-JUNE-2016

MIN: 02-JUNE-2016MAX: 25-JUNE-2016

Unsorted TableMIN: 01-JUNE-2016MAX: 06-JUNE-2016

MIN: 07-JUNE-2016MAX: 12-JUNE-2016

MIN: 13-JUNE-2016MAX: 18-JUNE-2016

MIN: 19-JUNE-2016MAX: 24-JUNE-2016

Sorted By Date

Benefit #1: Amazon Redshift is fastSort Keys and Zone Maps

Page 16: Benefícios e melhores práticas no uso do Amazon Redshift

Benefit #1: Amazon Redshift is fast

Parallel and DistributedQuery

Load

Export

Backup

Restore

Resize

Page 17: Benefícios e melhores práticas no uso do Amazon Redshift

ID Name

1 John Smith

2 Jane Jones

3 Peter Black

4 Pat Partridge

5 Sarah Cyan

6 Brian Snail

1 John Smith

4 Pat Partridge

2 Jane Jones

5 Sarah Cyan

3 Peter Black

6 Brian Snail

Benefit #1: Amazon Redshift is fastDistribution Keys

Page 18: Benefícios e melhores práticas no uso do Amazon Redshift

Benefit #1: Amazon Redshift is fast

H/W optimized for I/O intensive workloads

Choice of storage type, instance size

Regular cadence of auto-patched improvements

Example: Our new Dense Storage (HDD) instance typeImproved memory 2x, compute 2x, disk throughput 1.5xCost: same as our prior generation !

Page 19: Benefícios e melhores práticas no uso do Amazon Redshift

Benefit #2: Amazon Redshift is inexpensive

DS2 (HDD) Price Per Hour for DS2.XL Single Node

Effective AnnualPrice per TB compressed

On-Demand $ 0.850 $ 3,7251 Year Reservation $ 0.500 $ 2,1903 Year Reservation $ 0.228 $ 999

DC1 (SSD) Price Per Hour for DC1.L Single Node

Effective AnnualPrice per TB compressed

On-Demand $ 0.250 $ 13,6901 Year Reservation $ 0.161 $ 8,7953 Year Reservation $ 0.100 $ 5,500

Pricing is simpleNumber of nodes x price/hourNo charge for leader node No up front costsPay as you go

Page 20: Benefícios e melhores práticas no uso do Amazon Redshift

Amazon Redshift lets you start small and grow big

Dense Storage (DS2.XL) 2 TB HDD, 31 GB RAM, 2 slices/4 cores

Single Node (2 TB)

Cluster 2-32 Nodes (4 TB – 64 TB)

Dense Storage (DS2.8XL) 16 TB HDD, 244 GB RAM, 16 slices/36 cores, 10 GigECluster 2-128 Nodes (32 TB – 2 PB)

Note: Nodes not to scale

Benefit #2: Amazon Redshift is inexpensive

Page 21: Benefícios e melhores práticas no uso do Amazon Redshift

Benefit #3: Amazon Redshift is fully managed

Continuous/incremental backups

Multiple copies within cluster

Continuous and incremental backups to S3

Continuous and incremental backups across regions

Streaming restore

Amazon S3

Amazon S3

Region 1

Region 2

Page 22: Benefícios e melhores práticas no uso do Amazon Redshift

Benefit #3: Amazon Redshift is fully managed

Amazon S3

Amazon S3

Region 1

Region 2

Fault tolerance

Disk failures

Node failures

Network failures

Availability Zone/Region level disasters

Page 23: Benefícios e melhores práticas no uso do Amazon Redshift

Benefit #4: Security is built-in• Load encrypted from S3

• SSL to secure data in transit

• Amazon VPC for network isolation

• Encryption to secure data at rest

• All blocks on disks & in Amazon S3 encrypted

• Block key, Cluster key, Master key (AES-256)• On-premises HSM, AWS CloudHSM & KMS

support

• Audit logging and AWS CloudTrailintegration

• SOC 1/2/3, PCI-DSS, FedRAMP, BAA

10 GigE(HPC)

IngestionBackupRestore

Customer VPC

InternalVPC

JDBC/ODBC

Page 24: Benefícios e melhores práticas no uso do Amazon Redshift

Benefit #5: We innovate quickly

2013 20152014

•Initial Release in US East (N. Virginia; US West (Oregon), EU (Ireland); Asia Pacific (Tokyo, Singapore, Sydney) Regions•MANIFEST option for the COPY & UNLOAD commands•SQL Functions: Most recent queries•Resource-level IAM, CRC32•Data Pipeline•Event notifications, encryption, key rotation, audit logging, on-premises or AWS CloudHSM; PCI, SOC 1/2/3•Cross-Region Snapshot Copy•Audit features, cursor support, 500 concurrent client connections •EIP Address for VPC Cluster

•New system views to tune table design and track WLM query queues•Custom ODBC/JDBC drivers; Query Visualization•Mobile Analytics auto export•KMS for GovCloud Region; HIPAA BAA•Interleaved sort keys•New Dense Storage Nodes (DS2) with better RAM and CPU. •New Reserved Storage Nodes: No, Partial & All Upfront Options•Cross-region backups for KMS encrypted clusters•Scaler UDFs in Python•AVRO Ingestion; Kinesis Firehose; Database Migration Service (DMS)•Modify Cluster Dynamically•Tag-based permissions and BZIP2

•System Tables for query Tuning•Dense Compute Nodes•Gzip & Lzop; JSON , RegEx, Cursors•EMR Data Loading & Bootstrap Action with COPY command; WLM concurrency limit to 50; support for the ECDH cipher suites for SSL connections; FedRAMP•Cross-region ingestion•Free trials & price reductions in Asia Pacific•CloudWatch Alarm for Disk Usage•AES 128-bit encryption; UTF-16; KMS Integration•EU (Frankfurt); GovCloud Regions•S3 Servier-side encryption support for UNLOAD•Tagging Support for Cost-allocation

•WLM Queue-Hopping for timed-out queries•Append rows & Export to BZIP-2•Lambda for Clusters in VPC; Data Schema Conversion Support from ML Console•US West (N. California) Region.

2016

100+ new features added since launchRelease every two weeksAutomatic patching

Page 25: Benefícios e melhores práticas no uso do Amazon Redshift

Benefit #6: Amazon Redshift is powerful• Approximate functions

• User defined functions

• Machine Learning

• Data Science

Amazon ML

Page 26: Benefícios e melhores práticas no uso do Amazon Redshift

Benefit #7: Amazon Redshift has a large ecosystem

Data Integration Systems IntegratorsBusiness Intelligence

Page 27: Benefícios e melhores práticas no uso do Amazon Redshift

Benefit #8: Service oriented architecture

DynamoDB

EMR

S3

EC2/SSH

RDS/Aurora

Amazon Redshift

Amazon Kinesis

MachineLearning

Data Pipeline

CloudSearch

Mobile Analytics

Page 28: Benefícios e melhores práticas no uso do Amazon Redshift

Use cases

Page 29: Benefícios e melhores práticas no uso do Amazon Redshift

Analyzing Twitter Firehose

Page 30: Benefícios e melhores práticas no uso do Amazon Redshift

Amazon RedshiftStarts at $0.25/hour

EC2Starts at $0.02/hour

S3$0.030/GB-Mo

Amazon Glacier$0.010/GB-Mo

Amazon Kinesis$0.015/shard 1MB/s in; 2MB/out$0.028/million puts

Analyzing Twitter Firehose

Page 31: Benefícios e melhores práticas no uso do Amazon Redshift

500MM tweets/day = ~ 5,800 tweets/sec

2k/tweet is ~12MB/sec (~1TB/day)

$0.015/hour per shard, $0.028/million PUTS

Amazon Kinesis cost is $0.765/hour

Amazon Redshift cost is $0.850/hour (for a 2TB node)

S3 cost is $1.28/hour (no compression)

Total: $2.895/hour

Data warehouses can be

inexpensive and

powerful

Page 32: Benefícios e melhores práticas no uso do Amazon Redshift

Use only the services you need

Scale only the services you need

Pay for what you use

~40% discount with 1 year commitment

~70% discounts with 3 year commitment

Data warehouses can be

inexpensive and

powerful

Page 33: Benefícios e melhores práticas no uso do Amazon Redshift

Amazon.com – Weblog analysis

Web log analysis for Amazon.com1PB+ workload, 2TB/day, growing 67% YoYLargest table: 400 TB

Want to understand customer behavior

Page 34: Benefícios e melhores práticas no uso do Amazon Redshift

Query 15 months of data (1PB) in 14 minutes

Load 5B rows in 10 minutes

21B rows joined with 10B rows – 3 days (Hive) to 2 hours

Load pipeline: 90 hours (Oracle) to 8 hours

64 clusters

800 total nodes

13PB provisioned storage

2 DBAs

Data warehouses can be

fastand

simple

Page 35: Benefícios e melhores práticas no uso do Amazon Redshift

Sushiro – Real-time streaming from IoT & analysis

Page 36: Benefícios e melhores práticas no uso do Amazon Redshift

Sushiro – Real-time streaming & analysisReal-time data ingested by Amazon Kinesis is analyzed in Amazon Redshift

380 stores stream live data from Sushi plates

Inventory information combined with consumption information near real-time

Forecast demand by store, minimize food waste, and improve efficiencies

Amazon

Page 37: Benefícios e melhores práticas no uso do Amazon Redshift

Big data does not mean batch

Can be streamed in

Can be processed in near real time

Can be used to respond quickly to requests

You can mix and match

On premises and cloud

Custom development and managed services

Infrastructure with managed scaling, security

Data warehouses can support

real-time data

Page 38: Benefícios e melhores práticas no uso do Amazon Redshift

Amazon Redshift Data Loading and MigrationPetabyte-Scale Data Warehousing Service

Page 39: Benefícios e melhores práticas no uso do Amazon Redshift

Data Loading Process

Data Source Extraction Transformation Loading

Amazon Redshift

Target

Focus

Page 40: Benefícios e melhores práticas no uso do Amazon Redshift

Amazon Redshift Loading Data Overview

AWS CloudCorporate Data center

AmazonDynamoDB

Amazon S3

Data Volume

Amazon Elastic MapReduce

Amazon RDS

Amazon Redshift

Amazon Glacier

logs / files

Source DBs

VPN Connection

AWS Direct Connect

S3 Multipart Upload

AWS Import/ Export

EC2 or On-Prem (using

SSH)

Database Migration Service

Kinesis

AWS Lambda

AWS Datapipeline

Page 41: Benefícios e melhores práticas no uso do Amazon Redshift

Uploading Files to Amazon S3

Amazon Redshiftmydata

Client.txt

Corporate Data center

Region

Ensure that your data resides in the same Region as your Redshift

clusters

Split the data into multiple files to facilitate parallel

processing

Optionally, you can encrypt your

data using Amazon S3

Server-Side or Client-Side Encryption

Client.txt.1

Client.txt.2

Client.txt.3

Client.txt.4

Files should be individually

compressed using GZIP or LZOP

Page 42: Benefícios e melhores práticas no uso do Amazon Redshift

Loading Data From Amazon S3

Preparing Input Data FilesUploading files to Amazon S3Using COPY to load data from Amazon S3

Page 43: Benefícios e melhores práticas no uso do Amazon Redshift

Splitting Data Files

Slice 0

Slice 1

Slice 0

Slice 1

Client.txt.1

Client.txt.2

Client.txt.3

Client.txt.4

Node 0

Node 1

2 XL Compute Nodes

Copy customer from ‘s3://mydata/client.txt’Credentials ‘aws_access_key_id=<your-access-key>; aws_secret_access_key=<your_secret_keyDelimiter ‘|’;

mydata

Page 44: Benefícios e melhores práticas no uso do Amazon Redshift

Use the COPY command

Each slice can load one file at a time

A single input file means only one slice is ingesting data

Instead of 100MB/s, you’re only getting 6.25MB/s

Loading – Use multiple input files to maximize throughput

Page 45: Benefícios e melhores práticas no uso do Amazon Redshift

Use the COPY command

You need at least as many input files as you have slices

With 16 input files, all slices are working so you maximize throughput

Get 100MB/s per node; scale linearly as you add nodes

Loading – Use multiple input files to maximize throughput

Page 46: Benefícios e melhores práticas no uso do Amazon Redshift

Amazon Confidential – Slides not intended for redistribution.

• Comecesuaprimeiramigraçãoem10minutosoumenos

• Mantenhasuasaplicaçõesemexecuçãoduranteamigração

• Movadadosparaamesmaengine ouparaumadiferente

AWS Database Migration Service

Page 47: Benefícios e melhores práticas no uso do Amazon Redshift

CustomerPremises

Application Users

AWS

Internet

VPN

Start a replication instance

Connect to source and target databases

Select tables, schemas, or databases

Let AWS Database Migration Service create tables, load data, and keep them in sync

Switch applications over to the target at your convenience

AWS DMS - Keep your apps running during the migration

AWSDatabase Migration

Service

Page 48: Benefícios e melhores práticas no uso do Amazon Redshift

Loading Data from an Amazon DynamoDB Table

Differences Amazon DynamoDB Amazon Redshift

Table Names • Up to 255 characters• May contain ‘.’ (dot) and ‘-’ (dash)

characters• Case-sensitive

• Limited to 127 characters• Can’t contain dots or dashes• Are NOT case-sensitive• Can’t conflict with any Amazon

Redshift reserved words

NULL Does not support the SQL concept of NULL

Must specify how Amazon Redshift interprets empty or blank attribute values in Amazon DynamoDB

Data Types STRING and NUMBER Data Types Supported

BINARY and SET Data Types Not Supported

Page 49: Benefícios e melhores práticas no uso do Amazon Redshift

Loading Data from Amazon Elastic MapReduce

Load data from Amazon EMR in parallel using COPYSpecify Amazon EMR cluster ID and HDFS file path/nameAmazon EMR must be running until COPY completes.

copy sales from 'emr:// j-1H7OUO3B52HI5/myoutput/part*' credentials 'aws_access_key_id=<access-key id>;aws_secret_access_key=<secret-access-key>';

Page 50: Benefícios e melhores práticas no uso do Amazon Redshift

Loading Data using LambdaAWS Lambda-based Amazon Redshift loader to offer you the ability to drop files into S3 and load them into any number of database tables in multiple Amazon Redshift clusters automatically, with no servers to maintain.Blog post https://blogs.aws.amazon.com/bigdata/post/Tx24VJ6XF1JVJAA/A-Zero-Administration-Amazon-Redshift-Database-LoaderGitHub http://github.com/awslabs/aws-lambda-redshift-loader

Page 51: Benefícios e melhores práticas no uso do Amazon Redshift

Remote Loading using SSHRedshift COPY command can reach out to remote locations (EC2 and on premise) to load data using a secure shell script (SSH)

To Remote Load follow this process:• Add cluster's public key to the

remote host's authorized keys file• Configure remote host to accept

connections from all cluster IP addresses

• Create manifest file in JSON format and upload to S3 bucket

• Issue a COPY command, including a reference to the manifest file

Page 52: Benefícios e melhores práticas no uso do Amazon Redshift

Vacuuming Tables1 RFK 900 Columbus MOROCCO MOROCCO AFRICA 25-989-741-2988 BUILDING2 JFK 800 Washington JORDAN JORDAN MIDDLE EAST 23-768-687-3665 AUTOMOBILE3 LBJ 700 Foxborough ARGENTINA ARGENTINA AMERICA 11-719-748-3364 AUTOMOBILE4 GWB 600 Kansas EGYPT EGYPT MIDDLE EAST 14-128-190-5944 MACHINERY

1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansas

Column 0Column 1 Column 2

Amazon Redshift serializes all of the values of a column together.

Page 53: Benefícios e melhores práticas no uso do Amazon Redshift

Vacuuming Tables1 RFK 900 Columbus MOROCCO MOROCCO AFRICA 25-989-741-2988 BUILDING2 JFK 800 Washington JORDAN JORDAN MIDDLE EAST 23-768-687-3665 AUTOMOBILE3 LBJ 700 Foxborough ARGENTINA ARGENTINA AMERICA 11-719-748-3364 AUTOMOBILE4 GWB 600 Kansas EGYPT EGYPT MIDDLE EAST 14-128-190-5944 MACHINERY

1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansas

Column 0 Column 1 Column 2

Delete customer where column_0 = 3;

x xxx xxxxxxxxxxxxxxx

Page 54: Benefícios e melhores práticas no uso do Amazon Redshift

Vacuuming Tables

1,2,4 RFK,JFK,GWB 900 Columbus,800 Washington,600 Kansas

VACUUM Customer;

1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansasx xxx xxxxxxxxxxxxxxx

Redshift does not automatically reclaim and reuse space that is freed when you delete rows from tables or update rows in tables. The VACUUM command reclaims space following deletes, which improves performance as well as increasing available storage.

Page 55: Benefícios e melhores práticas no uso do Amazon Redshift

Analyzing Tables

ANALYZE command

The entire current

database

A single Table

One or more

specific columns in

a single table

The ANALYZE command obtains a sample of rows,

does some calculations, and saves resulting

column statistics.

You do not need to analyze all columns in all

tables regularly

Analyze the columns that are frequently used in the following:

• Sorting and grouping operations• Joins• Query Predicates

To maintain current statistics for tables:

• Run the ANALYZE command before running queries

• Run the ANALYZE command against the database routinely at the end of every regular load or update cycle

• Run the ANALYZE command against an new tables

Statistics

Page 56: Benefícios e melhores práticas no uso do Amazon Redshift

Managing Concurrent Write Operations

Concurrent COPY/INSERT/DELETE/UPDATEOperations into the same table

Transaction 1

Transaction 2 CUSTOMER

Copy customer from ‘s3://mydata/client.txt’…;

Copy customer from ‘s3://mydata/client.txt’…;

Session A

Session B

Transaction 1 puts on the write lock on the CUSTOMER table

Transaction 2 waits until transaction 1 releases the write

lock

Page 57: Benefícios e melhores práticas no uso do Amazon Redshift

Managing Concurrent Write Operations

Concurrent COPY/INSERT/DELETE/UPDATEOperations into the same table

Transaction 1

Transaction 2

CUSTOMER

Begin;Delete one row from CUSTOMERS;Copy…;Select count(*) from CUSTOMERS;End;

Begin;Delete one row from CUSTOMERS;Copy…;Select count(*) from CUSTOMERS;End;

Session A

Session B

Transaction 1 puts on the write lock on the CUSTOMER table

Transaction 2 waits until transaction 1 releases the write

lock

Page 58: Benefícios e melhores práticas no uso do Amazon Redshift

Managing Current Write Operations

Potential deadlock situation for concurrent write transactions

Transaction 1

Transaction 2

CUSTOMER

Begin;Delete 3000 rows from CUSTOMERS;Copy…;Delete 5000 rows from PARTS;End;

Begin;Delete 3000 rows from PARTS;Copy…;Delete 5000 rows from CUSTOMERS;End;

Session A

Session B

Transaction 1 puts on the write lock on the CUSTOMER table

PARTS

Transaction 2 puts on the write lock on the PARTS table

Wait

Wait

Page 59: Benefícios e melhores práticas no uso do Amazon Redshift

Amazon Redshift Security

Petabyte-Scale Data Warehousing Service

Amazon Redshift Table and Schema Design

Page 60: Benefícios e melhores práticas no uso do Amazon Redshift

Goals of distribution

• Distribute data evenly for parallel processing• Minimize data movement

• Co-located joins• Localized aggregations

Distribution key All

Node 1

Slice 1

Slice 2

Node 2

Slice 3

Slice 4

Node 1

Slice 1

Slice 2

Node 2

Slice 3

Slice 4

Full table data on first slice of every nodeSame key to same location

Node 1

Slice 1

Slice 2

Node 2

Slice 3

Slice 4

EvenRound-robin distribution

Page 61: Benefícios e melhores práticas no uso do Amazon Redshift

Choosing a distribution style

Key• Large FACT tables• Rapidly changing tables used

in joins• Localize columns used within

aggregations

All• Have slowly changing data• Reasonable size (i.e., few

millions but not 100s of millions of rows)

• No common distribution key for frequent joins

• Typical use case: joined dimension table without a common distribution key

Even• Tables not frequently joined or

aggregated• Large tables without

acceptable candidate keys

Page 62: Benefícios e melhores práticas no uso do Amazon Redshift

Data Distribution and Distribution Keys

Node 1Slice 1 Slice 2

Node 2Slice 3 Slice 4

cloudfronturi = /games/g1.exeuser_id=1234…

cloudfronturi = /imgs/ad1.pnguser_id=2345…

cloudfronturi=/games/g10.exeuser_id=4312…

cloudfronturi = /img/ad_5.imguser_id=1234…

2M records

5M records

1M records4M records

Poor key choices lead to uneven distribution of records…

Page 63: Benefícios e melhores práticas no uso do Amazon Redshift

Data Distribution and Distribution Keys

Node 1Slice 1 Slice 2

Node 2Slice 3 Slice 4

cloudfronturi = /games/g1.exeuser_id=1234…

cloudfronturi = /imgs/ad1.pnguser_id=2345…

cloudfronturi=/games/g10.exeuser_id=4312…

cloudfronturi = /img/ad_5.imguser_id=1234…

2M records

5M records

1M records4M records

Unevenly distributed data cause processing imbalances!

Page 64: Benefícios e melhores práticas no uso do Amazon Redshift

Data Distribution and Distribution Keys

Node 1Slice 1 Slice 2

Node 2Slice 3 Slice 4

cloudfronturi = /games/g1.exeuser_id=1234…

cloudfronturi = /imgs/ad1.pnguser_id=2345…

cloudfronturi=/games/g10.exeuser_id=4312…

cloudfronturi = /img/ad_5.imguser_id=1234…

2M records2M records 2M records 2M records

Evenly distributed data improves query performance

Page 65: Benefícios e melhores práticas no uso do Amazon Redshift

Single Column

Compound

Interleaved

Sort Keys

Page 66: Benefícios e melhores práticas no uso do Amazon Redshift

Goals of sorting

• Physically sort data within blocks and throughout a table• Optimal SORTKEY is dependent on:

• Query patterns• Data profile• Business requirements

Page 67: Benefícios e melhores práticas no uso do Amazon Redshift

COMPOUND• Most common• Well-defined filter criteria• Time-series data

Choosing a SORTKEY

INTERLEAVED• Edge cases• Large tables (>billion rows)• No common filter criteria• Non time-series data

• Primarily as a query predicate (date, identifier, …)• Optionally, choose a column frequently used for aggregates• Optionally, choose same as distribution key column for most

efficient joins (merge join)

Page 68: Benefícios e melhores práticas no uso do Amazon Redshift

Table is sorted by 1 column[ SORTKEY ( date ) ]

Best for: • Queries that use 1st column (i.e. date) as primary filter• Can speed up joins and group bys• Quickest to VACUUM

Date Region Country

2-JUN-2015 Oceania New Zealand

2-JUN-2015 Asia Singapore

2-JUN-2015 Africa Zaire

2-JUN-2015 Asia Hong Kong

3-JUN-2015 Europe Germany

3-JUN-2015 Asia Korea

Sort Keys – Single Column

Page 69: Benefícios e melhores práticas no uso do Amazon Redshift

• Table is sorted by 1st column , then 2nd column etc.[ SORTKEY COMPOUND ( date, region, country) ]

• Best for: • Queries that use 1st column as primary filter, then other cols• Can speed up joins and group bys• Slower to VACUUM

Date Region Country

2-JUN-2015 Oceania New Zealand

2-JUN-2015 Asia Singapore

2-JUN-2015 Africa Zaire

2-JUN-2015 Asia Hong Kong

3-JUN-2015 Europe Germany

3-JUN-2015 Asia Korea

Sort Keys – Compound

Page 70: Benefícios e melhores práticas no uso do Amazon Redshift

• Equal weight is given to each column.[ SORTKEY INTERLEAVED ( date, region, country) ]

• Best for: • Queries that use different columns in filter• Queries get faster the more columns used in the filter (up to 8)• Slowest to VACUUM

Date Region Country

2-JUN-2015 Oceania New Zealand

2-JUN-2015 Asia Singapore

2-JUN-2015 Africa Zaire

2-JUN-2015 Asia Hong Kong

3-JUN-2015 Europe Germany

3-JUN-2015 Asia Korea

Sort Keys – Interleaved

Page 71: Benefícios e melhores práticas no uso do Amazon Redshift

Compressing data

• COPY automatically analyzes and compresses data when loading into empty tables

• ANALYZE COMPRESSION checks existing tables and proposes optimal compression algorithms for each column

• Changing column encoding requires a table rebuild

Page 72: Benefícios e melhores práticas no uso do Amazon Redshift

Automatic compression is a good thing (mostly)

• From the zone maps we know:• Which blocks contain the

range• Which row offsets to scan

• Highly compressed SORTKEYs: • Many rows per block • Large row offset

Skip compression on just the leading column of the compound SORTKEY

Page 73: Benefícios e melhores práticas no uso do Amazon Redshift

During queries and ingestion, the system allocates buffers based on column width

Wider than needed columns mean memory is wasted

Fewer rows fit into memory; increased likelihood of queries spilling to disk

DDL – Make Columns as narrow as possible

Page 74: Benefícios e melhores práticas no uso do Amazon Redshift

Amazon Redshift Security

Petabyte-Scale Data Warehousing Service

Amazon Redshift Operations

Page 75: Benefícios e melhores práticas no uso do Amazon Redshift

Backup and Restore

Backups to Amazon S3 are automatic, continuous & incremental

Configurable system snapshot retention period

User-defined snapshots are on-demand

Streaming restores enable you to resume querying faster

128GB RAM

16TB disk

16 coresRedshiftCluster Node

128GB RAM

16TB disk

16 coresRedshiftCluster Node

128GB RAM

16TB disk

16 coresRedshiftCluster Node

Amazon S3

Page 76: Benefícios e melhores práticas no uso do Amazon Redshift

Snapshots

Automated SnapshotsRedshift enables automated snapshots by default

Snapshots are incremental

Redshift retains incremental backup data required to restore cluster using manual snapshot or automatic snapshot

Only for clusters that haven’t reached the snapshot retention period

Amazon Redshift provides free storage for snapshots that is equal to the storage capacity of your cluster.

Manual SnapshotsThe system will never delete a manual snapshot

You can authorize access to an existing manual snapshot for as many as 20 AWS customer accounts

Shared access between accounts means you can move data between dev/test/prod without reloading

Page 77: Benefícios e melhores práticas no uso do Amazon Redshift

Cross-Region SnapshotsCopy cluster snapshots to another region using the management console

Can’t modify destination region without disabling cross-region snapshots and re-enabling with a new destination region and retention period

Page 78: Benefícios e melhores práticas no uso do Amazon Redshift

Streaming Restore

During restore your node is provisioned within less than two minutes

Query provisioned node immediately as data is automatically streamed from the S3 snapshot

Page 79: Benefícios e melhores práticas no uso do Amazon Redshift

― Cluster is put into read-only mode― New cluster is provisioned according to resizing needs― Node-to-node parallel data copy― Only charged for source cluster

Resize – Phase 1

Page 80: Benefícios e melhores práticas no uso do Amazon Redshift

Resize - Phase 2

― Automatic SQL endpoint switchover via DNS― Decommission the source cluster

Page 81: Benefícios e melhores práticas no uso do Amazon Redshift

Resizing a Redshift Data Warehouse via the AWS Console

Page 82: Benefícios e melhores práticas no uso do Amazon Redshift

Resizing a Redshift Data Warehouse via the AWS Console

http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html#cluster-resize-intro

Page 83: Benefícios e melhores práticas no uso do Amazon Redshift

Resizing without Production Impact

When you resize your cluster is read-only• Query cluster throughout resizing• There will be a short outage at the end of cluster resizing

Page 84: Benefícios e melhores práticas no uso do Amazon Redshift

Parameter GroupsAllow you to set a variety of configuration options:

• date style, search path and query timeout • user activity logging • whether the leader node should accept only

SSL protected connections

Apply to every database in a cluster

View events using the console, the SDK, or the CLI/API

Use SET to change some properties temporarily, for example:

SET wlm-query-slot-count = 3

Some changes to the parameter group may require a restart to take effect

Default Parameter Values

http://docs.aws.amazon.com/redshift/latest/mgmt/workload-mgmt-config.html#wlm-dynamic-and-static-properties

Page 85: Benefícios e melhores práticas no uso do Amazon Redshift

EventsEvent information is available for a period of several weeks

Source types:• Cluster• Parameter Group• Security Group• Snapshot

Categories:• Configuration• Management• Monitoring• Security

Subscribe to events to receive SNS notifications

http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-event-notifications.html#redshift-event-messages

Page 86: Benefícios e melhores práticas no uso do Amazon Redshift

Resource TaggingDefine custom tags for your cluster

• provide metadata about a resource • categorize billing reports for cost

allocation• Activate tags in the billing and cost

management service

Cluster resize or a restore from a snapshot in the original region preserves tags but not in cross region snapshots

http://docs.aws.amazon.com/redshift/latest/mgmt/amazon-redshift-tagging.html

Page 87: Benefícios e melhores práticas no uso do Amazon Redshift

Concurrent Query ExecutionLong-running queries may cause short-running queries to wait in a queue

• As a result users may experience unpredictable performance

By default, a Redshift cluster is configured with one queue with 5 execution slots by default (50 max)

• 5 slots means 5 queries can execute concurrently while additional queries will queue until additional resources free up

RunningDefault queue

Page 88: Benefícios e melhores práticas no uso do Amazon Redshift

Workload ManagementWorkload management is about creating queues for different workloads

User Group A

Short-running queueLong-running queue

Short Query Group

Long Query Group

Page 89: Benefícios e melhores práticas no uso do Amazon Redshift

Workload Management

Page 90: Benefícios e melhores práticas no uso do Amazon Redshift

Workload Management

Don’t set concurrency to more that you need

set query_group to allqueries; select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid <40000;reset query_group;

Page 91: Benefícios e melhores práticas no uso do Amazon Redshift

Redshift Utils

Script Purposetop_queries.sql Get the top 50 most time-consuming statements in the last 7 days.

perf_alerts.sql Get the top occurrences of alerts; join with table scans.

filter_used.sql Get the filter applied to tables on scans. To aid on choosing sortkey.

commit_stats.sql Get information on consumption of cluster resources through COMMIT statements.

current_session_info.sql Get information about sessions with currently running queries.

missing_table_stats.sql Get EXPLAIN plans that flagged missing statistics on underlying tables.

queuing_queries.sql Get queries that are waiting on a WLM query slot.

table_info.sql Get table storage information (size, skew, etc.).

• Redshift Admin Scripts provide diagnostic information for tuning and troubleshooting

• https://github.com/awslabs/amazon-redshift-utils

Page 92: Benefícios e melhores práticas no uso do Amazon Redshift

Redshift Utils

View Purposev_check_data_distribution.sql Get data distribution across slices.v_constraint_dependency.sql Get the the foreign key constraints between tables.v_generate_group_ddl.sql Get the DDL for a group.v_generate_schema_ddl.sql Get the DDL for schemas.v_generate_tbl_ddl.sql Get the DDL for a table. This will contain the distkey, sortkey, and constraints.v_generate_unload_copy_cmd.sql Generate unload and copy commands for an object.v_generate_user_object_permissions.sql Get the DDL for a users permissions to tables and views.v_generate_view_ddl.sql Get the DDL for a view.v_get_obj_priv_by_user.sql Get the table/views that a user has access to.v_get_schema_priv_by_user.sql Get the schema that a user has access to.v_get_tbl_priv_by_user.sql Get the tables that a user has access to.v_get_users_in_group.sql Get all users in a group.v_get_view_priv_by_user.sql Get the views that a user has access to.v_object_dependency.sql Merge the different dependency views.v_space_used_per_tbl.sql Get pull space used per table.v_view_dependency.sql Get the names of the views that are dependent on other tables/views.

• Redshift Admin Views provide information about user and group access, various table constraints, object and view dependencies, data distribution across slices, and pull space used per table

Page 93: Benefícios e melhores práticas no uso do Amazon Redshift

Open source tools

https://github.com/awslabs/amazon-redshift-utilshttps://github.com/awslabs/amazon-redshift-monitoringhttps://github.com/awslabs/amazon-redshift-udfs

Admin scriptsCollection of utilities for running diagnostics on your cluster

Admin viewsCollection of utilities for managing your cluster, generating schema DDL, etc.

ColumnEncodingUtilityGives you the ability to apply optimal column encoding to an established schema with data already loaded

Page 94: Benefícios e melhores práticas no uso do Amazon Redshift

Q&A

Page 95: Benefícios e melhores práticas no uso do Amazon Redshift

PlayKids iFood MapLink Apontador Rapiddo Superplayer Cinepapaya ChefTime

Page 96: Benefícios e melhores práticas no uso do Amazon Redshift

“Com os serviços da AWS pudemos dosar os investimentos iniciais e prospectar os custospara expansões futuras”

“O Redshift nos permitiu

transformar dados em informações

self-service” - Wanderley Paiva

Database Specialist

Page 97: Benefícios e melhores práticas no uso do Amazon Redshift

O Desafio

• Escalabilidade

• Disponibilidade

• Centralização dos dados

• Custos reduzidos e preferencialmente diluído

Page 98: Benefícios e melhores práticas no uso do Amazon Redshift

Solução

Page 99: Benefícios e melhores práticas no uso do Amazon Redshift

Solução v2