benefícios e melhores práticas no uso do amazon redshift

Post on 16-Apr-2017

325 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Daniel Bento, Arquiteto de Soluções, AWS

Novembro 2016 | São Paulo, Brasil

WorkshopAmazon Redshift

Agenda

• Introduction

• Architecture and Use Cases

• Data Loading and Migration

• Table and Schema Design

• Operations

We start with the customer… and innovate

Managing databases is painful & difficult

SQL DBs do not perform well at scale

Hadoop is difficult to deploy and manage

DWs are complex, costly, and slow

Commercial DBs are punitive & expensive

Streaming data Is difficult to capture & analyze

BI Tools are expensive and hard to manage

ü Amazon RDS

ü Amazon DynamoDB

ü Amazon EMR

ü Amazon Redshift

ü Amazon Aurora

ü Amazon Kinesis

ü Amazon QuickSight

Customers told us… We created…

AnalyzeStore

Glacier

S3

DynamoDB

RDS, Aurora

AWS Big Data Portfolio

Data Pipeline

CloudSearch

EMR EC2

RedshiftMachineLearning

ElasticSearch

Launched

Database Migration

NewQuickSight

New

SQL over Streams

NewKinesis

Firehose

New

Import Export

Direct Connect

Collect

Kinesis

Global Footprint

14 Regions; 33 Availability Zones; 54 Edge Locations

Redshift

Relational data warehouse

Massively parallel; Petabyte scale

Fully managed

HDD and SSD Platforms

$1,000/TB/Year; starts at $0.25/hour

Free Tier – 2 months

Amazon Redshift

a lot fastera lot simplera lot cheaper

The legacy view of data warehousing ...

Multi-year commitment

Multi-year deployments

Multi-million dollar deals

… Leads to dark data

This is a narrow view

Small companies also have big data

(mobile, social, gaming, adtech, IoT)

Long cycles, high costs, administrative complexity all stifle innovation

0

200

400

600

800

1000

1200

Enterprise Data Data in Warehouse

The Amazon Redshift view of data warehousing

10x cheaper

Easy to provision

Higher DBA productivity

10x faster

No programming

Easily leverage BI tools, Hadoop, Machine Learning, Streaming

Analysis in-line with process flows

Pay as you go, grow as you need

Managed availability & DR

Enterprise Big Data SaaS

The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailedspreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.

The Forrester Wave™: Enterprise Data Warehouse, Q4 2015

Selected Amazon Redshift Customers

NTT Docomo | Telecom FINRA | Financial Svcs Philips | Healthcare Yelp | Technology NASDAQ | Financial Svcs

The Weather Company | Media Nokia | Telecom Pinterest | Technology Foursquare | Technology Coursera | Education

Coinbase | Bitcoin Amazon | E-Commerce Etix | Entertainment Spuul | Entertainment Vivaki | Ad Tech

Z2 | Gaming Neustar | Ad Tech SoundCloud | Technology BeachMint | E-Commerce Civis | Technology

Amazon Redshift Security

Petabyte-Scale Data Warehousing Service

Amazon Redshift Architecture

Amazon Redshift Architecture

Leader NodeSimple SQL end pointStores metadataOptimizes query planCoordinates query execution

Compute NodesLocal columnar storageParallel/distributed execution of all queries, loads, backups, restores, resizes

Start at just $0.25/hour, grow to 2 PB (compressed)DC1: SSD; scale from 160 GB to 326 TBDS1/DS2: HDD; scale from 2 TB to 2 PB

Ingestion/BackupBackupRestore

JDBC/ODBC

10 GigE(HPC)

Benefit #1: Amazon Redshift is fast

Dramatically less I/OColumn storage

Data compression

Zone maps

Direct-attached storage

Large data block sizes

analyze compression listing;

Table | Column | Encoding ---------+----------------+----------listing | listid | deltalisting | sellerid | delta32klisting | eventid | delta32klisting | dateid | bytedictlisting | numtickets | bytedictlisting | priceperticket | delta32klisting | totalprice | mostly32listing | listtime | raw

10 | 13 | 14 | 26 |…

… | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

10

324375

623637

959

SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2016’

MIN: 01-JUNE-2016MAX: 20-JUNE-2016

MIN: 08-JUNE-2016MAX: 30-JUNE-2016

MIN: 12-JUNE-2016MAX: 20-JUNE-2016

MIN: 02-JUNE-2016MAX: 25-JUNE-2016

Unsorted TableMIN: 01-JUNE-2016MAX: 06-JUNE-2016

MIN: 07-JUNE-2016MAX: 12-JUNE-2016

MIN: 13-JUNE-2016MAX: 18-JUNE-2016

MIN: 19-JUNE-2016MAX: 24-JUNE-2016

Sorted By Date

Benefit #1: Amazon Redshift is fastSort Keys and Zone Maps

Benefit #1: Amazon Redshift is fast

Parallel and DistributedQuery

Load

Export

Backup

Restore

Resize

ID Name

1 John Smith

2 Jane Jones

3 Peter Black

4 Pat Partridge

5 Sarah Cyan

6 Brian Snail

1 John Smith

4 Pat Partridge

2 Jane Jones

5 Sarah Cyan

3 Peter Black

6 Brian Snail

Benefit #1: Amazon Redshift is fastDistribution Keys

Benefit #1: Amazon Redshift is fast

H/W optimized for I/O intensive workloads

Choice of storage type, instance size

Regular cadence of auto-patched improvements

Example: Our new Dense Storage (HDD) instance typeImproved memory 2x, compute 2x, disk throughput 1.5xCost: same as our prior generation !

Benefit #2: Amazon Redshift is inexpensive

DS2 (HDD) Price Per Hour for DS2.XL Single Node

Effective AnnualPrice per TB compressed

On-Demand $ 0.850 $ 3,7251 Year Reservation $ 0.500 $ 2,1903 Year Reservation $ 0.228 $ 999

DC1 (SSD) Price Per Hour for DC1.L Single Node

Effective AnnualPrice per TB compressed

On-Demand $ 0.250 $ 13,6901 Year Reservation $ 0.161 $ 8,7953 Year Reservation $ 0.100 $ 5,500

Pricing is simpleNumber of nodes x price/hourNo charge for leader node No up front costsPay as you go

Amazon Redshift lets you start small and grow big

Dense Storage (DS2.XL) 2 TB HDD, 31 GB RAM, 2 slices/4 cores

Single Node (2 TB)

Cluster 2-32 Nodes (4 TB – 64 TB)

Dense Storage (DS2.8XL) 16 TB HDD, 244 GB RAM, 16 slices/36 cores, 10 GigECluster 2-128 Nodes (32 TB – 2 PB)

Note: Nodes not to scale

Benefit #2: Amazon Redshift is inexpensive

Benefit #3: Amazon Redshift is fully managed

Continuous/incremental backups

Multiple copies within cluster

Continuous and incremental backups to S3

Continuous and incremental backups across regions

Streaming restore

Amazon S3

Amazon S3

Region 1

Region 2

Benefit #3: Amazon Redshift is fully managed

Amazon S3

Amazon S3

Region 1

Region 2

Fault tolerance

Disk failures

Node failures

Network failures

Availability Zone/Region level disasters

Benefit #4: Security is built-in• Load encrypted from S3

• SSL to secure data in transit

• Amazon VPC for network isolation

• Encryption to secure data at rest

• All blocks on disks & in Amazon S3 encrypted

• Block key, Cluster key, Master key (AES-256)• On-premises HSM, AWS CloudHSM & KMS

support

• Audit logging and AWS CloudTrailintegration

• SOC 1/2/3, PCI-DSS, FedRAMP, BAA

10 GigE(HPC)

IngestionBackupRestore

Customer VPC

InternalVPC

JDBC/ODBC

Benefit #5: We innovate quickly

2013 20152014

•Initial Release in US East (N. Virginia; US West (Oregon), EU (Ireland); Asia Pacific (Tokyo, Singapore, Sydney) Regions•MANIFEST option for the COPY & UNLOAD commands•SQL Functions: Most recent queries•Resource-level IAM, CRC32•Data Pipeline•Event notifications, encryption, key rotation, audit logging, on-premises or AWS CloudHSM; PCI, SOC 1/2/3•Cross-Region Snapshot Copy•Audit features, cursor support, 500 concurrent client connections •EIP Address for VPC Cluster

•New system views to tune table design and track WLM query queues•Custom ODBC/JDBC drivers; Query Visualization•Mobile Analytics auto export•KMS for GovCloud Region; HIPAA BAA•Interleaved sort keys•New Dense Storage Nodes (DS2) with better RAM and CPU. •New Reserved Storage Nodes: No, Partial & All Upfront Options•Cross-region backups for KMS encrypted clusters•Scaler UDFs in Python•AVRO Ingestion; Kinesis Firehose; Database Migration Service (DMS)•Modify Cluster Dynamically•Tag-based permissions and BZIP2

•System Tables for query Tuning•Dense Compute Nodes•Gzip & Lzop; JSON , RegEx, Cursors•EMR Data Loading & Bootstrap Action with COPY command; WLM concurrency limit to 50; support for the ECDH cipher suites for SSL connections; FedRAMP•Cross-region ingestion•Free trials & price reductions in Asia Pacific•CloudWatch Alarm for Disk Usage•AES 128-bit encryption; UTF-16; KMS Integration•EU (Frankfurt); GovCloud Regions•S3 Servier-side encryption support for UNLOAD•Tagging Support for Cost-allocation

•WLM Queue-Hopping for timed-out queries•Append rows & Export to BZIP-2•Lambda for Clusters in VPC; Data Schema Conversion Support from ML Console•US West (N. California) Region.

2016

100+ new features added since launchRelease every two weeksAutomatic patching

Benefit #6: Amazon Redshift is powerful• Approximate functions

• User defined functions

• Machine Learning

• Data Science

Amazon ML

Benefit #7: Amazon Redshift has a large ecosystem

Data Integration Systems IntegratorsBusiness Intelligence

Benefit #8: Service oriented architecture

DynamoDB

EMR

S3

EC2/SSH

RDS/Aurora

Amazon Redshift

Amazon Kinesis

MachineLearning

Data Pipeline

CloudSearch

Mobile Analytics

Use cases

Analyzing Twitter Firehose

Amazon RedshiftStarts at $0.25/hour

EC2Starts at $0.02/hour

S3$0.030/GB-Mo

Amazon Glacier$0.010/GB-Mo

Amazon Kinesis$0.015/shard 1MB/s in; 2MB/out$0.028/million puts

Analyzing Twitter Firehose

500MM tweets/day = ~ 5,800 tweets/sec

2k/tweet is ~12MB/sec (~1TB/day)

$0.015/hour per shard, $0.028/million PUTS

Amazon Kinesis cost is $0.765/hour

Amazon Redshift cost is $0.850/hour (for a 2TB node)

S3 cost is $1.28/hour (no compression)

Total: $2.895/hour

Data warehouses can be

inexpensive and

powerful

Use only the services you need

Scale only the services you need

Pay for what you use

~40% discount with 1 year commitment

~70% discounts with 3 year commitment

Data warehouses can be

inexpensive and

powerful

Amazon.com – Weblog analysis

Web log analysis for Amazon.com1PB+ workload, 2TB/day, growing 67% YoYLargest table: 400 TB

Want to understand customer behavior

Query 15 months of data (1PB) in 14 minutes

Load 5B rows in 10 minutes

21B rows joined with 10B rows – 3 days (Hive) to 2 hours

Load pipeline: 90 hours (Oracle) to 8 hours

64 clusters

800 total nodes

13PB provisioned storage

2 DBAs

Data warehouses can be

fastand

simple

Sushiro – Real-time streaming from IoT & analysis

Sushiro – Real-time streaming & analysisReal-time data ingested by Amazon Kinesis is analyzed in Amazon Redshift

380 stores stream live data from Sushi plates

Inventory information combined with consumption information near real-time

Forecast demand by store, minimize food waste, and improve efficiencies

Amazon

Big data does not mean batch

Can be streamed in

Can be processed in near real time

Can be used to respond quickly to requests

You can mix and match

On premises and cloud

Custom development and managed services

Infrastructure with managed scaling, security

Data warehouses can support

real-time data

Amazon Redshift Data Loading and MigrationPetabyte-Scale Data Warehousing Service

Data Loading Process

Data Source Extraction Transformation Loading

Amazon Redshift

Target

Focus

Amazon Redshift Loading Data Overview

AWS CloudCorporate Data center

AmazonDynamoDB

Amazon S3

Data Volume

Amazon Elastic MapReduce

Amazon RDS

Amazon Redshift

Amazon Glacier

logs / files

Source DBs

VPN Connection

AWS Direct Connect

S3 Multipart Upload

AWS Import/ Export

EC2 or On-Prem (using

SSH)

Database Migration Service

Kinesis

AWS Lambda

AWS Datapipeline

Uploading Files to Amazon S3

Amazon Redshiftmydata

Client.txt

Corporate Data center

Region

Ensure that your data resides in the same Region as your Redshift

clusters

Split the data into multiple files to facilitate parallel

processing

Optionally, you can encrypt your

data using Amazon S3

Server-Side or Client-Side Encryption

Client.txt.1

Client.txt.2

Client.txt.3

Client.txt.4

Files should be individually

compressed using GZIP or LZOP

Loading Data From Amazon S3

Preparing Input Data FilesUploading files to Amazon S3Using COPY to load data from Amazon S3

Splitting Data Files

Slice 0

Slice 1

Slice 0

Slice 1

Client.txt.1

Client.txt.2

Client.txt.3

Client.txt.4

Node 0

Node 1

2 XL Compute Nodes

Copy customer from ‘s3://mydata/client.txt’Credentials ‘aws_access_key_id=<your-access-key>; aws_secret_access_key=<your_secret_keyDelimiter ‘|’;

mydata

Use the COPY command

Each slice can load one file at a time

A single input file means only one slice is ingesting data

Instead of 100MB/s, you’re only getting 6.25MB/s

Loading – Use multiple input files to maximize throughput

Use the COPY command

You need at least as many input files as you have slices

With 16 input files, all slices are working so you maximize throughput

Get 100MB/s per node; scale linearly as you add nodes

Loading – Use multiple input files to maximize throughput

Amazon Confidential – Slides not intended for redistribution.

• Comecesuaprimeiramigraçãoem10minutosoumenos

• Mantenhasuasaplicaçõesemexecuçãoduranteamigração

• Movadadosparaamesmaengine ouparaumadiferente

AWS Database Migration Service

CustomerPremises

Application Users

AWS

Internet

VPN

Start a replication instance

Connect to source and target databases

Select tables, schemas, or databases

Let AWS Database Migration Service create tables, load data, and keep them in sync

Switch applications over to the target at your convenience

AWS DMS - Keep your apps running during the migration

AWSDatabase Migration

Service

Loading Data from an Amazon DynamoDB Table

Differences Amazon DynamoDB Amazon Redshift

Table Names • Up to 255 characters• May contain ‘.’ (dot) and ‘-’ (dash)

characters• Case-sensitive

• Limited to 127 characters• Can’t contain dots or dashes• Are NOT case-sensitive• Can’t conflict with any Amazon

Redshift reserved words

NULL Does not support the SQL concept of NULL

Must specify how Amazon Redshift interprets empty or blank attribute values in Amazon DynamoDB

Data Types STRING and NUMBER Data Types Supported

BINARY and SET Data Types Not Supported

Loading Data from Amazon Elastic MapReduce

Load data from Amazon EMR in parallel using COPYSpecify Amazon EMR cluster ID and HDFS file path/nameAmazon EMR must be running until COPY completes.

copy sales from 'emr:// j-1H7OUO3B52HI5/myoutput/part*' credentials 'aws_access_key_id=<access-key id>;aws_secret_access_key=<secret-access-key>';

Loading Data using LambdaAWS Lambda-based Amazon Redshift loader to offer you the ability to drop files into S3 and load them into any number of database tables in multiple Amazon Redshift clusters automatically, with no servers to maintain.Blog post https://blogs.aws.amazon.com/bigdata/post/Tx24VJ6XF1JVJAA/A-Zero-Administration-Amazon-Redshift-Database-LoaderGitHub http://github.com/awslabs/aws-lambda-redshift-loader

Remote Loading using SSHRedshift COPY command can reach out to remote locations (EC2 and on premise) to load data using a secure shell script (SSH)

To Remote Load follow this process:• Add cluster's public key to the

remote host's authorized keys file• Configure remote host to accept

connections from all cluster IP addresses

• Create manifest file in JSON format and upload to S3 bucket

• Issue a COPY command, including a reference to the manifest file

Vacuuming Tables1 RFK 900 Columbus MOROCCO MOROCCO AFRICA 25-989-741-2988 BUILDING2 JFK 800 Washington JORDAN JORDAN MIDDLE EAST 23-768-687-3665 AUTOMOBILE3 LBJ 700 Foxborough ARGENTINA ARGENTINA AMERICA 11-719-748-3364 AUTOMOBILE4 GWB 600 Kansas EGYPT EGYPT MIDDLE EAST 14-128-190-5944 MACHINERY

1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansas

Column 0Column 1 Column 2

Amazon Redshift serializes all of the values of a column together.

Vacuuming Tables1 RFK 900 Columbus MOROCCO MOROCCO AFRICA 25-989-741-2988 BUILDING2 JFK 800 Washington JORDAN JORDAN MIDDLE EAST 23-768-687-3665 AUTOMOBILE3 LBJ 700 Foxborough ARGENTINA ARGENTINA AMERICA 11-719-748-3364 AUTOMOBILE4 GWB 600 Kansas EGYPT EGYPT MIDDLE EAST 14-128-190-5944 MACHINERY

1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansas

Column 0 Column 1 Column 2

Delete customer where column_0 = 3;

x xxx xxxxxxxxxxxxxxx

Vacuuming Tables

1,2,4 RFK,JFK,GWB 900 Columbus,800 Washington,600 Kansas

VACUUM Customer;

1,2,3,4 RFK,JFK,LBJ,GWB 900 Columbus,800 Washington, 700 Foxborough,600 Kansasx xxx xxxxxxxxxxxxxxx

Redshift does not automatically reclaim and reuse space that is freed when you delete rows from tables or update rows in tables. The VACUUM command reclaims space following deletes, which improves performance as well as increasing available storage.

Analyzing Tables

ANALYZE command

The entire current

database

A single Table

One or more

specific columns in

a single table

The ANALYZE command obtains a sample of rows,

does some calculations, and saves resulting

column statistics.

You do not need to analyze all columns in all

tables regularly

Analyze the columns that are frequently used in the following:

• Sorting and grouping operations• Joins• Query Predicates

To maintain current statistics for tables:

• Run the ANALYZE command before running queries

• Run the ANALYZE command against the database routinely at the end of every regular load or update cycle

• Run the ANALYZE command against an new tables

Statistics

Managing Concurrent Write Operations

Concurrent COPY/INSERT/DELETE/UPDATEOperations into the same table

Transaction 1

Transaction 2 CUSTOMER

Copy customer from ‘s3://mydata/client.txt’…;

Copy customer from ‘s3://mydata/client.txt’…;

Session A

Session B

Transaction 1 puts on the write lock on the CUSTOMER table

Transaction 2 waits until transaction 1 releases the write

lock

Managing Concurrent Write Operations

Concurrent COPY/INSERT/DELETE/UPDATEOperations into the same table

Transaction 1

Transaction 2

CUSTOMER

Begin;Delete one row from CUSTOMERS;Copy…;Select count(*) from CUSTOMERS;End;

Begin;Delete one row from CUSTOMERS;Copy…;Select count(*) from CUSTOMERS;End;

Session A

Session B

Transaction 1 puts on the write lock on the CUSTOMER table

Transaction 2 waits until transaction 1 releases the write

lock

Managing Current Write Operations

Potential deadlock situation for concurrent write transactions

Transaction 1

Transaction 2

CUSTOMER

Begin;Delete 3000 rows from CUSTOMERS;Copy…;Delete 5000 rows from PARTS;End;

Begin;Delete 3000 rows from PARTS;Copy…;Delete 5000 rows from CUSTOMERS;End;

Session A

Session B

Transaction 1 puts on the write lock on the CUSTOMER table

PARTS

Transaction 2 puts on the write lock on the PARTS table

Wait

Wait

Amazon Redshift Security

Petabyte-Scale Data Warehousing Service

Amazon Redshift Table and Schema Design

Goals of distribution

• Distribute data evenly for parallel processing• Minimize data movement

• Co-located joins• Localized aggregations

Distribution key All

Node 1

Slice 1

Slice 2

Node 2

Slice 3

Slice 4

Node 1

Slice 1

Slice 2

Node 2

Slice 3

Slice 4

Full table data on first slice of every nodeSame key to same location

Node 1

Slice 1

Slice 2

Node 2

Slice 3

Slice 4

EvenRound-robin distribution

Choosing a distribution style

Key• Large FACT tables• Rapidly changing tables used

in joins• Localize columns used within

aggregations

All• Have slowly changing data• Reasonable size (i.e., few

millions but not 100s of millions of rows)

• No common distribution key for frequent joins

• Typical use case: joined dimension table without a common distribution key

Even• Tables not frequently joined or

aggregated• Large tables without

acceptable candidate keys

Data Distribution and Distribution Keys

Node 1Slice 1 Slice 2

Node 2Slice 3 Slice 4

cloudfronturi = /games/g1.exeuser_id=1234…

cloudfronturi = /imgs/ad1.pnguser_id=2345…

cloudfronturi=/games/g10.exeuser_id=4312…

cloudfronturi = /img/ad_5.imguser_id=1234…

2M records

5M records

1M records4M records

Poor key choices lead to uneven distribution of records…

Data Distribution and Distribution Keys

Node 1Slice 1 Slice 2

Node 2Slice 3 Slice 4

cloudfronturi = /games/g1.exeuser_id=1234…

cloudfronturi = /imgs/ad1.pnguser_id=2345…

cloudfronturi=/games/g10.exeuser_id=4312…

cloudfronturi = /img/ad_5.imguser_id=1234…

2M records

5M records

1M records4M records

Unevenly distributed data cause processing imbalances!

Data Distribution and Distribution Keys

Node 1Slice 1 Slice 2

Node 2Slice 3 Slice 4

cloudfronturi = /games/g1.exeuser_id=1234…

cloudfronturi = /imgs/ad1.pnguser_id=2345…

cloudfronturi=/games/g10.exeuser_id=4312…

cloudfronturi = /img/ad_5.imguser_id=1234…

2M records2M records 2M records 2M records

Evenly distributed data improves query performance

Single Column

Compound

Interleaved

Sort Keys

Goals of sorting

• Physically sort data within blocks and throughout a table• Optimal SORTKEY is dependent on:

• Query patterns• Data profile• Business requirements

COMPOUND• Most common• Well-defined filter criteria• Time-series data

Choosing a SORTKEY

INTERLEAVED• Edge cases• Large tables (>billion rows)• No common filter criteria• Non time-series data

• Primarily as a query predicate (date, identifier, …)• Optionally, choose a column frequently used for aggregates• Optionally, choose same as distribution key column for most

efficient joins (merge join)

Table is sorted by 1 column[ SORTKEY ( date ) ]

Best for: • Queries that use 1st column (i.e. date) as primary filter• Can speed up joins and group bys• Quickest to VACUUM

Date Region Country

2-JUN-2015 Oceania New Zealand

2-JUN-2015 Asia Singapore

2-JUN-2015 Africa Zaire

2-JUN-2015 Asia Hong Kong

3-JUN-2015 Europe Germany

3-JUN-2015 Asia Korea

Sort Keys – Single Column

• Table is sorted by 1st column , then 2nd column etc.[ SORTKEY COMPOUND ( date, region, country) ]

• Best for: • Queries that use 1st column as primary filter, then other cols• Can speed up joins and group bys• Slower to VACUUM

Date Region Country

2-JUN-2015 Oceania New Zealand

2-JUN-2015 Asia Singapore

2-JUN-2015 Africa Zaire

2-JUN-2015 Asia Hong Kong

3-JUN-2015 Europe Germany

3-JUN-2015 Asia Korea

Sort Keys – Compound

• Equal weight is given to each column.[ SORTKEY INTERLEAVED ( date, region, country) ]

• Best for: • Queries that use different columns in filter• Queries get faster the more columns used in the filter (up to 8)• Slowest to VACUUM

Date Region Country

2-JUN-2015 Oceania New Zealand

2-JUN-2015 Asia Singapore

2-JUN-2015 Africa Zaire

2-JUN-2015 Asia Hong Kong

3-JUN-2015 Europe Germany

3-JUN-2015 Asia Korea

Sort Keys – Interleaved

Compressing data

• COPY automatically analyzes and compresses data when loading into empty tables

• ANALYZE COMPRESSION checks existing tables and proposes optimal compression algorithms for each column

• Changing column encoding requires a table rebuild

Automatic compression is a good thing (mostly)

• From the zone maps we know:• Which blocks contain the

range• Which row offsets to scan

• Highly compressed SORTKEYs: • Many rows per block • Large row offset

Skip compression on just the leading column of the compound SORTKEY

During queries and ingestion, the system allocates buffers based on column width

Wider than needed columns mean memory is wasted

Fewer rows fit into memory; increased likelihood of queries spilling to disk

DDL – Make Columns as narrow as possible

Amazon Redshift Security

Petabyte-Scale Data Warehousing Service

Amazon Redshift Operations

Backup and Restore

Backups to Amazon S3 are automatic, continuous & incremental

Configurable system snapshot retention period

User-defined snapshots are on-demand

Streaming restores enable you to resume querying faster

128GB RAM

16TB disk

16 coresRedshiftCluster Node

128GB RAM

16TB disk

16 coresRedshiftCluster Node

128GB RAM

16TB disk

16 coresRedshiftCluster Node

Amazon S3

Snapshots

Automated SnapshotsRedshift enables automated snapshots by default

Snapshots are incremental

Redshift retains incremental backup data required to restore cluster using manual snapshot or automatic snapshot

Only for clusters that haven’t reached the snapshot retention period

Amazon Redshift provides free storage for snapshots that is equal to the storage capacity of your cluster.

Manual SnapshotsThe system will never delete a manual snapshot

You can authorize access to an existing manual snapshot for as many as 20 AWS customer accounts

Shared access between accounts means you can move data between dev/test/prod without reloading

Cross-Region SnapshotsCopy cluster snapshots to another region using the management console

Can’t modify destination region without disabling cross-region snapshots and re-enabling with a new destination region and retention period

Streaming Restore

During restore your node is provisioned within less than two minutes

Query provisioned node immediately as data is automatically streamed from the S3 snapshot

― Cluster is put into read-only mode― New cluster is provisioned according to resizing needs― Node-to-node parallel data copy― Only charged for source cluster

Resize – Phase 1

Resize - Phase 2

― Automatic SQL endpoint switchover via DNS― Decommission the source cluster

Resizing a Redshift Data Warehouse via the AWS Console

Resizing a Redshift Data Warehouse via the AWS Console

http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html#cluster-resize-intro

Resizing without Production Impact

When you resize your cluster is read-only• Query cluster throughout resizing• There will be a short outage at the end of cluster resizing

Parameter GroupsAllow you to set a variety of configuration options:

• date style, search path and query timeout • user activity logging • whether the leader node should accept only

SSL protected connections

Apply to every database in a cluster

View events using the console, the SDK, or the CLI/API

Use SET to change some properties temporarily, for example:

SET wlm-query-slot-count = 3

Some changes to the parameter group may require a restart to take effect

Default Parameter Values

http://docs.aws.amazon.com/redshift/latest/mgmt/workload-mgmt-config.html#wlm-dynamic-and-static-properties

EventsEvent information is available for a period of several weeks

Source types:• Cluster• Parameter Group• Security Group• Snapshot

Categories:• Configuration• Management• Monitoring• Security

Subscribe to events to receive SNS notifications

http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-event-notifications.html#redshift-event-messages

Resource TaggingDefine custom tags for your cluster

• provide metadata about a resource • categorize billing reports for cost

allocation• Activate tags in the billing and cost

management service

Cluster resize or a restore from a snapshot in the original region preserves tags but not in cross region snapshots

http://docs.aws.amazon.com/redshift/latest/mgmt/amazon-redshift-tagging.html

Concurrent Query ExecutionLong-running queries may cause short-running queries to wait in a queue

• As a result users may experience unpredictable performance

By default, a Redshift cluster is configured with one queue with 5 execution slots by default (50 max)

• 5 slots means 5 queries can execute concurrently while additional queries will queue until additional resources free up

RunningDefault queue

Workload ManagementWorkload management is about creating queues for different workloads

User Group A

Short-running queueLong-running queue

Short Query Group

Long Query Group

Workload Management

Workload Management

Don’t set concurrency to more that you need

set query_group to allqueries; select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid <40000;reset query_group;

Redshift Utils

Script Purposetop_queries.sql Get the top 50 most time-consuming statements in the last 7 days.

perf_alerts.sql Get the top occurrences of alerts; join with table scans.

filter_used.sql Get the filter applied to tables on scans. To aid on choosing sortkey.

commit_stats.sql Get information on consumption of cluster resources through COMMIT statements.

current_session_info.sql Get information about sessions with currently running queries.

missing_table_stats.sql Get EXPLAIN plans that flagged missing statistics on underlying tables.

queuing_queries.sql Get queries that are waiting on a WLM query slot.

table_info.sql Get table storage information (size, skew, etc.).

• Redshift Admin Scripts provide diagnostic information for tuning and troubleshooting

• https://github.com/awslabs/amazon-redshift-utils

Redshift Utils

View Purposev_check_data_distribution.sql Get data distribution across slices.v_constraint_dependency.sql Get the the foreign key constraints between tables.v_generate_group_ddl.sql Get the DDL for a group.v_generate_schema_ddl.sql Get the DDL for schemas.v_generate_tbl_ddl.sql Get the DDL for a table. This will contain the distkey, sortkey, and constraints.v_generate_unload_copy_cmd.sql Generate unload and copy commands for an object.v_generate_user_object_permissions.sql Get the DDL for a users permissions to tables and views.v_generate_view_ddl.sql Get the DDL for a view.v_get_obj_priv_by_user.sql Get the table/views that a user has access to.v_get_schema_priv_by_user.sql Get the schema that a user has access to.v_get_tbl_priv_by_user.sql Get the tables that a user has access to.v_get_users_in_group.sql Get all users in a group.v_get_view_priv_by_user.sql Get the views that a user has access to.v_object_dependency.sql Merge the different dependency views.v_space_used_per_tbl.sql Get pull space used per table.v_view_dependency.sql Get the names of the views that are dependent on other tables/views.

• Redshift Admin Views provide information about user and group access, various table constraints, object and view dependencies, data distribution across slices, and pull space used per table

Open source tools

https://github.com/awslabs/amazon-redshift-utilshttps://github.com/awslabs/amazon-redshift-monitoringhttps://github.com/awslabs/amazon-redshift-udfs

Admin scriptsCollection of utilities for running diagnostics on your cluster

Admin viewsCollection of utilities for managing your cluster, generating schema DDL, etc.

ColumnEncodingUtilityGives you the ability to apply optimal column encoding to an established schema with data already loaded

Q&A

PlayKids iFood MapLink Apontador Rapiddo Superplayer Cinepapaya ChefTime

“Com os serviços da AWS pudemos dosar os investimentos iniciais e prospectar os custospara expansões futuras”

“O Redshift nos permitiu

transformar dados em informações

self-service” - Wanderley Paiva

Database Specialist

O Desafio

• Escalabilidade

• Disponibilidade

• Centralização dos dados

• Custos reduzidos e preferencialmente diluído

Solução

Solução v2

top related