intercon 2016 - sla vs agilidade: uso de microserviços e monitoramento de cloud

42
October 2016 First 90 SLA vs. Agile Microservices and cloud monitoring

Upload: imasters

Post on 07-Jan-2017

74 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

October 2016

First 90SLA vs. AgileMicroservices and cloud monitoring

Page 2: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

Why this talk?

Page 3: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

This is our visionBuilding the foundation to Build a 3B Company by FY20

Agenda1 . “Old World”: MercadoLivre’s original architecture.

2 . “Ground Zero”: shifting to microservices on the cloud

3 . Monitoring the cloud

4. Alarms: when things go south

5. “Fury”: streamlining DevOps at MercadoLivre

Page 4: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

In numbers

+400 deploys/dayOn +650 APPS

+1000 DevelopersIn 8 development centers

+10 programming languages

Page 5: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

In numbers

+25.000.000Request per minute

+22.000 VM’sIn 7 data centers

+700 DB’sIn 4 different engines

Page 6: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

OldWorld

Page 7: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

Old world architecture

User ml.jarHuge DB

Page 8: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

This is our visionBuilding the foundation to Build a 3B Company by FY20

Old world properties

● Monolithic

● Highly coupled code

● Unified SVN repository

● Single DB

● Simple infrastructure with little overhead

● Single QA team

● Closed system

Page 9: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

This is our visionBuilding the foundation to Build a 3B Company by FY20

Deployments as ML grew

Anyone at anytime

Page 10: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

This is our visionBuilding the foundation to Build a 3B Company by FY20

Deployments as ML grew

Anyone at anytime

Some people, anytime

Page 11: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

This is our visionBuilding the foundation to Build a 3B Company by FY20

Deployments as ML grew

Anyone at anytime

Some people, anytime

Some people, once a week

Page 12: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

This is our visionBuilding the foundation to Build a 3B Company by FY20

Deployments as ML grew

Anyone at anytime

Some people, anytime

Some people, once a week

Only by all experts together, at 3 AM, on thursdays not covered by any “freeze”

Page 13: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

GroundZero

Page 14: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

Shifting to microservices

Frontend

API

Frontend CRMMobile apps

3rd party devsAPI API

Page 15: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

This is our visionBuilding the foundation to Build a 3B Company by FY20

Ground zero properties

● Multiple technologies and frameworks (dev’s choice)

● Completely decoupled code in multiple Github repositories

● One DB for each app, multiple engines

● Complex infrastructure with possible high overhead

● QA, testing and Continuous Integrations is done by each team

● Independent deployments, environments and policies

● Open platform

Page 16: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

“With great power comes great responsibility”.

Stan Lee

Page 17: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

This is our visionBuilding the foundation to Build a 3B Company by FY20

Developer responsibilities● Developer gets ownership of entire dev cycle

● Massive empowerment of dev team -> OWNERSHIP

Manage resourcesVMs

Choose support systems required and create them

DevelopCodeChoose your technology and keep your Github repository

Test

Create tests, regressions or CI as needed

Ensure qualityDefine uptime

Define what “up” means for your own app (health.sh)

Measure

Create metrics to analyze performance and downtime

DBs and services

NetworkingCreate rules and loadbalancers to route traffic to application

Create & scale computing pools for dev/test/prod

React

Deploy

Write all routines for automatically deploying your app on any VM React to critical events

that affect your app

Page 18: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

DevTools in ML

Developer

Melicloud API

- Create apps- Manage pools (test/prod)- Manage VMs & loadbalancers- Build & deploy

- Create queues- Create DBaaS or KVSaaS- Create caches

Github repo- Code app- Write test & deploy strategy- Write uptime definitions

Nginx

eventRouting & OpsGenie

- Write rules to route traffic to your pools

- Write rules to manage alarms- Define alarm escalation policies & schedules- Manage contact channels

Page 19: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

Microservices in ML

Page 20: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

Mobile apps

Module

Test app

CI

Main appAutomated build & store deployment

Repo

Team

Module

Test app

CI

Repo

Team

Module

Test app

CI

Repo

Team

Page 21: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

Monitoring mobile apps

Module

Main app

Team

Module

Module

Crash reporting

Team

Team

Page 22: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

Monitoring the cloud

Page 23: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

This is our visionBuilding the foundation to Build a 3B Company by FY20

New Relic● Default monitoring in VMs golden image

● No configuration necessary (initially)

HTTP errorsUnhandled errors

See if other devs/clients misuse your entry params

Stack tracesFast debugging

See what’s going on in production

Unified pool data

All instances’ traces in the same place

Performance metricsTransaction traces

See what’s taking so long

Recognize deviations

Graphs to see if traffic or response time vary w/ respect to another period

Unsupported params

Other services

Detect down services affecting you

Unexpected issues appear in production

Apdex Score

Page 24: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
Page 25: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

This is our visionBuilding the foundation to Build a 3B Company by FY20

Datadog● Easy to use for different frameworks

● Good for business specific metrics

Custom metricsComplex metrics

Graphs filtered with different dimensions

Infra monitoringFull info

More data than NR on disk, memory, network

Scalable

Handles well aggregating information from many different VMs

Real time analysisFast response

Almost no latency

Dashboards

Customizable dashboards to show what’s more relevant for each app

Online filtering

Alarms

Flexible alarms based on custom metrics

You can send multiple parameters for events

Page 26: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
Page 27: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

This is our visionBuilding the foundation to Build a 3B Company by FY20

Log collection

● Logs are collected by an agent on all VMs

● They are sent to an ElasticSearch

● Access via a Kibana frontend

● Developers can use special syntax to create queryable

dimensions for all logged events

● All instances’ logs in the same place

● Request tracing through multiple applications/APIs

(request_id)

Page 28: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
Page 29: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

Alarms

Page 30: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

Unified handling of events

health.sh

Code triggered alarms

eventRouting

Page 31: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

This is our visionBuilding the foundation to Build a 3B Company by FY20

Event routing

● Rules added by each team

● Check alarm origin, type and importance

● Check “quiet hours”

● Assign escalation policy and forward to OpsGenie

Page 32: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

This is our visionBuilding the foundation to Build a 3B Company by FY20

OpsGenie

● Manage teams to deal with escalation policies

● Set “on call” schedules (w/substitutes & manager escalation)

● Everyone manages his contact methods (SMS, mail, phone call, app)

Page 33: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

Fury

Page 34: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

This is our visionBuilding the foundation to Build a 3B Company by FY20

Evolution

Old world Ground zero Fury

Page 35: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

This is our visionBuilding the foundation to Build a 3B Company by FY20

Fury: DevOps to NoOps

● Still microservices

● Full service oriented

● Easier dev cycle and learning curve

● Pre-assembled flavors for popular frameworks

● Less bash scripts, more UI based configuration

● Auto-scaling & auto-healing

● Docker based (smaller dev/prod environment gap)

● Designed to run on AWS

● Continuous integration already included

Page 36: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

This is our visionBuilding the foundation to Build a 3B Company by FY20

Fury dashboard

Page 37: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

This is our visionBuilding the foundation to Build a 3B Company by FY20

Dev Cycle in Fury: create app

● Creates repository

● Creates Jenkins CI server

● Creates network infra

Page 38: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

This is our visionBuilding the foundation to Build a 3B Company by FY20

Dev Cycle in Fury: create scope

● Creates load balancer (ELB)

● Creates auto scaling group (ASG) for scope instances

● Creates instances

● Initialize logs & metrics services

● Download containers to instances

● Start traffic

Page 39: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

This is our visionBuilding the foundation to Build a 3B Company by FY20

Dev Cycle in Fury: deploy

● Creates ASG for new version

● Create instances for new ASG

● Initialize logs & metrics services

● Download containers to instances

● Progressive traffic switch

● If candidate is OK, destroy

previous infrastructure

Page 40: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
Page 41: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

?

Page 42: InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud

Thankyou!