threat intelligence baseada em dados: métricas de disseminação e compartilhamento de indicadores

55
THREAT INTELLIGENCE BASEADA EM DADOS : MÉTRICAS DE DISSEMINAÇÃO E COMPARTILHAMENTO DE INDICADORES Alexandre Sieira Alex Pinto

Upload: alexandre-sieira

Post on 09-Jan-2017

501 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Applying Machine Learning to Network Security Monitoring

THREAT INTELLIGENCE BASEADA EM DADOS : MTRICAS DE DISSEMINAO E COMPARTILHAMENTO DE INDICADORES

Alexandre SieiraAlex Pinto

Cyber War Threat Intel What is it good for?Combine and TIQ-testMeasuring indicatorsThreat Intelligence SharingFuture research direction (i.e. will work for data)

Agenda

HT to @RCISCwendy

During this presentation Ill be making a quick introduction on some important concepts and our views on what threat intelligence is and the many useful things that can and cannot be done with it.

Then, Mr. Pinto will present a couple of open source tools made available at the MLSec Project GitHub repository and an analysis of the metrics we gathered from a set of publicly available and private threat intelligence feeds.

50-ish Slides3 Key Takeaways2 Heartfelt and genuine defenses of Threat Intelligence Providers1 Prediction on The Future of Threat Intelligence SharingPresentation Metrics!!

During this presentation Ill be making a quick introduction on some important concepts and our views on what threat intelligence is and the many useful things that can and cannot be done with it.

Then, Mr. Pinto will present a couple of open source tools made available at the MLSec Project GitHub repository and an analysis of the metrics we gathered from a set of publicly available and private threat intelligence feeds.

What is TI good for (1) Attribution

Based on what you read on the media and one some online marketing material, attribution is one of the sexiest parts about threat intelligence.

If you understand your adversaries, their intent and capabilities, there is a lot you can do within your risk management and even business decisions to better prepare yourself.

However, if you want to be realistic youll have to admit that as in the case demonstrated here, attribution is really hard to do.

This reflects both on the cost of of getting good CTI data with attribution.

Also it is not reasonable to expect that any significant proportion of attacks out there will be attributed at all, given the sheer amount of smaller threat actors out there.

What is TI good for anyway?

TY to @bfist for his work on http://sony.attributed.to

This is beautifully illustrated by the Sony breach controversy.

So data science to the rescue! The good humored sony.attributed.to website creates plausible attribution reports for the Sony hack.

It is based on data sampled from actual attacks data in the VERIS and DBIR databases consistent with how frequently threat actors, locations and methods are actually used.

What is TI good for (2) Cyber Maps!!

TY to @hrbrmstr for his work onhttps://github.com/hrbrmstr/pewpew

One other really common application of threat intelligence is building a threat map.

After all, how else will upper management know that your team really has what it takes to prevent pandas, bears and even maybe capybaras from infiltrating your networks?

Who knows what those samba-dancing Brazilian hackers are up to, after all?

Fear not, now theres an open source threat map that you can use called Pew pew.

And you dont even need to stand up your own honeypots. Pew pew displays plausible attack patterns that are sampled randomly from public data made available by Arbor Networks.

So it is exactly as useful as the real thing.

What is TI good for anyway?(3) How about actual defense? Strategic and tactical: planningTechnical indicators: DFIR and monitoring

But seriously, what about using the threat intelligence data to actually defend organizations?

The high level data is awesome to help with the high level decisions, of course.

However, how do you go about using the technical indicators?

Its straightforward enough to pull a list of IP addresses, domain names and URLs into SIEMs or IDS rules.

As Gavin Reid mentioned in his talk yesterday, this can be a great way to reduce the time it takes to detect a novel threat and bypass the detection, rule writing and policy update cycle of your sensors.

However, this has to be done with great care.

Affirming the Consequent FallacyIf A, then B.B.Therefore, A.Evil malware talks to 8.8.8.8.I see traffic to 8.8.8.8.ZOMG, APT!!!

Threat intelligence feeds are mostly providing indicators of things that malware does: for example IP addresses, domain names and URLs they communicate with.

Knowing this can be invaluable for all sorts of investigation and forensics activities.

But we really believe the holy grail is detection.

A breach detection example the fallacy: some malware observed talking to a destination does NOT mean that all communication towards that destination is indicative of malware.

Also you need evaluate the quality and the applicability of the indicators you are consuming (and perhaps paying for) to decide which combination of sources is optimal for your organization.

Now Mr. Pinto will tell you a bit about two open source projects we develop to help you perform such an evaluation.

But this is a Data-Driven talk!

Combine and TIQ-TestCombine (https://github.com/mlsecproject/combine)Gathers TI data (ip/host) from Internet and local filesNormalizes the data and enriches it (AS / Geo / pDNS)Can export to CSV, tiq-test format and CRITsComing Soon: CybOX / STIX / SILK /ArcSight CEF

TIQ-Test (https://github.com/mlsecproject/tiq-test)Runs statistical summaries and tests on TI feedsGenerates charts based on the tests and summariesWritten in R (because you should learn a stat language)

https://github.com/mlsecproject/tiq-test-Summer2015

2014 was not a leap year

150k outbound300k inbound

450k X 365 -> 164,250,000 / 165 MILLION

Using TIQ-TEST Feeds SelectedDataset was separated into inbound and outbound

TY to @kafeine and John Bambenek for access to their feeds

28 feeds

Using TIQ-TEST Data PrepExtract the raw information from indicator feedsBoth IP addresses and hostnames were extracted

Using TIQ-TEST Data PrepConvert the hostname data to IP addresses:Active IP addresses for the respective date (A query)Passive DNS from Farsight Security (DNSDB)

For each IP record (including the ones from hostnames):Add asnumber and asname (from MaxMind ASN DB)Add country (from MaxMind GeoLite DB)Add rhost (again from DNSDB) most popular PTR

- For the hostname / domains feeds: - We extracted the "active" IP addresses for those hostnames on the dates they were reported (using pDNS from Farsight Security) - Passive DNS query of active "A" responses on the reported day (from 00:00 to 23:59) - For this experiment, we got rid of "non-public IPs" (localhost, RFC1918)

- Then for each IP record (including the ones got above) enrich it with: - asnumber and asname (from MaxMind ASN DB) - country (from MaxMind GeoLite DB) - rhost - the more popular reverse DNS entry ("PTR") from passive DNS on that date

Using TIQ-TEST Data Prep Done

- Then for each IP record (including the ones got above) enrich it with: - asnumber and asname (from MaxMind ASN DB) - country (from MaxMind GeoLite DB) - rhost - the more popular reverse DNS entry ("PTR") from passive DNS on that date - we are not playing around with this on this talk, data is too sparse

Novelty Test Measuring added and dropped indicators

Novelty Test - Inbound

NOVELTY:

Always request a trial of the data feed (15/30 days)Measure addition and churn

Aging Test Is anyone cleaning this mess up eventually?

OVERLAP

INBOUND

OVERLAP

OUTBOUND

OVERLAP

Population TestLet us use the ASN and GeoIP databases that we used to enrich our data as a reference of the true population.

But, but, human beings are unpredictable! We will never be able to forecast this!

POPULATION:

We will use the ASN and GEO databases as our population - They should cover all the existing IPv4 space, give or take a few anomalies

With this, we can simulate a draw from this population for people that are going to be attacking us because human beings are unpredictable and will never be able to forecast where the attacks are coming from right? :troll:

-----COUNTRY / ASN

(graphs of country proportions on different feeds)

Inbound Turkey WINS!

US is highly above the population average, and CN is slightly below

Is your sampling poll as random as you think?

----HYPOTHESIS TESTING OF PROPORTIONS(explain exact binom test vs. Chi-squared)

Can we get a better look?Statistical inference-based comparison models (hypothesis testing)Exact binomial tests (when we have the true pop)Chi-squared proportion tests (similar to independence tests)

----HYPOTHESIS TESTING OF PROPORTIONS(explain exact binom test vs. Chi-squared)

-> Explain the diferences - These differences describe a higher probability of specific actors targeting you from specific locations (GEO/ASN) in relation to a completely random actor. - Was one of the 1st features I used for MLSec

Overlap Test More data can be better, but make sure it is not the same data

Overlap Test - Inbound

Overlap Test - Outbound

Uniqueness Test

Get 100 fish in a pondTag allGet 100 more fish how many were tagged? 5?-> 20x more fish

Uniqueness Test

Domain-based indicators are unique to one list between 96.16% and 97.37%IP-based indicators are unique to one list between 82.46% and 95.24% of the time

I hate quoting myself, but

Key Takeaway #1MORE != BETTERThreat Intelligence Indicator FeedsThreat Intelligence Program

Intermission

Intermission

Key Takeaway #2

"These are the problems Threat Intelligence Sharing is here to solve!

Right?

Herd Immunity, is it?

Source:www.vaccines.gov

Herd Immunity would imply that others in your sharing community being immune to malware A meant you wouldnt get it even if you were still vulnerable to it.

Threat Intelligence Sharing

How many indicators are being shared?

How many members do actually share and how many just leech?

Can we measure that? What a super-deeee-duper idea!

Threat Intelligence SharingWe would like to thank the kind contribution of data from the fine folks at Facebook Threat Exchange and Threat Connect

and also the sharing communities that chose to remain anonymous. You know who you are, and we you too.

Threat Intelligence Sharing Data

From a period of 2015-03-01 to 2015-05-31:Number of Indicators SharedPer dayPer member

Not sharing this data privacy concerns for the members and communities

LARGE is 36x bigger than SMALL

Could we be in a sharing community and not have paid feeds?

MATURITY?

Reddit of Threat Intelligence?

'How can sharing make me better understand what are attacks that are targeted and what are commodity?'

TELEMETRY > CONTENTKey Takeaway #3(Also Prediction #1)

More Takeaways (I lied)Analyze your data. Extract more value from it!If you ABSOLUTELY HAVE TO buy Threat Intelligence or data, evaluate it first.

Try the sample data, replicate the experiments:https://github.com/mlsecproject/tiq-test-Summer2015http://rpubs.com/alexcpsec/tiq-test-Summer2015

Share data with us. Ill make sure it gets proper exercise!

Alex PintoChief Data Scientist MLSec Project@alexcpsec@MLSecProjectAlexandre SieiraCTONiddel @AlexandreSieira@NiddelCorp

The measure of intelligence is the ability to change." - Albert Einstein OBRIGADO!