big data? - iptricardo/ficheiros/bd - big data.pdf · big data must follow the same principles of...
Post on 20-May-2020
30 Views
Preview:
TRANSCRIPT
Big Data?
Ricardo Campos
Instituto Politécnico de Tomar
Mestrado EI-IC – Análise e Processamento de Grandes Volumes de Dados Tomar, Portugal, 2017
What is Information Retrieval?
AGENDAWhat is this talk about?
Who
2Overview
1Where
3
BD vs
Traditional
6V’s
5Different Types of
Data
7
Why
4
Q&A
8
What is Information Retrieval?
What is Information Retrieval?
Big Data is used in the singular and refers to a collection of data sets so large and complex, it’s impossible to process them with the usual databases and tools.
Because of its size and associated numbers, Big Data is hard to capture, store, search,
share, analyze and visualize.
Consider reading:
https://www.simplilearn.com/whats-the-big-deal-about-big-data-article
https://storage.googleapis.com/supplemental_media/udacityu/306818608/Lesson%201
%20Notes.pdf
What is Information Retrieval?
The phenomenon came about in recent years due to the sheer amount of machine
data being generated today – thanks to:
mobile devices
tracking systems / RFID
sensor networks
social networks
internet searches
automated record keeping
video archives
e-commerce
coupled with the additional information derived by analyzing all this information,
which on its own creates another enormous data set.
What is Information Retrieval?
Big Data analysis requires collecting massive amounts of messy data
The data is not in a uniform format as one would see in traditional database, it is not
annotated (semantically tagged)
Think of every tweet ever tweeted
What is Information Retrieval?
Big Data analytics can reveal important patterns that would otherwise go unnoticed.
Taking the antidepressant Paxil together with the anti-cholesterol drug Pravachol could
result in diabetic blood sugar levels. Discovered by:
(1) using a symptomatic footprint characteristic of very high blood sugar levels
obtained by analyzing thirty years of reports in an FDA database, and
(2) then finding that footprint in the Bing searches using an algorithm that
detected statistically significant correlations. People taking both drugs also tended
to enter search terms (“fatigue” and “headache,” for example) that constitute the
symptomatic footprint.
What is Information Retrieval?
Common use cases for Big Data:
• Fraud Detection;
• Risk Modeling;
• Social Sentiment Analysis;
• Image Classification;
• Graph Analysis
Please consider reading page 29 – 39 of
Hadoop for Dummies book
Please consider reading page 3 – 9 of
Harness the Power of Big Data the IBM
Big Data Platform book
What is Information Retrieval?
Big data must follow the same principles of data management:
Data collection (sensors etc)
Data storage (Oracle, SAP, IBM, EMC, Spark, Hadoop, Storm, BigQuery, Amazon
EC2 and EMR)
Data format conversion (voice2txt, txt2voice, natural language processing from
unstructured to structured)
Data integration ( data linkage, meta data)
Data privacy (privacy-preserved data mining, computer security)
What is Information Retrieval?
What is Information Retrieval?
Technical Challenges:
Storage: How can we capture relevant data in time and then use the insight
derived from that data for business results?
Analysis: How can we understand and utilize it, when it comes in such a multitude
of unstructured formats?
Price: How can we analyze and manage the need for and the size of computational
capacity required to handle it safely?
What is Information Retrieval?
Technical Challenges:
Storage: NoSQL DBs: Hadoop, Dynamo DB, Berkeley DB, MangoDB, CouchDB
…Non relational
Analysis: Parallel computing (Hadoop’s Map reduce).
Price: Parallel computing (Hadoop’s Map reduce).
What is Information Retrieval?
What is Information Retrieval?
Big Data is so promissing that IBM has created the Big Data University
What is Information Retrieval?
Companies pursue Big Data because it can be revelatory in spotting business trends,
improving research quality, and gaining insights in a variety of fields, from IT to
medicine to law enforcement and everything in between and beyond.
A health care consultancy has made the data coming out of medical
practices the focus of its thriving business. The company collects billing
and diagnostic code data from 10,000 doctors on a daily, weekly and
monthly basis to create a virtual clinical integration model.
Health
Cloud services such as Ginger.io already allow care providers to monitor their patients
through sensor-based applications on their smartphones.
What is Information Retrieval?
Global position satellite technology now allows trucking firms to track their trucks - and the
merchandise inside them. Practically anything you can attach an RFID tag to can be tracked. How
a company uses that information – to re-route trucks to create efficient routes, alert customers to
deliveries, and forecast and price services – depends on the ability to manage and analyze data
effectively.
Walmart handles more than 1 million customer
transactions every hour, which is imported into
databases estimated to contain more than 2.5 petabytes
* of data — the equivalent of 167 times the information
contained in all the books in the US Library of Congress.
Consumer Products Companies
https://pplware.sapo.pt/informacao/amazon-go-fim-das-filas-caixas-supermercado/
What is Information Retrieval?
Last month, I talked to Amazon customer service about my malfunctioning Kindle, and
it was great. Thirty seconds after putting in a service request on Amazon’s website, my
phone rang, and the woman on the other end--let’s call her Barbara--greeted me by
name and said, "I understand that you have a problem with your Kindle." We resolved
my problem in under two minutes, we got to skip the part where I carefully spell out my
last name and address, and she didn’t try to upsell me on anything. After nearly a
decade of ordering stuff from Amazon, I never loved the company as much as I did at
that moment.
Article by Sean Madden, May 2012, an expert in service design and innovation
strategy.
What is Information Retrieval?
The fact is, Amazon has been collecting my information for years--not just addresses
and payment information but the identity of everything I’ve ever bought or even
looked at. And while dozens of other companies do that, too, Amazon’s doing
something remarkable with theirs. They’re using that data to build our relationship.
What is Information Retrieval?
Sports ClubsIn one of the greatest sports stories of all
times, Leicester City won the Premier League
title of 2015/16.
Throughout the history of the Premier League, every champion, until now, has
finished in the top 3 in the season before winning the title. Leicester City, however, was
an exception, finishing the 2014/15 season in 14th place, 46 points behind winners
Chelsea. How did they do it?
Please consider reading this article: https://www.simplilearn.com/data-analytics-
behind-leicester-city-16-epl-win-article
http://www.maisfutebol.iol.pt/benfica/formacao/video-imagens-nunca-vistas-sobre-a-
maquina-do-seixal?_ga=2.67815961.940326162.1493990127-1281919722.1484127213
What is Information Retrieval?
Big Pharmaceutical Companies
What is Information Retrieval?
Government Agencies
What is Information Retrieval?
Credit Card Companies
What is Information Retrieval?
Telecoms
What is Information Retrieval?
Facebook uses Hadoop, Hive, and HBase for data warehousing and real-time application serving. Their data warehousing clusters are petabytes in size with thousands of nodes.
Please consider reading this article:https://www.simplilearn.com/how-facebook-is-using-big-data-articlehttps://www.facebook.com/note.php?note_id=468211193919https://www.simplilearn.com/how-facebook-is-using-big-data-article
What is Information Retrieval?
Twitter uses Hadoop, Pig, and HBase for data analysis, visualization, social graph analysis, and machine learning
What is Information Retrieval?
Yahoo
Yahoo! uses Hadoop for data analytics, machine learning, search ranking, email antispam, ad optimization, ETL, and more. Combined, it has over 40,000 servers running Hadoop with 170 PB of storage.
What is Information Retrieval?
Google, in its MapReduce paper, indicated that it used its version of MapReduce to create its web index from crawl data.
In 2010 Google moved to a real-time indexing system called Caffeine:
Please consider reading this article:https://googleblog.blogspot.pt/2010/06/our-new-search-index-caffeine.html
What is Information Retrieval?
eBay, Samsung, ….
eBay, Samsung, Rackspace, J.P. Morgan, Groupon, LinkedIn, AOL, Last.fm, and StumbleUpon are some of the other organizations that are also heavily invested in Hadoop and Spark.
What is Information Retrieval?
IBM Watson
Sloan Kettering Cancer Center doctors are training IBM Watson to be an expert in
cancer diagnosis and treatment based on learning:
Over 600,000 diagnostic reports
Two million pages of medical journal articles
One and a half million patient records
What is Information Retrieval?
Watson is an IBM supercomputer that combines artificial intelligence (AI) and sophisticated analytical software for optimal performance as a “question answering” machine (https://web.stanford.edu/~jurafsky/slp3/28.pdf and http://start.csail.mit.edu/index.php);
The supercomputer is named for IBM’s founder, Thomas J. Watson.
To replicate (or surpass) a high-functioning human’s ability to answer questions, Watson accesses 90 servers with a combined data store of over 200 million pages of information, which it processes against six million logic rules.
What is Information Retrieval?
Apache's Hadoop, a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
SUSE operating system;
2,880 processor cores.
15 terabytes of RAM.
BM'sDeepQA software, which is designed for information retrieval that incorporates natural language processing and machine learning.
What is Information Retrieval?
It performs text mining and complex analytics on huge volumes of unstructured data;
Not available through a Web interface;
Vertical applications such as healthcare and decision support;
Watson triumphs in Jeopardy's man vs. machine challenge
http://www.computerworld.com/article/2513199/high-performance-computing/watson-triumphs-in-jeopardy-s-man-vs--machine-challenge.html
What is Information Retrieval?
IBM's Watson Supercomputer May Soon Be The Best Doctor In The World
http://www.businessinsider.com/ibms-watson-may-soon-be-the-best-doctor-in-the-world-2014-4
Watson is already capable of storing far more medical information than doctors;
Its decisions are all evidence-based and free of cognitive biases;
It's also capable of understanding natural language, generating hypotheses, evaluating the strength of those hypotheses, and learning — not just storing data, but finding meaning in it.
What is Information Retrieval?
It’s based on all available medical knowledge. Human doctors can’t possibly hold this much information in their heads, or keep up it as it changes over time. Dr. Watson knows it all and never overlooks or forgets anything.
It’s accurate. If Dr. Watson is as good at medical questions as the current Watson is at game show questions, it will be an excellent diagnostician indeed.
It has very low marginal cost. It’ll be very expensive to build and train Dr. Watson, but once it’s up and running the cost of doing one more diagnosis with it is essentially zero
What is Information Retrieval?
It’s consistent. Given the same inputs, Dr. Watson will always output the same diagnosis. Inconsistency is a surprisingly large and common flaw among human medical professionals, even experienced ones. And Dr. Watson is always available and never annoyed, sick, nervous, hungover, upset, in the middle of a divorce, sleep-deprived, and so on.
It can be offered anywhere in the world. If a person has access to a computer or mobile phone, Dr. Watson is on call for them.
http://andrewmcafee.org/2011/03/mcafee-watson-ibm-healthcare-verghese/
https://www.youtube.com/watch?v=_Xcmh1LQB9Ihttps://www.youtube.com/watch?v=P18EdAKuC1U
What is Information Retrieval?
http://tek.sapo.pt/noticias/computadores/artigo/robots_na_saude_ainda_estao_longe_de_substituir_os_m-49530btt.html
What is Information Retrieval?
What is Information Retrieval?
What is Information Retrieval?
The Internet of Things (IoT) is a scenario in which
objects, animals or people are provided with
unique identifies and the ability to automatically
transfer data over a network without requiring
human-to-human or human-to-computer
interaction
What is Information Retrieval?
Sports
It’s now possible to get a basketball with over 200 built-in
sensors that provide player and coaches with detailed
feedback on performance
In tennis a system called SlamTracker can record a
player’s performance providing real-time statistics
and comprehensive match analytics.
What is Information Retrieval?
If you’ve ever watched rugby you may have wondered what
the bump is between the players’ shoulder blades – it’s a GPS
tracking system that allows the coaching staff to assess
performance in real time.
The device will measure the players’ average speed, whether the player is performing
above or below their normal levels, and heart rate, to identify potential problems
before they occur.
What is Information Retrieval?
Self-driving cars
Computers in cars know where you go, when you go, how fast you go, how many times
you stop along the way, whether you stay in your lane, what your average MPG is, how
you like your temperature, how close you get before stepping on the brake, and tens of
thousands of other facts….instantly.
The ethical dilemma of self-driving carshttps://www.youtube.com/watch?v=ixIoDYVfKA0
What is Information Retrieval?
Analyzing all of this data rapidly allows a self-driving car to:
Anticipate where you are going by looking at driving history
Check road signs using sensors to know what the speed limit is or if a stop sign is approaching
Alert and activate your braking and steering systems if pedestrians are in the street or you’re too
close to the curb or you drift into another lane or you doze off.
In 2040, it is anticipated people will not need to get driver ’s licenses. Cars will be able to drop
someone off and then go find a parking space.
What is Information Retrieval?
Homes
There are smart thermostats that monitor the home
and only heat the areas that are being used. The
temperature of your home can be changed while
you are still at work so that when you arrive on a
winter’s evening the house is cosy
Smart TVs use face recognition to make sure your children don’t ever watch anything unsuitable
for their age
What is Information Retrieval?
Considering all the toys, gadgets and smart appliances there are now more machines
connected to the Internet than people. And all those smart things are gathering data
and communicating with each other.
http://exameinformatica.sapo.pt/noticias/insolitos/2016-10-21-Esta-coleira-diz-lhe-
se-o-cao-esta-feliz
Gadgets
What is Information Retrieval?
Social Networks
Online dating site eHarmony matches people based on
twenty-nine different variables such as personality traits,
behaviours, beliefs, values and social
skills.
What is Information Retrieval?
Search Engines
What is Information Retrieval?
Web Browsers
What is Information Retrieval?
Electronic Devices
What is Information Retrieval?
Movie Rental Sites
What is Information Retrieval?
Apps
Restaurant reservations (Open Table)
Weather in L.A. in 3 days (Weather+)
Side effects of medications (MedWatcher)
3-star hotels in New Orleans (Priceline)
Which PC should I buy and where (PriceCheck)
What is Information Retrieval?
From traffic patterns and music downloads to web history and medical records, data
is recorded, stored, and analyzed to enable that technology and services that the
world relies on every day. But what exactly is big data used for?
What is Information Retrieval?
To send you catalogs for exactly the merchandise you typically purchase.;
To suggest medications that precisely match your medical history.
To “push” television channels to your set instead of your “pulling” them in.
To send advertisements on those channels just for you!
What is Information Retrieval?
To know what you need before you even know you need it based on past
purchasing habits!
To notify you of your expiring driver’s license or credit cards or last refill on a Rx, etc.
To give you turn-by-turn directions to a shelter in case of emergency.
Predict weather patterns to plan optimal wind turbine usage, and optimize capital
expenditure on asset placement
What is Information Retrieval?
Make risk decisions based on real-time transactional data
Identify criminals and threats from disparate video, audio, and data feeds
(recorded future. com)
Detect life-threatening conditions at hospitals in time to intervene
Multi-channel customer sentiment and experience analysis
What is Information Retrieval?
According to IBM scientists big data can be break into four dimensions:
Volume, Velocity, Variety and Veracity.
Volume
of Tweets
create daily.
12+ terabytes
Variety
of different
types of data (structured,
unstructured, text, multimedia)
100’s Veracity
decision makers trust
their information. Fact Checking (https://poligrafo.sapo.pt/)
Only 1 in 3
trade events
per second. Analysis of data
to take decisions within
seconds
5+million
Velocity
Please consider reading page 9 – 15 of
Harness the Power of Big Data the IBM
Big Data Platform book
What is Information Retrieval?
Responding to the
increasing Velocity
30 Billion RFID sensors and counting
Collectively Analyzing the broadening
Variety
80% of the
worlds data is unstructured
Establishing the
Veracity of big data sources
1 in 3 business leaders don’t trust the information they use to make decisions
Cost efficiently processing the
growing Volume
50x 35 ZB
20202010
What is Information Retrieval?
Volume
What is Information Retrieval?
Volume
Many factors contribute to the increase in data volume:
Transaction-based data stored through the years.
Unstructured data streaming in from social media.
Increasing amounts of sensor and machine-to-machine data being collected.
In the past, excessive data volume was a storage issue. But with decreasing storage
costs, other issues emerge, including how to determine relevance within large data
volumes and how to use analytics to create value from relevant data.
What is Information Retrieval?
Variety
What is Information Retrieval?
Variety
Data today comes in all types of formats.
Structured, numeric data in traditional databases.
Information created from line-of-business applications.
Unstructured text documents, email, video, audio, stock ticker data and financial
transactions.
Managing, merging and governing different varieties of data is something many
organizations still grapple with.
What is Information Retrieval?
Velocity
What is Information Retrieval?
Velocity
Data is streaming in at unprecedented speed and must be dealt with in a timely
manner.
RFID tags, sensors and smart metering are driving the need to deal with torrents of
data in near-real time.
Reacting quickly enough to deal with data velocity is a challenge for most
organizations.
What is Information Retrieval?
Veracity
What is Information Retrieval?
Veracity
Big Data Veracity refers to the biases, noise and abnormality in data.
Is the data that is being stored, and mined meaningful to the problem being
analyzed.
In scoping out your big data strategy you need to have your team and partners work
to help keep your data clean and processes to keep ‘dirty data’ from accumulating in
your systems.
What is Information Retrieval?
Value
Value is defined as the usefulness of data for an enterprise. The value characteristic is
intuitively related to the veracity characteristic in that the higher the data fidelity, the
more value it holds for the business;
Value is also dependent on how long data processing takes because analytics
results have a shelf-life;
What is Information Retrieval?
The real value is not in the large volumes of data but what we can now do with it.
It is not the amount of data that is making the difference but our ability to analyze
vast and complex data sets beyond anything we could ever do before.
Innovations such as cloud computing combined with improved network speed as
well as creative techniques to analyse data have resulted in a new ability to turn vast
amounts of complex data into value.
What’s more, the analysis can now be performed without the need to purchase or build
large supercomputers.
What is Information Retrieval?
The longer it takes for data to be turned into meaningful information, the less value
it has for a business
What is Information Retrieval?
Structured vs. Exploratory
IT
Structures the data to answer that question
Business Users
Determine what question to ask
Monthly sales reports
Profitability analysis
Customer surveys
Traditional Approach
Structured & Repeatable Analysis
IT
Delivers a platform to enable creative discovery
Business Users
Explores what questions could be asked
Brand sentiment
Product strategy
Maximum asset utilization
Big Data Approach
Iterative & Exploratory Analysis
What is Information Retrieval?
What is Information Retrieval?
The data processed by Big Data solutions can be human-generated
or machine-generated
Human-generated
What is Information Retrieval?
Machine-generated
What is Information Retrieval?
Human-generated and machine-generated data can come from a variety of sources
and be represented in various formats or types. The primary types of data are:
• Structured Data
• Unstructured Data
• Semi-Structured Data
What is Information Retrieval?
Structured Data
Structured data conforms to a data model or schema and is often stored in tabular
form. It is used to capture relationships between different entities and is therefore
most often stored in a relational database.
Examples of this type of data include banking transactions, invoices, and customer
records.
What is Information Retrieval?
Unstructured Data
Data that does not conform to a data model or data schema is known as
unstructured data. It is estimated that unstructured data makes up 80% of the data
within any given enterprise.
This form of data is either textual or binary and often conveyed via files that are self-
contained and non-relational. A text file may contain the contents of various tweets
or blog postings. Binary files are often media files that contain image, audio or video
data
Basically, unstructured data is the data we can’t easily store and index in traditional
formats or databases and includes email conversations, social media posts, video
content, photos, voice recordings, sounds, etc.
What is Information Retrieval?
In most businesses there are already huge amounts of text or word-based data in the
form of documents, reports, internal and external communication, customer
communication, emails, websites, social media updates, blog
And while all those words are structured to make sense to a human being they are
unstructured from an analytics perspective, as they don’t fit neatly into a relational
database or rows and columns of a spreadsheet
But they still present a huge opportunity if we can just figure out how to use it.
What is Information Retrieval?
What sets unstructured data apart from structured data is that its structure is
unpredictable.
Some people believe that the term unstructured data is misleading because each text
source may contain its own specific structure or formatting based on the software that
created it. In fact, it is the content of the document that is really unstructured
What is Information Retrieval?
Semi-Structured Data
Semi-structured data has a defined level of structure and consistency, but is not
relational in nature. This kind of data is commonly stored in files that contain text.. Due
to the textual nature of this data and its conformance to some level of structure, it is
more easily processed than unstructured data.
What is Information Retrieval?
These data types refer to the internal organization of data and are sometimes called
data formats. Apart from these three fundamental data types, another important type
of data in Big Data environments is metadata.
Metadata
Metadata provides information about a dataset’s characteristics and structure. This type
of data is mostly machine-generated and can be appended to data
What is Information Retrieval?
The tracking of metadata is crucial to Big Data processing, storage and analysis
because it provides information about the pedigree of the data and its provenance
during processing. Examples of metadata include:
• XML tags providing the author and creation date of a document
• Attributes providing the file size and resolution of a digital photograph
What is Information Retrieval?
https://www.youtube.com/watch?v=l-SVN3txo_4
What is Information Retrieval?
top related