SlideShare a Scribd company logo
1 of 31
Download to read offline
INTRODUCTION TO DATA SCIENCE
NIKO VUOKKO
JYVÄSKYLÄ SUMMER SCHOOL
AUGUST 2013
DATA SCIENCE WITH A BROAD BRUSH
Concepts and methodologies
DATA SCIENCE IS AN UMBRELLA, A FUSION
• Databases and infrastructure
• Pattern mining
• Statistics
• Machine learning
• Numerical optimization
• Stochastic modeling
• Data visualization
… of specialties needed
for data-driven
business optimization
DATA SCIENTIST
• Data scientist is defined as DS : business problem  data solution
• Combination of strong programming, math, computational and business skills
• Recipe for success
1. Convert vague business requirements into measurable technical targets
2. Develop a solution to reach the targets
3. Communicate business results
4. Deploy the solution in production
UNDERSTANDING DATA
Monday 19 August 2013
PATTERN MINING AND DATA ANALYSIS
UNSUPERVISED LEARNING
• Could be called pattern recognition or structure discovery
• What kind of a process could have produced this data?
• Discovery of “interesting” phenomena in a dataset
• Now how do you define interesting?
• Learning algorithms exist for a huge collection of pattern types
• Analogy: You decide if you want to see westerns or comedies,
but the machine picks the movies
• But does “interesting” imply useful and significant?
EXAMPLES OF STRUCTURES IN DATA
• Clustering and mixture models: separation of data into parts
• Dictionary learning: a compact grammar of the dataset
• Single class learning: learn the natural boundaries of data
Example: Early detection of machine failure or network intrusion
• Latent allocation: learn hidden preferences driving purchase decisions
• Source separation: find independent generators of the data
Example: Independent phenomena affecting exchange rates
MORE EXAMPLES OF “INTERESTING” PATTERNS
• { charcoal, mustard } ⇒ sausage
• Grocery customer types with differing paths around the trading floor
• Pricing trend change in a web ad exchange
• Communities and topics in a social network
• Distinct features of a person’s face and fingerprints
• Objects emerging in front of a moving car
KNOW YOUR EIGENS AND SINGULARS
• Eigenvalue and singular value decompositions are central data analysis tools
• They describe the energy distribution and static core structures of data
Examples
• Face detection, speaker adaptation
• Google PageRank is basically just the world’s largest EVD
• Zombie outbreak risk is determined by its eigenvalues
• As a sub-component in every second learning algorithm
DIMENSION REDUCTION
• Some applications encounter large dimension counts up to millions
• Dimension reduction may either
1. Retain space: preserve the most “descriptive” dimensions
2. Transform space: trade interpretability for powerful rendition
• Usually transformations work oblivious to data (they are simple functions)
• Curvilinear transformations try to see how the data is “folded” and build new
dimensions specific to the given dataset
DIMENSION REDUCTION EXAMPLE
• Singular value decomposition is commonly used to remove the “noise
dimensions” with little energy
• Example: gene expression data and movie preferences have lots of these
• After this more complex methods can be used for unfolding the data
DIMENSION REDUCTION EXAMPLE
BLIND SOURCE SEPARATION
• Find latent sources that generated the data
• Tries to discover the real truth beneath all noise and convolution
• Examples:
• Air defense missile guidance systems
• Error-correcting codes
• Language modeling
• Brain activity factors
• Industrial process dynamics
• Factors behind climate change
(STATISTICAL) SIGNIFICANCE TESTING
• Example: Rejection rate increase in a manufacturing plant
• “What is the probability of observing this increase if everything was OK?”
• “What is the probability of having a valid alert if there really was something
wrong?”
• Reliability of significance testing results is wholly dependent on correct
modeling of the data source and pattern type
• Statistical significance is different from material significance
CORRELATION IS NOT CAUSALITY
A correlation may hide an almost arbitrary truth
• Cities with more firemen have more fires
• Companies spending more in marketing have higher revenues
• Marsupials exist mainly in Australia
• However, making successful predictions does not require causality
MACHINE LEARNING
Basics
SUPERVISED LEARNING
• Simplistically task is to find function f : f(input) = output
• Examples: spam filtering, speech recognition, steel strength estimation
• Risks for different types of errors can be very skewed
• Complex inputs may confuse or slow down models
• Unsupervised methods often useful in improving results by simplifying the input
SEMI-SUPERVISED LEARNING
• Only a part of data is labeled
• Needed when labeling data is expensive
• Understanding the structure of unlabeled data enhances learning by bringing
diversity and generalization and by constraining learning
• Relates to multi-source learning, some sources labeled, some not
• Examples:
• Object detection from a video feed
• Web page categorization
• Sentiment analysis
• Transfer learning between domains
TRAINING, TESTING, VALIDATION
• A model is trained using a training dataset
• The quality of the model is measured by using it on a separate testing dataset
• A model often contains hyper-parameters chosen by the user
• A separate validation dataset is split off from the training data
• Validation data is used for testing and finding good hyper-parameter values
• Cross-validation is common practice and asymptotically unbiased
BIAS AND VARIANCE
• Squared error of predictions consists of bias and variance (and noise)
• BIAS Model incapability of approximating the underlying truth
• VARIANCE Model reliance on whims of the observed data
• Complex models often have low bias and high variance
• Simple models often have high bias and low variance
• Having more data instances (rows) may reduce variance
• Having more detailed data (variables) may reduce bias
• Testing different types of models can explain how to improve your data
TRAINING AND TESTING, BIAS AND VARIANCE
Complex modelSimple model
Minimal testing error
Minimal training error
MACHINE LEARNING
Learning new tricks
THE KERNEL TRICK
• Many learning methods rely on inner products of data points
• The “kernel trick” maps the data to an implicitly defined, high dimension space
• Kernel is the matrix of the new inner products in this space
• Mapping itself often left unknown
• Example: Gaussian kernel associates local Euclidean neighborhoods to similarity
• Example: String kernels are used for modeling DNA sequence structure
• Kernels can be combined and custom built to match expert knowledge
A kernel is a dataset-specific space transformation,
success depends on good understanding of the dataset
ENSEMBLE LEARNING
• The power of many: combine multiple models into one
• Wide and strong proof of superior performance
• Extra bonus: often trivially parallelizable
OUR EXPERIENCE IS THAT MOST EFFORTS SHOULD BE CONCENTRATED IN
DERIVING SUBSTANTIALLY DIFFERENT APPROACHES, RATHER THAN REFINING
A SINGLE TECHNIQUE.
Netflix $1M prize winner (ensemble of 107 models)
“
“
ENSEMBLE LEARNING IN PRACTICE
• Boosting: weigh (⇒ low bias) focused (⇒ low bias) simple models (⇒ low bias)
• Bagging: average (⇒ low variance) results of simple models (⇒ low bias)
• What aspect of the data am I still missing?
• Variable mixing, discretized jumps, independent factors, transformations, etc.
• Questions about practical implementability and ROI
• Failure: Netflix winner solution never taken to production
• Success: Official US hurricane model is an ensemble of 43
RANDOMIZED LEARNING
• Motivation: random variation beats expert guidance surprisingly often
• Introducing randomness can improve generalization performance (smaller
variance)
• Randomness allows methods to discover unexpected success
• Examples: genetic models, simulated annealing, parallel tempering
• Increasingly useful to allow scale-out for large datasets
• Many successful methods combine random models as an ensemble
• Example: combining random projections or transformations can often beat optimized
unsupervised models
ONLINE LEARNING
• Instead of ingesting a training dataset, adjust the data model after every
incoming (instance, label) pair
• Allows quick adaptation and “always-on” operation
• Finds good models fast, but may miss the great one
⟹ suitable also as a burn-in for other models
• Useful especially for the present trend towards analyzing data streams
BAYESIAN BASICS
• Bayesians see data as fixed and parameters as distributions
• Parameters have prior assumptions that can encode expert knowledge
• Data is used as evidence for possible parameter values
• Final output is a set of posterior distributions for the parameters
• Models may employ only the most probable parameter values or their full
probability distribution
• Variational Bayes approximates the posterior with a simpler distribution
MODEL COMPLEXITY
• Limiting model size and complexity can be used to avoid excessive bias
• Minimum description length and Akaike/Bayesian information criteria are the
Occam’s razor of data science
• VC dimension of a model provides a theoretical limit for generalization error
• Regularization can limit instance weights or parameter sizes
• Bayesian models use hyper-parameters to limit parameter overfit
THE END

More Related Content

What's hot

Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science ProcessVishal Patel
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Edureka!
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...Simplilearn
 
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...Edureka!
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Simplilearn
 
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...Edureka!
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxVrishit Saraswat
 
Career in Data Science
Career in Data ScienceCareer in Data Science
Career in Data ScienceActonRoy
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data ScienceJason Geng
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptxSadhanaParameswaran
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data ScienceSpotle.ai
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceSrishti44
 
Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...Edureka!
 

What's hot (20)

Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Data science
Data scienceData science
Data science
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science Process
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
 
Data Science
Data ScienceData Science
Data Science
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
 
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Career in Data Science
Career in Data ScienceCareer in Data Science
Career in Data Science
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptx
 
Data science
Data science Data science
Data science
 
Data science
Data scienceData science
Data science
 
Data science Big Data
Data science Big DataData science Big Data
Data science Big Data
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...
 

Viewers also liked

Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data ScienceEdureka!
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and AnalyticsSrinath Perera
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science IntroductionGang Tao
 
Introduction to Data Science - ESCP Europe
Introduction to Data Science - ESCP Europe Introduction to Data Science - ESCP Europe
Introduction to Data Science - ESCP Europe Martin Daniel
 
New Technologies in our daily life
New Technologies in our daily lifeNew Technologies in our daily life
New Technologies in our daily lifeEcommaster.es
 
Technology powerpoint presentations
Technology powerpoint presentationsTechnology powerpoint presentations
Technology powerpoint presentationsismailraesha
 

Viewers also liked (7)

Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
 
The overview of latest technology
The overview of latest technologyThe overview of latest technology
The overview of latest technology
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science Introduction
 
Introduction to Data Science - ESCP Europe
Introduction to Data Science - ESCP Europe Introduction to Data Science - ESCP Europe
Introduction to Data Science - ESCP Europe
 
New Technologies in our daily life
New Technologies in our daily lifeNew Technologies in our daily life
New Technologies in our daily life
 
Technology powerpoint presentations
Technology powerpoint presentationsTechnology powerpoint presentations
Technology powerpoint presentations
 

Similar to Introduction to Data Science

Improving AI Development - Dave Litwiller - Jan 11 2022 - Public
Improving AI Development - Dave Litwiller - Jan 11 2022 - PublicImproving AI Development - Dave Litwiller - Jan 11 2022 - Public
Improving AI Development - Dave Litwiller - Jan 11 2022 - PublicDave Litwiller
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needGibDevs
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data ScienceNiko Vuokko
 
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013Neo4j
 
Lecture 2 Data mining process.pdf
Lecture 2 Data mining process.pdfLecture 2 Data mining process.pdf
Lecture 2 Data mining process.pdfKaushik Kundu
 
MACHINE LEARNING PRESENTATION (ARTIFICIAL INTELLIGENCE)
MACHINE LEARNING PRESENTATION (ARTIFICIAL INTELLIGENCE)MACHINE LEARNING PRESENTATION (ARTIFICIAL INTELLIGENCE)
MACHINE LEARNING PRESENTATION (ARTIFICIAL INTELLIGENCE)MAHIRA
 
The zen of predictive modelling
The zen of predictive modellingThe zen of predictive modelling
The zen of predictive modellingQuinton Anderson
 
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...Lucas Jellema
 
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f..."Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...Edge AI and Vision Alliance
 
02-Lifecycle.pptx
02-Lifecycle.pptx02-Lifecycle.pptx
02-Lifecycle.pptxShree Shree
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxNAGARAJANS68
 
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxDATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxrandyburney60861
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial IndustrySubrat Panda, PhD
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptxharikaramisetty3
 
Large Scale Modeling Overview
Large Scale Modeling OverviewLarge Scale Modeling Overview
Large Scale Modeling OverviewFerris Jumah
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data miningDhilsath Fathima
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptxsnigdhaagrawal11
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptxAbderrahmanABID2
 

Similar to Introduction to Data Science (20)

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Improving AI Development - Dave Litwiller - Jan 11 2022 - Public
Improving AI Development - Dave Litwiller - Jan 11 2022 - PublicImproving AI Development - Dave Litwiller - Jan 11 2022 - Public
Improving AI Development - Dave Litwiller - Jan 11 2022 - Public
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data Science
 
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
 
Lecture 2 Data mining process.pdf
Lecture 2 Data mining process.pdfLecture 2 Data mining process.pdf
Lecture 2 Data mining process.pdf
 
MACHINE LEARNING PRESENTATION (ARTIFICIAL INTELLIGENCE)
MACHINE LEARNING PRESENTATION (ARTIFICIAL INTELLIGENCE)MACHINE LEARNING PRESENTATION (ARTIFICIAL INTELLIGENCE)
MACHINE LEARNING PRESENTATION (ARTIFICIAL INTELLIGENCE)
 
The zen of predictive modelling
The zen of predictive modellingThe zen of predictive modelling
The zen of predictive modelling
 
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
 
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f..."Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
 
02-Lifecycle.pptx
02-Lifecycle.pptx02-Lifecycle.pptx
02-Lifecycle.pptx
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptx
 
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxDATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptx
 
Large Scale Modeling Overview
Large Scale Modeling OverviewLarge Scale Modeling Overview
Large Scale Modeling Overview
 
Ml2 production
Ml2 productionMl2 production
Ml2 production
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptx
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptx
 

More from Niko Vuokko

Analytics in business
Analytics in businessAnalytics in business
Analytics in businessNiko Vuokko
 
Drones in real use
Drones in real useDrones in real use
Drones in real useNiko Vuokko
 
Analytiikka bisneksessä
Analytiikka bisneksessäAnalytiikka bisneksessä
Analytiikka bisneksessäNiko Vuokko
 
Sensor Data in Business
Sensor Data in BusinessSensor Data in Business
Sensor Data in BusinessNiko Vuokko
 
Sensoridatan ja liiketoiminnan tulevaisuus
Sensoridatan ja liiketoiminnan tulevaisuusSensoridatan ja liiketoiminnan tulevaisuus
Sensoridatan ja liiketoiminnan tulevaisuusNiko Vuokko
 
Metrics @ App Academy
Metrics @ App AcademyMetrics @ App Academy
Metrics @ App AcademyNiko Vuokko
 
Big Data Rampage
Big Data RampageBig Data Rampage
Big Data RampageNiko Vuokko
 

More from Niko Vuokko (7)

Analytics in business
Analytics in businessAnalytics in business
Analytics in business
 
Drones in real use
Drones in real useDrones in real use
Drones in real use
 
Analytiikka bisneksessä
Analytiikka bisneksessäAnalytiikka bisneksessä
Analytiikka bisneksessä
 
Sensor Data in Business
Sensor Data in BusinessSensor Data in Business
Sensor Data in Business
 
Sensoridatan ja liiketoiminnan tulevaisuus
Sensoridatan ja liiketoiminnan tulevaisuusSensoridatan ja liiketoiminnan tulevaisuus
Sensoridatan ja liiketoiminnan tulevaisuus
 
Metrics @ App Academy
Metrics @ App AcademyMetrics @ App Academy
Metrics @ App Academy
 
Big Data Rampage
Big Data RampageBig Data Rampage
Big Data Rampage
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Fact vs. Fiction: Autodetecting Hallucinations in LLMs
Fact vs. Fiction: Autodetecting Hallucinations in LLMsFact vs. Fiction: Autodetecting Hallucinations in LLMs
Fact vs. Fiction: Autodetecting Hallucinations in LLMsZilliz
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Fact vs. Fiction: Autodetecting Hallucinations in LLMs
Fact vs. Fiction: Autodetecting Hallucinations in LLMsFact vs. Fiction: Autodetecting Hallucinations in LLMs
Fact vs. Fiction: Autodetecting Hallucinations in LLMs
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 

Introduction to Data Science

  • 1. INTRODUCTION TO DATA SCIENCE NIKO VUOKKO JYVÄSKYLÄ SUMMER SCHOOL AUGUST 2013
  • 2. DATA SCIENCE WITH A BROAD BRUSH Concepts and methodologies
  • 3. DATA SCIENCE IS AN UMBRELLA, A FUSION • Databases and infrastructure • Pattern mining • Statistics • Machine learning • Numerical optimization • Stochastic modeling • Data visualization … of specialties needed for data-driven business optimization
  • 4. DATA SCIENTIST • Data scientist is defined as DS : business problem  data solution • Combination of strong programming, math, computational and business skills • Recipe for success 1. Convert vague business requirements into measurable technical targets 2. Develop a solution to reach the targets 3. Communicate business results 4. Deploy the solution in production
  • 6. PATTERN MINING AND DATA ANALYSIS
  • 7. UNSUPERVISED LEARNING • Could be called pattern recognition or structure discovery • What kind of a process could have produced this data? • Discovery of “interesting” phenomena in a dataset • Now how do you define interesting? • Learning algorithms exist for a huge collection of pattern types • Analogy: You decide if you want to see westerns or comedies, but the machine picks the movies • But does “interesting” imply useful and significant?
  • 8. EXAMPLES OF STRUCTURES IN DATA • Clustering and mixture models: separation of data into parts • Dictionary learning: a compact grammar of the dataset • Single class learning: learn the natural boundaries of data Example: Early detection of machine failure or network intrusion • Latent allocation: learn hidden preferences driving purchase decisions • Source separation: find independent generators of the data Example: Independent phenomena affecting exchange rates
  • 9. MORE EXAMPLES OF “INTERESTING” PATTERNS • { charcoal, mustard } ⇒ sausage • Grocery customer types with differing paths around the trading floor • Pricing trend change in a web ad exchange • Communities and topics in a social network • Distinct features of a person’s face and fingerprints • Objects emerging in front of a moving car
  • 10. KNOW YOUR EIGENS AND SINGULARS • Eigenvalue and singular value decompositions are central data analysis tools • They describe the energy distribution and static core structures of data Examples • Face detection, speaker adaptation • Google PageRank is basically just the world’s largest EVD • Zombie outbreak risk is determined by its eigenvalues • As a sub-component in every second learning algorithm
  • 11. DIMENSION REDUCTION • Some applications encounter large dimension counts up to millions • Dimension reduction may either 1. Retain space: preserve the most “descriptive” dimensions 2. Transform space: trade interpretability for powerful rendition • Usually transformations work oblivious to data (they are simple functions) • Curvilinear transformations try to see how the data is “folded” and build new dimensions specific to the given dataset
  • 12. DIMENSION REDUCTION EXAMPLE • Singular value decomposition is commonly used to remove the “noise dimensions” with little energy • Example: gene expression data and movie preferences have lots of these • After this more complex methods can be used for unfolding the data
  • 14. BLIND SOURCE SEPARATION • Find latent sources that generated the data • Tries to discover the real truth beneath all noise and convolution • Examples: • Air defense missile guidance systems • Error-correcting codes • Language modeling • Brain activity factors • Industrial process dynamics • Factors behind climate change
  • 15. (STATISTICAL) SIGNIFICANCE TESTING • Example: Rejection rate increase in a manufacturing plant • “What is the probability of observing this increase if everything was OK?” • “What is the probability of having a valid alert if there really was something wrong?” • Reliability of significance testing results is wholly dependent on correct modeling of the data source and pattern type • Statistical significance is different from material significance
  • 16. CORRELATION IS NOT CAUSALITY A correlation may hide an almost arbitrary truth • Cities with more firemen have more fires • Companies spending more in marketing have higher revenues • Marsupials exist mainly in Australia • However, making successful predictions does not require causality
  • 18. SUPERVISED LEARNING • Simplistically task is to find function f : f(input) = output • Examples: spam filtering, speech recognition, steel strength estimation • Risks for different types of errors can be very skewed • Complex inputs may confuse or slow down models • Unsupervised methods often useful in improving results by simplifying the input
  • 19. SEMI-SUPERVISED LEARNING • Only a part of data is labeled • Needed when labeling data is expensive • Understanding the structure of unlabeled data enhances learning by bringing diversity and generalization and by constraining learning • Relates to multi-source learning, some sources labeled, some not • Examples: • Object detection from a video feed • Web page categorization • Sentiment analysis • Transfer learning between domains
  • 20. TRAINING, TESTING, VALIDATION • A model is trained using a training dataset • The quality of the model is measured by using it on a separate testing dataset • A model often contains hyper-parameters chosen by the user • A separate validation dataset is split off from the training data • Validation data is used for testing and finding good hyper-parameter values • Cross-validation is common practice and asymptotically unbiased
  • 21. BIAS AND VARIANCE • Squared error of predictions consists of bias and variance (and noise) • BIAS Model incapability of approximating the underlying truth • VARIANCE Model reliance on whims of the observed data • Complex models often have low bias and high variance • Simple models often have high bias and low variance • Having more data instances (rows) may reduce variance • Having more detailed data (variables) may reduce bias • Testing different types of models can explain how to improve your data
  • 22. TRAINING AND TESTING, BIAS AND VARIANCE Complex modelSimple model Minimal testing error Minimal training error
  • 24. THE KERNEL TRICK • Many learning methods rely on inner products of data points • The “kernel trick” maps the data to an implicitly defined, high dimension space • Kernel is the matrix of the new inner products in this space • Mapping itself often left unknown • Example: Gaussian kernel associates local Euclidean neighborhoods to similarity • Example: String kernels are used for modeling DNA sequence structure • Kernels can be combined and custom built to match expert knowledge A kernel is a dataset-specific space transformation, success depends on good understanding of the dataset
  • 25. ENSEMBLE LEARNING • The power of many: combine multiple models into one • Wide and strong proof of superior performance • Extra bonus: often trivially parallelizable OUR EXPERIENCE IS THAT MOST EFFORTS SHOULD BE CONCENTRATED IN DERIVING SUBSTANTIALLY DIFFERENT APPROACHES, RATHER THAN REFINING A SINGLE TECHNIQUE. Netflix $1M prize winner (ensemble of 107 models) “ “
  • 26. ENSEMBLE LEARNING IN PRACTICE • Boosting: weigh (⇒ low bias) focused (⇒ low bias) simple models (⇒ low bias) • Bagging: average (⇒ low variance) results of simple models (⇒ low bias) • What aspect of the data am I still missing? • Variable mixing, discretized jumps, independent factors, transformations, etc. • Questions about practical implementability and ROI • Failure: Netflix winner solution never taken to production • Success: Official US hurricane model is an ensemble of 43
  • 27. RANDOMIZED LEARNING • Motivation: random variation beats expert guidance surprisingly often • Introducing randomness can improve generalization performance (smaller variance) • Randomness allows methods to discover unexpected success • Examples: genetic models, simulated annealing, parallel tempering • Increasingly useful to allow scale-out for large datasets • Many successful methods combine random models as an ensemble • Example: combining random projections or transformations can often beat optimized unsupervised models
  • 28. ONLINE LEARNING • Instead of ingesting a training dataset, adjust the data model after every incoming (instance, label) pair • Allows quick adaptation and “always-on” operation • Finds good models fast, but may miss the great one ⟹ suitable also as a burn-in for other models • Useful especially for the present trend towards analyzing data streams
  • 29. BAYESIAN BASICS • Bayesians see data as fixed and parameters as distributions • Parameters have prior assumptions that can encode expert knowledge • Data is used as evidence for possible parameter values • Final output is a set of posterior distributions for the parameters • Models may employ only the most probable parameter values or their full probability distribution • Variational Bayes approximates the posterior with a simpler distribution
  • 30. MODEL COMPLEXITY • Limiting model size and complexity can be used to avoid excessive bias • Minimum description length and Akaike/Bayesian information criteria are the Occam’s razor of data science • VC dimension of a model provides a theoretical limit for generalization error • Regularization can limit instance weights or parameter sizes • Bayesian models use hyper-parameters to limit parameter overfit