- •LECTURE 8
- •What is Data Mining?
- •Typical Kinds of Patterns
- •Example: Clusters
- •Example: Frequent Itemsets
- •Applications (Among Many)
- •Cultures
- •Models vs. Analytic Processing
- •(Way too Simple) Example
- •Meaningfulness of Answers
- •Examples
- •Rhine Paradox --- (1)
- •Rhine Paradox --- (2)
- •Rhine Paradox --- (3)
- •What is Web Mining?
- •How does it differ from “classical” Data Mining?
- •The World-Wide Web
- •Size of the Web
- •Netcraft survey
- •The web as a graph
- •Power-law degree distribution
- •Power-laws galore
- •Searching the Web
- •Ads vs. search results
- •Ads vs. search results
- •Sidebar: What’s in a name?
- •The Long Tail
- •Web Mining topics
- •Web search basics
- •Search engine components
- •Knowledge Discovery in
- •Typical Tasks in Data Mining
- •Typical Tasks in Data Mining
- •Typical Tasks in Data Mining
- •Typical Tasks in Data Mining
- •Typical Tasks in Data Mining
- •Typical Tasks in Data Mining
- •Typical Tasks in Data Mining
- •What is Data Mining?
- •Data Mining Algorithms
- •Data Mining Algorithms
- •Data Mining Models
- •Data Mining Models
- •Data Mining Models
- •Data Mining Models
- •Data Mining Models
- •Searching the Model Space
- •Searching the Model Space
- •THANK YOU
Data Mining Algorithms
Determine the preference criterion
In the face of two models, which one is “better”
Examples: goodness of fit, prediction accuracy, size/complexity, etc.
Search algorithm
Good models are found by searching the space of all possible models
How is this space organized and searched?
Data Mining Models
Mathematical Functions
Mathematical combination of attribute values
E.g. linear model, non-linear model, support vectors, etc.
CPU performance prediction
PRP 55.9 0.489MYCT 0.0153MMIN 0.0056MMAX0.6410CACH 0.2700CHMIN 1.480CHMAX
Data Mining Models
Decision Trees
|
>= 10 hours |
Study |
<10 hours |
|
|
|
|||
|
Do Homework |
|
Test Well |
|
Yes |
|
No |
Yes |
No |
Test Well |
C |
C |
F |
|
No |
|
|
||
Yes |
|
|
|
|
A |
B |
|
|
|
Data Mining Models
Neural Networks
0.80.23
-0.48 |
0.5 |
1.5 0.67 |
|
1.93 |
-0.88 |
-0.81 |
|
-0.4 0.18 |
|
Data Mining Models
Mixture Models
Data Mining Models
Bayesian Networks
P(B)
.001
A P(J) T 0.90 F 0.05
|
|
|
P(E) |
Burglary |
Earthquake .002 |
||
|
B |
E |
P(A) |
Alarm |
T |
T |
0.95 |
T |
F |
0.95 |
|
|
F |
T |
0.29 |
|
F |
F |
0.001 |
John Calls |
A P(M) |
|
Mary Calls T |
0.70 |
|
|
F |
0.01 |
Searching the Model Space
Concept generalization is searching
Almost all search algorithms are heuristic
Optimal models are not guaranteed
Enumerating the space involve bias
Language bias – what the model can represent
Search bias – which models are ignored
Searching the Model Space
|
>= 10 hours |
|
|
Do Homework |
|
Yes |
|
No |
Test Well |
C |
|
|
||
Yes |
|
No |
A |
B |
|
Model 1
Study
|
<10 hours |
|
|
|
|
|
|
Test Well |
|
|
|
|
|
Yes |
No |
|
Model 2 |
|
|
|
|
|
|
|
|||
C |
F |
|
|
|
|
|
|
|
|
|
Study |
|
|
|
|
>= 10 hours |
|
<10 hours |
|
|
|
|
Test Well |
|
|
Homework |
|
|
Yes |
|
No |
Yes |
|
No |
|
Good Project |
C |
Test Well |
F |
||
|
|
|
||||
|
Yes |
No |
|
Yes |
No |
|
|
A |
B |
|
B |
C |
|
THANK YOU
49