Tutorial on Event Detection
KDD 2009 
Weng-Keen Wong
School of EECS
Oregon State University
Daniel B. Neill
H. J. Heinz III College
Carnegie Mellon University   
wong@eecs.oregonstate.edu
  
neill@cs.cmu.edu 
Introduction
• Many real-world tasks in surveillance, 
scientific discovery and data cleaning 
involve monitoring routinely collected data
• Want to detect “events of interest” which are       
usually anomalous events that rarely occur
• These events typically affect a subgroup of      
the data rather than an individual data point
E l t f ll• xamp es o o ow…
2
Introduction
Early detection of disease outbreaks
• Bioterrorist attacks are a very real, and 
scary, possibility
100 kg anthrax released over D C  ,   . ., 
could kill 1-3 million and hospitalize 
millions more.
• Emerging infectious diseases
“Conservative estimate” of 2-7 
million deaths from pandemic avian     
influenza.
• Better response to common outbreaks 
3
(seasonal flu, GI)
Introduction
Benefits of early detection: 
Reduces cost to society, both in lives and in dollars!
Without 
treatment, 95% 
mortality rate
Post-symptomatic 
treatment, 40% 
mortality rate
Pre-symptomatic 
treatment, 1% 
mortality rate
incubation stage 1 stage 2
Day 0 Day 10Day 4
Exposure to 
inhalational
Acute respiratory 
distress high fever
Flu-like symptoms: 
headache cough fever 
anthrax
,  , 
shock, death
, , 
4
DARPA estimate: a two-day gain in detection 
time could reduce fatalities by a factor of six.
Introduction
Start of Definitive
Early detection is hard
  
symptoms
 
diagnosis
Lag time
incubation stage 1 stage 2
Day 0 Day 10Day 4
Buys OTC drugs?
Skips work/school?
5
Visits doctor/hospital/ED?
Introduction
Start of 
symptoms
Definitive 
diagnosis
incubation stage 1 stage 2
Day 0 Day 10Day 4
  
Buys OTC drugs? Cough 
medication 
sales in 
affected area
D ft
6
ays a er 
attack
Introduction
Start of 
symptoms
Definitive 
diagnosis
incubation stage 1 stage 2
We can achieve very early detection of outbreaks 
by gathering syndromic data and identifying
Day 0 Day 10Day 4
    ,   
emerging spatial clusters of symptoms.
Buys OTC drugs? Cough 
medication 
sales in 
affected area
D ft
7
ays a er 
attack
Introduction
Spike in sales of pediatric electrolytes near Columbus, Ohio
8
Introduction
Application to law enforcement: 
detecting crime hot-spots.
Crime hot-spot detection
Hot-spot = neighborhood or 
other spatial area with an 
unexpected rise in crime.
Goal: early detection to 
enable targeted enforcement.
Even better goal: predict where 
hot spots of crime are going to 
occur, and prevent them.
We demonstrated1 that hot-spots of violent crime can be predicted 
1-3 weeks in advance, by detecting clusters of “leading indicator” 
crimes such as disorderly conduct trespass and simple assault
9
    , ,   .
1D.B. Neill and W.L. Gorr, Proc. ISDS Annual Conf. 2007.
Introduction
Detecting clusters of pipe breaks
Application to civil engineering: 
Monitoring a city’s water distribution 
system to detect anomalous 
clusters of pipe breakage 1
Pittsburgh
   .
Different distance metric: flow 
distance along pipes, not 
Euclidean distance. 
Must account for pipe age, 
dimensions and material when,    
computing expected number of breaks.
10
1D. Olivera, et al., in preparation.  Thanks to 
Daniel Olivera for providing this picture.
Introduction
Goal is to detect patterns of suspicious 
hi t di t ill l
Detecting illicit container shipments
s pmen s correspon ng o ega  
activity (terrorism, smuggling, etc.)
We can achieve this goal by detecting 
anomalous, self-similar groups of records.
No “spatial” dimension in the standard 
sense, but we can define a dissimilarity       
metric between shipments and detect 
anomalous patterns in metric space.
FPORT USPORT COUNTRY SLINE VESSEL SHIPPER NAME F NAME COMMODITY SIZE MTONS VALUE
YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE AMERICAN_TRI_NET_EXPRETRI_NET EMPTY_RACK 0 5.6 27579
YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE ORDER ORDER_OUSED_TIRE 2 13.43 9497
YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE ORDER ORDER_OUSED_TIRE 2 13.43 9497
YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE AMERICAN_TRI_NET_EXPRETRI_NET CRUDE_IODINE_PURITY 1 17.68 251151
YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE NEW_WAVE_TRANSPORT JIT PANELS_F_MODEL_98 3 39.57 65169
11
YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE NEW_WAVE_TRANSPORT JIT PANELS_F_MODEL_98 3 39.57 65169
YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE NEW_WAVE_TRANSPORT JIT PANELS_F_MODEL_98 3 39.57 65169
YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE ORDER ORDER_OUSED_TIRES 2 13.43 9497
YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE CHINA_OCEAN_SHPG CHINA_OCEMPTY_CONTAINERS 0 0 0
YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE CHINA_OCEAN_SHPG CHINA_OCEMPTY_CONTAINERS 0 0 0
Introduction
Environmental Monitoring
Remote Sensors are becoming the 
new standard for collecting field data
Nearly continuous observation of a 
given domain, generating large 
l f d tvo umes o  a a
Data must be cleaned before being 
given to outside researchers. 
R i l f l d tequ res remova  o  anoma ous a a 
points.
Anomalies can be simple or very 
challenging!
From: Dereszynski, E., Dietterich, T. (2007). Probabilistic 
Models for Anomaly Detection in Remote Sensor Data 
Streams. Proceedings of the 23rd Conference on Uncertainty 
in Artificial Intelligence (UAI-2007). 75-82. Thanks to Ethan 
Dereszynski for the slide materials.
Introduction
Central Met. Week 40, 1999
Simple Anomaly Types
2
5
3
0 1.5m
2.5m
3.5m
4.5m-6999.99 Deg. 
C. 
1
5
2
0
m
p
.
 
(
C
)
1
0
1
T
e
m
26900 27000 27100 27200 27300 27400 27500
0
5
13
15 Minute Interval
Introduction
More difficult anomaly types
Central Met Week 6 1996
2
5
3
0
 .  , 
1.5m
2.5m
3.5m
4.5m
2
0
C
)
1
0
1
5
T
e
m
p
.
 
(
Broken Sun shield
0
5
14
4100 4200 4300 4400 4500 4600 4700
0
15 Minute Interval
Introduction
2
0
Upper Lookout Met. Weeks 3-7, 1996
1.5m
2.5m
3 5m2 5m normal
1
0
1
5
.
4.5m
2.5m suppressed
.  
5
1
m
p
.
 
(
C
)
-
5
0
T
e
m
-
1
0
1.5m Buried 1.5m Exposed 
15
2000 2500 3000 3500 4000 4500 5000
-
1
5
15 Minute Interval
Introduction
Suppose you have data D = {x1, ..., xn} where 
xi arrives over time
• Can we detect a time t when an event of 
interest occurs? 
• The question we are asking is: at what point 
in time is the data is “different”?      
Introduction
Simple example: xi is a scalar eg. a count
10
12
6
8
C
o
u
n
t
0
2
4
0 5 10 15 20 25
Time
Introduction
Harder example: xi is a vector of categorical values
Gender Age Zip
M 60s 97332
Gender Age Zip
F 50 97330
Gender Age Zip
M 20s 97331
Gender Age Zip
F 20 97330
Gender Age Zip
F 20s 97331
Gender Age Zip
M 40s 97331
s
Gender Age Zip
M 20s 97331
s
Time06/01/09 06/01/09
09:00 09:30
Introduction
Even harder example: xi is spatial data
1. How do we determine if the 
data is “different”?
2. How can we figure out what 
makes the data different?
Time06/01/09 06/01/09
09:00 09:30
Introduction
Goals of event detection: 
• Identify if an event of interest has occurred
• Characterize the event
• Pinpoint the affected subgroup of the data ie. 
what features describe the event (eg. spatial 
area, time duration)?
/ f ?• What is the severity magnitude o  the event
• Detect as accurately as possible
• Detect as early as possible
Introduction
How is event detection different from:
1. Supervised Learning: 
• Abnormal events are extremely rare, normal events are 
plentiful
2. Clustering:
• Clustering = partitioning data into groups
N t th fi di t ti ti ll l• o  e same as n ng s a s ca y anoma ous groups
3. Outlier Detection:
Events of interest are usually not individual outliers•        
• The event typically affects a subgroup of the data rather 
than a single data point
Introduction
How do we evaluate event detection algorithms?
• Can’t use prediction accuracy for “event” vs 
“non-event”
¾ Class imbalance: many more “non-events” than 
“ t ”even s
¾ Guessing “non-event” all the time results in very good 
accuracy
• Most event detection algorithms have a tunable 
threshold for when an alarm is raised
¾ Trades off accuracy and false alarm rate
¾ Need performance over multiple thresholds
Introduction
How do we evaluate event detection algorithms?
Better
o
s
i
t
i
v
e
 
R
a
t
e
c
t
i
o
n
 
T
i
m
e
Worse
Worse
T
r
u
e
 
P
o
D
e
t
e
c
Better
False Positive Rate False Positive Rate
To evaluate accuracy, use a 
Receiver Operating 
Characteristic (ROC) curve
To evaluate timeliness of 
detection, use an Activity 
Monitoring Operating 
Characteristic (AMOC) curve 
(Fawcett and Provost 1999)
Introduction
Challenges:
• Incorporating spatial and/or temporal 
information
• Integrating information from multiple 
features or data streams 
• Distinguishing between multiple event types    
• Computational complexity
Outline
1. Introduction
2 Temporal Event Detection.   
3. Spatio-Temporal Event Detection
4 F t W k. u ure or
Univariate Temporal Methods  
Univariate Methods
Examples of univariate time series
3
0
3
5
Central Met. Week 35, 2006
1.5m
1.5m
4.5m
4.5m
0
1
5
2
0
2
5
T
e
m
p
.
 
(
C
)
From: Goldenberg, A., Shmueli, G., Caruana, R. A., and 
Fienberg S E (2002) Early statistical detection of
23500 23600 23700 23800 23900 24000 24100 24200
5
1
0
15 Minute Interval
From: Dereszynski, E., Dietterich, T. (2007). Probabilistic 
M d l f A l D t ti i R t S D t, . . .      
anthrax outbreaks by tracking over-the-counter 
medication sales. Proceedings of the National Academy of 
Sciences (pp. 5237-5249) 
o e s or noma y e ec on n emo e ensor a a 
Streams. Proceedings of the 23rd Conference on Uncertainty 
in Artificial Intelligence (UAI-2007). 75-82.
Univariate Methods
This is a time series of counts of primary-
physician visits in data from Norfolk in 
D b 2001 I dd d f k tb kecem er .  a e  a a e ou rea , 
starting at a certain date. Can you guess 
when?
Univariate Methods
This is a time series of counts of primary-
physician visits in data from Norfolk in 
D b 2001 I dd d f k tb k
Here (much too 
high for a Friday)
ecem er .  a e  a a e ou rea , 
starting at a certain date. Can you guess 
when?
(Ramp attack)
Univariate Methods
20
25
10
15
S
i
g
n
a
l
0
5
0 2 4 6 8 10 12 14 16 18 20
Time
• Easy case: when does an “event” happen?
• How can we detect this with an algorithm?       
Univariate Methods
20
25
10
15
S
i
g
n
a
l
0
5
0 2 4 6 8 10 12 14 16 18 20
Time
General framework:
1 Learn model to predict expected signal value.      
2. Measure difference between actual and expected
3. Compute alarm value
Univariate Methods
Methods we will discuss
• Control Chart (Shewhart 1931)
• Moving Average
• Exponentially Weighted Moving Average (Roberts 
1959)
• CUSUM (Page 1954)
• Regression
For a reference on Statistical Quality Control techniques such as control charts, 
EWMA and CUSUM, see (Montgomery 2001)
Univariate Methods (Control Chart)
20
25
5
10
15
S
i
g
n
a
l
Upper Control Limit
0
0 2 4 6 8 10 12 14 16 18 20
Time
Control chart (from Statistical Quality Control)
• Estimate and from data up to current timeμ̂ σ̂             
• Upper control limit =
∑
=
=
N
i
iXN 1
1μ̂ ∑
=
−−=
N
i
iXN 1
2)ˆ(
1
1
ˆ μσ
ˆ3ˆ +   
• Raise alarm if upper control limit exceeded 
σμ
Univariate Methods (Control Chart)
20
25
5
10
15
S
i
g
n
a
l
0
0 2 4 6 8 10 12 14 16 18 20
Time
Control chart (from Statistical Quality Control)
• Alternately, use 
• And signal alarm when alarm level > threshold
⎟⎠
⎞⎜⎝
⎛ −Φ= σ
μ
ˆ
)ˆ,0max(
  level Alarm i
X N(0,1)for  CDF   where =Φ
       
Univariate Methods (Control Chart)
Alarm LevelControl chart 
applied to Norfolk   
data
Univariate Methods (Control Chart)
Alarm LevelControl chart 
applied to Norfolk   
data (long term)
Univariate Methods (Control Chart)
Alarm LevelControl chart 
applied to Norfolk   
data (long term)
Univariate Methods (Moving Average)
• Let W be the window size
• A moving average window predicts the following:
)(
1 +++= XXXX ... 111 −−−+ Wtttt W
Setting the alarm value:
• Fit a Gaussian to the W observations within the window ie. estimate      and
C l l t th l l l b f
μ̂ σ̂
• a cu a e e a arm eve  as e ore
⎟⎠
⎞⎜⎝
⎛ −Φ= σ
μ
ˆ
)ˆ,0max(
  level Alarm i
X
Univariate Methods (Moving Average)
Moving Average 
applied to Norfolk   
data
Univariate Methods (Moving Average)
Moving Average 
applied to Norfolk   
data (long term)
Univariate Methods (Moving Average)
Moving Average 
applied to Norfolk   
data (long term)
Univariate Methods (EWMA)
• Exponentially Weighted Moving Average 
(EWMA) - a variation on the moving average
• Let Zi be the EWMA statistic (which is 
monitored):
)1( ZXZ λλ 10h ≤< λ1−−+= iii  w ere
Observations in the past receive a 
decreasing amount of weight   
Univariate Methods (CUSUM)
• CUmulative SUM Statistics
• Good at detecting shifts from the mean more        
quickly than control chart
• Keep a running sum of “surprises”: a sum of         
excesses each day over the mean
Wh thi d th h ld H i l• en s sum excee s res o  , s gna  
alarm and reset sum
Univariate Methods (CUSUM)
• r = reference value eg. mean
X ith b ti• i =  o serva on
• Si = ith cumulative sum
+−=−+−=
−=
SrXrXrXS
rXS 11
)()()(
k
12122
M
∑
=
−+−=
i
kkk SrXS
1
1)(
When a shift from the 
mean occurs S will , i  
start to increase
Univariate Methods (CUSUM)
• If we are only tracking increases, we can do the 
f ll io ow ng:
))(,0max( 1−+−= kkk SrXS
Ensures we don’t go below 0
• We can also add a tolerance or a slack K
))(,0max( 1−++−= kkk SKrXS
Univariate Methods (CUSUM)
CUSUM applied 
to Norfolk data   
Univariate Methods (CUSUM)
CUSUM applied 
to Norfolk data   
(long term)
Univariate Methods (CUSUM)
CUSUM applied 
to Norfolk data   
(long term)
Univariate Methods
• Data often consists of trends eg. 
• Seasonal effect
• Day-of-week effects
• Holiday effect 
• None of the methods discussed so far 
explicitly model these trends   
• Regression (eg. linear regression) can be 
d ith t t f th t duse  w  ex ra erms or e ren s
Univariate Methods (Regression)
Regression example to model seasonal effects and 
Monday effects:
iiii IsMondaylightHoursOfDayY εβββ +++= )()( 210
 
Boolean feature – adds 
a “bump” to the value 
of Y if it is a Monday⎟⎞⎜⎛ −31)July  since days num(2sin ππ
Could be defined as: Normally distributed 
noise with mean 0, 
known variance σ2      ⎠⎝ 2365.25
Regression learns the β parameters from data to minimize the residual sum of 
  
squares 
Univariate Methods (Regression)
Regression applied to 
Norfolk data using 
H OfD li ht dours ay g an  
IsMonday terms
Univariate Methods (Regression)
Regression applied to 
Norfolk data using 
H OfD li ht dours ay g an  
IsMonday terms (long 
term)
Univariate Methods (Other)
Other state-of-the-art methods not discussed 
in this tutorial
• Box-Jenkins models eg. ARMA, ARIMA
• Wavelets
• Change-point detection
• Kalman filters
• Hidden Markov Models
Multivariate Temporal 
Methods
Multivariate Methods
Each data point (recorded at some time point) is now a multivariate vector
Date Time Gender Age Prodrome Home 
Location
Work 
Location
Many 
more…
eg. patient records from an Emergency Department 
6/1/09 9:12 M 20s Fever NE NE …
6/1/09 10:45 F 40s Diarrhea NE NE …
6/1/09 11:03 F 60s Respiratory NE N …
6/1/09 11:07 M 60s Diarrhea E W …
: : : : : : : :
Multivariate Methods
How are patient records from 6/1/09 different
Date Time Gender Age Prodrome Home 
Location
Work 
Location
Many 
more…
6/1/09 9:12 M 20s Fever NE NE …
6/1/09 10:45 F 40s Diarrhea NE NE …
6/1/09 11:03 F 60s Respiratory NE N …
: : : : : : : :
from patient records from 6/2/09?
Date Time Gender Age Prodrome Home 
Location
Work 
Location
Many 
more…
6/2/09 9:15 M 60s Respiratory E NE …
6/2/09 10:01 F 50s Respiratory N NW …
6/2/09 13:05 F 40s Respiratory SW SW …
: : : : : : : :
Multivariate Methods
Note: need to split data into two groups according to time:
1 T i i d t l d l. ra n ng: use  o earn mo e
Date Time Gender Age Prodrome Home 
Location
Work 
Location
Many 
more…
6/1/09 9:12 M 20s Fever NE NE …
6/1/09 10:45 F 40s Diarrhea NE NE …
6/1/09 11:03 F 60s Respiratory NE N …
: : : : : : : :
2. Testing: used to identify events with respect to the learned model
D t Ti G d A P d H W k Ma e me en er ge ro rome ome 
Location
or  
Location
any 
more…
6/2/09 9:15 M 60s Respiratory E NE …
6/2/09 10:01 F 50s Respiratory N NW …
6/2/09 13:05 F 40s Respiratory SW SW …
: : : : : : : :
Multivariate Methods
We make the following distinction
• Multivariate Changepoint Detection:  
• Detects that a change has happened
• Does not identify the subgroup of data that has         
changed the most
• Multivariate Event Detection  
• Detects that a change has happened
Id tifi th b th t h h d th• en es e su group a  as c ange  e 
most
Multivariate Methods
Outline
• Multivariate Changepoint Detection  
• Multivariate Statistical Quality Control
• Others
• Multivariate Event Detection
E i P tt• merg ng a erns
• STUCCO
• WSARE 2.0
• WSARE 3.0
Multivariate Changepoint Detection 
(Hotelling’s T2)
Multivariate version of control chart is Hotelling’s T2
ˆ
statistic (Hotelling 1931)
)ˆ()ˆ( 12
μXSμX ii −−= −cT
Sample size that covariance Estimated    
matrix was estimated from
 
covariance matrix
Estimated 
mean
Other multivariate statistical quality control methods:     
• Multivariate CUSUM (Crosier 1988)
• Multivariate EWMA (Lowry et al. 1992)
All make strong assumptions about the underlying model
Multivariate Changepoint Detection
Other methods:
• Cross-match test (Rosenbaum 2005)   
• kdq-tree (Dasu et al. 2006)
D it T t (S t l 2007)• ens y es  ong e  a . 
Multivariate Event Detection
General framework:
1.Learn model to predict expected signal value for 
the given subgroup
2. Measure difference between actual and expected
3.Compute alarm value (now more involved)
Multivariate Event Detection
Algorithm Data Model Measuring Differences
Emerging 
Patterns 
(Dong and Li 
1999)
Categorical Counts Increase in support ratio
STUCCO 
(Bay and 
Pazzani 1999)
Categorical Counts Chi-square, Bonferroni
 
WSARE 2.0 
(Wong et al. 
2005)
Categorical Counts Fisher’s Exact test, 
Randomization test
WSARE 3.0 
(Wong et al. 
2005)
Categorical Bayesian 
network
Fisher’s Exact test, 
Randomization Test
Multivariate Event Detection (Categorical Data)
How can we find differences in multivariate 
    
categorical data?
Date Time Gender Age Prodrome Home 
L ti
Work 
L ti
Many 
oca on oca on more…
6/2/09 9:15 M 60s Respiratory E NE …
6/2/09 10:01 F 50s Respiratory N NW …
6/2/09 13:05 F 40s Respiratory SW SW …
: : : : : : : :
Idea from association rule mining: 
Characterize differences by rules ie. 
conjunctions of attribute value pairs  -   
Multivariate Event Detection (Categorical Data)    
Training Date Time Gender Age Prodrome Home 
Location
Work 
Location
Many more…
6/1/09 9:12 M 20s Fever NE NE …
6/1/09 10:45 F 40s Diarrhea NE NE …
6/1/09 11:03 F 60s Respiratory NE N …
: : : : : : : :
Testing Date Time Gender Age Prodrome Home Location
Work 
Location
Many more…
6/2/09 9:15 M 60s Respiratory E NE …
6/2/09 10:01 F 50s Respiratory N NW …
Find which rules predict unusually high proportions in test data when compared
6/2/09 13:05 F 40s Respiratory SW SW …
: : : : : : : :
            
to the training data eg.
92/180 records from Testing have Gender = Male AND Age = 60s
43/200 records from Training have Gender = Male AND Age = 60s
Multivariate Event Detection 
(Emerging Patterns)
• Let D = {X1, …, XN} be a data set with N
 
data points
• Define the support of a rule R to be:
||
)(
supp
D
Rcount
(R) DD =
where countD(R) = number of data points 
th t t h l Ra  ma c  ru e 
Multivariate Event Detection 
(Emerging Patterns) 
Suppose we are given data sets D1 and D2       . 
Define the GrowthRate(R) from D1 to D2 as:
⎧
⎪⎪
⎪
⎨ ≠=∞
==
= 0(R)suppand 0(R)suppif,
0(R)supp and 0(R)supp if ,0
)( D2D1
D2D1
RGrowthRate
⎪⎪
⎪
⎩
otherwise ,
)(supp
)(supp
1
2
R
R
D
D
Multivariate Event Detection 
(Emerging Patterns)
• Given ρ > 1, a rule R is said to be a ρ-emerging 
 
pattern from D1 to D2 if
GrowthRate(R) ≥ ρ
• Goal: For a given ρ, find all ρ-emerging patterns
See (Dong and Li 1999) for efficient algorithms to find 
emerging patterns using borders to describe large 
collections of itemsets 
Search and Testing for 
U d d bl C in erstan a e ons stent 
Contrasts (STUCCO) 
Multivariate Methods (STUCCO)
• Define a contrast set as a conjunction of attribute-
value pairs (ie. what we defined as a rule)
• Search for contrast sets CS such that:
1. )|()|( TestingCSPTrainingCSP ≠
Multivariate Methods (STUCCO)
• Define a contrast set as a conjunction of attribute-
value pairs 
• Search for contrast sets CS such that:
1. )|()|( TestingCSPTrainingCSP ≠
This says that the distribution of the contrast set CS is different 
“ ff fin the Training and Testing data. Di erent” will be de ined 
shortly.
Multivariate Methods (STUCCO)
• Define a contrast set as a conjunction of attribute-
value pairs
• Search for contrast sets CS such that:
1. )|()|( TestingCSPTrainingCSP ≠
2. δ≥− |),(),(| TestingCSSupportTrainingCSSupport
Multivariate Methods (STUCCO)
• Define a contrast set as a conjunction of attribute-
value pairs
• Search for contrast sets CS such that:
1. )|()|( TestingCSPTrainingCSP ≠
2. δ≥− |),(),(| TestingCSSupportTrainingCSSupport
The support of a contrast set is the percentage of data points 
(in the Training / Testing data) where the contrast set is true
Multivariate Methods (STUCCO)
Search for contrast sets involves efficient breadth-first 
search of a set enumeration tree (Rymon 1992)
{ }
Gender=M Gender=F Age=Young Age=Old . . .
Gender=M 
AND 
Age=Young
Gender=M 
AND 
Age=OLD
Gender=F 
AND 
Age=Young
Gender=F 
AND 
Age=Old
. . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
A variety of 
pruning methods 
makes this search 
efficient
Multivariate Methods (STUCCO)
• How do we determine that the distributions of 
contrast sets are different between training and 
testing?
• For each contrast set, construct a 2x2 contingency 
t bla e
Count (Training) Count (Testing)
Age = Young 25 52
Age ≠ Young 101 206
Multivariate Methods (STUCCO)
• Perform Chi-Square test of independence
• Compute χ2 statistic:
• Where Oij = observed frequency count for the cell in row i and
∑∑
= =
−=
2
1
2
1
2
2 )(
i j ij
ijij
E
EOχ
           
column j
• Eij is the expected frequency count for the cell in row i and 
column j given independence of the row and column variables ie         .
NOOE
i j
ijijij /
2
1
2
1
⎟⎟⎠
⎞
⎜⎜⎝
⎛= ∑ ∑
= =
Where N = total number of 
observations in all cells
• To obtain a p-value, compare χ2 statistic to a chi-square 
distribution
Multivariate Methods (STUCCO)
• If we perform a Chi-Square test on the 2x2 
contingency table below, we get a p-value of 
0.0074
• This is significant at the α = 0.05 level
• Report contrast sets with p-value < 0.05
C t (T i i ) C t (T ti )oun  ra n ng oun  es ng
Age = Young 25 100
Age ≠ Young 101 206  
Multivariate Methods (STUCCO)
• But…we can’t interpret this p-value at face value
• The search suffers from a multiple hypothesis 
testing problem
• Need to correct the p-values to compensate for 
multiple hypothesis testing
Multivariate Methods (STUCCO) 
Multiple Hypothesis Testing
• Suppose we reject null hypothesis when p-value < 
α, where α = 0.05
• For a single hypothesis test, the probability of a 
false positive = α
• Suppose we do 1000 tests, one for each possible 
rule
• Probability(false positive) could be as bad as: 
1 – ( 1 – 0.05)1000 >> 0.05 
Multivariate Methods (STUCCO) 
Bonferroni Correction:
• If you are performing n hypothesis tests for        
hypotheses h1, …, hn, 
• Adjust significance level for test i to be      
i
αα =
• Reject hypothesis hi if
n
ii α≤pvalue
Multivariate Methods (STUCCO) 
Two problems:
1. We do not know n as we incrementally mine each level of 
th te ree
2. Same cutoff for contrast sets with different numbers of 
attribute-value pairs (want more power on smaller       
conjunctions)
STUCCO’s solution
α
)|,|/
2
min( 1−= llll C αα
Significance threshold at 
level l of the tree 
Number of candidates at level l
Multivariate Methods (STUCCO)
Traverse set-enumeration tree using Breadth-First Search
Summary
     
For each contrast set at depth l of tree:
1. Form 2x2 contingency table
2. Compute p-value using chi-square test
3. Account for multiple hypothesis testing by computing 
significance level αl
4. If p-value < αl , report contrast set
Multivariate Methods (WSARE 2.0)
For each rule in rule set:
Summary
     
1. Form 2x2 contingency table
2. Compute rule score using Fisher’s Exact test / Chi-
square test
3. Account for multiple hypothesis testing by 
randomization test 
4. If p-value from randomization test < alarm value, report rule
Multivariate Methods (WSARE 2.0)
For each rule in rule set:
Difference #1
Diff #2
Summary
     
1. Form 2x2 contingency table
2. Compute rule score using Fisher’s Exact test / Chi-
erence 
square test
3. Account for multiple hypothesis testing by 
randomization test 
4. If p-value from randomization test < alarm value, report rule
Difference #3
Multivariate Methods (WSARE 2.0)
Difference #1 (from STUCCO)
• Rule set defined as all rules with a maximum         
of k conjunctions of attribute-value pairs
• No need to search set enumeration tree    -  
Multivariate Methods (WSARE 2.0)
Difference #2 (from STUCCO)
• Use of Fisher’s Exact Test for rules with        
small counts that violate assumptions of Chi-
square test 
Multivariate Methods (WSARE 2.0)
Difference #3 (from STUCCO)
• Bonferroni correction is very conservative    
• Increases risk of Type II errors (not rejecting 
null hypothesis when it is false)     
• Very hesitant to declare something as an 
“ t”even
• Randomization test is a better alternative 
with more statistical power
Multivariate Methods (WSARE 2.0)
Aug 16, 2008 C2
Aug 17, 2008 C3
Aug 17, 2008 C4
Aug 16, 2008 C2
Aug 17, 2008 C3
Aug 24, 2008 C4  
Aug 17, 2008 C5
Aug 17, 2008 C6
Aug 17, 2008 C7
  
Aug 17, 2008 C5
Aug 24, 2008 C6
Aug 17, 2008 C7
Aug 21, 2008 C8
Aug 21, 2008 C9
Aug 22, 2008 C10
Aug 21, 2008 C8
Aug 21, 2008 C9
Aug 22, 2008 C10
Aug 23, 2008 C11
Aug 23, 2008 C12
Aug 24, 2008 C13
A 24 2008 C14
Aug 23, 2008 C11
Aug 23, 2008 C12
Aug 17, 2008 C13
A 17 2008 C14
Randomization Test
• Take the training data points and the testing data points Shuffle the
ug , ug , 
         .    
date field to produce a randomized dataset called DBRand
• Find the rule with the best score on DBRand.
Multivariate Methods (WSARE 2.0)
Repeat the procedure on the 
previous slide for 1000    
iterations.  Determine how 
many scores from the 1000 
iterations are better than the 
original score.
If the original score were here, it would 
place in the top 1% of the 1000 scores from 
the randomization test We would be  .    
impressed and an alert should be raised.  
Corrected p-value of the rule is:
# better scores / # iterations
Multivariate Methods (WSARE 3.0)
For each rule in rule set:
Summary
     
1. Learn a Bayesian network from training data. 
Sample a baseline data set using Bayesian 
network.
2. Form 2x2 contingency table using counts from 
Baseline data and Testing   .
3. Compute rule score using Fisher’s Exact test / Chi-
square test
4. Account for multiple hypothesis testing by 
randomization test
5 If l f d i i l l.  p-va ue rom ran om zat on test < a arm va ue, report 
rule
Multivariate Methods (WSARE 3.0)
For each rule in rule set:
Summary Difference #1
     
1. Learn a Bayesian network from training data. 
Sample a baseline data set using Bayesian 
network.
2. Form 2x2 contingency table using counts from 
Baseline data and Testing   .
3. Compute rule score using Fisher’s Exact test / Chi-
square test
4. Account for multiple hypothesis testing by 
randomization test
5 If l f d i i l l.  p-va ue rom ran om zat on test < a arm va ue, report 
rule
Multivariate Methods (WSARE 3.0)
Difference #1 (from WSARE 2.0)
• Need to account for trends in the data
From: Goldenberg, A., Shmueli, G., Caruana, R. A., and Fienberg, S. E. (2002).  Early statistical 
detection of anthrax outbreaks by tracking over-the-counter medication sales.  Proceedings of 
the National Academy of Sciences (pp. 5237-5249)
Multivariate Methods (WSARE 3.0)
• Temporal trends in health care data:
• Seasonal effects in temperature and weather
• Day of Week effects
• Holidays
• Etc.
• Not accounting for these trends can      
adversely affect the detection time and false 
positives rate 
Multivariate Methods (WSARE 3.0)
• “Taking into account that today is a public holiday ”
Generating the Baseline:
        …
• “Taking into account that this is Spring…”
• “Taking into account recent heatwave…”
• “Taking into account recent flu levels…”
• “Taking into account that there’s a known natural Food-
borne outbreak in progress ”   …
Multivariate Methods (WSARE 3.0)
Generating the Baseline:
• “Taking into account that today is a public holiday ”        …
• “Taking into account that this is Spring…”
• “Taking into account recent heatwave…”
• “Taking into account recent flu levels…”
• “Taking into account that there’s a known natural Food-
borne outbreak in progress ”   …
Use a Bayesian network (Pearl 1988) to       
model the joint probability distribution 
of the attributes.
Multivariate Methods (WSARE 3.0)
All Historical
D t
Obtaining Baseline Data
a a
Today’s 
1. Learn Bayesian Network 
using Optimal 
Reinsertion [Moore and
Environment
   
Wong 2003]
2. Generate baseline given 
today’s environment
Baseline
 
Multivariate Methods (WSARE 3.0)
All Historical
D t
Obtaining Baseline Data
a a
Today’s 
1. Learn Bayesian Network 
using Optimal 
Reinsertion [Moore and
Environment
   
Wong 2003]
2. Generate baseline given 
today’s environment
Baseline
 
Multivariate Methods (WSARE 3.0)
Divide the data into two types of attributes:
• Environmental attributes: attributes that   
cause trends in the data eg. day of week, 
season weather flu levels, ,  
• Response attributes: all other non-
environmental attributes eg age gender  . , 
Multivariate Methods (WSARE 3.0)
When learning the Bayesian network structure, do not allow 
environmental attributes to have parents.
Wh ?y  
• We are not interested in predicting their distributions
• Instead we use them to predict the distributions of the,          
response attributes
Season Day of Week Weather Flu Level
Multivariate Methods (WSARE 3.0)
All Historical
D t
Obtaining Baseline Data
a a
Today’s 
1. Learn Bayesian Network 
using Optimal 
Reinsertion [Moore and
Environment
   
Wong 2003]
2. Generate baseline given 
today’s environment
Baseline
 
Multivariate Methods (WSARE 3.0)
Season Day of Week Weather Flu Level
Today Winter Monday Snow High
Suppose we know the 
following for today:
Season = 
Winter
Day of Week = 
Monday
Weather = 
Snow
Flu Level = 
HighWe fill in these 
values for the 
environmental 
attributes in the 
learned Bayesian  
network
We sample 10000 records 
from the Bayesian network 
d k thi d t t th
Baseline
an  ma e s a a se  e 
baseline
Multivariate Methods (WSARE 3.0)
Season Day of Week Weather Flu Level
Today Winter Monday Snow High
Suppose we know the 
following for today:
Season = 
Winter
Day of Week = 
Monday
Weather = 
Snow
Flu Level = 
HighWe fill in these 
values for the 
Sampling is easy 
because
environmental 
attributes in the 
learned Bayesian
 
environmental 
attributes are at the 
  
network
top of the Bayes Net
We sample 10000 records 
from the Bayesian network 
d k thi d t t th
Baseline
an  ma e s a a se  e 
baseline
Multivariate Methods (WSARE 3.0)
Season Day of Week Weather Flu Level
Today Winter Monday Snow High
Suppose we know the 
following for today:
Season = 
Winter
Day of Week = 
Monday
Weather = 
Snow
Flu Level = 
HighWe fill in these 
values for the 
environmental 
attributes in the 
learned Bayesian
An alternate 
possible technique 
  
network
is to use inference
We sample 10000 records 
from the Bayesian network 
d k thi d t t th
Baseline
an  ma e s a a se  e 
baseline
Multivariate Methods (WSARE 3.0)
For each rule in rule set:
Summary
     
1. Learn a Bayesian network from training data. 
Sample a baseline data set using Bayesian 
network.
2. Form 2x2 contingency table using counts from 
Baseline data and Testing   .
3. Compute rule score using Fisher’s Exact test / Chi-
square test
4. Account for multiple hypothesis testing by 
randomization test
5 If l f d i i l l.  p-va ue rom ran om zat on test < a arm va ue, report 
rule
Multivariate Methods (WSARE 3.0)
Side Note:
• Conditional Anomaly Detection (Song et al. 2007) 
is a similar approach
• Uses a Gaussian Mixture Model instead of a 
Bayesian Network
• Applicable to continuous multivariate data
But:
It discovers individual data points that are anomalies, 
not anomalous groups of data points
Multivariate Methods
Open Questions
• What about continuous features?   
• What about mixed discrete and continuous 
features?
• Can we develop faster methods, especially 
th th t id d i ti t ti ?ose a  can avo  ran om za on es ng
• Can we discover interesting (not just 
statistically significant) events?
Acknowledgements
• We would like to thank the following 
individuals/groups for their slide materials:
• AUTON Lab (Carnegie Mellon University) 
• RODS Lab (University of Pittsburgh)
• Ethan Dereszynski
• Univariate temporal methods section based in part 
on an earlier tutorial on detection algorithms for 
biosurveillance by Andrew Moore
References
• [Bay and Pazzani 1999] Bay, S. D., and Pazzani, M. J.,  Detecting change in categorical data: Mining contrast 
sets. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data 
Mining, pages 302–306, New York, NY, 1999. ACM.
• [Crosier 1988] Crosier, R. B. (1988). Multivariate generalizations of cumulative sum quality-control schemes. 
Technometrics 30 291 303, , - .
• [Dasu et al. 2006] Dasu, T., Krishnan, S., Venkatasubramanian, S, and Yi, K. (2006). An information-theoretic 
approach to detecting changes in multi-dimensional data streams. In Proceedings of the 38th Symposium on the 
Interface of Statistics, Computing Science, and Applications (Interface 06).
• [Dereszynski and Dietterich] Dereszynski, E., and Dietterich, T. (2007). Probabilistic Models for Anomaly             
Detection in Remote Sensor Data Streams. Proceedings of the 23rd Conference on Uncertainty in Artificial 
Intelligence (UAI-2007). 75-82. 
• [Dong and Li 1999] Dong, G. and Li, J. Efficient mining of emerging patterns: discovering trends and 
differences. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and 
Data Mining, pages 43–52, New York, NY, 1999. ACM.
• [Fawcett and Provost 1999] Fawcett, T., and Provost, F. (1999). Activity monitoring: Noticing interesting 
changes in behavior. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge 
Discovery and Data Mining, (pp. 53-62).
[G ld b t l 2002] G ld b A Sh li G C R A d Fi b S E (2002) E l• o en erg e  a .  o en erg, ., mue , ., aruana, . ., an  en erg, . . .  ar y 
statistical detection of anthrax outbreaks by tracking over-the-counter medication sales.  Proceedings of the 
National Academy of Sciences (pp. 5237-5249)
• [Hotelling 1931] Hotelling, H. (1931). The generalization of Student's ratio, Ann. Math. Statist., Vol. 2, pp 360–
378.
108
• [Lowry et al. 1992] Lowry, C. A., Woodall, W. H., Champ, C. W., and Rigdon, S. E. (1992). A Multivariate 
Exponentially Weighted Moving Average Chart, Technometrics, 34, 46-53.
References
• [Montgomery 2001] Montgomery, D. C. Introduction to Statistical Quality Control. John Wiley and Sons, Inc., 
2001.
• [Moore and Wong 2003] Moore, A., and Wong, W.-K. (2003). Optimal Reinsertion: A new search operator for 
accelerated and more accurate Bayesian network structure learning. Proceedings of the Twentieth International 
Conference on Machine Learning (ICML 2003) (pp 552 559) Menlo Park CA: AAAI Press      . - .  ,   .
• [Page  1954] Page, E. S. (1954). Continuous inspection schemes. Biometrika, 41, 100-115.
• [Pearl 1988] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 
San Francisco, CA: Morgan Kaufmann Publishers, Inc.
• [Roberts 1959] Roberts S W (1959) Control chart tests based on geometric moving averages Technometrics  , . . .        . , 
1, 239-250.
• [Rosenbaum 2005] Rosenbaum, P. R. (2005). An exact distribution-free test comparing two multivariate 
distributions based on adjacency. Journal of the Royal Statistical Society Series B, 67(4): 515-530.
• [Shewhart 1931] Shewhart, W. A. (1931). Economic control of quality of manufactured product. New York: D.                
Van Nostrand Company.
• [Song et al. 2007] Song, X., Wu, M., and Jermaine, C. (2007). Conditional Anomaly Detection. IEEE 
Transactions on Knowledge and Data Engineering, 19(5), 631-645.
• [Song et al. 2007b] Song, X., Wu, M., Jermaine, C., and Ranka, S. (2007). Statistical Change Detection for 
Multi-Dimensional Data. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge 
Discovery and Data Mining, 667-676, New York, NY: ACM Press.
• [Wong et al. 2005] Wong, W.-K., Moore, A., Cooper, G. and Wagner, M. (2005). What's Strange About Recent 
Events (WSARE): An Algorithm for the Early Detection of Disease Outbreaks. Journal of Machine Learning 
R h 6 1961 1998
109
esearc , , - . 
Tutorial on Event Detection   
Part II: Spatial Event Detection
Daniel B. Neill
Carnegie Mellon University  
H.J. Heinz III College
neill@cs.cmu.edu
Thi k ti ll t d b NSF t
110
s wor  was par a y suppor e  y  gran  
IIS-0325581 and CDC grant 8-R01-HK000020-02.
Outline of this part
A. Introduction to spatial event detection
Problem statement and overview of 
approaches.
B. Univariate scan statistic approaches   
Spatial and space-time.
C Multivariate scan statistic approaches.    
Parametric, non-parametric, and Bayesian.
D C t d f t di ti. urren  an  u ure rec ons    
Incorporating learning; fast algorithms.
111
A Introduction to.   
Spatial Event Detection
1. The spatial event detection problem
2 Approaches to spatial event detection.     
a. Top-down and bottom-up approaches
b. Parallel monitoring approaches   
c. Scan statistic approaches
d. Other approaches from spatial statistics
©2009 Carnegie Mellon University 112
Spatial event detection
D1 = respiratory ED
Outbreak detection
Time series of counts 
ci,mt for each location si
D2 = constitutional ED
D3 = OTC cough/cold
D4 = OTC anti-fever
Spatial time series data from 
spatial locations si (e.g. zip codes)
for each data stream Dm.
G l f d t ti t k d t t i t ( di tb k )
(etc.)
oa s o  e ec on as : e ec any emerg ng even s e.g. sease ou rea s , 
pinpoint the affected spatial area, and characterize the type of event.
Informally, we want to know: Formally, we will distinguish between:
Is there anything happening?
If so, what and where?
Null hypothesis H0 (no events)
Set of alternative hypotheses H1(S, Ek)
= event of type E in spatial region S
113
    k    . 
Spatial event detection
D1 = respiratory ED
Outbreak detection
Time series of counts 
ci,mt for each location si
D2 = constitutional ED
D3 = OTC cough/cold
D4 = OTC anti-fever
Spatial time series data from 
spatial locations si (e.g. zip codes)
for each data stream Dm.
G l f d t ti t k d t t i t ( di tb k )
(etc.)
oa s o  e ec on as : e ec any emerg ng even s e.g. sease ou rea s , 
pinpoint the affected spatial area, and characterize the type of event.
This formulation assumes count data aggregated to                     
discrete time steps (e.g. days) and small areas (e.g. zips).
More generally, we can have a set of data records (observations) 
where each observation has a time-stamp, location information, and 
114
possibly other attributes.  Each count represents the number of 
observations with given attributes in a given area and time interval.
Spatial event detection
D1 = respiratory ED
Outbreak detection
Time series of counts 
ci,mt for each location si
D2 = constitutional ED
D3 = OTC cough/cold
D4 = OTC anti-fever
Spatial time series data from 
spatial locations si (e.g. zip codes)
for each data stream Dm.
G l f d t ti t k d t t i t ( di tb k )
(etc.)
oa s o  e ec on as : e ec any emerg ng even s e.g. sease ou rea s , 
pinpoint the affected spatial area, and characterize the type of event.
This formulation assumes count data aggregated to                     
discrete time steps (e.g. days) and small areas (e.g. zips).
We assume that an event will result in anomalously                      
high counts for some subset of data streams for                               
115
the affected spatial region and time interval.
Spatial event detection
D1 = respiratory ED
Outbreak detection
Time series of counts 
ci,mt for each location si
D2 = constitutional ED
D3 = OTC cough/cold
D4 = OTC anti-fever
Spatial time series data from 
spatial locations si (e.g. zip codes)
for each data stream Dm.
G l f d t ti t k d t t i t ( di tb k )
(etc.)
oa s o  e ec on as : e ec any emerg ng even s e.g. sease ou rea s , 
pinpoint the affected spatial area, and characterize the type of event.
We will initially make three additional assumptions:
Purely spatial detection problem                                               
(only a single time interval to consider)
Monitoring a single data stream D
116
     m
Attempting to detect a single event type Ek
A Introduction to.   
Spatial Event Detection
1. The spatial event detection problem
2 Approaches to spatial event detection.     
a. Top-down and bottom-up approaches
b. Parallel monitoring approaches   
c. Scan statistic approaches
d. Other approaches from spatial statistics
©2009 Carnegie Mellon University 117
Top-down and bottom-up detection
Top-down detection approaches Bottom-up detection approaches
1. Are there any globally 
interesting patterns?
1. Find individual data points 
with “interesting” local 
2. If so, find the most interesting 
sub-partition of the data and  
search it recursively
   
neighborhoods 
2. Aggregate interesting 
points into clusters  .   .
Bottom-up: density-based 
clustering (e.g. DBSCAN2)
Top-down: bump hunting1
118
2M. Ester et al., KDD 1996.
Thanks to Daniel Olivera for these examples. 1J. Friedman and N. Fisher, 1999.
Greedy approaches can fail!
Top-down detection approaches Bottom-up detection approaches
1. Are there any globally 
interesting patterns?
1. Find individual data points 
with “interesting” local 
2. If so, find the most interesting 
sub-partition of the data and  
search it recursively
   
neighborhoods 
2. Aggregate interesting 
points into clusters  .   .
Top-down fails when the affected 
region is too small to significantly 
Bottom-up fails when the affected 
region is not dense enough for the 
119
affect the global aggregate statistics. local neighborhoods to be interesting.
Greedy approaches can fail!
Top-down detection approaches Bottom-up detection approaches
1. Are there any globally 
interesting patterns?
1. Find individual data points 
with “interesting” local 
2. If so, find the most interesting 
sub-partition of the data and  
search it recursively
   
neighborhoods 
2. Aggregate interesting 
points into clusters  .   .
How can we detect both 
small, dense clusters and 
l l d l t ?
How can we move beyond 
cluster detection, to detect 
t th t i ti ?arger, ess ense c us ers even s a  emerge n me
One answer: Parallel Monitoring
Partition the monitored  
area into subregions.
Then separately monitor each 
subregion using purely temporal
Fixed partition: 
zip code
Fixed partition: 
uniform grid
Ad-hoc partition: 
data clustering
120
    
detection methods (see Part 1!)
  
boundaries
  
Challenges of parallel monitoring
One major challenge of parallel 
monitoring is choosing an appropriate 
partitioning of the monitored area. 
A given partitioning has high power to detect 
events corresponding to a single partition (red), 
but is suboptimal for events which affect       
multiple partitions (yellow), part of a partition 
(white), or parts of multiple partitions (pink).
Coarse partitions lose power for small regions      , 
fine partitions lose power for large regions, and 
both lose power for unaligned regions.
121
Challenges of parallel monitoring
One major challenge of parallel 
monitoring is choosing an appropriate 
partitioning of the monitored area. 
A second challenge of parallel 
monitoring is the problem of 
multiple hypothesis testing.
A given partitioning has high power to detect 
events corresponding to a single partition (red), 
but is suboptimal for events which affect
Monitoring thousands of spatial 
partitions, and performing a 
separate significance test for       
multiple partitions (yellow), part of a partition 
(white), or parts of multiple partitions (pink).
Coarse partitions lose power for small regions
    
each, leads to huge numbers 
of false positive alerts. 
The Bonferroni correction for      , 
fine partitions lose power for large regions, and 
both lose power for unaligned regions.
    
multiple tests leads to greatly 
reduced detection power.  
S l ti t th fi t h ll th ti l t ti ti
1. Form a very fine partitioning of the monitored area into individual locations
(e.g. zip codes or census tracts, depending on spatial resolution of the data).
o u on o e rs  c a enge: e spa a scan s a s c.
2. Rather than monitoring each partition separately, examine a huge number of 
overlapping spatial regions, each consisting of a group of locations.
122
Challenges of parallel monitoring
One major challenge of parallel 
monitoring is choosing an appropriate 
partitioning of the monitored area. 
A second challenge of parallel 
monitoring is the problem of 
multiple hypothesis testing.
A given partitioning has high power to detect 
events corresponding to a single partition (red), 
but is suboptimal for events which affect
Monitoring thousands of spatial 
partitions, and performing a 
separate significance test forS hi       
multiple partitions (yellow), part of a partition 
(white), or parts of multiple partitions (pink).
Coarse partitions lose power for small regions
    
each, leads to huge numbers 
of false positive alerts. 
The Bonferroni correction for
Spatial scan approaches 
have high power to 
detect events affecting 
small or large r gi ns
earc ng over so many 
regions makes the 
multiple hypothesis testing 
problem even worse…
S l ti t th fi t h ll th ti l t ti ti
      , 
fine partitions lose power for large regions, and 
both lose power for unaligned regions.
    
multiple tests leads to greatly 
reduced detection power.  
   .   
o u on o e rs  c a enge: e spa a scan s a s c.
1. Form a very fine partitioning of the monitored area into individual locations
(e.g. zip codes or census tracts, depending on spatial resolution of the data).
But we can solve the multiple testing problem by:
1. Finding the most significant regions.
2 D t i i h lik l ld b t2. Rather than monitoring each partition separately, examine a huge number of 
overlapping spatial regions, each consisting of a group of locations.
123
. e erm n ng ow e y we wou  e o see 
any regions that significant due to chance.
Other methods from spatial statistics
Clustered?
Tests for “general clustering”
(Whittemore, Tango, Knox, Mantel, etc.)
Not 
clustered?
Determine whether there is sufficient 
evidence of spatial or space-time 
clustering in the data, but without 
detecting specific clusters  .
Tests for “focused clustering”
(Lawson, Stone, Waller, Diggle, etc.)
Determine whether the risk is 
Focus?
Not a ?
significantly increased near a 
given point (e.g. possible 
environmental hazard).
focus?
124
Neither method detects cluster locations!
Spatial risk mapping approaches
Spatial 
smoothing
Original risk map 
(observed / expected)
Smoothed risk map
EB (Gangnon & Clayton; Mollie)
Often based on Bayesian modeling
Advantages: Explicit modeling of spatial correlation structure
FB (Lawson et al.)
Disadvantages: Cannot automatically determine whether an event
      , 
useful for data visualization, can detect areas with high risk.
125
       
has occurred; cannot identify the spatial area and time duration.
B Univariate Scan.   
Statistic Approaches
1. Kulldorff’s spatial scan statistic
2 Variants of spatial scan:.    
• Which spatial regions to search?
• How to evaluate the score of a region?
3. Extensions to space-time scanning 
(expectation-based scan statistic)
©2009 Carnegie Mellon University 126
The spatial scan statistic
(Kulldorff 1997)
Rather than monitoring 
individual locations, we
, 
   
examine groups of locations.
Imagine moving a spatial 
window around the monitored 
area, allowing the size and 
shape of the window to vary.
127
The spatial scan statistic
(Kulldorff 1997)
Rather than monitoring 
individual locations, we
, 
   
examine groups of locations.
Imagine moving a spatial 
window around the monitored 
area, allowing the size and 
shape of the window to vary.
128
The spatial scan statistic
(Kulldorff 1997)
Rather than monitoring 
individual locations, we
, 
   
examine groups of locations.
Imagine moving a spatial 
window around the monitored 
area, allowing the size and 
shape of the window to vary.
129
The spatial scan statistic
(Kulldorff 1997)
I have a population 
of 6000, of whom 90 
Rather than monitoring 
individual locations, we
, 
(1.5%) are sick.
   
examine groups of locations.
Imagine moving a spatial 
Everywhere else has a 
population of 2.2 
million of whom
window around the monitored 
area, allowing the size and 
shape of the window to vary.
Is there any position of the 
window such that the points 
inside form a significant cluster? 
,   
20,000 (0.9%) are sick.
We compute a score for each 
spatial region, and then test 
whether the highest scoring
How to evaluate a region?
Which regions to search?
130
    
regions are significant.How to search them efficiently?
Finding the most 
significant regions 
• Define models:
• of the null hypothesis 
H0: no events. 
• of the alternative 
K lld ff’ d l
hypotheses H1(S): 
event in region S.
u or s mo e
ci ~ Poisson(qbi)
H : q = q everywhere
ci = count for location si (e.g. number of disease cases)
bi = baseline for location si (e.g. population at-risk, or 
expected count computed from historical data)
0    all
H1(S): q = qin inside S,   
q = qout outside,
q = risk (expected ratio of count to baseline)
131
  
qin > qout.
Finding the most 
significant regions 
• Define models:
• of the null hypothesis 
H0: no events. 
• of the alternative 
K lld ff’ d l
hypotheses H1(S): 
event in region S.
u or s mo e
ci ~ Poisson(qbi)
H : q = q everywhere0    all
H1(S): q = qin inside S,   
q = qout outside,
132
  
qin > qout.
Finding the most 
significant regions 
• Define models:
• of the null hypothesis 
H0: no events. 
• of the alternative 
K lld ff’ d l
hypotheses H1(S): 
event in region S.
• Derive a score function:
)| DataPr(
))(| DataPr(
)(F
0
1
H
SH
S =
u or s mo e
ci ~ Poisson(qbi)
H : q = q everywhere
• Likelihood ratio:
0    all
H1(S): q = qin inside S,   
q = qout outside,
133
  
qin > qout.
Finding the most 
significant regions
Score = 1.3
 
• Define models:
• of the null hypothesis 
H0: no events. 
• of the alternative 
K lld ff’ d l
hypotheses H1(S): 
event in region S.
• Derive a score function:
)| DataPr(
))(| DataPr(
)(F
0
1
H
SH
S =
u or s mo e
ci ~ Poisson(qbi)
H : q = q everywhere
• Likelihood ratio:
totCCtotCC CCCC
−−
⎟⎞⎜⎛⎟⎞⎜⎛ −⎟⎞⎜⎛
0    all
H1(S): q = qin inside S,   
q = qout outside,
134
tot
tot
tot
tot
BBBB
S ⎟⎠⎜⎝⎟⎠⎜⎝ −⎠⎝
=)F(   
qin > qout.
Finding the most 
significant regions
Score = 1.3
 
• Define models:
• of the null hypothesis 
H0: no events. 
• of the alternative 
K lld ff’ d l
hypotheses H1(S): 
event in region S.
• Derive a score function:
)| DataPr(
))(| DataPr(
)(F
0
1
H
SH
S =
u or s mo e
ci ~ Poisson(qbi)
H : q = q everywhere
• Likelihood ratio:
Total count and Total count and 
0    all
H1(S): q = qin inside S,   
q = qout outside,
baseline of region S baseline of search area
totCCtotCC CCCC
−−
⎟⎞⎜⎛⎟⎞⎜⎛ −⎟⎞⎜⎛
135
  
qin > qout.tot
tot
tot
tot
BBBB
S ⎟⎠⎜⎝⎟⎠⎜⎝ −⎠⎝
=)F(
Finding the most 
significant regions
Score = 1.3
 
• Define models:
• of the null hypothesis 
H0: no events. 
• of the alternative 
K lld ff’ d l
hypotheses H1(S): 
event in region S.
• Derive a score function:
)| DataPr(
))(| DataPr(
)(F
0
1
H
SH
S =
u or s mo e
ci ~ Poisson(qbi)
H : q = q everywhere
• Likelihood ratio:
0    all
H1(S): q = qin inside S,   
q = qout outside,
totCCtotCC CCCC
−−
⎟⎞⎜⎛⎟⎞⎜⎛ −⎟⎞⎜⎛
136
  
qin > qout.tot
tot
tot
tot
BBBB
S ⎟⎠⎜⎝⎟⎠⎜⎝ −⎠⎝
=)F(
Finding the most 
significant regions
Score = 1.3
 
• Define models:
• of the null hypothesis 
H0: no events. 
• of the alternative 
K lld ff’ d l
hypotheses H1(S): 
event in region S.
• Derive a score function:
)| DataPr(
))(| DataPr(
)(F
0
1
H
SH
S =
u or s mo e
ci ~ Poisson(qbi)
H : q = q everywhere)(Fmaxarg* SS
• Likelihood ratio:
• To find the most 
significant regions: 0    all
H1(S): q = qin inside S,   
q = qout outside,
S
= 
totCCtotCC CCCC
−−
⎟⎞⎜⎛⎟⎞⎜⎛ −⎟⎞⎜⎛
137
  
qin > qout.tot
tot
tot
tot
BBBB
S ⎟⎠⎜⎝⎟⎠⎜⎝ −⎠⎝
=)F(
Finding the most 
significant regions
2nd highest 
score = 8.4 
• Define models: Maximum region
• of the null hypothesis 
H0: no events. 
• of the alternative 
  
score = 9.8
K lld ff’ d l
hypotheses H1(S): 
event in region S.
• Derive a score function:
)| DataPr(
))(| DataPr(
)(F
0
1
H
SH
S =
u or s mo e
ci ~ Poisson(qbi)
H : q = q everywhere)(Fmaxarg* SS
• Likelihood ratio:
• To find the most 
significant regions: 0    all
H1(S): q = qin inside S,   
q = qout outside,
S
= 
totCCtotCC CCCC
−−
⎟⎞⎜⎛⎟⎞⎜⎛ −⎟⎞⎜⎛
138
  
qin > qout.tot
tot
tot
tot
BBBB
S ⎟⎠⎜⎝⎟⎠⎜⎝ −⎠⎝
=)F(
Which regions are 
significant?
2nd highest 
score = 8.4RB = 97, p = .098
• Randomly generate counts for 
R 999 li d t t d Maximum region =  rep ca a ase s un er 
H0 (i.e. assuming no events).
• Find maximum region score 
F*= max F(S) of each replica
  
score = 9.8
RB = 12, p = .013 S    .
• p-value of region S = (RB+1) / 
(R+1), where RB =  # of replicas 
with F* ≥ F(S).
Thi i i i ifi t t 05• All regions with p-values < α
are significant at level α.  
s reg on s s gn can  a  α = . ; 
no other regions are significant.
…
F* = 2.4 F* = 9.1 F* = 7.0
139
G1 G2 G999
B Univariate Scan.   
Statistic Approaches
1. Kulldorff’s spatial scan statistic
2 Variants of spatial scan:.    
• Which spatial regions to search?
• How to evaluate the score of a region?
3. Extensions to space-time scanning 
(expectation-based scan statistic)
140
Choosing the set of search regions
• Some practical considerations:
• Set of regions should cover entire search space.
• Regions should overlap, not partition the space.
• Choose a set of regions that corresponds well with 
the size/shape of the clusters we want to detect.
• Typical approaches consider some fixed shape (circles, 
rectangles) and vary the location and dimensions      .
Don’t search too few regions: Don’t search too many regions:
Overall power to detect any given 
subset of regions reduced because of 
multiple hypothesis testing.
Reduced power to detect clusters 
outside the search space.
141
Computational infeasibility!
Choosing the set of search regions
• Kulldorff’s original spatial scan 
searches over circular regions of 
varying radius, centered at each 
ti l l tispa a  oca on si.
• Since the score function F(S) 
depends only on which locations 
are included we need to search ,     
O(N2) regions, each consisting of 
a center location and its k-NN.
• Advantages: computationally  
efficient, generalizable to arbitrary 
metric spaces, high detection 
power for compact clusters.   
• Disadvantage: low power for 
elongated/irregular clusters.
April 1979: inadvertent release of 
anthrax from a Soviet biological 
weapons facility, 77 cases confirmed.
142
Disease cluster elongated due to wind.
Choosing the set of search regions
• Kulldorff’s original spatial scan 
searches over circular regions of 
varying radius, centered at each 
ti l l ti
Many recent spatial scan 
variants search over 
elongated clusters, e.g. 
spa a  oca on si.
• Since the score function F(S) 
depends only on which locations 
are included we need to search
rectangles1 or ellipses2
Other variants: heuristic 
h ll t d ,     
O(N2) regions, each consisting of 
a center location and its k-NN.
• Advantages: computationally
searc  over a  connec e  
regions3, or exhaustive 
search over a subset of 
connected regions4 5  
efficient, generalizable to arbitrary 
metric spaces, high detection 
power for compact clusters.
 ,
Main challenge: 
efficient computation!   
• Disadvantage: low power for 
elongated/irregular clusters.
1Neill and Moore, KDD 2004
2Kulldorff et al., Stat. Med., 2007
3D l d A CSDA 2004
143
uczma  an  ssuncao, , 
4Tango and Takahashi, IJHG, 2005
5Patil and Taillie, EES, 2004
Computing the score function
Method 1 (Frequentist, hypothesis testing approach):
Use likelihood ratio
)|P (
))(|Pr(
)( 1
HD t
SHData
SF =
r 0a a
Method 2 (Bayesian approach):
Use posterior probability ))(Pr())(|Pr()( 11 SHSHDataSF
Prior probability of region S
  
)Pr(Data
=
144
Computing the score function
Method 1 (Frequentist, hypothesis testing approach):
Use likelihood ratio
)|P (
))(|Pr(
)( 1
HD t
SHData
SF =
r 0a a
Method 2 (Bayesian approach):
Use posterior probability ))(Pr())(|Pr()( 11 SHSHDataSF
Prior probability of region S
  
)Pr(Data
=
What to do when each hypothesis has a parameter space Θ?          
Method A (Maximum likelihood approach)
)|P ()|P ( θHDHD
Method B (Marginal likelihood approach)
,rmaxr )(θ ataata HΘ∈=
∫
145
Θ∈
=
)(
)Pr(),|Pr()|Pr(
H
HDataHData
θ
θθ
Computing the score function
Method 1 (Frequentist, hypothesis testing approach):
Use likelihood ratio
)|P (
))(|Pr(
)( 1
HD t
SHData
SF =
r 0a a
Most common (frequentist) approach: use 
likelihood ratio statistic with maximum likelihood  ,    
estimates of any free parameters, and compute 
statistical significance by randomization1,2
Method A (Maximum likelihood approach)
)|P ()|P ( θHDHD
1Kulldorff, 1997
2Neill and Moore, 
,rmaxr )(θ ataata HΘ∈= ADKDD 2005.
Many possible variants, depending on how we model the 
lik lih d f th d t d h h th i H (S) d H
146
e oo  o  e a a un er eac  ypo es s 1  an  0
(Poisson, Gaussian, exponential, negative binomial, etc.)
Computing the score function
Advantages: Randomization testing 
unnecessary (1000x speedup), can be 
extended to multiple data streams and 
multiple event types (more on this later)
Method 2 (Bayesian approach):
Use posterior probability ))(Pr())(|Pr()( 11 SHSHDataSF
      .
  
)Pr(Data
=
Bayesian spatial scan statistic1,2: 
A Bayesian marginal likelihood approach, efficiently 
computable using conjugate priors (Gamma-Poisson).
Method B (Marginal likelihood approach)
∫
1Neill et al., NIPS 2005
2Neill and Cooper, Machine 
L i 2009 i
147
Θ∈
=
)(
)Pr(),|Pr()|Pr(
H
HDataHData
θ
θθ earn ng, , n press.
B Univariate Scan.   
Statistic Approaches
1. Kulldorff’s spatial scan statistic
2 Variants of spatial scan:.    
• Which spatial regions to search?
• How to evaluate the score of a region?
3. Extensions to space-time scanning 
(expectation-based scan statistic)
148
Expectation-based scan statistics
(K lld ff 2001 N ill t l KDD 2005)u or , ; e  e  a .,  
To detect emerging events, we can 
search for space-time regions where    
the recently observed counts are 
significantly higher than expected.
Imagine moving a space-time
window around the scan area, 
allowing the window size, 
h d d is ape, an urat on to vary.
149
Expectation-based scan statistics
(K lld ff 2001 N ill t l KDD 2005)u or , ; e  e  a .,  
To detect emerging events, we can 
search for space-time regions where    
the recently observed counts are 
significantly higher than expected.
Imagine moving a space-time
window around the scan area, 
allowing the window size, 
h d d i
Historical Current counts
s ape, an urat on to vary.
(Consider most recent w days, w = 1…Wmax)
 
counts
  
(1 day duration)
150
Expectation-based scan statistics
(K lld ff 2001 N ill t l KDD 2005)u or , ; e  e  a .,  
To detect emerging events, we can 
search for space-time regions where    
the recently observed counts are 
significantly higher than expected.
Imagine moving a space-time
window around the scan area, 
allowing the window size, 
h d d i
Historical Current counts
s ape, an urat on to vary.
(Consider most recent w days, w = 1…Wmax)
 
counts
  
(2 day duration)
151
Expectation-based scan statistics
(K lld ff 2001 N ill t l KDD 2005)u or , ; e  e  a .,  
To detect emerging events, we can 
search for space-time regions where    
the recently observed counts are 
significantly higher than expected.
Imagine moving a space-time
window around the scan area, 
allowing the window size, 
h d d i
Historical Current counts
s ape, an urat on to vary.
(Consider most recent w days, w = 1…Wmax)
For each space-time region, we 
compare the current counts for 
each location to the time series of 
 
counts
  
(3 day duration)
152
historical counts for that location. 
Expectation-based scan statistics
(K lld ff 2001 N ill t l KDD 2005)
For the standard scan statistic 
approach we assume that each
u or , ; e  e  a .,  
,     
count is drawn from a Poisson 
distribution with unknown mean.
We perform time series analysis 
to find the expected counts for 
each recent day, then compare 
Historical Current counts
actual to expected counts.
Expected 
t
 
counts
  
(3 day duration)For each space-time region, we 
compare the current counts for 
each location to the time series of 
153
coun shistorical counts for that location. 
Expectation-based scan statistics
(K lld ff 2001 N ill t l KDD 2005)
For the standard scan statistic 
approach we assume that each
u or , ; e  e  a .,  
,     
count is drawn from a Poisson 
distribution with unknown mean.
Similarly, we can compute a 
Gaussian scan statistic by 
obtaining the expectations and 
Historical Current counts
variances from historical data.
 
counts
  
(3 day duration)For each space-time region, we 
compare the current counts for 
each location to the time series of Expected 
t
154
historical counts for that location. coun s
Expectation-based scan statistics
(K lld ff 2001 N ill t l KDD 2005)
As before, we find the regions with 
highest values of the likelihood ratio
2nd highest 
score = 8 4
Not significant 
(p = 098)
u or , ; e  e  a .,  
      
statistic, and compute the p-value of 
each region by randomization.
Maximum region
  . .
)|DataPr(
))(| DataPr(
)(F
0
1
H
SH
S =
Alternative hypothesis: 
event in region S
  
score = 9.8
Significant! (p = .013)
Null hypothesis: 
no events
…
F1* = 2.4 F2* = 9.1 F999* = 7.0To compute p-value
Compare region score 
to maximum region 
scores of simulated
155
   
datasets under H0.
Poisson scan statistic models
Counts are Poisson distributed: cit ~ Poisson(qitbit)
qit is relative risk, 
bit is expected 
count under H0
Expectation-based Poisson (EBP) Population-based Poisson (PBP)
(Neill et al., KDD 2005) (Kulldorff, 1997, 2001)
H0: qit = 1 everywhere       
(counts = expected)
H0: qit = qall everywhere 
(inside = outside)
H1(S): qit = qin in S and qit = 1 
outside, for some qin > 1. 
(counts > expected in S)
H1(S): qit = qin in S and qit = qout 
outside, for some qin > qout. 
(inside > outside)
qin = 1.2 qin = 1.3
156
qout = 1.1
Poisson scan statistic models
Counts are Poisson distributed: cit ~ Poisson(qitbit)
qit is relative risk, 
bit is expected 
count under H0
Expectation-based Poisson (EBP) Population-based Poisson (PBP)
(Neill et al., KDD 2005) (Kulldorff, 1997, 2001)
H0: qit = 1 everywhere       
(counts = expected)
H0: qit = qall everywhere 
(inside = outside)
H1(S): qit = qin in S and qit = 1 
outside, for some qin > 1. 
(counts > expected in S)
H1(S): qit = qin in S and qit = qout 
outside, for some qin > qout. 
(inside > outside)
allC
all
all
outC
out
out
inC
in
in
B
C
B
C
B
C
)S(F
−
⎟⎟⎠
⎞
⎜⎜⎝
⎛
⎟⎟⎠
⎞
⎜⎜⎝
⎛
⎟⎟⎠
⎞
⎜⎜⎝
⎛=CB
C
e
B
C
F(S) −⎟⎠
⎞⎜⎝
⎛=
157
(if Cin / Bin > Cout / Bout)(if C > B)
Gaussian scan statistic models
Counts are Gaussian distributed: cit ~ Gaussian(qitbit, σit)
Let C’ = Σ citbit / (σit)2 and B’ = Σ (bit)2 / (σit)2
Expectation-based Gaussian (EBG) Population-based Gaussian (PBG)
(Neill, Ph.D. thesis, 2006) (Neill, Ph.D. thesis, 2006)
H0: qit = 1 everywhere       
(counts = expected)
H0: qit = qall everywhere 
(inside = outside)
H1(S): qit = qin in S and qit = 1 
outside, for some qin > 1. 
(counts > expected in S)
H1(S): qit = qin in S and qit = qout 
outside, for some qin > qout. 
(inside > outside)
( ) ( ) ( )
⎟⎟⎠
⎞
⎜⎜⎝
⎛ −+=
all
2
all
out
2
out
in
2
in
'B2
'C
'B2
'C
'B2
'C
exp)S(F⎟⎟⎠
⎞
⎜⎜⎝
⎛ −+= 'C
2
'B
'B2
)'C(
expF(S)
2
158
(if C’in / B’in > C’out / B’out)(if C’ > B’)
Comparison of models and methods
• Expectation-based space-time scan statistics typically outperform 
purely spatial and purely temporal scans1 and parallel monitoring2.
• EBP and EBG statistics have consistently high detection power 
whether the affected region is large or small in size 3         .
• Kulldorff’s statistic (PBP) has very low detection power for large 
regions.  For small regions, PBP beats EBP and EBG for large-count 
datasets, while EBP wins for small-count datasets.3
Diff t ti i th d b t f ti b li f• eren  me ser es me o s are es  or compu ng ase nes or 
different datasets; it is important to adjust for seasonal and day-of-week 
trends if these are present.2,3
• Randomization testing is often miscalibrated for public health datasets, 
resulting in lower detection power and high false positive rates.  We 
suggest using the empirical distribution of maximum scores from 
historical data instead.2
• Bayesian4,5 and nonparametric6 approaches often outperform typical     
parametric scan statistics (more on these later).
1Neill, Ph.D. thesis, 2006 4Neill et al., NIPS 2005
159
2Neill, IJF, 2009
3Neill, IJHG, 2009
5Neill and Cooper, MLJ, 2009
6Neill and Lingwall, ISDS 2007
Persistent vs. emerging clusters
Most space-time scan approaches assume that the 
relative risks qit are spatially uniform over the affected 
i d t t th d ti f th t
Good for detecting Not as good for detecting 
reg on, an  cons an  over e ura on o  e even .
persistent clusters 
(e.g. shift in mean)
clusters that emerge
gradually over time
Better idea: assume that relative risk is non-
160
decreasing over the duration of the event.1
1Neill et al., KDD 2005.
Persistent vs. emerging example
Estimated values 
of relative risk for 
each day
Actual values 
of C/B for 
each day
Efficiently computable: 
q = Σ cit / Σ bit
Can be computed just 
as efficiently1, using 
dynamic programming!
t tP i t t E i
161
ers s en merg ng
1Neill et al., KDD 2005.
Static vs. dynamic clusters
Most space-time scan approaches 
assume that the affected set of spatial 
locations remains constant over time
t
    .
We can think of this as a search over 
regions shaped like right prisms (with
Iyengar (KDD 2004) considers regions
      
both bases the same) in 3-D space-time. 
     
with truncated pyramidal shapes in 
space-time.  This models regions which 
move, grow, or shrink linearly over time.
Exact search is computationally 
infeasible; heuristic search is used 
162
From Iyengar, KDD 2004
to obtain an approximate solution.
C. Multivariate Scan   
Statistic Approaches
1. Advantages of multivariate approaches
2 Parametric multivariate scan statistics.    
3. Non-parametric scan statistics (NPSS)
4. Multivariate Bayesian scan statistics (MBSS)    
©2009 Carnegie Mellon University 163
Why multivariate event detection?
ED visits
(fever)
ED visits
(respiratory) …
OTC sales
(nasal)
OTC sales
(cough)
Multivariate Event 
Detection System
No A th I fl 
outbreak …
n rax 
outbreak
in region S
n uenza
outbreak
in region S
164
Why multivariate event detection?
ED visits
(fever)
ED visits
(respiratory) …
OTC sales
(nasal)
OTC sales
(cough)
W t t d t t
Multivariate Event 
Detection System
e wan  o e ec  
emerging events in the 
early stages, when they 
have a small impact on 
Goal 1: More 
timely detection.
No A th I fl
each individual stream.
 
outbreak …
n rax 
outbreak
in region S
n uenza
outbreak
in region S
165
Why multivariate event detection?
ED visits
(fever)
ED visits
(respiratory) …
OTC sales
(nasal)
OTC sales
(cough)
W t t di ti i h
Multivariate Event 
Detection System
e wan  o s ngu s  
between different 
event types based on 
the affected region 
Goal 2: Event 
characterization.
No A th I fl
and streams.
 
outbreak …
n rax 
outbreak
in region S
n uenza
outbreak
in region S
166
Why multivariate event detection?
ED visits
(fever)
ED visits
(respiratory) …
OTC sales
(nasal)
OTC sales
(cough)
W t t d l
Multivariate Event 
Detection System
e wan  o mo e  
various other causes of 
a detected pattern and 
distinguish these from 
Goal 3: Reduce 
false positives.
No PromotionalA th I fl
relevant events. 
 
outbreak
 
sale of OTC 
medications
n rax 
outbreak
in region S
n uenza
outbreak
in region S
167
C. Multivariate Scan   
Statistic Approaches
1. Advantages of multivariate approaches
2 Parametric multivariate scan statistics.    
3. Non-parametric scan statistics (NPSS)
4. Multivariate Bayesian scan statistics (MBSS)    
168
Parametric scan statistics
Parametric scan statistics find the 
regions with highest values of a 2nd highest 
score = 8 4
Not significant 
(p = 098) likelihood ratio statistic, and 
compute statistical significance of 
each region by randomization.
Maximum region
  . .
)H|D tP (
))S(H| DataPr(
)S(F
1=
Alternative hypothesis: 
outbreak in region S
  
score = 9.8
Significant! (p = .013)
a ar 0
Null hypothesis: 
no outbreak
1Typical multivariate approach : assume 
streams are independent, conditional 
on whether an event has occurred and 
the affected space-time region S.
169
      
1Kulldorff et al., Stat. Med., 2007
Parametric scan statistics
Parametric scan statistics find the 
regions with highest values of a 2nd highest 
score = 8 4
Not significant 
(p = 098) likelihood ratio statistic, and 
compute statistical significance of 
each region by randomization.
Maximum region
  . .
)H|D tP (
))S(H| DataPr(
)S(F
1=
Alternative hypothesis: 
outbreak in region S
  
score = 9.8
Significant! (p = .013)
a ar 0
Null hypothesis: 
no outbreak
1 Under this assumption we can multiplyTypical multivariate approach : assume 
streams are independent, conditional 
on whether an event has occurred and 
the affected space-time region S. ∏ == M 1m 01m )H|DPr( ))S(H| DPr()S(F
  ,    
the likelihood ratios for each stream:
170
      
m
1Kulldorff et al., Stat. Med., 2007
Parametric scan statistics
2nd highest 
score = 8 4
Not significant 
(p = 098)
This multivariate approach has 
several disadvantages:
1) Does not account for
Maximum region
  . .     
correlations between streams.
2) Cannot determine which subset 
of streams have been affected  
score = 9.8
Significant! (p = .013)
    .
3) Tends to focus detection on 
streams with highest counts.
1 Under this assumption we can multiply
4) Cannot distinguish between 
multiple types of event.
Typical multivariate approach : assume 
streams are independent, conditional 
on whether an event has occurred and 
the affected space-time region S. ∏ == M 1m 01m )H|DPr( ))S(H| DPr()S(F
  ,    
the likelihood ratios for each stream:
171
      
m
1Kulldorff et al., Stat. Med., 2007
C. Multivariate Scan   
Statistic Approaches
1. Advantages of multivariate approaches
2 Parametric multivariate scan statistics.    
3. Non-parametric scan statistics (NPSS)
4. Multivariate Bayesian scan statistics (MBSS)    
172
The nonparametric scan statistic
Neill and Lingwall ISDS 2007
Rather than assuming a parametric distribution and learning the mean 
and variance parameters from past counts, NPSS compares the 
t t t th ti i i l di t ib ti f hi t i l t
  ,  
curren  coun s o e en re emp r ca  s r u on o  s or ca  coun s.
Simple assumption: under H0, all counts for a given location and               
d t t d i d d tl f th di t ib tia a s ream are rawn n epen en y rom e same s r u on.
In this case, the proportion of historical counts that are greater than 
t t t ill b t ti ll if l di t ib t d [0 1]
Historical Current counts
curren  coun  ci,m w  e asymp o ca y un orm y s r u e  on , .
Compute the empirical p-value pi,mt
corresponding to each current count ci mt:  
counts
  
(3 day duration)
     ,   
pi,mt = (Tbeat + 1) / (T + 1)
173
Total # of 
historical counts
# of historical 
counts > ci,mt
The nonparametric scan statistic
Neill and Lingwall ISDS 2007  ,  
Rather than assuming a parametric distribution and learning the mean 
and variance parameters from past counts, NPSS compares the 
t t t th ti i i l di t ib ti f hi t i l tcurren  coun s o e en re emp r ca  s r u on o  s or ca  coun s.
Simple assumption: under H0, all counts for a given location and               
d t t d i d d tl f th di t ib tia a s ream are rawn n epen en y rom e same s r u on.
In this case, the proportion of historical counts that are greater than 
t t t ill b t ti ll if l di t ib t d [0 1]
Compute the empirical p-value pi,mt
corresponding to each current count ci mt:
Under H0, pi,mt ~ U[0,1]
curren  coun  ci,m w  e asymp o ca y un orm y s r u e  on , .
     ,   
pi,mt = (Tbeat + 1) / (T + 1) Under H1(S), the counts in region 
S will be higher than expected 
under H and thus the empirical p
174
Total # of 
historical counts
# of historical 
counts > ci,mt
 0,     -
values will be lower than expected.
The nonparametric scan statistic
We search for regions (D, S, W) 
with a surprisingly large number 
f l i i l l
D: subset of data streams               
S: set of spatial locations                
W: duration (number of days)o  ow emp r ca  p-va ues.     
Total number of empirical p-values in region: N = |D| x |S| x W
How many low empirical p-values (pi,mt < α) do we expect under H0?
L t N # { t ) Th N Bi i l(N )e  α =  pi,m < α .  en α ~ nom a , α , 
with mean Nα and variance Nα(1 – α).
Follo ing Donoho and Jin (2004) e define the higherw     , w     
criticism statistic F(D, S, W) = maxα (Nα – Nα) / √Nα(1 – α).
We find the multivariate space time regions (D S W) with highest scores
175
    -   , ,     
F(D, S, W), and compute statistical significance by randomization.
The nonparametric scan statistic
Advantages of the nonparametric scan statistic (NPSS) 
No parametric model assumptions.
Can easily combine information from multiple streams                       
and identify which subset of streams are affected.
Randomization testing is easy (draw each p t ~ U[0 1])      i,m  , .
NPSS assumes that all of the counts for a given time                        
series are drawn from the same (unknown) distribution       ,                      
which will not be true if the time series is nonstationary.
Solution: use the standardized residuals ri mt = (ci mt – bi mt) / √bi mt,     ,  , ,   ,  
where the expected counts bi,mt are inferred by time series analysis.
Other nonparametric score functions F(D, S, W) = max F (N , N)
176
        α α α                
can be defined, and in some cases these outperform higher criticism.
Preliminary comparison of methods
• We compared the parametric and nonparametric scan 
statistics on a variety of outbreak detection tasks using 
Emergency Department data from Allegheny County.
• Univariate detection performance was comparable; NPSS 
outperformed parametric scans for larger outbreaks, and 
for data that did not fit the parametric model assumptions         .
• NPSS achieved significant gains in detection power on 
multivariate tasks, especially when only a subset of the 
monitored streams ere affected  w  .
• NPSS was able to accurately identify the affected streams.
• A naïve implementation of NPSS is more computationally 
expensive than parametric scan, O(2M) for M streams.
• However, we have developed an efficient implementation 
that scales linearly with M using newly developed methods    ,     
for linear-time subset scanning (LTSS).1
177
1Neill, ISDS 2008
C. Multivariate Scan   
Statistic Approaches
1. Advantages of multivariate approaches
2 Parametric multivariate scan statistics.    
3. Non-parametric scan statistics (NPSS)
4. Multivariate Bayesian scan statistics (MBSS)    
178
Overview of the MBSS method
Neill and Cooper MLJ 2009
Priors Multivariate 
Bayesian
  , , 
Dataset  
Scan 
StatisticModels
Outputs
Given a set of event types Ek, a set of spatial regions S, and the multivariate 
dataset D MBSS outputs the posterior probability Pr(H (S E ) | D) of each type
We must provide the prior probability Pr(H1(S, Ek)) of each event type Ek
 ,      1 , k          
of event in each region, as well as the probability of no event, Pr(H0 | D).
in each region S, as well as the prior probability of no event, Pr(H0). 
MBSS uses Bayes’ Theorem to combine the data likelihood given each hypothesis 
ith th i b bilit f th t h th i P (H | D) P (D | H) P (H) / P (D)
179
w  e pr or pro a y o  a  ypo es s: r    = r    r   r .
The Bayesian hierarchical model
Type of event
Space-time 
region affected
Effects on each 
data stream
 
Effects of 
event
Parameter priors
Expected counts Relative risks
180
Observed counts
The Bayesian hierarchical model
ci,mt ~ Poisson(qi,mtbi,mt)
Count for data 
stream dm in 
location si at 
time t
bi,mt is expected value of ci,mt
under the null hypothesis, 
predicted from historical data.   
qi,mt is relative risk.
Null hypothesis H0 
(no events)
Alternative hypothesis H1(S, Ek) 
(event of type Ek in region S)
qi,mt ~ Gamma(αm, βm) 
everywhere
qi,mt ~ Gamma(xmαm, βm) inside region S, 
qi,mt ~ Gamma(αm, βm) elsewhere
αm and βm are learned from 
historical data for stream dm.
Event type Ek multiplies expected counts in S by 
some constant xm for each stream dm.
Gamma(α,β)   
μ= α / β      
σ2 = α / β2
Simple event model:  xm = 1 + θ (xkm,avg – 1)
Event severity
181
Average effect of event type Ek on stream dm.
Computing Bayesian likelihoods
• Marginal likelihood approach: integrate over possible values 
of the relative risks qi,mt, weighted by their prior probabilities.
C j t i ll l d f l ti• on uga e pr ors a ow a c ose  orm so u on.
• Gamma priors, Poisson counts Æ Negative binomial likelihoods.
182
Computing Bayesian likelihoods
• Marginal likelihood approach: integrate over possible values 
of the relative risks qi,mt, weighted by their prior probabilities.
C j t i ll l d f l ti• on uga e pr ors a ow a c ose  orm so u on.
• Gamma priors, Poisson counts Æ Negative binomial likelihoods.
∏ tt )()|( ∝
t,m,i
mmm,imi,0 ,,b,cNegBinHDPr βα
dq))qbPo(~cPr()),Ga(~qPr(),,b,cNegBin( where βαβα ∫=
)(βαΓ
183
)()b(
c
  c αβ
α
α Γ+
+∝ +
Computing Bayesian likelihoods
• Marginal likelihood approach: integrate over possible values 
of the relative risks qi,mt, weighted by their prior probabilities.
C j t i ll l d f l ti• on uga e pr ors a ow a c ose  orm so u on.
• Gamma priors, Poisson counts Æ Negative binomial likelihoods.
∏ tt )()|( ∝
t,m,i
mmm,imi,0 ,,b,cNegBinHDPr βα
∏∝ tt )xbcNegBin(})x{)ES(H|DPr( βα
∈St,m,i
mmmm,imi,mk1 ,,,,,
∏
∉
×
Stmi
mm
t
m,i
t
mi, ),,b,cNegBin( βα
dq))qbPo(~cPr()),Ga(~qPr(),,b,cNegBin( where βαβα ∫=
)(βαΓ
,,
184
)()b(
c
  c αβ
α
α Γ+
+∝ +
Computing Bayesian likelihoods
• Marginal likelihood approach: integrate over possible values 
of the relative risks qi,mt, weighted by their prior probabilities.
C j t i ll l d f l ti• on uga e pr ors a ow a c ose  orm so u on.
• Gamma priors, Poisson counts Æ Negative binomial likelihoods.
∏ tt )()|( ∝
t,m,i
mmm,imi,0 ,,b,cNegBinHDPr βα
∏∝ tt )xbcNegBin(})x{)ES(H|DPr( βα
∈St,m,i
mmmm,imi,mk1 ,,,,,
∏
∉
×
Stmi
mm
t
m,i
t
mi, ),,b,cNegBin( βα
,,
To compute the data likelihood given the alternative 
hypothesis H (S E ) we marginalize over the values of x
185
 1 , k ,       m.
Comparison to prior methods
We compared MBSS to the parametric multivariate scan 
statistic on outbreak detection using OTC medication sales.
Using uninformative priors, MBSS achieves similar detection 
performance to parametric scans, enabling its use as a general 
detector with high performance across many event types.
However, we can also incorporate prior information into event 
models, and thus use MBSS as a specific detector with much 
higher detection power for the given event type, achieving an 
average of 1.3 days faster detection than parametric scans.
Additionally, MBSS can be used to characterize events by 
specifying models for multiple event types and computing      
186
the probability that each type of event has occurred. 
Testing discrimination power
• We examined the ability of MBSS to differentiate 
between two types of influenza-like illness using 
two streams of OTC data (cough/cold antifever)     , .
Outbreak E1 affects cough/cold twice as much as fever.  
O tb k E ff t f t i h h/ ld
100
t
y
u rea  2 a ec s ever w ce as muc  as coug co .
40
60
80
d
 
P
r
o
b
a
b
i
l
i
t
Correct type
Wrong type
MBSS was able to 
accurately discriminate 
between the two event 
0
20
E
s
t
i
m
a
t
e
d
No outbreaktypes by the second 
outbreak day.
187
1 2 3 4 5 6 7
Day of outbreak
Interpretation and visualization
• MBSS gives the total posterior probability of each 
event type Ek, and the distribution of this 
probability over space time regions S  -   .
• Probabilistic basis for decision-making, given costs 
of false positives and false negatives     .
• Visualization: Pr(H1(si, Ek)) = Σ Pr(H1(S, Ek)) for all 
regions S containing location si.
Total posterior probability of a 
i t tb k i hresp ra ory ou rea  n eac  
Allegheny County zip code on 6/3/05. 
Darker shading = higher probability.
188
Advantages of MBSS
Results are easy to 
interpret, visualize, and 
use for decision-making.
Can incorporate prior knowledge of 
event prevalence, size, shape, 
duration, spread, and impact.      
Computation is fast in the Bayesian 
framework and randomization
P(anthrax) = 22% 
P(influenza) = 13%,   
testing is not necessary.
  
P(other ILI) = 33%
We can model and differentiate 
between multiple potential 
causes of an event.
We can detect faster and 
more accurately by integrating 
multiple data streams.
189
D Current Directions.   
in Spatial Event   
Detection
1. Incorporating learning into detection
2. Very fast detection algorithms   
3. Generalization of spatial methods to non-spatial data
4. Non-aggregated spatial and temporal data
5 I ti i h i f ti f b ti. ncorpora ng r c  n orma on rom o serva ons
6. Detecting multiple, dynamic, irregular clusters
7. Integrating detection, tracking, and response
©2009 Carnegie Mellon University 190
8. Combining sensor placement and sensor fusion
Incorporating learning into detection
We have made major advances in detecting anomalous patterns, 
but not in determining which of these anomalies are relevant.
9/20, 15213, cough/cold, …
9/21, 15207, antifever, …
9/22, 15213, CC = cough, ...
1,000,000  more records…
Huge mass of data Detection algorithm “What am I supposed 
to do with this?”
Too many 
alerts
We must model and differentiate between multiple causes of a 
detected pattern, and provide only the relevant patterns to the user.
How can we classify patterns, and 
determine which ones are relevant 
191
to a given user at a given time?
Incorporating learning into detection
We have made major advances in detecting anomalous patterns, 
but not in determining which of these anomalies are relevant.
9/20, 15213, cough/cold, …
9/21, 15207, antifever, …
9/22, 15213, CC = cough, ...
1,000,000  more records…
Huge mass of data
Feedback loop
We must model and differentiate between multiple causes of a 
detected pattern, and provide only the relevant patterns to the user.
How can we classify patterns, and 
determine which ones are relevant 
Incorporate user feedback 
into the detection process, 
192
to a given user at a given time? and use it to learn models!
Incorporating learning into MBSS
Many aspects of the MBSS framework can be learned from data.
The set of event types Ek, and the 
prevalence of each event type Pr(Ek).
We first consider the 
passive learning scenario, 
in which the user provides a
The space-time pattern of each 
event type, Pr(H1(S, Ek) | Ek).
      
label (S, Ek) for each day.
The effects of each event type on the 
multiple data streams, Pr(D | H1(S, Ek)).
The relevance of each type
This label can be then used 
by the system to update its 
models and improve its 
     
of event to the user.
performance for future days.
193
Incorporating learning into MBSS
Many aspects of the MBSS framework can be learned from data.
We need to generalize over 
huge # of possible regions S.
The set of event types Ek, and the 
prevalence of each event type Pr(Ek).
The space-time pattern of each 
event type, Pr(H1(S, Ek) | Ek).
The effects of each event type on the 
multiple data streams, Pr(D | H1(S, Ek)).
The relevance of each type     
of event to the user.
194
Latent center model 
(Makatchev and Neill, 2008)
Incorporating learning into MBSS
Many aspects of the MBSS framework can be learned from data.
We can learn different temporal     
patterns for different event types.The set of event types Ek, and the 
prevalence of each event type Pr(Ek).
The space-time pattern of each 
event type, Pr(H1(S, Ek) | Ek).
The effects of each event type on the 
multiple data streams, Pr(D | H1(S, Ek)).
The relevance of each type
We can detect events where the 
affected region changes over time.
     
of event to the user.
195(Das, Schneider, & Neill, 2009)
Passive vs. active learning
So far, we have considered the passive learning scenario, in which 
each day is assigned a label by the user (no event, or event type 
Ek and affected area S) independently of the system’s output        .
However, having a user in the loop allows for much interaction   
and learning than this simple framework In the active learning     .      
scenario, the system presents its output (detected clusters) to     
the user, and receives feedback on these clusters, each day.
This presents an interesting challenge: the system must strike a 
balance between exploration, presenting “unknown” examples that 
will best inform its models and exploitation presenting “known”    ,  ,   
examples that have highest probability of relevance to the user.
196
Active learning of new event types
This scenario allows the user to define new classes “on the fly”, by 
assigning a new label type to an example.  The system can then 
find other potential examples of the new class in historical data,            
ask the user to label these, and learn a model for the new class.
“Any clusters of interest today?”
“Yes, this appears to be an anthrax attack, 
based on increased OTC cough and fever.”
“No, we don’t see a corresponding increase in ED 
visits.  I think this cluster is just a promotional sale.”
“OK.  Can you identify which of these historical 
clusters also correspond to promotional sales?”
“Yes, all of these.  Any other clusters of interest?” MBSS
The end goal is to transform the process of pattern discovery into a 
“conversation” between the user and system in which the system takes
“Not really, just more seasonal flu and another promotional sale.”
197
     ,      
an active role in identifying and explaining potentially interesting patterns.
D Current Directions.   
in Spatial Event   
Detection
1. Incorporating learning into detection
2. Very fast detection algorithms   
3. Generalization of spatial methods to non-spatial data
4. Non-aggregated spatial and temporal data
5 I ti i h i f ti f b ti. ncorpora ng r c  n orma on rom o serva ons
6. Detecting multiple, dynamic, irregular clusters
7. Integrating detection, tracking, and response
198
8. Combining sensor placement and sensor fusion
Which subsets to scan?
Since there are exponentially many subsets of the data,      
it is often computationally infeasible to search all of them.
The most common approach is to use domain knowledge          
to restrict our search space: for example, we assume               
that an event will affect a contiguous spatial region,                  
and often further restrict the region size and shape        .
e.g. “search over circular regions centered at a data point” Æ only N2 regions instead of 2N.
Another common approach is to perform a heuristic                 
search.  For example, we can greedily grow subsets             
starting from each data record, repeatedly adding the 
additional record that gives the highest scoring subset.1
Tradeoff: much more efficient than naïve search, but not guaranteed to find highest scoring region.
In some cases, we can find the highest-scoring subsets         
199
without computing the scores of all possible subsets!
1Neill et al., in Scan Statistics: Methods and Applications, 2009.
Fast spatial scan over rectangles
Th b f h i l
Consider searching over all rectangular 
regions for data aggregated to a N x N grid.
e num er o  searc  reg ons sca es as 
O(N4), making an exhaustive search 
computationally infeasible for large N.
We can find the highest scoring clusters without an exhaustive search 
using branch and bound: we keep track of the highest region score 
found so far, and prune sets of regions with provably lower scores.1,2
A new multi-resolution data structure, 
the overlap-kd tree, enables us to 
k thi h ffi i t
We can now monitor nationwide 
health data in 20 minutes (vs. 1 week).
ma e s searc  e c en .
200
1Neill and Moore, KDD 2004
2Neill et al., NIPS 2004
Fast spatial scan over rectangles
Th b f h i l
Consider searching over all rectangular 
regions for data aggregated to a N x N grid.
e num er o  searc  reg ons sca es as 
O(N4), making an exhaustive search 
computationally infeasible for large N.
We can find the highest scoring clusters without an exhaustive search 
using branch and bound: we keep track of the highest region score 
found so far, and prune sets of regions with provably lower scores.1,2
Other recent work on efficiently maximizing scan statistics over gridded rectangles
W t l KDD 2009A l t l SODA 2006 KDD 2006 u e  a .,  
Compute scores for subset of rectangles,           
bound scores of other rectangles by               
tiling with evaluated rectangles.
garwa  e  a .,  ,  
Fast approximate optimization of convex 
score functions.  Solution within ε of 
optimal, runtime O((1/ε) N2 log2 N).
201
    
Can be used for non-convex score functions. 
  
Linear-time subset scanning
• In certain cases, we can optimize F(S) over 
the exponentially many subsets of locations, 
1while evaluating only O(N) regions.
• Many commonly used scan statistics have 
the property of linear-time subset scanning:
• Just sort the locations from highest to lowest 
priority according to some function    …
• … then search over groups consisting of the 
top-k highest priority locations for k = 1 N   ,    .. .
The highest scoring subset is 
guaranteed to be one of these!
202
     
1Neill, ISDS 2008
The LTSS property
• Example: Poisson statistics (Kulldorff, EBP)
• F(S) = F(C, B), where C = Σ ci and B = Σ bi are 
the aggregate count and baseline of region S.
• Sort locations si by the ratio of observed to 
expected count, ci / bi. 
• Given the ordering s(1) … s(N), we can prove that 
the top-scoring subset consists of the locations 
s(1) … s(k) for some k, 1 ≤ k ≤ N.
• This follows from the facts that F(S) is convex, 
increasing with C and decreasing with B.
203
How to use LTSS in practice?
• Simplest case: assume all subsets are equally 
likely (e.g. outbreak that does not cluster spatially)
• LTSS gives highest-scoring subset by evaluating N
subsets instead of 2N for naïve search.
B t hat if e ant to se spatial information to• u  w   w  w   u     
constrain our search over subsets?
• Soft constraints: some subsets of locations are more        
likely than others (non-uniform priors).
• Hard constraints: some subsets of locations are not 
allowed (e.g. non-contiguous or highly irregular regions).
• In most cases, we cannot use LTSS directly to find 
th ti l b t bj t t th t i t
204
e op ma  su se  su ec  o ese cons ra n s.
How to use LTSS in practice?
• We can use LTSS to speed up the 
constrained search problem in three ways:
1) For some hard constraints, we can compute the 
optimal subset by maximizing over multiple 
LTSS searches (e.g. fast localized scan).
2)We can use the unconstrained maximum score 
as an upper bound on the constrained 
maximum (e.g. fast scan over rectangles).
3) For heuristic search, we can use the 
unconstrained maximum as a starting point, or 
use the LTSS ordering to guide our search
205
       .
Fast localized scan
• Maximize the spatial scan statistic over regions 
consisting of a “center” location si and any subset
of its k-nearest neighbors, for a fixed constant k.
• This is similar to FlexScan1 but does not force the 
i t b tireg on o e con guous.
• Naïve search requires O(N · 2k) time and is 
t ti ll i f ibl f k > 20compu a ona y n eas e or   .
• For each center, we can search over all subsets 
of its k nearest neighbors in O(k) using LTSS  -      , 
thus requiring total time O(Nk) + O(N log N) for 
sorting by priority.
206
  
1Tango and Takahashi, IJHG, 2005
Evaluation on ED data
We compared the time needed to perform localized 
scans with and without LTSS, as a function of the number 
of neighbors k for 281 days of Emergency Department  ,       
visit data from Allegheny County.
k  = 15: 869x speedup 
(4.42 sec. vs. over 1 hour)
k = 20: 38 460x speedup   ,   
(4.69 sec. vs. over 50 hours)
k = 30: 5.21 sec. vs. ~9 yrs.
k = 88: 8.12 sec. vs. ~1026 yrs.
207
Linear-time subset scanning
Linear-time subset scanning is a powerful and useful tool that enables 
us to speed up a wide variety of spatial event detection methods.
The Poisson, Gaussian, and nonparametric spatial scan statistics all 
satisfy the LTSS property, as do many other possible statistics.
LTSS makes “all subsets” search computationally feasible, makes 
localized scans feasible even for large values of k, and speeds up 
searches over “all distinct rectangles” by 2-3 orders of magnitude.         
Many other LTSS-enabled searches are possible, and these will 
enable huge speedups for an even wider range of problems.
Current work includes extending LTSS to the multivariate and         
space-time scan statistics, developing fast graph scan algorithms,   
208
and incorporating LTSS into our Bayesian scan framework.
D Current Directions.   
in Spatial Event   
Detection
1. Incorporating learning into detection
2. Very fast detection algorithms   
3. Generalization of spatial methods to non-spatial data
4. Non-aggregated spatial and temporal data
5 I ti i h i f ti f b ti. ncorpora ng r c  n orma on rom o serva ons
6. Detecting multiple, dynamic, irregular clusters
7. Integrating detection, tracking, and response
©2009 Carnegie Mellon University 209
8. Combining sensor placement and sensor fusion
Anomalous Group Detection
1. Learn a Bayesian Network model for the null hypothesis 
H0 (no events) from the training data.
US Port F Port
Shipper Name
Importer Name
Shipping Line
Country
Commodity
Size
Weight
Value
2. To evaluate a group of records S:
1. Fit the alternate hypothesis Bayesian Network (H1(S)) parameters 
using Data S.
2. Compute the group score using the likelihood ratio:
3. Greedily grow a group from each record, and output the 
)H|P(S
(S))H | P(S
 F(S)
0
1=
©2009 Carnegie Mellon University 210
groups with highest score.
Neill et al., in Scan Statistics: Methods and Applications, 2009.
D Current Directions.   
in Spatial Event   
Detection
1. Incorporating learning into detection
2. Very fast detection algorithms   
3. Generalization of spatial methods to non-spatial data
4. Non-aggregated spatial and temporal data
5 I ti i h i f ti f b ti. ncorpora ng r c  n orma on rom o serva ons
6. Detecting multiple, dynamic, irregular clusters
7. Integrating detection, tracking, and response
©2009 Carnegie Mellon University 211
8. Combining sensor placement and sensor fusion
References
D Agarwal A McGregor J M Phillips S Venkatasubramanian and Z Zhu Spatial scan statistics:• . , . , . . , . ,  . .    
approximations and performance study. Proc. 12th ACM SIGKDD Conf. on Knowledge Discovery and 
Data Mining, 24–33, 2006.
• D. Agarwal, J. M. Phillips, and S. Venkatasubramanian. The hunting of the bump: On maximizing 
statistical discrepancy.  Proc. Symposium on Discrete Algorithms, 1137–1144, 2006.
• D Donoho and J Jin Higher criticism for detecting sparse heterogeneous mixtures Annals of.   . .       .   
Statistics, 32(3): 962–994, 2004.
• L. Duczmal and R. Assuncao. A simulated annealing strategy for the detection of arbitrary shaped 
spatial clusters. Computational Statistics and Data Analysis, 45:269–286, 2004.
• M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large 
spatial databases with noise. Proc. 2nd Intl. Conf. on Knowledge Discovery and Data Mining, 1996.              
• J. Friedman and N. Fisher. Bump hunting in high dimensional data. Statistics and Computing, 9(2):1–
20, 1999.
• V. Iyengar. On detecting space-time clusters. Proc. 10th ACM SIGKDD Conf. on Knowledge Discovery 
and Data Mining, 587–592, 2004.
• M. Kulldorff. A spatial scan statistic. Communications in Statistics: Theory and Methods, 26(6): 1481–             
1496, 1997.
• M. Kulldorff. Prospective time-periodic geographical disease surveillance using a scan statistic.  
Journal of the Royal Statistical Society A, 164: 61–72, 2001.
• M. Kulldorff, L. Huang, L. Pickle, and L. Duczmal. An elliptic spatial scan statistic. Statistics in 
Medicine, 25:3929–3943, 2006.
• M. Kulldorff, F. Mostashari, L. Duczmal, W. K. Yih, K. Kleinman, and R. Platt. Multivariate scan 
statistics for disease surveillance. Statistics in Medicine, 26: 1824–1833, 2007.
• M. Makatchev and D.B. Neill. Learning outbreak regions in Bayesian spatial scan statistics. Proc. 
ICML/UAI/COLT Workshop on Machine Learning for Health Care Applications, 2008.
• D.B. Neill and A.W. Moore. Rapid detection of significant spatial clusters. Proceedings of the 10th 
212
ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 256-265, 2004. 
• D.B. Neill, A.W. Moore, F. Pereira, and T. Mitchell. Detecting significant multidimensional spatial 
clusters. In L.K. Saul, et al., eds., Adv. Neural Information Processing Systems 17, 969-976, 2005.
References
• D.B. Neill, A.W. Moore, M. Sabhnani, and K. Daniel. Detection of emerging space-time clusters. Proc.               
11th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 218-227, 2005.
• D.B. Neill and A.W. Moore. Anomalous spatial cluster detection. Proceedings of the KDD 2005 
Workshop on Data Mining Methods for Anomaly Detection, 2005.
• D.B. Neill, A.W. Moore, and G.F. Cooper. A Bayesian spatial scan statistic. In Y. Weiss, et al., eds. 
Advances in Neural Information Processing Systems 18, 1003-1010, 2006.
• D.B. Neill. Detection of spatial and spatio-temporal clusters. Ph.D. thesis, Carnegie Mellon University, 
Department of Computer Science, 2006.
• D.B. Neill.  Incorporating learning into disease surveillance systems.  Advances in Disease 
Surveillance 4: 107, 2007.
• D.B. Neill and W.L. Gorr. Detecting and preventing emerging epidemics of crime. Advances in Disease 
Surveillance 4: 13, 2007.
• D.B. Neill and J. Lingwall. A nonparametric scan statistic for multivariate disease surveillance. 
Advances in Disease Surveillance 4: 106, 2007.
• D.B. Neill. Fast and flexible outbreak detection by linear-time subset scanning. Advances in Disease 
Surveillance 5: 48, 2008.
• D.B. Neill. An empirical comparison of spatial scan statistics for outbreak detection. International 
Journal of Health Geographics 8: 20, 2009.
• D.B. Neill, G.F. Cooper, K. Das, X. Jiang, and J. Schneider. Bayesian network scan statistics for 
multivariate pattern detection. In J. Glaz, et al., eds., Scan Statistics: Methods and Applications, 2009.
• D.B. Neill. Expectation-based scan statistics for monitoring spatial time series data. International 
J l f F ti 2009 iourna  o  orcas ng, , n press.
• D.B. Neill and G.F. Cooper.  A multivariate Bayesian scan statistic for early event detection and 
characterization.  Machine Learning, 2009, in press.
• G. P. Patil and C. Taillie. Upper level set scan statistic for detecting arbitrarily shaped hotspots. Envir. 
Ecol. Stat., 11: 183–197, 2004.
T T d K T k h hi A fl ibl h d ti l t ti ti f d t ti l t I tl J l
213
• . ango an  . a a as .  ex y s ape  spa a  scan s a s c or e ec ng c us ers.  n . ourna  
of Health Geographics, 4: 11, 2005.
• M. Wu, X. Song, C. Jermaine, S. Ranka and J. Gums. A LRT Framework for Fast Spatial Anomaly 
Detection.  Proc. 15th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, 2009.