Tutorial on Event Detection KDD 2009 Weng-Keen Wong School of EECS Oregon State University Daniel B. Neill H. J. Heinz III College Carnegie Mellon University wong@eecs.oregonstate.edu neill@cs.cmu.edu Introduction • Many real-world tasks in surveillance, scientific discovery and data cleaning involve monitoring routinely collected data • Want to detect “events of interest” which are usually anomalous events that rarely occur • These events typically affect a subgroup of the data rather than an individual data point E l t f ll• xamp es o o ow… 2 Introduction Early detection of disease outbreaks • Bioterrorist attacks are a very real, and scary, possibility 100 kg anthrax released over D C , . ., could kill 1-3 million and hospitalize millions more. • Emerging infectious diseases “Conservative estimate” of 2-7 million deaths from pandemic avian influenza. • Better response to common outbreaks 3 (seasonal flu, GI) Introduction Benefits of early detection: Reduces cost to society, both in lives and in dollars! Without treatment, 95% mortality rate Post-symptomatic treatment, 40% mortality rate Pre-symptomatic treatment, 1% mortality rate incubation stage 1 stage 2 Day 0 Day 10Day 4 Exposure to inhalational Acute respiratory distress high fever Flu-like symptoms: headache cough fever anthrax , , shock, death , , 4 DARPA estimate: a two-day gain in detection time could reduce fatalities by a factor of six. Introduction Start of Definitive Early detection is hard symptoms diagnosis Lag time incubation stage 1 stage 2 Day 0 Day 10Day 4 Buys OTC drugs? Skips work/school? 5 Visits doctor/hospital/ED? Introduction Start of symptoms Definitive diagnosis incubation stage 1 stage 2 Day 0 Day 10Day 4 Buys OTC drugs? Cough medication sales in affected area D ft 6 ays a er attack Introduction Start of symptoms Definitive diagnosis incubation stage 1 stage 2 We can achieve very early detection of outbreaks by gathering syndromic data and identifying Day 0 Day 10Day 4 , emerging spatial clusters of symptoms. Buys OTC drugs? Cough medication sales in affected area D ft 7 ays a er attack Introduction Spike in sales of pediatric electrolytes near Columbus, Ohio 8 Introduction Application to law enforcement: detecting crime hot-spots. Crime hot-spot detection Hot-spot = neighborhood or other spatial area with an unexpected rise in crime. Goal: early detection to enable targeted enforcement. Even better goal: predict where hot spots of crime are going to occur, and prevent them. We demonstrated1 that hot-spots of violent crime can be predicted 1-3 weeks in advance, by detecting clusters of “leading indicator” crimes such as disorderly conduct trespass and simple assault 9 , , . 1D.B. Neill and W.L. Gorr, Proc. ISDS Annual Conf. 2007. Introduction Detecting clusters of pipe breaks Application to civil engineering: Monitoring a city’s water distribution system to detect anomalous clusters of pipe breakage 1 Pittsburgh . Different distance metric: flow distance along pipes, not Euclidean distance. Must account for pipe age, dimensions and material when, computing expected number of breaks. 10 1D. Olivera, et al., in preparation. Thanks to Daniel Olivera for providing this picture. Introduction Goal is to detect patterns of suspicious hi t di t ill l Detecting illicit container shipments s pmen s correspon ng o ega activity (terrorism, smuggling, etc.) We can achieve this goal by detecting anomalous, self-similar groups of records. No “spatial” dimension in the standard sense, but we can define a dissimilarity metric between shipments and detect anomalous patterns in metric space. FPORT USPORT COUNTRY SLINE VESSEL SHIPPER NAME F NAME COMMODITY SIZE MTONS VALUE YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE AMERICAN_TRI_NET_EXPRETRI_NET EMPTY_RACK 0 5.6 27579 YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE ORDER ORDER_OUSED_TIRE 2 13.43 9497 YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE ORDER ORDER_OUSED_TIRE 2 13.43 9497 YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE AMERICAN_TRI_NET_EXPRETRI_NET CRUDE_IODINE_PURITY 1 17.68 251151 YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE NEW_WAVE_TRANSPORT JIT PANELS_F_MODEL_98 3 39.57 65169 11 YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE NEW_WAVE_TRANSPORT JIT PANELS_F_MODEL_98 3 39.57 65169 YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE NEW_WAVE_TRANSPORT JIT PANELS_F_MODEL_98 3 39.57 65169 YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE ORDER ORDER_OUSED_TIRES 2 13.43 9497 YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE CHINA_OCEAN_SHPG CHINA_OCEMPTY_CONTAINERS 0 0 0 YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE CHINA_OCEAN_SHPG CHINA_OCEMPTY_CONTAINERS 0 0 0 Introduction Environmental Monitoring Remote Sensors are becoming the new standard for collecting field data Nearly continuous observation of a given domain, generating large l f d tvo umes o a a Data must be cleaned before being given to outside researchers. R i l f l d tequ res remova o anoma ous a a points. Anomalies can be simple or very challenging! From: Dereszynski, E., Dietterich, T. (2007). Probabilistic Models for Anomaly Detection in Remote Sensor Data Streams. Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence (UAI-2007). 75-82. Thanks to Ethan Dereszynski for the slide materials. Introduction Central Met. Week 40, 1999 Simple Anomaly Types 2 5 3 0 1.5m 2.5m 3.5m 4.5m-6999.99 Deg. C. 1 5 2 0 m p . ( C ) 1 0 1 T e m 26900 27000 27100 27200 27300 27400 27500 0 5 13 15 Minute Interval Introduction More difficult anomaly types Central Met Week 6 1996 2 5 3 0 . , 1.5m 2.5m 3.5m 4.5m 2 0 C ) 1 0 1 5 T e m p . ( Broken Sun shield 0 5 14 4100 4200 4300 4400 4500 4600 4700 0 15 Minute Interval Introduction 2 0 Upper Lookout Met. Weeks 3-7, 1996 1.5m 2.5m 3 5m2 5m normal 1 0 1 5 . 4.5m 2.5m suppressed . 5 1 m p . ( C ) - 5 0 T e m - 1 0 1.5m Buried 1.5m Exposed 15 2000 2500 3000 3500 4000 4500 5000 - 1 5 15 Minute Interval Introduction Suppose you have data D = {x1, ..., xn} where xi arrives over time • Can we detect a time t when an event of interest occurs? • The question we are asking is: at what point in time is the data is “different”? Introduction Simple example: xi is a scalar eg. a count 10 12 6 8 C o u n t 0 2 4 0 5 10 15 20 25 Time Introduction Harder example: xi is a vector of categorical values Gender Age Zip M 60s 97332 Gender Age Zip F 50 97330 Gender Age Zip M 20s 97331 Gender Age Zip F 20 97330 Gender Age Zip F 20s 97331 Gender Age Zip M 40s 97331 s Gender Age Zip M 20s 97331 s Time06/01/09 06/01/09 09:00 09:30 Introduction Even harder example: xi is spatial data 1. How do we determine if the data is “different”? 2. How can we figure out what makes the data different? Time06/01/09 06/01/09 09:00 09:30 Introduction Goals of event detection: • Identify if an event of interest has occurred • Characterize the event • Pinpoint the affected subgroup of the data ie. what features describe the event (eg. spatial area, time duration)? / f ?• What is the severity magnitude o the event • Detect as accurately as possible • Detect as early as possible Introduction How is event detection different from: 1. Supervised Learning: • Abnormal events are extremely rare, normal events are plentiful 2. Clustering: • Clustering = partitioning data into groups N t th fi di t ti ti ll l• o e same as n ng s a s ca y anoma ous groups 3. Outlier Detection: Events of interest are usually not individual outliers• • The event typically affects a subgroup of the data rather than a single data point Introduction How do we evaluate event detection algorithms? • Can’t use prediction accuracy for “event” vs “non-event” ¾ Class imbalance: many more “non-events” than “ t ”even s ¾ Guessing “non-event” all the time results in very good accuracy • Most event detection algorithms have a tunable threshold for when an alarm is raised ¾ Trades off accuracy and false alarm rate ¾ Need performance over multiple thresholds Introduction How do we evaluate event detection algorithms? Better o s i t i v e R a t e c t i o n T i m e Worse Worse T r u e P o D e t e c Better False Positive Rate False Positive Rate To evaluate accuracy, use a Receiver Operating Characteristic (ROC) curve To evaluate timeliness of detection, use an Activity Monitoring Operating Characteristic (AMOC) curve (Fawcett and Provost 1999) Introduction Challenges: • Incorporating spatial and/or temporal information • Integrating information from multiple features or data streams • Distinguishing between multiple event types • Computational complexity Outline 1. Introduction 2 Temporal Event Detection. 3. Spatio-Temporal Event Detection 4 F t W k. u ure or Univariate Temporal Methods Univariate Methods Examples of univariate time series 3 0 3 5 Central Met. Week 35, 2006 1.5m 1.5m 4.5m 4.5m 0 1 5 2 0 2 5 T e m p . ( C ) From: Goldenberg, A., Shmueli, G., Caruana, R. A., and Fienberg S E (2002) Early statistical detection of 23500 23600 23700 23800 23900 24000 24100 24200 5 1 0 15 Minute Interval From: Dereszynski, E., Dietterich, T. (2007). Probabilistic M d l f A l D t ti i R t S D t, . . . anthrax outbreaks by tracking over-the-counter medication sales. Proceedings of the National Academy of Sciences (pp. 5237-5249) o e s or noma y e ec on n emo e ensor a a Streams. Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence (UAI-2007). 75-82. Univariate Methods This is a time series of counts of primary- physician visits in data from Norfolk in D b 2001 I dd d f k tb kecem er . a e a a e ou rea , starting at a certain date. Can you guess when? Univariate Methods This is a time series of counts of primary- physician visits in data from Norfolk in D b 2001 I dd d f k tb k Here (much too high for a Friday) ecem er . a e a a e ou rea , starting at a certain date. Can you guess when? (Ramp attack) Univariate Methods 20 25 10 15 S i g n a l 0 5 0 2 4 6 8 10 12 14 16 18 20 Time • Easy case: when does an “event” happen? • How can we detect this with an algorithm? Univariate Methods 20 25 10 15 S i g n a l 0 5 0 2 4 6 8 10 12 14 16 18 20 Time General framework: 1 Learn model to predict expected signal value. 2. Measure difference between actual and expected 3. Compute alarm value Univariate Methods Methods we will discuss • Control Chart (Shewhart 1931) • Moving Average • Exponentially Weighted Moving Average (Roberts 1959) • CUSUM (Page 1954) • Regression For a reference on Statistical Quality Control techniques such as control charts, EWMA and CUSUM, see (Montgomery 2001) Univariate Methods (Control Chart) 20 25 5 10 15 S i g n a l Upper Control Limit 0 0 2 4 6 8 10 12 14 16 18 20 Time Control chart (from Statistical Quality Control) • Estimate and from data up to current timeμ̂ σ̂ • Upper control limit = ∑ = = N i iXN 1 1μ̂ ∑ = −−= N i iXN 1 2)ˆ( 1 1 ˆ μσ ˆ3ˆ + • Raise alarm if upper control limit exceeded σμ Univariate Methods (Control Chart) 20 25 5 10 15 S i g n a l 0 0 2 4 6 8 10 12 14 16 18 20 Time Control chart (from Statistical Quality Control) • Alternately, use • And signal alarm when alarm level > threshold ⎟⎠ ⎞⎜⎝ ⎛ −Φ= σ μ ˆ )ˆ,0max( level Alarm i X N(0,1)for CDF where =Φ Univariate Methods (Control Chart) Alarm LevelControl chart applied to Norfolk data Univariate Methods (Control Chart) Alarm LevelControl chart applied to Norfolk data (long term) Univariate Methods (Control Chart) Alarm LevelControl chart applied to Norfolk data (long term) Univariate Methods (Moving Average) • Let W be the window size • A moving average window predicts the following: )( 1 +++= XXXX ... 111 −−−+ Wtttt W Setting the alarm value: • Fit a Gaussian to the W observations within the window ie. estimate and C l l t th l l l b f μ̂ σ̂ • a cu a e e a arm eve as e ore ⎟⎠ ⎞⎜⎝ ⎛ −Φ= σ μ ˆ )ˆ,0max( level Alarm i X Univariate Methods (Moving Average) Moving Average applied to Norfolk data Univariate Methods (Moving Average) Moving Average applied to Norfolk data (long term) Univariate Methods (Moving Average) Moving Average applied to Norfolk data (long term) Univariate Methods (EWMA) • Exponentially Weighted Moving Average (EWMA) - a variation on the moving average • Let Zi be the EWMA statistic (which is monitored): )1( ZXZ λλ 10h ≤< λ1−−+= iii w ere Observations in the past receive a decreasing amount of weight Univariate Methods (CUSUM) • CUmulative SUM Statistics • Good at detecting shifts from the mean more quickly than control chart • Keep a running sum of “surprises”: a sum of excesses each day over the mean Wh thi d th h ld H i l• en s sum excee s res o , s gna alarm and reset sum Univariate Methods (CUSUM) • r = reference value eg. mean X ith b ti• i = o serva on • Si = ith cumulative sum +−=−+−= −= SrXrXrXS rXS 11 )()()( k 12122 M ∑ = −+−= i kkk SrXS 1 1)( When a shift from the mean occurs S will , i start to increase Univariate Methods (CUSUM) • If we are only tracking increases, we can do the f ll io ow ng: ))(,0max( 1−+−= kkk SrXS Ensures we don’t go below 0 • We can also add a tolerance or a slack K ))(,0max( 1−++−= kkk SKrXS Univariate Methods (CUSUM) CUSUM applied to Norfolk data Univariate Methods (CUSUM) CUSUM applied to Norfolk data (long term) Univariate Methods (CUSUM) CUSUM applied to Norfolk data (long term) Univariate Methods • Data often consists of trends eg. • Seasonal effect • Day-of-week effects • Holiday effect • None of the methods discussed so far explicitly model these trends • Regression (eg. linear regression) can be d ith t t f th t duse w ex ra erms or e ren s Univariate Methods (Regression) Regression example to model seasonal effects and Monday effects: iiii IsMondaylightHoursOfDayY εβββ +++= )()( 210 Boolean feature – adds a “bump” to the value of Y if it is a Monday⎟⎞⎜⎛ −31)July since days num(2sin ππ Could be defined as: Normally distributed noise with mean 0, known variance σ2 ⎠⎝ 2365.25 Regression learns the β parameters from data to minimize the residual sum of squares Univariate Methods (Regression) Regression applied to Norfolk data using H OfD li ht dours ay g an IsMonday terms Univariate Methods (Regression) Regression applied to Norfolk data using H OfD li ht dours ay g an IsMonday terms (long term) Univariate Methods (Other) Other state-of-the-art methods not discussed in this tutorial • Box-Jenkins models eg. ARMA, ARIMA • Wavelets • Change-point detection • Kalman filters • Hidden Markov Models Multivariate Temporal Methods Multivariate Methods Each data point (recorded at some time point) is now a multivariate vector Date Time Gender Age Prodrome Home Location Work Location Many more… eg. patient records from an Emergency Department 6/1/09 9:12 M 20s Fever NE NE … 6/1/09 10:45 F 40s Diarrhea NE NE … 6/1/09 11:03 F 60s Respiratory NE N … 6/1/09 11:07 M 60s Diarrhea E W … : : : : : : : : Multivariate Methods How are patient records from 6/1/09 different Date Time Gender Age Prodrome Home Location Work Location Many more… 6/1/09 9:12 M 20s Fever NE NE … 6/1/09 10:45 F 40s Diarrhea NE NE … 6/1/09 11:03 F 60s Respiratory NE N … : : : : : : : : from patient records from 6/2/09? Date Time Gender Age Prodrome Home Location Work Location Many more… 6/2/09 9:15 M 60s Respiratory E NE … 6/2/09 10:01 F 50s Respiratory N NW … 6/2/09 13:05 F 40s Respiratory SW SW … : : : : : : : : Multivariate Methods Note: need to split data into two groups according to time: 1 T i i d t l d l. ra n ng: use o earn mo e Date Time Gender Age Prodrome Home Location Work Location Many more… 6/1/09 9:12 M 20s Fever NE NE … 6/1/09 10:45 F 40s Diarrhea NE NE … 6/1/09 11:03 F 60s Respiratory NE N … : : : : : : : : 2. Testing: used to identify events with respect to the learned model D t Ti G d A P d H W k Ma e me en er ge ro rome ome Location or Location any more… 6/2/09 9:15 M 60s Respiratory E NE … 6/2/09 10:01 F 50s Respiratory N NW … 6/2/09 13:05 F 40s Respiratory SW SW … : : : : : : : : Multivariate Methods We make the following distinction • Multivariate Changepoint Detection: • Detects that a change has happened • Does not identify the subgroup of data that has changed the most • Multivariate Event Detection • Detects that a change has happened Id tifi th b th t h h d th• en es e su group a as c ange e most Multivariate Methods Outline • Multivariate Changepoint Detection • Multivariate Statistical Quality Control • Others • Multivariate Event Detection E i P tt• merg ng a erns • STUCCO • WSARE 2.0 • WSARE 3.0 Multivariate Changepoint Detection (Hotelling’s T2) Multivariate version of control chart is Hotelling’s T2 ˆ statistic (Hotelling 1931) )ˆ()ˆ( 12 μXSμX ii −−= −cT Sample size that covariance Estimated matrix was estimated from covariance matrix Estimated mean Other multivariate statistical quality control methods: • Multivariate CUSUM (Crosier 1988) • Multivariate EWMA (Lowry et al. 1992) All make strong assumptions about the underlying model Multivariate Changepoint Detection Other methods: • Cross-match test (Rosenbaum 2005) • kdq-tree (Dasu et al. 2006) D it T t (S t l 2007)• ens y es ong e a . Multivariate Event Detection General framework: 1.Learn model to predict expected signal value for the given subgroup 2. Measure difference between actual and expected 3.Compute alarm value (now more involved) Multivariate Event Detection Algorithm Data Model Measuring Differences Emerging Patterns (Dong and Li 1999) Categorical Counts Increase in support ratio STUCCO (Bay and Pazzani 1999) Categorical Counts Chi-square, Bonferroni WSARE 2.0 (Wong et al. 2005) Categorical Counts Fisher’s Exact test, Randomization test WSARE 3.0 (Wong et al. 2005) Categorical Bayesian network Fisher’s Exact test, Randomization Test Multivariate Event Detection (Categorical Data) How can we find differences in multivariate categorical data? Date Time Gender Age Prodrome Home L ti Work L ti Many oca on oca on more… 6/2/09 9:15 M 60s Respiratory E NE … 6/2/09 10:01 F 50s Respiratory N NW … 6/2/09 13:05 F 40s Respiratory SW SW … : : : : : : : : Idea from association rule mining: Characterize differences by rules ie. conjunctions of attribute value pairs - Multivariate Event Detection (Categorical Data) Training Date Time Gender Age Prodrome Home Location Work Location Many more… 6/1/09 9:12 M 20s Fever NE NE … 6/1/09 10:45 F 40s Diarrhea NE NE … 6/1/09 11:03 F 60s Respiratory NE N … : : : : : : : : Testing Date Time Gender Age Prodrome Home Location Work Location Many more… 6/2/09 9:15 M 60s Respiratory E NE … 6/2/09 10:01 F 50s Respiratory N NW … Find which rules predict unusually high proportions in test data when compared 6/2/09 13:05 F 40s Respiratory SW SW … : : : : : : : : to the training data eg. 92/180 records from Testing have Gender = Male AND Age = 60s 43/200 records from Training have Gender = Male AND Age = 60s Multivariate Event Detection (Emerging Patterns) • Let D = {X1, …, XN} be a data set with N data points • Define the support of a rule R to be: || )( supp D Rcount (R) DD = where countD(R) = number of data points th t t h l Ra ma c ru e Multivariate Event Detection (Emerging Patterns) Suppose we are given data sets D1 and D2 . Define the GrowthRate(R) from D1 to D2 as: ⎧ ⎪⎪ ⎪ ⎨ ≠=∞ == = 0(R)suppand 0(R)suppif, 0(R)supp and 0(R)supp if ,0 )( D2D1 D2D1 RGrowthRate ⎪⎪ ⎪ ⎩ otherwise , )(supp )(supp 1 2 R R D D Multivariate Event Detection (Emerging Patterns) • Given ρ > 1, a rule R is said to be a ρ-emerging pattern from D1 to D2 if GrowthRate(R) ≥ ρ • Goal: For a given ρ, find all ρ-emerging patterns See (Dong and Li 1999) for efficient algorithms to find emerging patterns using borders to describe large collections of itemsets Search and Testing for U d d bl C in erstan a e ons stent Contrasts (STUCCO) Multivariate Methods (STUCCO) • Define a contrast set as a conjunction of attribute- value pairs (ie. what we defined as a rule) • Search for contrast sets CS such that: 1. )|()|( TestingCSPTrainingCSP ≠ Multivariate Methods (STUCCO) • Define a contrast set as a conjunction of attribute- value pairs • Search for contrast sets CS such that: 1. )|()|( TestingCSPTrainingCSP ≠ This says that the distribution of the contrast set CS is different “ ff fin the Training and Testing data. Di erent” will be de ined shortly. Multivariate Methods (STUCCO) • Define a contrast set as a conjunction of attribute- value pairs • Search for contrast sets CS such that: 1. )|()|( TestingCSPTrainingCSP ≠ 2. δ≥− |),(),(| TestingCSSupportTrainingCSSupport Multivariate Methods (STUCCO) • Define a contrast set as a conjunction of attribute- value pairs • Search for contrast sets CS such that: 1. )|()|( TestingCSPTrainingCSP ≠ 2. δ≥− |),(),(| TestingCSSupportTrainingCSSupport The support of a contrast set is the percentage of data points (in the Training / Testing data) where the contrast set is true Multivariate Methods (STUCCO) Search for contrast sets involves efficient breadth-first search of a set enumeration tree (Rymon 1992) { } Gender=M Gender=F Age=Young Age=Old . . . Gender=M AND Age=Young Gender=M AND Age=OLD Gender=F AND Age=Young Gender=F AND Age=Old . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A variety of pruning methods makes this search efficient Multivariate Methods (STUCCO) • How do we determine that the distributions of contrast sets are different between training and testing? • For each contrast set, construct a 2x2 contingency t bla e Count (Training) Count (Testing) Age = Young 25 52 Age ≠ Young 101 206 Multivariate Methods (STUCCO) • Perform Chi-Square test of independence • Compute χ2 statistic: • Where Oij = observed frequency count for the cell in row i and ∑∑ = = −= 2 1 2 1 2 2 )( i j ij ijij E EOχ column j • Eij is the expected frequency count for the cell in row i and column j given independence of the row and column variables ie . NOOE i j ijijij / 2 1 2 1 ⎟⎟⎠ ⎞ ⎜⎜⎝ ⎛= ∑ ∑ = = Where N = total number of observations in all cells • To obtain a p-value, compare χ2 statistic to a chi-square distribution Multivariate Methods (STUCCO) • If we perform a Chi-Square test on the 2x2 contingency table below, we get a p-value of 0.0074 • This is significant at the α = 0.05 level • Report contrast sets with p-value < 0.05 C t (T i i ) C t (T ti )oun ra n ng oun es ng Age = Young 25 100 Age ≠ Young 101 206 Multivariate Methods (STUCCO) • But…we can’t interpret this p-value at face value • The search suffers from a multiple hypothesis testing problem • Need to correct the p-values to compensate for multiple hypothesis testing Multivariate Methods (STUCCO) Multiple Hypothesis Testing • Suppose we reject null hypothesis when p-value < α, where α = 0.05 • For a single hypothesis test, the probability of a false positive = α • Suppose we do 1000 tests, one for each possible rule • Probability(false positive) could be as bad as: 1 – ( 1 – 0.05)1000 >> 0.05 Multivariate Methods (STUCCO) Bonferroni Correction: • If you are performing n hypothesis tests for hypotheses h1, …, hn, • Adjust significance level for test i to be i αα = • Reject hypothesis hi if n ii α≤pvalue Multivariate Methods (STUCCO) Two problems: 1. We do not know n as we incrementally mine each level of th te ree 2. Same cutoff for contrast sets with different numbers of attribute-value pairs (want more power on smaller conjunctions) STUCCO’s solution α )|,|/ 2 min( 1−= llll C αα Significance threshold at level l of the tree Number of candidates at level l Multivariate Methods (STUCCO) Traverse set-enumeration tree using Breadth-First Search Summary For each contrast set at depth l of tree: 1. Form 2x2 contingency table 2. Compute p-value using chi-square test 3. Account for multiple hypothesis testing by computing significance level αl 4. If p-value < αl , report contrast set Multivariate Methods (WSARE 2.0) For each rule in rule set: Summary 1. Form 2x2 contingency table 2. Compute rule score using Fisher’s Exact test / Chi- square test 3. Account for multiple hypothesis testing by randomization test 4. If p-value from randomization test < alarm value, report rule Multivariate Methods (WSARE 2.0) For each rule in rule set: Difference #1 Diff #2 Summary 1. Form 2x2 contingency table 2. Compute rule score using Fisher’s Exact test / Chi- erence square test 3. Account for multiple hypothesis testing by randomization test 4. If p-value from randomization test < alarm value, report rule Difference #3 Multivariate Methods (WSARE 2.0) Difference #1 (from STUCCO) • Rule set defined as all rules with a maximum of k conjunctions of attribute-value pairs • No need to search set enumeration tree - Multivariate Methods (WSARE 2.0) Difference #2 (from STUCCO) • Use of Fisher’s Exact Test for rules with small counts that violate assumptions of Chi- square test Multivariate Methods (WSARE 2.0) Difference #3 (from STUCCO) • Bonferroni correction is very conservative • Increases risk of Type II errors (not rejecting null hypothesis when it is false) • Very hesitant to declare something as an “ t”even • Randomization test is a better alternative with more statistical power Multivariate Methods (WSARE 2.0) Aug 16, 2008 C2 Aug 17, 2008 C3 Aug 17, 2008 C4 Aug 16, 2008 C2 Aug 17, 2008 C3 Aug 24, 2008 C4 Aug 17, 2008 C5 Aug 17, 2008 C6 Aug 17, 2008 C7 Aug 17, 2008 C5 Aug 24, 2008 C6 Aug 17, 2008 C7 Aug 21, 2008 C8 Aug 21, 2008 C9 Aug 22, 2008 C10 Aug 21, 2008 C8 Aug 21, 2008 C9 Aug 22, 2008 C10 Aug 23, 2008 C11 Aug 23, 2008 C12 Aug 24, 2008 C13 A 24 2008 C14 Aug 23, 2008 C11 Aug 23, 2008 C12 Aug 17, 2008 C13 A 17 2008 C14 Randomization Test • Take the training data points and the testing data points Shuffle the ug , ug , . date field to produce a randomized dataset called DBRand • Find the rule with the best score on DBRand. Multivariate Methods (WSARE 2.0) Repeat the procedure on the previous slide for 1000 iterations. Determine how many scores from the 1000 iterations are better than the original score. If the original score were here, it would place in the top 1% of the 1000 scores from the randomization test We would be . impressed and an alert should be raised. Corrected p-value of the rule is: # better scores / # iterations Multivariate Methods (WSARE 3.0) For each rule in rule set: Summary 1. Learn a Bayesian network from training data. Sample a baseline data set using Bayesian network. 2. Form 2x2 contingency table using counts from Baseline data and Testing . 3. Compute rule score using Fisher’s Exact test / Chi- square test 4. Account for multiple hypothesis testing by randomization test 5 If l f d i i l l. p-va ue rom ran om zat on test < a arm va ue, report rule Multivariate Methods (WSARE 3.0) For each rule in rule set: Summary Difference #1 1. Learn a Bayesian network from training data. Sample a baseline data set using Bayesian network. 2. Form 2x2 contingency table using counts from Baseline data and Testing . 3. Compute rule score using Fisher’s Exact test / Chi- square test 4. Account for multiple hypothesis testing by randomization test 5 If l f d i i l l. p-va ue rom ran om zat on test < a arm va ue, report rule Multivariate Methods (WSARE 3.0) Difference #1 (from WSARE 2.0) • Need to account for trends in the data From: Goldenberg, A., Shmueli, G., Caruana, R. A., and Fienberg, S. E. (2002). Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales. Proceedings of the National Academy of Sciences (pp. 5237-5249) Multivariate Methods (WSARE 3.0) • Temporal trends in health care data: • Seasonal effects in temperature and weather • Day of Week effects • Holidays • Etc. • Not accounting for these trends can adversely affect the detection time and false positives rate Multivariate Methods (WSARE 3.0) • “Taking into account that today is a public holiday ” Generating the Baseline: … • “Taking into account that this is Spring…” • “Taking into account recent heatwave…” • “Taking into account recent flu levels…” • “Taking into account that there’s a known natural Food- borne outbreak in progress ” … Multivariate Methods (WSARE 3.0) Generating the Baseline: • “Taking into account that today is a public holiday ” … • “Taking into account that this is Spring…” • “Taking into account recent heatwave…” • “Taking into account recent flu levels…” • “Taking into account that there’s a known natural Food- borne outbreak in progress ” … Use a Bayesian network (Pearl 1988) to model the joint probability distribution of the attributes. Multivariate Methods (WSARE 3.0) All Historical D t Obtaining Baseline Data a a Today’s 1. Learn Bayesian Network using Optimal Reinsertion [Moore and Environment Wong 2003] 2. Generate baseline given today’s environment Baseline Multivariate Methods (WSARE 3.0) All Historical D t Obtaining Baseline Data a a Today’s 1. Learn Bayesian Network using Optimal Reinsertion [Moore and Environment Wong 2003] 2. Generate baseline given today’s environment Baseline Multivariate Methods (WSARE 3.0) Divide the data into two types of attributes: • Environmental attributes: attributes that cause trends in the data eg. day of week, season weather flu levels, , • Response attributes: all other non- environmental attributes eg age gender . , Multivariate Methods (WSARE 3.0) When learning the Bayesian network structure, do not allow environmental attributes to have parents. Wh ?y • We are not interested in predicting their distributions • Instead we use them to predict the distributions of the, response attributes Season Day of Week Weather Flu Level Multivariate Methods (WSARE 3.0) All Historical D t Obtaining Baseline Data a a Today’s 1. Learn Bayesian Network using Optimal Reinsertion [Moore and Environment Wong 2003] 2. Generate baseline given today’s environment Baseline Multivariate Methods (WSARE 3.0) Season Day of Week Weather Flu Level Today Winter Monday Snow High Suppose we know the following for today: Season = Winter Day of Week = Monday Weather = Snow Flu Level = HighWe fill in these values for the environmental attributes in the learned Bayesian network We sample 10000 records from the Bayesian network d k thi d t t th Baseline an ma e s a a se e baseline Multivariate Methods (WSARE 3.0) Season Day of Week Weather Flu Level Today Winter Monday Snow High Suppose we know the following for today: Season = Winter Day of Week = Monday Weather = Snow Flu Level = HighWe fill in these values for the Sampling is easy because environmental attributes in the learned Bayesian environmental attributes are at the network top of the Bayes Net We sample 10000 records from the Bayesian network d k thi d t t th Baseline an ma e s a a se e baseline Multivariate Methods (WSARE 3.0) Season Day of Week Weather Flu Level Today Winter Monday Snow High Suppose we know the following for today: Season = Winter Day of Week = Monday Weather = Snow Flu Level = HighWe fill in these values for the environmental attributes in the learned Bayesian An alternate possible technique network is to use inference We sample 10000 records from the Bayesian network d k thi d t t th Baseline an ma e s a a se e baseline Multivariate Methods (WSARE 3.0) For each rule in rule set: Summary 1. Learn a Bayesian network from training data. Sample a baseline data set using Bayesian network. 2. Form 2x2 contingency table using counts from Baseline data and Testing . 3. Compute rule score using Fisher’s Exact test / Chi- square test 4. Account for multiple hypothesis testing by randomization test 5 If l f d i i l l. p-va ue rom ran om zat on test < a arm va ue, report rule Multivariate Methods (WSARE 3.0) Side Note: • Conditional Anomaly Detection (Song et al. 2007) is a similar approach • Uses a Gaussian Mixture Model instead of a Bayesian Network • Applicable to continuous multivariate data But: It discovers individual data points that are anomalies, not anomalous groups of data points Multivariate Methods Open Questions • What about continuous features? • What about mixed discrete and continuous features? • Can we develop faster methods, especially th th t id d i ti t ti ?ose a can avo ran om za on es ng • Can we discover interesting (not just statistically significant) events? Acknowledgements • We would like to thank the following individuals/groups for their slide materials: • AUTON Lab (Carnegie Mellon University) • RODS Lab (University of Pittsburgh) • Ethan Dereszynski • Univariate temporal methods section based in part on an earlier tutorial on detection algorithms for biosurveillance by Andrew Moore References • [Bay and Pazzani 1999] Bay, S. D., and Pazzani, M. J., Detecting change in categorical data: Mining contrast sets. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 302–306, New York, NY, 1999. ACM. • [Crosier 1988] Crosier, R. B. (1988). Multivariate generalizations of cumulative sum quality-control schemes. Technometrics 30 291 303, , - . • [Dasu et al. 2006] Dasu, T., Krishnan, S., Venkatasubramanian, S, and Yi, K. (2006). An information-theoretic approach to detecting changes in multi-dimensional data streams. In Proceedings of the 38th Symposium on the Interface of Statistics, Computing Science, and Applications (Interface 06). • [Dereszynski and Dietterich] Dereszynski, E., and Dietterich, T. (2007). Probabilistic Models for Anomaly Detection in Remote Sensor Data Streams. Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence (UAI-2007). 75-82. • [Dong and Li 1999] Dong, G. and Li, J. Efficient mining of emerging patterns: discovering trends and differences. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 43–52, New York, NY, 1999. ACM. • [Fawcett and Provost 1999] Fawcett, T., and Provost, F. (1999). Activity monitoring: Noticing interesting changes in behavior. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (pp. 53-62). [G ld b t l 2002] G ld b A Sh li G C R A d Fi b S E (2002) E l• o en erg e a . o en erg, ., mue , ., aruana, . ., an en erg, . . . ar y statistical detection of anthrax outbreaks by tracking over-the-counter medication sales. Proceedings of the National Academy of Sciences (pp. 5237-5249) • [Hotelling 1931] Hotelling, H. (1931). The generalization of Student's ratio, Ann. Math. Statist., Vol. 2, pp 360– 378. 108 • [Lowry et al. 1992] Lowry, C. A., Woodall, W. H., Champ, C. W., and Rigdon, S. E. (1992). A Multivariate Exponentially Weighted Moving Average Chart, Technometrics, 34, 46-53. References • [Montgomery 2001] Montgomery, D. C. Introduction to Statistical Quality Control. John Wiley and Sons, Inc., 2001. • [Moore and Wong 2003] Moore, A., and Wong, W.-K. (2003). Optimal Reinsertion: A new search operator for accelerated and more accurate Bayesian network structure learning. Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003) (pp 552 559) Menlo Park CA: AAAI Press . - . , . • [Page 1954] Page, E. S. (1954). Continuous inspection schemes. Biometrika, 41, 100-115. • [Pearl 1988] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Francisco, CA: Morgan Kaufmann Publishers, Inc. • [Roberts 1959] Roberts S W (1959) Control chart tests based on geometric moving averages Technometrics , . . . . , 1, 239-250. • [Rosenbaum 2005] Rosenbaum, P. R. (2005). An exact distribution-free test comparing two multivariate distributions based on adjacency. Journal of the Royal Statistical Society Series B, 67(4): 515-530. • [Shewhart 1931] Shewhart, W. A. (1931). Economic control of quality of manufactured product. New York: D. Van Nostrand Company. • [Song et al. 2007] Song, X., Wu, M., and Jermaine, C. (2007). Conditional Anomaly Detection. IEEE Transactions on Knowledge and Data Engineering, 19(5), 631-645. • [Song et al. 2007b] Song, X., Wu, M., Jermaine, C., and Ranka, S. (2007). Statistical Change Detection for Multi-Dimensional Data. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 667-676, New York, NY: ACM Press. • [Wong et al. 2005] Wong, W.-K., Moore, A., Cooper, G. and Wagner, M. (2005). What's Strange About Recent Events (WSARE): An Algorithm for the Early Detection of Disease Outbreaks. Journal of Machine Learning R h 6 1961 1998 109 esearc , , - . Tutorial on Event Detection Part II: Spatial Event Detection Daniel B. Neill Carnegie Mellon University H.J. Heinz III College neill@cs.cmu.edu Thi k ti ll t d b NSF t 110 s wor was par a y suppor e y gran IIS-0325581 and CDC grant 8-R01-HK000020-02. Outline of this part A. Introduction to spatial event detection Problem statement and overview of approaches. B. Univariate scan statistic approaches Spatial and space-time. C Multivariate scan statistic approaches. Parametric, non-parametric, and Bayesian. D C t d f t di ti. urren an u ure rec ons Incorporating learning; fast algorithms. 111 A Introduction to. Spatial Event Detection 1. The spatial event detection problem 2 Approaches to spatial event detection. a. Top-down and bottom-up approaches b. Parallel monitoring approaches c. Scan statistic approaches d. Other approaches from spatial statistics ©2009 Carnegie Mellon University 112 Spatial event detection D1 = respiratory ED Outbreak detection Time series of counts ci,mt for each location si D2 = constitutional ED D3 = OTC cough/cold D4 = OTC anti-fever Spatial time series data from spatial locations si (e.g. zip codes) for each data stream Dm. G l f d t ti t k d t t i t ( di tb k ) (etc.) oa s o e ec on as : e ec any emerg ng even s e.g. sease ou rea s , pinpoint the affected spatial area, and characterize the type of event. Informally, we want to know: Formally, we will distinguish between: Is there anything happening? If so, what and where? Null hypothesis H0 (no events) Set of alternative hypotheses H1(S, Ek) = event of type E in spatial region S 113 k . Spatial event detection D1 = respiratory ED Outbreak detection Time series of counts ci,mt for each location si D2 = constitutional ED D3 = OTC cough/cold D4 = OTC anti-fever Spatial time series data from spatial locations si (e.g. zip codes) for each data stream Dm. G l f d t ti t k d t t i t ( di tb k ) (etc.) oa s o e ec on as : e ec any emerg ng even s e.g. sease ou rea s , pinpoint the affected spatial area, and characterize the type of event. This formulation assumes count data aggregated to discrete time steps (e.g. days) and small areas (e.g. zips). More generally, we can have a set of data records (observations) where each observation has a time-stamp, location information, and 114 possibly other attributes. Each count represents the number of observations with given attributes in a given area and time interval. Spatial event detection D1 = respiratory ED Outbreak detection Time series of counts ci,mt for each location si D2 = constitutional ED D3 = OTC cough/cold D4 = OTC anti-fever Spatial time series data from spatial locations si (e.g. zip codes) for each data stream Dm. G l f d t ti t k d t t i t ( di tb k ) (etc.) oa s o e ec on as : e ec any emerg ng even s e.g. sease ou rea s , pinpoint the affected spatial area, and characterize the type of event. This formulation assumes count data aggregated to discrete time steps (e.g. days) and small areas (e.g. zips). We assume that an event will result in anomalously high counts for some subset of data streams for 115 the affected spatial region and time interval. Spatial event detection D1 = respiratory ED Outbreak detection Time series of counts ci,mt for each location si D2 = constitutional ED D3 = OTC cough/cold D4 = OTC anti-fever Spatial time series data from spatial locations si (e.g. zip codes) for each data stream Dm. G l f d t ti t k d t t i t ( di tb k ) (etc.) oa s o e ec on as : e ec any emerg ng even s e.g. sease ou rea s , pinpoint the affected spatial area, and characterize the type of event. We will initially make three additional assumptions: Purely spatial detection problem (only a single time interval to consider) Monitoring a single data stream D 116 m Attempting to detect a single event type Ek A Introduction to. Spatial Event Detection 1. The spatial event detection problem 2 Approaches to spatial event detection. a. Top-down and bottom-up approaches b. Parallel monitoring approaches c. Scan statistic approaches d. Other approaches from spatial statistics ©2009 Carnegie Mellon University 117 Top-down and bottom-up detection Top-down detection approaches Bottom-up detection approaches 1. Are there any globally interesting patterns? 1. Find individual data points with “interesting” local 2. If so, find the most interesting sub-partition of the data and search it recursively neighborhoods 2. Aggregate interesting points into clusters . . Bottom-up: density-based clustering (e.g. DBSCAN2) Top-down: bump hunting1 118 2M. Ester et al., KDD 1996. Thanks to Daniel Olivera for these examples. 1J. Friedman and N. Fisher, 1999. Greedy approaches can fail! Top-down detection approaches Bottom-up detection approaches 1. Are there any globally interesting patterns? 1. Find individual data points with “interesting” local 2. If so, find the most interesting sub-partition of the data and search it recursively neighborhoods 2. Aggregate interesting points into clusters . . Top-down fails when the affected region is too small to significantly Bottom-up fails when the affected region is not dense enough for the 119 affect the global aggregate statistics. local neighborhoods to be interesting. Greedy approaches can fail! Top-down detection approaches Bottom-up detection approaches 1. Are there any globally interesting patterns? 1. Find individual data points with “interesting” local 2. If so, find the most interesting sub-partition of the data and search it recursively neighborhoods 2. Aggregate interesting points into clusters . . How can we detect both small, dense clusters and l l d l t ? How can we move beyond cluster detection, to detect t th t i ti ?arger, ess ense c us ers even s a emerge n me One answer: Parallel Monitoring Partition the monitored area into subregions. Then separately monitor each subregion using purely temporal Fixed partition: zip code Fixed partition: uniform grid Ad-hoc partition: data clustering 120 detection methods (see Part 1!) boundaries Challenges of parallel monitoring One major challenge of parallel monitoring is choosing an appropriate partitioning of the monitored area. A given partitioning has high power to detect events corresponding to a single partition (red), but is suboptimal for events which affect multiple partitions (yellow), part of a partition (white), or parts of multiple partitions (pink). Coarse partitions lose power for small regions , fine partitions lose power for large regions, and both lose power for unaligned regions. 121 Challenges of parallel monitoring One major challenge of parallel monitoring is choosing an appropriate partitioning of the monitored area. A second challenge of parallel monitoring is the problem of multiple hypothesis testing. A given partitioning has high power to detect events corresponding to a single partition (red), but is suboptimal for events which affect Monitoring thousands of spatial partitions, and performing a separate significance test for multiple partitions (yellow), part of a partition (white), or parts of multiple partitions (pink). Coarse partitions lose power for small regions each, leads to huge numbers of false positive alerts. The Bonferroni correction for , fine partitions lose power for large regions, and both lose power for unaligned regions. multiple tests leads to greatly reduced detection power. S l ti t th fi t h ll th ti l t ti ti 1. Form a very fine partitioning of the monitored area into individual locations (e.g. zip codes or census tracts, depending on spatial resolution of the data). o u on o e rs c a enge: e spa a scan s a s c. 2. Rather than monitoring each partition separately, examine a huge number of overlapping spatial regions, each consisting of a group of locations. 122 Challenges of parallel monitoring One major challenge of parallel monitoring is choosing an appropriate partitioning of the monitored area. A second challenge of parallel monitoring is the problem of multiple hypothesis testing. A given partitioning has high power to detect events corresponding to a single partition (red), but is suboptimal for events which affect Monitoring thousands of spatial partitions, and performing a separate significance test forS hi multiple partitions (yellow), part of a partition (white), or parts of multiple partitions (pink). Coarse partitions lose power for small regions each, leads to huge numbers of false positive alerts. The Bonferroni correction for Spatial scan approaches have high power to detect events affecting small or large r gi ns earc ng over so many regions makes the multiple hypothesis testing problem even worse… S l ti t th fi t h ll th ti l t ti ti , fine partitions lose power for large regions, and both lose power for unaligned regions. multiple tests leads to greatly reduced detection power. . o u on o e rs c a enge: e spa a scan s a s c. 1. Form a very fine partitioning of the monitored area into individual locations (e.g. zip codes or census tracts, depending on spatial resolution of the data). But we can solve the multiple testing problem by: 1. Finding the most significant regions. 2 D t i i h lik l ld b t2. Rather than monitoring each partition separately, examine a huge number of overlapping spatial regions, each consisting of a group of locations. 123 . e erm n ng ow e y we wou e o see any regions that significant due to chance. Other methods from spatial statistics Clustered? Tests for “general clustering” (Whittemore, Tango, Knox, Mantel, etc.) Not clustered? Determine whether there is sufficient evidence of spatial or space-time clustering in the data, but without detecting specific clusters . Tests for “focused clustering” (Lawson, Stone, Waller, Diggle, etc.) Determine whether the risk is Focus? Not a ? significantly increased near a given point (e.g. possible environmental hazard). focus? 124 Neither method detects cluster locations! Spatial risk mapping approaches Spatial smoothing Original risk map (observed / expected) Smoothed risk map EB (Gangnon & Clayton; Mollie) Often based on Bayesian modeling Advantages: Explicit modeling of spatial correlation structure FB (Lawson et al.) Disadvantages: Cannot automatically determine whether an event , useful for data visualization, can detect areas with high risk. 125 has occurred; cannot identify the spatial area and time duration. B Univariate Scan. Statistic Approaches 1. Kulldorff’s spatial scan statistic 2 Variants of spatial scan:. • Which spatial regions to search? • How to evaluate the score of a region? 3. Extensions to space-time scanning (expectation-based scan statistic) ©2009 Carnegie Mellon University 126 The spatial scan statistic (Kulldorff 1997) Rather than monitoring individual locations, we , examine groups of locations. Imagine moving a spatial window around the monitored area, allowing the size and shape of the window to vary. 127 The spatial scan statistic (Kulldorff 1997) Rather than monitoring individual locations, we , examine groups of locations. Imagine moving a spatial window around the monitored area, allowing the size and shape of the window to vary. 128 The spatial scan statistic (Kulldorff 1997) Rather than monitoring individual locations, we , examine groups of locations. Imagine moving a spatial window around the monitored area, allowing the size and shape of the window to vary. 129 The spatial scan statistic (Kulldorff 1997) I have a population of 6000, of whom 90 Rather than monitoring individual locations, we , (1.5%) are sick. examine groups of locations. Imagine moving a spatial Everywhere else has a population of 2.2 million of whom window around the monitored area, allowing the size and shape of the window to vary. Is there any position of the window such that the points inside form a significant cluster? , 20,000 (0.9%) are sick. We compute a score for each spatial region, and then test whether the highest scoring How to evaluate a region? Which regions to search? 130 regions are significant.How to search them efficiently? Finding the most significant regions • Define models: • of the null hypothesis H0: no events. • of the alternative K lld ff’ d l hypotheses H1(S): event in region S. u or s mo e ci ~ Poisson(qbi) H : q = q everywhere ci = count for location si (e.g. number of disease cases) bi = baseline for location si (e.g. population at-risk, or expected count computed from historical data) 0 all H1(S): q = qin inside S, q = qout outside, q = risk (expected ratio of count to baseline) 131 qin > qout. Finding the most significant regions • Define models: • of the null hypothesis H0: no events. • of the alternative K lld ff’ d l hypotheses H1(S): event in region S. u or s mo e ci ~ Poisson(qbi) H : q = q everywhere0 all H1(S): q = qin inside S, q = qout outside, 132 qin > qout. Finding the most significant regions • Define models: • of the null hypothesis H0: no events. • of the alternative K lld ff’ d l hypotheses H1(S): event in region S. • Derive a score function: )| DataPr( ))(| DataPr( )(F 0 1 H SH S = u or s mo e ci ~ Poisson(qbi) H : q = q everywhere • Likelihood ratio: 0 all H1(S): q = qin inside S, q = qout outside, 133 qin > qout. Finding the most significant regions Score = 1.3 • Define models: • of the null hypothesis H0: no events. • of the alternative K lld ff’ d l hypotheses H1(S): event in region S. • Derive a score function: )| DataPr( ))(| DataPr( )(F 0 1 H SH S = u or s mo e ci ~ Poisson(qbi) H : q = q everywhere • Likelihood ratio: totCCtotCC CCCC −− ⎟⎞⎜⎛⎟⎞⎜⎛ −⎟⎞⎜⎛ 0 all H1(S): q = qin inside S, q = qout outside, 134 tot tot tot tot BBBB S ⎟⎠⎜⎝⎟⎠⎜⎝ −⎠⎝ =)F( qin > qout. Finding the most significant regions Score = 1.3 • Define models: • of the null hypothesis H0: no events. • of the alternative K lld ff’ d l hypotheses H1(S): event in region S. • Derive a score function: )| DataPr( ))(| DataPr( )(F 0 1 H SH S = u or s mo e ci ~ Poisson(qbi) H : q = q everywhere • Likelihood ratio: Total count and Total count and 0 all H1(S): q = qin inside S, q = qout outside, baseline of region S baseline of search area totCCtotCC CCCC −− ⎟⎞⎜⎛⎟⎞⎜⎛ −⎟⎞⎜⎛ 135 qin > qout.tot tot tot tot BBBB S ⎟⎠⎜⎝⎟⎠⎜⎝ −⎠⎝ =)F( Finding the most significant regions Score = 1.3 • Define models: • of the null hypothesis H0: no events. • of the alternative K lld ff’ d l hypotheses H1(S): event in region S. • Derive a score function: )| DataPr( ))(| DataPr( )(F 0 1 H SH S = u or s mo e ci ~ Poisson(qbi) H : q = q everywhere • Likelihood ratio: 0 all H1(S): q = qin inside S, q = qout outside, totCCtotCC CCCC −− ⎟⎞⎜⎛⎟⎞⎜⎛ −⎟⎞⎜⎛ 136 qin > qout.tot tot tot tot BBBB S ⎟⎠⎜⎝⎟⎠⎜⎝ −⎠⎝ =)F( Finding the most significant regions Score = 1.3 • Define models: • of the null hypothesis H0: no events. • of the alternative K lld ff’ d l hypotheses H1(S): event in region S. • Derive a score function: )| DataPr( ))(| DataPr( )(F 0 1 H SH S = u or s mo e ci ~ Poisson(qbi) H : q = q everywhere)(Fmaxarg* SS • Likelihood ratio: • To find the most significant regions: 0 all H1(S): q = qin inside S, q = qout outside, S = totCCtotCC CCCC −− ⎟⎞⎜⎛⎟⎞⎜⎛ −⎟⎞⎜⎛ 137 qin > qout.tot tot tot tot BBBB S ⎟⎠⎜⎝⎟⎠⎜⎝ −⎠⎝ =)F( Finding the most significant regions 2nd highest score = 8.4 • Define models: Maximum region • of the null hypothesis H0: no events. • of the alternative score = 9.8 K lld ff’ d l hypotheses H1(S): event in region S. • Derive a score function: )| DataPr( ))(| DataPr( )(F 0 1 H SH S = u or s mo e ci ~ Poisson(qbi) H : q = q everywhere)(Fmaxarg* SS • Likelihood ratio: • To find the most significant regions: 0 all H1(S): q = qin inside S, q = qout outside, S = totCCtotCC CCCC −− ⎟⎞⎜⎛⎟⎞⎜⎛ −⎟⎞⎜⎛ 138 qin > qout.tot tot tot tot BBBB S ⎟⎠⎜⎝⎟⎠⎜⎝ −⎠⎝ =)F( Which regions are significant? 2nd highest score = 8.4RB = 97, p = .098 • Randomly generate counts for R 999 li d t t d Maximum region = rep ca a ase s un er H0 (i.e. assuming no events). • Find maximum region score F*= max F(S) of each replica score = 9.8 RB = 12, p = .013 S . • p-value of region S = (RB+1) / (R+1), where RB = # of replicas with F* ≥ F(S). Thi i i i ifi t t 05• All regions with p-values < α are significant at level α. s reg on s s gn can a α = . ; no other regions are significant. … F* = 2.4 F* = 9.1 F* = 7.0 139 G1 G2 G999 B Univariate Scan. Statistic Approaches 1. Kulldorff’s spatial scan statistic 2 Variants of spatial scan:. • Which spatial regions to search? • How to evaluate the score of a region? 3. Extensions to space-time scanning (expectation-based scan statistic) 140 Choosing the set of search regions • Some practical considerations: • Set of regions should cover entire search space. • Regions should overlap, not partition the space. • Choose a set of regions that corresponds well with the size/shape of the clusters we want to detect. • Typical approaches consider some fixed shape (circles, rectangles) and vary the location and dimensions . Don’t search too few regions: Don’t search too many regions: Overall power to detect any given subset of regions reduced because of multiple hypothesis testing. Reduced power to detect clusters outside the search space. 141 Computational infeasibility! Choosing the set of search regions • Kulldorff’s original spatial scan searches over circular regions of varying radius, centered at each ti l l tispa a oca on si. • Since the score function F(S) depends only on which locations are included we need to search , O(N2) regions, each consisting of a center location and its k-NN. • Advantages: computationally efficient, generalizable to arbitrary metric spaces, high detection power for compact clusters. • Disadvantage: low power for elongated/irregular clusters. April 1979: inadvertent release of anthrax from a Soviet biological weapons facility, 77 cases confirmed. 142 Disease cluster elongated due to wind. Choosing the set of search regions • Kulldorff’s original spatial scan searches over circular regions of varying radius, centered at each ti l l ti Many recent spatial scan variants search over elongated clusters, e.g. spa a oca on si. • Since the score function F(S) depends only on which locations are included we need to search rectangles1 or ellipses2 Other variants: heuristic h ll t d , O(N2) regions, each consisting of a center location and its k-NN. • Advantages: computationally searc over a connec e regions3, or exhaustive search over a subset of connected regions4 5 efficient, generalizable to arbitrary metric spaces, high detection power for compact clusters. , Main challenge: efficient computation! • Disadvantage: low power for elongated/irregular clusters. 1Neill and Moore, KDD 2004 2Kulldorff et al., Stat. Med., 2007 3D l d A CSDA 2004 143 uczma an ssuncao, , 4Tango and Takahashi, IJHG, 2005 5Patil and Taillie, EES, 2004 Computing the score function Method 1 (Frequentist, hypothesis testing approach): Use likelihood ratio )|P ( ))(|Pr( )( 1 HD t SHData SF = r 0a a Method 2 (Bayesian approach): Use posterior probability ))(Pr())(|Pr()( 11 SHSHDataSF Prior probability of region S )Pr(Data = 144 Computing the score function Method 1 (Frequentist, hypothesis testing approach): Use likelihood ratio )|P ( ))(|Pr( )( 1 HD t SHData SF = r 0a a Method 2 (Bayesian approach): Use posterior probability ))(Pr())(|Pr()( 11 SHSHDataSF Prior probability of region S )Pr(Data = What to do when each hypothesis has a parameter space Θ? Method A (Maximum likelihood approach) )|P ()|P ( θHDHD Method B (Marginal likelihood approach) ,rmaxr )(θ ataata HΘ∈= ∫ 145 Θ∈ = )( )Pr(),|Pr()|Pr( H HDataHData θ θθ Computing the score function Method 1 (Frequentist, hypothesis testing approach): Use likelihood ratio )|P ( ))(|Pr( )( 1 HD t SHData SF = r 0a a Most common (frequentist) approach: use likelihood ratio statistic with maximum likelihood , estimates of any free parameters, and compute statistical significance by randomization1,2 Method A (Maximum likelihood approach) )|P ()|P ( θHDHD 1Kulldorff, 1997 2Neill and Moore, ,rmaxr )(θ ataata HΘ∈= ADKDD 2005. Many possible variants, depending on how we model the lik lih d f th d t d h h th i H (S) d H 146 e oo o e a a un er eac ypo es s 1 an 0 (Poisson, Gaussian, exponential, negative binomial, etc.) Computing the score function Advantages: Randomization testing unnecessary (1000x speedup), can be extended to multiple data streams and multiple event types (more on this later) Method 2 (Bayesian approach): Use posterior probability ))(Pr())(|Pr()( 11 SHSHDataSF . )Pr(Data = Bayesian spatial scan statistic1,2: A Bayesian marginal likelihood approach, efficiently computable using conjugate priors (Gamma-Poisson). Method B (Marginal likelihood approach) ∫ 1Neill et al., NIPS 2005 2Neill and Cooper, Machine L i 2009 i 147 Θ∈ = )( )Pr(),|Pr()|Pr( H HDataHData θ θθ earn ng, , n press. B Univariate Scan. Statistic Approaches 1. Kulldorff’s spatial scan statistic 2 Variants of spatial scan:. • Which spatial regions to search? • How to evaluate the score of a region? 3. Extensions to space-time scanning (expectation-based scan statistic) 148 Expectation-based scan statistics (K lld ff 2001 N ill t l KDD 2005)u or , ; e e a ., To detect emerging events, we can search for space-time regions where the recently observed counts are significantly higher than expected. Imagine moving a space-time window around the scan area, allowing the window size, h d d is ape, an urat on to vary. 149 Expectation-based scan statistics (K lld ff 2001 N ill t l KDD 2005)u or , ; e e a ., To detect emerging events, we can search for space-time regions where the recently observed counts are significantly higher than expected. Imagine moving a space-time window around the scan area, allowing the window size, h d d i Historical Current counts s ape, an urat on to vary. (Consider most recent w days, w = 1…Wmax) counts (1 day duration) 150 Expectation-based scan statistics (K lld ff 2001 N ill t l KDD 2005)u or , ; e e a ., To detect emerging events, we can search for space-time regions where the recently observed counts are significantly higher than expected. Imagine moving a space-time window around the scan area, allowing the window size, h d d i Historical Current counts s ape, an urat on to vary. (Consider most recent w days, w = 1…Wmax) counts (2 day duration) 151 Expectation-based scan statistics (K lld ff 2001 N ill t l KDD 2005)u or , ; e e a ., To detect emerging events, we can search for space-time regions where the recently observed counts are significantly higher than expected. Imagine moving a space-time window around the scan area, allowing the window size, h d d i Historical Current counts s ape, an urat on to vary. (Consider most recent w days, w = 1…Wmax) For each space-time region, we compare the current counts for each location to the time series of counts (3 day duration) 152 historical counts for that location. Expectation-based scan statistics (K lld ff 2001 N ill t l KDD 2005) For the standard scan statistic approach we assume that each u or , ; e e a ., , count is drawn from a Poisson distribution with unknown mean. We perform time series analysis to find the expected counts for each recent day, then compare Historical Current counts actual to expected counts. Expected t counts (3 day duration)For each space-time region, we compare the current counts for each location to the time series of 153 coun shistorical counts for that location. Expectation-based scan statistics (K lld ff 2001 N ill t l KDD 2005) For the standard scan statistic approach we assume that each u or , ; e e a ., , count is drawn from a Poisson distribution with unknown mean. Similarly, we can compute a Gaussian scan statistic by obtaining the expectations and Historical Current counts variances from historical data. counts (3 day duration)For each space-time region, we compare the current counts for each location to the time series of Expected t 154 historical counts for that location. coun s Expectation-based scan statistics (K lld ff 2001 N ill t l KDD 2005) As before, we find the regions with highest values of the likelihood ratio 2nd highest score = 8 4 Not significant (p = 098) u or , ; e e a ., statistic, and compute the p-value of each region by randomization. Maximum region . . )|DataPr( ))(| DataPr( )(F 0 1 H SH S = Alternative hypothesis: event in region S score = 9.8 Significant! (p = .013) Null hypothesis: no events … F1* = 2.4 F2* = 9.1 F999* = 7.0To compute p-value Compare region score to maximum region scores of simulated 155 datasets under H0. Poisson scan statistic models Counts are Poisson distributed: cit ~ Poisson(qitbit) qit is relative risk, bit is expected count under H0 Expectation-based Poisson (EBP) Population-based Poisson (PBP) (Neill et al., KDD 2005) (Kulldorff, 1997, 2001) H0: qit = 1 everywhere (counts = expected) H0: qit = qall everywhere (inside = outside) H1(S): qit = qin in S and qit = 1 outside, for some qin > 1. (counts > expected in S) H1(S): qit = qin in S and qit = qout outside, for some qin > qout. (inside > outside) qin = 1.2 qin = 1.3 156 qout = 1.1 Poisson scan statistic models Counts are Poisson distributed: cit ~ Poisson(qitbit) qit is relative risk, bit is expected count under H0 Expectation-based Poisson (EBP) Population-based Poisson (PBP) (Neill et al., KDD 2005) (Kulldorff, 1997, 2001) H0: qit = 1 everywhere (counts = expected) H0: qit = qall everywhere (inside = outside) H1(S): qit = qin in S and qit = 1 outside, for some qin > 1. (counts > expected in S) H1(S): qit = qin in S and qit = qout outside, for some qin > qout. (inside > outside) allC all all outC out out inC in in B C B C B C )S(F − ⎟⎟⎠ ⎞ ⎜⎜⎝ ⎛ ⎟⎟⎠ ⎞ ⎜⎜⎝ ⎛ ⎟⎟⎠ ⎞ ⎜⎜⎝ ⎛=CB C e B C F(S) −⎟⎠ ⎞⎜⎝ ⎛= 157 (if Cin / Bin > Cout / Bout)(if C > B) Gaussian scan statistic models Counts are Gaussian distributed: cit ~ Gaussian(qitbit, σit) Let C’ = Σ citbit / (σit)2 and B’ = Σ (bit)2 / (σit)2 Expectation-based Gaussian (EBG) Population-based Gaussian (PBG) (Neill, Ph.D. thesis, 2006) (Neill, Ph.D. thesis, 2006) H0: qit = 1 everywhere (counts = expected) H0: qit = qall everywhere (inside = outside) H1(S): qit = qin in S and qit = 1 outside, for some qin > 1. (counts > expected in S) H1(S): qit = qin in S and qit = qout outside, for some qin > qout. (inside > outside) ( ) ( ) ( ) ⎟⎟⎠ ⎞ ⎜⎜⎝ ⎛ −+= all 2 all out 2 out in 2 in 'B2 'C 'B2 'C 'B2 'C exp)S(F⎟⎟⎠ ⎞ ⎜⎜⎝ ⎛ −+= 'C 2 'B 'B2 )'C( expF(S) 2 158 (if C’in / B’in > C’out / B’out)(if C’ > B’) Comparison of models and methods • Expectation-based space-time scan statistics typically outperform purely spatial and purely temporal scans1 and parallel monitoring2. • EBP and EBG statistics have consistently high detection power whether the affected region is large or small in size 3 . • Kulldorff’s statistic (PBP) has very low detection power for large regions. For small regions, PBP beats EBP and EBG for large-count datasets, while EBP wins for small-count datasets.3 Diff t ti i th d b t f ti b li f• eren me ser es me o s are es or compu ng ase nes or different datasets; it is important to adjust for seasonal and day-of-week trends if these are present.2,3 • Randomization testing is often miscalibrated for public health datasets, resulting in lower detection power and high false positive rates. We suggest using the empirical distribution of maximum scores from historical data instead.2 • Bayesian4,5 and nonparametric6 approaches often outperform typical parametric scan statistics (more on these later). 1Neill, Ph.D. thesis, 2006 4Neill et al., NIPS 2005 159 2Neill, IJF, 2009 3Neill, IJHG, 2009 5Neill and Cooper, MLJ, 2009 6Neill and Lingwall, ISDS 2007 Persistent vs. emerging clusters Most space-time scan approaches assume that the relative risks qit are spatially uniform over the affected i d t t th d ti f th t Good for detecting Not as good for detecting reg on, an cons an over e ura on o e even . persistent clusters (e.g. shift in mean) clusters that emerge gradually over time Better idea: assume that relative risk is non- 160 decreasing over the duration of the event.1 1Neill et al., KDD 2005. Persistent vs. emerging example Estimated values of relative risk for each day Actual values of C/B for each day Efficiently computable: q = Σ cit / Σ bit Can be computed just as efficiently1, using dynamic programming! t tP i t t E i 161 ers s en merg ng 1Neill et al., KDD 2005. Static vs. dynamic clusters Most space-time scan approaches assume that the affected set of spatial locations remains constant over time t . We can think of this as a search over regions shaped like right prisms (with Iyengar (KDD 2004) considers regions both bases the same) in 3-D space-time. with truncated pyramidal shapes in space-time. This models regions which move, grow, or shrink linearly over time. Exact search is computationally infeasible; heuristic search is used 162 From Iyengar, KDD 2004 to obtain an approximate solution. C. Multivariate Scan Statistic Approaches 1. Advantages of multivariate approaches 2 Parametric multivariate scan statistics. 3. Non-parametric scan statistics (NPSS) 4. Multivariate Bayesian scan statistics (MBSS) ©2009 Carnegie Mellon University 163 Why multivariate event detection? ED visits (fever) ED visits (respiratory) … OTC sales (nasal) OTC sales (cough) Multivariate Event Detection System No A th I fl outbreak … n rax outbreak in region S n uenza outbreak in region S 164 Why multivariate event detection? ED visits (fever) ED visits (respiratory) … OTC sales (nasal) OTC sales (cough) W t t d t t Multivariate Event Detection System e wan o e ec emerging events in the early stages, when they have a small impact on Goal 1: More timely detection. No A th I fl each individual stream. outbreak … n rax outbreak in region S n uenza outbreak in region S 165 Why multivariate event detection? ED visits (fever) ED visits (respiratory) … OTC sales (nasal) OTC sales (cough) W t t di ti i h Multivariate Event Detection System e wan o s ngu s between different event types based on the affected region Goal 2: Event characterization. No A th I fl and streams. outbreak … n rax outbreak in region S n uenza outbreak in region S 166 Why multivariate event detection? ED visits (fever) ED visits (respiratory) … OTC sales (nasal) OTC sales (cough) W t t d l Multivariate Event Detection System e wan o mo e various other causes of a detected pattern and distinguish these from Goal 3: Reduce false positives. No PromotionalA th I fl relevant events. outbreak sale of OTC medications n rax outbreak in region S n uenza outbreak in region S 167 C. Multivariate Scan Statistic Approaches 1. Advantages of multivariate approaches 2 Parametric multivariate scan statistics. 3. Non-parametric scan statistics (NPSS) 4. Multivariate Bayesian scan statistics (MBSS) 168 Parametric scan statistics Parametric scan statistics find the regions with highest values of a 2nd highest score = 8 4 Not significant (p = 098) likelihood ratio statistic, and compute statistical significance of each region by randomization. Maximum region . . )H|D tP ( ))S(H| DataPr( )S(F 1= Alternative hypothesis: outbreak in region S score = 9.8 Significant! (p = .013) a ar 0 Null hypothesis: no outbreak 1Typical multivariate approach : assume streams are independent, conditional on whether an event has occurred and the affected space-time region S. 169 1Kulldorff et al., Stat. Med., 2007 Parametric scan statistics Parametric scan statistics find the regions with highest values of a 2nd highest score = 8 4 Not significant (p = 098) likelihood ratio statistic, and compute statistical significance of each region by randomization. Maximum region . . )H|D tP ( ))S(H| DataPr( )S(F 1= Alternative hypothesis: outbreak in region S score = 9.8 Significant! (p = .013) a ar 0 Null hypothesis: no outbreak 1 Under this assumption we can multiplyTypical multivariate approach : assume streams are independent, conditional on whether an event has occurred and the affected space-time region S. ∏ == M 1m 01m )H|DPr( ))S(H| DPr()S(F , the likelihood ratios for each stream: 170 m 1Kulldorff et al., Stat. Med., 2007 Parametric scan statistics 2nd highest score = 8 4 Not significant (p = 098) This multivariate approach has several disadvantages: 1) Does not account for Maximum region . . correlations between streams. 2) Cannot determine which subset of streams have been affected score = 9.8 Significant! (p = .013) . 3) Tends to focus detection on streams with highest counts. 1 Under this assumption we can multiply 4) Cannot distinguish between multiple types of event. Typical multivariate approach : assume streams are independent, conditional on whether an event has occurred and the affected space-time region S. ∏ == M 1m 01m )H|DPr( ))S(H| DPr()S(F , the likelihood ratios for each stream: 171 m 1Kulldorff et al., Stat. Med., 2007 C. Multivariate Scan Statistic Approaches 1. Advantages of multivariate approaches 2 Parametric multivariate scan statistics. 3. Non-parametric scan statistics (NPSS) 4. Multivariate Bayesian scan statistics (MBSS) 172 The nonparametric scan statistic Neill and Lingwall ISDS 2007 Rather than assuming a parametric distribution and learning the mean and variance parameters from past counts, NPSS compares the t t t th ti i i l di t ib ti f hi t i l t , curren coun s o e en re emp r ca s r u on o s or ca coun s. Simple assumption: under H0, all counts for a given location and d t t d i d d tl f th di t ib tia a s ream are rawn n epen en y rom e same s r u on. In this case, the proportion of historical counts that are greater than t t t ill b t ti ll if l di t ib t d [0 1] Historical Current counts curren coun ci,m w e asymp o ca y un orm y s r u e on , . Compute the empirical p-value pi,mt corresponding to each current count ci mt: counts (3 day duration) , pi,mt = (Tbeat + 1) / (T + 1) 173 Total # of historical counts # of historical counts > ci,mt The nonparametric scan statistic Neill and Lingwall ISDS 2007 , Rather than assuming a parametric distribution and learning the mean and variance parameters from past counts, NPSS compares the t t t th ti i i l di t ib ti f hi t i l tcurren coun s o e en re emp r ca s r u on o s or ca coun s. Simple assumption: under H0, all counts for a given location and d t t d i d d tl f th di t ib tia a s ream are rawn n epen en y rom e same s r u on. In this case, the proportion of historical counts that are greater than t t t ill b t ti ll if l di t ib t d [0 1] Compute the empirical p-value pi,mt corresponding to each current count ci mt: Under H0, pi,mt ~ U[0,1] curren coun ci,m w e asymp o ca y un orm y s r u e on , . , pi,mt = (Tbeat + 1) / (T + 1) Under H1(S), the counts in region S will be higher than expected under H and thus the empirical p 174 Total # of historical counts # of historical counts > ci,mt 0, - values will be lower than expected. The nonparametric scan statistic We search for regions (D, S, W) with a surprisingly large number f l i i l l D: subset of data streams S: set of spatial locations W: duration (number of days)o ow emp r ca p-va ues. Total number of empirical p-values in region: N = |D| x |S| x W How many low empirical p-values (pi,mt < α) do we expect under H0? L t N # { t ) Th N Bi i l(N )e α = pi,m < α . en α ~ nom a , α , with mean Nα and variance Nα(1 – α). Follo ing Donoho and Jin (2004) e define the higherw , w criticism statistic F(D, S, W) = maxα (Nα – Nα) / √Nα(1 – α). We find the multivariate space time regions (D S W) with highest scores 175 - , , F(D, S, W), and compute statistical significance by randomization. The nonparametric scan statistic Advantages of the nonparametric scan statistic (NPSS) No parametric model assumptions. Can easily combine information from multiple streams and identify which subset of streams are affected. Randomization testing is easy (draw each p t ~ U[0 1]) i,m , . NPSS assumes that all of the counts for a given time series are drawn from the same (unknown) distribution , which will not be true if the time series is nonstationary. Solution: use the standardized residuals ri mt = (ci mt – bi mt) / √bi mt, , , , , where the expected counts bi,mt are inferred by time series analysis. Other nonparametric score functions F(D, S, W) = max F (N , N) 176 α α α can be defined, and in some cases these outperform higher criticism. Preliminary comparison of methods • We compared the parametric and nonparametric scan statistics on a variety of outbreak detection tasks using Emergency Department data from Allegheny County. • Univariate detection performance was comparable; NPSS outperformed parametric scans for larger outbreaks, and for data that did not fit the parametric model assumptions . • NPSS achieved significant gains in detection power on multivariate tasks, especially when only a subset of the monitored streams ere affected w . • NPSS was able to accurately identify the affected streams. • A naïve implementation of NPSS is more computationally expensive than parametric scan, O(2M) for M streams. • However, we have developed an efficient implementation that scales linearly with M using newly developed methods , for linear-time subset scanning (LTSS).1 177 1Neill, ISDS 2008 C. Multivariate Scan Statistic Approaches 1. Advantages of multivariate approaches 2 Parametric multivariate scan statistics. 3. Non-parametric scan statistics (NPSS) 4. Multivariate Bayesian scan statistics (MBSS) 178 Overview of the MBSS method Neill and Cooper MLJ 2009 Priors Multivariate Bayesian , , Dataset Scan StatisticModels Outputs Given a set of event types Ek, a set of spatial regions S, and the multivariate dataset D MBSS outputs the posterior probability Pr(H (S E ) | D) of each type We must provide the prior probability Pr(H1(S, Ek)) of each event type Ek , 1 , k of event in each region, as well as the probability of no event, Pr(H0 | D). in each region S, as well as the prior probability of no event, Pr(H0). MBSS uses Bayes’ Theorem to combine the data likelihood given each hypothesis ith th i b bilit f th t h th i P (H | D) P (D | H) P (H) / P (D) 179 w e pr or pro a y o a ypo es s: r = r r r . The Bayesian hierarchical model Type of event Space-time region affected Effects on each data stream Effects of event Parameter priors Expected counts Relative risks 180 Observed counts The Bayesian hierarchical model ci,mt ~ Poisson(qi,mtbi,mt) Count for data stream dm in location si at time t bi,mt is expected value of ci,mt under the null hypothesis, predicted from historical data. qi,mt is relative risk. Null hypothesis H0 (no events) Alternative hypothesis H1(S, Ek) (event of type Ek in region S) qi,mt ~ Gamma(αm, βm) everywhere qi,mt ~ Gamma(xmαm, βm) inside region S, qi,mt ~ Gamma(αm, βm) elsewhere αm and βm are learned from historical data for stream dm. Event type Ek multiplies expected counts in S by some constant xm for each stream dm. Gamma(α,β) μ= α / β σ2 = α / β2 Simple event model: xm = 1 + θ (xkm,avg – 1) Event severity 181 Average effect of event type Ek on stream dm. Computing Bayesian likelihoods • Marginal likelihood approach: integrate over possible values of the relative risks qi,mt, weighted by their prior probabilities. C j t i ll l d f l ti• on uga e pr ors a ow a c ose orm so u on. • Gamma priors, Poisson counts Æ Negative binomial likelihoods. 182 Computing Bayesian likelihoods • Marginal likelihood approach: integrate over possible values of the relative risks qi,mt, weighted by their prior probabilities. C j t i ll l d f l ti• on uga e pr ors a ow a c ose orm so u on. • Gamma priors, Poisson counts Æ Negative binomial likelihoods. ∏ tt )()|( ∝ t,m,i mmm,imi,0 ,,b,cNegBinHDPr βα dq))qbPo(~cPr()),Ga(~qPr(),,b,cNegBin( where βαβα ∫= )(βαΓ 183 )()b( c c αβ α α Γ+ +∝ + Computing Bayesian likelihoods • Marginal likelihood approach: integrate over possible values of the relative risks qi,mt, weighted by their prior probabilities. C j t i ll l d f l ti• on uga e pr ors a ow a c ose orm so u on. • Gamma priors, Poisson counts Æ Negative binomial likelihoods. ∏ tt )()|( ∝ t,m,i mmm,imi,0 ,,b,cNegBinHDPr βα ∏∝ tt )xbcNegBin(})x{)ES(H|DPr( βα ∈St,m,i mmmm,imi,mk1 ,,,,, ∏ ∉ × Stmi mm t m,i t mi, ),,b,cNegBin( βα dq))qbPo(~cPr()),Ga(~qPr(),,b,cNegBin( where βαβα ∫= )(βαΓ ,, 184 )()b( c c αβ α α Γ+ +∝ + Computing Bayesian likelihoods • Marginal likelihood approach: integrate over possible values of the relative risks qi,mt, weighted by their prior probabilities. C j t i ll l d f l ti• on uga e pr ors a ow a c ose orm so u on. • Gamma priors, Poisson counts Æ Negative binomial likelihoods. ∏ tt )()|( ∝ t,m,i mmm,imi,0 ,,b,cNegBinHDPr βα ∏∝ tt )xbcNegBin(})x{)ES(H|DPr( βα ∈St,m,i mmmm,imi,mk1 ,,,,, ∏ ∉ × Stmi mm t m,i t mi, ),,b,cNegBin( βα ,, To compute the data likelihood given the alternative hypothesis H (S E ) we marginalize over the values of x 185 1 , k , m. Comparison to prior methods We compared MBSS to the parametric multivariate scan statistic on outbreak detection using OTC medication sales. Using uninformative priors, MBSS achieves similar detection performance to parametric scans, enabling its use as a general detector with high performance across many event types. However, we can also incorporate prior information into event models, and thus use MBSS as a specific detector with much higher detection power for the given event type, achieving an average of 1.3 days faster detection than parametric scans. Additionally, MBSS can be used to characterize events by specifying models for multiple event types and computing 186 the probability that each type of event has occurred. Testing discrimination power • We examined the ability of MBSS to differentiate between two types of influenza-like illness using two streams of OTC data (cough/cold antifever) , . Outbreak E1 affects cough/cold twice as much as fever. O tb k E ff t f t i h h/ ld 100 t y u rea 2 a ec s ever w ce as muc as coug co . 40 60 80 d P r o b a b i l i t Correct type Wrong type MBSS was able to accurately discriminate between the two event 0 20 E s t i m a t e d No outbreaktypes by the second outbreak day. 187 1 2 3 4 5 6 7 Day of outbreak Interpretation and visualization • MBSS gives the total posterior probability of each event type Ek, and the distribution of this probability over space time regions S - . • Probabilistic basis for decision-making, given costs of false positives and false negatives . • Visualization: Pr(H1(si, Ek)) = Σ Pr(H1(S, Ek)) for all regions S containing location si. Total posterior probability of a i t tb k i hresp ra ory ou rea n eac Allegheny County zip code on 6/3/05. Darker shading = higher probability. 188 Advantages of MBSS Results are easy to interpret, visualize, and use for decision-making. Can incorporate prior knowledge of event prevalence, size, shape, duration, spread, and impact. Computation is fast in the Bayesian framework and randomization P(anthrax) = 22% P(influenza) = 13%, testing is not necessary. P(other ILI) = 33% We can model and differentiate between multiple potential causes of an event. We can detect faster and more accurately by integrating multiple data streams. 189 D Current Directions. in Spatial Event Detection 1. Incorporating learning into detection 2. Very fast detection algorithms 3. Generalization of spatial methods to non-spatial data 4. Non-aggregated spatial and temporal data 5 I ti i h i f ti f b ti. ncorpora ng r c n orma on rom o serva ons 6. Detecting multiple, dynamic, irregular clusters 7. Integrating detection, tracking, and response ©2009 Carnegie Mellon University 190 8. Combining sensor placement and sensor fusion Incorporating learning into detection We have made major advances in detecting anomalous patterns, but not in determining which of these anomalies are relevant. 9/20, 15213, cough/cold, … 9/21, 15207, antifever, … 9/22, 15213, CC = cough, ... 1,000,000 more records… Huge mass of data Detection algorithm “What am I supposed to do with this?” Too many alerts We must model and differentiate between multiple causes of a detected pattern, and provide only the relevant patterns to the user. How can we classify patterns, and determine which ones are relevant 191 to a given user at a given time? Incorporating learning into detection We have made major advances in detecting anomalous patterns, but not in determining which of these anomalies are relevant. 9/20, 15213, cough/cold, … 9/21, 15207, antifever, … 9/22, 15213, CC = cough, ... 1,000,000 more records… Huge mass of data Feedback loop We must model and differentiate between multiple causes of a detected pattern, and provide only the relevant patterns to the user. How can we classify patterns, and determine which ones are relevant Incorporate user feedback into the detection process, 192 to a given user at a given time? and use it to learn models! Incorporating learning into MBSS Many aspects of the MBSS framework can be learned from data. The set of event types Ek, and the prevalence of each event type Pr(Ek). We first consider the passive learning scenario, in which the user provides a The space-time pattern of each event type, Pr(H1(S, Ek) | Ek). label (S, Ek) for each day. The effects of each event type on the multiple data streams, Pr(D | H1(S, Ek)). The relevance of each type This label can be then used by the system to update its models and improve its of event to the user. performance for future days. 193 Incorporating learning into MBSS Many aspects of the MBSS framework can be learned from data. We need to generalize over huge # of possible regions S. The set of event types Ek, and the prevalence of each event type Pr(Ek). The space-time pattern of each event type, Pr(H1(S, Ek) | Ek). The effects of each event type on the multiple data streams, Pr(D | H1(S, Ek)). The relevance of each type of event to the user. 194 Latent center model (Makatchev and Neill, 2008) Incorporating learning into MBSS Many aspects of the MBSS framework can be learned from data. We can learn different temporal patterns for different event types.The set of event types Ek, and the prevalence of each event type Pr(Ek). The space-time pattern of each event type, Pr(H1(S, Ek) | Ek). The effects of each event type on the multiple data streams, Pr(D | H1(S, Ek)). The relevance of each type We can detect events where the affected region changes over time. of event to the user. 195(Das, Schneider, & Neill, 2009) Passive vs. active learning So far, we have considered the passive learning scenario, in which each day is assigned a label by the user (no event, or event type Ek and affected area S) independently of the system’s output . However, having a user in the loop allows for much interaction and learning than this simple framework In the active learning . scenario, the system presents its output (detected clusters) to the user, and receives feedback on these clusters, each day. This presents an interesting challenge: the system must strike a balance between exploration, presenting “unknown” examples that will best inform its models and exploitation presenting “known” , , examples that have highest probability of relevance to the user. 196 Active learning of new event types This scenario allows the user to define new classes “on the fly”, by assigning a new label type to an example. The system can then find other potential examples of the new class in historical data, ask the user to label these, and learn a model for the new class. “Any clusters of interest today?” “Yes, this appears to be an anthrax attack, based on increased OTC cough and fever.” “No, we don’t see a corresponding increase in ED visits. I think this cluster is just a promotional sale.” “OK. Can you identify which of these historical clusters also correspond to promotional sales?” “Yes, all of these. Any other clusters of interest?” MBSS The end goal is to transform the process of pattern discovery into a “conversation” between the user and system in which the system takes “Not really, just more seasonal flu and another promotional sale.” 197 , an active role in identifying and explaining potentially interesting patterns. D Current Directions. in Spatial Event Detection 1. Incorporating learning into detection 2. Very fast detection algorithms 3. Generalization of spatial methods to non-spatial data 4. Non-aggregated spatial and temporal data 5 I ti i h i f ti f b ti. ncorpora ng r c n orma on rom o serva ons 6. Detecting multiple, dynamic, irregular clusters 7. Integrating detection, tracking, and response 198 8. Combining sensor placement and sensor fusion Which subsets to scan? Since there are exponentially many subsets of the data, it is often computationally infeasible to search all of them. The most common approach is to use domain knowledge to restrict our search space: for example, we assume that an event will affect a contiguous spatial region, and often further restrict the region size and shape . e.g. “search over circular regions centered at a data point” Æ only N2 regions instead of 2N. Another common approach is to perform a heuristic search. For example, we can greedily grow subsets starting from each data record, repeatedly adding the additional record that gives the highest scoring subset.1 Tradeoff: much more efficient than naïve search, but not guaranteed to find highest scoring region. In some cases, we can find the highest-scoring subsets 199 without computing the scores of all possible subsets! 1Neill et al., in Scan Statistics: Methods and Applications, 2009. Fast spatial scan over rectangles Th b f h i l Consider searching over all rectangular regions for data aggregated to a N x N grid. e num er o searc reg ons sca es as O(N4), making an exhaustive search computationally infeasible for large N. We can find the highest scoring clusters without an exhaustive search using branch and bound: we keep track of the highest region score found so far, and prune sets of regions with provably lower scores.1,2 A new multi-resolution data structure, the overlap-kd tree, enables us to k thi h ffi i t We can now monitor nationwide health data in 20 minutes (vs. 1 week). ma e s searc e c en . 200 1Neill and Moore, KDD 2004 2Neill et al., NIPS 2004 Fast spatial scan over rectangles Th b f h i l Consider searching over all rectangular regions for data aggregated to a N x N grid. e num er o searc reg ons sca es as O(N4), making an exhaustive search computationally infeasible for large N. We can find the highest scoring clusters without an exhaustive search using branch and bound: we keep track of the highest region score found so far, and prune sets of regions with provably lower scores.1,2 Other recent work on efficiently maximizing scan statistics over gridded rectangles W t l KDD 2009A l t l SODA 2006 KDD 2006 u e a ., Compute scores for subset of rectangles, bound scores of other rectangles by tiling with evaluated rectangles. garwa e a ., , Fast approximate optimization of convex score functions. Solution within ε of optimal, runtime O((1/ε) N2 log2 N). 201 Can be used for non-convex score functions. Linear-time subset scanning • In certain cases, we can optimize F(S) over the exponentially many subsets of locations, 1while evaluating only O(N) regions. • Many commonly used scan statistics have the property of linear-time subset scanning: • Just sort the locations from highest to lowest priority according to some function … • … then search over groups consisting of the top-k highest priority locations for k = 1 N , .. . The highest scoring subset is guaranteed to be one of these! 202 1Neill, ISDS 2008 The LTSS property • Example: Poisson statistics (Kulldorff, EBP) • F(S) = F(C, B), where C = Σ ci and B = Σ bi are the aggregate count and baseline of region S. • Sort locations si by the ratio of observed to expected count, ci / bi. • Given the ordering s(1) … s(N), we can prove that the top-scoring subset consists of the locations s(1) … s(k) for some k, 1 ≤ k ≤ N. • This follows from the facts that F(S) is convex, increasing with C and decreasing with B. 203 How to use LTSS in practice? • Simplest case: assume all subsets are equally likely (e.g. outbreak that does not cluster spatially) • LTSS gives highest-scoring subset by evaluating N subsets instead of 2N for naïve search. B t hat if e ant to se spatial information to• u w w w u constrain our search over subsets? • Soft constraints: some subsets of locations are more likely than others (non-uniform priors). • Hard constraints: some subsets of locations are not allowed (e.g. non-contiguous or highly irregular regions). • In most cases, we cannot use LTSS directly to find th ti l b t bj t t th t i t 204 e op ma su se su ec o ese cons ra n s. How to use LTSS in practice? • We can use LTSS to speed up the constrained search problem in three ways: 1) For some hard constraints, we can compute the optimal subset by maximizing over multiple LTSS searches (e.g. fast localized scan). 2)We can use the unconstrained maximum score as an upper bound on the constrained maximum (e.g. fast scan over rectangles). 3) For heuristic search, we can use the unconstrained maximum as a starting point, or use the LTSS ordering to guide our search 205 . Fast localized scan • Maximize the spatial scan statistic over regions consisting of a “center” location si and any subset of its k-nearest neighbors, for a fixed constant k. • This is similar to FlexScan1 but does not force the i t b tireg on o e con guous. • Naïve search requires O(N · 2k) time and is t ti ll i f ibl f k > 20compu a ona y n eas e or . • For each center, we can search over all subsets of its k nearest neighbors in O(k) using LTSS - , thus requiring total time O(Nk) + O(N log N) for sorting by priority. 206 1Tango and Takahashi, IJHG, 2005 Evaluation on ED data We compared the time needed to perform localized scans with and without LTSS, as a function of the number of neighbors k for 281 days of Emergency Department , visit data from Allegheny County. k = 15: 869x speedup (4.42 sec. vs. over 1 hour) k = 20: 38 460x speedup , (4.69 sec. vs. over 50 hours) k = 30: 5.21 sec. vs. ~9 yrs. k = 88: 8.12 sec. vs. ~1026 yrs. 207 Linear-time subset scanning Linear-time subset scanning is a powerful and useful tool that enables us to speed up a wide variety of spatial event detection methods. The Poisson, Gaussian, and nonparametric spatial scan statistics all satisfy the LTSS property, as do many other possible statistics. LTSS makes “all subsets” search computationally feasible, makes localized scans feasible even for large values of k, and speeds up searches over “all distinct rectangles” by 2-3 orders of magnitude. Many other LTSS-enabled searches are possible, and these will enable huge speedups for an even wider range of problems. Current work includes extending LTSS to the multivariate and space-time scan statistics, developing fast graph scan algorithms, 208 and incorporating LTSS into our Bayesian scan framework. D Current Directions. in Spatial Event Detection 1. Incorporating learning into detection 2. Very fast detection algorithms 3. Generalization of spatial methods to non-spatial data 4. Non-aggregated spatial and temporal data 5 I ti i h i f ti f b ti. ncorpora ng r c n orma on rom o serva ons 6. Detecting multiple, dynamic, irregular clusters 7. Integrating detection, tracking, and response ©2009 Carnegie Mellon University 209 8. Combining sensor placement and sensor fusion Anomalous Group Detection 1. Learn a Bayesian Network model for the null hypothesis H0 (no events) from the training data. US Port F Port Shipper Name Importer Name Shipping Line Country Commodity Size Weight Value 2. To evaluate a group of records S: 1. Fit the alternate hypothesis Bayesian Network (H1(S)) parameters using Data S. 2. Compute the group score using the likelihood ratio: 3. Greedily grow a group from each record, and output the )H|P(S (S))H | P(S F(S) 0 1= ©2009 Carnegie Mellon University 210 groups with highest score. Neill et al., in Scan Statistics: Methods and Applications, 2009. D Current Directions. in Spatial Event Detection 1. Incorporating learning into detection 2. Very fast detection algorithms 3. Generalization of spatial methods to non-spatial data 4. Non-aggregated spatial and temporal data 5 I ti i h i f ti f b ti. ncorpora ng r c n orma on rom o serva ons 6. Detecting multiple, dynamic, irregular clusters 7. Integrating detection, tracking, and response ©2009 Carnegie Mellon University 211 8. Combining sensor placement and sensor fusion References D Agarwal A McGregor J M Phillips S Venkatasubramanian and Z Zhu Spatial scan statistics:• . , . , . . , . , . . approximations and performance study. Proc. 12th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, 24–33, 2006. • D. Agarwal, J. M. Phillips, and S. Venkatasubramanian. The hunting of the bump: On maximizing statistical discrepancy. Proc. Symposium on Discrete Algorithms, 1137–1144, 2006. • D Donoho and J Jin Higher criticism for detecting sparse heterogeneous mixtures Annals of. . . . Statistics, 32(3): 962–994, 2004. • L. Duczmal and R. Assuncao. A simulated annealing strategy for the detection of arbitrary shaped spatial clusters. Computational Statistics and Data Analysis, 45:269–286, 2004. • M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. Proc. 2nd Intl. Conf. on Knowledge Discovery and Data Mining, 1996. • J. Friedman and N. Fisher. Bump hunting in high dimensional data. Statistics and Computing, 9(2):1– 20, 1999. • V. Iyengar. On detecting space-time clusters. Proc. 10th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, 587–592, 2004. • M. Kulldorff. A spatial scan statistic. Communications in Statistics: Theory and Methods, 26(6): 1481– 1496, 1997. • M. Kulldorff. Prospective time-periodic geographical disease surveillance using a scan statistic. Journal of the Royal Statistical Society A, 164: 61–72, 2001. • M. Kulldorff, L. Huang, L. Pickle, and L. Duczmal. An elliptic spatial scan statistic. Statistics in Medicine, 25:3929–3943, 2006. • M. Kulldorff, F. Mostashari, L. Duczmal, W. K. Yih, K. Kleinman, and R. Platt. Multivariate scan statistics for disease surveillance. Statistics in Medicine, 26: 1824–1833, 2007. • M. Makatchev and D.B. Neill. Learning outbreak regions in Bayesian spatial scan statistics. Proc. ICML/UAI/COLT Workshop on Machine Learning for Health Care Applications, 2008. • D.B. Neill and A.W. Moore. Rapid detection of significant spatial clusters. Proceedings of the 10th 212 ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 256-265, 2004. • D.B. Neill, A.W. Moore, F. Pereira, and T. Mitchell. Detecting significant multidimensional spatial clusters. In L.K. Saul, et al., eds., Adv. Neural Information Processing Systems 17, 969-976, 2005. References • D.B. Neill, A.W. Moore, M. Sabhnani, and K. Daniel. Detection of emerging space-time clusters. Proc. 11th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 218-227, 2005. • D.B. Neill and A.W. Moore. Anomalous spatial cluster detection. Proceedings of the KDD 2005 Workshop on Data Mining Methods for Anomaly Detection, 2005. • D.B. Neill, A.W. Moore, and G.F. Cooper. A Bayesian spatial scan statistic. In Y. Weiss, et al., eds. Advances in Neural Information Processing Systems 18, 1003-1010, 2006. • D.B. Neill. Detection of spatial and spatio-temporal clusters. Ph.D. thesis, Carnegie Mellon University, Department of Computer Science, 2006. • D.B. Neill. Incorporating learning into disease surveillance systems. Advances in Disease Surveillance 4: 107, 2007. • D.B. Neill and W.L. Gorr. Detecting and preventing emerging epidemics of crime. Advances in Disease Surveillance 4: 13, 2007. • D.B. Neill and J. Lingwall. A nonparametric scan statistic for multivariate disease surveillance. Advances in Disease Surveillance 4: 106, 2007. • D.B. Neill. Fast and flexible outbreak detection by linear-time subset scanning. Advances in Disease Surveillance 5: 48, 2008. • D.B. Neill. An empirical comparison of spatial scan statistics for outbreak detection. International Journal of Health Geographics 8: 20, 2009. • D.B. Neill, G.F. Cooper, K. Das, X. Jiang, and J. Schneider. Bayesian network scan statistics for multivariate pattern detection. In J. Glaz, et al., eds., Scan Statistics: Methods and Applications, 2009. • D.B. Neill. Expectation-based scan statistics for monitoring spatial time series data. International J l f F ti 2009 iourna o orcas ng, , n press. • D.B. Neill and G.F. Cooper. A multivariate Bayesian scan statistic for early event detection and characterization. Machine Learning, 2009, in press. • G. P. Patil and C. Taillie. Upper level set scan statistic for detecting arbitrarily shaped hotspots. Envir. Ecol. Stat., 11: 183–197, 2004. T T d K T k h hi A fl ibl h d ti l t ti ti f d t ti l t I tl J l 213 • . ango an . a a as . ex y s ape spa a scan s a s c or e ec ng c us ers. n . ourna of Health Geographics, 4: 11, 2005. • M. Wu, X. Song, C. Jermaine, S. Ranka and J. Gums. A LRT Framework for Fast Spatial Anomaly Detection. Proc. 15th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, 2009.