11
Automatic Summarization
Ani Nenkova University of Pennsylvania
Sameer Maskey IBM Research
Yang Liu University of Texas at Dallas
2
Why summarize?
23
Text summarization
News articles
Scientific Articles
Emails
Books
Websites
Social Media 
Streams
4
Speech summarization
MeetingPhone Conversation
Classroom
Radio NewsBroadcast News
Talk Shows
Lecture
Chat
35
How to 
summarize
Text & Speech?
-Algorithms
-Issues
-Challenges
-Systems
Tutorial
6
Motivation & Definition
Topic Models
Graph Based Methods
Supervised Techniques
Optimization Methods
Speech Summarization
Evaluation
Manual (Pyramid), Automatic (Rouge, F-Measure)
Fully Automatic
Frequency, Lexical chains, TF*IDF,
Topic Words, Topic Models [LSA, EM, Bayesian]
Features, Discriminative Training
Sampling, Data, Co-training
Iterative, Greedy, Dynamic Programming
ILP, Sub-Modular Selection
Segmentation, ASR
Acoustic Information, Disfluency 
47
Motivation: where does summarization 
help?
 Single document summarization 
 Simulate the work of intelligence analyst
 Judge if a document is relevant to a topic of interest
“Summaries as short as 17% of the full text length speed up 
decision making twice, with no significant degradation in 
accuracy.”
“Query-focused summaries enable users to find more relevant 
documents more accurately, with less need to consult the full text 
of the document.”
[Mani et al., 2002]
8
Motivation: multi-document summarization 
helps in compiling and presenting
 Reduce search time, especially when the goal of the 
user is to find as much information as possible about a 
given topic
 Writing better reports, finding more relevant information, 
quicker
 Cluster similar articles and provide a multi-document 
summary of the similarities
 Single document summary of the information unique to 
an article
[Roussinov and Chen, 2001; Mana-Lopez et al., 2004; McKeown et al., 2005 ]
59
Benefits from speech summarization
 Voicemail
 Shorter time spent on listening (call centers)
 Meetings
 Easier to find main points
 Broadcast News
 Summary of story from mulitiple channels
 Lectures
 Useful for reviewing of course materials
[He et al., 2000; Tucker and Whittaker, 2008; Murray et al., 2009]
10
Assessing summary quality: overview
 Responsiveness
 Assessor directly rate each summary on a scale
 In official evaluations but rarely reported in papers
 Pyramid
 Assessors create model summaries
 Assessors identifies semantic overlap between summary 
and models
 ROUGE
 Assessors create model summaries
 ROUGE automatically computes word overlap
611
Tasks in summarization
Content (sentence) selection
 Extractive summarization
Information ordering
 In what order to present the selected sentences, especially 
in multi-document summarization
Automatic editing, information fusion and compression
 Abstractive summaries
12
Extractive (multi-document) summarization
Input text2Input text1 Input text3
Summary
1. Selection
2. Ordering
3. Fusion
Compute Informativeness
713
Computing informativeness
 Topic models (unsupervised)
 Figure out what the topic of the input
 Frequency, Lexical chains, TF*IDF
 LSA, content models (EM, Bayesian) 
 Select informative sentences based on the topic
 Graph models (unsupervised)
 Sentence centrality
 Supervised approaches
 Ask people which sentences should be in a summary
 Use any imaginable feature to learn to predict human 
choices
14
Motivation & Definition
Topic Models
Graph Based Methods
Supervised Techniques
Global Optimization Methods
Speech Summarization
Evaluation
Frequency, Lexical chains, TF*IDF, 
Topic Words,Topic Models [LSA, EM, Bayesian]
Manual (Pyramid), Automatic (Rouge, F-Measure)
Fully Automatic
Features, Discriminative Training
Sampling, Data, Co-training
Iterative, Greedy, Dynamic Programming
ILP, Sub-Modular Selection
Segmentation, ASR
Acoustic Information, Disfluency 
815
Frequency as document topic proxy
10 incarnations of an intuition
 Simple intuition, look only at the document(s)
 Words that repeatedly appear in the document are likely to 
be related to the topic of the document
 Sentences that repeatedly appear in different input 
documents represent themes in the input
 But what appears in other documents is also helpful 
in determining the topic
 Background corpus probabilities/weights for word 
16
What is an article about?
 Word probability/frequency
 Proposed by Luhn in 1958 [Luhn 1958]
 Frequent content words would be indicative of the 
topic of the article
 In multi-document summarization, words or 
facts repeated in the input are more likely to 
appear in human summaries [Nenkova et al., 2006]
917
Word probability/weights 
Libya
bombing
trail
Gadafhi
suspects
Libya refuses 
to surrender 
two Pan Am 
bombing 
suspects 
Pan Am
INPUT
SUMMARY
WORD PROBABILITY TABLE
Word Probability
pan 0.0798
am 0.0825
libya 0.0096
suspects 0.0341
gadafhi 0.0911
trail 0.0002
….
usa 0.0007
HOW?
UK and 
USA
18
HOW: Main steps in sentence selection 
according to word probabilities
Step 1 Estimate word weights (probabilities)
Step 2 Estimate sentence weights
Step 3 Choose best sentence
Step 4 Update word weights
Step 5 Go to 2 if desired length not reached
)()( SentwCFSentWeight i ∈=
10
19
More specific choices [Vanderwende et al., 2007; Yih et al., 
2007; Haghighi and Vanderwende, 2009]
 Select highest scoring sentence
 Update word probabilities for the selected sentence 
to reduce redundancy
 Repeat until desired summary length
∑
∈
=
Sw
wp
S
SScore )(||
1)(
pnew (w) = pold (w).pold (w)
20
Is this a reasonable approach: yes, people 
seem to be doing something similar
 Simple test
 Compute word probability table from the input
 Get a batch of summaries written by H(umans) and S(ystems)
 Compute the likelihood of the summaries given the word 
probability table 
 Results
 Human summaries have higher likelihood
HSSSSSSSSSSHSSSHSSHHSHHHHH
HIGH LIKELIHOODLOW
11
21
Obvious shortcomings of the pure 
frequency approaches
 Does not take account of related words
 suspects -- trail
 Gadhafi – Libya
 Does not take into account evidence from 
other documents
 Function words: prepositions, articles, etc.
 Domain words: “cell” in cell biology articles
 Does not take into account many other 
aspects
22
Two easy fixes
 Lexical chains [Barzilay and Elhadad, 1999, Silber and McCoy, 
2002, Gurevych and Nahnsen, 2005]
 Exploits existing lexical resources (WordNet)
 TF*IDF weights [most summarizers]
 Incorporates evidence from a background corpus
12
23
Lexical chains and WordNet relations
 Lexical chains
 Word sense disambiguation is performed 
 Then topically related words represent a topic
 Synonyms, hyponyms, hypernyms
 Importance is determined by frequency of the words in a 
topic rather than a single word
 One sentence per topic is selected 
 Concepts based on WordNet [Schiffman et al., 2002, Ye et al., 
2007]
 No word sense disambiguation is performed
 {war, campaign, warfare, effort, cause, operation}
 {concern, carrier, worry, fear, scare}
24
TF*IDF weights for words
Combining evidence for document topics from the 
input and from a background corpus
 Term Frequency (TF)
 Times a word occurs in the input 
 Inverse Document Frequency (IDF)
 Number of documents (df) from a background 
corpus of N documents that contain the word
)/log(* dfNtfIDFTF ×=
13
25
Motivation & Definition
Topic Models
Graph Based Methods
Supervised Techniques
Global Optimization Methods
Speech Summarization
Evaluation
Frequency, TF*IDF, Topic Words
Topic Models [LSA, EM, Bayesian]
Manual (Pyramid), Automatic (Rouge, F-Measure)
Fully Automatic
Features, Discriminative Training
Sampling, Data, Co-training
Iterative, Greedy, Dynamic Programming
ILP, Sub-Modular Selection
Segmentation, ASR
Acoustic Information, Disfluency 
26
Topic words (topic signatures)
 Which words in the input are most descriptive?
 Instead of assigning probabilities or weights to all words, 
divide words into two classes: descriptive or not
 For iterative sentence selection approach, the binary 
distinction is key to the advantage over frequency and 
TF*IDF
 Systems based on topic words have proven to be the most 
successful in official summarization evaluations 
14
27
Example input and associated topic words
 Input for summarization: articles relevant to the 
following user need
Title: Human Toll of Tropical 
Storms Narrative: What has been the human toll in death or injury 
of tropical storms in recent years? Where and when have each of 
the storms caused human casualties? What are the approximate 
total number of casualties attributed to each of the storms?
ahmed, allison, andrew, bahamas, bangladesh, bn, caribbean, carolina, caused, cent, 
coast, coastal, croix, cyclone, damage, destroyed, devastated, disaster, dollars, drowned, 
flood, flooded, flooding, floods, florida, gulf, ham, hit, homeless, homes, hugo, hurricane, 
insurance, insurers, island, islands, lloyd, losses, louisiana, manila, miles, nicaragua, 
north, port, pounds, rain, rains, rebuild, rebuilding, relief, remnants, residents, roared, salt, 
st, storm, storms, supplies, tourists, trees, tropical, typhoon, virgin, volunteers, weather, 
west, winds, yesterday.
Topic Words
28
Formalizing the problem of identifying topic 
words 
 Given
 t: a word that appears in the input
 T: cluster of articles on a given topic (input)
 NT: articles not on topic T (background corpus)
 Decide if t is a topic word or not
 Words that have (almost) the same probability in T 
and NT are not topic words
15
29
Computing probabilities
 View a text as a sequence of Bernoulli trails
 A word is either our term of interest t or not
 The likelihood of observing term t which occurs with 
probability p in a text consisting of N words is given by 
 Estimate the probability of t in three ways
 Input + background corpus combines
 Input only
 Background only
t
30
Testing which hypothesis is more 
likely: log-likelihood ratio test
has a known statistical distribution: chi-square 
At a given significance level, we can decide if a word is 
descriptive of the input or not.
This feature is used in the best performing systems for 
multi-document summarization of news [Lin and Hovy, 
2000; Conroy et al., 2006]
Likelihood of the data given H1
Likelihood of the data given H2
λ =
-2 log λ
16
31
Motivation & Definition
Topic Models
Graph Based Methods
Supervised Techniques
Global Optimization Methods
Speech Summarization
Evaluation
Frequency, TF*IDF, Topic Words
Topic Models [LSA, EM, Bayesian]
Manual (Pyramid), Automatic (Rouge, F-Measure)
Fully Automatic
Features, Discriminative Training
Sampling, Data, Co-training
Iterative, Greedy, Dynamic Programming
ILP, Sub-Modular Selection
Segmentation, ASR
Acoustic Information, Disfluency 
32
The background corpus takes more 
central stage
 Learn topics from the background corpus
 topic ~ themes often discusses in the background
 topic representation ~ word probability tables
 Usually one time training step
 To summarize an input 
 Select sentences from the input that correspond 
to the most prominent topics
17
33
Latent semantic analysis (LSA) [Gong and Liu, 
2001, Hachey et al., 2006, Steinberger et al., 2007]
 Discover topics from the background corpus with n unique 
words and d documents
 Represent the background corpus as nxd matrix A
 Rows correspond to words
 Aij=number of times word I appears in document j
 Use standard change of coordinate system and dimensionality 
reduction techniques
 In the new space each row corresponds to the most important 
topics in the corpus
 Select the best sentence to cover each topic
TUPVA =
34
Notes on LSA and other approaches
 The original article that introduced LSA for 
single document summarization of news did 
not find significant difference with TF*IDF
 For multi-document summarization of news 
LSA approaches have not outperformed topic 
words or extensions of frequency approaches
 Other topic/content models have been much 
more influential
18
35
Domain dependent content models
 Get sample documents from the domain
 background corpus
 Cluster sentences from these documents 
 Implicit topics
 Obtain a word probability table for each topic
 Counts only from the cluster representing the 
topic
 Select sentences from the input with highest 
probability for main topics 
36
Text structure can be learnt
 Human-written examples from a domain
Location, time
relief efforts
magnitude
damage
19
37
Topic = cluster of similar sentences from 
the background corpus
 Sentences cluster from earthquake articles
 Topic “earthquake location”
 The Athens seismological institute said the temblor’s epicenter 
was located 380 kilometers (238 miles) south of the capital.
 Seismologists in Pakistan’s Northwest Frontier Province said the 
temblor’s epicenter was about 250 kilometers (155 miles) north of 
the provincial capital Peshawar.
 The temblor was centered 60 kilometers (35 miles) north- west of 
the provincial capital of Kunming, about 2,200 kilometers (1,300
miles) southwest of Beijing, a bureau seismologist said.
38
Content model [Barzilay and Lee, 2004, Pascale et al., 2003] 
 Hidden Markov Model (HMM)-based 
 States - clusters of related sentences “topics”
 Transition prob. - sentence precedence in corpus
 Emission prob. - bigram language model
location, 
magnitude casualties
relief efforts
)|()|(),|,( 11111 +++++ ⋅=><>< iieiitiiii hsphhphshsp
Earthquake reportsTransition from previous 
topic
Generating 
sentence in 
current topic
20
39
Learning the content model
 Many articles from the same domain
 Cluster sentences: each cluster represents a topic from 
the domain
 Word probability tables for each topic
 Transitions between clusters can be computed from 
sentence adjacencies in the original articles  
 Probabilities of going from one topic to another
 Iterate between clustering and transition probability 
estimation to obtain domain model
40
To select a summary
 Find main topics in the domain
 using a small collection of summary-input pairs
 Find the most likely topic for each sentence in 
the input 
 Select the best sentence per main topic
21
41
Historical note
 Some early approaches to multi-document 
summarization relied on clustering the 
sentences in the input alone [McKeown et al., 1999, 
Siddharthan et al., 2004]
 Clusters of similar sentences represent a theme in 
the input
 Clusters with more sentences are more important
 Select one sentence per important cluster
42
Example cluster
Choose one sentence to represent the cluster
1. PAL was devastated by a pilots' strike in June and by the 
region's currency crisis.
2. In June, PAL was embroiled in a crippling three-week 
pilots' strike.
3. Tan wants to retain the 200 pilots because they stood by 
him when the majority of PAL's pilots staged a 
devastating strike in June.
22
43
Bayesian content models
 Takes a batch of inputs for summarization
 Many word probability tables
 One for general English
 One for each of the inputs to be summarized
 One for each document in any input
To select a summary S with L words from 
document collection D given as input
The goal is to select the summary, not a 
sentence. Greedy selection vs. global will 
be discussed in detail later
S* = minS:words(S)≤LKL(PD||PS)
44
KL divergence
 Distance between two probability distributions: P, Q
 P, Q: Input and summary word distributions  
KL (P || Q) = pP (w) log2 pP (w)pQ (w)w∑
23
45
Intriguing side note
 In the full Bayesian topic models, word 
probabilities for all words is more important 
than binary distinctions of topic and non-topic 
word
 Haghighi and Vanderwende report that a 
system that chooses the summary with 
highest expected number of topic words 
performs as SumBasic
46
Review
 Frequency based informativeness has been 
used in building summarizers
 Topic words probably more useful
 Topic models
 Latent Semantic Analysis
 Domain dependent content model
 Bayesian content model
24
47
Motivation & Definition
Topic Models
Graph Based Methods
Supervised Techniques
Optimization Methods
Speech Summarization
Evaluation
Frequency, TF*IDF, Topic Words
Topic Models [LSA, EM, Bayesian]
Manual (Pyramid), Automatic (Rouge, F-Measure)
Fully Automatic
Features, Discriminative Training
Sampling, Data, Co-training
Iterative, Greedy, Dynamic Programming
ILP, Sub-Modular Selection
Segmentation, ASR
Acoustic Information, Disfluency 
48
Using graph representations [Erkan and Radev, 
2004; Mihalcea and Tarau, 2004; Leskovec et al., 2005 ]
 Nodes
 Sentences
 Discourse entities
 Edges
 Between similar sentences
 Between syntactically related entities
 Computing sentence similarity
 Distance between their TF*IDF weighted vector 
representations
25
49
50
Sentence :
Iraqi vice president…
Sentence :
Ivanov contended…
Sim(d1s1, d3s2)
26
51
Advantages of the graph model
 Combines word frequency and sentence 
clustering
 Gives a formal model for computing 
importance: random walks
 Normalize weights of edges to sum to 1
 They now represent probabilities of transitioning 
from one node to another
52
Random walks for summarization
 Represent the input text as graph
 Start traversing from node to node 
 following the transition probabilities 
 occasionally hopping to a new node
 What is the probability that you are in any 
particular node after doing this process for a 
certain time? 
 Standard solution (stationary distribution)
 This probability is the weight of the sentence
27
53
Motivation & Definition
Topic Models
Graph Based Methods
Supervised Techniques
Global Optimization Methods
Speech Summarization
Evaluation
Frequency, TF*IDF, Topic Words
Topic Models [LSA, EM, Bayesian]
Manual (Pyramid), Automatic (Rouge, F-Measure)
Fully Automatic
Features, Discriminative Training
Sampling, Data, Co-training
Iterative, Greedy, Dynamic Programming
ILP, Sub-Modular Selection
Segmentation, ASR
Acoustic Information, Disfluency 
54
Supervised methods 
 For extractive summarization, the task can be 
represented as binary classification
 A sentence is in the summary or not
 Use statistical classifiers to determine the score of a 
sentence: how likely it’s included in the summary
 Feature representation for each sentence
 Classification models trained from annotated data
 Select the sentences with highest scores (greedy for 
now, see other selection methods later)
28
55
Features
 Sentence length
 long sentences tend to be more important
 Sentence weight
 cosine similarity with documents
 sum of term weights for all words in a sentence
 calculate term weight after applying LSA
56
Features
 Sentence position
 beginning is often more important
 some sections are more important (e.g., in 
conclusion section)
 Cue words/phrases 
 frequent n-grams
 cue phrases (e.g., in summary, as a conclusion)
 named entities
29
57
Features
 Contextual features
 features from context sentences
 difference of a sentence and its neighboring ones 
 Speech related features (more later):
 acoustic/prosodic features
 speaker information (who said the sentence, is the 
speaker dominant?)
 speech recognition confidence measure 
58
Classifiers
 Can classify each sentence individually, or 
use sequence modeling
 Maximum entropy [Osborne, 2002]
 Condition random fields (CRF) [Galley, 2006]
 Classic Bayesian Method [Kupiec et al., 1995]
 HMM [Conroy and O'Leary, 2001; Maskey, 2006 ]
 Bayesian networks 
 SVMs [Xie and Liu, 2010]
 Regression [Murray et al., 2005]
 Others
30
59
So that is it with supervised methods? 
 It seems it is a straightforward classification 
problem
 What are the issues with this method?
 How to get good quality labeled training data
 How to improve learning
 Some recent research has explored a few 
directions
 Discriminative training, regression, sampling, co-
training, active learning
60
Motivation & Definition
Topic Models
Graph Based Methods
Supervised Techniques
Global Optimization Methods
Speech Summarization
Evaluation
Frequency, TF*IDF, Topic Words
Topic Models [LSA, EM, Bayesian]
Manual (Pyramid), Automatic (Rouge, F-Measure)
Fully Automatic
Features, Discriminative Training
Sampling, Data, Co-training
Iterative, Greedy, Dynamic Programming
ILP, Sub-Modular Selection
Segmentation, ASR
Acoustic Information, Disfluency 
31
61
Improving supervised methods: different 
training approaches
 What are the problems with standard training 
methods?
 Classifiers learn to determine a sentence’s label 
(in summary or not)   
 Sentence-level accuracy is different from 
summarization evaluation criterion (e.g., 
summary-level ROUGE scores)
 Training criterion is not optimal
 Sentences’ labels used in training may be too 
strict (binary classes)
62
Improving supervised methods: MERT 
discriminative training
 Discriminative training based on MERT [Aker et 
al., 2010]
 In training, generate multiple summary candidates 
(using A* search algorithm)
 Adjust model parameters (feature weights) 
iteratively to optimize ROUGE scores
Note: MERT has been used for machine translation discriminative training
32
63
Improving supervised methods: ranking 
approaches 
 Ranking approaches [Lin et al. 2010]
 Pair-wise training
 Not classify each sentence individually
 Input to learner is a pair of sentences
 Use Rank SVM to learn the order of two sentences
 Direct optimization
 Learns how to correctly order/rank summary candidates 
(a set of sentences)
 Use AdaRank [Xu and Li 2007] to combine weak rankers
64
Improving supervised methods: regression 
model
 Use regression model [Xie and Liu, 2010]
 In training, a sentence’s label is not +1 and -1
 Each one is labeled with numerical values to 
represent their importance
 Keep +1 for summary sentence
 For non-summary sentences (-1), use their similarity to 
the summary as labels
 Train a regression model to better discriminate 
sentence candidates
33
65
Improving supervised methods: sampling
 Problems -- in binary classification setup for 
summarization, the two classes are 
imbalanced
 Summary sentences are minority class. 
 Imbalanced data can hurt classifier training
 How can we address this?
 Sampling to make distribution more balanced to 
train classifiers
 Has been studied a lot in machine learning
66
Improving supervised methods: sampling
 Upsampling: increase minority samples
 Replicate existing minority samples
 Generate synthetic examples (e.g., by some kind 
of interpolation)
 Downsampling: reduce majority samples
 Often randomly select from existing majority 
samples
34
67
Improving supervised methods: sampling
 Sampling for summarization [Xie and Liu, 2010]
 Different from traditional upsampling and downsampling
 Upsampling
 select non-summary sentences that are like summary 
sentences based on cosine similarity or ROUGE scores
 change their label to positive 
 Downsampling: 
 select those that are different from summary sentences
 These also address some human annotation disagreement
 The instances whose labels are changed are often the ones 
that humans have problems with
68
Motivation & Definition
Topic Models
Graph Based Methods
Supervised Techniques
Global Optimization Methods
Speech Summarization
Evaluation
Frequency, TF*IDF, Topic Words
Topic Models [LSA, EM, Bayesian]
Manual (Pyramid), Automatic (Rouge, F-Measure)
Fully Automatic
Features, Discriminative Training
Sampling, Data, Co-raining
Iterative, Greedy, Dynamic Programming
ILP, Sub-Modular Selection
Segmentation, ASR
Acoustic Information, Disfluency 
35
69
Supervised methods: data issues
 Need labeled data for model training
 How do we get good quality training data? 
 Can ask human annotators to select extractive 
summary sentences
 However, human agreement is generally low
 What if data is not labeled at all? or it only 
has abstractive summary?
70
 Distributions of content units and words are similar
 Few units are expressed by everyone; many units 
are expressed by only one person
Do humans agree on summary sentence 
selection? Human agreement on word/sentence/fact selection
36
71
Supervised methods: semi-supervised 
learning
 Question – can we use unlabeled data to 
help supervised methods? 
 A lot of research has been done on semi-
supervised learning for various tasks
 Co-training and active learning have been 
used in summarization
72
Co-training
 Use co-training to leverage unlabeled data
 Feature sets represent different views
 They are conditionally independent given the 
class label
 Each is sufficient for learning
 Select instances based on one view, to help the 
other classifier
37
73
Co-training in summarization
 In text summarization [Wong et al., 2008]
 Two classifiers (SVM, naïve Bayes) are used on 
the same feature set
 In speech summarization [Xie et al., 2010]
 Two different views: acoustic and lexical features
 They use both sentence and document as 
selection units
74
Active learning in summarization
 Select samples for humans to label
 Typically hard samples, machines are not 
confident, informative ones
 Active learning in lecture summarization [Zhang 
et al. 2009]
 Criterion: similarity scores between the extracted 
summary sentences and the sentences in the 
lecture slides are high
38
75
Supervised methods: using labeled 
abstractive summaries
 Question -- what if I only have abstractive 
summaries, but not extractive summaries? 
 No labeled sentences to use for classifier 
training in extractive summarization 
 Can use reference abstract summary to 
automatically create labels for sentences
 Use similarity of a sentence to the human written 
abstract (or ROUGE scores, other metrics)
76
Comment on supervised performance
 Easier to incorporate more information
 At the cost of requiring a large set of human 
annotated training data
 Human agreement is low, therefore labeled 
training data is noisy
 Need matched training/test conditions
 may not easily generalize to different domains
 Effective features vary for different domains
 e.g., position is important for news articles
39
77
Comments on supervised performance
 Seems supervised methods are more 
successful in speech summarization than in 
text
 Speech summarization is almost never multi-
document
 There are fewer indications about the topic of the 
input in speech domains
 Text analysis techniques used in speech 
summarization are relatively simpler 
78
Motivation & Definition
Topic Models
Graph Based Methods
Supervised Techniques
Optimization Methods
Speech Summarization
Evaluation
Frequency, TF*IDF, Topic Words
Topic Models [LSA, EM, Bayesian]
Manual (Pyramid), Automatic (Rouge, F-Measure)
Fully Automatic
Features, Discriminative Training
Sampling, Data, Co-training
Iterative, Greedy, Dynamic Programming
ILP, Sub-Modular Selection
Segmentation, ASR
Acoustic Information, Disfluency 
40
79
Parameters to optimize
 In summarization methods we try to find 
1. Most significant sentences
2. Remove redundant ones
3. Keep the summary under given length
 Can we combine all 3 steps in one?
 Optimize all 3 parameters at once
80
Summarization as an optimization problem
 Knapsack Optimization Problem 
Select boxes such that amount of money is 
maximized while keeping total weight under X Kg
 Summarization Problem 
Select sentences such that summary relevance is 
maximized while keeping total length under X words
 Many other similar optimization problems  
 General Idea: Maximize a function given a set of 
constraints
41
81
Optimization methods for summarization
 Different flavors of solutions
 Greedy Algorithm
 Choose highest valued boxes
 Choose the most relevant sentence 
 Dynamic Programming algorithm
 Save intermediate computations
 Look at both relevance and length
 Integer Linear Programming
 Exact Inference
 Scaling Issues
We will now discuss these 3 types of optimization solutions
82
Motivation & Definition
Topic Models
Graph Based Methods
Supervised Techniques
Optimization Methods
Speech Summarization
Evaluation
Frequency, TF*IDF, Topic Words
Topic Models [LSA, EM, Bayesian]
Manual (Pyramid), Automatic (Rouge, F-Measure)
Fully Automatic
Features, Discriminative Training
Sampling, Data, Co-training
Greedy, Dynamic Programming
ILP, Sub-Modular Selection
Segmentation, ASR
Acoustic Information, Disfluency 
42
83
Greedy optimization algorithms
 Greedy solution is an approximate algorithm which 
may not be optimal
 Choose the most relevant + least redundant 
sentence if the total length does not exceed the 
summary length
 Maximal Marginal Relevance is one such greedy algorithm 
proposed by [Carbonell et al., 1998]
84
Maximal Marginal Relevance (MMR) 
[Carbonell et al., 1998]
 Summary: relevant and non-redundant information
 Many summaries are built based on sentences ranked by 
relevance
 E.g. Extract most relevant 30% of sentences
Relevance Redundancyvs.
 Summary should maximize relevant information as 
well as reduce redundancy
43
85
Marginal relevance
 “Marginal Relevance” or “Relevant Novelty”
 Measure relevance and novelty separately
 Linearly combine these two measures
 High Marginal relevance if
 Sentence is relevant to story (significant information)
 Contains minimal similarity to previously selected sentences 
(new novel information)
 Maximize Marginal Relevance to get summary that 
has significant non-redundant information
86
Relevance with query or centroid
 We can compute relevance of text snippet 
with respect to query or centroid
 Centroid as defined in [Radev, 2004]
 based on the content words of  a document 
 TF*IDF vector of all documents in corpus
 Select words above a threshold : remaining vector 
is a centroid vector
44
87
Maximal Marginal Relevance (MMR) 
[Carbonell et al., 1998]
 Q – document centroid/user query
 D – document collection
 R – ranked listed
 S – subset of documents in R already selected
 Sim – similarity metric 
 Lambda =1 produces most significant ranked list
 Lambda = 0 produces most diverse ranked list
MMR≈ Argmax(Di∈R−S)[λ(Sim1(Di, Q))−(1−λ)max(Dj∈S)Sim2(Di, Dj)]
88
MMR based Summarization [Zechner, 2000]
Iteratively select next sentence
Next Sentence = 
Frequency Vector 
of all content words
centroid
45
89
MMR based summarization
 Why this iterative sentence selection process 
works?
 1st Term: Find relevant sentences similar to 
centroid of the document
 2nd Term: Find redundancy ─ sentences that are 
similar to already selected sentences are not 
selected
90
 MMR is an iterative sentence selection 
process
 decision made for each sentence
 Is this selected sentence globally optimal?
Sentence selection in MMR
Sentence with same level of relevance but shorter may not be 
selected if a longer relevant sentence is already selected
46
91
Motivation & Definition
Topic Models
Graph Based Methods
Supervised Techniques
Optimization Methods
Speech Summarization
Evaluation
Frequency, TF*IDF, Topic Words
Topic Models [LSA, EM, Bayesian]
Manual (Pyramid), Automatic (Rouge, F-Measure)
Fully Automatic
Features, Discriminative Training
Sampling, Data, Co-training
Iterative, Greedy, Dynamic Programming
ILP, Sub-Modular Selection
Segmentation, ASR
Acoustic Information, Disfluency 
92
Global inference
D=t1, t2, , tn−1, tn
 Modify our greedy algorithm 
 add constraints for sentence length as well
 Let us define document D with tn textual 
units
47
93
Global inference
 Let us define
Relevance of ti to be in the 
summary
Redundancy between ti and tj
Length of til(i)
Red(i,j)
Rel(i)
94
Inference problem [McDonald, 2007]
 Let us define inference problem as 
Summary Score
Pairwise RedundancyMaximum Length
48
95
Greedy solution [McDonald, 2007]
Sort by Relevance
Select Sentence
 Sorted list may have longer sentences at the top
 Solve it using dynamic programming
 Create table and fill it based on length and redundancy 
requirements
No consideration of
sentence length
96
Dynamic programming solution [McDonald, 2007]
High scoring summary
of length k and i-1
text unitsHigh scoring 
summary of
length k-l(i) +
ti
Higher ?
49
97
 Better than the previously shown greedy 
algorithm
 Maximizes the space utilization by not 
inserting longer sentences
 These are still approximate algorithms: 
performance loss?
Dynamic programming algorithm [McDonald, 2007]
98
Inference algorithms comparison
[McDonald, 2007]
System 50 100 200
Baseline 26.6/5.3 33.0/6.8 39.4/9.6
Greedy 26.8/5.1 33.5/6.9 40.1/9.5
Dynamic Program 27.9/5.9 34.8/7.3 41.2/10.0
Summarization results: Rouge-1/Rouge-2
Sentence Length
50
99
Motivation & Definition
Topic Models
Graph Based Methods
Supervised Techniques
Optimization Methods
Speech Summarization
Evaluation
Frequency, TF*IDF, Topic Words
Topic Models [LSA, EM, Bayesian]
Manual (Pyramid), Automatic (Rouge, F-Measure)
Fully Automatic
Features, Discriminative Training
Sampling, Data, Co-training
Iterative, Greedy, Dynamic Programming
ILP, Sub-Modular Selection
Segmentation, ASR
Acoustic Information, Disfluency 
100
Integer Linear Programming (ILP) [Gillick 
and Favre, 2009; Gillick et al., 2009; McDonald, 2007]
 Greedy algorithm is an approximate solution
 Use exact solution algorithm with ILP (scaling issues 
though)
 ILP is constrained optimization problem
 Cost and constraints are linear in a set of integer variables
 Many solvers on the web
 Define the constraints based on relevance and 
redundancy for summarization
 Sentence based ILP
 N-gram based ILP
51
101
Sentence-level ILP formulation [McDonald, 
2007]
1 if ti in summary
Constraints
Optimization Function
102
N-gram ILP formulation [Gillick and Favre, 2009; 
Gillick et al., 2009]
 Sentence-ILP constraint on redundancy is 
based on sentence pairs
 Improve by modeling n-gram-level 
redundancy
 Redundancy implicitly defined
Ci indicates presence
of n-gram i in summary 
and its weight is wi
∑
i wici
52
103
N-gram ILP formulation [Gillick and Favre, 2009]
Constraints
Optimization Function n-gram level ILP has different  optimization 
function than one shown before
104
Sentence vs. n-gram ILP
System ROUGE-2 Pyramid
Baseline 0.058 0.186
Sentence ILP
[McDonald, 2007] 0.072 0.295
N-gram ILP
[Gillick and Favre, 2009] 0.110 0.345
53
105
Other optimization based summarization 
algorithms
 Submodular selection [Lin et al., 2009]
 Submodular set functions for optimization
 Modified greedy algorithm [Filatova, 2004]
 Event based features
 Stack decoding algorithm [Yih et al., 2007]
 Multiple stacks, each stack represents hypothesis of different 
length
 A* Search [Aker et al., 2010]
 Use scoring and heuristic functions
106
Submodular selection for summarization 
[Lin et al., 2009]
 Summarization Setup
 V – set of all sentences in document
 S – set of extraction sentences
 f(.) scores the quality of the summary
 Submodularity been used in solving many 
optimization problems in near polynomial time
 For summarization: 
Select subset S (sentences) representative of V 
given the constraint |S| =< K (budget)
54
107
Submodular selection [Lin et al., 2009]
 If V are nodes in a Graph G=(V,E) representing 
sentences
 And E represents edges (i,j) such that w(i,j) 
represents similarity between sentences i and j
 Introduce submodular set functions which measures 
“representative” S of entire set V
 [Lin et al., 2009] presented 4 submodular set functions
108
Submodular selection for summarization 
[Lin et al., 2009]
Comparison of results using different methods
55
109
Review: optimization methods
 Global optimization methods have shown to be 
superior than 2-step selection process and reduce 
redundancy
 3 parameters are optimized together
 Relevance
 Redundancy
 Length
 Various Algorithms for Global Inference
 Greedy
 Dynamic Programming 
 Integer Linear Programming
 Submodular Selection
110
Motivation & Definition
Topic Models
Graph Based Methods
Supervised Techniques
Optimization Methods
Speech Summarization
Evaluation
Frequency, TF*IDF, Topic Words
Topic Models [LSA, EM, Bayesian]
Manual (Pyramid), Automatic (Rouge, F-Measure)
Fully Automatic
Features, Discriminative Training
Sampling, Data, Co-training
Iterative, Greedy, Dynamic Programming
ILP, Sub-Modular Selection
Segmentation, ASR
Acoustic Information, Disfluency 
56
111
Speech summarization
 Increasing amount of data available in 
speech form
 meetings, lectures, broadcast, youtube, voicemail
 Browsing is not as easy as for text domains
 users need to listen to the entire audio
 Summarization can help effective information 
access
 Summary output can be in the format of text 
or speech
112
Domains 
 Broadcast news
 Lectures/presentations
 Multiparty meetings
 Telephone conversations
 Voicemails
57
113
Example
Meeting transcripts and summary sentences (in red)
so it’s possible that we could do something like a 
summary node of some sort that
me003
but there is some technology you could try to applyme010
yeahme010
now I don’t know that any of these actually apply in 
this case
me010
uh so if you co- you could ima- and i-me010
mmmme003
there’re ways to uh sort of back off on the purity of 
your bayes-net-edness
me010
andme010
uh i- i slipped a paper to bhaskara and about noisy-
or’s and noisy-maxes
me010
which is there are technical ways of doing itme010
uh let me just mention something that i don’t want 
to pursue today
me010
there there are a variety of ways of doing itme010
Broadcast news transcripts and summary (in red)
try to use electrical appliances before p.m. and after p.m. and 
turn off computers, copiers and lights when they're not being 
used
set your thermostat at 68 degrees when you're home, 55 
degrees when  you're away
energy officials are offering tips to conserve electricity, they say, 
to delay holiday lighting until after at night
the area shares power across many states
meanwhile, a cold snap in the pacific northwest is putting an 
added strain on power supplies
coupled with another unit, it can provide enough power for about
2 million people
it had been shut down for maintenance
a unit at diablo canyon nuclear plant is expected to resume 
production today
california's strained power grid is getting a boost today which 
might help increasingly taxed power supplies
114
Speech vs. text summarization: similarities
 When high quality transcripts are available
 Not much different from text summarization
 Many similar approaches have been used
 Some also incorporate acoustic information
 For genres like broadcast news, style is also 
similar to text domains
58
115
Speech vs. text summarization: differences
 Challenges in speech summarization
 Speech recognition errors can be very high
 Sentences are not as well formed as in most text 
domains: disfluencies, ungrammatical
 There are not clearly defined sentences
 Information density is also low (off-topic 
discussions, chit chat, etc.)
 Multiple participants
116
Motivation & Definition
Topic Models
Graph Based Methods
Supervised Techniques
Optimization Methods
Speech Summarization
Evaluation
Frequency, TF*IDF, Topic Words
Topic Models [LSA, EM, Bayesian]
Manual (Pyramid), Automatic (Rouge, F-Measure)
Fully Automatic
Features, Discriminative Training
Sampling, Data, Co-training
Iterative, Greedy, Dynamic Programming
ILP, Sub-Modular Selection
Segmentation, ASR
Acoustic Information, Disfluency 
59
117
What should be extraction units in speech 
summarization?
 Text domain
 Typically use sentences (based on punctuation 
marks)
 Speech domain
 Sentence information is not available
 Sentences are not as clearly defined
Utterance from previous example:
there there are a variety of ways of doing it uh let me just mention something 
that i don’t want to pursue today which is there are technical ways of doing it
118
Automatic sentence segmentation (side note) 
 For a word boundary, determine whether it’s a sentence 
boundary
 Different approaches: 
 Generative: HMM
 Discriminative: SVM, boosting, maxent, CRF
 Information used: word n-gram, part-of-speech, parsing 
information, acoustic info (pause, pitch, energy)
60
119
What is the effect of different 
units/segmentation on summarization?
 Research has used different units in speech 
summarization
 Human annotated sentences or dialog acts
 Automatic sentence segmentation
 Pause-based segments
 Adjacency pairs
 Intonational phrases 
 Words
120
What is the effect of different 
units/segmentation on summarization?
 Findings from previous studies
 Using intonational phrases (IP) is better than 
automatic sentence segmentation, pause-based 
segmentation [Maskey, 2008 ]
 IPs are generally smaller than sentences, also 
linguistically meaningful
 Using sentences is better than words, between 
filler segments [Furui et al., 2004]
 Using human annotated dialog acts is better than 
automatically generated ones [Liu and Xie, 2008]
61
121
Motivation & Definition
Topic Models
Graph Based Methods
Supervised Techniques
Optimization Methods
Speech Summarization
Evaluation
Frequency, TF*IDF, Topic Words
Topic Models [LSA, EM, Bayesian]
Manual (Pyramid), Automatic (Rouge, F-Measure)
Fully Automatic
Features, Discriminative Training
Sampling, Data, Co-training
Iterative, Greedy, Dynamic Programming
ILP, Sub-Modular Selection
Segmentation, ASR
Acoustic Information, Disfluency 
122
Using acoustic information in 
summarization
 Acoustic/prosodic features: 
 F0 (max, min, mean, median, range)
 Energy (max, min, mean, median, range)
 Sentence duration
 Speaking rate (# of words or letters)
 Need proper normalization
 Widely used in supervised methods, in 
combination with textual features
62
123
Using acoustic information in 
summarization
 Are acoustic features useful when combining 
it with lexical information?
 Results vary depending on the tasks and 
domains 
 Often lexical features are ranked higher
 But acoustic features also contribute to overall 
system performance
 Some studies showed little impact when adding 
speech information to textual features [Penn and Zhu, 
2008]
124
Using acoustic information in 
summarization
 Can we use acoustic information only for speech 
summarization?
 Transcripts may not be available
 Another way to investigate contribution of acoustic 
information
 Studies showed using just acoustic information can 
achieve similar performance to using lexical 
information [Maskey and Hirschberg, 2005; Xie et al., 2009; Zhu et al., 
2009]
 Caveat: in some experiments, lexical information is used 
(e.g., define the summarization units)
63
125
Speech recognition errors
 ASR is not perfect, often high word error rate
 10-20% for read speech
 40% or even higher for conversational speech
 Recognition errors generally have negative 
impact on summarization performance
 Important topic indicative words are incorrectly 
recognized
 Can affect term weighting and sentence scores
126
Speech recognition errors
 Some studies evaluated effect of recognition 
errors on summarization by varying word 
error rate [Christensen et al., 2003; Penn and Zhu, 2008; Lin et al., 
2009]
 Degradation is not much when word error 
rate is not too low (similar to spoken 
document retrieval)
 Reason: better recognition accuracy in summary 
sentences than overall  
64
127
What can we do about ASR errors? 
 Deliver summary using original speech 
 Can avoid showing recognition errors in the 
delivered text summary
 But still need to correctly identify summary 
sentences/segments
 Use recognition confidence measure and 
multiple candidates to help better summarize
128
Address problems due to ASR errors
 Re-define summarization task: select 
sentences that are most informative, at the 
same time have high recognition accuracy
 Important words tend to have high recognition 
accuracy
 Use ASR confidence measure or n-gram 
language model scores in summarization
 Unsupervised methods [Zechner, 2002; Kikuchi et al., 2003; 
Maskey, 2008]
 Use as a feature in supervised methods
65
129
Address problems due to ASR errors
 Use multiple recognition candidates
 n-best lists [Liu et al., 2010]
 Lattices [Lin et al., 2010]
 Confusion network [Xie and Liu, 2010]
 Use in MMR framework
 Summarization segment/unit contains all the word 
candidates (or pruned ones based on probabilities)
 Term weights (TF, IDF) use candidate’s posteriors
 Improved performance over using 1-best recognition 
output
130
Motivation & Definition
Topic Models
Graph Based Methods
Supervised Techniques
Optimization Methods
Speech Summarization
Evaluation
Frequency, TF*IDF, Topic Words
Topic Models [LSA, EM, Bayesian]
Manual (Pyramid), Automatic (Rouge, F-Measure)
Fully Automatic
Features, Discriminative Training
Sampling, Data, Co-training
Iterative, Greedy, Dynamic Programming
ILP, Sub-Modular Selection
Segmentation, ASR
Acoustic Information, Disfluency
66
131
Disfluencies and summarization
 Disfluencies (filler words, repetitions, revisions, 
restart, etc) are frequent in conversational speech
 Example from meeting transcript:
so so does i- just remind me of what what you were going to do with the 
what what what what's
y- you just described what you've been doing
 Existence may hurt summarization systems, also 
affect human readability of the summaries
132
Disfluencies and summarization
 Natural thought: remove disfluenices 
 Word-based selection can avoid disfluent 
words 
 Using n-gram scores tends to select fluent 
parts [Hori and Furui, 2001]
 Remove disfluencies first, then perform 
summarization 
 Does it work? not consistent results 
 Small improvement [Maskey, 2008; Zechner, 2002]
 No improvement [Liu et al., 2007]
67
133
Disfluencies and summarization
 In supervised classification, information related to 
disfluencies can be used as features for 
summarization 
 Small improvement on Switchboard data [Zhu and Penn, 2006]
 Going beyond disfluency removal, can perform 
sentence compression in conversational speech to 
remove un-necessary words [Liu and Liu, 2010]
 Help improve sentence readability
 Output is more like abstractive summaries
 Compression helps summarization
134
Review on speech summarization
 Speech summarization has been performed 
for different domains
 A lot of text-based approaches have been 
adopted
 Some speech specific issues have been 
investigated
 Segmentation 
 ASR errors
 Disfluencies
 Use acoustic information
68
135
Motivation & Definition
Topic Models
Graph Based Methods
Supervised Techniques
Global Optimization Methods
Speech Summarization
Evaluation
Frequency, TF*IDF, Topic Words
Topic Models [LSA, EM, Bayesian]
Manual (Pyramid), Automatic (Rouge, F-Measure)
Fully Automatic
Features, Discriminative Training
Sampling, Data, Co-training
Iterative, Greedy, Dynamic Programming
ILP, Sub-Modular Selection
Segmentation, ASR
Acoustic Information, Disfluency 
136
Manual evaluations
 Task-based evaluations 
 too expensive
 Bad decisions possible, hard to fix
 Assessors rate summaries on a scale
 Responsiveness
 Assessors compare with gold-standards
 Pyramid
69
137
Automatic and fully automatic 
evaluation
 Automatically compare with gold-standard
 Precision/recall (sentence level)
 ROUGE (word level)
 No human gold-standard is used
 Automatically compare input and summary
138
Precision and recall for extractive 
summaries
 Ask a person to select the most important 
sentences
Recall: system-human choice 
overlap/sentences chosen by human
Precision: system-human choice 
overlap/sentences chosen by system
70
139
Problems?
 Different people choose different sentences
 The same summary can obtain a recall score 
that is between 25% and 50% different 
depending on which of two available human 
extracts is used for evaluation
 Recall more important/informative than 
precision?
140
More problems?
 Granularity
We need help. Fires have spread in the nearby 
forest and threaten several villages in this remote 
area.
 Semantic equivalence
 Especially in multi-document summarization
 Two sentences convey almost the same 
information: only one will be chosen in the human 
summary
71
141
Pyramid
Responsiveness
ROUGE
Fully automatic
Model 
summaries
Manual comparison/ 
ratings
Evaluation methods for content
142
Pyramid method [Nenkova and Passonneau, 2004; Nenkova et al., 
2007]
 Based on Semantic Content Units (SCU)
 Emerge from the analysis of several texts
 Link different surface realizations with the 
same meaning
72
143
SCU example
S1 Pinochet arrested in London on Oct 16 at a 
Spanish judge’s request for atrocities against 
Spaniards in Chile.
S2 Former Chilean dictator Augusto Pinochet has 
been arrested in London at the request of the 
Spanish government.
S3 Britain caused international controversy and 
Chilean turmoil by arresting former Chilean 
dictator Pinochet in London.
144
SCU: label, weight, contributors 
Label London was where Pinochet was 
arrested
Weight=3
S1 Pinochet arrested in London on Oct 16 at a Spanish 
judge’s request for atrocities against Spaniards in Chile.
S2 Former Chilean dictator Augusto Pinochet has been
arrested in London at the request of the Spanish
government.
S3 Britain caused international controversy and Chilean 
turmoil by arresting former Chilean dictator Pinochet in 
London.
73
145
Ideally informative summary
 Does not include an SCU from a lower tier 
unless all SCUs from higher tiers are 
included as well
146
Ideally informative summary
 Does not include an SCU from a lower tier 
unless all SCUs from higher tiers are 
included as well
74
147
Ideally informative summary
 Does not include an SCU from a lower tier 
unless all SCUs from higher tiers are 
included as well
148
Ideally informative summary
 Does not include an SCU from a lower tier 
unless all SCUs from higher tiers are 
included as well
75
149
Ideally informative summary
 Does not include an SCU from a lower tier 
unless all SCUs from higher tiers are 
included as well
150
Ideally informative summary
 Does not include an SCU from a lower tier 
unless all SCUs from higher tiers are 
included as well
76
151
Different equally good summaries
 Pinochet arrested
 Arrest in London
 Pinochet is a former 
Chilean dictator
 Accused of atrocities 
against Spaniards
152
Different equally good summaries
 Pinochet arrested
 Arrest in London
 On Spanish warrant
 Chile protests
77
153
Diagnostic ─ why is a summary bad?
 Good  Less relevant 
summary
154
Importance of content 
 Can observe distribution in human 
summaries
 Assign relative importance
 Empirical rather than subjective
 The more people agree, the more important
78
155
Pyramid score for evaluation
 New summary with n content units
 Estimates the percentage of information that is 
maximally important
IdealWight
ightObservedWe
Ideal
Weight
n
i
i
n
i
i
=
∑
∑
=
=
1
1
156
ROUGE [Lin, 2004]
 De facto standard for evaluation in text 
summarization
 High correlation with manual evaluations in that 
domain
 More problematic for some other domains, 
particularly speech
 Not highly correlated with manual evaluations
 May fail to distinguish human and machine 
summaries
79
157
ROUGE details
 In fact a suite of evaluation metrics
 Unigram
 Bigram
 Skip bigram
 Longest common subsequence
 Many settings concerning
 Stopwords
 Stemming
 Dealing with multiple models
158
How to evaluate without human 
involvement? [Louis and Nenkova, 2009]
 A good summary should be similar to the 
input
 Multiple ways to measure similarity
 Cosine similarity
 KL divergence
 JS divergence
 Not all work!
80
159
 Distance between two distributions as 
average KL divergence from their mean 
distribution
JS divergence between input and 
summary
)]||()||([)||( 21 ASummKLAInpKLSummInpJS +=
SummaryandInputofondistributimeanSummInpA ,
2
+
=
160
Summary likelihood given the input
 Probability that summary is generated according to 
term distribution in the input
Higher likelihood ~ better summary
 Unigram Model
 Multinomial Model
ii
n
rInp
n
Inp
n
Inp
wwordofsummaryincountn
vocabularysummaryr
wpwpwp r
=
−
)()()( 21 21 K
sizesummarynN
wpwpwp
i
i
n
rInp
n
Inp
n
Inpnn
N r
r
==∑
)()()( 21
1 21!!
! KK
81
161
 Fraction of summary = input’s topic words
 % of input’s topic words also appearing in summary 
 Capture variety
 Cosine similarity: input’s topic words and all summary 
words
 Fewer dimensions, more specific vectors
Topic words identified by log-likelihood 
test
162
How good are these metrics? 
48 inputs, 57 systems
JSD -0.880 -0.736
0.795 0.627
-0.763 -0.694
0.712 0.647
0.712 0.602
-0.688 -0.585
-0.188 -0.101
0.222 0.235
% input’s topic in summary
KL div summ-input
Cosine similarity
% of summary = topic words
KL div input-summ
Unigram summ prob.
Multinomial summ prob.
-0.699 0.629Topic word similarity
Pyramid Responsiveness
Spearman correlation on macro level for the query focused task.
82
163
 JSD correlations with pyramid scores even better than 
R1-recall
 R2-recall is consistently better
 Can extend features using higher order n-grams 
How good are these metrics?
0.870.90R2-recall
0.800.85R1-recall
-0.73-0.88JSD
Resp.Pyramid
164
Motivation & Definition
Topic Models
Graph Based Methods
Supervised Techniques
Global Optimization Methods
Speech Summarization
Evaluation
Frequency, TF*IDF, Topic Words
Topic Models [LSA, EM, Bayesian]
Manual (Pyramid), Automatic (Rouge, F-Measure)
Fully Automatic
Features, Discriminative Training
Sampling, Data, Co-training
Iterative, Greedy, Dynamic Programming
ILP, Sub-Modular Selection
Segmentation, ASR
Acoustic Information, Disfluency 
83
165
Current summarization research  
 Summarization for various new genres
 Scientific articles
 Biography
 Social media (blog, twitter)
 Other text and speech data 
 New task definition 
 Update summarization 
 Opinion summarization
 New summarization approaches 
 Incorporate more information (deep linguistic knowledge, information 
from the web)
 Adopt more complex machine learning techniques
 Evaluation issues
 Better automatic metrics
 Extrinsic evaluations
And more…
166
 Check out summarization papers at ACL this 
year
 Workshop at ACL-HLT 2011:
 Automatic summarization for different genres, 
media, and languages [June 23, 2011]
 http://www.summarization2011.org/
84
167
References
 Ahmet Aker, Trevor Cohn, Robert Gaizauska. 2010. Multi-document summarization using A* search and 
discriminative training. Proc. of EMNLP.
 R. Barzilay and M. Elhadad. 2009. Text summarizations with lexical chains. In: I. Mani and M. Maybury (eds.): 
Advances in Automatic Text Summarization.
 Jaime Carbonell and Jade Goldstein. 1998. The Use of MMR, Diversity-Based reranking for Reordering 
Documents and Producing Summaries. Proceedings of the 21st Annual International ACM SIGIR Conference on 
Research and Development in Information Retrieval.
 H. Christensen, Y. Gotoh, B. Killuru, and S. Renals. 2003. Are Extractive Text Summarization Techniques 
Portable to Broadcast News? Proc. of ASRU.
 John Conroy and Dianne O'Leary. 2001. Text Summarization via Hidden Markov Models. Proc. of SIGIR.
 J. M. Conroy, J. D. Schlesinger, and D. P. OLeary. 2006. Topic-Focused Multi-Document Summarization Using an 
Approximate Oracle Score. Proc. COLING/ACL 2006. pp. 152-159.
 Thomas Cormen, Charles E. Leiserson, and Ronald L. Rivest.1990. Introduction to algorithms. MIT Press.
 G. Erkan and D. R. Radev.2004. LexRank: Graph-based Centrality as Salience in Text Summarization. Journal of 
Artificial Intelligence Research (JAIR).
 Pascale Fung, Grace Ngai, and Percy Cheung. 2003. Combining optimal clustering and hidden Markov models for 
extractive summarization. Proceedings of ACL Workshop on Multilingual Summarization.Sadoki Furui, T. Kikuchi, 
Y. Shinnaka, and C. Hori. 2004. Speech-to-text and Speech-to-speech Summarization of Spontaneous Speech. 
IEEE Transactions on Audio, Speech, and Language Processing. 12(4), pages 401-408.
 Michel Galley. 2006. A Skip-Chain Conditional Random Field for Ranking Meeting Utterances by Importance. 
Proc. of EMNLP.
 Dan Gillick, Benoit Favre. 2009. A scalable global model for summarization. Proceedings of the Workshop on 
Integer Linear Programming for Natural Language Processing.
 Dan Gillick, Korbinian Riedhammer, Benoit Favre, Dilek Hakkani-Tur. 2009. A global optimization framework for 
meeting summarization. Proceedings of ICASSP.
 Jade Goldstein, Vibhu Mittal, Jaime Carbonell, and Mark Kantrowitz. 2000. Multi-document summarization by 
sentence extraction. Proceedings of the 2000 NAACL-ANLP Workshop on Automatic Summarization.
168
References
 Y. Gong and X. Liu. 2001. Generic text summarization using relevance measure and latent semantic analysis. 
Proc. ACM SIGIR.
 I. Gurevych and T. Nahnsen. 2005. Adapting Lexical Chaining to Summarize Conversational Dialogues. Proc. 
RANLP.
 B. Hachey, G. Murray, and D. Reitter.2006. Dimensionality reduction aids term co-occurrence based multi-
document summarization. In: SumQA 06: Proceedings of the Workshop on Task-Focused Summarization and 
Question Answering. 
 Aria Haghighi and Lucy Vanderwende. 2009. Exploring content models for multi-document summarization. Proc. of 
NAACL-HLT.
 L. He, E. Sanocki, A. Gupta, and J. Grudin. 2000. Comparing presentation summaries: Slides vs. reading vs.
listening. Proc. of SIGCHI on Human factors in computing systems.
 C. Hori and Sadaoki Furui. 2001. Advances in Automatic Speech Summarization. Proc. of Eurospeech.
 T. Kikuchi, S. Furui, and C. Hori. 2003. Automatic Speech Summarization based on Sentence Extractive and 
Compaction. Proc. of ICSLP.  
 Julian Kupiec, Jan Pedersen, and Francine Chen. 1995. A Trainable Document Summarizer. Proc. of SIGIR.
 J. Leskovec, N. Milic-frayling, and M. Grobelnik. 2005. Impact of Linguistic Analysis on the Semantic Graph 
Coverage and Learning of Document Extracts. Proc. AAAI.
 Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries, Workshop on Text 
Summarization Branches Out. 
 C.Y. Lin and E. Hovy. 2000. The automated acquisition of topic signatures for text summarization. Proc. COLING.
 Hui Lin and Jeff Bilmes. 2010. Multi-document summarization via budgeted maximization of submodular functions. 
Proc. of NAACL.
 Hui Lin and Jeff Bilmes and Shasha Xie. 2009. Graph-based Submodular Selection for Extractive Summarization. 
Proceedings of ASRU.
 Shih-Hsiang Lin and Berlin Chen. 2009. Improved Speech Summarization with Multiple-hypothesis 
Representations and Kullback-Leibler Divergence Measures. Proc. of Interspeech.
 Shih-Hsiang Lin, Berlin Chen, and H. Min Wang. 2009. A Comparative Study of Probabilistic Ranking Models for 
Chinese Spoken Document Summarization. ACM Transactions on Asian Language Information Processing.
85
169
References
 Shih Hsiang Lin, Yu Mei Chang, Jia Wen Liu, Berlin Chen. 2010 Leveraging Evaluation Metric-related Training 
Criteria for Speech Summarization. Proc. of ICASSP.
 Fei Liu and Yang Liu. 2009. From Extractive to Abstractive Meeting Summaries: Can it be done by sentence 
compression? Proc. of ACL.
 Fei Liu and Yang Liu. 2010. Using Spoken Utterance Compression for Meeting Summarization: A pilot study. Proc. 
of IEEE SLT.
 Yang Liu and Shasha Xie. 2008. Impact of Automatic Sentence Segmentation on Meeting Summarization. Proc. of 
ICASSP.
 Yang Liu, Feifan Liu, Bin Li, and Shasha Xie. 2007. Do Disfluencies Affect Meeting Summarization: A pilot study 
on the impact of disfluencies. Poster at MLMI.
 Yang Liu, Shasha Xie, and Fei Liu. 2010. Using n-best Recognition Output for Extractive Summarization and 
Keyword Extraction in Meeting Speech. Proc. of ICASSP.
 Annie Louis and Ani Nenkova. 2009. Automatically evaluating content selection in summarization without 
human models. Proceedings of EMNLP
 H.P. Luhn. 1958. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development 2(2).
 Inderjeet Mani, Gary Klein, David House, Lynette Hirschman, Therese Firmin, and Beth Sundheim. 2002. 
SUMMAC: a text summarization evaluation. Natual Language Engineering. 8,1 (March 2002), 43-68.
 Manuel J. Mana-Lopez, Manuel De Buenaga, and Jose M. Gomez-Hidalgo. 2004. Multidocument summarization: 
An added value to clustering in interactive retrieval. ACM Trans. Inf. Systems.
 Sameer Maskey. 2008. Automatic Broadcast News Summarization. Ph.D thesis. Columbia University.
 Sameer Maskey and Julia Hirschberg. 2005. Comparing lexical, acoustic/prosodic, discourse and structural 
features for speech summarization. Proceedings of Interspeech.
 Sameer Maskey and Julia Hirschberg. 2006. Summarizing Speech Without Text Using Hidden Markov Models. 
Proc. of HLT-NAACL.
 Ryan McDonald. 2007. A Study of Global Inference Algorithms in Multi-document Summarization. Lecture Notes in 
Computer Science. Advances in Information Retrieval. 
 Kathleen McKeown, Rebecca J. Passonneau, David K. Elson, Ani Nenkova, and Julia Hirschberg. 2005. Do 
summaries help?. Proc. of SIGIR.
 K. McKeown, J. L. Klavans, V. Hatzivassiloglou, R. Barzilay, and E. Eskin.1999. Towards multidocument
summarization by reformulation: progress and prospects. Proc. AAAI 1999.
170
References
 R. Mihalcea and P. Tarau .2004. Textrank: Bringing order into texts. Proc. of EMNLP 2004. 
 G. Murray, S. Renals, J. Carletta, J. Moore. 2005. Evaluating Automatic Summaries of Meeting Recordings. Proc. 
of ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation.
 G. Murray, T. Kleinbauer, P. Poller, T. Becker, S. Renals, and J. Kilgour. 2009. Extrinsic Summarization 
Evaluation: A Decision Audit Task. ACM Transactions on Speech and Language Processing.
 A. Nenkova and R. Passonneau. 2004. Evaluating Content Selection in Summarization: The Pyramid Method. 
Proc. HLT-NAACL.
 A. Nenkova, L. Vanderwende, and K. McKeown. 2006. A compositional context sensitive multi-document 
summarizer: exploring the factors that influence summarization. Proc. ACM SIGIR.
 A. Nenkova, R. Passonneau, and K. McKeown. 2007. The Pyramid Method: Incorporating human content 
selection variation in summarization evaluation. ACM Trans. Speech Lang. Processing.
 Miles Osborne. 2002. Using maximum entropy for sentence extraction. Proc. of ACL Workshop on Automatic 
Summarization.
 Gerald Penn and Xiaodan Zhu. 2008. A critical Reassessement of Evaluation Baselines for Speech 
Summarization. Proc. of ACL-HLT.
 Dmitri G. Roussinov and Hsinchun Chen. 2001. Information navigation on the web by clustering and summarizing 
query results. Inf. Process. Manage. 37, 6 (October 2001), 789-816. 
 B. Schiffman, A. Nenkova, and K. McKeown. 2002. Experiments in Multidocument Summarization. Proc. HLT.
 A. Siddharthan, A. Nenkova, and K. Mckeown.2004. Syntactic Simplification for Improving Content Selection in 
Multi-Document Summarization. Proc. COLING.
 H. Grogory Silber and Kathleen F. McCoy. 2002. Efficiently computed lexical chains as an intermediate 
representation for automatic text summarization. Computational. Linguist. 28, 4 (December 2002), 487-496.
 J. Steinberger, M. Poesio, M. A. Kabadjov, and K. Jeek. 2007. Two uses of anaphora resolution in summarization. 
Inf. Process. Manage. 43(6).
 S. Tucker and S. Whittaker. 2008. Temporal compression of speech: an evaluation. IEEE Transactions on Audio, 
Speech and Language Processing, pages 790-796.
 L. Vanderwende, H. Suzuki, C. Brockett, and A. Nenkova. 2007. Beyond SumBasic: Task-focused summarization 
with sentence simplification and lexical expansion. Information Processing and Management 43.
86
171
References
 Kam-Fai Wong, Mingli Wu, and Wenjie Li. 2008. Extractive Summarization using Supervised and Semi-supervised 
learning. Proc. of ACL.
 Shasha Xie and Yang Liu. 2010. Improving Supervised Learning for Meeting Summarization using Sampling and 
Regression. Computer Speech and Language. V24, pages 495-514.
 Shasha Xie and Yang Liu. 2010. Using Confusion Networks for Speech Summarization. Proc. of NAACL.
 Shasha Xie, Dilek Hakkani-Tur, Benoit Favre, and Yang Liu. 2009. Integrating Prosodic Features in Extractive 
Meeting Summarization. Proc. of ASRU.
 Shasha Xie, Hui Lin, and Yang Liu. 2010. Semi-supervised Extractive Speech Summarization via Co-training 
Algorithm. Proc. of Interspeech.
 S. Ye, T.-S. Chua, M.-Y. Kan, and L. Qiu. 2007. Document concept lattice for text understanding and 
summarization. Information Processing and Management 43(6).
 W. Yih, J. Goodman, L. Vanderwende, and H. Suzuki. 2007. Multi-Document Summarization by Maximizing 
Informative Content-Words. Proc. IJCAI 2007.
 Klaus Zechner. 2002. Automatic Summarization of Open-domain Multiparty Dialogues in Diverse Genres. 
Computational Linguistics. V28, pages 447-485.
 Klaus Zechner and Alex Waibel. 2000. Minimizing word error rate in textual summaries of spoken language. 
Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference.
 Justin Zhang and Pascale Fung. 2009. Extractive Speech Summarization by Active Learning. Proc. of ASRU.
 Xiaodan Zhu and Gerald Penn. 2006. Comparing the Roles of Textual, Acoustic and Spoken-language Features 
on Spontaneous Conversation Summarization. Proc. of HLT-NAACL.
 Xiaodan Zhu, Gerald Penn, and F. Rudzicz. 2009. Summarizing Multiple Spoken Documents: Finding Evidence 
from Untranscribed Audio. Proc. of ACL.