Vincent Ng Human Language Technology Research Institute University of Texas at Dallas http://www.hlt.utdallas.edu/~vince/ijcai-2016/coreference IJCAI 2016 Tutorial Coreference Resolution: Successes and Challenges 2Plan for the Talk  Part I: Background  Task definition  Why coreference is hard  Applications  Brief history  Part II: Machine learning for coreference resolution  System architecture  Computational models  Resources and evaluation (corpora, evaluation metrics, …)  Employing semantics and world knowledge  Part III: Solving hard coreference problems  Difficult cases of overt pronoun resolution  Relation to the Winograd Schema Challenge 3Plan for the Talk  Part I: Background  Task definition  Why coreference is hard  Applications  Brief history  Part II: Machine learning for coreference resolution  System architecture  Computational models  Resources and evaluation (corpora, evaluation metrics, …)  Employing semantics and world knowledge  Part III: Solving hard coreference problems  Difficult cases of overt pronoun resolution  Relation to the Winograd Schema Challenge 4Entity Coreference Identify the noun phrases (or entity mentions) that refer to the same real-world entity Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. A renowned speech therapist was summoned to help the King overcome his speech impediment... 5Entity Coreference Identify the noun phrases (or entity mentions) that refer to the same real-world entity Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. A renowned speech therapist was summoned to help the King overcome his speech impediment... 6Entity Coreference Identify the noun phrases (or entity mentions) that refer to the same real-world entity Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. A renowned speech therapist was summoned to help the King overcome his speech impediment... 7Entity Coreference Identify the noun phrases (or entity mentions) that refer to the same real-world entity Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. A renowned speech therapist was summoned to help the King overcome his speech impediment... 8Entity Coreference Identify the noun phrases (or entity mentions) that refer to the same real-world entity Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. A renowned speech therapist was summoned to help the King overcome his speech impediment... 9Entity Coreference Identify the noun phrases (or entity mentions) that refer to the same real-world entity  Inherently a clustering task  the coreference relation is transitive  Coref(A,B) ∧ Coref(B,C) ? Coref(A,C) Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. A renowned speech therapist was summoned to help the King overcome his speech impediment... 10 Entity Coreference Identify the noun phrases (or entity mentions) that refer to the same real-world entity  Typically recast as the problem of selecting an antecedent for each mention, mj Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. A renowned speech therapist was summoned to help the King overcome his speech impediment... 11 Entity Coreference Identify the noun phrases (or entity mentions) that refer to the same real-world entity  Typically recast as the problem of selecting an antecedent for each mention, mj  Does Queen Elizabeth have a preceding mention coreferent with it? If so, what is it? Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. A renowned speech therapist was summoned to help the King overcome his speech impediment... 12 Entity Coreference Identify the noun phrases (or entity mentions) that refer to the same real-world entity  Typically recast as the problem of selecting an antecedent for each mention, mj  Does her have a preceding mention coreferent with it? If so, what is it? Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. A renowned speech therapist was summoned to help the King overcome his speech impediment... 13 Entity Coreference Identify the noun phrases (or entity mentions) that refer to the same real-world entity  Typically recast as the problem of selecting an antecedent for each mention, mj  Does husband have a preceding mention coreferent with it? If so, what is it? Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. A renowned speech therapist was summoned to help the King overcome his speech impediment... 14 Plan for the Talk  Part I: Background  Task definition  Why coreference is hard  Applications  Brief history  Part II: Machine learning for coreference resolution  System architecture  Computational models  Resources and evaluation (corpora, evaluation metrics, …)  Employing semantics and world knowledge  Part III: Solving hard coreference problems  Difficult cases of overt pronoun resolution  Relation to the Winograd Schema Challenge 15 Why it’s hard  Many sources of information play a role  lexical / word: head noun matches  President Clinton = Clinton =? Hillary Clinton  grammatical: number/gender agreement, …  syntactic: syntactic parallelism, binding constraints  John helped himself to... vs. John helped him to…  discourse: discourse focus, salience, recency, …  semantic: semantic class agreement, …  world knowledge  Not all knowledge sources can be computed easily 16 Why it’s hard  Many sources of information play a role  lexical / word: head noun matches  President Clinton = Clinton =? Hillary Clinton  grammatical: number/gender agreement, …  syntactic: syntactic parallelism, binding constraints  John helped himself to... vs. John helped him to…  discourse: discourse focus, salience, recency, …  semantic: semantic class agreement, …  world knowledge  Not all knowledge sources can be computed easily 17 Why It’s Hard No single source is a completely reliable indicator  number and gender  assassination (of Jesuit priests) =? these murders  the woman = she = Mary =? the chairman 18 Why It’s Hard Coreference strategies differ depending on the mention type  definiteness of mentions  … Then Mark saw the man walking down the street.  … Then Mark saw a man walking down the street.  pronoun resolution alone is notoriously difficult  There are pronouns whose resolution requires world knowledge  The Winograd Schema Challenge (Levesque, 2011)  pleonastic pronouns refer to nothing in the text I went outside and it was snowing. 19 Why it’s hard  Anaphoricity determination is a difficult task  determine whether a mention has an antecedent  check whether it is part of a coreference chain but is not the head of the chain Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment... 20 Why it’s hard  Anaphoricity determination is a difficult task  determine whether a mention has an antecedent  check whether it is part of a coreference chain but is not the head of the chain Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment... 21 Why it’s hard  Anaphoricity determination is a difficult task  determine whether a mention has an antecedent  check whether it is part of a coreference chain but is not the head of the chain Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment... 22 Why it’s hard  Anaphoricity determination is a difficult task  determine whether a mention has an antecedent  check whether it is part of a coreference chain but is not the head of the chain Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment... 23 Why it’s hard  Anaphoricity determination is a difficult task  determine whether a mention has an antecedent  check whether it is part of a coreference chain but is not the head of the chain Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment... 24 Why it’s hard  Anaphoricity determination is a difficult task  determine whether a mention has an antecedent  check whether it is part of a coreference chain but is not the head of the chain Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment... Resolving a non-anaphoric mention causes a coreference system’s precision to drop. 25 Why it’s hard  Anaphoricity determination is a difficult task  determine whether a mention has an antecedent  check whether it is part of a coreference chain but is not the head of the chain Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment... If we do a perfect job in anaphoricity determination, coreference systems can improve by an F-score of 5% absolute (Stoyanov et al., 2009) 26 Not all coreference relations are equally difficult to resolve 27 Not all coreference relations are equally difficult to resolve Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. A renowned speech therapist was summoned to help the King overcome his speech impediment... 28 Not all coreference relations are equally difficult to resolve Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. A renowned speech therapist was summoned to help the King overcome his speech impediment... 29 Not all coreference relations are equally difficult to resolve Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. A renowned speech therapist was summoned to help the King overcome his speech impediment... 30 Not all coreference relations are equally difficult to resolve Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. A renowned speech therapist was summoned to help the King overcome his speech impediment... 31 Not all coreference relations are equally difficult to resolve Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. A renowned speech therapist was summoned to help the King overcome his speech impediment... 32 Not all coreference relations are equally difficult to resolve Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. A renowned speech therapist was summoned to help the King overcome his speech impediment... 33 Let’s modify the paragraph Queen Mother asked Queen Elizabeth to transform her sister, Margaret, into an elegant lady. Logue was summoned to help the princess improve her manner... 34 Let’s modify the paragraph Queen Mother asked Queen Elizabeth to transform her sister, Margaret, into an elegant lady. Logue was summoned to help the princess improve her manner... 35 Queen Mother asked Queen Elizabeth to transform her sister, Margaret, into an elegant lady. Logue was summoned to help the princess improve her manner... Let’s modify the paragraph 36 Queen Mother asked Queen Elizabeth to transform her sister, Margaret, into an elegant lady. Logue was summoned to help the princess improve her manner... Let’s modify the paragraph 37 Queen Mother asked Queen Elizabeth to transform her sister, Margaret, into an elegant lady. Logue was summoned to help the princess improve her manner... Let’s modify the paragraph 38 Queen Mother asked Queen Elizabeth to transform her sister, Margaret, into an elegant lady. Logue was summoned to help the princess improve her manner... Let’s modify the paragraph 39 Queen Mother asked Queen Elizabeth to transform her sister, Margaret, into an elegant lady. Logue was summoned to help the princess improve her manner... Let’s modify the paragraph 40 Queen Mother asked Queen Elizabeth to transform her sister, Margaret, into an elegant lady. Logue was summoned to help the princess improve her manner... Let’s modify the paragraph 41 Queen Mother asked Queen Elizabeth to transform her sister, Margaret, into an elegant lady. Logue was summoned to help the princess improve her manner... Let’s modify the paragraph 42 Queen Mother asked Queen Elizabeth to transform her sister, Margaret, into an elegant lady. Logue was summoned to help the princess improve her manner... Let’s modify the paragraph 43 Queen Mother asked Queen Elizabeth to transform her sister, Margaret, into an elegant lady. Logue was summoned to help the princess improve her manner...  Not all coreference relations are equally difficult to identify  A system will be more confident in predicting some and less confident in predicting others  use the more confident ones to help predict the remaining ones? Let’s modify the paragraph 44 Plan for the Talk  Part I: Background  Task definition  Why coreference is hard  Applications  Brief history  Part II: Machine learning for coreference resolution  System architecture  Computational models  Resources and evaluation (corpora, evaluation metrics, …)  Employing semantics and world knowledge  Part III: Solving hard coreference problems  Difficult cases of overt pronoun resolution  Relation to the Winograd Schema Challenge 45 Applications of Coreference  Question answering  Information extraction  Machine translation  Text summarization  Information retrieval  ... 46 text collection Question Answering System natural language question answer Application: Question Answering 47 Mozart was one of the first classical composers. He was born in Salzburg, Austria, in 27 January 1756. He wrote music of many different genres... Haydn was a contemporary and friend of Mozart. He was born in Rohrau, Austria, in 31 March 1732. He wrote 104 symphonies... Where was Mozart born? Coreference for Question Answering 48 Mozart was one of the first classical composers. He was born in Salzburg, Austria, in 27 January 1756. He wrote music of many different genres... Haydn was a contemporary and friend of Mozart. He was born in Rohrau, Austria, in 31 March 1732. He wrote 104 symphonies... Where was Mozart born? Coreference for Question Answering 49 Haydn was a contemporary and friend of Mozart. He was born in Rohrau, Austria, in 31 March 1732. He wrote 104 symphonies... Where was Mozart born? Coreference for Question Answering Mozart was one of the first classical composers. He was born in Salzburg, Austria, in 27 January 1756. He wrote music of many different genres... 50 Haydn was a contemporary and friend of Mozart. He was born in Rohrau, Austria, in 31 March 1732. He wrote 104 symphonies... Mozart was one of the first classical composers. He was born in Salzburg, Austria, in 27 January 1756. He wrote music of many different genres... Where was Mozart born? Coreference for Question Answering 51 Disaster Type: • location: • date: • magnitude: • magnitude-confidence: • damage: • human-effect: • victim: • number: • outcome: • physical-effect: • object: • outcome: AFGANISTAN MAY BE PREPARING FOR ANOTHER TEST Thousands of people are feared dead following... (voice-over) ...a powerful earthquake that hit Afghanistan today. The quake registered 6.9 on the Richter scale. (on camera) Details now hard to come by, but reports say entire villages were buried by the earthquake. Application: Information Extraction 52 Disaster Type: earthquake • location: Afghanistan • date: today • magnitude: 6.9 • magnitude-confidence: high • damage: • human-effect: • victim: Thousands of people • number: Thousands • outcome: dead • physical-effect: • object: entire villages • outcome: damaged AFGANISTAN MAY BE PREPARING FOR ANOTHER TEST Thousands of people are feared dead following... (voice-over) ...a powerful earthquake that hit Afghanistan today. The quake registered 6.9 on the Richter scale. (on camera) Details now hard to come by, but reports say entire villages were buried by the earthquake. Application: Information Extraction 53 Disaster Type: earthquake • location: Afghanistan • date: today • magnitude: 6.9 • magnitude-confidence: high • damage: • human-effect: • victim: Thousands of people • number: Thousands • outcome: dead • physical-effect: • object: entire villages • outcome: damaged AFGANISTAN MAY BE PREPARING FOR ANOTHER TEST Thousands of people are feared dead following... (voice-over) ...a powerful earthquake that hit Afghanistan today. The quake registered 6.9 on the Richter scale. (on camera) Details now hard to come by, but reports say entire villages were buried by the earthquake. Coreference for Information Extraction 54 Disaster Type: earthquake • location: Afghanistan • date: today • magnitude: 6.9 • magnitude-confidence: high • damage: • human-effect: • victim: Thousands of people • number: Thousands • outcome: dead • physical-effect: • object: entire villages • outcome: damaged AFGANISTAN MAY BE PREPARING FOR ANOTHER TEST Thousands of people are feared dead following... (voice-over) ...a powerful earthquake that hit Afghanistan today. The quake registered 6.9 on the Richter scale. (on camera) Details now hard to come by, but reports say entire villages were buried by the earthquake. The last major earthquake in Afghanistan took place in August 2001. Coreference for Information Extraction 55 Application: Machine Translation  Chinese to English machine translation 俄罗斯作为米洛舍夫维奇一贯的支持者,曾经提出 调停这场政治危机。 Russia is a consistent supporter of Milosevic, has proposed to mediate the political crisis. 56 Application: Machine Translation  Chinese to English machine translation 俄罗斯作为米洛舍夫维奇一贯的支持者,曾经提出 调停这场政治危机。 Russia is a consistent supporter of Milosevic, ??? has proposed to mediate the political crisis. 57 Application: Machine Translation  Chinese to English machine translation 俄罗斯作为米洛舍夫维奇一贯的支持者,曾经提出 调停这场政治危机。 Russia is a consistent supporter of Milosevic, ??? has proposed to mediate the political crisis. 58 Plan for the Talk  Part I: Background  Task definition  Why coreference is hard  Applications  Brief history  Part II: Machine learning for coreference resolution  System architecture  Computational models  Resources and evaluation (corpora, evaluation metrics, …)  Employing semantics and world knowledge  Part III: Solving hard coreference problems  Difficult cases of overt pronoun resolution  Relation to the Winograd Schema Challenge 59 Rule-Based Approaches  Popular in the 1970s and 1980s  A popular PhD thesis topic  Charniak (1972): Children’s story comprehension  “In order to do pronoun resolution, one had to be able to do everything else.”  Focus on sophisticated knowledge & inference mechanisms  Syntax-based approaches (Hobbs, 1976)  Discourse-based approaches / Centering algorithms  Kantor (1977), Grosz (1977), Webber (1978), Sidner (1979)  Centering algorithms and alternatives  Brennan et al. (1987), … 60 Rule-Based Approaches  Knowledge-poor approaches (Mitkov, 1990s) 61 Evaluation of Rule-Based Approaches  Small-scale evaluation  a few hundred sentences  sometimes by hand  algorithm not always implemented/implementable 62 MUC Coreference  The MUC conferences  Goal: evaluate information extraction systems  Coreference as a supporting task for information extraction  First recognized in MUC-6 (1995)  First large-scale evaluation of coreference systems. Need  Scoring program  MUC scoring program (Vilain et al., 1995)  Guidelines for coreference-annotating a corpus  Original task definition very ambitious  Final task definition focuses solely on identity coreference 63 Other Types of Coreference  Non-identity coreference: bridging  Part-whole relations  He passed by Jan’s house and saw that the door was painted red.  Set-subset relations  Difficult cases  Verb phrase ellipsis  John enjoys watching movies, but Mary doesn’t.  Reference to abstract entities  Each fall, penguins migrate to Fuji.  It happens just before the eggs hatch. 64 Other Types of Coreference  Non-identity coreference: bridging  Part-whole relations  He passed by Jan’s house and saw that the door was painted red.  Set-subset relations  Difficult cases  Verb phrase ellipsis  John enjoys watching movies, but Mary doesn’t.  Reference to abstract entities  Each fall, penguins migrate to Fuji.  That’s why I’m going there next month. 65 MUC Coreference  MUC-6 coreference task  MUC-6 corpus: 30 training documents, 30 test documents  Despite the fact that 30 training documents are available, all but one resolver are rule-based  UMASS (Lenhert & McCarthy, 1995) resolver is learning-based  Best-performing resolver, FASTUS, is rule-based 66 MUC Coreference  MUC-6 coreference task  MUC-6 corpus: 30 training documents, 30 test documents  Despite the fact that 30 training documents are available, all but one resolver are rule-based  UMASS (Lenhert & McCarthy, 1995) resolver is learning-based  Best-performing resolver, FASTUS, is rule-based  MUC-7 coreference task  MUC-7 corpus: 30 training documents, 20 test documents  none of the 7 participating resolvers used machine learning 67 MUC Coreference  MUC-6 coreference task  MUC-6 corpus: 30 training documents, 30 test documents  Despite the fact that 30 training documents are available, all but one resolver are rule-based  UMASS (Lenhert & McCarthy, 1995) resolver is learning-based  Best-performing resolver, FASTUS, is rule-based  MUC-7 coreference task  MUC-7 corpus: 30 training documents, 20 test documents  none of the 7 participating resolvers used machine learning  Learning-based approaches were not the mainstream  But … interest grew when Soon et al. (2001) showed that a learning-based resolver can offer competitive performance 68 Plan for the Talk  Part I: Background  Task definition  Why coreference is hard  Applications  Brief history  Part II: Machine learning for coreference resolution  System architecture  Computational models  Employing semantics and world knowledge  Resources and evaluation (corpora, evaluation metrics, …)  Part III: Solving hard coreference problems  Difficult cases of overt pronoun resolution  Relation to the Winograd Schema Challenge 69 System Architecture: Training Preprocessing Mention Detection Model Training Training texts Coreference Model 70 System Architecture: Training Preprocessing Mention Detection Model Training Training texts Coreference Model  Tokenization  Sentence splitting  part-of-speech tagging 71 System Architecture: Training Preprocessing Mention Detection Model Training Training texts Coreference Model  Tokenization  Sentence splitting  part-of-speech tagging  Trivial: extract the hand-annotated (i.e., gold) mentions from training texts 72 System Architecture: Training Preprocessing Mention Detection Model Training Training texts Coreference Model  Tokenization  Sentence splitting  part-of-speech tagging  Trivial: extract the hand-annotated (i.e., gold) mentions from training texts  Feature extraction  Model design and learning 73 System Architecture: Testing Preprocessing Mention Detection Coreference Model Test text Coreference partition 74 System Architecture: Testing Preprocessing Mention Detection Coreference Model Test text Coreference partition  Tokenization  Sentence splitting  part-of-speech tagging 75 System Architecture: Testing Preprocessing Mention Detection Coreference Model Test text Coreference partition  Tokenization  Sentence splitting  part-of-speech tagging  Not-so-trivial: extract the mentions (pronouns, names, nominals, nested NPs)  Some researchers reported results on gold mentions, not system mentions  Substantially simplified the coref task  F-scores of 80s rather than 60s 76 System Architecture: Testing Preprocessing Mention Detection Coreference Model Test text Coreference partition  Tokenization  Sentence splitting  part-of-speech tagging  Not-so-trivial: extract the mentions (pronouns, names, nominals, nested NPs)  Some researchers reported results on gold mentions, not system mentions  Substantially simplified the coref task  F-scores of 80s rather than 60s 77 Plan for the Talk  Part I: Background  Task definition  Why coreference is hard  Applications  Brief history  Part II: Machine learning for coreference resolution  System architecture  Computational models  Employing semantics and world knowledge  Resources and evaluation (corpora, evaluation metrics, …)  Part III: Solving hard coreference problems  Difficult cases of overt pronoun resolution  Relation to the Winograd Schema Challenge 78 System Architecture: Training Preprocessing Mention Detection Model Training Training texts Coreference Model 79 The Mention-Pair Model  a classifier that, given a description of two mentions, mi and mj, determines whether they are coreferent or not  coreference as a pairwise classification task 80  Training instance creation  create one training instance for each pair of mentions from texts annotated with coreference information [Mary] said [John] hated [her] because [she] … How to train a mention-pair model? 81  Training instance creation  create one training instance for each pair of mentions from texts annotated with coreference information [Mary] said [John] hated [her] because [she] … How to train a mention-pair model? negative 82  Training instance creation  create one training instance for each pair of mentions from texts annotated with coreference information [Mary] said [John] hated [her] because [she] … How to train a mention-pair model? negative negative 83  Training instance creation  create one training instance for each pair of mentions from texts annotated with coreference information [Mary] said [John] hated [her] because [she] … How to train a mention-pair model? negative negative positive 84  Training instance creation  create one training instance for each pair of mentions from texts annotated with coreference information [Mary] said [John] hated [her] because [she] … negative How to train a mention-pair model? negative negative positive positive positive 85  Training instance creation  create one training instance for each pair of mentions from texts annotated with coreference information  Problem: all mention pairs produce a large and skewed data set  Soon et al.’s (2001) heuristic instance creation method [Mary] said [John] hated [her] because [she] … negative How to train a mention-pair model? negative negative positive positive positive 86 How to train a mention-pair model?  Soon et al.’s feature vector: describes two mentions  Exact string match  are mi and mj the same string after determiners are removed?  Grammatical  gender and number agreement, Pronouni?, Pronounj?, …  Semantic  semantic class agreement  Positional  distance between the two mentions  Learning algorithm  C5 decision tree learner 87 Decision Tree Learned for MUC-6 Data STR-MATCH GENDER NUMBER ALIAS J-PRONOUN I-PRONOUN APPOSITIVE DISTANCE + - + - + - + - + - > 0 ≤ 0 + - + + + + + - - - - 1 0 2 - 88 [Jing] likes [him] but [she] … Applying the mention-pair model  After training, we can apply the model to a test text  Classify each pair of mentions as coreferent or not coreferent  Problem: the resulting classifications may violate transitivity! positive negative positive 89 How to resolve the conflicts? 90 How to resolve the conflicts?  Given a test text,  process the mentions in a left-to-right manner  for each mj,  select as its antecedent the closest preceding mention that is classified as coreferent with mj  otherwise, no antecedent is found for mj 91 How to resolve the conflicts?  Given a test text,  process the mentions in a left-to-right manner  for each mj,  select as its antecedent the closest preceding mention that is classified as coreferent with mj  otherwise, no antecedent is found for mj [Jing] likes [him] but [she] … positive negativepositive 92 How to resolve the conflicts?  Given a test text,  process the mentions in a left-to-right manner  for each mj,  select as its antecedent the closest preceding mention that is classified as coreferent with mj  otherwise, no antecedent is found for mj [Jing] likes [him] but [she] … positive negativepositive No antecedent 93 How to resolve the conflicts?  Given a test text,  process the mentions in a left-to-right manner  for each mj,  select as its antecedent the closest preceding mention that is classified as coreferent with mj  otherwise, no antecedent is found for mj [Jing] likes [him] but [she] … positive negativepositive 94 How to resolve the conflicts?  Given a test text,  process the mentions in a left-to-right manner  for each mj,  select as its antecedent the closest preceding mention that is classified as coreferent with mj  otherwise, no antecedent is found for mj [Jing] likes [him] but [she] … positive negativepositive Antecedent is Jing 95 How to resolve the conflicts?  Given a test text,  process the mentions in a left-to-right manner  for each mj,  select as its antecedent the closest preceding mention that is classified as coreferent with mj  otherwise, no antecedent is found for mj [Jing] likes [him] but [she] … positive negativepositive 96 How to resolve the conflicts?  Given a test text,  process the mentions in a left-to-right manner  for each mj,  select as its antecedent the closest preceding mention that is classified as coreferent with mj  otherwise, no antecedent is found for mj [Jing] likes [him] but [she] … positive negativepositive Antecedent is Jing 97 How to resolve the conflicts?  Given a test text,  process the mentions in a left-to-right manner  for each mj,  select as its antecedent the closest preceding mention that is classified as coreferent with mj  otherwise, no antecedent is found for mj [Jing] likes [him] but [she] … positive negativepositive In the end, all three mentions will be in the same cluster 98 How to resolve the conflicts?  Given a test text,  process the mentions in a left-to-right manner  for each mj,  select as its antecedent the closest preceding mention that is classified as coreferent with mj  otherwise, no antecedent is found for mj [Jing] likes [him] but [she] … positive negativepositive In the end, all three mentions will be in the same cluster Single-link clustering 99 MUC Scoring Metric  Link-based metric  Key: {A, B, C}, {D}  Response: {A, B}, {C, D}  Two links are needed to create the key clusters  Response recovered one of them, so recall is ½  Out of the two links in the response clusters, one is correct  So precision is ½  F-measure 100 MUC Scoring Metric  Link-based metric  Key: {A, B, C}, {D}, {E}  Response: {A, B}, {C, D}, {E}  Two links are needed to create the key clusters  Response recovered one of them, so recall is ½  Out of the two links in the response clusters, one is correct  So precision is ½  F-measure 101 60.465.556.162.667.358.6 FPRFPR Soon et al. Results MUC-6 MUC-7 102 Anaphoricity Determination Determines whether a mention has an antecedent 103 Anaphoricity Determination  In single-link clustering, when selecting an antecedent for mj,  select the closest preceding mention that is coreferent with mj  if no such mention exists, mj is classified as non-anaphoric 104 Anaphoricity Determination  In single-link clustering, when selecting an antecedent for mj,  select the closest preceding mention that is coreferent with mj  if no such mention exists, mj is classified as non-anaphoric  Why not explicitly determine whether mj is anaphoric prior to coreference resolution (anaphoricity determination)? Anaphoricity Determination Coreference Resolution 105 Anaphoricity Determination  In single-link clustering, when selecting an antecedent for mj,  select the closest preceding mention that is coreferent with mj  if no such mention exists, mj is classified as non-anaphoric  Why not explicitly determine whether mj is anaphoric prior to coreference resolution (anaphoricity determination)?  If mj is not anaphoric, shouldn’t bother to resolve it  pipeline architecture ? filter non-anaphoric NPs prior to coreference resolution Anaphoricity Determination Coreference Resolution 106 Anaphoricity Determination  train a classifier to determine whether a mention is anaphoric (i.e., whether an mention has an antecedent)  one training instance per mention  37 features  class value is anaphoric or not anaphoric Ng & Cardie COLING’02 107 Anaphoricity Determination  train a classifier to determine whether a mention is anaphoric (i.e., whether an mention has an antecedent)  one training instance per mention  37 features  class value is anaphoric or not anaphoric  Result on MUC-6/7 (Ng & Cardie, 2002)  coreference F-measure drops  precision increases, recall drops abruptly  many anaphoric mentions are misclassified ? Error propagation Ng & Cardie COLING’02 108 Some Questions (Circa 2003)  Is there a model better than the mention-pair model?  Can anaphoricity determination benefit coreference resolution? 109 Some Questions (Circa 2003)  Is there a model better than the mention-pair model?  Can anaphoricity determination benefit coreference resolution? 110  Limited expressiveness  information extracted from two mentions may not be sufficient for making an informed coreference decision Weaknesses of the Mention-Pair Model 111 Weaknesses of the Mention-Pair Model  Limited expressiveness  information extracted from two mentions may not be sufficient for making an informed coreference decision Mr. Clinton Clinton she 112 Weaknesses of the Mention-Pair Model  Limited expressiveness  information extracted from two mentions may not be sufficient for making an informed coreference decision Mr. Clinton Clinton she 113 Weaknesses of the Mention-Pair Model  Limited expressiveness  information extracted from two mentions may not be sufficient for making an informed coreference decision Mr. Clinton Clinton she 114 Weaknesses of the Mention-Pair Model  Limited expressiveness  information extracted from two mentions may not be sufficient for making an informed coreference decision Mr. Clinton Clinton she ? 115 Weaknesses of the Mention-Pair Model  Limited expressiveness  information extracted from two mentions may not be sufficient for making an informed coreference decision  Can’t determine which candidate antecedent is the best  only determines how good a candidate is relative to the mention to be resolved, not how good it is relative to the others 116 Weaknesses of the Mention-Pair Model  Limited expressiveness  information extracted from two mentions may not be sufficient for making an informed coreference decision  Can’t determine which candidate antecedent is the best  only determines how good a candidate is relative to the mention to be resolved, not how good it is relative to the others Bush Clinton she ? ? 117  Limited expressiveness  information extracted from two mentions may not be sufficient for making an informed coreference decision  Can’t determine which candidate antecedent is the best  only determines how good a candidate is relative to the mention to be resolved, not how good it is relative to the others Weaknesses of the Mention-Pair Model 118 Improving Model Expressiveness  Want a coreference model that can tell us whether “she” and a preceding cluster of “she” are coreferent Mr. Clinton Clinton she? ? 119 Improving Model Expressiveness  Want a coreference model that can tell us whether “she” and a preceding cluster of “she” are coreferent Mr. Clinton Clinton she ? 120 The Entity-Mention Model  a classifier that determines whether (or how likely) a mention belongs to a preceding coreference cluster  more expressive than the mention-pair model  an instance is composed of a mention and a preceding cluster  can employ cluster-level features defined over any subset of mentions in a preceding cluster Pasula et al. (2003), Luo et al. (2004), Yang et al. (2004, 2008), Daume & Marcu (2005), Culotta et al. (2007), … 121 The Entity-Mention Model  a classifier that determines whether (or how likely) a mention belongs to a preceding coreference cluster  more expressive than the mention-pair model  an instance is composed of a mention and a preceding cluster  can employ cluster-level features defined over any subset of mentions in a preceding cluster  is a mention gender-compatible with all mentions in a preceding cluster?  is a mention gender-compatible with most of the mentions in it?  is a mention gender-compatible with none of them? Pasula et al. (2003), Luo et al. (2004), Yang et al. (2004, 2008), Daume & Marcu (2005), Culotta et al. (2007), … 122  Limited expressiveness  information extracted from two mentions may not be sufficient for making an informed coreference decision  Can’t determine which candidate antecedent is the best  only determine how good a candidate is relative to the mention to be resolved, not how good it is relative to the others Weaknesses of the Mention-Pair Model 123 How to address this problem?  Idea: train a model that imposes a ranking on the candidate antecedents for a mention to be resolved  so that it assigns the highest rank to the correct antecedent 124 How to address this problem?  Idea: train a model that imposes a ranking on the candidate antecedents for a mention to be resolved  so that it assigns the highest rank to the correct antecedent  A ranker allows all candidate antecedents to be compared  allows us find the best candidate antecedent for a mention 125 How to address this problem?  Idea: train a model that imposes a ranking on the candidate antecedents for a mention to be resolved  so that it assigns the highest rank to the correct antecedent  A ranker allows all candidate antecedents to be compared  allows us find the best candidate antecedent for a mention  There is a natural resolution strategy for a mention-ranking model  A mention is resolved to the highest-ranked candidate antecedent Yang et al. (2003), Iida et al., (2003), Denis & Baldridge (2007, 2008), … 126 Caveat  Since a mention ranker only imposes a ranking on the candidates, it cannot determine whether a mention is anaphoric  Need to train a classifier to perform anaphoricity determination 127 Recap Cannot determine best candidate Limited expressiveness Mention Ranking Entity MentionProblem 128 Recap Cannot determine best candidate Limited expressiveness Mention Ranking Entity MentionProblem Can we combine the strengths of these two model? 129 Consider preceding clusters, not candidate antecedents Rank candidate antecedents Mention-ranking model Entity-mention model 130 Consider preceding clusters, not candidate antecedents Rank candidate antecedents Mention-ranking model Entity-mention model Rank preceding clusters 131 The Cluster-Ranking Model Consider preceding clusters, not candidate antecedents Rank candidate antecedents Mention-ranking model Entity-mention model Rank preceding clusters 132 The Cluster-Ranking Model  Training  train a ranker to rank preceding clusters  Testing  resolve each mention to the highest-ranked preceding cluster Rahman & Ng EMNLP’09 133 The Cluster-Ranking Model  Training  train a ranker to rank preceding clusters  Testing  resolve each mention to the highest-ranked preceding cluster After many years of hard work … finally came up with cluster rankers, which are conceptually similar to Lappin & Leass’ (1994) pronoun resolver --- Bonnie Webber (2010) Rahman & Ng EMNLP’09 134  As a ranker, the cluster-ranking model cannot determine whether a mention is anaphoric  Before resolving a mention, we still need to use an anaphoricity classifier to determine if it is anaphoric  yields a pipeline architecture  Potential problem  errors made by the anaphoricity classifier will be propagated to the coreference resolver The Cluster-Ranking Model Rahman & Ng EMNLP’09 135 Potential Solution  Jointly learn anaphoricity and coreference 136 How to jointly learn anaphoricity and coreference resolution? 137 How to jointly learn anaphoricity and coreference resolution?  Currently, the cluster-ranking model is trained to rank preceding clusters for a given mention, mj  In joint modeling, the cluster-ranking model is trained to rank preceding clusters + null cluster for a given mention, mj  want to train the model such that the null cluster has the highest rank if mj is non-anaphoric  Joint training allows the model to simultaneously learn whether to resolve an mention, and if so, which preceding cluster is the best 138 How to apply the joint model?  During testing, resolve mj to the highest-ranked cluster  if highest ranked cluster is null cluster, mj is non-anaphoric  Same idea can be applied to mention-ranking models 139 Experimental Setup  The English portion of the ACE 2005 training corpus  599 documents coref-annotated on the ACE entity types  80% for training, 20% for testing  Mentions extracted automatically using a mention detector  Scoring programs: recall, precision, F-measure  B3 (Bagga & Baldwin, 1998)  CEAF (Luo, 2005) 140 B3 Scoring Metric  Mention-based metric  Computes per-mention recall and precision  Aggregates per-mention scores into overall scores  Key: {A, B, C}, {D}  Response: {A, B}, {C, D}  To compute the recall and precision for A:  A’s key cluster and response cluster have 2 overlapping mentions  2 of the 3 mentions in key cluster is recovered, so recall = 2/3  2 mentions in response cluster, so precision = 2/2 141 B3 Scoring Metric  Mention-based metric  Computes per-mention recall and precision  Aggregates per-mention scores into overall scores  Key: {A, B, C}, {D}, {E}  Response: {A, B}, {C, D}, {E}  To compute the recall and precision for A:  A’s key cluster and response cluster have 2 overlapping mentions  2 of the 3 mentions in key cluster is recovered, so recall = 2/3  2 mentions in response cluster, so precision = 2/2 142 CEAF Scoring Metric  Entity/Cluster-based metric  Computes the best bipartite matching between the set of key clusters and the set of response clusters  Key: {A, B, C}, {D, E}  Response: {A, B}, {C, D}, {E} 143 CEAF Scoring Metric  Entity/Cluster-based metric  Computes the best bipartite matching between the set of key clusters and the set of response clusters  Key: {A, B, C}, {D, E}  Response: {A, B}, {C, D}, {E}  Recall: (2+1)/(3+2)  Precision: (2+1)/3 144 CEAF Scoring Metric  Entity/Cluster-based metric  Computes the best bipartite matching between the set of key clusters and the set of response clusters  Key: {A, B, C}, {D, E}  Response: {A, B}, {C, D}, {E}  Recall: (2+1)/(3+2)  Precision: (2+1)/3 145 Experimental Setup  Three baseline coreference models  mention-pair, entity-mention, mention-ranking models 146 B3 CEAF R P F R P F Mention-Pair Baseline 50.8 57.9 54.1 56.1 51.0 53.4 Entity-Mention Baseline 51.2 57.8 54.3 56.3 50.2 53.1 Mention-Ranking Baseline (Pipeline) 52.3 61.8 56.6 51.6 56.7 54.1 Mention-Ranking Baseline (Joint) 50.4 65.5 56.9 53.0 58.5 55.6 Cluster-Ranking Model (Pipeline) 55.3 63.7 59.2 54.1 59.3 56.6 Cluster-Ranking Model (Joint) 54.4 70.5 61.4 56.7 62.6 59.5 Results (Mention-Pair Baseline) 147 B3 CEAF R P F R P F Mention-Pair Baseline 50.8 57.9 54.1 56.1 51.0 53.4 Entity-Mention Baseline 51.2 57.8 54.3 56.3 50.2 53.1 Mention-Ranking Baseline (Pipeline) 52.3 61.8 56.6 51.6 56.7 54.1 Mention-Ranking Baseline (Joint) 50.4 65.5 56.9 53.0 58.5 55.6 Cluster-Ranking Model (Pipeline) 55.3 63.7 59.2 54.1 59.3 56.6 Cluster-Ranking Model (Joint) 54.4 70.5 61.4 56.7 62.6 59.5 Results (Entity-Mention Baseline) 148 B3 CEAF R P F R P F Mention-Pair Baseline 50.8 57.9 54.1 56.1 51.0 53.4 Entity-Mention Baseline 51.2 57.8 54.3 56.3 50.2 53.1 Mention-Ranking Baseline (Pipeline) 52.3 61.8 56.6 51.6 56.7 54.1 Mention-Ranking Baseline (Joint) 50.4 65.5 56.9 53.0 58.5 55.6 Cluster-Ranking Model (Pipeline) 55.3 63.7 59.2 54.1 59.3 56.6 Cluster-Ranking Model (Joint) 54.4 70.5 61.4 56.7 62.6 59.5 Results (Pipeline Mention-Ranking)  Apply an anaphoricity classifier to filter non-anaphoric NPs 149 B3 CEAF R P F R P F Mention-Pair Baseline 50.8 57.9 54.1 56.1 51.0 53.4 Entity-Mention Baseline 51.2 57.8 54.3 56.3 50.2 53.1 Mention-Ranking Baseline (Pipeline) 52.3 61.8 56.6 51.6 56.7 54.1 Mention-Ranking Baseline (Joint) 50.4 65.5 56.9 53.0 58.5 55.6 Cluster-Ranking Model (Pipeline) 55.3 63.7 59.2 54.1 59.3 56.6 Cluster-Ranking Model (Joint) 54.4 70.5 61.4 56.7 62.6 59.5 Results (Joint Mention-Ranking) 150 B3 CEAF R P F R P F Mention-Pair Baseline 50.8 57.9 54.1 56.1 51.0 53.4 Entity-Mention Baseline 51.2 57.8 54.3 56.3 50.2 53.1 Mention-Ranking Baseline (Pipeline) 52.3 61.8 56.6 51.6 56.7 54.1 Mention-Ranking Baseline (Joint) 50.4 65.5 56.9 53.0 58.5 55.6 Cluster-Ranking Model (Pipeline) 55.3 63.7 59.2 54.1 59.3 56.6 Cluster-Ranking Model (Joint) 54.4 70.5 61.4 56.7 62.6 59.5 Results (Pipeline Cluster Ranking)  Apply an anaphoricity classifier to filter non-anaphoric mentions 151 B3 CEAF R P F R P F Mention-Pair Baseline 50.8 57.9 54.1 56.1 51.0 53.4 Entity-Mention Baseline 51.2 57.8 54.3 56.3 50.2 53.1 Mention-Ranking Baseline (Pipeline) 52.3 61.8 56.6 51.6 56.7 54.1 Mention-Ranking Baseline (Joint) 50.4 65.5 56.9 53.0 58.5 55.6 Cluster-Ranking Model (Pipeline) 55.3 63.7 59.2 54.1 59.3 56.6 Cluster-Ranking Model (Joint) 54.4 70.5 61.4 56.7 62.6 59.5 Results (Joint Cluster Ranking) 152  B3 CEAF R P F R P F Mention-Pair Baseline 50.8 57.9 54.1 56.1 51.0 53.4 Entity-Mention Baseline 51.2 57.8 54.3 56.3 50.2 53.1 Mention-Ranking Baseline (Pipeline) 52.3 61.8 56.6 51.6 56.7 54.1 Mention-Ranking Baseline (Joint) 50.4 65.5 56.9 53.0 58.5 55.6 Cluster-Ranking Model (Pipeline) 55.3 63.7 59.2 54.1 59.3 56.6 Cluster-Ranking Model (Joint) 54.4 70.5 61.4 56.7 62.6 59.5  In comparison to the best baseline (joint mention-ranking),  significant improvements in F-score for both B3 and CEAF  due to simultaneous rise in recall and precision Results (Joint Cluster Ranking) 153 B3 CEAF R P F R P F Mention-Pair Baseline 50.8 57.9 54.1 56.1 51.0 53.4 Entity-Mention Baseline 51.2 57.8 54.3 56.3 50.2 53.1 Mention-Ranking Baseline (Pipeline) 52.3 61.8 56.6 51.6 56.7 54.1 Mention-Ranking Baseline (Joint) 50.4 65.5 56.9 53.0 58.5 55.6 Cluster-Ranking Model (Pipeline) 55.3 63.7 59.2 54.1 59.3 56.6 Cluster-Ranking Model (Joint) 54.4 70.5 61.4 56.7 62.6 59.5 Results (Joint Cluster Ranking)  Joint modeling is better than pipeline modeling 154 The CoNLL Shared Tasks  Much recent work on entity coreference resolution was stimulated in part by the availability of the OntoNotes corpus and its use in two coreference shared tasks  CoNLL-2011 and CoNLL-2012  OntoNotes coreference: unrestricted coreference 155  Multi-pass sieve approach (Lee et al., 2011)  Winner of the CoNLL-2011 shared task  English coreference resolution  Latent tree-based approach (Fernandes et al.,2012)  Winner of the CoNLL-2012 shared task  Multilingual coreference resolution (English, Chinese, Arabic) Two Top Shared Task Systems 156  Multi-pass sieve approach (Lee et al., 2011)  Winner of the CoNLL-2011 shared task  English coreference resolution  Latent tree-based approach (Fernandes et al.,2012)  Winner of the CoNLL-2012 shared task  Multilingual coreference resolution (English, Chinese, Arabic) Two Recent Approaches 157 Stanford’s Sieve-Based Approach  Rule-based resolver  Each rule enables coreference links to be established  Rules are partitioned into 12 components (or sieves) arranged as a pipeline 158 The 12 Sieves  Discourse Processing  Exact String Match  Relaxed String Match  Precise Constructs  Strict Head Matching A,B,C  Proper Head Word Match  Alias Sieve  Relaxed Head Matching  Lexical Chain  Pronouns 159 The 12 Sieves  Discourse Processing  Exact String Match  Relaxed String Match  Precise Constructs  Strict Head Matching A,B,C  Proper Head Word Match  Alias Sieve  Relaxed Head Matching  Lexical Chain  Pronouns Each sieve is composed of a set of rules for establishing coreference links 160 The 12 Sieves  Discourse Processing  Exact String Match  Relaxed String Match  Precise Constructs  Strict Head Matching A,B,C  Proper Head Word Match  Alias Sieve  Relaxed Head Matching  Lexical Chain  Pronouns  Two mentions are coreferent if they have the same string 161 The 12 Sieves  Discourse Processing  Exact String Match  Relaxed String Match  Precise Constructs  Strict Head Matching A,B,C  Proper Head Word Match  Alias Sieve  Relaxed Head Matching  Lexical Chain  Pronouns  Two mentions are coreferent if the strings obtained by dropping the text after their head words are identical 162 The 12 Sieves  Discourse Processing  Exact String Match  Relaxed String Match  Precise Constructs  Strict Head Matching A,B,C  Proper Head Word Match  Alias Sieve  Relaxed Head Matching  Lexical Chain  Pronouns  Two mentions are coreferent if they are in an appositive construction  Two mentions are coreferent if they are in a copular construction  Two mentions are coreferent if one is a relative pronoun that modifies the head of the antecedent NP  … 163 The 12 Sieves  Discourse Processing  Exact String Match  Relaxed String Match  Precise Constructs  Strict Head Matching A,B,C  Proper Head Word Match  Alias Sieve  Relaxed Head Matching  Lexical Chain  Pronouns Sieves implementing different kinds of string matching 164 The 12 Sieves  Discourse Processing  Exact String Match  Relaxed String Match  Precise Constructs  Strict Head Matching A,B,C  Proper Head Word Match  Alias Sieve  Relaxed Head Matching  Lexical Chain  Pronouns Posits two mentions as coreferent if they are linked by a WordNet lexical chain 165 The 12 Sieves  Discourse Processing  Exact String Match  Relaxed String Match  Precise Constructs  Strict Head Matching A,B,C  Proper Head Word Match  Alias Sieve  Relaxed Head Matching  Lexical Chain  Pronouns Resolves a pronoun to a mention that agree in number, gender, person, number & semantic class 166 A Few Notes on Sieves  Each sieve is composed of a set of rules for establishing coreference links  Sieves are ordered in decreasing order of precision 167 The 12 Sieves  Discourse Processing  Exact String Match  Relaxed String Match  Precise Constructs  Strict Head Matching A,B,C  Proper Head Word Match  Alias Sieve  Relaxed Head Matching  Lexical Chain  Pronouns  Rules in the DP sieve has the highest precision 168 The 12 Sieves  Discourse Processing  Exact String Match  Relaxed String Match  Precise Constructs  Strict Head Matching A,B,C  Proper Head Word Match  Alias Sieve  Relaxed Head Matching  Lexical Chain  Pronouns  Rules in the DP sieve has the highest precision  … followed by those in Exact String Match (Sieve 2) 169 The 12 Sieves  Discourse Processing  Exact String Match  Relaxed String Match  Precise Constructs  Strict Head Matching A,B,C  Proper Head Word Match  Alias Sieve  Relaxed Head Matching  Lexical Chain  Pronouns  Rules in the DP sieve has the highest precision  … followed by those in Exact String Match (Sieve 2)  … followed by those in Relaxed String Match (Sieve 3) 170 The 12 Sieves  Discourse Processing  Exact String Match  Relaxed String Match  Precise Constructs  Strict Head Matching A,B,C  Proper Head Word Match  Alias Sieve  Relaxed Head Matching  Lexical Chain  Pronouns  Rules in the DP sieve has the highest precision  … followed by those in Exact String Match (Sieve 2)  … followed by those in Relaxed String Match (Sieve 3)  Those in the Pronouns sieve have the lowest precision 171 A Few Notes on Sieves  Each sieve is composed of a set of rules for establishing coreference links  Sieves are ordered in decreasing order of precision  Coreference clusters are constructed incrementally  Each sieve builds on the partial coreference clusters constructed by the preceding sieves  Enables the use of rules that link two clusters  Rules can employ features computed over one or both clusters  E.g., are there mentions in the 2 clusters that have same head?  Resemble entity-mention models 172 Evaluation  Corpus  Training: CoNLL-2011 shared task training corpus  Test: CoNLL-2011 shared task test corpus  Scoring programs: recall, precision, F-measure  MUC (Vilain et al., 1995)  B3 (Bagga & Baldwin, 1998)  CEAFe (Luo, 2005)  CoNLL (unweighted average of MUC, B3 and CEAFe F-scores) 173 Results (Closed Track)  Caveat  Mention detection performance could have played a role  Best system’s mention detection results: R/75, P/67, F/71  2nd best system’s mention detection results: R/92, P/28, F/43 System MUC B3 CEAFe CoNLL Rank 1: Multi-Pass Sieves 59.6 68.3 45.5 57.8 Rank 2: Label Propagation 59.6 67.1 41.3 56.0 174 Lessons  Easy-first coreference resolution  Exploit easy relations to discover hard relations  Results seem to suggest that humans are better at combining features than machine learners  Better feature induction methods for combining primitive features into more powerful features? 175 Another Sieve-Based Approach  Ratinov & Roth (EMNLP 2012)  Learning-based  Each sieve is a machine-learned classifier  Later sieves can override earlier sieves’ decisions  Can recover from errors as additional evidence is available 176 Ratinov & Roth’s 9 Sieves (Easy First) • Each sieve is a mention-pair model applicable to a subset of mention pairs 1. Nested (e.g., {city of {Jurusalem}}) 2. Same Sentence both Named Entities (NEs) 3. Adjacent (Mentions closest to each other in dependency tree) 4. Same Sentence NE&Nominal (e.g., Barack Obama, president) 5. Different Sentence two NEs 6. Same Sentence No Pronouns 7. Different Sentence Closest Mentions (no intervening mentions) 8. Same Sentence All Pairs 9. All Pairs 177 Information Propagation  Encoded as features  Decision-encoding features at sieve i  whether mj and mk are posited as coreferent by sieve 1, sieve 2, …, sieve i-1  whether mj and mk are in the same coreference cluster after sieve 1, sieve 2, …, sieve i-1  the results of various set operations applied to the cluster containing mj and the cluster containing mk  set identity, set containment, set overlap, … 178  Multi-pass sieve approach (Lee et al., 2011)  Winner of the CoNLL-2011 shared task  English coreference resolution  Latent tree-based approach (Fernandes et al.,2012)  Winner of the CoNLL-2012 shared task  Multilingual coreference resolution (English, Chinese, Arabic) Two Recent Approaches 179 Training the Mention-Pair Model  The mention-pair model determines whether two mentions are coreferent or not  Each training example corresponds to two mentions  Class value indicates whether they are coreferent or not  Soon et al. train the model using a decision tree learner  But we can train it using other learners, such as the perceptron learning algorithm 180 The Perceptron Learning Algorithm  Parameterized by a weight vector w  Learns a linear function  Output of perceptron y = w • x weight vector feature vector (features computed based on two mentions) 181 The Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + xi // update the weights until convergence  An iterative algorithm  An error-driven algorithm: weight vector is updated whenever a mistake is made 182  Observation (McCallum & Wellner, 2004):  Since the goal is to output a coreference partition, why not learn to predict a partition directly?  They modified the perceptron algorithm in order to learn to predict a coreference partition  each training example corresponds to a document  Class value is the correct coreference partition of the mentions 183 The Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + xi // update the weights until convergence 184 The Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + F(yi) – F(yi’) // update the weights until convergence predict the most probable partition yi’ xi = (document, correct partition yi)document w/ correct partition yi 185 The Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + F(yi) – F(yi’) // update the weights until convergence  Each example xi corresponds to a partition. What features should be used to represent a partition? xi = (document, correct partition yi)document w/ correct partition yi predict the most probable partition yi’ 186 The Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + F(yi) – F(yi’) // update the weights until convergence  Each example xi corresponds to a partition. What features should be used to represent a partition?  still pairwise features, but they are computed differently  Before: do the two mentions have compatible gender?  Now: how many coreferent pairs have compatible gender? xi = (document, correct partition yi)document w/ correct partition yi predict the most probable partition yi’ 187 The Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + F(yi) – F(yi’) // update the weights until convergence  Each example xi corresponds to a partition. What features should be used to represent a partition?  still pairwise features, but they are computed differently  Before: do the two mentions have compatible gender?  Now: how many coreferent pairs have compatible gender? xi = (document, correct partition yi)document w/ correct partition yi predict the most probable partition yi’ each value in F(yi’) is the sum f the fe ure values over all mention pairs in yi’ 188 The Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + F(yi) – F(yi’) // update the weights until convergence  How can we predict most prob. partition using the current w? Method 1:  For each possible partition p, compute w • F(p)  Select the p with the largest w • F(p) xi = (document, correct partition yi)document w/ correct partition yi predict the most probable partition yi’ 189 The Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + F(yi) – F(yi’) // update the weights until convergence  How can we predict most prob. partition using the current w? Method 1:  For each possible partition p, compute w • F(p)  Select the p with the largest w • F(p) Computationally intractable xi = (document, correct partition yi)document w/ correct partition yi predict the most probable partition yi’ 190 The Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + F(yi) – F(yi’) // update the weights until convergence  How can we predict most prob. partition using the current w? Method 2:  Approximate the optimal partition given the current w using correlation clustering xi = (document, correct partition yi)document w/ correct partition yi predict the most probable partition yi’ 191 The Structured Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + F(yi) – F(yi’) // update the weights until convergence  How can we predict most prob. partition using the current w? Method 2:  Approximate the optimal partition given the current w using correlation clustering xi = (document, correct partition yi)document w/ correct partition yi predict the most probable partition yi’ 192 Observation  Now we have an algorithm that learns to partition  Can we further improve? 193 Observation  Now we have an algorithm that learns to partition  Can we further improve?  Recall that to compute each pairwise feature, we sum the values of the pairwise feature over the coreferent pairs  E.g., number of coreferent pairs that have compatible gender 194 Observation  Now we have an algorithm that learns to partition  Can we further improve?  Recall that to compute each pairwise feature, we sum the values of the pairwise feature over the coreferent pairs  E.g., number of coreferent pairs that have compatible gender  We are learning a partition from all coreferent pairs  But … learning from all coreferent pairs is hard  Some coreferent pairs are hard to learn from  And … we don’t need to learn from all coreferent pairs  We do not need all coreferent pairs to construct a partition 195 Observation  To construct a coreference partition, we need to construct each coreference cluster  To construct a coreference cluster with n mentions, we need only n-1 links Queen Elizabeth her a viable monarch a renowned speech therapist speech impediment husband King George VI the King his 196 Observation  To construct a coreference partition, we need to construct each coreference cluster  To construct a coreference cluster with n mentions, we need only n-1 links Queen Elizabeth her a viable monarch a renowned speech therapist speech impediment husband King George VI the King his 197 husband King George VI the King his Observations  There are many ways links can be chosen husband King George VI the King his husband King George VI the King his husband King George VI his the King … 198 Coreference Tree husband King George VI the King his Queen Elizabeth her a viable monarch a renowned speech therapist speech impediment Root 199 Coreference Tree  A coreference tree is an equivalent representation of a coreference partition  Latent tree-based approach (Fernandes et al., 2012)  Learn coreference trees rather than coreference partitions  But … many coreference trees can be created from one coreference partition … which one should we learn? 200 The Structured Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + F(yi) – F(yi’) // update the weights until convergence predict the partition yi’ xi = (document, correct partition yi)document w/ correct partition yi predict the most probable partition yi’ 201 The Structured Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + F(yi) – F(yi’) // update the weights until convergence xi = (document, correct tree yi) predict the most probable tree yi’ document w/ correct tr e yi 202 The Structured Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + F(yi) – F(yi’) // update the weights until convergence xi = (document, correct tree yi) predict the most probable tree yi’  How can we predict most prob. tree yi’ using the current w? document w/ correct tr e yi 203 The Structured Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + F(yi) – F(yi’) // update the weights until convergence xi = (document, correct tree yi) predict the most probable tree yi’  How can we predict most prob. tree yi’ using the current w? Method 1:  For each possible tree t, compute w • F(t)  Select the t with the largest w • F(t) document w/ correct tr e yi 204 The Structured Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + F(yi) – F(yi’) // update the weights until convergence xi = (document, correct tree yi) predict the most probable tree yi’  How can we predict most prob. tree yi’ using the current w? Method 1:  For each possible tree t, compute w • F(t)  Select the t with the largest w • F(t) document w/ correct tr e yi each value in F(yi’) is the sum of the feature values over all the edges in yi’ 205 The Structured Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + F(yi) – F(yi’) // update the weights until convergence xi = (document, correct tree yi) predict the most probable tree yi’  How can we predict most prob. tree yi’ using the current w? Method 2:  Run the Chu-Liu/Edmonds’ algorithm to find max spanning tree document w/ correct tr e yi 206  Each xi is labeled with correct tree yi. Since many correct trees can be created from a partition, which one should be used? The Structured Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + F(yi) – F(yi’) // update the weights until convergence xi = (document, correct tree yi) predict the most probable tree yi’ document w/ correct tr e yi 207  Each xi is labeled with correct tree yi. Since many correct trees can be created from a partition, which one should be used?  Heuristically? The Structured Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + F(yi) – F(yi’) // update the weights until convergence xi = (document, correct tree yi) predict the most probable tree yi’ document w/ correct tr e yi 208 The Structured Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + F(yi) – F(yi’) // update the weights until convergence xi = (document, correct tree yi) predict the most probable tree yi’  Each xi is labeled with correct tree yi. Since many correct trees can be created from a partition, which one should be used?  Select the correct yi with the largest w • F(yi) document w/ correct tr e yi 209 The Structured Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + F(yi) – F(yi’) // update the weights until convergence xi = (document, correct tree yi) predict the most probable tree yi’  Each xi is labeled with correct tree yi. Since many correct trees can be created from a partition, which one should be used?  Select the correct yi with the largest w • F(yi) Select the correct tree that is best given the current model document w/ correct tr e yi 210 The Structured Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (1) predict the class of xi using the current w (2) if the predicted class is not equal to the correct class w  w + F(yi) – F(yi’) // update the weights until convergence xi = (document, correct tree yi) predict the most probable tree yi’  Each xi is labeled with correct tree yi. Since many correct trees can be created from a partition, which one should be used?  Select the correct yi with the largest w • yi Select the correct tree that is best given the current model document w/ correct tr e yi A different correct tree will be selected in each iteration 211 The Structured Perceptron Learning Algorithm  Initialize w  Loop for each training example xi (0) select the correct tree yi using the current w (1) predict the most prob. tree yi’ of xi using the current w (2) if the predicted tree is not equal to the correct tree w  w + F(yi) – F(yi’) // update the weights until convergence 212 The Latent Structured Perceptron Learning Algorithm (Joachims & Yu, 2009)  Initialize w  Loop for each training example xi (0) select the correct tree yi using the current w (1) predict the most prob. tree yi’ of xi using the current w (2) if the predicted tree is not equal to the correct tree w  w + F(yi) – F(yi’) // update the weights until convergence 213 What’s left?  Recall that …  Humans are better at combining features than machine learners  Better feature induction methods for combining primitive features into more powerful features?  The latent tree-based model employs feature induction  Entropy-based feature induction  given the same training set used to train a mention-pair model, train a decision tree classifier 214 STR-MATCH GENDER NUMBER ALIAS J-PRONOUN I-PRONOUN APPOSITIVE DISTANCE + - + - + - + - + - > 0 ≤ 0 + - + + + + + - - - - 1 0 2 - 215 STR-MATCH GENDER NUMBER ALIAS J-PRONOUN I-PRONOUN APPOSITIVE DISTANCE + - + - + - + - + - > 0 ≤ 0 + - + + + + + - - - - 1 0 2 -  Generate feature combinations using all paths of all possible lengths starting the root node 216 STR-MATCH GENDER NUMBER ALIAS J-PRONOUN I-PRONOUN APPOSITIVE DISTANCE + - + - + - + - + - > 0 ≤ 0 + - + + + + + - - - - 1 0 2 -  Generate feature combinations using all paths of all possible lengths starting the root node 217 STR-MATCH GENDER NUMBER ALIAS J-PRONOUN I-PRONOUN APPOSITIVE DISTANCE + - + - + - + - + - > 0 ≤ 0 + - + + + + + - - - - 1 0 2 -  Generate feature combinations using all paths of all possible lengths starting the root node 218 STR-MATCH GENDER NUMBER ALIAS J-PRONOUN I-PRONOUN APPOSITIVE DISTANCE + - + - + - + - + - > 0 ≤ 0 + - + + + + + - - - - 1 0 2 -  Generate feature combinations using all paths of all possible lengths starting the root node 219 STR-MATCH GENDER NUMBER ALIAS J-PRONOUN I-PRONOUN APPOSITIVE DISTANCE + - + - + - + - + - > 0 ≤ 0 + - + + + + + - - - - 1 0 2 -  Generate feature combinations using all paths of all possible lengths starting the root node 220 STR-MATCH GENDER NUMBER ALIAS J-PRONOUN I-PRONOUN APPOSITIVE DISTANCE + - + - + - + - + - > 0 ≤ 0 + - + + + + + - - - - 1 0 2 -  Generate feature combinations using all paths of all possible lengths starting the root node 221 STR-MATCH GENDER NUMBER ALIAS J-PRONOUN I-PRONOUN APPOSITIVE DISTANCE + - + - + - + - + - > 0 ≤ 0 + - + + + + + - - - - 1 0 2 -  Generate feature combinations using all paths of all possible lengths starting the root node 222 STR-MATCH GENDER NUMBER ALIAS J-PRONOUN I-PRONOUN APPOSITIVE DISTANCE + - + - + - + - + - > 0 ≤ 0 + - + + + + + - - - - 1 0 2 -  Generate feature combinations using all paths of all possible lengths starting the root node 223 STR-MATCH GENDER NUMBER ALIAS J-PRONOUN I-PRONOUN APPOSITIVE DISTANCE + - + - + - + - + - > 0 ≤ 0 + - + + + + + - - - - 1 0 2 -  Generate feature combinations using all paths of all possible lengths starting the root node 224 Feature Induction  Use the resulting feature combinations, rather than the original features, to represent a training/test instance 225 Evaluation  Corpus  Training: CoNLL-2012 shared task training corpus  Test: CoNLL-2012 shared task test corpus  Scoring programs: recall, precision, F-measure  MUC (Vilain et al., 1995)  B3 (Bagga & Baldwin, 1998)  CEAFe (Luo, 2005)  CoNLL (unweighted average of MUC, B3 and CEAFe F-scores) 226 Results (Closed Track)  Usual caveat  Mention detection performance could have played a role 58.353.660.061.2Rank 2: Mention-Pair System English Chinese Arabic Avg Rank 1: Latent Trees 63.4 58.5 54.2 58.7 Rank 3: Multi-Pass Sieves 59.7 62.2 47.1 56.4 227 Latent Tree-Based Approach: Main Ideas  Represent a coreference partition using a tree  Avoid learning from the hard coreference pairs  Allow gold tree to change in each perceptron learning iteration  Reduce number of candidate trees using Stanford sieves  Use feature induction to better combine available features  Which of them are effective? 228 Recent Models  Revival of mention-ranking models  Durrett & Klein (2013): “There is no polynomial-time dynamic program for inference in a model with arbitrary entity-level features, so systems that use such features make decisions in a pipelined manner and sticking with them, operating greedily in a left-to-right fashion or in a multi-pass, sieve-like manner”  Two recent mention-ranking models  Durrett & Klein (EMNLP 2013)  Wiseman et al. (ACL 2015) 229 Recent Models  Revival of mention-ranking models  Durrett & Klein, 2013: “There is no polynomial-time dynamic program for inference in a model with arbitrary entity-level features, so systems that use such features make decisions in a pipelined manner and sticking with them, operating greedily in a left-to-right fashion or in a multi-pass, sieve-like manner”  Two recent mention-ranking models  Durrett & Klein (EMNLP 2013)  Wiseman et al. (ACL 2015) 230 Durrett & Klein (EMNLP 2013)  Log linear model of the conditional distribution P(a|x) 231 Durrett & Klein (EMNLP 2013)  Log linear model of the conditional distribution P(a|x) 232 Durrett & Klein (EMNLP 2013)  Log linear model of the conditional distribution P(a|x) 233 Durrett & Klein (EMNLP 2013)  Log linear model of the conditional distribution P(a|x) document context 234 Durrett & Klein (EMNLP 2013)  Log linear model of the conditional distribution P(a|x) document context ith mention 235 Durrett & Klein (EMNLP 2013)  Log linear model of the conditional distribution P(a|x) document context ith mention antecedent chosen for mention i 236 Durrett & Klein (EMNLP 2013)  Log linear model of the conditional distribution P(a|x) document context ith mentionsum over all mentions antecedent chosen for mention i 237 Durrett & Klein (EMNLP 2013)  Log linear model of the conditional distribution P(a|x)  Similar to mention-ranking model, except that we train it to jointly maximize the likelihood of selecting all antecedents document context ith mentionsum over all mentions antecedent chosen for mention i Vector of antecedents chosen for the document, one per mention 238 Durrett & Klein (EMNLP 2013)  Goal  maximize the likelihood of the antecedent vector  Problem  a mention may have more than one antecedent, so which one should we use for training?  Solution  sum over all antecedent structures licensed by gold clusters 239 Durrett & Klein (EMNLP 2013)  Log linear model of the conditional distribution P(a|x) 240 Durrett & Klein (EMNLP 2013)  Log linear model of the conditional distribution P(a|x)  Log likelihood function 241 Durrett & Klein (EMNLP 2013)  Log linear model of the conditional distribution P(a|x)  Log likelihood function weight parameters 242 Durrett & Klein (EMNLP 2013)  Log linear model of the conditional distribution P(a|x)  Log likelihood function weight parameters sum over all training docs 243 Durrett & Klein (EMNLP 2013)  Log linear model of the conditional distribution P(a|x)  Log likelihood function weight parameters sum over all training docs 244 Durrett & Klein (EMNLP 2013)  Log linear model of the conditional distribution P(a|x)  Log likelihood function weight parameters sum over all training docs sum over all antecedent structures 245 Durrett & Klein (EMNLP 2013)  Log linear model of the conditional distribution P(a|x)  Log likelihood function weight parameters sum over all training docs sum over all antecedent structures L1 regularizer 246 Durrett & Klein (EMNLP 2013)  Log linear model of the conditional distribution P(a|x)  Log likelihood function weight parameters sum over all training docs sum over all antecedent structures L1 regularizer 247 Durrett & Klein (EMNLP 2013)  Log linear model of the conditional distribution P(a|x)  Log likelihood function 248 Durrett & Klein (EMNLP 2013)  Log linear model of the conditional distribution P(a|x)  Log likelihood function 249 Durrett & Klein (EMNLP 2013)  Log linear model of the conditional distribution P(a|x)  Log likelihood function 250 Durrett & Klein (EMNLP 2013)  Log linear model of the conditional distribution P(a|x)  Log likelihood function 251 The Loss Function 252 The Loss Function antecedent vector 253 The Loss Function antecedent vector gold clusters 254 The Loss Function antecedent vector gold clusters 255 The Loss Function antecedent vector gold clusters False Anaphoric 256 The Loss Function antecedent vector gold clusters False Anaphoric False New 257 The Loss Function antecedent vector gold clusters False Anaphoric False New Wrong Link 258 The Loss Function antecedent vector gold clusters False Anaphoric False New Wrong Link 259 Surface Features  Features computed on each of the two mentions  mention type (pronoun, name, nominal)  complete string, semantic head  first word, last word, preceding word, following word  length (in words)  Features computed based on both mentions  Exact string match, head match  Distance in number of sentences and number of mentions  Feature conjunctions  Attach to each feature the mention type (or the pronoun itself if it’s a pronoun) 260 Results on the CoNLL-2011 test set  Easy victories  Using surface features (rather than standard coreference features) allows their system to outperform the state of the art 261 Why?  D&K’s explanation  These standard features do capture the same phenomena as standard coreference features, just implicitly  Examples  Rather than using rules targeting person, number, or gender of mentions, they use conjunctions of pronoun identity  Rather than using a feature encoding definiteness, the first word of a mention would capture this  Rather than encoding grammatical role (subject/object), such information can be inferred from the surrounding words 262 Recent Models  Revival of mention-ranking models  Durrett & Klein, 2013: “There is no polynomial-time dynamic program for inference in a model with arbitrary entity-level features, so systems that use such features make decisions in a pipelined manner and sticking with them, operating greedily in a left-to-right fashion or in a multi-pass, sieve-like manner”  Two recent mention-ranking models  Durrett & Klein (EMNLP 2013)  Wiseman et al. (ACL 2015) 263 Wiseman et al. (ACL 2015)  Observation: recent mention-ranking models are all linear; why not train a non-linear mention-ranking model?  Recap  Scoring function for linear mention-ranking models 264 Wiseman et al. (ACL 2015)  Observation: recent mention-ranking models are all linear; why not train a non-linear mention-ranking model?  Recap  Scoring function for linear mention-ranking models 265 Wiseman et al. (ACL 2015)  Observation: recent mention-ranking models are all linear; why not train a non-linear mention-ranking model?  Recap  Scoring function for linear mention-ranking models 266 Wiseman et al. (ACL 2015)  Observation: recent mention-ranking models are all linear; why not train a non-linear mention-ranking model?  Recap  Scoring function for linear mention-ranking models anaphor 267 Wiseman et al. (ACL 2015)  Observation: recent mention-ranking models are all linear; why not train a non-linear mention-ranking model?  Recap  Scoring function for linear mention-ranking models anaphor candidate antecedent 268 Wiseman et al. (ACL 2015)  Observation: recent mention-ranking models are all linear; why not train a non-linear mention-ranking model?  Recap  Scoring function for linear mention-ranking models  Another way of expressing the scoring function 269 Wiseman et al. (ACL 2015)  Observation: recent mention-ranking models are all linear; why not train a non-linear mention-ranking model?  Recap  Scoring function for linear mention-ranking models  Another way of expressing the scoring function non-null antecedent 270 Wiseman et al. (ACL 2015)  Observation: recent mention-ranking models are all linear; why not train a non-linear mention-ranking model?  Recap  Scoring function for linear mention-ranking models  Another way of expressing the scoring function non-null antecedent Features on anaphor 271 Wiseman et al. (ACL 2015)  Observation: recent mention-ranking models are all linear; why not train a non-linear mention-ranking model?  Recap  Scoring function for linear mention-ranking models  Another way of expressing the scoring function non-null antecedent Features on anaphor Features on both anaphor and candidate antecedent 272 Wiseman et al. (ACL 2015)  Observation: recent mention-ranking models are all linear; why not train a non-linear mention-ranking model?  Recap  Scoring function for linear mention-ranking models  Another way of expressing the scoring function null antecedent 273 Wiseman et al. (ACL 2015)  Observation: recent mention-ranking models are all linear; why not train a non-linear mention-ranking model?  Recap  Scoring function for linear mention-ranking models  Another way of expressing the scoring function null antecedent Features on anaphor 274 Wiseman et al. (ACL 2015)  Observation: recent mention-ranking models are all linear; why not train a non-linear mention-ranking model?  Recap  Scoring function for linear mention-ranking models  Another way of expressing the scoring function Raw/Unconjoined features 275 Raw/Unconjoined Features  Mention’s head, complete string, first word, last word, …  Researchers don’t seem to like them  they almost always use conjoined features  created by hand or obtained via feature induction  can add some non-linearity to the linear model  Why?  Wiseman et al. empirically showed that raw/unconjoined features are not predictive for the coreference task 276  Learn feature representations that are useful for the task  Scoring function for linear mention-ranking model  Scoring function for Wiseman’s neural net-based model Wiseman et al.’s Proposal 277  Learn feature representations that are useful for the task  Scoring function for linear mention-ranking model  Scoring function for Wiseman’s neural net-based model Wiseman et al.’s Proposal 278  Learn feature representations that are useful for the task  Scoring function for linear mention-ranking model  Scoring function for Wiseman’s neural net-based model Wiseman et al.’s Proposal 279  Learn feature representations that are useful for the task  Scoring function for linear mention-ranking model  Scoring function for Wiseman’s neural net-based model Wiseman et al.’s Proposal 280  Learn feature representations that are useful for the task  Scoring function for linear mention-ranking model  Scoring function for Wiseman’s neural net-based model Wiseman et al.’s Proposal 281  Learn feature representations that are useful for the task  Scoring function for linear mention-ranking model  Scoring function for Wiseman’s neural net-based model Wiseman et al.’s Proposal 282  Learn feature representations that are useful for the task  Scoring function for linear mention-ranking model  Scoring function for Wiseman’s neural net-based model Wiseman et al.’s Proposal Network’s parameters 283  Learn feature representations that are useful for the task  Scoring function for linear mention-ranking model  Scoring function for Wiseman’s neural net-based model Wiseman et al.’s Proposal Can learn feature combinations 284  Learn feature representations that are useful for the task  Scoring function for linear mention-ranking model  Scoring function for Wiseman’s neural net-based model Wiseman et al.’s Proposal 285  Learn feature representations that are useful for the task  Scoring function for linear mention-ranking model  Scoring function for Wiseman’s neural net-based model Wiseman et al.’s Proposal  Option 1: let g be the identity function  Neural net model same as linear model except that it’s defined over non-linear feature representations 286  Learn feature representations that are useful for the task  Scoring function for linear mention-ranking model  Scoring function for Wiseman’s neural net-based model Wiseman et al.’s Proposal  Option 2: functions as an additional hidden layer 287 Wiseman et a.’s Proposal  Pros  Can learn (non-linear) feature representations from raw features  Don’t have to conjoin features by hand  Cons  Training a non-linear model is more difficult than training a linear model  Model performance sensitive to weight initializations 288 Training 1. Train two neural nets separately  One for anaphoricity and one for coreference 2. Use the weight parameters learned as initializations for the combined neural net  Objective function similar to Durrett & Klein’s, except that it is based on margin loss rather than log loss 289 Results on CoNLL 2012 test set 290 Wiseman et al. (NAACL 2016)  Improved their neural net model by incorporating entity-level information  CoNLL score increased by 0.8  State-of-the-art results on the English portion of the CoNLL 2012 test data  Software available from the Harvard NLP group page 291 Unsupervised Models  EM  Cherry & Bergsma (2005), Charniak & Elsner (2009)  Clustering  Cardie & Wagstaff (1999), Cai & Strube (2010)  Nonparametric models  Haghighi & Klein (2007, 2010)  Markov Logic networks  Poon & Domingos (2008)  Bootstrapping  Kobdani et al. (2011) 292 Plan for the Talk  Part I: Background  Task definition  Why coreference is hard  Applications  Brief history  Part II: Machine learning for coreference resolution  System architecture  Computational models  Resources and evaluation (corpora, evaluation metrics, …)  Employing semantics and world knowledge  Part III: Solving hard coreference problems  Difficult cases of overt pronoun resolution  Relation to the Winograd Schema Challenge 293 Semantics and World Knowledge  Coreference resolution is considered one of the most difficult tasks in NLP in part because of its reliance on sophisticated knowledge sources  The importance of semantics and world knowledge in coreference resolution has long been recognized  Hobbs (1976)  Syntactic approach (the naïve algorithm)  Semantic approach  Shift in research trends  Knowledge-rich approaches (1970s and 1980s)  Knowledge-lean approaches (1990s)  Knowledge-rich approaches (2000 onwards) 294 Semantics and World Knowledge  Have researchers been successful in employing semantics and world knowledge to improve learning-based coreference resolution systems?  Are these features useful in the presence of morpho- syntactic (knowledge-lean, robustly computed) features? 295 Soon et al. (2001)  One of the first learning-based coreference systems  Mention-pair model trained using C4.5  12 features  Mostly morpho-syntactic features  One semantic feature computed based on WordNet (1st sense)  U if one/both mentions have undefined WordNet semantic class  T if the two have the same WordNet semantic class  F otherwise  No separate evaluation of how useful the WordNet semantic feature is, but… 296 Decision Tree Learned for MUC-6 Data STR-MATCH GENDER NUMBER ALIAS J-PRONOUN I-PRONOUN APPOSITIVE DISTANCE + - + - + - + - + - > 0 ≤ 0 + - + + + + + - - - - 1 0 2 - 297 Decision Tree Learned for MUC-6 Data STR-MATCH GENDER NUMBER ALIAS J-PRONOUN I-PRONOUN APPOSITIVE DISTANCE + - + - + - + - + - > 0 ≤ 0 + - + + + + + - - - - 1 0 2 - Semantic features not useful??? 298 Ng & Cardie (ACL 2002)  A large-scale expansion of the features used by Soon et al.  53 lexical, grammatical, and semantic features 299 Four WordNet-based Semantic Features  Whether mj is the closest mention preceding mention that has the same WordNet semantic class as mk  Whether the two mentions have an ancestor-descendent relationship in WordNet; if yes,  Encode the sense numbers in WordNet that give rise to this ancestor-descendent relationship  compute the distance between the two WordNet synsets 300 Evaluation  No separate evaluation of how useful the WordNet semantic feature is, but…  A hand-selected subset of the features was used to train a mention-pair model that yielded better performance  This subset did not include any of these 4 semantic features 301 Evaluation  No separate evaluation of how useful the WordNet semantic feature is, but…  A hand-selected subset of the features was used to train a mention-pair model that yielded better performance  The subset did not include any of these 4 semantic features Semantic features not useful??? 302 Kehler et al. (NAACL 2004)  Approximate world knowledge using predicate-argument statistics  forcing_industry is a more likely verb-object combination in naturally-occurring data than forcing_initiative or forcing_edge  Such predicate-argument (e.g., subject-verb, verb-object) statistics can be collected from a large corpus  TDT-2 corpus (1.32 million subj-verb; 1.17 million verb-obj) He worries that Trump’s initiative would push his industry over the edge, forcing it to shift operations elsewhere. 303 Kehler et al. (NAACL 2004)  Goal: examine whether predicate-argument statistics, when encoded as features, can improve a pronoun resolver trained on state-of-the-art morpho-syntactic features  Morpho-syntactic features  Gender agreement  number agreement  Distance between the two mentions  Grammatical role of candidate antecedent  NP form (def, indef, pronoun) of the candidate antecedent  Train a mention-pair model using maximum entropy  Evaluate on the ACE 2003 corpus 304 Results 305 Results Semantic features not useful??? 306 Error Analysis  MaxEnt selected the temerity  Predicate-argument statistics selected the endowment  Need a better way to exploit the statistics? After the endowment was publicly excoriated for having the temerity to award some of its money to art that addressed changing views of gender and race, … 307 Error Analysis  MaxEnt selected the supporters  So did the predicate-argument statistics The dancers were joined by about 70 supporters as they marched around a fountain not far from the mayor’s office. 308 Kehler et al.’s Observations  In the cases in which statistics reinforced a wrong answer, no manipulation of features can rescue the prediction  For the cases in which statistics could help, their successful use will depend on the existence of a formula that can capture these cases without changing the predictions for examples that the model currently classifiers correctly  Conclusion: predicate-argument statistics are a poor substitute for world knowledge, and more to the point, they do not offer much predictive power to a state-of-the-art morphosyntactically-driven pronoun resolution system 309 Yang et al. (ACL 2005)  Goal: examine whether predicate-argument statistics, when encoded as features, can improve a pronoun resolver trained on state-of-the-art morpho-syntactic features  But… with the following differences in the setup  Web as corpus (to mitigate data sparsity)  Pairwise ranking model  Baseline morpho-syntactic features defined on m_i and m_j  DefNP_i, Pro_i, NE_i, SameSent, NearestNP_i, Parallel_i, FirstNPinSent_i, Reflexive_j, NPType_j  Learner: C4.5  Corpus: MUC-6 and MUC-7 310 Results Neutral Pro. Personal Pro. Overall Corp Web Corp Web Corp Web Baseline 73.9 91.9 81.9 Baseline + predicate-arg statistics 76.7 79.2 91.4 91.9 83.3 84.8 311 Results Semantic features are useful!!! Neutral Pro. Personal Pro. Overall Corp Web Corp Web Corp Web Baseline 73.9 91.9 81.9 Baseline + predicate-arg statistics 76.7 79.2 91.4 91.9 83.3 84.8 312 Ponzetto & Strube (NAACL 2006)  Goal: improve learning-based coreference resolution by exploiting three knowledge sources  WordNet  Wikipedia  Semantic role labeling 313 Using WordNet  Motivation:  Soon et al. employed a feature that checks whether two mentions have the same WordNet semantic class  Noisy: had problems with coverage and sense proliferation  Solution: measure the similarity between the WordNet synsets of the two mentions using six similarity measures  3 path-length based measures  3 information-content based measures  Two features  Highest similarity score over all senses and all measures  Average similarity score over all senses and all measures 314 Using Wikipedia  can also resolve the celebrity using syntactic parallelism, but  heuristics are not always accurate  does not mimic the way humans look for antecedents  Use world knowledge extracted from Wikipedia Martha Stewart is hoping people don’t run out on her. The celebrity indicted on charges stemming from … 315 Using Wikipedia  Given mentions mi and mj, retrieve the Wiki pages they refer to, Pi and Pj, with titles mi and mj (or their heads)  Create features for coreference resolution  Features based on first paragraph of Wiki page  Whether Pi’s first paragraph contains mj  Create an analogous feature by reversing the roles of mi & mj 316 Using Wikipedia  Given mentions mi and mj, retrieve the Wiki pages they refer to, Pi and Pj, with titles mi and mj (or their heads)  Create features for coreference resolution  Features based on first paragraph of Wiki page  Features based on the hyperlinks of Wiki page  Whether at least one hyperlink in Pi contains mj  Create an analogous feature by reversing the roles of mi & mj 317 Using Wikipedia  Given mentions mi and mj, retrieve the Wiki pages they refer to, Pi and Pj, with titles mi and mj (or their heads)  Create features for coreference resolution  Features based on first paragraph of Wiki page  Features based on the hyperlinks of Wiki page  Features based on the Wiki categories  Whether the categories Pi belongs to contains mj (or its head)  Create analogous feature by reversing the roles of mi & mj 318 Using Wikipedia  Given mentions mi and mj, retrieve the Wiki pages they refer to, Pi and Pj, with titles mi and mj (or their heads)  Create features for coreference resolution  Features based on first paragraph of Wiki page  Features based on the hyperlinks of Wiki page  Features based on the Wiki categories  Features based on overlap of first paragraphs  Overlap score between first paragraphs of the two Wiki pages 319 Using Wikipedia  Given mentions mi and mj, retrieve the Wiki pages they refer to, Pi and Pj, with titles mi and mj (or their heads)  Create features for coreference resolution  Features based on first paragraph of Wiki page  Features based on the hyperlinks of Wiki page  Features based on the Wiki categories  Features based on overlap of first paragraphs  Highest & average relatedness score of all category pairs formed from the categories associated with the two Wiki pages 320 Semantic Role Labeling (SRL)  Knowing that “program trading” is the PATIENT of the “decry” predicate and “it” being the PATIENT of “denounce” could trigger the (semantic parallelism based) inference Peter Anthony decries program trading as “limiting the game to a few,” but he is not sure whether he wants to denounce it because … 321 Semantic Role Labeling (SRL)  Use the ASSERT semantic parser to identify all verb predicates in a sentence and their semantic arguments  Each argument is labeled with its PropBank-style semantic role  ARG1, …, ARGn  Two SRL features  The role-predicate pair of mention mi  The role-predicate pair of mention mj 322 Experimental Setup  Baseline  mention-pair model trained with Soon et al.’s 12 features using MaxEnt  ACE 2003 (Broadcast News + Newswire)  Evaluation metric: MUC scorer 323 Results 324 Results Semantic features are useful!!! 325 Rahman & Ng (ACL 2011)  Goal: improve learning-based coreference resolution by exploiting two knowledge sources  YAGO  FrameNet 326 YAGO (Suchanek et al., 2007)  contains 5 million facts derived from Wikipedia and WordNet  each fact is a triple describing a relation between two NPs  , rel can be one of 90 YAGO relation types  focuses on two types of YAGO relations: TYPE and MEANS (Bryl et al., 2010, Uryupina et al., 2011)  TYPE: the IS-A relation   MEANS: addresses synonymy and ambiguity  ,  provide evidence that the two NPs involved are coreferent 327 Why YAGO?  combines the information in Wikipedia and WordNet  can resolve the celebrity to Martha Stewart  neither Wikipedia nor WordNet alone can  How to use YAGO to resolve? 1. Heuristically maps each Wiki category in the Wiki page for Martha Stewart to its semantically closest WordNet synset  AMERICAN TELEVISION PERSONALITIES? synset for sense #2 of personality 2. Realizes personality is a direct hyponym of celebrity in WordNet 3. Extracts the fact Martha Stewart is hoping people don’t run out on her. The celebrity indicted on charges stemming from … 328 Using YAGO for Coreference Resolution  create a new feature for mention-pair model whose value is 1 if the two NPs are in a TYPE or MEANS relation 0 otherwise 329 FrameNet (Baker et al., 1998)  A lexico-semantic resource focused on semantic frames  A frame contains  the lexical predicates that can invoke it  the frame elements (i.e., the semantic roles)  E.g., the JUDGMENT_COMMUNICATION frame describes situations in which a COMMUNICATOR communicates a judgment of an EVALUEE to an ADDRESSEE  frame elements: COMMUNICATOR, EVALUEE, ADDRESSEE  lexical predicates: acclaim, accuse, decry, denounce, slam, … 330 Motivating Example  To resolve it to Peter Anthony, it may be helpful to know  decry and decounce are “semantically related”  the two mentions have the same semantic role Peter Anthony decries program trading as “limiting the game to a few,” but he is not sure whether he wants to denounce it because … 331 Motivating Example  To resolve it to Peter Anthony, it may be helpful to know  decry and decounce are “semantically related”  the two mentions have the same semantic role Peter Anthony decries program trading as “limiting the game to a few,” but he is not sure whether he wants to denounce it because …  Missing from Ponzetto & Strube’s application of semantic roles  We model this using FrameNet 332 Observation  Features encoding  the semantic roles of the two NPs under consideration  whether the associated predicates are “semantically related” could be useful for identifying coreference relations. 333 Observation  Features encoding  the semantic roles of the two NPs under consideration  whether the associated predicates are “semantically related” could be useful for identifying coreference relations. Use ASSERT  Provides PropBank-style roles (Arg0, Arg1, …) 334 Observation  Features encoding  the semantic roles of the two NPs under consideration  whether the associated predicates are “semantically related” could be useful for identifying coreference relations. Use ASSERT  Provides PropBank-style roles (Arg0, Arg1, …) Use PropBank  Checks whether the two predicates appear in the same frame 335 Results on ACE 2005 and OntoNotes  Baseline: mention-pair model trained with 39 features from Rahman & Ng (2009) ACE OntoNotes B3 CEAF B3 CEAF Baseline 62.4 60.0 53.3 51.5 Baseline+YAGO types 63.1 62.8 54.6 52.8 Baseline+YAGO types & means 63.6 63.2 55.0 53.3 Baseline+YAGO types & means & FN 63.8 63.1 55.2 53.4 336 Difficulty in Exploiting World Knowledge  World knowledge extracted from YAGO is noisy  Numerous entries for a particular mention, all but one are irrelevant 337 Ratinov & Roth (EMNLP 2012)  Goal: Improve their learning-based multi-pass sieve approach using world knowledge extracted from Wikipedia  Extract for each mention knowledge attributes from Wiki  Find the Wiki page the mention refers to using their context- sensitive entity linking system, which could reduce noise  Extract from the retrieved page three attributes  (1) gender; (2) nationality; (3) fine-grained semantic categories  Create features from the extracted attributes  Whether the two mentions are mapped to the same Wiki page  Agreement w.r.t. gender, nationality, and semantic categories  Augment each sieve’s feature set with these new features 338 Results on ACE 2004 Newswire texts  System shows a minimum improvement of 3 (MUC), 2 (B3), and 1.25 (CEAF) F1 points on gold mentions  Not always considered an acceptable evaluation setting  Improvements on gold mentions do not necessarily imply improvements on automatically extracted mentions 339 Durrett & Klein (EMNLP 2013)  Using only surface features, their log linear model achieved state of the art results on the CoNLL-2011 test set  Can performance be further improved with semantics?  Derive semantic features from four sources  WordNet hypernymy and synonymy  Number and gender for names and nominals  Named entity types  Latent clusters computed from English Gigaword  Each element in a cluster is a nominal head together with the conjunction of its verbal governor and its semantic role 340 Results on CoNLL-2011 Test Set 341 Results on CoNLL-2011 Test Set Semantic features not useful??? 342 Results on CoNLL-2011 Test Set  D&K’s explanation  only a small fraction of the mention pairs are coreferent  A system needs very strong evidence to overcome the default hypothesis that a pair of mentions is not coreferent  The weak (semantic) indicators of coreference will likely have high false positive rates, doing more harm than good 343 Results on CoNLL-2011 Test Set  D&K’s explanation  only a small fraction of the mention pairs are coreferent  A system needs very strong evidence to overcome the default hypothesis that a pair of mentions is not coreferent  The weak (semantic) indicators of coreference will likely have high false positive rates, doing more harm than good  Our explanation  It’s harder to use semantics to improve a strong baseline  D&K’s semantic features are shallow semantic features 344 Other (Failed) Attempts  Stanford’s multi-pass sieve system  beet system in the CoNLL 2011 shared task  extended their system with 2 new sieves that exploit semantics from WordNet, Wikipedia infoboxes, and Freebase records  the semantics sieves didn’t help 345 Other (Failed) Attempts  Stanford’s multi-pass sieve system  beet system in the CoNLL 2011 shared task  extended their system with 2 new sieves that exploit semantics from WordNet, Wikipedia infoboxes, and Freebase records  the semantics sieves didn’t help  Sepena et al.’s (2013) system  2nd best system in the CoNLL 2011 shared task  Proposed a constraint-based hypergraph partitioning approach  Used info extracted from Wikipedia as features/constraints  Conclusion: the problem seems to lie with the extracted info, which is biased in favor of famous/popular entities… including false positives… imbalance against entities with little or no info 346 Plan for the Talk  Part I: Background  Task definition  Why coreference is hard  Applications  Brief history  Part II: Machine learning for coreference resolution  System architecture  Computational models  Resources and evaluation (corpora, evaluation metrics, …)  Employing semantics and world knowledge  Part III: Solving hard coreference problems  Difficult cases of overt pronoun resolution  Relation to the Winograd Schema Challenge 347 Hard-to-resolve Definite Pronouns  Resolve definite pronouns for which traditional linguistic constraints on coreference and commonly-used resolution heuristics would not be useful 348 A Motivating Example (Winograd, 1972)  The city council refused to give the demonstrators a permit because they feared violence.  The city council refused to give the demonstrators a permit because they advocated violence. 349 Another Motivating Example (Hirst, 1981)  When Sue went to Nadia’s home for dinner, she served sukiyaki au gratin.  When Sue when to Nadia’s home for dinner, she ate sukiyaki au gratin. 350 Another Example  James asked Robert for a favor, but he refused.  James asked Robert for a favor, but he was refused. 351 Yet Another Example  Sam fired Tom but he did not regret doing so.  Sam fired Tom although he is diligent. 352 Focus on certain kinds of sentences  The target pronoun should  appear in a sentence that has two clauses with a discourse connective, where the first clause contains two candidate antecedents and the second contains the pronoun  agree in gender, number, semantic class with both candidates When Sue went to Nadia’s home for dinner, she served sukiyaki au gratin. When Sue when to Nadia’s home for dinner, she ate sukiyaki au gratin. Rahman & Ng EMNLP’12 353 Focus on certain kinds of sentences  The target pronoun should  appear in a sentence that has two clauses with a discourse connective, where the first clause contains two candidate antecedents and the second contains the pronoun  agree in gender, number, semantic class with both candidates  We ensure that each sentence has a twin. Two sentences are twins if  their first clauses are the same  they have lexically identical pronouns with different antecedents When Sue went to Nadia’s home for dinner, she served sukiyaki au gratin. When Sue when to Nadia’s home for dinner, she ate sukiyaki au gratin. Rahman & Ng EMNLP’12 354 Dataset  941 sentence pairs composed by 30 students who took my undergraduate machine learning class in Fall 2011 355 Our Approach: Ranking  Create one ranking problem from each sentence  Each ranking problem consists of two instances  one formed from the pronoun and the first candidate  one formed from the pronoun and the second candidate  Goal: train a ranker that assigns a higher rank to the instance having the correct antecedent for each ranking problem 356 Eight Components for Deriving Features  Narrative chains  Google  FrameNet  Semantic compatibility  Heuristic polarity  Machine-learned polarity  Connective-based relations  Lexical features 357 Narrative Chains (Chambers & Jurafsky, 2008)  Narrative chains are learned versions of scripts  Scripts represent knowledge of stereotypical event sequences that can aid text understanding  Reach restaurant, waiter sits you, gives you a menu, order food,..  Partially ordered sets of events centered around a protagonist  e.g., borrow-s invest-s spend-s pay-s raise-s lend-s  Someone who borrows something may invest, spend, pay, or lend it  can contain a mix of “s” (subject role) and “o” (object role)  e.g., the restaurant script 358 How can we apply narrative chains for pronoun resolution? Ed punished Tim because he tried to escape.  Employ narrative chains to heuristically predict the antecedent 359 How can we apply narrative chains for pronoun resolution? Ed punished Tim because he tried to escape.  Employ narrative chains to heuristically predict the antecedent 1) Find the event in which the pronoun participates and its role  “he” participates in the “try” and “escape” events as a subject 360 How can we apply narrative chains for pronoun resolution? Ed punished Tim because he tried to escape.  Employ narrative chains to heuristically predict the antecedent 1) Find the event in which the pronoun participates and its role  “he” participates in the “try” and “escape” events as a subject 2) Find the event(s) in which the candidates participate  Both candidates participate in the “punish” event 361 How can we apply narrative chains for pronoun resolution? Ed punished Tim because he tried to escape.  Employ narrative chains to heuristically predict the antecedent 1) Find the event in which the pronoun participates and its role  “he” participates in the “try” and “escape” events as a subject 2) Find the event(s) in which the candidates participate  Both candidates participate in the “punish” event 3) Pair each candidate event with each pronoun event  Two pairs are created: (punish, try-s), (punish, escape-s) 362 How can we apply narrative chains for pronoun resolution? Ed punished Tim because he tried to escape.  Employ narrative chains to heuristically predict the antecedent 1) Find the event in which the pronoun participates and its role  “he” participates in the “try” and “escape” events as a subject 2) Find the event(s) in which the candidates participate  Both candidates participate in the “punish” event 3) Pair each candidate event with each pronoun event  Two pairs are created: (punish, try-s), (punish, escape-s) 4) For each pair, extract chains containing both elements in pair  One chain is extracted, which contains punish-o and escape-s 363 How can we apply narrative chains for pronoun resolution? Ed punished Tim because he tried to escape.  Employ narrative chains to heuristically predict the antecedent 1) Find the event in which the pronoun participates and its role  “he” participates in the “try” and “escape” events as a subject 2) Find the event(s) in which the candidates participate  Both candidates participate in the “punish” event 3) Pair each candidate event with each pronoun event  Two pairs are created: (punish, try-s), (punish, escape-s) 4) For each pair, extract chains containing both elements in pair  One chain is extracted, which contains punish-o and escape-s 5) Obtain role played by pronoun in the candidate’s event: object 364 How can we apply narrative chains for pronoun resolution? Ed punished Tim because he tried to escape.  Employ narrative chains to heuristically predict the antecedent 1) Find the event in which the pronoun participates and its role  “he” participates in the “try” and “escape” events as a subject 2) Find the event(s) in which the candidates participate  Both candidates participate in the “punish” event 3) Pair each candidate event with each pronoun event  Two pairs are created: (punish, try-s), (punish, escape-s) 4) For each pair, extract chains containing both elements in pair  One chain is extracted, which contains punish-o and escape-s 5) Obtain role played by pronoun in the candidate’s event: object 6) Find the candidate that plays the extracted role: Tim 365 How can we apply narrative chains for pronoun resolution? Ed punished Tim because he tried to escape.  Employ narrative chains to heuristically predict the antecedent 1) Find the event in which the pronoun participates and its role  “he” participates in the “try” and “escape” events as a subject 2) Find the event(s) in which the candidates participate  Both candidates participate in the “punish” event 3) Pair each candidate event with each pronoun event  Two pairs are created: (punish, try-s), (punish, escape-s) 4) For each pair, extract chains containing both elements in pair  One chain is extracted, which contains punish-o and escape-s 5) Obtain role played by pronoun in the candidate’s event: object 6) Find the candidate that plays the extracted role: Tim  Creates a binary feature that encodes this heuristic decision 366 Search Engine (Google) Lions eat zebras because they are predators. 367 Search Engine (Google) Lions eat zebras because they are predators. 1) Replace the target pronoun with a candidate antecedent Lions eat zebras because lions are predators. Lions eat zebras because zebras are predators. 2) Generate search queries based on lexico-syntactic patterns  Four search queries for this example: “lions are”, “zebras are”, “lions are predators”, “zebras are predators 3) Create features where the query counts obtained for the two candidate antecedents are compared 368 FrameNet John killed Jim, so he was arrested. 369 FrameNet John killed Jim, so he was arrested.  Both candidates are names, so search queries won’t return useful counts. 370 FrameNet John killed Jim, so he was arrested.  Both candidates are names, so search queries won’t return useful counts.  Solution: before generating search queries, replace each name with its FrameNet semantic role  “John” with “killer”, “Jim” with “victim”  Search “killer was arrested”, “victim was arrested”, … 371 Semantic Compatibility  Same as what we did in the Search Engine component, except that we obtain query counts from the Google Gigaword corpus 372 Heuristic Polarity Ed was defeated by Jim in the election although he is more popular. Ed was defeated by Jim in the election because he is more popular. 373 Heuristic Polarity Ed was defeated by Jim in the election although he is more popular. Ed was defeated by Jim in the election because he is more popular. 374 Heuristic Polarity Ed was defeated by Jim in the election although he is more popular. Ed was defeated by Jim in the election because he is more popular.  Use polarity information to resolve target pronouns in sentences that involve comparison 375 Heuristic Polarity Ed was defeated by Jim in the election although he is more popular. Ed was defeated by Jim in the election because he is more popular.  Use polarity information to resolve target pronouns in sentences that involve comparison 1) Assign rank values to the pronoun and the two candidates  In first sentence, “Jim” is better, “Ed” is worse, “he” is worse  In second sentence, “Jim” is better, “Ed” is worse, “he” is better 376 Heuristic Polarity Ed was defeated by Jim in the election although he is more popular. Ed was defeated by Jim in the election because he is more popular.  Use polarity information to resolve target pronouns in sentences that involve comparison 1) Assign rank values to the pronoun and the two candidates  In first sentence, “Jim” is better, “Ed” is worse, “he” is worse  In second sentence, “Jim” is better, “Ed” is worse, “he” is better 2) Resolve pronoun to the candidate that has the same rank value as the pronoun 377 Heuristic Polarity Ed was defeated by Jim in the election although he is more popular. Ed was defeated by Jim in the election because he is more popular.  Use polarity information to resolve target pronouns in sentences that involve comparison 1) Assign rank values to the pronoun and the two candidates  In first sentence, “Jim” is better, “Ed” is worse, “he” is worse  In second sentence, “Jim” is better, “Ed” is worse, “he” is better 2) Resolve pronoun to the candidate that has the same rank value as the pronoun  Create features that encode this heuristic decision and rank values 378 Machine-Learned Polarity  Hypothesis  rank values could be computed more accurately by employing a sentiment analyzer that can capture contextual information 379 Machine-Learned Polarity  Hypothesis  rank values could be computed more accurately by employing a sentiment analyzer that can capture contextual information  Same as Heuristic Polarity, except that OpinionFinder (Wilson et al., 2005) is used to compute rank values 380 Connective-Based Relations Google bought Motorola because they are rich. 381 Connective-Based Relations Google bought Motorola because they are rich.  To resolve “they”, we 1) Count number of times the triple <“buy”, “because”, “rich”> appears in the Google Gigaword corpus 382 Connective-Based Relations Google bought Motorola because they are rich.  To resolve “they”, we 1) Count number of times the triple <“buy”, “because”, “rich”> appears in the Google Gigaword corpus 2) If count is greater than a certain threshold, resolve pronoun to candidate that has the same deep grammatical role as pronoun 383 Connective-Based Relations Google bought Motorola because they are rich.  To resolve “they”, we 1) Count number of times the triple <“buy”, “because”, “rich”> appears in the Google Gigaword corpus 2) If count is greater than a certain threshold, resolve pronoun to candidate that has the same deep grammatical role as pronoun 3) Generate feature based on this heuristic resolution decision 384 Lexical Features  Exploit information in the coreference-annotated training texts 385 Lexical Features  Exploit information in the coreference-annotated training texts  Antecedent-independent features  Unigrams  Bigrams (pairing word before connective and word after connective)  Trigrams (augmenting each bigram with connective) 386 Lexical Features  Exploit information in the coreference-annotated training texts  Antecedent-independent features  Unigrams  Bigrams (pairing word before connective and word after connective)  Trigrams (augmenting each bigram with connective)  Antecedent-dependent features  pair a candidate’s head word with  its governing verb  its modifying adjective  the pronoun’s governing verb  the pronoun’s modifying adjective 387 Evaluation  Dataset  941 annotated sentence pairs (70% training; 30% testing) 388 Results Unadjusted Scores Adjusted Scores Correct Wrong No Dec. Correct Wrong No Dec. Stanford 40.07 29.79 30.14 55.14 44.86 0.00 Baseline Ranker 47.70 47.16 5.14 50.27 49.73 0.00 Combined resolver 53.49 43.12 3.39 55.19 44.77 0.00 Our system 73.05 26.95 0.00 73.05 26.95 0.00 389 Unadjusted Scores Adjusted Scores Correct Wrong No Dec. Correct Wrong No Dec. Stanford 40.07 29.79 30.14 55.14 44.86 0.00 Baseline Ranker 47.70 47.16 5.14 50.27 49.73 0.00 Combined resolver 53.49 43.12 3.39 55.19 44.77 0.00 Our system 73.05 26.95 0.00 73.05 26.95 0.00  Evaluation metrics: percentages of target pronouns that are  correctly resolved  incorrectly resolved  unresolved (no decision) Results 390  Evaluation metrics: percentages of target pronouns that are  correctly resolved  incorrectly resolved  unresolved (no decision) Results Unadjusted Scores Adjusted Scores Correct Wrong No Dec. Correct Wrong No Dec. Stanford 40.07 29.79 30.14 55.14 44.86 0.00 Baseline Ranker 47.70 47.16 5.14 50.27 49.73 0.00 Combined resolver 53.49 43.12 3.39 55.19 44.77 0.00 Our system 73.05 26.95 0.00 73.05 26.95 0.00 391  Evaluation metrics: percentages of target pronouns that are  correctly resolved  incorrectly resolved  unresolved (no decision) Results Unadjusted Scores Adjusted Scores Correct Wrong No Dec. Correct Wrong No Dec. Stanford 40.07 29.79 30.14 55.14 44.86 0.00 Baseline Ranker 47.70 47.16 5.14 50.27 49.73 0.00 Combined resolver 53.49 43.12 3.39 55.19 44.77 0.00 Our system 73.05 26.95 0.00 73.05 26.95 0.00 392  Evaluation metrics: percentages of target pronouns that are  correctly resolved  incorrectly resolved  unresolved (no decision) Results Unadjusted Scores Adjusted Scores Correct Wrong No Dec. Correct Wrong No Dec. Stanford 40.07 29.79 30.14 55.14 44.86 0.00 Baseline Ranker 47.70 47.16 5.14 50.27 49.73 0.00 Combined resolver 53.49 43.12 3.39 55.19 44.77 0.00 Our system 73.05 26.95 0.00 73.05 26.95 0.00 393 Results  Unadjusted Scores  Raw scores computed based on a resolver’s output Unadjusted Scores Adjusted Scores Correct Wrong No Dec. Correct Wrong No Dec. Stanford 40.07 29.79 30.14 55.14 44.86 0.00 Baseline Ranker 47.70 47.16 5.14 50.27 49.73 0.00 Combined resolver 53.49 43.12 3.39 55.19 44.77 0.00 Our system 73.05 26.95 0.00 73.05 26.95 0.00 394  Unadjusted Scores  Raw scores computed based on a resolver’s output  Adjusted Scores  “Force” a resolver to resolve every pronoun by probabilistically assuming that it gets half of the unresolved pronouns right Results Unadjusted Scores Adjusted Scores Correct Wrong No Dec. Correct Wrong No Dec. Stanford 40.07 29.79 30.14 55.14 44.86 0.00 Baseline Ranker 47.70 47.16 5.14 50.27 49.73 0.00 Combined resolver 53.49 43.12 3.39 55.19 44.77 0.00 Our system 73.05 26.95 0.00 73.05 26.95 0.00 395 Results  Three baseline resolvers  Stanford resolver (Lee et al., 2011) Unadjusted Scores Adjusted Scores Correct Wrong No Dec. Correct Wrong No Dec. Stanford 40.07 29.79 30.14 55.14 44.86 0.00 Baseline Ranker 47.70 47.16 5.14 50.27 49.73 0.00 Combined resolver 53.49 43.12 3.39 55.19 44.77 0.00 Our system 73.05 26.95 0.00 73.05 26.95 0.00 396  Three baseline resolvers  Stanford resolver (Lee et al., 2011)  Baseline Ranker: same as our ranking approach, except that ranker is trained using the 39 features from Rahman & Ng (2009) Results Unadjusted Scores Adjusted Scores Correct Wrong No Dec. Correct Wrong No Dec. Stanford 40.07 29.79 30.14 55.14 44.86 0.00 Baseline Ranker 47.70 47.16 5.14 50.27 49.73 0.00 Combined resolver 53.49 43.12 3.39 55.19 44.77 0.00 Our system 73.05 26.95 0.00 73.05 26.95 0.00 397  Three baseline resolvers  Stanford resolver (Lee et al., 2011)  Baseline Ranker: same as our ranking approach, except that ranker is trained using the 39 features from Rahman & Ng (2009)  The Combined resolver combines Stanford and Baseline Ranker:  Baseline Ranker is used only when Stanford can’t make a decision Results Unadjusted Scores Adjusted Scores Correct Wrong No Dec. Correct Wrong No Dec. Stanford 40.07 29.79 30.14 55.14 44.86 0.00 Baseline Ranker 47.70 47.16 5.14 50.27 49.73 0.00 Combined resolver 53.49 43.12 3.39 55.19 44.77 0.00 Our system 73.05 26.95 0.00 73.05 26.95 0.00 398 Results Unadjusted Scores Adjusted Scores Correct Wrong No Dec. Correct Wrong No Dec. Stanford 40.07 29.79 30.14 55.14 44.86 0.00 Baseline Ranker 47.70 47.16 5.14 50.27 49.73 0.00 Combined resolver 53.49 43.12 3.39 55.19 44.77 0.00 Our system 73.05 26.95 0.00 73.05 26.95 0.00 399  Stanford outperforms Baseline ranker  Combined resolver does not outperform Stanford  Our system outperforms Stanford by 18 accuracy points Unadjusted Scores Adjusted Scores Correct Wrong No Dec. Correct Wrong No Dec. Stanford 40.07 29.79 30.14 55.14 44.86 0.00 Baseline Ranker 47.70 47.16 5.14 50.27 49.73 0.00 Combined resolver 53.49 43.12 3.39 55.19 44.77 0.00 Our system 73.05 26.95 0.00 73.05 26.95 0.00 Results 400 Ablation Experiments  Remove each of the 8 components one at a time  Accuracy drops significantly (paired t-test, p < 0.05) after each component is removed  Most useful: Narrative chains, Google, Lexical Features  Least useful: FrameNet, Learned Polarity 401 Peng et al. (2015)  An alternative approach to address this task  Define two types of predicate schemas  Collect statistics for instantiated schemas from different knowledge sources: Gigaword, Wikipedia, and the Web 402 Motivating Example The bee landed on the flower because it had pollen.  To correctly resolve ‘it’, we need to know: S(have(m=[the flower], a=pollen)) > S(have(m=[the bee], a=pollen))  Corresponding predicate schema: S(predm(m,a)) 403 Motivating Example The bird perched on the limb and it bent.  To correctly resolve ‘it’, we need to know: S(bend(m=[the limb], a=*)) > S(bend(m=[the bird], a=*))  Corresponding predicate schema: S(predm(m,*)) 404 Predicate Schema Type 1  We saw S(predm(m,a)) and S(predm(m,*))  More generally, we also need S(predm(a,m)) and S(predm(*,m)) 405 Predicate Schema Type 1  We saw S(predm(m,a)) and S(predm(m,*))  More generally, we also need S(predm(a,m)) and S(predm(*,m)) 406 Predicate Schema Type 1  We saw S(predm(m,a)) and S(predm(m,*))  More generally, we also need S(predm(a,m)) and S(predm(*,m)) 407 Predicate Schema Type 1  We saw S(predm(m,a)) and S(predm(m,*))  More generally, we also need S(predm(a,m)) and S(predm(*,m)) 408 Motivating Example  To correctly resolve ‘he’, we need to know: S(be afraid of(m,a), because, get scared(m,a’)) > S(be afraid of(a,m), because, get scared(m,a’))  Corresponding predicate schemas S(pred1m(m,a), dc, pred2m(m,a’)) S(pred1m(a,m), dc, pred2m(m,a’)) Ed was afraid of Tim because he gets scared around new people. 409 Motivating Example  To correctly resolve ‘he’, we need to know: S(be afraid of(m,a), because, get scared(m,a’)) > S(be afraid of(a,m), because, get scared(m,a’))  Corresponding predicate schemas S(pred1m(m,a), dc, pred2m(m,a’)) S(pred1m(a,m), dc, pred2m(m,a’)) Ed was afraid of Tim because he gets scared around new people. 410 Motivating Example  To correctly resolve ‘he’, we need to know: S(be afraid of(m,a), because, get scared(m,a’)) > S(be afraid of(a,m), because, get scared(m,a’))  Corresponding predicate schemas S(pred1m(m,a), dc, pred2m(m,a’)) S(pred1m(a,m), dc, pred2m(m,a’)) Ed was afraid of Tim because he gets scared around new people. 411 Predicate Schema Type 2  So far, we have S(pred1m(m,a), dc, pred2m(m,a’)) S(pred1m(a,m), dc, pred2m(m,a’))  More generally, we also need: S(pred1m(m,a), dc, pred2m(a’,m)) S(pred1m(a,m), dc, pred2m(a’,m)) S(pred1m(m,*), dc, pred2m(*,m)) S(pred1m(*,m), dc, pred2m(*,m)) 412 Predicate Schema Type 2  So far, we have S(pred1m(m,a), dc, pred2m(m,a’)) S(pred1m(a,m), dc, pred2m(m,a’))  More generally, we also need: S(pred1m(m,a), dc, pred2m(a’,m)) S(pred1m(a,m), dc, pred2m(a’,m)) S(pred1m(m,*), dc, pred2m(*,m)) S(pred1m(*,m), dc, pred2m(*,m)) 413 Predicate Schema Type 2  So far, we have S(pred1m(m,a), dc, pred2m(m,a’)) S(pred1m(a,m), dc, pred2m(m,a’))  More generally, we also need: S(pred1m(m,a), dc, pred2m(a’,m)) S(pred1m(a,m), dc, pred2m(a’,m)) S(pred1m(m,*), dc, pred2m(*,m)) S(pred1m(*,m), dc, pred2m(*,m)) 414 Collecting Statistics for the Schemas  … from Gigaword, Wikipedia, and the Web 415 Using the Statistics  As features and/or as constraints for their coreference resolver 416 Peng et al.’s results on our dataset 417 Summary  There is a recent surge of interest in these hard, but incredibly interesting pronoun resolution tasks  They could serve as an alternative to the Turing Test (Levesque, 2011)  This challenge is known as the Winograd Schema Challenge  Announced as a shared task in AAAI 2014  Sponsored by Nuance  Additional test cases available from Ernest Davis’ website: https://www.cs.nyu.edu/davise/papers/WS.html 418 Challenges 419 Challenges: New Models  Can we jointly learn coreference resolution with other tasks?  Exploit cross-task constraints to improve model learning  Durrett & Klein (2014): jointly learn coreference with two tasks  Named entity recognition (coarse semantic typing)  Entity linking (matching to Wikipedia entities) using a graphical model, encoding soft constraints in factors  Use semantic info in Wikipedia for better semantic typing  Use semantic types to disambiguate tricky Wikipedia links  Ensure consistent type predictions across coreferent mentions  … 420 Challenges: New Models  Can we jointly learn coreference resolution with other tasks?  Exploit cross-task constraints to improve model learning  Durrett & Klein (2014): jointly learn coreference with two tasks  Named entity recognition (coarse semantic typing)  Entity linking (matching to Wikipedia entities) using a graphical model, encoding soft constraints in factors  Use semantic info in Wikipedia for better semantic typing  Use semantic types to disambiguate tricky Wikipedia links  Ensure consistent type predictions across coreferent mentions  …  Hajishirzi et al. (2013): jointly learn coreference w/ entity linking  Can we jointly learn entity coreference with event coreference? 421 Challenges: New Features  There is a limit on how far one can improve coreference resolution using machine learning methods  A good model can profitably exploit the available features, but if the knowledge needed is not present in the data, there isn’t much that the model can do 422 Challenges: New Features  There is a limit on how far one can improve coreference resolution using machine learning methods  A good model can profitably exploit the available features, but if the knowledge needed is not present in the data, there isn’t much that the model can do  We know that semantics and world knowledge are important  But it’s hard to use them to improve state-of-the-art systems  Wiseman: learn non-linear representations from raw features  What if we learn such representations from complex features, including those that encode world knowledge?  Can we leverage recent advances in distributional lexical semantics (e.g., word embeddings)? 423 Challenges: New Languages  Low-resource languages  Large lexical knowledge bases may not be available  Can we learn world knowledge from raw text?  Idea: using appositive constructions  E.g., Barack Obama, president of the United States, … 424 Challenges: New Languages  Low-resource languages  Large lexical knowledge bases may not be available  Can we learn world knowledge from raw text?  Idea: using appositive constructions  E.g., Barack Obama, president of the United States, …  Large coreference-annotated corpora may not be available  Can we employ weakly supervised learning or active learning?  Can we exploit resources from a resource-rich language?  Idea: translation-based coreference annotation projection 425 Translation-Based Projection: Example 玛丽告诉约翰她非常喜欢他。 426 Translation-Based Projection: Example 1. Machine-translate document from target to source 玛丽告诉约翰她非常喜欢他。 427 Translation-Based Projection: Example 1. Machine-translate document from target to source 玛丽告诉约翰她非常喜欢他。 Mary told John that she liked him a lot. 428 Translation-Based Projection: Example 2. Run resolver on the translated document • to extract mentions and produce coreference chains 玛丽告诉约翰她非常喜欢他。 Mary told John that she liked him a lot. 429 Translation-Based Projection: Example 2. Run resolver on the translated document • to extract mentions and produce coreference chains 玛丽告诉约翰她非常喜欢他。 Mary told John that she liked him a lot. 430 Translation-Based Projection: Example 3.Project annotations from source back to target • project mentions 玛丽告诉约翰她非常喜欢他。 Mary told John that she liked him a lot. 431 Translation-Based Projection: Example 3.Project annotations from source back to target • project mentions [玛丽]告诉[约翰][她]非常喜欢[他]。 Mary told John that she liked him a lot. 432 Translation-Based Projection: Example 3.Project annotations from source back to target • project mentions • project coreference chains [玛丽]告诉[约翰][她]非常喜欢[他]。 Mary told John that she liked him a lot. 433 Translation-Based Projection: Example 3.Project annotations from source back to target • project mentions • project coreference chains [玛丽]告诉[约翰][她]非常喜欢[他]。 Mary told John that she liked him a lot. 434 Challenges: New Coreference Tasks  Bridging [Non-identity coreference]  Set-subset relation, part-whole relation  Event coreference resolution  Determines which event mentions refer to the same event  Difficult because for two events to be coreferent, one needs to check whether their arguments/participants are compatible  Partial coreference relation [Non-identity coreference]  subevent  Subevent relations form a sterotypical sequence of events  e.g., bombing ? destroyed ? wounding  membership  multiple instances of the same kind of event  e.g., I attended three parties last month. The 1st one was the best. 435 Challenges: New Evaluation Metrics  Designing evaluation metrics is a challenging task  There are four commonly used coreference evaluation metrics (MUC, B3, CEAF, BLANC), but it’s not clear which of them is the best  Can we trust them? (Moosavi & Strube, 2016)  Weaknesses  Linguistically agnostic  Are all links equally important?  E.g., 3 mentions: Hillary Clinton, she, she  System 1: Clinton-she; System 2: she-she  Hard to interpret the resulting F-scores  Can the scores tell which aspects of a system can be improved?