ARGUMENTATION MINING Marie-Francine Moens joint work with Raquel Mochales Palau and Parisa Kordjamshidi Language Intel l igence and Information Retr ieval Depar tment of Computer Science KU Leuven, Belgium Dundee, 5-9-2014 ¡  Part 1: The setting §  Definition of argumentation mining §  Importance of the task ¡  Part 2: Introducing current methods §  Machine learning §  Features §  Common techniques: logistic regression, conditional random fields, support vector machines §  Joint recognitions: grammars, graphical models, structured support vector machines §  Features revisited §  Textual entailment ¡  Part 3: Some applications §  Legal field §  Scientific texts §  Blogs §  Dialogues and debates ¡  Part 4: Conclusions and thoughts for future research Tutorial Argumentation Mining 2014 2 OUTLINE ¡ PART 1: The setting Tutorial Argumentation Mining 2014 3 ¡  = the detection of an argumentative discourse structure in text or speech, and the detection and the functional classification of its composing components Tutorial Argumentation Mining 2014 4 ARGUMENTATION MINING ¡  Argumentation mining = recognition of a rhetorical structure in a discourse ¡  Rhetoric is the art of discourse that aims to improve the capabilities of writers and speakers to inform, persuade or motivate particular audiences in specific situations [Corbett, E. P. J. (1990). Classical rhetoric for the modern student. New York: Oxford University Press., p. 1.] Tutorial Argumentation Mining 2014 5 ARGUMENTATION MINING ¡  Is probably as old as mankind ¡  Has been studied by philosophers throughout the history Tutorial Argumentation Mining 2014 6 ARGUMENTATION ¡  From Ancient Greece to the late 19th century: central part of Western education: need to train public speakers and writers to move audiences to action with arguments ¡  The approach of argumentation is very often based on theories of rhetoric and logic ¡  Argumentation was/is taught at universities Tutorial Argumentation Mining 2014 7 SOME HISTORY ¡  Highlights: §  Aristotle's (4th century BC) logical works: Organon §  George Pierce Baker, The Principles of Argumentation, 1895 §  Chaïm Perelman describes of techniques of argumentation used by people to obtain the approval of others for their opinions: Traité de l'argumentation – la nouvelle rhétorique, 1958 §  Stephen Toulmin explains how argumentation occurs in the natural process of an everyday argument: The Uses of Argument, Cambridge University Press, 1958 Tutorial Argumentation Mining 2014 8 SOME HISTORY 9 http://sokogskriv.no/en/reading/argumentation-in-text/ Tutorial Argumentation Mining 2014 ¡  We find argumentation in: §  Legal texts and court decisions §  Biomedical cases §  Scientific texts §  Patents §  Reviews, online fora, user generated content §  Debates, interactions, dialogues §  ... Tutorial Argumentation Mining 2014 10 TODAY ¡  In the overload of information users want to find arguments that sustain a certain claim or conclusion ¡  Argumentation mining refines: §  Search and information retrieval §  Provides the end user with instructive visualizations and summaries of an argumentative structure Argumentation mining is related to opinion mining, but end user wants to know the underlying grounds and maybe counterarguments Tutorial Argumentation Mining 2014 11 WHY ARGUMENTATION MINING? ¡  Argumentative zoning ¡  Argumentation mining of legal cases ¡  Argumentation mining in online user comments and discussions ¡  . . . Tutorial Argumentation Mining 2014 12 WHAT IS THE STATE-OF-THE-ART? ¡  = segmentation of a discourse into discourse segments or zones that each play a specific rhetoric role in a text Tutorial Argumentation Mining 2014 13 ARGUMENTATIVE ZONING BKG: General scientific background (yellow) OTH: Neutral descriptions of other people's work (orange) OWN: Neutral descriptions of the own, new work (blue) AIM: Statements of the particular aim of the current paper (pink) TXT: Statements of textual organization of the current paper (in chapter 1, we introduce...) (red) CTR: Contrastive or comparative statements about other work; explicit mention of weaknesses of other work (green) BAS: Statements that own work is based on other work (purple) [PHD thesis of Simone Teufel 2000] ¡  Methods: seen as a classification task: rule based or statistical classifier (e.g., naïve Bayes, support vector machine) is trained with manually annotated examples [Moens, M.-F. & Uyttendaele, C. Information Processing & Management 1997] [Teufel, S. & Moens, M. ACL 1999] [Teufel, S. & Moens, M. EMNLP 2000] [Hachey, B. & Grover, C. ICAIL 2005] Tutorial Argumentation Mining 2014 14 ARGUMENTATIVE ZONING ¡ Legal field: § Precedent reasoning § Search for cases that use a similar type of reasoning, e.g., acceptance of rejection of a claim based on precedent cases § Adds an additional dimension to argumentative zoning: § Needs detection of the argumentation structure and classification of its components § Components or segments are connected with argumentative relationships Tutorial Argumentation Mining 2014 15 ARGUMENTATION MINING OF LEGAL CASES [Moens, Boiy, Mochales & Reed ICAIL 2007] Tutorial Argumentation Mining 2014 16 [PhD thesis Raquel Mochales Palau 2011] Tutorial Argumentation Mining 2014 17 [PhD thesis Raquel Mochales Palau 2011] Tutorial Argumentation Mining 2014 18 Tutorial Argumentation Mining 2014 19 ¡  [PhD thesis of Raquel Mochales 2011] §  Argumentation: a process whereby arguments are constructed, exchanged and evaluated in light of their interactions with other arguments §  Argument: a set of premises - pieces of evidence - in support of a claim §  Claim: a proposition, put forward by somebody as true; the claim of an argument is normally called its conclusion §  Argumentation may also involve chains of reasoning, where claims are used as premises for deriving further claims Tutorial Argumentation Mining 2014 20 20 [Mochales & Moens AI & Law 2011] ¡ Part 2: Introducing current methods Tutorial Argumentation Mining 2014 21 Text mining, also referred to as text data mining, or roughly equivalent to text analytics: = deriving high quality information from text Often done through means of statistical patterns learning ⇒  Use of statistical machine learning techniques Tutorial Argumentation Mining 2014 22 TEXT MINING 23 ARGUMENTATION MINING ¡ Because argumentation is well studied: typical argumentation structures are defined: ¡ => structuring of information: detecting the argumentation and its components ¡ => assignment of metadata: labeling of argumentation components and relations ¡ Can be done manually: § But, people are (often) expensive, slow and inconsistent § Can we perform this task automatically? Tutorial Argumentation Mining 2014 Tutorial Argumentation Mining 2014 24 ARGUMENTATION MINING ¡ Approaches: pattern recognition § Symbolic techniques: knowledge or part of it: § formally and manually implemented § Statistical machine learning techniques: knowledge or part of it: § automatically acquired ¡ Mostly: using supervised machine learning techniques ¡ Why ? §  Argumentation structure is well studied §  Manually labeled examples are available §  Annotating examples is usually considered easier than pattern engineering §  Current supervised learning techniques allow integration of soft rules Tutorial Argumentation Mining 2014 25 ARGUMENTATION MINING 26 ARGUMENTATION MINING ¡ Argumentation mining needs a large amount of knowledge: § Linguistic knowledge of the vocabulary, syntax and semantics of the language and the discourse § Knowledge of the subject domains § Background knowledge of the person who uses the texts at a certain moment in time Tutorial Argumentation Mining 2014 27 SUPERVISED LEARNING ¡ Techniques of supervised learning: § training set: example objects classified by an expert or teacher § detection of general, but high-accuracy classification patterns (function or rules) in the training set based on object features and their values § patterns are predictable to correctly classify new, previously unseen objects in a test set considering their features and feature values Tutorial Argumentation Mining 2014 28 SUPERVISED LEARNING § Text recognition or classification can be seen as a: § two-class learning problem: § an object is classified as belonging or not belonging to a particular class § convenient when the classes are not mutually exclusive § single multi-class learning problem § Result = often probability of belonging to a class, rather than simply a classification Tutorial Argumentation Mining 2014 29 Tutorial Argumentation Mining 2014 30 GENERATIVE VERSUS DISCRIMINATIVE CLASSIFICATION ¡  In classification: given inputs x and their labels y: §  Generative classifier learns a model of the joint probability p(x,y) = p(y) p(x|y) and then condition on the observed features x, thereby deriving the class posterior p(y|x) and selects the most probable y for x §  Generative classifier: since it specifies how to generate the observed features x for each class y §  E.g., §  Naive Bayes, hidden Markov model Tutorial Argumentation Mining 2014 §  Discriminative classifier learns a model p(y|x) which directly models the mapping from inputs x to output y, and selects the most likely label y §  Discriminative classifier: discriminates between classes §  E.g., §  Logistic regression model, conditional random field, support vector machine (discussed in this tutorial) Tutorial Argumentation Mining 2014 31 GENERATIVE VERSUS DISCRIMINATIVE CLASSIFICATION 32 MAXIMUM ENTROPY PRINCIPLE ¡ Text classifiers are often trained with incomplete information ¡ Probabilistic classification can adhere to the principle of maximum entropy: When we make inferences based on incomplete information, we should draw them from that probability distribution that has the maximum entropy permitted by the information we have: e.g., § Multinomial logistic regression, conditional random fields Tutorial Argumentation Mining 2014 33 CONTEXT-DEPENDENT RECOGNITION ¡ When there exist a relation between various classes: it is valuable not to classify an object separately from other objects ¡ Context-dependent classification : the class to which a feature vector is assigned depends on: § the object itself § other objects and their class § the existing relations among the various classes §  e.g., hidden Markov model, conditional random fields, structured support vector machine, structured perceptron Tutorial Argumentation Mining 2014 § Local classification (i.e., learning a model for each class), applying the models on each input, and combining the outputs § Global classification (i.e., learning 1 model jointly, cf. context dependent classification) Tutorial Argumentation Mining 2014 34 LOCAL VERSUS GLOBAL CLASSIFICATION 35 FEATURE SELECTION AND EXTRACTION ¡  In classification tasks: object is described with set of attributes or features ¡  Typical features in text classification tasks: §  word, phrase, syntactic class of a word, text position, the length of a sentence, the relationship between two sentences, an n-gram, a document (term classification), …. §  choice of the features is application- and domain-specific ¡  Features can have a value, for text the value is often: §  numeric, e.g., discrete or real values §  nominal, e.g. certain strings §  ordinal, e.g., the values 0= small number, 1 = medium number, 2 = large number Tutorial Argumentation Mining 2014 36 FEATURE SELECTION AND EXTRACTION ¡  The features together span a multi-variate space called the measurement space or feature space: §  an object x can be represented as: §  a vector of features: x = [x1, x2, …, xp]T where p = the number of features measured §  as a structure: e.g., §  representation in first order predicate logic §  graph representation (e.g., tree) where relations between features are figured as edges between nodes and nodes can contain attributes of features Tutorial Argumentation Mining 2014 37 SWARM INTELLIGENCE Following a trail of insects as they work together to accomplish a task offers unique possibilities for problem solving. By Peter Tarasewich & Patrick R. McMullen Even with today’s ever-increasing computing power, there are still many types of problems that are very difficult to solve. Particularly combinatorial optimization problems continue to pose challenges. An example of this type of problem can be found in product design. Take as an example the design of an automobile based on the attributes of engine horsepower, passenger seating, body style and wheel size. If we have three different levels for each of these attributes, there are 3 4 , or 81, possible configurations to consider. For a slightly larger problem with 5 attributes of 4 levels, there are suddenly 1,024 combinations. Typically, an enormous amount of possible combinations exist, even for relatively small problems. Finding the optimal solution to these problems is usually impractical. Fortunately, search heuristics have been developed to find good solutions to these problems in a reasonable amount of time. Over the past decade or so, several heuristic techniques have been developed that build upon observations of processes in the physical and biological sciences. Examples of these techniques include Genetic Algorithms (GA) and simulated annealing… Words Examples of classification features Sentence position POS-tag Sentence length The following and preceding word Tutorial Argumentation Mining 2014 38 FEATURE VECTORS FOR AN EXAMPLE TEXT A Java Applet that scans Java Applets ¡  Binary values, based on lower-cased words: [a: 1, apple: 0, applet: 1, applets: 1, …., java: 1, …] ¡  Remove stopwords   : [apple : 0, applet : 1, applets : 1, … , java : 1 …] ¡  Numeric value: based on text term frequency (tf) : [apple : 0, applet : 1, applets : 1, … , java : 2 …] ¡  Numeric value: based on text term frequency of lower cased n-grams (tf) : [aa: 0, a_a: 2, a_b: 0, …] ¡  Numeric attribute value based on latent semantic indexing: [F1: 0.38228938, F2: 0.000388, F3: 0.201033, …] ¡  . . . Tutorial Argumentation Mining 2014 39 FEATURE SELECTION ¡ = eliminating low quality features: § redundant features § noisy features ¡ Decreases computational complexity ¡ Decreases the danger of overfitting in supervised learning (especially when large number of features and few training examples) ¡ Overfitting: § the classifier perfectly fits the training data, but fails to generalize sufficiently from the training data to correctly classify the new case Tutorial Argumentation Mining 2014 40 FEATURE EXTRACTION ¡ = creates new features by applying a set of operators upon the current features: § a single feature can be replaced by a new feature (e.g., replacing words by their stem) § a set of features is replaced by one feature or another set of features § use of logical operators (e.g., disjunction), arithmetical operators (e.g. mean, LSI) § choice of operators: application- and domain- specific Tutorial Argumentation Mining 2014 ¡  Naïve Bayes, learning of rules and trees, nearest neighbor or exemplar based learning, logistic regression, support vector machines ¡  Here we discuss support vector machines, logistic regression, and conditional random fields ¡  Then, we move to more advanced methods such as structured perceptrons, structured support vector machines and more general graphical models Tutorial Argumentation Mining 2014 41 COMMON CLASSIFICATION METHODS Tutorial Argumentation Mining 2014 42 MACHINE LEARNING FRAMEWORK ¡  Input space: objects are represented as feature vectors ¡  Output space: §  Regression: the space of real numbers §  Classification: the set of discrete categories: C = {C1,C2, … ,Cm} ¡  Hypothesis space = class of function mappings from the input space to the output space ¡  To learn a good hypothesis: in supervised learning a training set is used which contains a number of objects and their ground truth labels ¡  Loss function: to what degree the prediction generated by the hypothesis is in accordance with the ground truth label ℜ Tutorial Argumentation Mining 2014 43 44 SUPPORT VECTOR MACHINE ¡ Support vector machine: § when two classes are linearly separable: §  find a hyperplane in the p-dimensional feature space that best separates with maximum margins the positive and negative examples §  maximum margins: with maximum Euclidean distance (= margin d) to the closest training examples (support vectors) §  e.g., decision surface in two dimensions §  idea can be generalized to examples that are not necessarily linearly separable and to examples that cannot be represented by linear decision surfaces Tutorial Argumentation Mining 2014 Tutorial Argumentation Mining 2014 45 [Burges 1998]? 46 SUPPORT VECTOR MACHINE ¡ Linear support vector machine: § case: trained on data that are separable (simple case) § input is a set of n training examples: where xi ∈ and yi ∈ {-1,+1} indicating that xi is a negative or positive example respectively € S = {(x1,y1),...,(xn,yn)} € ℜ p Tutorial Argumentation Mining 2014 Tutorial Argumentation Mining 2014 47 ¡ In case the data objects are not necessar i ly completely linearly separable (soft margin SVM): the amount of training error is measured using slack variables ξi the sum of which must not exceed some upper bound where = penalty for misclassification G = weighting factor € ξ i 2 i =1 n ∑ Minimizeξ , w,b w ⋅w +G ξ i2i=1n∑Subject to yi( w ⋅xi +b)−1+ξ i ≥ 0 , i= 1,...,n Tutorial Argumentation Mining 2014 48 [Burges DMKD 1998]? A dual representation is obtained by introducing Lagrange multipliers λ i, which turns out to be easier to solve: (1) Tutorial Argumentation Mining 2014 49 € Maximize W (λ) = λi i=1 n ∑ − 1 2 λiλjyiyj xi⋅xj i, j=1 n ∑ Subject to : λi ≥ 0 λiyi = 0 i=1 n ∑ , i =1,...,n Tutorial Argumentation Mining 2014 50 Yielding the following decision function: (2) The decision function only depends on support vectors, i.e., for which λ i > 0. Training examples that are not support vectors have no influence on the decision function € h(x) = sign( f (x)) € f (x) = λiyi xi ⋅ x + b i=1 n ∑ 51 SUPPORT VECTOR MACHINE ¡ When classifying natural language data, it is not always possible to linearly separate the data: in this case we can map them into a feature space where they are linearly separable ¡ Working in a high dimensional feature space gives computational problems, as one has to work with very large vectors ¡ In the dual representation the data appear only inside inner products (both in the training algorithm shown by (1) and in the decision function of (2)): in both cases a kernel function can be used in the computations Tutorial Argumentation Mining 2014 52 Tutorial Argumentation Mining 2014 53 KERNEL FUNCTION ¡  A kernel function K is a mapping K: S x S → [0, ∞] from the instance space of examples S to a similarity score: ¡  In other words a kernel function is an inner product in some feature space ¡  The kernel function must be: § symmetric [K(xi,xj) = K(xj,xi)] § positive semi-definite: if x1,…,xn ∈ S, then the n x n matrix G (Gram matrix or kernel matrix) defined by Gij = K (xi,xj) is positive semi-definite* * has non-negative eigenvalues € K(xi,xj) = φ(xi) ⋅ φ(xj) Tutorial Argumentation Mining 2014 54 SUPPORT VECTOR MACHINE ¡ Typical kernel functions: linear (mostly used in text categorization), polynomial, radial basis function (RBF) ¡ We can define kernel functions that (efficiently) compare strings (string kernel) or trees (tree kernel) ¡ The decision function f(x) we can just replace the dot products with kernels K(x i,x j): € f (x) = λiyi φ(xi) ⋅ φ(x) + b i=1 n ∑ € f (x) = λiyiK(xi,x) + b i=1 n ∑ € h(x) = sign( f (x)) Tutorial Argumentation Mining 2014 ¡ Linear regression: ¡ Class membership: Tutorial Argumentation Mining 2014 55 LINEAR REGRESSION p(y = true x) = wi× fii=0N∑ = w ⋅ fy = w0 + wi× fii=1N∑ = w ⋅ f (3) § Training of the model of (3): § By assigning each training example that belongs to the class the value y = 1, and the target value y = 0, if it is not § Train the weight vector to minimize the predictive error from 1 (for observations in the class) or 0 (for observations not in the class) § Testing: dot product of the learned weight vector with the feature vector x of the new example § But, result is not guaranteed to lie in [0,1] Tutorial Argumentation Mining 2014 56 LINEAR REGRESSION Tutorial Argumentation Mining 2014 57 LOGISTIC REGRESSION ¡ We predict a ratio of two probabilities as the log odds (or logit) function: ¡ Logistic regression: model of regression in which we use a linear function to estimate the logit of the probability logit(p(x)) = ln( p(x)(1− p(x)))ln( p(y = true x)1− p(y = true x)) = w ⋅ fp(y = true x) = ew⋅ f1+ ew⋅ f Tutorial Argumentation Mining 2014 58 MULTINOMIAL LOGISTIC REGRESSION ¡ = Maximum entropy classifier (Maxent) deals with a larger number of classes: multinomial logistic regression ¡ Let there be C different classes: y1,y2,...,yC ¡ We estimate the probability that y is a particular class y given N feature functions as: p(y x) = 1Z exp wifii=0N∑p(y x) = exp wifii=0N∑ (y,x)exp wifi( "y ,x)i=0N∑"y ∈C∑ 59 ¡ Context dependent classification = the class to which a feature vector is assigned depends on: 1) the feature vector itself 2) the values of other feature vectors and their class 3) the existing relation among the various classes ¡ Examples: § conditional random field § structured output support vector machine Tutorial Argumentation Mining 2014 60 Tutorial Argumentation Mining 2014 Tutorial Argumentation Mining 2014 61? CONDITIONAL RANDOM FIELD •  Linear chain conditional random field: –  Let X = (x1, ... , xT) be a random variable over data sequences to be labeled and Y a random variable over the corresponding label sequences –  All components yj of Y are assumed to range over a finite label alphabet ∑ –  We define G = (V, E) to be an undirected graph such that there is a node v ∈ V corresponding to each of the random variables representing an element yv of Y –  If each yv obeys the Markov property with respect to G, then the model (Y,X) is a conditional random field Tutorial Argumentation Mining 2014 62? CONDITIONAL RANDOM FIELD •  In an information extraction task, X might range over the words or constituents of a sentence/discourse, while Y ranges over the semantic/pragmatic classes to be recognized in these sentences/dscourse •  Template based or general CRF: In theory the structure of graph G may be arbitrary: e.g., template based or general CRF, where you can define the dependencies in the Markov network or graph [Lafferty et al. ICML 2001] Tutorial Argumentation Mining 2014 63? CONDITIONAL RANDOM FIELD §  To classify a new instance P(Y|X) is computed as follows: where = one of the k binary-valued feature functions = parameter that models the observed statistics in the training examples Z = normalizing constant §  The most probable label sequence Y* for input sequence X is: Y*= argmaxY p(Y X) p(Y X) = 1Z exp( λifi(yj − 1, yj, X, j)i=1k∑ )j=1T∑fi(yj − 1, yj, X, j)λi Tutorial Argumentation Mining 2014 64? CONDITIONAL RANDOM FIELD •  CRF training: –  Like for the Maxent model, we need numerical methods in order to derive λI –  E.g., linear-chain CRF: variation of the Baum-Welch algorithm –  In general CRFs we use approximate inference (e.g., Markov Chain Monte Carlo sampler) •  Advantages and disadvantages: •  Very successful IE technique •  Training is computationally expensive, especially when the graphical structure is complex ¡  Global or jointly recognizing several labels and their relationship ¡  Can be realized by: §  Inferring a grammar (with rules) from data §  Structured support vector machines §  Graphical models (Markov random fields and Bayesian networks) Tutorial Argumentation Mining 2014 65 GLOBAL LEARNING Tutorial Argumentation Mining 2014 66 MODELS THAT JOINTLY LEARN ¨  The machine recognizes fragmentary pieces (e.g., names, facts) and the recognition of related fragments of text are often limited to the sentence level ¨  Emerging recognition of integrated understanding: e.g., in a discourse noun-phrase coreference resolution and entity recognition Human understanding of text: inferencing, connecting content [Wikipedia] Tutorial Argumentation Mining 2014 67 67 [Mochales & Moens AI & Law 2011] ¡  Can be done manually (cf. PhD of Raquel Mochales Palau) ¡  Can be learned from annotated data ¡  Could be learned from a very large unannotated corpus, but very difficult if grammar is complex Tutorial Argumentation Mining 2014 68 INFERRING A GRAMMAR WITH RULES FROM DATA Tutorial Argumentation Mining 2014 69 [PhD thesis Raquel Mochales Palau 2011] Tutorial Argumentation Mining 2014 70 Experiments with decisions of the European Court of Human Rights (ECHR) [Mochales & Moens AI & Law 2011] Tutorial Argumentation Mining 2014 71 ¡  Works well (see further the results) ¡  A deterministic grammar might overfit the data it is constructed from ¡  A probabilistic grammar needs annotated data ¡  If we have annotated data we can learn the grammar Tutorial Argumentation Mining 2014 72 INFERRING A GRAMMAR WITH RULES FROM DATA INTERMEZZO: SPATIAL RELATION 73 The goal is to jointly assign the labels of the ontology to a text item Tutorial Argumentation Mining 2014 74 ¡ Joint or global learning ≠ local learning of independent classifiers § Independent classifiers and combination of results (e.g., based on integer linear programming) § Joint training: § 1 classification model for the global structure: cf. CRF § Output is = structure (e.g., spatial ontology) JOINT MACHINE LEARNING [PhD of Parisa Kordjamshidi 2013] [Kordjamshidi & Moens Journal of Web Semantics 2014] Tutorial Argumentation Mining 2014 OUTPUT 75 ¡  Output variables = labels in the structure Tutorial Argumentation Mining 2014 INPUT 76 ¡  Object to which the classification model is applied: e.g., sentence (in our case), paragraph, full document, .. . ¡  Is usually composed of dif ferent input components: single words, phrases, .. . depending on the type of text snippet to which a label will be assigned Tutorial Argumentation Mining 2014 FEATURE FUNCTIONS 77 ¡  Each input component is assigned a set of features: e.g., lexical, syntactic, discourse distance, ... ¡  Feature functions link an input component with a possible label (notion of feature templates) ¡  Each feature function will receive a weight during training ¡  A feature template groups a set of feature functions => block of corresponding weights W i Tutorial Argumentation Mining 2014 OBJECTIVE FUNCTION 78 ¡  The main objective discriminant function is a linear function in terms of the combined feature representation associated with each candidate input component and an output label according to the template (Ψ) specifications ¡  Can be written in terms of the instantiations of the templates and their related blocks of weights Wp Tutorial Argumentation Mining 2014 TRAINING OF THE MODEL Tutorial Argumentation Mining 2014 79 ¡  A popular discriminative training approach is to minimize the following convex upper bound of the loss function over the N training data: ¡  Training with the most violated constraints/outputs (y) per training example ¡  In the experiments: structured support vector machines (SSVM), structured perceptrons and averaged structured perceptrons CONSTRAINTS 80 Constraints are linear and variables take the form of integers Constraints are applied: during training: finding the most violated outputs and/or during testing Tutorial Argumentation Mining 2014 ¡  E.g., Markov random fields ¡  Allow using rules as features for which the weight is trained on the annotated data ¡  Concern: the computational complexity Tutorial Argumentation Mining 2014 81 GRAPHICAL MODELS IN GENERAL 82 Tutorial Argumentation Mining 2014 ¡  Structured learning: modeling of interdependence among output labels: §  Generalized linear models, e.g., structured support vector machines and structured perceptrons [Tsochantaridis et al. JMLR 2006] §  Probabilistic graphical models [Koller and Friedman 2009] ¡  The interdependencies between output labels and other background knowledge can be imposed using constraint optimization techniques during prediction and training §  Cf. recent work on structure analysis of scientific documents [Guo et al. NAACL-HLT 2013] JOINT RECOGNITION OF A CLAIM AND ITS COMPOSING ARGUMENTS ¡  Or to the Toulmin model or the many different argumentation schemes/structures discussed in Douglas Walton (1996). Argumentation Schemes for Presumptive Reasoning. Mahwah, New Jersey: Lawrence Erlbaum Associates ¡  Work of Prakken, Gordon, Bench-Capon, Atkinson, Wyner, Schneider, .. . Tutorial Argumentation Mining 2014 83 OTHER ARGUMENTATION STRUCTURES ¡  Complex graphical structures: considering the interdependencies and structural constraints over the output space easily leads to intractable training and prediction situations: §  Models for decomposition, communicative inference, message passing, ... §  A current research topic in machine learning Tutorial Argumentation Mining 2014 84 DECOMPOSITIONS Tutorial Argumentation Mining 2014 85 ¡  Breaking the structured model in two or more pieces: §  Build a model for each piece §  Possibly: Iteratively improve each model by communicating between the pieces DECOMPOSITIONS ¡  Argumentation mining Tutorial Argumentation Mining 2014 86 FEATURES REVISITED On the other hand the court notes that there are substantial delays attributable to the authorities In particular in the first set of proceedings there is a period of inactivity of more than two years ... In the second set of proceedings there is a period of inactivity of some three years The court cannot find that the government has given sufficient explanation for these delays that occurred Conclusion Premises ¡  Because we input candidate arguments and their candidate components: §  We can describe the component with different features than the ones used for describing the full argument §  E.g., textual entailment relationships can be used to describe the full argument Tutorial Argumentation Mining 2014 87 FEATURES REVISITED 88 ¡  Our argumentation mining machine only uses information resided in the texts ¡  Human understanding of text: humans connect to their world/ domain knowledge [Wikipedia] Tutorial Argumentation Mining 2014 ¡  The discourse structure is often signaled by typical keywords (e.g., in conclusion, however, . . .) , but often this is not the case ¡  Humans who understand the meaning of the text can infer whether a claim is a plausible conclusion given a set of premises, or a claim rebuts another claim ⇒  Background or domain knowledge that makes a certain discourse relation valid ⇒  Background or domain knowledge that an argumentation mining tool should also acquire: how? ¡  Work on textual entailment : [Cabrio & Villata 2012], event causality: [Xuan Do et al. EMNLP 2011], … Tutorial Argumentation Mining 2014 89 ¡  Textual entailment: recognize, given two text fragments whether one text can be inferred (entailed) from the other ¡  Has been studied widely in computational linguistics and the machine learning communities (e.g., Pascal recognizing textual entailment challenge) Tutorial Argumentation Mining 2014 90 TEXTUAL ENTAILMENT ¡  Most of the work in textual entailment: approaches of distance computation between the texts (e.g. edit distances, similarity metrics, kernels): §  E.g., EDITS system (Edit Distance Textual Entailment Suite), an open- source software package for textual entailment: http://edits.fbk.eu/ Tutorial Argumentation Mining 2014 91 TEXTUAL ENTAILMENT Tutorial Argumentation Mining 2014 92 ENTAILMENT IN ARGUMENTATION [Cabrio & Vilata ACL 2012] TE provides techniques to detect both the argument components, and the kind of relation underlying them: Or an entailment or a contradiction is detected ¡  Similarity measures are rough approaches ¡  Very difficult to acquire automatically the background knowledge needed for the entailment: ¡  => process that takes years for legal professionals Tutorial Argumentation Mining 2014 93 ENTAILMENT IN ARGUMENTATION ¡ Part 3: Some applications Tutorial Argumentation Mining 2014 94 Tutorial Argumentation Mining 2014 95 ARGUMENTATION MINING OF LEGAL CASES [PhD thesis Raquel Mochales Palau 2011] Cases of the European Court of Human Rights Tutorial Argumentation Mining 2014 96 [Mochales & Moens AI & Law 2011] Context free grammar allows also to recognize the full argumentation structure: accuracy: 60% Features of classifier: Clauses described by unigrams, bigrams, adverbs, legal keywords, word couples over adjacent clauses, ... ¡  Online user comments contain arguments with appropriate or missing justification ¡  [Park & Cardie FWAM 2014] classify comments into classes such as UNVERIFIABLE, VERIFIABLE NON-EXPERIENTIAL, VERIFIABLE EXPERIENTIAL Tutorial Argumentation Mining 2014 97 SUPPORT FOR ONLINE USER COMMENTS ¡  Features: n-grams, POS tags, present in core or accessory clause, sentiment clue, speech event anchors, imperative expression count, emotion expression count, tense count, person count Tutorial Argumentation Mining 2014 98 SUPPORT FOR ONLINE USER COMMENTS [Park & Cardie FWAM 2014] ¡  Boltuž ic & Šnajder FWAM 2014 identify properties of comment-argument pairs Tutorial Argumentation Mining 2014 99 RECOGNIZING ARGUMENTS IN ONLINE DISCUSSIONS ¡  Features: entailment features (TE): from pretrained entailment decision algorithms (which a.o. use WordNet, VerbOcean); semantic text similarity features (STS) and stance alignment feature (SA) with stance known a priori ¡  Multiclass classification with support vector machine Tutorial Argumentation Mining 2014 100 RECOGNIZING ARGUMENTS IN ONLINE DISCUSSIONS Boltužic & Šnajder FWAM 2014 ¡  Opinion mining: finding arguments and counter arguments for an opining expressed: §  Find support for the opinion, explain the opinion §  An opinion, whether it is grounded in fact or completely unsupportable, is an idea that an individual or group holds to be true. An opinion does not necessarily have to be supportable or based on anything but one's own personal feelings, or what one has been taught. An argument is an assertion or claim that is supported with concrete, real-world evidence. [http://wiki.answers.com] Tutorial Argumentation Mining 2014 101 ARGUMENT ENRICHED OPINION MINING ¡  Mining of the supporting evidence of claims in scientific publications and patents and their visualization for easy access Tutorial Argumentation Mining 2014 102 ARGUMENT MINING IN THE SCIENTIFIC LITERATURE [http://undsci.berkeley.edu/article/ howscienceworks_07] ¡  Digital humanities: finding and comparing the arguments that politicians use in their speeches: §  Then that little man in black there, he says women can't have as much rights as men, 'cause Christ wasn't a woman! Where did your Christ come from? Where did your Christ come from? From God and a woman! Man had nothing to do with Him. [Sojourner Truth (1797-1883): Ain't I A Woman?, Delivered 1851, Women's Convention, Akron, Ohio] Tutorial Argumentation Mining 2014 103 ARGUMENT MINING IN THE DIGITAL HUMANITIES ¡  The Araucaria corpus (constructed by Chris Reed at the University of Dundee, 2003) now extended to AIF-DB ¡  The ECHR corpus annotated by legal experts in 2006 under supervision of Raquel Mochales Palau: §  25 legal cases §  29 admissibility reports §  12.904 sentences, 10.133 non-argumentative and 2.771 argumentative, 2.355 premises and 416 conclusions ¡  Plans to build corpus of biomedical genetics research literature [Green FWAM 2014] ¡  Several smaller corpora described in FWAM 2014 ¡  ... ANNOTATED DATA Tutorial Argumentation Mining 2014 104 ¡ Part 4: Conclusions and thoughts for future research Tutorial Argumentation Mining 2014 105 ¡  Argumentation mining: novel and promising research domain ¡  Potential of joint learning of an argumentation structure integrating known interdependencies between the structural components in the argumentation and expert knowledge ¡  Potential of better textual entailment techniques ¡  Numerous interesting applications of the technology ! Tutorial Argumentation Mining 2014 106 CONCLUSIONS ¡  ? Tutorial Argumentation Mining 2014 107 THOUGHTS FOR FUTURE RESEARCH ¡  ISCH COST Action IS1312 Structuring Discourse in Multilingual Europe (TextLink) http://www.cost.eu/domains_actions/isch/Actions/IS1312 http://textlinkcost.wix.com/textlink Tutorial Argumentation Mining 2014 108