ARGUMENTATION 
MINING 
Marie-Francine Moens joint  work with Raquel 
Mochales Palau and Parisa Kordjamshidi  
Language Intel l igence and Information 
Retr ieval  
Depar tment of Computer Science 
KU Leuven, Belgium 
Dundee, 5-9-2014 
¡  Part 1: The setting   
§  Definition of argumentation mining 
§  Importance of the task 
¡  Part 2: Introducing current methods  
§  Machine learning 
§  Features 
§  Common techniques: logistic regression, conditional random fields, support 
vector machines 
§  Joint recognitions: grammars, graphical models, structured support vector 
machines 
§  Features revisited 
§  Textual entailment 
¡  Part 3: Some applications 
§  Legal field 
§  Scientific texts 
§  Blogs 
§  Dialogues and debates 
¡  Part 4: Conclusions and thoughts for future research 
Tutorial Argumentation Mining 2014 2 
OUTLINE 
¡ PART 1: The setting 
Tutorial Argumentation Mining 2014 3 
¡  = the detection of an argumentative discourse structure in 
text or speech, and the detection and the functional 
classification of its composing components 
Tutorial Argumentation Mining 2014 4 
ARGUMENTATION MINING 
¡  Argumentation mining = recognition of a rhetorical structure 
in a discourse  
¡  Rhetoric is the art of discourse that aims to improve the 
capabilities of writers and speakers to inform, persuade or 
motivate particular audiences in specific situations 
[Corbett, E. P. J. (1990). Classical rhetoric for the modern 
student. New York: Oxford University Press., p. 1.] 
Tutorial Argumentation Mining 2014 5 
ARGUMENTATION MINING 
¡  Is probably as old as mankind 
¡  Has been studied by philosophers throughout the history  
Tutorial Argumentation Mining 2014 6 
ARGUMENTATION 
¡  From Ancient Greece to the late 19th century: central part of 
Western education: need to train public speakers and writers 
to move audiences to action with arguments 
¡  The approach of argumentation is very often based on 
theories of rhetoric and logic 
¡  Argumentation was/is taught at universities 
Tutorial Argumentation Mining 2014 7 
SOME HISTORY 
¡  Highlights:  
§  Aristotle's (4th century BC) logical works: Organon 
§  George Pierce Baker, The Principles of Argumentation, 1895 
§  Chaïm Perelman describes of techniques of argumentation used by 
people to obtain the approval of others for their opinions: Traité de 
l'argumentation – la nouvelle rhétorique, 1958  
§  Stephen Toulmin explains how argumentation occurs in the natural 
process of an everyday argument: The Uses of Argument, Cambridge 
University Press, 1958 
Tutorial Argumentation Mining 2014 8 
SOME HISTORY 
9 
http://sokogskriv.no/en/reading/argumentation-in-text/ 
Tutorial Argumentation Mining 2014 
¡  We find argumentation in:  
§  Legal texts and court decisions 
§  Biomedical cases 
§  Scientific texts 
§  Patents 
§  Reviews, online fora, user generated content 
§  Debates, interactions, dialogues 
§  ... 
Tutorial Argumentation Mining 2014 10 
TODAY 
¡  In the overload of information users want to find arguments 
that sustain a certain claim or conclusion 
¡  Argumentation mining refines:  
§  Search and information retrieval 
§  Provides the end user with instructive visualizations and summaries 
of an argumentative structure 
Argumentation mining is related to opinion mining, but end user 
wants to know the underlying grounds and maybe 
counterarguments 
 
Tutorial Argumentation Mining 2014 11 
WHY ARGUMENTATION MINING? 
¡  Argumentative zoning 
¡  Argumentation mining of legal cases 
¡  Argumentation mining in online user comments and 
discussions 
¡  . . .  
Tutorial Argumentation Mining 2014 12 
WHAT IS THE STATE-OF-THE-ART? 
¡  = segmentation of a discourse into discourse segments or 
zones that each play a specific rhetoric role in a text  
Tutorial Argumentation Mining 2014 13 
ARGUMENTATIVE ZONING 
BKG: General scientific background (yellow) 
OTH: Neutral descriptions of other people's 
work (orange) 
OWN: Neutral descriptions of the own, new 
work (blue) 
AIM: Statements of the particular aim of the 
current paper (pink) 
TXT: Statements of textual organization of the 
current paper (in chapter 1, we introduce...) 
(red) 
CTR: Contrastive or comparative statements 
about other work; explicit mention of 
weaknesses of other work (green) 
BAS: Statements that own work is based on 
other work (purple) 
[PHD thesis of Simone Teufel 2000]  
¡  Methods: seen as a classification task: rule based or 
statistical classifier (e.g., naïve Bayes, support vector 
machine) is trained with manually annotated examples 
 
[Moens, M.-F. & Uyttendaele, C. Information Processing & 
Management 1997]  
[Teufel, S. & Moens, M. ACL 1999]  
[Teufel, S. & Moens, M.  EMNLP 2000] 
[Hachey, B. & Grover, C. ICAIL 2005]  
Tutorial Argumentation Mining 2014 14 
ARGUMENTATIVE ZONING 
¡ Legal field:  
§ Precedent reasoning 
§ Search for cases that use a similar type of reasoning, 
e.g., acceptance of rejection of a claim based on 
precedent cases 
§ Adds an additional dimension to argumentative 
zoning:  
§ Needs detection of the argumentation structure and 
classification of its components 
§ Components or segments are connected with 
argumentative relationships 
 
Tutorial Argumentation Mining 2014 15 
ARGUMENTATION MINING OF LEGAL 
CASES 
[Moens, Boiy, Mochales & Reed ICAIL 2007] 
Tutorial Argumentation Mining 2014 16 
[PhD thesis Raquel Mochales Palau 2011] 
Tutorial Argumentation Mining 2014 17 
[PhD thesis Raquel Mochales Palau 2011] 
Tutorial Argumentation Mining 2014 18 
Tutorial Argumentation Mining 2014 19 
¡  [PhD thesis of Raquel Mochales 2011]  
§  Argumentation: a process whereby arguments are constructed, 
exchanged and evaluated in light of their interactions with other 
arguments 
§  Argument: a set of premises - pieces of evidence - in support of a 
claim 
§  Claim: a proposition, put forward by somebody as true; the claim of 
an argument is normally called its conclusion 
§  Argumentation may also involve chains of reasoning, where claims 
are used as premises for deriving further claims  
 
Tutorial Argumentation Mining 2014 20 20 
[Mochales & Moens AI & Law 2011] 
¡ Part 2: Introducing current methods  
Tutorial Argumentation Mining 2014 21 
Text mining, also referred to as text data mining, or roughly 
equivalent to text analytics: 
= deriving high quality information from text 
  
Often done through means of statistical patterns learning 
 
⇒  Use of statistical machine learning techniques 
Tutorial Argumentation Mining 2014 22 
TEXT MINING 
23 
ARGUMENTATION MINING 
¡ Because argumentation is well studied: typical 
argumentation structures are defined: 
¡ => structuring of information: detecting the 
argumentation and its components  
¡ => assignment of metadata: labeling of 
argumentation components and relations 
¡ Can be done manually:  
§ But, people are (often) expensive, slow and inconsistent 
§ Can we perform this task automatically?  
Tutorial Argumentation Mining 2014 
Tutorial Argumentation Mining 2014 24 
ARGUMENTATION MINING 
¡ Approaches:  pattern recognition 
§ Symbolic techniques: knowledge or part of it: 
§ formally and manually implemented 
§ Statistical machine learning techniques: knowledge or 
part of it:  
§ automatically acquired 
¡ Mostly: using supervised machine learning 
techniques 
¡ Why ? 
 
§  Argumentation structure is well studied 
§  Manually labeled examples are available 
§  Annotating examples is usually considered easier than pattern 
engineering 
§  Current supervised learning techniques allow integration of soft 
rules 
  
Tutorial Argumentation Mining 2014 25 
ARGUMENTATION MINING 
26 
ARGUMENTATION MINING 
¡ Argumentation mining needs a large amount of 
knowledge:  
 
§ Linguistic knowledge of the vocabulary, syntax 
and semantics of the language and the discourse 
§ Knowledge of the subject domains  
§ Background knowledge of the person who uses 
the texts at a certain moment in time 
Tutorial Argumentation Mining 2014 
27 
SUPERVISED LEARNING 
¡ Techniques of supervised learning: 
§ training set: example objects classified by an 
expert or teacher 
§ detection of general, but high-accuracy 
classification patterns (function or rules) in the 
training set based on object features and their 
values  
§ patterns are  predictable to correctly classify 
new, previously unseen objects in a test set 
considering their features and feature values 
Tutorial Argumentation Mining 2014 
28 
SUPERVISED LEARNING 
§ Text recognition or classification can be seen as 
a:  
§ two-class learning problem:  
§ an object is classified as belonging or not 
belonging to a particular class 
§ convenient when the classes are not mutually 
exclusive 
§ single multi-class learning problem 
§ Result = often probability of belonging to a class, 
rather than simply a classification 
Tutorial Argumentation Mining 2014 
29 Tutorial Argumentation Mining 2014 
30 
GENERATIVE VERSUS DISCRIMINATIVE 
CLASSIFICATION 
¡  In classification: given inputs x and their labels y:  
§  Generative classifier learns a model of the joint 
probability p(x,y) = p(y) p(x|y) and then condition on the 
observed features x, thereby deriving the class posterior 
p(y|x) and selects the most probable y for x  
§  Generative classifier: since it specifies how to generate the 
observed features x for each class y 
§  E.g.,  
§  Naive Bayes, hidden Markov model 
Tutorial Argumentation Mining 2014 
§  Discriminative classifier learns a model p(y|x) which 
directly models the mapping from inputs x to output y, and 
selects the most likely label y 
§  Discriminative classifier: discriminates between classes 
§   E.g., 
§  Logistic regression model, conditional random field, 
support vector machine 
(discussed in this tutorial) 
 
Tutorial Argumentation Mining 2014 31 
GENERATIVE VERSUS DISCRIMINATIVE 
CLASSIFICATION 
32 
MAXIMUM ENTROPY PRINCIPLE 
¡ Text classifiers are often trained with incomplete 
information  
¡ Probabilistic classification can adhere to the 
principle of maximum entropy: When we make 
inferences based on incomplete information, we 
should draw them from that probability distribution 
that has the maximum entropy permitted by the 
information we have: e.g.,  
§ Multinomial logistic regression, conditional random 
fields  
Tutorial Argumentation Mining 2014 
33 
CONTEXT-DEPENDENT RECOGNITION 
¡ When there exist a relation between various 
classes: it is valuable not to classify an object 
separately from other objects 
¡ Context-dependent classification :  the class to 
which a feature vector is assigned depends on:  
§ the object itself 
§ other objects and their class 
§ the existing relations among the various classes 
 
§  e.g., hidden Markov model, conditional random fields, structured 
support vector machine, structured perceptron 
Tutorial Argumentation Mining 2014 
 
§ Local classification (i.e., learning a model for each 
class), applying  the models on each input, and 
combining the outputs 
§ Global classification (i.e., learning 1 model jointly, cf. 
context dependent classification)   
Tutorial Argumentation Mining 2014 34 
LOCAL VERSUS GLOBAL CLASSIFICATION 
35 
FEATURE SELECTION AND EXTRACTION 
¡  In classification tasks: object is described with set of attributes or 
features 
¡  Typical features in text classification tasks: 
§  word, phrase, syntactic class of a word, text position,  the length of a 
sentence, the relationship between two sentences, an n-gram, a 
document (term classification), ….  
§  choice of the features is application- and domain-specific 
¡  Features can have a value, for text the value is often:  
§  numeric, e.g., discrete or real values 
§  nominal, e.g. certain strings 
§  ordinal, e.g., the values 0= small number, 1 = medium number, 2 = 
large number 
Tutorial Argumentation Mining 2014 
36 
FEATURE SELECTION AND EXTRACTION 
¡  The features together span a multi-variate space called the 
measurement space or feature space: 
§  an object x can be represented as:  
§  a vector of features:  
 x = [x1, x2, …, xp]T  
 where  p  = the number of features measured 
§  as a structure: e.g.,  
§  representation in first order predicate logic 
§  graph representation (e.g., tree) where relations between features are 
figured as edges between nodes and nodes can contain attributes of 
features 
  
Tutorial Argumentation Mining 2014 
37 
SWARM INTELLIGENCE
Following a trail of insects as they work together to accomplish a task offers
unique possibilities for problem solving.
By Peter Tarasewich & Patrick R. McMullen
Even with today’s ever-increasing computing power, there are still many
types of problems that are very difficult to solve. Particularly combinatorial
optimization problems continue to pose challenges. An example of this type
of problem can be found in product design. Take as an example the design of
an automobile based on the attributes of engine horsepower, passenger
seating, body style and wheel size. If we have three different levels for each
of these attributes, there are 3
4
, or 81, possible configurations to consider.
For a slightly larger problem with 5 attributes of 4 levels, there are suddenly
1,024 combinations. Typically, an enormous amount of possible
combinations exist, even for relatively small problems. Finding the optimal
solution to these problems is usually impractical. Fortunately, search
heuristics have been developed to find good solutions to these problems in a
reasonable amount of time.
Over the past decade or so, several heuristic techniques have been developed
that build upon observations of processes in the physical and biological
sciences. Examples of these techniques include Genetic Algorithms (GA)
and simulated annealing…
Words 
Examples of classification features 
Sentence position 
POS-tag 
Sentence length 
The following and 
preceding word 
Tutorial Argumentation Mining 2014 
38 
FEATURE VECTORS FOR AN EXAMPLE 
TEXT 
A Java Applet that scans Java Applets  
¡   Binary values, based on lower-cased words:  
[a: 1, apple: 0, applet: 1, applets: 1, …., java: 1, …] 
¡  Remove stopwords   :   
[apple : 0, applet : 1, applets : 1, … , java : 1 …] 
¡  Numeric value: based on text term frequency (tf) :  
[apple : 0, applet : 1, applets : 1, … , java : 2 …] 
¡  Numeric value: based on text term frequency of lower cased 
n-grams (tf) :  
[aa: 0, a_a: 2, a_b: 0, …] 
¡  Numeric attribute value based on latent semantic indexing: 
[F1: 0.38228938, F2: 0.000388, F3: 0.201033, …] 
¡  . . .  
Tutorial Argumentation Mining 2014 
39 
FEATURE SELECTION 
¡ = eliminating low quality features:  
§ redundant features 
§ noisy features 
¡ Decreases computational complexity  
¡ Decreases the danger of overfitting in supervised 
learning (especially when large number of features and 
few training examples)  
¡ Overfitting:   
§ the classifier perfectly fits the training data, but fails to 
generalize sufficiently from the training data to 
correctly classify the new case 
Tutorial Argumentation Mining 2014 
40 
FEATURE EXTRACTION 
¡ = creates new features by applying a set of 
operators upon the current features:  
§ a single feature can be replaced by a new feature 
(e.g., replacing words by their stem) 
§ a set of features is replaced by one feature or 
another set of features  
§ use of  logical operators (e.g., disjunction), 
arithmetical operators (e.g. mean, LSI) 
§ choice of operators: application- and domain-
specific  
Tutorial Argumentation Mining 2014 
¡  Naïve Bayes, learning of rules and trees, nearest neighbor or 
exemplar based learning, logistic regression, support vector 
machines 
¡  Here we discuss support vector machines,  logistic regression, 
and conditional random fields 
¡  Then, we move to more advanced methods such as structured 
perceptrons, structured support vector machines and more 
general graphical models 
Tutorial Argumentation Mining 2014 41 
COMMON CLASSIFICATION METHODS 
Tutorial Argumentation Mining 2014 42 
MACHINE LEARNING FRAMEWORK 
¡  Input space: objects are represented as feature vectors 
¡  Output space:  
§  Regression: the space of real numbers  
§  Classification: the set of discrete categories:  C = {C1,C2, … ,Cm} 
¡  Hypothesis space = class of function mappings from the input 
space to the output space 
¡  To learn a good hypothesis: in supervised learning a training 
set is used which contains a number of objects and their 
ground truth labels 
¡  Loss function: to what degree the prediction generated by the 
hypothesis is in accordance with the ground truth label  
ℜ
Tutorial Argumentation Mining 2014 43 
44 
SUPPORT VECTOR MACHINE 
¡ Support vector machine: 
§ when two classes are linearly separable: 
§  find a hyperplane in the p-dimensional feature space 
that best separates with maximum margins the positive 
and negative examples  
§  maximum margins: with maximum Euclidean distance 
(= margin d) to the closest training examples (support 
vectors) 
§  e.g., decision surface in two dimensions 
§  idea can be generalized to examples that are not 
necessarily linearly separable and to examples that 
cannot be represented by linear decision surfaces 
Tutorial Argumentation Mining 2014 
Tutorial Argumentation Mining 2014 45 
[Burges 1998]?
 
46 
SUPPORT VECTOR MACHINE 
¡ Linear support vector machine:  
§ case: trained on data that are separable (simple case) 
§ input is a set of n training examples:  
 
 
 where xi ∈      and yi ∈ {-1,+1} indicating that xi is a 
negative or positive example respectively 
  
    
    
  
€ 
S = {(x1,y1),...,(xn,yn)}
€ 
ℜ
p
Tutorial Argumentation Mining 2014 
Tutorial Argumentation Mining 2014 47 
 
¡ In case the data objects are not necessar i ly 
completely linearly separable (soft margin SVM): 
 
 
 
  
 the amount of training error is measured using slack 
variables ξi the sum of which must not exceed some 
upper bound 
 where   =   penalty for misclassification 
   G  =   weighting factor 
 € 
ξ
i
2
i =1
n
∑
Minimizeξ , w,b  w ⋅w +G ξ i2i=1n∑Subject to yi( w ⋅xi +b)−1+ξ i ≥ 0 ,    i= 1,...,n   
Tutorial Argumentation Mining 2014 48 
[Burges DMKD 1998]?
 
A dual representation  is obtained by introducing 
Lagrange multipliers λ i, which turns out to be easier 
to solve: 
       
 
       (1) 
 
   
  
 
Tutorial Argumentation Mining 2014 49 
€ 
Maximize W (λ) = λi
i=1
n
∑
−
1
2
λiλjyiyj xi⋅xj
i, j=1
n
∑
Subject to :  λi ≥ 0
                 λiyi = 0
i=1
n
∑
 ,   i =1,...,n     
Tutorial Argumentation Mining 2014 50 
Yielding the following decision function:   
 
 
       
      (2)    
 
 
The decision function only depends on support vectors, 
i.e., for which λ i  > 0. Training examples that are not 
support vectors have no influence on the decision 
function 
€ 
h(x) = sign( f (x))
€ 
f (x) = λiyi xi ⋅ x + b
i=1
n
∑
51 
SUPPORT VECTOR MACHINE 
¡ When classifying natural language data, it is not 
always possible to linearly separate the data: in 
this case we can map them into a feature space 
where they are linearly separable 
¡ Working in a high dimensional feature space gives 
computational problems, as one has to work with 
very large vectors 
¡ In the dual representation the data appear only 
inside inner products (both in the training 
algorithm shown by (1) and in the decision function 
of (2)): in both cases a kernel function can be used 
in the computations 
Tutorial Argumentation Mining 2014 
52 Tutorial Argumentation Mining 2014 
53 
KERNEL FUNCTION 
¡  A kernel  function K is  a  mapping K:  S  x  S  →  [0,  ∞]  from 
the instance space of examples S to a similarity score:	

¡  In  other  words  a  kernel  function  is  an  inner  product  in 
some feature space 	

¡  The kernel function must be: 	

§ symmetric [K(xi,xj) = K(xj,xi)]	 
§ positive semi-definite: if x1,…,xn ∈ S, then the n x n matrix 
G (Gram matrix or kernel matrix) defined by Gij = K (xi,xj) 
is positive semi-definite*	

* has non-negative eigenvalues 
€ 
K(xi,xj) = φ(xi) ⋅ φ(xj)
Tutorial Argumentation Mining 2014 
54 
SUPPORT VECTOR MACHINE 
¡ Typical kernel functions: linear (mostly used in text 
categorization), polynomial, radial basis function (RBF)	

¡ We can define kernel functions that (efficiently) 
compare strings (string kernel) or trees (tree kernel)	

¡ The decision function f(x) we can just replace the dot 
products with kernels K(x i,x j):	

€ 
f (x) = λiyi φ(xi) ⋅ φ(x) + b
i=1
n
∑
€ 
f (x) = λiyiK(xi,x) + b
i=1
n
∑
€ 
h(x) = sign( f (x))
Tutorial Argumentation Mining 2014 
¡ Linear regression:  
¡ Class membership: 
Tutorial Argumentation Mining 2014 55 
LINEAR REGRESSION 
p(y = true x) = wi× fii=0N∑ = w ⋅ fy = w0 + wi× fii=1N∑ = w ⋅ f (3) 
§ Training of the model of (3):  
§ By assigning each training example that belongs to the 
class the value y = 1, and the target value y = 0, if it is 
not  
§ Train the weight vector to minimize the predictive error 
from 1 (for observations in the class) or 0 (for 
observations not in the class)  
§ Testing: dot product of the learned weight vector with 
the feature vector x of the new example 
§ But, result is not guaranteed to lie in [0,1] 
Tutorial Argumentation Mining 2014 56 
LINEAR REGRESSION 
Tutorial Argumentation Mining 2014 57 
LOGISTIC REGRESSION 
¡ We predict a ratio of two probabilities as the log 
odds (or logit) function: 
¡ Logistic regression: model of regression in which 
we use a linear function to estimate the logit of the 
probability 
 
 
 
 
 
logit(p(x)) = ln( p(x)(1− p(x)))ln( p(y = true x)1− p(y = true x)) = w ⋅ fp(y = true x) = ew⋅ f1+ ew⋅ f
Tutorial Argumentation Mining 2014 58 
MULTINOMIAL LOGISTIC REGRESSION 
¡ = Maximum entropy classifier (Maxent) deals with a 
larger number of classes: multinomial logistic 
regression 
¡ Let there be C different classes: y1,y2,...,yC 
¡ We estimate the probability that y is a particular 
class y given N feature functions as:  p(y x) = 1Z exp wifii=0N∑p(y x) = exp wifii=0N∑ (y,x)exp wifi( "y ,x)i=0N∑"y ∈C∑
59 
¡ Context dependent classification = the class to which 
a feature vector is assigned depends on:  
1) the feature vector itself 
2) the values of other feature vectors and their class 
3) the existing relation among the various classes  
¡ Examples:  
§ conditional random field 
§ structured output support vector machine 
Tutorial Argumentation Mining 2014 
60 Tutorial Argumentation Mining 2014 
Tutorial Argumentation Mining 2014 61?
CONDITIONAL RANDOM FIELD 
•  Linear chain conditional random field:  
–  Let X = (x1, ... , xT) be a random variable over data 
sequences to be labeled and Y a random variable over 
the corresponding label sequences 
–  All components yj of Y are assumed to range over a finite 
label alphabet ∑ 
–  We define G = (V, E) to be an undirected graph such that 
there is a node v ∈ V corresponding to each of the 
random variables representing an element yv of Y 
–  If each yv obeys the Markov property with respect to G, 
then the model (Y,X) is a conditional random field 
Tutorial Argumentation Mining 2014 62?
CONDITIONAL RANDOM FIELD 
	

•  In an information extraction task, X might range over the 
words or constituents of a sentence/discourse, while Y ranges 
over the semantic/pragmatic classes to be recognized in these 
sentences/dscourse 
•  Template based or general CRF: In theory the structure of 
graph G may be arbitrary: e.g., template based or general 
CRF, where you can define the dependencies in the Markov 
network or graph 
[Lafferty et al.  ICML 2001] 
Tutorial Argumentation Mining 2014 63?
CONDITIONAL RANDOM FIELD 
§  To classify a new instance P(Y|X) is computed as follows: 
  
 where 
                         = one of the k binary-valued feature functions 
            = parameter that models the observed statistics in the  
  training examples  
   Z = normalizing constant 
§  The most probable label sequence Y* for input sequence X is:  
 
 
 Y*= argmaxY p(Y X)  
p(Y X) = 1Z exp( λifi(yj − 1, yj, X, j)i=1k∑ )j=1T∑fi(yj − 1, yj, X, j)λi
Tutorial Argumentation Mining 2014 64?
CONDITIONAL RANDOM FIELD 
•  CRF training:  
–  Like for the Maxent model, we need numerical methods 
in order to derive λI  
–  E.g., linear-chain CRF: variation of the Baum-Welch 
algorithm 
–  In general CRFs we use approximate inference (e.g., 
Markov Chain Monte Carlo sampler)  
•  Advantages and disadvantages:  
•  Very successful IE technique 
•  Training is computationally expensive, especially when 
the graphical structure is complex 
 
¡  Global or jointly recognizing several labels and their 
relationship 
¡  Can be realized by:  
§  Inferring a grammar (with rules) from data 
§  Structured support vector machines 
§  Graphical models (Markov random fields and Bayesian networks) 
Tutorial Argumentation Mining 2014 65 
GLOBAL LEARNING 
Tutorial Argumentation Mining 2014 66 
MODELS THAT JOINTLY LEARN  
¨  The machine recognizes fragmentary pieces (e.g., names, 
facts) and the recognition of related fragments of text are 
often limited to the sentence level 
¨  Emerging recognition of integrated understanding: e.g.,  in 
a discourse noun-phrase coreference resolution and entity 
recognition 
Human understanding of text:  
inferencing,  
connecting content 
[Wikipedia] 
Tutorial Argumentation Mining 2014 67 67 
[Mochales & Moens AI & Law 2011] 
¡  Can be done manually (cf. PhD of Raquel Mochales Palau) 
¡  Can be learned from annotated data 
¡  Could be learned from a very large unannotated corpus, but 
very difficult if grammar is complex 
Tutorial Argumentation Mining 2014 68 
INFERRING A GRAMMAR WITH RULES 
FROM DATA 
Tutorial Argumentation Mining 2014 69 
[PhD thesis Raquel Mochales Palau 2011] 
Tutorial Argumentation Mining 2014 70 
Experiments with 
decisions of the  
European Court of  
Human Rights (ECHR) 
[Mochales & Moens  AI & Law 2011] 
Tutorial Argumentation Mining 2014 71 
¡  Works well (see further the results) 
¡  A deterministic grammar might overfit the data it is 
constructed from 
¡  A probabilistic grammar needs annotated data 
¡  If we have annotated data we can learn the grammar  
Tutorial Argumentation Mining 2014 72 
INFERRING A GRAMMAR WITH RULES 
FROM DATA 
INTERMEZZO: SPATIAL RELATION 
73 
The goal is to jointly assign the labels of the ontology to a text item 
Tutorial Argumentation Mining 2014 
74 
¡ Joint or global learning ≠ local learning of 
independent classifiers 
§ Independent classifiers and combination of results 
(e.g., based on integer linear programming) 
§ Joint training:  
§ 1 classification model for the global structure: cf. 
CRF 
§ Output is = structure (e.g., spatial ontology) 
 
JOINT MACHINE LEARNING 
[PhD of Parisa Kordjamshidi 2013] 
[Kordjamshidi & Moens Journal of Web 
Semantics 2014] 
Tutorial Argumentation Mining 2014 
OUTPUT 
75 
¡  Output variables = labels in the structure 
Tutorial Argumentation Mining 2014 
INPUT 
76 
¡  Object to which the classification model is applied: e.g., 
sentence (in our case), paragraph, full document, .. .  
¡  Is usually composed of dif ferent input components: single 
words, phrases, .. .  depending on the type of text snippet to 
which a label will be assigned  
Tutorial Argumentation Mining 2014 
FEATURE FUNCTIONS 
77 
¡  Each input component is assigned a set of features: e.g., 
lexical, syntactic, discourse distance, ... 
¡  Feature functions link an input component with a possible 
label (notion of feature templates) 
¡  Each feature function will receive a weight during training 
¡  A feature template groups a set of feature functions => block 
of corresponding weights W i  
  
 
Tutorial Argumentation Mining 2014 
OBJECTIVE FUNCTION 
78 
¡  The main objective discriminant function 
 
is a linear function in terms of the combined feature 
representation associated with each candidate input 
component and an output label according to the template (Ψ) 
specifications 
¡  Can be written in terms of the instantiations of the  
templates and their related blocks of weights Wp  
Tutorial Argumentation Mining 2014 
TRAINING OF THE MODEL 
Tutorial Argumentation Mining 2014 79 
¡   A popular discriminative training approach is to minimize 
the following convex upper bound of the loss function over 
the N training data: 
¡  Training with the most violated constraints/outputs (y) per 
training example 
¡  In the experiments:  structured support vector machines 
(SSVM), structured perceptrons and averaged structured 
perceptrons  
CONSTRAINTS 
80 
Constraints are linear and  
variables take the form of integers  
Constraints are applied:  
during training:  
finding the most violated outputs 
and/or 
during testing 
Tutorial Argumentation Mining 2014 
¡  E.g., Markov random fields 
¡  Allow using rules as features for which the weight is trained 
on the annotated data 
¡  Concern: the computational complexity 
Tutorial Argumentation Mining 2014 81 
GRAPHICAL MODELS IN GENERAL 
82 Tutorial Argumentation Mining 2014 
¡  Structured learning: modeling of interdependence among 
output labels: 
§  Generalized linear models, e.g., structured support vector 
machines and structured perceptrons [Tsochantaridis et al.  
JMLR 2006] 
§  Probabilistic graphical models [Koller and Friedman 2009] 
¡  The interdependencies between output labels and other 
background knowledge can be imposed using constraint 
optimization techniques during prediction and training 
§  Cf. recent work on structure analysis of scientific documents [Guo et 
al. NAACL-HLT 2013] 
JOINT RECOGNITION OF A CLAIM AND ITS 
COMPOSING ARGUMENTS 
¡  Or to the Toulmin model or the many different argumentation 
schemes/structures discussed in Douglas Walton (1996). 
Argumentation Schemes for Presumptive Reasoning.  Mahwah, 
New Jersey: Lawrence Erlbaum Associates 
¡  Work of Prakken, Gordon, Bench-Capon, Atkinson, Wyner, 
Schneider, .. . 
Tutorial Argumentation Mining 2014 83 
OTHER ARGUMENTATION STRUCTURES 
¡  Complex graphical structures: considering the 
interdependencies and structural constraints over the output 
space easily leads to intractable training and prediction 
situations: 
§  Models for decomposition, communicative inference, message 
passing, ... 
§  A current research topic in machine learning 
Tutorial Argumentation Mining 2014 84 
DECOMPOSITIONS 
Tutorial Argumentation Mining 2014 85 
¡  Breaking the structured model in two or more pieces:  
§  Build a model for each piece  
§  Possibly: Iteratively improve each model by communicating between 
the pieces 
DECOMPOSITIONS 
¡  Argumentation mining 
Tutorial Argumentation Mining 2014 86 
FEATURES REVISITED 
On the other hand the court notes  
that there are substantial delays 
attributable to the authorities 
In particular in the first set of proceedings 
there is a period of inactivity of more than  
two years ... 
In the second set of proceedings 
there is a period of inactivity  
of some three years 
The court cannot find 
that the government 
has given sufficient explanation 
for these delays that occurred  
Conclusion 
Premises 
¡  Because we input candidate arguments and their candidate 
components:  
§  We can describe the component with different features than the ones 
used for describing the full argument 
§  E.g., textual entailment relationships can be used to describe the full 
argument 
Tutorial Argumentation Mining 2014 87 
FEATURES REVISITED 
88 
¡  Our argumentation mining machine only uses information 
resided in the texts 
¡  Human understanding of text: humans connect to their world/
domain knowledge 
[Wikipedia] 
Tutorial Argumentation Mining 2014 
¡  The discourse structure is often signaled by typical keywords 
(e.g., in conclusion, however, . . .) ,  but often this is not the case 
¡  Humans who understand the meaning of the text can infer 
whether a claim is a plausible conclusion given a set of 
premises, or a claim rebuts another claim 
⇒  Background or domain knowledge that makes a certain 
discourse relation valid 
⇒  Background or domain knowledge that an argumentation mining 
tool should also acquire: how?    
 
¡  Work on textual entailment : [Cabrio & Villata 2012],  
 event causality: [Xuan Do et al. EMNLP 2011], … 
Tutorial Argumentation Mining 2014 89 
¡  Textual entailment: recognize, given two text fragments 
whether one text can be inferred (entailed) from the other 
¡  Has been studied widely in computational linguistics and the 
machine learning communities (e.g., Pascal recognizing 
textual entailment challenge) 
Tutorial Argumentation Mining 2014 90 
TEXTUAL ENTAILMENT 
¡  Most of the work in textual entailment: approaches of 
distance computation between the texts (e.g. edit distances, 
similarity metrics, kernels): 
 
§  E.g., EDITS system (Edit Distance Textual Entailment Suite), an open-
source software package for textual entailment: http://edits.fbk.eu/ 
Tutorial Argumentation Mining 2014 91 
TEXTUAL ENTAILMENT 
Tutorial Argumentation Mining 2014 92 
ENTAILMENT IN ARGUMENTATION 
[Cabrio & Vilata ACL 2012] 
TE provides techniques to detect both the argument components,  
and the kind of relation underlying them:  
Or an entailment or a contradiction is detected 
¡  Similarity measures are rough approaches  
¡  Very difficult to acquire automatically the background 
knowledge needed for the entailment:  
¡  => process that takes years for legal professionals 
 
Tutorial Argumentation Mining 2014 93 
ENTAILMENT IN ARGUMENTATION 
¡ Part 3: Some applications 
Tutorial Argumentation Mining 2014 94 
Tutorial Argumentation Mining 2014 95 
ARGUMENTATION MINING OF LEGAL 
CASES 
[PhD thesis Raquel Mochales Palau 2011] 
Cases of the European Court of  
Human Rights 
Tutorial Argumentation Mining 2014 96 
[Mochales & Moens AI & Law 2011] 
Context free grammar allows also to recognize the 
 full argumentation structure: accuracy: 60% 
 
Features of classifier: 
Clauses described by unigrams, bigrams, adverbs, legal keywords, word couples  
over adjacent clauses, ... 
¡  Online user comments contain arguments with appropriate or 
missing justification 
¡  [Park & Cardie FWAM 2014] classify comments into classes 
such as UNVERIFIABLE, VERIFIABLE NON-EXPERIENTIAL, VERIFIABLE 
EXPERIENTIAL 
Tutorial Argumentation Mining 2014 97 
SUPPORT FOR ONLINE USER COMMENTS 
¡  Features: n-grams, POS tags, present in core or accessory 
clause, sentiment clue, speech event anchors, imperative 
expression count, emotion expression count, tense count, 
person count 
Tutorial Argumentation Mining 2014 98 
SUPPORT FOR ONLINE USER COMMENTS 
[Park & Cardie FWAM 2014]  
¡   Boltuž ic & Šnajder FWAM 2014 identify properties of 
comment-argument pairs  
Tutorial Argumentation Mining 2014 99 
RECOGNIZING ARGUMENTS IN ONLINE 
DISCUSSIONS 
¡  Features: entailment features (TE): from pretrained entailment 
decision algorithms (which a.o. use WordNet, VerbOcean); 
semantic text similarity features (STS) and stance alignment 
feature (SA) with stance known a priori 
¡  Multiclass classification with support vector machine 
Tutorial Argumentation Mining 2014 100 
RECOGNIZING ARGUMENTS IN ONLINE 
DISCUSSIONS 
Boltužic & Šnajder FWAM 2014 
¡  Opinion mining: finding arguments and counter arguments for 
an opining expressed:  
 
§  Find support for the opinion, explain the opinion 
§  An opinion, whether it is grounded in fact or completely 
unsupportable, is an idea that an individual or group holds to be true. 
An opinion does not necessarily have to be supportable or based on 
anything but one's own personal feelings, or what one has been 
taught. An argument is an assertion or claim that is supported with 
concrete, real-world evidence.   
   [http://wiki.answers.com]  
Tutorial Argumentation Mining 2014 101 
ARGUMENT ENRICHED OPINION MINING 
¡  Mining of the supporting evidence of claims in scientific 
publications and patents and their visualization for easy 
access 
Tutorial Argumentation Mining 2014 102 
ARGUMENT MINING IN THE SCIENTIFIC 
LITERATURE 
[http://undsci.berkeley.edu/article/
howscienceworks_07] 
¡  Digital humanities: finding and comparing the arguments that 
politicians use in their speeches:  
§  Then that little man in black there, he says women can't have as 
much rights as men, 'cause Christ wasn't a woman! Where did your 
Christ come from? Where did your Christ come from? From God and a 
woman! Man had nothing to do with Him. [Sojourner Truth 
(1797-1883): Ain't I A Woman?, Delivered 1851, Women's 
Convention, Akron, Ohio]  
Tutorial Argumentation Mining 2014 103 
ARGUMENT MINING IN THE DIGITAL 
HUMANITIES 
¡  The Araucaria corpus (constructed by Chris Reed at the 
University of Dundee, 2003) now extended to AIF-DB 
¡  The ECHR corpus annotated by legal experts in 2006 under 
supervision of Raquel Mochales Palau:   
§  25 legal cases  
§  29 admissibility reports 
§  12.904 sentences, 10.133 non-argumentative and 2.771 
argumentative, 2.355 premises and 416 conclusions 
¡  Plans to build corpus of biomedical genetics research 
literature [Green FWAM 2014] 
¡  Several smaller corpora described in FWAM 2014 
¡  ... 
 
ANNOTATED DATA 
Tutorial Argumentation Mining 2014 104 
¡ Part 4: Conclusions and thoughts for future 
research 
Tutorial Argumentation Mining 2014 105 
¡  Argumentation mining: novel and promising research domain 
¡  Potential of joint learning of an argumentation structure 
integrating known interdependencies between the structural 
components in the argumentation and expert knowledge 
¡  Potential of better textual entailment techniques 
¡  Numerous interesting applications of the technology ! 
Tutorial Argumentation Mining 2014 106 
CONCLUSIONS 
¡  ?  
Tutorial Argumentation Mining 2014 107 
THOUGHTS FOR FUTURE RESEARCH 
¡  ISCH COST Action IS1312 
Structuring Discourse in Multilingual Europe (TextLink) 
http://www.cost.eu/domains_actions/isch/Actions/IS1312 
http://textlinkcost.wix.com/textlink 
Tutorial Argumentation Mining 2014 108