Text	  mining	  workflows	  	  
for	  indexing	  archives	  with	  automa?cally	  
extracted	  seman?c	  metadata	  
Riza	  Ba?sta-­‐Navarro1,	  Axel	  Soto1,	  
William	  Ulate2	  and	  Sophia	  Ananiadou1	  
1University	  of	  Manchester	  
2Missouri	  Botanical	  Garden	  
1	  
20th	  Interna?onal	  Conference	  on	  Theory	  and	  Prac?ce	  of	  Digital	  Libraries	  (TPDL	  2016)	  
Outline	  (1)	  
•  Introduc?on	  	  
– challenges	  in	  informa?on	  discovery/search	  
– seman?c	  metadata	  genera?on	  as	  a	  named	  en?ty	  
recogni?on	  (NER)	  task	  
•  The	  Argo	  text	  mining	  workbench	  
– system	  overview	  and	  features	  
– workflow	  construc?on,	  configura?on	  and	  execu?on	  
– visual	  inspec?on	  of	  generated	  annota?ons	  
2	  
Outline	  (2)	  
•  Construc?ng	  NER	  workflows	  for	  genera?ng	  
seman?c	  metadata	  
– medical	  history	  archives	  
– biodiversity	  legacy	  literature	  
•  Exploring	  search	  indexes	  containing	  seman?c	  
metadata	  
–  introduc?on	  	  
– overview	  of	  Elas?csearch	  
– query	  examples	  
3	  
Outline	  (3)	  
•  Example	  applica?ons	  
– Disambigua?on	  in	  the	  History	  of	  Medicine	  search	  
system	  
– Biodiversity	  Heritage	  Library	  query	  expansion	  
•  Conclusions	  
4	  
	  	   	  	  Biodiversity	  Heritage	  Library	  
•  h?p://www.biodiversitylibrary.org/	  	  
•  a	  consor?um	  of	  botanical	  and	  natural	  history	  
libraries	  
•  stores	  digi?sed	  legacy	  literature	  on	  
biodiversity	  
•  currently	  holds	  180,000	  volumes	  =	  50+	  million	  
pages	  (PDFs	  and	  OCR-­‐generated	  text)	  
•  open-­‐access	  
Introduc?on:	  Challenges	  to	  informa?on	  discovery	  
5	  
BHL’s	  keyword-­‐based	  search	  	  
and	  browsing	  
Introduc?on:	  Challenges	  to	  informa?on	  discovery	  
6	  
BHL’s	  advanced	  search	  func?onality	  
(also	  keyword-­‐based)	  
Introduc?on:	  Challenges	  to	  informa?on	  discovery	  
7	  
What’s	  wrong	  with	  	  
keyword-­‐based	  search?	  
(1)	  Inability	  to	  disambiguate	  
	  
California	  bay	  
hardwood	  
tree?	  
loca?on?	  
Introduc?on:	  Challenges	  to	  informa?on	  discovery	  
Emperor	  
fish?	  
person?	  
8	  
(1) Inability	  to	  disambiguate	  
	  
What’s	  wrong	  with	  	  
keyword-­‐based	  search?	  
Boxwood	  
historic	  place	  in	  
Alabama?	  
North	  American	  term	  for	  
plants	  in	  the	  Buxaceae	  
family?	  
Introduc?on:	  Challenges	  to	  informa?on	  discovery	  
9	  
What’s	  wrong	  with	  	  
keyword-­‐based	  search?	  
(1) Inability	  to	  disambiguate	  
•  Implica?ons:	  
–  less	  precise	  search	  results	  
– even	  documents	  irrelevant	  to	  what	  you	  have	  in	  
mind	  are	  returned	  
Introduc?on:	  Challenges	  to	  informa?on	  discovery	  
	  
	  
	  
	  
	  
	  
	   	   	  	  	  	  	  	  	  	  	  	  	  	  Returned	  	  
“Emperor“	  
(as	  person)	  
“Emperor”	  	  
(as	  fish)	  
10	  
What’s	  wrong	  with	  	  
keyword-­‐based	  search?	  
(2)	  Inability	  to	  account	  for	  variants	  
•  Implica?ons:	  
–  limited	  coverage	  
–  informa?on	  overlook	  
	  
Introduc?on:	  Challenges	  to	  informa?on	  discovery	  
	  
	  
	  
	  
	  
	  
	  	  Returned	  
“Panthera	  
leo”	  
“lion”	  
11	  
Solu?on:	  Seman?c	  metadata	  genera?on	  
using	  named	  en?ty	  recogni?on	  (NER)	  
•  task	  of	  automa?cally	  demarca?ng	  men?ons	  
– detec?ng	  their	  boundaries	  (e.g.,	  character	  offsets)	  
– placing	  them	  into	  predefined	  categories	  
Introduc?on:	  Seman?c	  metadata	  genera?on	  
12	  
Named	  en?ty	  recogni?on	  (NER)	  
•  cast	  as	  a	  sequence	  labelling	  task	  
– sequence	  =	  tokens	  in	  a	  sentence	  
•  approaches	  
– dic?onary-­‐based	  
–  rule-­‐based	  
– machine	  learning	  (ML)-­‐based	  
– hybrid	  
Introduc?on:	  Seman?c	  metadata	  genera?on	  
13	  
Dic?onary-­‐based	  NER	  
•  Sample	  entries	  in	  a	  gaze?eer:	  
	  
Ho	  Chi	  Minh 	   	   	  PROVINCE	  
Ho	  Chi	  Minh	  City 	   	  CITY	  
…	  
Johannesburg	   	   	  CITY	  
Johannesburg	   	   	  PROVINCE	  
…	  
Mexico 	   	   	   	  PROVINCE	  
Mexico	  Beach	   	   	  CITY	  
Mexico	  City 	   	   	  CITY	  
Mexico	  Crossing 	   	  CITY	  
…	  
Riyadh 	   	   	   	  CITY	  
…	  
Tehran 	   	   	   	  CITY	  
Tehran 	   	   	   	  PROVINCE	  
	  
	  
	  
•  Sample	  text	  
The	   	   	  	  
final 	   	  	  
five	   	   	  	  
include 	   	  	  
Mexico 	   	  	  
City	   	   	  	  
, 	   	   	  	  
Riyadh 	   	  	  
, 	   	   	  	  
Johannesburg 	  	  
, 	   	   	  	  
Ho 	   	   	  	  
Chi 	   	   	  	  
Minh 	   	  	  
City	   	   	  	  
and	   	  	  Introduc?on:	  Seman?c	  metadata	  genera?on	  
14	  
Dic?onary-­‐based	  NER	  
•  Sample	  entries	  in	  a	  gaze?eer:	  
	  
Ho	  Chi	  Minh 	   	   	  PROVINCE	  
Ho	  Chi	  Minh	  City 	   	  CITY	  
…	  
Johannesburg	   	   	  CITY	  
Johannesburg	   	   	  PROVINCE	  
…	  
Mexico 	   	   	   	  PROVINCE	  
Mexico	  Beach	   	   	  CITY	  
Mexico	  City 	   	   	  CITY	  
Mexico	  Crossing 	   	  CITY	  
…	  
Riyadh 	   	   	   	  CITY	  
…	  
Tehran 	   	   	   	  CITY	  
Tehran 	   	   	   	  PROVINCE	  
	  
	  
	  
•  Sample	  text	  matched	  (in	  BIO)	  
	  
The	   	   	  O 	  O 	  	  
final 	   	  O 	  O	  
five	   	   	  O 	  O	  
include 	   	  O 	  O	  
Mexico 	   	  B-­‐CITY 	  B-­‐PROVINCE	  
City	   	   	  I-­‐CITY 	  O	  
, 	   	   	  O 	  O	  
Riyadh 	   	  B-­‐CITY 	  O	  
, 	   	   	  O 	  O	  
Johannesburg 	  B-­‐CITY 	  B-­‐PROVINCE	  
, 	   	   	  O 	  O	  
Ho 	   	   	  B-­‐CITY 	  B-­‐PROVINCE	  
Chi 	   	   	  I-­‐CITY 	  I-­‐PROVINCE	  
Minh 	   	  I-­‐CITY 	  I-­‐PROVINCE	  
City	   	   	  I-­‐CITY 	  O	  
and	   	  	  Introduc?on:	  Seman?c	  metadata	  genera?on	  
15	  
Dic?onary-­‐based	  NER	  
Introduc?on:	  Seman?c	  metadata	  genera?on	  
ü Advantages	  
•  simple	  
•  many	  readily	  available	  
dic?onaries/lexica	  
✘ Disadvantages	  
•  dic?onaries	  can	  become	  
too	  big	  
•  yet,	  none	  of	  them	  
complete	  or	  
comprehensive	  enough	  
•  overlaps	  between	  
categories,	  e.g.,	  many	  
people	  and	  places	  have	  
the	  same	  names	  
16	  
Rule-­‐based	  NER	  
Introduc?on:	  Seman?c	  metadata	  genera?on	  
•  Regular	  expressions	  
–  checking	  for	  capitalisa?on	  
–  checking	  for	  numbers	  
•  Func㜫n	  words 	  for	  extrac?ng,	  e.g.,	  loca?ons	  
–  Capitalized	  word	  +	  {city,	  centre,	  river}	  indicates	  
loca?on	  
	  Examples:	  New	  York	  city,	  Hudson	  river	  
–  Capitalized	  word	  +	  {street,	  boulevard,	  avenue}	  
indicates	  loca?on	  
	  Examples:	  Fi?h	  avenue	  
17	  
Rule-­‐based	  NER	  
Introduc?on:	  Seman?c	  metadata	  genera?on	  
•  Context	  pa?erns	  
–  [PERSON]	  earned	  [MONEY]	  
	   	  Example:	  	  John	  earned	  £20	  
	  
–  [PERSON]	  joined	  [ORGANISATION]	  
	   	  Example:	  Sam	  joined	  IBM	  
	  
–  [PERSON],	  the	  [JOBTITLE]	  
	   	  Example:	  	  Mary,	  the	  teacher	  
18	  
Rule-­‐based	  NER	  
Introduc?on:	  Seman?c	  metadata	  genera?on	  
•  s?ll	  not	  so	  simple:	  
[PERSON|ORGANISATION]	  fly	  to	  [LOCATION]	  
	  Examples:	  Jerry	  flew	  to	  Japan	  
	   	   	   	   	  Delta	  flies	  to	  Europe	  
	   	   	   	   	  Birds	  fly	  to	  the	  nest	  
•  match	  pa?erns	  defined	  in	  a	  gaze?eer	  
– dic?onary	  of	  person	  names:	  	  
	  [John,	  Jerry,	  Mary,	  Frank,	  David,	  …	  ]	  
	  Jerry	  is	  a	  person’s	  name	  but	  not	  Delta	  nor	  Birds.	  
19	  
Rule-­‐based	  NER	  
Introduc?on:	  Seman?c	  metadata	  genera?on	  
ü Advantages	  
•  handcra?ed	  rules	  can	  be	  
very	  precise	  	  
•  only	  small	  amount	  of	  
development	  data	  
needed	  
✘ Disadvantages	  
•  domain-­‐dependent	  
•  expensive	  development	  
and	  test	  cycle	  
	  
20	  
Shortcomings	  of	  dic?onary-­‐	  
and	  rule-­‐based	  approaches	  
Introduc?on:	  Seman?c	  metadata	  genera?on	  
•  Failure	  to	  generalise	  
– first	  word	  of	  a	  sentence	  is	  also	  usually	  capitalised	  	  
– mul㜪ord	  expressions	  	  
•  Inability	  to	  disambiguate	  
–  Jordan	  the	  person	  vs.	  Jordan	  the	  	  loca?on	  
–  JFK	  the	  person	  vs.	  JFK	  the	  airport	  
– May	  the	  person	  vs.	  May	  the	  month	  
21	  
Shortcomings	  of	  dic?onary-­‐	  
and	  rule-­‐based	  approaches	  
Introduc?on:	  Seman?c	  metadata	  genera?on	  
•  Upkeep/maintenance	  
– No	  gaze?eer	  contains	  all	  exis?ng	  proper	  names	  
– New	  proper	  names	  constantly	  emerge	  
•  products,	  brands	  
•  scien?fic	  discoveries	  (e.g.,	  planets,	  	  stars,	  medicines)	  
– Mul?ple	  variants	  can	  emerge	  for	  the	  same	  en?ty	  
•  John	  Smith	  
•  J.	  Smith	  
•  Prof.	  	  J	  Smith	  
22	  
ML-­‐based	  approaches	  to	  NER	  
•  Supervised	  learning	  
–  labelled	  training	  examples	  
– methods	  
•  hidden	  Markov	  models	  (HMMs)	  
•  naïve	  Bayes	  
•  decision	  trees	  
•  support	  vector	  machines	  (SVMs)	  
•  condi?onal	  random	  fields	  (CRFs)	  
Introduc?on:	  Seman?c	  metadata	  genera?on	  
23	  
ML-­‐based	  approaches	  to	  NER	  
•  Semi-­‐supervised	  learning	  
–  small	  percentage	  of	  	  training	  examples	  is	  labelled,	  the	  
rest	  is	  unlabelled	  
–  methods	  
•  bootstrapping	  
•  ac?ve	  learning	  
•  co-­‐training	  
•  self-­‐training	  
•  Unsupervised	  learning	  
–  labels	  must	  be	  automa?cally	  discovered	  
–  method:	  clustering	  
Introduc?on:	  Seman?c	  metadata	  genera?on	  
24	  
Condi?onal	  random	  fields	  (CRFs)	  
•  a	  widely	  used	  algorithm	  for	  sequence	  labelling	  
•  finds	  the	  most	  probable	  label	  sequence	  y	  
given	  an	  observa?on	  sequence	  x	  
	  
	  
	  where	  x	  consists	  of	  the	  sequence	  of	  tokens	  
from	  input	  text	  
	  
Introduc?on:	  Seman?c	  metadata	  genera?on	  
25	  
Condi?onal	  random	  fields	  (CRFs)	  
•  computa?on	  of	  probability	  
	  	  
	  	  
	  	  
	  
feature	  func?on	  weight	  
summa?on	  over	  all	  
feature	  func?ons	  	  summa?on	  over	  all	  
tokens	  	  
normalisa?on	  
factor	  
€ 
f
i
(x,y) =  
1, if 1
st
 letter of x is uppercase &  y is B - ORG
0, otherwise
⎧ 
⎨
⎩ 
•  	  feature	  func?on:	  characterises	  the	  input	  
	  	  
	  
Introduc?on:	  Seman?c	  metadata	  genera?on	  
26	  
Condi?onal	  random	  fields	  (CRFs):	  
Feature	  types	  
•  character	  n-­‐grams	  (e.g.,	  2,	  3,	  4-­‐grams)	  
•  lexical	  and	  contextual	  
– current	  word,	  lemma,	  part-­‐of-­‐speech	  (POS)	  tag	  
– word	  n-­‐grams:	  around	  W0	  in	  [-­‐3,…,+3]	  window	  
•  suffixes	  and	  prefixes	  (e.g.,	  with	  lengths	  2	  	  to	  4)
	   	  	  
Introduc?on:	  Seman?c	  metadata	  genera?on	  
27	  
Condi?onal	  random	  fields	  (CRFs):	  
Feature	  types	  
•  orthographic	  
	  ini?al-­‐caps 	   	   	  all-­‐caps 	   	   	  lonely-­‐ini䬰l	  
	  all-­‐digits	   	   	   	  contains-­‐dots 	   	  punctua?on-­‐mark	  
	  single-­‐char 	   	   	  contains-­‐hyphen 	  	  
•  seman?c	  
– matches	  between	  tokens	  and	  names	  in	  gaze?eers	  
or	  controlled	  vocabularies	  
Introduc?on:	  Seman?c	  metadata	  genera?on	  
28	  
Pipelining	  various	  tools	  for	  NER	  
•  Sentence	  spli眨g	  
–  to	  define	  a	  sequence	  
•  Tokenisa?on	  
–  to	  generate	  the	  basic	  unit	  of	  analysis,	  i.e.,	  tokens	  
•  Lemma?sa?on,	  POS-­‐tagging	  
–  to	  generate	  lexical	  and	  contextual	  features	  
•  Gaze?eer	  matching	  
–  to	  generate	  seman?c	  features	  
Introduc?on:	  Seman?c	  metadata	  genera?on	  
29	  
	  	  
Ques?ons	  so	  far?	  
30	  
Argo:	  a	  generic	  text	  mining	  workbench	  
(h?p://argo.nactem.ac.uk)	  
Remote	  
Processing	  
Workflow	  
Diagramming	  
Manual	  Edi?ng	  
Annotator/Curator	  
Processing	  
Components	  
Developers	  
UIMA	  
Compliance	  
Structured	  
Data	  
Workflow	  Designer	  
Argo:	  System	  overview	  and	  features	  
31	  
Workflows	  
Argo:	  System	  overview	  and	  features	  
32	  
Processes	  
Argo:	  System	  overview	  and	  features	  
33	  
Documents	  
Argo:	  System	  overview	  and	  features	  
34	  
Workflow	  Editor	  
Argo:	  Workflow	  construc?on	  
35	  
Components	  
•  Readers	  
–  loads	  corpora/document	  collec?ons	  
– provide	  support	  for	  various	  formats,	  e.g.,	  plain	  
text,	  XML,	  TSV,	  stand-­‐off	  
•  Analy?cs	  
– natural	  language	  processing	  tools	  
–  tokenisers,	  POS	  taggers,	  parsers,	  named	  en?ty	  
recognisers	  
•  Consumers	  
– serialisa?on	  to	  files	  (e.g.,	  XML,	  TSV)	  and	  databases	  
Argo:	  Workflow	  construc?on	  
36	  
Configura?on	  
Argo:	  Workflow	  construc?on	  
37	  
Configura?on	  
Argo:	  Workflow	  configura?on	  
38	  
Execu㜫n	  
Argo:	  Workflow	  execu?on	  
39	  
Execu㜫n	  
Argo:	  Workflow	  execu?on	  
40	  
Monitoring	  
Argo:	  Workflow	  execu?on	  
41	  
Visual	  inspec?on	  of	  results:	  	  
the	  Manual	  Annota?on	  Editor	  
Argo:	  Visual	  inspec?on	  of	  seman?c	  metadata	  
42	  
Genera?ng	  seman?c	  metadata	  with	  
NER	  workflows:	  biodiversity	  literature	  
Loads	  BHL	  
corpus	  (XML)	  
Extracts	  text	  body	  
from	  relevant	  XML	  
elements	  
Performs	  
sentence	  
spli眨g 1	  
Performs	  tokenisa?on,	  
lemma?sa?on,	  	  
POS-­‐tagging2	  
CRF-­‐based	  
biodiversity	  
NER3	  
Removes	  
unnecessary	  
annota?ons	  
Launches	  
interface	  for	  
visual	  inspec?on	  
Writes	  
annota?ons	  to	  a	  
search	  index	  
1LingPipe:	  h?p://alias-­‐i.com/lingpipe	  	  
2GENIA	  Tagger:	  	  h?p://www.nactem.ac.uk/GENIA/tagger	  	  
3NERsuite:	  h?p://nersuite.nlplab.org	  	  
43	  
Genera?ng	  seman?c	  metadata	  with	  
NER	  workflows:	  biodiversity	  literature	  
44	  
Taxon	  
Loca?on	  
Habitat	  
Person	  
Temporal	  expression	  
Genera?ng	  seman?c	  metadata	  with	  
NER	  workflows:	  medical	  archives	  
CRF-­‐based	  
disease	  NER1	  
CRF-­‐based	  
chemical	  name	  
NER2	  
1NCBI	  Corpus:	  h?p://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/	  	  
2ChER:	  h?ps://jcheminf.springeropen.com/ar?cles/10.1186/1758-­‐2946-­‐7-­‐S1-­‐S6	  	  
45	  
Genera?ng	  seman?c	  metadata	  with	  
NER	  workflows:	  medical	  archives	  
46	  
	  	  
Ques?ons	  so	  far?	  
47	  
Introduc?on	  to	  Search	  Indices	  
•  A	  search	  engine	  is	  an	  informa?on	  retrieval	  
system	  designed	  to	  help	  find	  informa✮n	  
stored	  on	  a	  computer	  system	  
•  Intui?vely	  (and	  simplis?cally):	  
Exploring	  search	  indexes:	  Introduc?on	  
d1	  	  	  	  	  	  	  	  d2	  	  	  	  	  	  ….	  	  	  	  	  	  	  dn	  
We	  will	  focus	  on	  Elas?csearch	  for	  this	  tutorial!	  
Query:	  caesar	  killed	  
d8	  
An	  overview	  of	  Elas?csearch1	  
•  Elas?csearch	  is	  an	  open-­‐source	  distributed	  search	  (full-­‐text	  or	  
structured)	  and	  analy?cs	  engine:	  
–  ?mestamp	  or	  exact	  values,	  	  
–  full-­‐text	  search,	  handle	  synonyms,	  score	  documents	  by	  relevance	  
–  Analy?cs	  and	  aggrega?ons	  from	  the	  same	  data	  in	  real	  ?me	  
•  Notable	  examples:	  
–  Wikipedia	  (full-­‐text	  search,	  highlighted	  snippets,	  and	  search-­‐as-­‐you-­‐type	  
and	  did-­‐you-­‐mean	  sugges?ons)	  
–  The	  Guardian	  (visitor	  logs	  with	  social-­‐network	  data	  to	  provide	  analy?cs)	  
–  Stack	  Overflow	  (full-­‐text	  search	  with	  geoloca?on	  queries	  and	  more-­‐like-­‐
this	  in	  Q&A)	  
–  GitHub	  (query	  130	  billion	  lines	  of	  code)	  
•  Elas?csearch	  can	  run	  on	  your	  laptop,	  or	  scale	  out	  to	  hundreds	  of	  
servers	  and	  petabytes	  of	  data	  
Exploring	  search	  indexes:	  Elas?csearch	  
1Much	  of	  the	  following	  content	  was	  extracted	  from	  the	  Elas?csearch	  documenta?on	  
An	  overview	  of	  Elas?csearch	  (cont)	  
•  Built	  on	  top	  of	  Apache	  Lucene,	  a	  full-­‐text	  search-­‐engine	  
library	  
•  Lucene	  is	  arguably	  the	  most	  advanced,	  high-­‐performance,	  
and	  fully	  featured	  search	  engine	  
•  Why	  not	  using	  Lucene	  then?	  
–  Complexity,	  requires	  a	  deep	  understanding	  of	  IR	  concepts	  and	  
its	  inner	  workings	  
–  Need	  to	  work	  in	  Java	  and	  to	  integrate	  Lucene	  directly	  with	  your	  
applica?on	  
–  Elas?csearch	  packages	  up	  all	  this	  func?onality	  into	  a	  standalone	  
server	  that	  your	  applica?on	  can	  talk	  to	  via	  (a	  RESTful)	  API	  
–  “Works	  right	  out	  of	  the	  box”;	  sensible	  defaults	  and	  hides	  
complicated	  search	  theory,	  while	  s?ll	  fully	  configurable	  and	  
flexible	  
Exploring	  search	  indexes:	  Elas?csearch	  
An	  overview	  of	  Elas?csearch	  
•  Isn’t	  Solr	  doing	  the	  same?	  	  
– Which	  one	  is	  be?er	  depends	  on	  the	  applica?on	  
– Elas?csearch	  was	  born	  in	  the	  age	  of	  REST	  APIs,	  so	  
it’s	  more	  aligned	  with	  web	  2.0	  applica?ons	  
–  In	  our	  case	  the	  nested	  document	  structure	  made	  
Elas?csearch	  a	  clear	  winner	  
– h?p://solr-­‐vs-­‐elas?csearch.com	  	  
Exploring	  search	  indexes:	  Elas?csearch	  
How	  to	  install	  Elas?csearch	  
•  It’s	  quite	  straigh?orward:	  
– h?ps://www.elas?c.co/guide/en/elas?csearch/
guide/current/running-­‐elas?csearch.html	  	  
•  For	  development	  and	  interac?ve	  querying	  the	  
recommended	  so?ware	  is	  Sense	  
– Available	  as	  a	  Chrome	  extension	  too	  
– Send	  JSON	  data	  over	  HTTP	  
– Friendly	  syntax	  for	  the	  curl	  command	  
Exploring	  search	  indexes:	  Elas?csearch	  
How	  to	  communicate	  with	  
Elas?csearch?	  
•  Java	  API	  
– Used	  within	  the	  Argo	  component	  
•  RESTful	  API	  
– Used	  for	  the	  examples	  here	  
•  We	  will	  follow	  the	  ‘learn	  from	  example’	  
philosophy	  in	  this	  tutorial	  
– Only	  emphasising	  important	  aspects	  of	  the	  query	  
syntax	  
Exploring	  search	  indexes:	  Elas?csearch	  
Elas?csearch	  key	  concepts	  
•  Document	  oriented	  
–  Similar	  to	  the	  NoSQL	  concept	  of	  document	  
–  Intui?vely,	  a	  document	  is	  analogous	  to	  an	  object	  in	  	  
OO-­‐programming	  
–  Why?	  No	  need	  to	  squeeze	  or	  fla?en	  your	  object	  into	  a	  table	  
(usually	  one	  field	  per	  column)	  losing	  its	  richness	  
•  JSON	  
–  Serialisa?on	  format	  for	  documents	  
Elas?csearch	  key	  concepts	  
•  Glossary:	  
–  Index:	  	  
•  analogous	  to	  a	  database	  in	  SQL	  and	  NoSQL	  
•  can	  contains	  mul?ple	  types	  
–  Type:	  	  
•  analogous	  to	  a	  table	  (SQL)	  or	  collec?on	  (MongoDB)	  
•  can	  contain	  mul?ple	  documents	  
–  Document:	  
•  analogous	  to	  a	  row	  (SQL)	  	  
•  can	  contain	  mul?ple	  fields	  
–  Field:	  
•  Analogous	  to	  a	  column	  (SQL)	  
•  Each	  field	  is	  associated	  with	  a	  field	  type:	  ‘string’,	  ‘date’,	  ‘integer’	  
•  Index	  is	  an	  overloaded	  word	  
–  as	  a	  noun,	  as	  a	  verb	  and	  inverted	  index	  
Exploring	  search	  indexes:	  Elas?csearch	  
Querying	  Elas?csearch	  
•  We	  already	  ran	  Argo	  workflows,	  which	  
inserted	  data	  in	  Elas?csearch	  
•  Let’s	  have	  a	  look	  at	  the	  exis?ng	  indices…	  
•  Let’s	  search	  for	  all	  documents	  in	  an	  index…	  
– Format	  of	  the	  response	  
– Pagina?on	  
Exploring	  search	  indexes:	  Sample	  queries	  
Querying	  Elas?csearch	  
•  Let’s	  refine	  the	  query	  searching	  for	  a	  specific	  
term…	  
•  Let’s	  search	  for	  en??es…	  
Exploring	  search	  indexes:	  Sample	  queries	  
Querying	  Elas?csearch	  
using	  Sense	  
58	  
Some	  caveats	  
•  No	  need	  to	  define	  a	  mapping	  (i.e.	  schema)	  
– Elas?csearch	  tries	  to	  guess	  it	  (“works	  out	  of	  the	  
box”)	  
– But	  in	  most	  cases	  it	  is	  necessary	  to	  define	  it:	  
•  Define	  nested	  objects	  as	  such	  (e.g.	  ‘metadata’)	  
•  Define	  fields	  that	  do	  not	  need	  text	  processing	  (e.g.	  
metadata	  fields)	  
•  Let’s	  have	  a	  look	  at	  our	  current	  mappings…	  
Much	  more…	  
•  Aggrega?on	  (face?ng)	  
•  Horizontal	  scalability	  (sharding)	  
•  Sor?ng	  /	  relevance	  
•  Word	  proximity,	  par?al	  matching,	  fuzzy	  
matching,	  and	  language	  awareness	  
•  Geoloca?on	  and	  geohashes	  
	  	  
Ques?ons	  so	  far?	  
61	  
Applica?ons:	  Disambigua?on	  in	  the	  
History	  of	  Medicine	  search	  system	  
•  h?p://nactem.ac.uk/hom	  
•  Archives	  
– Bri?sh	  Medical	  Journal	  ar?cles	  (380,000)	  
– London	  Medical	  Office	  of	  Health	  reports	  (5,000)	  
62	  
Searching	  for	  “cold”	  based	  on	  keywords	  
63	  
Searching	  for	  “cold”	  based	  on	  keywords	  
“Cold”	  as	  a	  
medical	  
condi㜫n 	  
“Cold”	  to	  
describe	  
temperature	  
64	  
Searching	  for	  “cold”	  as	  a	  disease	  	  
based	  on	  seman?c	  metadata	  
65	  
Applica?ons:	  BHL	  Query	  Expansion	  
•  h?p://nactem10.mib.man.ac.uk/va/MiBio/
Search/queryExpansion.html?prot=thumb	  	  
66	  
Searching	  for	  “Aquila	  chrysaetos”	  	  
67	  
Searching	  for	  “Aquila	  chrysaetos”:	  
expanding	  with	  “Golden	  eagle”	  
68	  
Searching	  for	  “Aquila	  chrysaetos”	  in	  BHL	  
69	  
Conclusions	  
•  Discussed	  challenges	  in	  informa?on	  discovery	  
and	  search	  
•  Reviewed	  methods	  for	  NER	  
•  Presented	  the	  Argo	  text	  mining	  workbench	  
•  Extracted	  named	  en??es	  which	  are	  then	  
indexed	  to	  facilitate	  seman?c	  searches	  
•  Presented	  fundamentals	  of	  Elas?csearch:	  key	  
concepts,	  search,	  mappings	  
70	  
Conclusions	  
•  Illustrated	  some	  applica?ons:	  
– Disambigua?on	  in	  the	  History	  of	  Medicine	  system	  
–  Improving	  recall	  in	  BHL	  
•  Please	  get	  in	  touch	  with	  us	  if	  you’re	  interested	  
in	  applying	  Argo	  to	  your	  digital	  libraries!	  
–  riza.ba?sta@manchester.ac.uk	  	  
– axel.soto@manchester.ac.uk	  	  
71