Text  mining  workflows     for  indexing  archives  with  automa?cally   extracted  seman?c  metadata   Riza  Ba?sta-­‐Navarro1,  Axel  Soto1,   William  Ulate2  and  Sophia  Ananiadou1   1University  of  Manchester   2Missouri  Botanical  Garden   1   20th  Interna?onal  Conference  on  Theory  and  Prac?ce  of  Digital  Libraries  (TPDL  2016)   Outline  (1)   •  Introduc?on     – challenges  in  informa?on  discovery/search   – seman?c  metadata  genera?on  as  a  named  en?ty   recogni?on  (NER)  task   •  The  Argo  text  mining  workbench   – system  overview  and  features   – workflow  construc?on,  configura?on  and  execu?on   – visual  inspec?on  of  generated  annota?ons   2   Outline  (2)   •  Construc?ng  NER  workflows  for  genera?ng   seman?c  metadata   – medical  history  archives   – biodiversity  legacy  literature   •  Exploring  search  indexes  containing  seman?c   metadata   –  introduc?on     – overview  of  Elas?csearch   – query  examples   3   Outline  (3)   •  Example  applica?ons   – Disambigua?on  in  the  History  of  Medicine  search   system   – Biodiversity  Heritage  Library  query  expansion   •  Conclusions   4          Biodiversity  Heritage  Library   •  h?p://www.biodiversitylibrary.org/     •  a  consor?um  of  botanical  and  natural  history   libraries   •  stores  digi?sed  legacy  literature  on   biodiversity   •  currently  holds  180,000  volumes  =  50+  million   pages  (PDFs  and  OCR-­‐generated  text)   •  open-­‐access   Introduc?on:  Challenges  to  informa?on  discovery   5   BHL’s  keyword-­‐based  search     and  browsing   Introduc?on:  Challenges  to  informa?on  discovery   6   BHL’s  advanced  search  func?onality   (also  keyword-­‐based)   Introduc?on:  Challenges  to  informa?on  discovery   7   What’s  wrong  with     keyword-­‐based  search?   (1)  Inability  to  disambiguate     California  bay   hardwood   tree?   loca?on?   Introduc?on:  Challenges  to  informa?on  discovery   Emperor   fish?   person?   8   (1) Inability  to  disambiguate     What’s  wrong  with     keyword-­‐based  search?   Boxwood   historic  place  in   Alabama?   North  American  term  for   plants  in  the  Buxaceae   family?   Introduc?on:  Challenges  to  informa?on  discovery   9   What’s  wrong  with     keyword-­‐based  search?   (1) Inability  to  disambiguate   •  Implica?ons:   –  less  precise  search  results   – even  documents  irrelevant  to  what  you  have  in   mind  are  returned   Introduc?on:  Challenges  to  informa?on  discovery                                          Returned     “Emperor“   (as  person)   “Emperor”     (as  fish)   10   What’s  wrong  with     keyword-­‐based  search?   (2)  Inability  to  account  for  variants   •  Implica?ons:   –  limited  coverage   –  informa?on  overlook     Introduc?on:  Challenges  to  informa?on  discovery                  Returned   “Panthera   leo”   “lion”   11   Solu?on:  Seman?c  metadata  genera?on   using  named  en?ty  recogni?on  (NER)   •  task  of  automa?cally  demarca?ng  men?ons   – detec?ng  their  boundaries  (e.g.,  character  offsets)   – placing  them  into  predefined  categories   Introduc?on:  Seman?c  metadata  genera?on   12   Named  en?ty  recogni?on  (NER)   •  cast  as  a  sequence  labelling  task   – sequence  =  tokens  in  a  sentence   •  approaches   – dic?onary-­‐based   –  rule-­‐based   – machine  learning  (ML)-­‐based   – hybrid   Introduc?on:  Seman?c  metadata  genera?on   13   Dic?onary-­‐based  NER   •  Sample  entries  in  a  gaze?eer:     Ho  Chi  Minh      PROVINCE   Ho  Chi  Minh  City    CITY   …   Johannesburg      CITY   Johannesburg      PROVINCE   …   Mexico        PROVINCE   Mexico  Beach      CITY   Mexico  City      CITY   Mexico  Crossing    CITY   …   Riyadh        CITY   …   Tehran        CITY   Tehran        PROVINCE         •  Sample  text   The         final       five         include       Mexico       City         ,         Riyadh       ,         Johannesburg     ,         Ho         Chi         Minh       City         and      Introduc?on:  Seman?c  metadata  genera?on   14   Dic?onary-­‐based  NER   •  Sample  entries  in  a  gaze?eer:     Ho  Chi  Minh      PROVINCE   Ho  Chi  Minh  City    CITY   …   Johannesburg      CITY   Johannesburg      PROVINCE   …   Mexico        PROVINCE   Mexico  Beach      CITY   Mexico  City      CITY   Mexico  Crossing    CITY   …   Riyadh        CITY   …   Tehran        CITY   Tehran        PROVINCE         •  Sample  text  matched  (in  BIO)     The      O  O     final    O  O   five      O  O   include    O  O   Mexico    B-­‐CITY  B-­‐PROVINCE   City      I-­‐CITY  O   ,      O  O   Riyadh    B-­‐CITY  O   ,      O  O   Johannesburg  B-­‐CITY  B-­‐PROVINCE   ,      O  O   Ho      B-­‐CITY  B-­‐PROVINCE   Chi      I-­‐CITY  I-­‐PROVINCE   Minh    I-­‐CITY  I-­‐PROVINCE   City      I-­‐CITY  O   and      Introduc?on:  Seman?c  metadata  genera?on   15   Dic?onary-­‐based  NER   Introduc?on:  Seman?c  metadata  genera?on   ü Advantages   •  simple   •  many  readily  available   dic?onaries/lexica   ✘ Disadvantages   •  dic?onaries  can  become   too  big   •  yet,  none  of  them   complete  or   comprehensive  enough   •  overlaps  between   categories,  e.g.,  many   people  and  places  have   the  same  names   16   Rule-­‐based  NER   Introduc?on:  Seman?c  metadata  genera?on   •  Regular  expressions   –  checking  for  capitalisa?on   –  checking  for  numbers   •  Func㜫n  words  for  extrac?ng,  e.g.,  loca?ons   –  Capitalized  word  +  {city,  centre,  river}  indicates   loca?on    Examples:  New  York  city,  Hudson  river   –  Capitalized  word  +  {street,  boulevard,  avenue}   indicates  loca?on    Examples:  Fi?h  avenue   17   Rule-­‐based  NER   Introduc?on:  Seman?c  metadata  genera?on   •  Context  pa?erns   –  [PERSON]  earned  [MONEY]      Example:    John  earned  £20     –  [PERSON]  joined  [ORGANISATION]      Example:  Sam  joined  IBM     –  [PERSON],  the  [JOBTITLE]      Example:    Mary,  the  teacher   18   Rule-­‐based  NER   Introduc?on:  Seman?c  metadata  genera?on   •  s?ll  not  so  simple:   [PERSON|ORGANISATION]  fly  to  [LOCATION]    Examples:  Jerry  flew  to  Japan            Delta  flies  to  Europe            Birds  fly  to  the  nest   •  match  pa?erns  defined  in  a  gaze?eer   – dic?onary  of  person  names:      [John,  Jerry,  Mary,  Frank,  David,  …  ]    Jerry  is  a  person’s  name  but  not  Delta  nor  Birds.   19   Rule-­‐based  NER   Introduc?on:  Seman?c  metadata  genera?on   ü Advantages   •  handcra?ed  rules  can  be   very  precise     •  only  small  amount  of   development  data   needed   ✘ Disadvantages   •  domain-­‐dependent   •  expensive  development   and  test  cycle     20   Shortcomings  of  dic?onary-­‐   and  rule-­‐based  approaches   Introduc?on:  Seman?c  metadata  genera?on   •  Failure  to  generalise   – first  word  of  a  sentence  is  also  usually  capitalised     – mul㜪ord  expressions     •  Inability  to  disambiguate   –  Jordan  the  person  vs.  Jordan  the    loca?on   –  JFK  the  person  vs.  JFK  the  airport   – May  the  person  vs.  May  the  month   21   Shortcomings  of  dic?onary-­‐   and  rule-­‐based  approaches   Introduc?on:  Seman?c  metadata  genera?on   •  Upkeep/maintenance   – No  gaze?eer  contains  all  exis?ng  proper  names   – New  proper  names  constantly  emerge   •  products,  brands   •  scien?fic  discoveries  (e.g.,  planets,    stars,  medicines)   – Mul?ple  variants  can  emerge  for  the  same  en?ty   •  John  Smith   •  J.  Smith   •  Prof.    J  Smith   22   ML-­‐based  approaches  to  NER   •  Supervised  learning   –  labelled  training  examples   – methods   •  hidden  Markov  models  (HMMs)   •  naïve  Bayes   •  decision  trees   •  support  vector  machines  (SVMs)   •  condi?onal  random  fields  (CRFs)   Introduc?on:  Seman?c  metadata  genera?on   23   ML-­‐based  approaches  to  NER   •  Semi-­‐supervised  learning   –  small  percentage  of    training  examples  is  labelled,  the   rest  is  unlabelled   –  methods   •  bootstrapping   •  ac?ve  learning   •  co-­‐training   •  self-­‐training   •  Unsupervised  learning   –  labels  must  be  automa?cally  discovered   –  method:  clustering   Introduc?on:  Seman?c  metadata  genera?on   24   Condi?onal  random  fields  (CRFs)   •  a  widely  used  algorithm  for  sequence  labelling   •  finds  the  most  probable  label  sequence  y   given  an  observa?on  sequence  x        where  x  consists  of  the  sequence  of  tokens   from  input  text     Introduc?on:  Seman?c  metadata  genera?on   25   Condi?onal  random  fields  (CRFs)   •  computa?on  of  probability                 feature  func?on  weight   summa?on  over  all   feature  func?ons    summa?on  over  all   tokens     normalisa?on   factor   € f i (x,y) = 1, if 1 st letter of x is uppercase & y is B - ORG 0, otherwise ⎧ ⎨ ⎩ •   feature  func?on:  characterises  the  input         Introduc?on:  Seman?c  metadata  genera?on   26   Condi?onal  random  fields  (CRFs):   Feature  types   •  character  n-­‐grams  (e.g.,  2,  3,  4-­‐grams)   •  lexical  and  contextual   – current  word,  lemma,  part-­‐of-­‐speech  (POS)  tag   – word  n-­‐grams:  around  W0  in  [-­‐3,…,+3]  window   •  suffixes  and  prefixes  (e.g.,  with  lengths  2    to  4)       Introduc?on:  Seman?c  metadata  genera?on   27   Condi?onal  random  fields  (CRFs):   Feature  types   •  orthographic    ini?al-­‐caps      all-­‐caps      lonely-­‐ini䬰l    all-­‐digits        contains-­‐dots    punctua?on-­‐mark    single-­‐char      contains-­‐hyphen     •  seman?c   – matches  between  tokens  and  names  in  gaze?eers   or  controlled  vocabularies   Introduc?on:  Seman?c  metadata  genera?on   28   Pipelining  various  tools  for  NER   •  Sentence  spli眨g   –  to  define  a  sequence   •  Tokenisa?on   –  to  generate  the  basic  unit  of  analysis,  i.e.,  tokens   •  Lemma?sa?on,  POS-­‐tagging   –  to  generate  lexical  and  contextual  features   •  Gaze?eer  matching   –  to  generate  seman?c  features   Introduc?on:  Seman?c  metadata  genera?on   29       Ques?ons  so  far?   30   Argo:  a  generic  text  mining  workbench   (h?p://argo.nactem.ac.uk)   Remote   Processing   Workflow   Diagramming   Manual  Edi?ng   Annotator/Curator   Processing   Components   Developers   UIMA   Compliance   Structured   Data   Workflow  Designer   Argo:  System  overview  and  features   31   Workflows   Argo:  System  overview  and  features   32   Processes   Argo:  System  overview  and  features   33   Documents   Argo:  System  overview  and  features   34   Workflow  Editor   Argo:  Workflow  construc?on   35   Components   •  Readers   –  loads  corpora/document  collec?ons   – provide  support  for  various  formats,  e.g.,  plain   text,  XML,  TSV,  stand-­‐off   •  Analy?cs   – natural  language  processing  tools   –  tokenisers,  POS  taggers,  parsers,  named  en?ty   recognisers   •  Consumers   – serialisa?on  to  files  (e.g.,  XML,  TSV)  and  databases   Argo:  Workflow  construc?on   36   Configura?on   Argo:  Workflow  construc?on   37   Configura?on   Argo:  Workflow  configura?on   38   Execu㜫n   Argo:  Workflow  execu?on   39   Execu㜫n   Argo:  Workflow  execu?on   40   Monitoring   Argo:  Workflow  execu?on   41   Visual  inspec?on  of  results:     the  Manual  Annota?on  Editor   Argo:  Visual  inspec?on  of  seman?c  metadata   42   Genera?ng  seman?c  metadata  with   NER  workflows:  biodiversity  literature   Loads  BHL   corpus  (XML)   Extracts  text  body   from  relevant  XML   elements   Performs   sentence   spli眨g 1   Performs  tokenisa?on,   lemma?sa?on,     POS-­‐tagging2   CRF-­‐based   biodiversity   NER3   Removes   unnecessary   annota?ons   Launches   interface  for   visual  inspec?on   Writes   annota?ons  to  a   search  index   1LingPipe:  h?p://alias-­‐i.com/lingpipe     2GENIA  Tagger:    h?p://www.nactem.ac.uk/GENIA/tagger     3NERsuite:  h?p://nersuite.nlplab.org     43   Genera?ng  seman?c  metadata  with   NER  workflows:  biodiversity  literature   44   Taxon   Loca?on   Habitat   Person   Temporal  expression   Genera?ng  seman?c  metadata  with   NER  workflows:  medical  archives   CRF-­‐based   disease  NER1   CRF-­‐based   chemical  name   NER2   1NCBI  Corpus:  h?p://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/     2ChER:  h?ps://jcheminf.springeropen.com/ar?cles/10.1186/1758-­‐2946-­‐7-­‐S1-­‐S6     45   Genera?ng  seman?c  metadata  with   NER  workflows:  medical  archives   46       Ques?ons  so  far?   47   Introduc?on  to  Search  Indices   •  A  search  engine  is  an  informa?on  retrieval   system  designed  to  help  find  informa✮n   stored  on  a  computer  system   •  Intui?vely  (and  simplis?cally):   Exploring  search  indexes:  Introduc?on   d1                d2            ….              dn   We  will  focus  on  Elas?csearch  for  this  tutorial!   Query:  caesar  killed   d8   An  overview  of  Elas?csearch1   •  Elas?csearch  is  an  open-­‐source  distributed  search  (full-­‐text  or   structured)  and  analy?cs  engine:   –  ?mestamp  or  exact  values,     –  full-­‐text  search,  handle  synonyms,  score  documents  by  relevance   –  Analy?cs  and  aggrega?ons  from  the  same  data  in  real  ?me   •  Notable  examples:   –  Wikipedia  (full-­‐text  search,  highlighted  snippets,  and  search-­‐as-­‐you-­‐type   and  did-­‐you-­‐mean  sugges?ons)   –  The  Guardian  (visitor  logs  with  social-­‐network  data  to  provide  analy?cs)   –  Stack  Overflow  (full-­‐text  search  with  geoloca?on  queries  and  more-­‐like-­‐ this  in  Q&A)   –  GitHub  (query  130  billion  lines  of  code)   •  Elas?csearch  can  run  on  your  laptop,  or  scale  out  to  hundreds  of   servers  and  petabytes  of  data   Exploring  search  indexes:  Elas?csearch   1Much  of  the  following  content  was  extracted  from  the  Elas?csearch  documenta?on   An  overview  of  Elas?csearch  (cont)   •  Built  on  top  of  Apache  Lucene,  a  full-­‐text  search-­‐engine   library   •  Lucene  is  arguably  the  most  advanced,  high-­‐performance,   and  fully  featured  search  engine   •  Why  not  using  Lucene  then?   –  Complexity,  requires  a  deep  understanding  of  IR  concepts  and   its  inner  workings   –  Need  to  work  in  Java  and  to  integrate  Lucene  directly  with  your   applica?on   –  Elas?csearch  packages  up  all  this  func?onality  into  a  standalone   server  that  your  applica?on  can  talk  to  via  (a  RESTful)  API   –  “Works  right  out  of  the  box”;  sensible  defaults  and  hides   complicated  search  theory,  while  s?ll  fully  configurable  and   flexible   Exploring  search  indexes:  Elas?csearch   An  overview  of  Elas?csearch   •  Isn’t  Solr  doing  the  same?     – Which  one  is  be?er  depends  on  the  applica?on   – Elas?csearch  was  born  in  the  age  of  REST  APIs,  so   it’s  more  aligned  with  web  2.0  applica?ons   –  In  our  case  the  nested  document  structure  made   Elas?csearch  a  clear  winner   – h?p://solr-­‐vs-­‐elas?csearch.com     Exploring  search  indexes:  Elas?csearch   How  to  install  Elas?csearch   •  It’s  quite  straigh?orward:   – h?ps://www.elas?c.co/guide/en/elas?csearch/ guide/current/running-­‐elas?csearch.html     •  For  development  and  interac?ve  querying  the   recommended  so?ware  is  Sense   – Available  as  a  Chrome  extension  too   – Send  JSON  data  over  HTTP   – Friendly  syntax  for  the  curl  command   Exploring  search  indexes:  Elas?csearch   How  to  communicate  with   Elas?csearch?   •  Java  API   – Used  within  the  Argo  component   •  RESTful  API   – Used  for  the  examples  here   •  We  will  follow  the  ‘learn  from  example’   philosophy  in  this  tutorial   – Only  emphasising  important  aspects  of  the  query   syntax   Exploring  search  indexes:  Elas?csearch   Elas?csearch  key  concepts   •  Document  oriented   –  Similar  to  the  NoSQL  concept  of  document   –  Intui?vely,  a  document  is  analogous  to  an  object  in     OO-­‐programming   –  Why?  No  need  to  squeeze  or  fla?en  your  object  into  a  table   (usually  one  field  per  column)  losing  its  richness   •  JSON   –  Serialisa?on  format  for  documents   Elas?csearch  key  concepts   •  Glossary:   –  Index:     •  analogous  to  a  database  in  SQL  and  NoSQL   •  can  contains  mul?ple  types   –  Type:     •  analogous  to  a  table  (SQL)  or  collec?on  (MongoDB)   •  can  contain  mul?ple  documents   –  Document:   •  analogous  to  a  row  (SQL)     •  can  contain  mul?ple  fields   –  Field:   •  Analogous  to  a  column  (SQL)   •  Each  field  is  associated  with  a  field  type:  ‘string’,  ‘date’,  ‘integer’   •  Index  is  an  overloaded  word   –  as  a  noun,  as  a  verb  and  inverted  index   Exploring  search  indexes:  Elas?csearch   Querying  Elas?csearch   •  We  already  ran  Argo  workflows,  which   inserted  data  in  Elas?csearch   •  Let’s  have  a  look  at  the  exis?ng  indices…   •  Let’s  search  for  all  documents  in  an  index…   – Format  of  the  response   – Pagina?on   Exploring  search  indexes:  Sample  queries   Querying  Elas?csearch   •  Let’s  refine  the  query  searching  for  a  specific   term…   •  Let’s  search  for  en??es…   Exploring  search  indexes:  Sample  queries   Querying  Elas?csearch   using  Sense   58   Some  caveats   •  No  need  to  define  a  mapping  (i.e.  schema)   – Elas?csearch  tries  to  guess  it  (“works  out  of  the   box”)   – But  in  most  cases  it  is  necessary  to  define  it:   •  Define  nested  objects  as  such  (e.g.  ‘metadata’)   •  Define  fields  that  do  not  need  text  processing  (e.g.   metadata  fields)   •  Let’s  have  a  look  at  our  current  mappings…   Much  more…   •  Aggrega?on  (face?ng)   •  Horizontal  scalability  (sharding)   •  Sor?ng  /  relevance   •  Word  proximity,  par?al  matching,  fuzzy   matching,  and  language  awareness   •  Geoloca?on  and  geohashes       Ques?ons  so  far?   61   Applica?ons:  Disambigua?on  in  the   History  of  Medicine  search  system   •  h?p://nactem.ac.uk/hom   •  Archives   – Bri?sh  Medical  Journal  ar?cles  (380,000)   – London  Medical  Office  of  Health  reports  (5,000)   62   Searching  for  “cold”  based  on  keywords   63   Searching  for  “cold”  based  on  keywords   “Cold”  as  a   medical   condi㜫n   “Cold”  to   describe   temperature   64   Searching  for  “cold”  as  a  disease     based  on  seman?c  metadata   65   Applica?ons:  BHL  Query  Expansion   •  h?p://nactem10.mib.man.ac.uk/va/MiBio/ Search/queryExpansion.html?prot=thumb     66   Searching  for  “Aquila  chrysaetos”     67   Searching  for  “Aquila  chrysaetos”:   expanding  with  “Golden  eagle”   68   Searching  for  “Aquila  chrysaetos”  in  BHL   69   Conclusions   •  Discussed  challenges  in  informa?on  discovery   and  search   •  Reviewed  methods  for  NER   •  Presented  the  Argo  text  mining  workbench   •  Extracted  named  en??es  which  are  then   indexed  to  facilitate  seman?c  searches   •  Presented  fundamentals  of  Elas?csearch:  key   concepts,  search,  mappings   70   Conclusions   •  Illustrated  some  applica?ons:   – Disambigua?on  in  the  History  of  Medicine  system   –  Improving  recall  in  BHL   •  Please  get  in  touch  with  us  if  you’re  interested   in  applying  Argo  to  your  digital  libraries!   –  riza.ba?sta@manchester.ac.uk     – axel.soto@manchester.ac.uk     71