Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 1 Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining Text Mining Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining Relation extraction Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 4 • Information extraction (IE): automatically extracting structured information from unstructured machine-readable documents • IE approaches often focus on restricted domains (target domain) • IE should facilitate logical reasoning to generate inferences from the structured output generated by those systems • Assumption: entities and events in documents are described in a similar way, i.e. there are conventional, semantic, and syntactic constraints on how to express them Information Extraction Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 5 • Typically IE systems simplify the problem by considering the events as a sort of template • Templates are designed as a case frame or set of case frames, which in turn hold the information extracted from the documents • Templates usually have slots for the entities and their relations • IE systems need to understand the document at a level that allows filling the slots of the template with the correct information Information Extraction Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 6 Template: frames and slot filling Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 7 •  Dates back to late 1970s •  Commercial system (mid-1980s): JASPER for Reuters for providing real-time financial news to financial traders •  IE was strongly influenced by MUC* •  MUC: community challenge and conference focused: •  MUC-1 (1987), MUC-2 (1989): Naval operations •  MUC-3 (1991), MUC-4 (1992): Terrorism in Latin America •  MUC-5 (1993): Joint ventures and microelectronics •  MUC-6 (1995): News articles on management •  MUC-7 (1998): Satellite launch IE historical view MUC*: Message Understanding Conferences Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 8 Statistical associations: •  Association statistics (Mutual information, Chi-square,..) Hand-written regular expressions/rules: • Use linguistic: syntactic/grammatical aspects • Use semantic aspects • Often define trigger terms relevant for relations (e.g. ‘interact*’, ‘bind*’ for PPIs) Using machine learning: •  Flat features, sentence classifiers •  Linguistic kernels (syntactic trees or shallow parsing for features) Main IE strategies Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 9 •  Named entity recognition •  Coreference resolution: detection of coreference and anaphoric relations between entities (associations between previously extracted named entities) •  Relationship extraction: identifying relations between entities/terms IE subtasks and components (I) Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 10 •  Automatic term recognition (ATR): finding relevant terms from documents • Negation detection: affirmed and negated phrases (e.g. NegEx) IE subtasks and components (II) Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining Co-occurrence (frequency, Mi,..), association rules, textual patterns (e.g. interaction verbs, frames), shallow parsing, full parsing, machine learning (sentence classifiers),… Relation extraction Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 12 Simplified Information Extraction pipeline Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 13 Example IE pipeline: regulation Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 14 Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 15 Syntactic parser: Enju http://www.nactem.ac.uk/enju/demo.html Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 16 Syntactic parsing • Sentence (syntactic) parsing: divide sentence (string of words) into its constituents to generate a parse tree that displays syntactic relations between words • Method of understanding the meaning of sentence • Visualized with syntactic trees/diagrams Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining MEDIE: subject-verb-object relations http://www.nactem.ac.uk/medie/ Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining Example architecture: Relation extraction with NLP From: Literature Mining and Systems Biology, by Lars Juhl Jensen •  Tokenization !  Entity recognition with synonyms list !  Word boundaries (multi words) !  Sentence boundaries (abbreviations) •  Part-of-speech tagging !  TreeTagger trained on GENIA •  Semantic labeling !  Dictionary of regular expressions •  Entity and relation chunking !  Rule-based system implemented in CASS Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining Gene regulation events: textual annotation Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 20 Qualifying co-mentions: tri-co mentions Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining iHOP http://www.ihop-net.org/UniPub/iHOP/ Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 22 Main gene Associated genes Relevant Biomedical terms Compounds Colour legend Defining Information for this Gene iHOP system: Defining information Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 23 iHOP system: interaction information Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 24 Info-PubMed: PPI Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining IE for protein interactions: PPLook http://meta.usc.edu/softs/PPLook/ Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining STRING: Data integration: from literature to databases to experiments Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining Bio-entities to terms: CoPub Mapper http://services.nbic.nl/copub5 Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining http://services.nbic.nl/copub5 Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining Bio-entities to terms: CoPub Mapper Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining Bio-entities to terms: CoPub Mapper Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining CoPub Mapper Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining http://www.coremine.com COREMINE Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining http://www.coremine.com Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining Highly specialized IE: eGIFT http://biotm.cis.udel.edu/eGIFT/index.php Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining Highly specialized IE: eGIFT Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining IE: GoPubMed Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining IE: GoPubMed Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining PolySearch http://polysearch.cs.ualberta.ca/ Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining PolySearch http://polysearch.cs.ualberta.ca/ Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining PolySearch http://polysearch.cs.ualberta.ca/ Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining PolySearch http://polysearch.cs.ualberta.ca/ Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining Highly specialized IE: E3Miner http://e3miner.biopathway.org/e3miner.html Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining Highly specialized IE: eFIP http://biotm.cis.udel.edu/eFIP/index.php Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining Highly specialized IE: eFIP Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining Highly specialized IE: miRTex http://research.bioinformatics.udel.edu/miRTex/ Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining Highly specialized IE http://cbdm.mdc-berlin.de/tools/pescador/index.php Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining Highly specialized IE Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining Chilibot http://www.chilibot.net/ Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 50 Chilibot Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining A C B Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 52 Arrowsmith http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/start.cgi Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 53 Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 54 Arrowsmith Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 55 http://www.nactem.ac.uk/facta/ Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 56 http://www.nactem.ac.uk/facta/ FACTA+ Direct relations: co-occurrences Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 57 http://www.nactem.ac.uk/facta/ FACTA+ Indirect relations Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 59 •  Humans formulate questions using natural language. •  Example: What are the molecular functions of Glycogenin?. •  QA: automatic generation of answers to queries in form NL expressions from document collections. •  Most systems limited to generic literature or newswire. •  QA difficult: heterogeneous, poorly formalized domain, new scientific terms •  Ad hoc retrieval task of the TREC Genomics Track 2005. •  Galitsky system (semantic skeletons (SSK), logical programming). Question Answering Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 60 Question analysis Source Retrieval Answer Extraction Answer Presentation Documents, Web pages, Articles, reports, databases HUMAN How, Why, What, Who, When, Why, Where New directions in Question Answering, Mark Maybury Question Answering simplified architecture Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 61 Question Answering for Alzheimer domain http://celct.fbk.eu/QA4MRE/index.php?page=Pages/biomedicalTask.html Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 62 http://askmedline.nlm.nih.gov/ask/ask.php Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 63 http://www.wolframalpha.com/ Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 64 http://www.wolframalpha.com/ Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 65 Text mining at CNIO Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 66 http://limtox.bioinfo.cnio.es/ Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 67 Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 68 Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 69 http://limtox.bioinfo.cnio.es/ Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 70 http://melanomamine.bioinfo.cnio.es/ Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 71 http://melanomamine.bioinfo.cnio.es/ Dr. Martin Krallinger, Spanish National Cancer Research Centre Text mining 72 http://melanomamine.bioinfo.cnio.es/