• 1 • 2 • 3 • 4 • 5 • 6 • 7 7 Text-mining methods used for information extraction in plant scientific papers 2. From text to words 8CC-BY Knowledge model of entities 16 entities 9CC-BY The processing order is important : To keep structures of words and sentences that do not respect the "classical" structure and could be segmented by these processes. NER & Text segmentation 10CC-BY e.g. In the plant use-case, this is the case for gene names which may have number, punctuation inside the term: CRP810_1.3 1,2-DIOXYGENASE In our case, we detect some entities using lexicons before the text segmentation NER & Text segmentation 11CC-BY NER: Named Entity Recognition 12CC-BY There are two types of entities to recognize - Named Entities (e.g. author names, genes, geographical locations …) denoted by rigid designators : NER Preprocessor - Complex entities (e.g. development phase, pathway , tissue…) expressed in natural language : Entity Recognizer Entity Recognition 13CC-BY Aim to annotate entities which are defined by rigid designators Example: Person Name, Bibliographical quote , Gene , Protein and their families, RNA Tools : Projection with Lexicon Regular Expressions NER Preprocessor 14CC-BY Named Entities Recognition : Very useful for tagging person, authors, organisations, geographical localisation… Tools : Stanford Named Entity Tagger e.g: Stanford Potential tags: Organization Location Person “American Society of Plant Biologists MUCILAGE-MODIFIED4 Encodes a Putative Pectin Biosynthetic Enzyme Developmentally Regulated by APETALA2, TRANSPARENT TESTA GLABRA1, and GLABRA2 in the Arabidopsis Seed Coat1 Tamara L. Western2, Diana S. Young, Gillian H.” online : http://nlp.stanford.edu:8080/ner/process NER Preprocessor : Name detection 15CC-BY NER Preprocessor with Regular Expressions It may be useful to detect bibliographical references in text, avoiding some errors in the detection of other entities. Bibliographical references are generally of the form : (Authora A., et al 2000) 16CC-BY NER Preprocessor with Regular Expressions \(([\p{L}-\s\.,]+\s\d{4}[a-zA-Z]?[\s;]*)+?\) that matches with (Meinke et al., 1994) (Baumlein et al., 1994; Parcy et al., 1997) (Leung and Giraudat, 1998) Explanation of this Regular Expression : https://regex101.com/r/ARHkEi/1 An example of pattern that could match with similar bibliographical reference is: 17CC-BY Using a lexicon in "learning by rote" for Detection of the sequence of characters as entities e.g. : words to be excluded from future predictions : stopwords NER Preprocessor with Lexicon a about above after again against Lexicon all am an and any are aren't as at be because been... 18CC-BY Using a lexicon in "learning by rote" for Gene / Protein detection : Lexicon provided by TAIR annotation NER Preprocessor with Lexicon AP2 CRP810_1.3 1,2-Dioxygenase LEC 2 Lexicon 19CC-BY Morphosyntactic changes of lexicon : - e.g. adding space between letters and numbers - changes in hyphen and spaces ... NER Preprocessor with Lexicon AP2 CRP810_1.3 1,2-Dioxygenase LEC 2 Lexicon AP 2 AP_2 AP-2 CRP810 1.3 CRP 810 1.3 ... New Lexicon 20CC-BY Parameterization of lexicon projection e.g. CaseInsensitive : the match allows case substitutions on all characters NER Preprocessor with Lexicon ap2 crp810_1.3 1,2-DIOXYGENASE Lec 2 Lexicon AP2 CRP810_1.3 1,2-Dioxygenase LEC 2 e.g. could match with 21CC-BY NER: Named Entity Recognition