• 1 • 2 • 3 • 4 • 5 • 6 • 7 22 Text-mining methods used for information extraction in plant scientific papers 3.From words to entity recognition 23CC-BY NER: Named Entity Recognition 24CC-BY For each words, different tools could be applied to tag different features of the word such as the form, pos, lemma, stemma … e.g. “...expressed during early embryogenesis ” e.g. with NLTK tools : ➢ Form : expressed during early embryogenesis ➢ Pos Tagging : expressed|VBN during|IN early|JJ embryogenesis|NN ➢ WordNet Lemma : express during early embryogenesis ➢ Snowball Stemmer : express dure earli embryogenesi Word Tagger 25CC-BY NER: Named Entity Recognition 26CC-BY Entity Recognizer 27CC-BY Interpretation of DNA sequences in text : A sequence of A, C, T, G, N ≥ 3 characters, could begin with/finish by 5’ or 3’ Definition: A short DNA sequence that corresponds to a binding site for a protein That could be expressed as target of genes (AFL target) or DNA sequence (AACA, (C/T)ACGTGGC , CCATTTTTGG …) e.g. http://arabidopsis.med.ohio-state.edu/AtcisDB/bindingsites.html Entity Recognition with Regular Expressions 28CC-BY Example of Boxes : 5'ACGTACGTAATG'3 AAAAAAACG (C/G/T)ACGTG(G/T)(A/C) Regular expression matching with these boxes (5.{0,3})?((A|C|G|T|N|\/|\(|\)){3,})+(.3{0,3}.)? Explanation of this Regular Expression : https://regex101.com/r/nhgHxb/1 Entity Recognition with Regular Expressions 29CC-BY Prerequisite : Detection of some entities with lexicon, regular expressions Aim to predict entities that follows patterns e.g. : [Gene] transcript [Gene] level [RNA] [Gene] mRNA Entity Recognition with Pattern Matching 30CC-BY Entity Recognizer 31CC-BY Prerequisite : Examples from manual annotations from the challenge BioNLP-ST SeeDev Aim to predict entities by learning features from manual annotation examples and a mathematical algorithm Entity Recognition with Machine Learning 32CC-BY NER: Named Entity Recognition