A one-stop shop computing platform for text mining of scientific literature FORCE2017 27 Oct, 2017 @ BERLIN Natalia Manola Athena Research and Innovation Centre Partners 2 The problem PART I The global research community generates ~2.5 million new scholarly articles per year (English only) STM report (2015) … one paper published every 12 seconds… …70,000 papers published on a single protein, the tumor suppressor p53 Spangler et al, Automated Hypothesis Genera?on based on Mining Scien?fic Literature, 2014 FORCe2017@BERLIN - Oct 27, 2017 How can we make sense of this data? 5 PART II TDM - AN Emerging solution Machine reading process textual sources, organise and classify in various dimensions, extract main (indexical) information items, … and “understanding” identify and extract entities and relations between entities, facilitate the transformation of unstructured textual sources into structured data … and predicting enable the multidimensional analysis of structured data to extract meaningful insights and improve the ability to predict LIBER conference - PATRAS, 5 July 2017 6 However, … Multitude of solutions catering for different Text Types Newswire Scientific Literature Tweets/blogs Patents Clinical/medical records Textbooks, monographs Online forums …. Languages English French German Spanish Portuguese Italian Polish …. Tasks Translation Information Extraction Semantic Search Question Answering Sentiment Analysis Summarization Knowledge Discovery …. Domains Finance/Business Health Biology Social Sciences Humanities …. Creating a fragmented landscape LIBER conference - PATRAS, 5 July 2017 7 A complex and fragmented Landscape LIBER conference - PATRAS, 5 July 2017 Text Mining Researchers Computing Infrastructures Content Providers End Users 8 The components 9 PART III 1. Share content • Document literature content • Share in a meaningful way: what does Open Access really mean? IPR and licensing • Study IPR restrictions for reuse of sources as well as possible exceptions • Promote clarity and standardisation of legal rights and obligations Challenges • Rights statement vs. Open licenses (for repositories) • No access to full text. We live in a metadata world • No standard protocols, formats and APIs for access and retrieval • No capacity to handle extra traffic LIBER conference - PATRAS, 5 July 2017 10 2. Share TDM Services • Document language processing/text mining services and workflows in a meaningful way for domain discipline researchers • Document language/knowledge resources, data categories taxonomies, provenance information Interoperable services • Common way of presenting annotated results • Combine services into workflows • Combine content and language resources with services and workflows • Combine automatic and manual/crowdsourcing annotation services IPR and licensing • Translate the legal & policy aspects into specifications for lawful user-to- service and service-to-service interactions Challenges • Bring text miners close to the researcher problems and needs • Semantic interoperability (not just technical) LIBER conference - PATRAS, 5 July 2017 11 3. Use/Share computing resources • Capacities and capabilities Interoperable services at the lower level • Common way of deploying operations/jobs • Authentication and Authorisation services: Single Sign On (SSO) • Accounting Challenges • Legal, organisational, … LIBER conference - PATRAS, 5 July 2017 12 The OpenMinted platform 13 PART III Register and Discover TDM Services and tools Link to Content hubs - Share corpora Run a TDM job Store, document, Publish and Share results (ANNOTATED CORPORA) Our Services 14 LIBER conference - PATRAS, 5 July 2017 Build your own service – Combine components into a Workflow and SHARE who is openminted for PART IV End users as consumers Domain specific researchers & research communities Rather novice users and who want to find services (end to end) that fill their needs in an off the shelf type of situation. (>100.000) Application developers / RI data scientists Understand basic usage of NLP and TDM services, but not the details. They know how to connect components, which content they must work on to get the required results. They need to develop end to end applications. (>10.000) Infrastructure operators agnostic to the internal specifics of TDM, but they need to integrate and operate TDM services into daily workflows. (<100) LIBER conference - PATRAS, 5 July 2017 content and services contributors FOR Content Publishers and repository managers (research libraries). (<1000) For services Expert language technology oriented people, who are using specific technologies and frameworks to develop and enhance their services. (< 500) Non NLP expert developers, creating TDM modules based on off the shelf libraries and tools (e.g. Python, Jupyter). Not familiar with NLP frameworks and terminology but are eager to publish their small services. (<5.000) LIBER conference - PATRAS, 5 July 2017 where we are now PART v LIBER conference - PATRAS, 5 July 2017 Beta release REAL TIME Building corpora: OpenAIRE CORE Uploading OWN corpora Registering a service Running a service Viewing annotations Storing results in zenodo THANK YOU! Questions? natalia manola natalia@di.uoa.gr