Review Text mining resources for the life sciences Piotr Przybyła1,†, Matthew Shardlow1,*,†, Sophie Aubin2, Robert Bossy2, Richard Eckart de Castilho3, Stelios Piperidis4, John McNaught1 and Sophia Ananiadou1 1National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK, 2Institut National de la Recherche Agronomique, Jouy-en-Josas, France, 3Ubiquitous Knowledge Processing Lab, Technische Universit€at Darmstadt, Darmstadt, Germany and 4Institute for Language and Speech Processing, Athena Research Center, Athens, Greece *Corresponding author: Tel: þ44 161 306 3094; E-mail: matthew.shardlow@manchester.ac.uk Citation details: Przybyła,P., Shardlow,M., Aubin,S. et al. Text mining resources for the life sciences. Database (2016) Vol. 2016: article ID baw145; doi:10.1093/database/baw145 †These authors contributed equally to this work. Received 8 September 2016; Revised 13 October 2016; Accepted 17 October 2016 Abstract Text mining is a powerful technology for quickly distilling key information from vast quan- tities of biomedical literature. However, to harness this power the researcher must be well versed in the availability, suitability, adaptability, interoperability and comparative accur- acy of current text mining resources. In this survey, we give an overview of the text min- ing resources that exist in the life sciences to help researchers, especially those employed in biocuration, to engage with text mining in their own work. We categorize the various resources under three sections: Content Discovery looks at where and how to find bio- medical publications for text mining; Knowledge Encoding describes the formats used to represent the different levels of information associated with content that enable text min- ing, including those formats used to carry such information between processes; Tools and Services gives an overview of workflow management systems that can be used to rapidly configure and compare domain- and task-specific processes, via access to a wide range of pre-built tools. We also provide links to relevant repositories in each section to enable the reader to find resources relevant to their own area of interest. Throughout this work we give a special focus to resources that are interoperable—those that have the cru- cial ability to share information, enabling smooth integration and reusability. Introduction Text mining empowers the researcher to rapidly extract relevant information from vast quantities of literature. Despite the power of this technology, the novice user may find text mining unapproachable, with an overload of re- sources, jargon, services, tools and frameworks. The focus of this article, and one of the obstacles that have limited the widespread uptake of text mining, is a lack of specialist knowledge about text mining among those researchers who could most benefit from its results. Our contributions in this article are intended to inform and equip these re- searchers such that they will be better placed to take full VC The Author(s) 2016. Published by Oxford University Press. Page 1 of 30 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unre- stricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. (page number not for citation purposes) Database, 2016, 1–30 doi: 10.1093/database/baw145 Review advantage of the panoply of resources available for advanced text mining in the life sciences. In this context, resources refers to anything that could help in such a pro- cess, such as annotation formats, content sharing mechan- isms, tools and services for text processing, knowledge bases and the accepted standards associated with these. As of 2016, PubMed, the most widely used database of biomedical literature, contains over 26 million citations1. It is growing constantly: over 800 000 articles are added yearly2 and this number has substantially increased over the previous few years (1). This constant increase is recog- nized as a major challenge for evidence-based medicine (2), as well as other fields (3). One of the tools for tackling this problem is text mining (TM). However, as many surveys have shown (4–6) the TM landscape is fragmented and, at its worst, it can be hostile to the uninitiated, giving rise to such questions as: Where should I look for resources? How should I assemble and manage my TM workflows? How should I encode and store their output? How can I ensure that others will be able to use the TM outputs I wish to share from my research? Moreover, one may be drawn to popular standards—while lesser known standards go un- noticed, yet may be more suitable. When taking stock of the current literature, we identified three key areas which we address through this article. First, there is a lack of aggregated knowledge sources, entailing search through dozens of separate resources to find those that are appro- priate to one’s work. We have addressed this need by pro- viding numerous tables that give an overview of key resources throughout this work. At the end of each section we also provide a table of repositories that can be browsed to find further resources of interest to the reader. Second, we found that there was no clear outline of the text mining process in the literature. The structure of this survey fol- lows the text mining process, beginning with content dis- covery, moving to annotation formats and ending with workflow management systems that enable text mining in the life sciences. Third, we found a lack of focus on inter- operability, i.e. ability of resources to share information and cooperate, which is achieved by using widely accepted standards. Although the interoperability issue is known to most researchers, not enough is done in the literature to promote interoperable resources to the communities who may benefit from them. Interoperability was named as one of the major obstacles when implementing text mining in biocuration workflows (7). We have placed a high focus on interoperability throughout this report, suggesting where and when interoperable resources can be used. Interoperability is not appropriate for every task, but we take the view that in these cases the user should know about the interoperable options and make a conscious choice. Interoperability is vital at multiple levels of granularity, such as the way that a TM tool encodes input and output (8–10); the format of metadata that is associated with a re- source in a repository (11); or the licences associated with software and content. Each of these levels must be ad- dressed if we wish to promote a culture of true interoper- ability within TM. We have addressed several levels that are relevant to the text-mining life scientist as seen through the sections of this report. This paper has been split into the following sections to categorize and help understand the existing technologies: • In ‘Content discovery’ section, we start by explaining how and where to get the input text data, i.e. corpora of publications. We describe metadata schemata and vocab- ularies and discuss the most popular publication repositories. • In ‘Knowledge encoding’ section, we show how the ori- ginal document is encoded with additional information, describing annotation formats. We also outline formats for knowledge resources that are frequently employed in the TM process and describe a few examples of such databases, especially from the area of life sciences. We end by discussing content repositories that may be of use to the reader. • In ‘Tools and services’ section, we look at methods of annotating and transforming the acquired data, includ- ing software tools, workflow engines and web services. We also describe repositories that let users easily dis- cover such resources. • Finally, in ‘Discussion’ section, we discuss the landscape described in previous sections, focusing on interoperabil- ity and its importance for TM in the life sciences. Since the subject matter described in this survey is vast, the reader may wish to initially explore those parts that are pertinent to their own research. Biocurators with no back- ground in text mining may wish to begin with points 2.4, describing repositories of publications; ‘Annotation mod- els’ section explaining annotation formats; the ‘Useful knowledge resources’ section including most popular knowledge resources and ‘Text mining workflow manage- ment systems’ section introducing text mining workflow systems. Nevertheless, both biocurators and text mining experts should be able to take value from reading the sur- vey as a whole. We hope that as the novice biocurator grows in their knowledge of the field, both through reading this report and other materials that they will come to treat the information herein as a useful point of reference. We have categorized the information into structured tables 1 http://www.ncbi.nlm.nih.gov/pubmed 2 https://www.nlm.nih.gov/bsd/stats/cit_added.html Page 2 of 30 Database, Vol. 2016, Article ID baw145 throughout to help the reader quickly find and compare the information that they seek. Content discovery In this section, we will explore how users can access and retrieve publications that are relevant to their research. Although many other document types can be mined (e.g. Electronic Health Records, Patent Applications or Tweets), in this work we have focused on scholarly publi- cations. This is because there is a large amount of informa- tion to be mined from such literature, making it a very good starting point in most fields. Many of the resources and services described in this article can be easily trans- ferred to other types of content. We explain and discuss repositories, and aggregators, metadata, application pro- files and vocabularies. We mention several web-accessible repositories, aggregators and their features, which are con- sidered interesting and useful for the life sciences researcher. Publications are usually stored in searchable structured databases typically called repositories. Although many repositories stand alone, an aggregator may connect sev- eral repositories in looser or tighter networks by aggregat- ing publications, or information about them, from other repositories in the network. The internal mechanism of a repository relies on a set of structured labels, known as metadata. Metadata can generally be defined as data used to describe data, and as such metadata may themselves be stored and managed in repositories usually called metadata repositories (or registries, or simply catalogues). Usually, aggregators act as metadata repositories in that they har- vest metadata from repositories and make them available to facilitate the search and discovery of publications. Metadata for scientific articles, e.g. should include authors’ names and affiliations, date of publication, journal or con- ference name, publisher, sometimes scientific domain or subdomain, etc., in addition to article title and an appro- priate identifier. As the metadata needs of particular appli- cations or scientific communities may vary, metadata can be combined into application profiles. Application profiles specify and describe the metadata used for particular appli- cations, including, e.g. refinements in the definitions as well as the format and range of values permitted for spe- cific elements. Usually, aggregators design and make use of application profiles. We particularly focus on the format and vocabulary of the metadata used in repositories. Without a proper understanding of the operational prin- ciples of the repositories and/or aggregators, and the meta- data they use to document their content, users may struggle to retrieve publications. There is a wide range of metadata schemata and appli- cation profiles used for the description of content and re- sources. This variety is, to a great extent, due to the diverse needs and requirements of the communities for which they are developed. Thus, schemata for publications originally came from publishers, librarians and archivists. Currently, we also witness the cross-domain activities of the various scientific communities, as the objects of their interest ex- pand to those of the other communities, e.g. in order to link publications of different domains, publications and the supplementary material described in them, or services which can be used for processing publications and/or other datasets. Differences between the schemata are attested at vari- ous levels such as: • types of information (e.g. identification, resource typing, provenance, classification, licensing, etc.) covered by the schema; • the granularity of the schema, ranging from detailed schemata to very general descriptions, including manda- tory, recommended and optional description elements; • degree of freedom allowed for particular elements (e.g. use of free text statements vs. recommended values vs. entirely controlled vocabularies) • use of alternative names for the same element(s) or use of the same name with different semantics. All of the above features, especially the degree of granu- larity and control of the element values, influence the dis- coverability of publications via their metadata and, in consequence, the applicability and performance of the TM process. Metadata schemata and profiles To make publications and other types of content, data and services discoverable we use metadata. Metadata will en- able the biocurator to search repositories and aggregators for content that is appropriate for his/her purposes using specific metadata elements or filtering the retrieved results (i.e. publications, language and knowledge resources or TM tools and services) using specific values of each meta- data element. So, e.g. a biocurator wishing to compile a collection of publications can search a repository or aggre- gator for publications from a certain publishing body, in a particular language and topic, while he can further filter the retrieved results for those that are available under open access rights. This section presents the most common meta- data schemata and application profiles used for the de- scription of publications in the life sciences domain. In order to make metadata information interoperable, we use Database, Vol. 2016, Article ID baw145 Page 3 of 30 schemata that define common encoding formats and vocabularies (Table 1). Dublin Core (12) is a widely used metadata schema, best suited to resource description in the general domain. It consists of 15 generic basic elements used for describing any kind of digital resource. DCMI Metadata Terms con- sist of the full set of metadata vocabularies used in combin- ation with terms from other compatible vocabularies in the context of application profiles. For many metadata sche- mata, there are mappings to DC elements for metadata ex- change. DC is often criticized as being too minimal for expressing more elaborate descriptions required by specific communities. To remedy this defect, DC is usually ex- tended according to DCMI specifications. JATS (13) is a suite of metadata schemata for publica- tions, originally based on the National Library of Medicine (NLM) Journal Archiving and Interchange Tag Suite. The most recent3, defines a set of XML elements and attributes for tagging journal articles both with external (biblio- graphic) and internal (tagging the actual textual content of an article) metadata. In the life sciences area, DC and JATS are supported by PubMed Central4 (PMC) as formats for metadata retrieval and harvesting, as well as for authoring, publishing and archiving. DataCite (14) represents an initiative for a metadata schema, along the same lines as JATS, aspiring to cover all types of research data, while it is more tuned to metadata- based description of publications. It places a strong em- phasis on citation of research data in general, not only including publications, and for this reason it strongly supports the use of persistent identifiers in the form of digi- tal object identifiers (DOIs, see ‘Mechanisms used for the identification of resources’ section). Similarly, CrossRef (15) is a registry for scholarly publications, stretching out to research datasets, documented with basic bibliographic information and heavily also relying on DOIs for citation and attribution purposes. BibJSON is a convention for rep- resenting bibliographic metadata in JSON facilitating the sharing and use of such metadata. The Comprehensive Knowledge Archive Network (CKAN) (16) is essentially an open data management soft- ware solution, very popular among the public sector open data communities. Intending to be an inclusive solution, CKAN features a generic, albeit limited, set of metadata elements covering many types of datasets. Similarly, Common European Research Information Format (CERIF) (17) proposes a data model catering for the description of research information and all entities and relationships among them (researchers, publications, datasets, etc.) Building on metadata schemata, many initiatives have defined their own application profiles. As indicated in the previous section, application profiles specify the metadata terms that an information provider uses in its metadata, identify the terms used to describe a resource and may also provide information about term usage by specifying vocab- ularies or other restrictions on potential values for meta- data elements; they may go further to describe policies, as well as operational and legal frameworks. OpenAIRE5, for example, has proposed and used an application profile and harvests metadata from various sources, notably reposito- ries of scholarly publications in OAI_DC format, data archives in DataCite format, etc., while they are currently considering publishing homogenized OpenAIRE metadata Table 1. A comparison of popular metadata schemata, used to encode information about publications Name Last updated Domain Main use Dublin Core (DC)/DC Metadata Initiative (DCMI)a June 2012 Generic Widely accepted standard Journal Article Tag Suite (JATS)b Actively Maintained Journal Articles Open access journals DataCitec Actively Maintained Research Data and Publications Citations CrossRefd Actively Maintained Research Data and Publications Citations BibJSONe Actively Maintained Bibliographic information Bibliographic metadata CERIFf Actively Maintained Research Information European research CKANg Actively Maintained Generic Data management portals Different formats describe different types of items as shown in the ‘Domain’ and ‘Main Use’ columns. ahttp://dublincore.org/ bhttps://jats.nlm.nih.gov/ chttps://www.datacite.org/ dhttp://www.crossref.org/ ehttp://okfnlabs.org/bibjson/ fhttp://www.eurocris.org/cerif/main-features-cerif ghttp://ckan.org/ 3 http://www.niso.org/standards/z39-96-2015/ (November 2015) NISO JATS Version 1.1 (ANSI/NISO Z39.96-20151) 4 http://www.ncbi.nlm.nih.gov/pmc/tools/oai/ 5 https://www.openaire.eu/ Page 4 of 30 Database, Vol. 2016, Article ID baw145 as Linked Open Data (LOD). RIOXX6 is a similar applica- tion profile targeting mainly open access repositories in the UK. It is also based on DC with references to other vocabu- laries, like JAV7, while adhering to many of the OpenAIRE guidelines. Metadata schemata and profiles for language and knowledge resources Text mining processes are closely related to language and knowledge resources that are either used as conceptual ref- erence material for annotating text (e.g. scientific publica- tions) or as resources for the creation and operation of text mining tools and services. Language/knowledge resources have been, in the past three decades, recognized as the raw materials for language processing technologies and as one of the key strands of text mining research and develop- ment. In order to cover both the varieties of language use and the requirements of linguistic research, several initia- tives have proposed metadata schemata for documenting language resources. Using such metadata, biocurators can search repositories and aggregators for vocabularies, termi- nologies, thesauri, corpora made up of (annotated) scien- tific publications or other types of content as well as text mining tools and services pertinent to the life sciences do- main. Table 2 lists some of the most widespread of these schemata. The Text Encoding Initiative (TEI) (18) represents a ‘standard for the representation of texts in digital form’, currently the most widely used format in the area of the humanities. To some extent similarly to JATS, the TEI P5 guidelines8 include recommendations both for the bibliographic-style description of texts as well as for the representation of the internal structure of the texts them- selves (form and content) and their annotations. The Common Language Resources and Technology Infrastructure (CLARIN) Research Infrastructure (19) has proposed CMDI, a flexible mechanism for creating, storing and using various metadata schemata, in an attempt to ac- commodate the diverse needs of language technology and text mining research and to promote interoperability. Along the same lines, the META-SHARE metadata schema (11) is used in the META-SHARE infrastructure (20) to de- scribe all kinds of language resources including datasets (e.g. corpora, ontologies, computational lexica, grammars, language models, etc.) and language processing tools/ser- vices (e.g. parsers, annotators, term extractors, etc.). A subset of these metadata components is common to all re- source types (containing administrative information, e.g. contact points, identification details, versioning, etc.), while metadata referring to technical information (e.g. text format, size and language(s) for corpora, requirements for the input and output of tools/services, etc.) are specific to each resource type. Finally, the LRE Map (21) features a minimal, yet prac- tical for its purposes, metadata schema that is used for crowdsourcing metadata information for language re- sources, including datasets and software tools and services, directly by authors who submit their publications to the LREC Conferences9. To facilitate interoperability between metadata sche- mata and the repositories that use them, including those described above, the World Wide Web Consortium (W3C) has published the Data Catalog (DCAT) Vocabulary10. DCAT is an RDF vocabulary catering for the description of catalogues, catalogue records, their datasets as well as their forms of distribution, e.g. as downloadable file, as web service that provides the data, etc. DCAT is now ex- tensively used for government data catalogues and is also growing in popularity in the wider Linked Data community. Table 2. A comparison of metadata schemata used for documenting language resources Name Last Updated Domain Main use TEIa Actively Maintained Documents Encoding text corpora CMDIb Actively Maintained Generic Infrastructure for metadata profiles META-SHAREc Actively Maintained Language Resources Metadata schema for language resources and services documentation LRE Mapd Updated at each LREC conference (biennial) Language Resources Metadata schema for language resources ahttp://www.tei-c.org/ bhttp://www.clarin.eu/content/component-metadata, http://www.clarin.eu/ccr/ chttp://www.meta-net.eu/meta-share/metadata-schema, http://www.meta-share.org/portal/knowledgebase/home dhttp://www.resourcebook.eu/searchll.php 6 http://www.rioxx.net/profiles/v2-0-final/ 7 http://www.niso.org/apps/group_public/project/details. php?project_id¼117 8 http://www.tei-c.org/Guidelines/P5/ 9 http://www.lrec-conf.org/ 10 https://www.w3.org/TR/vocab-dcat/ Database, Vol. 2016, Article ID baw145 Page 5 of 30 Vocabularies and ontologies for describing specific information types Metadata schemata are not enough for comprehensive de- scription of resources, as we also need to know what par- ticular fields mean. For example, different metadata schemata for scientific articles may include a field called ‘subject’, or ‘domain’ but this raises questions: Are ‘sub- ject’ and ‘domain’ intended to codify the same informa- tion? Are the values for these fields provided freely by the authors, or do they have to be selected from a controlled vocabulary or an ontology? Such questions are usually ad- dressed when designing application profiles where inter alia vocabularies and/or ontologies associated with par- ticular fields are specified. The resources in Table 3, mainly controlled vocabularies, authority lists and ontologies, are presented because they are used widely and can be useful for improving existing schemata in recording information. The vocabularies of Table 3 represent variably struc- tured conceptualizations of different aspects in the lifecycle of a resource (or in general of a content item) from basic bibliographic description to its reuse and associated intel- lectual property and distribution rights. Focusing on the medical domain, Medical Subject Headings (MeSH) (22) is one of the most widely used con- trolled vocabularies for classification, and EDAM (EMBRACE Data and Methods) (23) is an ontology of well established, familiar concepts that are prevalent within bioinformatics, including types of data and data identifiers, data formats, operations and topics. A range of controlled vocabularies that have evolved from flat lists of concepts into hierarchical classification systems or even full-fledged ontologies are employed for standardizing, to the extent possible, subject domain classes. Springing from library sciences as well as docu- mentation and information services, the Dewey Decimal Classification (DDC) (24), the Universal Decimal Classification (UDC) (25) and the Library of Congress Subject Headings are among the most widely used systems for the classification of documents and collections. EuroVoc is a similar system, represented as a thesaurus Table 3. A comparison of vocabularies and ontologies for metadata description, used in conjunction with metadata schemata to give meaningful descriptions of resources Title Domain Format Medical Subject Headings (MESH)a Medicine XML EDAM (EMBRACE Data and Methods) ontologyb Bioinformatics OWL, OBO Dewey Decimal Classification (DDC)c Library classification – Universal Decimal Classification (UDC)d Library classification – Library of Congress Subject Headings (LCSH)e Library classification – EuroVocf Document classification XML, SKOS/RDF Semantic Web for Research Communities (SWRC)g Research communities OWL CASRAI dictionaryh Research administration information HTML Bibliographic Ontology (BIBO)i Bibliographic information (citations and bibliographic references) RDF/RDFS COAR Resource Type Vocabularyj Open access repositories of research outputs SKOS PROV Ontology (PROV-O)k Provenance information OWL2 Open Digital Rights Language (ODRL)l Digital Rights Management, Licensing RDF/XML Creative Commons Rights Expression Language (ccREL)m Intellectual Property Rights, Digital Rights Management, Licensing RDF A wide variety of formats and sizes, suitable for different domains, is reported above. Although it is difficult to compare size due to different formats, we have presented the resources in approximate order of the number of items held in each at the time of writing from most to least. ahttps://www.nlm.nih.gov/mesh/ bhttp://edamontology.org/page chttps://www.oclc.org/dewey.en.html dhttp://www.udcc.org/index.php/site/page?view¼about ehttp://id.loc.gov/authorities/subjects.html fhttp://eurovoc.europa.eu/ ghttp://ontoware.org/swrc/ hhttp://dictionary.casrai.org/Main_Page ihttp://bibliontology.com/ jhttps://www.coar-repositories.org/ khttps://www.w3.org/TR/prov-o/ lhttps://www.w3.org/ns/odrl/2/ODRL21 mhttps://wiki.creativecommons.org/wiki/CC_REL Page 6 of 30 Database, Vol. 2016, Article ID baw145 covering the activities of the European Parliament, and gradually expanding to public sector administrative docu- ments in general. EuroVoc is multidisciplinary, as is the case for the previously mentioned classification systems, enriched however with a strong multilingual dimension in that its concepts and terms are rendered in all official lan- guages of the EU (and those of 3 EU accession countries), thus paving the way for cross-lingual interoperability. The Semantic Web for Research Communities (SWRC) is a generic ontology for modelling entities of research communities such as persons, organizations, publications and their relationships (26), while the Bibliographic Ontology (BIBO) caters mostly for bibliographic informa- tion providing classes and properties to represent citations and bibliographic references. COAR (27) is a controlled vocabulary, described in SKOS (a popular format for encoding thesauri, see ‘Formats for knowledge resources’ section), for types of digital resources, such as publications, research data, audio and video objects, etc. The PROV Ontology (PROV-O), a W3C recommendation, provides a model that can be used to represent and interchange prov- enance information generated in different systems and under different contexts. Finally, catering for expressing information about rights of use, reuse and distribution, the Creative Commons Rights Expression Language (ccREL) (28) formalizes the vocabulary for expressing licensing information in RDF and the ways licensing may be attached to resources, while Open Digital Rights Language (ODRL) (29) provides mechanisms to describe distribution and licensing informa- tion of digital content. Using these vocabularies, biocurators and text mining researchers can effectively search and retrieve content from digital repositories and also use them to annotate content and data both externally (e.g. tag a document or collection) and internally (e.g. annotate text spans as referring to a certain concept in an ontology or term in a vocabulary). Mechanisms used for the identification of resources Identification systems present the researcher with a means of assigning a persistent identifier to a resource (usually under their ownership). In contrast to simple identifiers, a persistent identifier is actionable on the Web and can be distributed to other researchers who can also use the same identifier to refer to the original resource. This facilitates deduplication, versioning and helps to indicate the relation between resources (e.g. raw and annotated text, scholarly articles before and after review, corrected or enriched ver- sions of a lexicon, etc.). Although persistent identifiers have so far been assigned primarily to publications, they are recently also applied elsewhere: e.g. datasets, software libraries or even the individual researcher. Below, we pre- sent the main mechanisms used for assigning Persistent Identifiers (PIDs). Similarly to persistent URL solutions (permalink and PURL11), the assignment of unique PIDs allows one to refer to a resource throughout time, even when it is moved between different locations. Some of the most popular PID systems are: • Handle PIDs: abstract IDs assigned to a resource in ac- cordance to the Handle schema (based on Request for Comment (RFC) 365012); resource owners have to regis- ter their resources to a PID service, get the ID and add it to the description of the resource; a PID resolver (included in the system) redirects the end users to the lo- cation where the resource resides • DOIs (Digital Object Identifiers)13: serial IDs used to uniquely identify digital resources; widely used for elec- tronic documents, such as digitally published journal art- icles; it is based on the Handle system and it can be accompanied with its own metadata. As with Handle PIDs the resource owner adds the DOI to the description of the resource (30). • ORCID (Open Researcher and Contributor ID)14: de- signed to allow researchers to create a unique ID for themselves which can be attached to publications and re- sources created by that researcher. This helps to clear up ambiguity when researchers have similar names to others in the field, or when a researcher changes their name (31). While these identifiers are widely used in the general re- search domain, there exist identification procedures, of dif- ferent scale and focus, like the PubMed Identifier (PMID) used for identifying articles in PubMed, or the Maven co- ordinates used to identify Java libraries. To facilitate search based on identifiers, utilities have been developed to search and match additional identifiers that may have been attached to the same object (article) in other contexts, e.g. find and match additional unique identifiers such as PMID (from PubMed), PMCID (from PMC), Manuscript ID (from a manuscript submission system, e.g. NIHMS, Europe PMC) or DOI (Digital Object Identifier). By using persistent identifiers a biocurator can unambiguously iden- tify and refer to resources of various types, e.g. from publi- cations to domain terminologies and possibly terms themselves, to authors and resource contributors, expect- ing that he/she will be able to locate such resources even when their initial locations on the web have changed. This 11 https://archive.org/services/purl/ 12 http://www.ietf.org/rfc/rfc3650.txt 13 http://www.doi.org/ 14 http://orcid.org/ Database, Vol. 2016, Article ID baw145 Page 7 of 30 property of persistent identification will likewise enable the biocurator or a bioinformatician to reproduce the re- sults of reported experiments conducted by other re- searchers by appropriately accessing the various types of resources involved in such experiments through resolving (usually by just clicking on) their persistent identifiers. Publication repositories This section describes the online repositories, aggregators and catalogues where publications can be deposited and subsequently discovered. Some have also been presented in ‘Metadata schemata and profiles’ section, from a different angle, i.e. the schema adopted/recommended for the de- scription of resources they host, especially when this is used widely by a variety of resource providers. Scholarly publications can be accessed through a wide range of sites and portals, e.g. publishers’ sites, project sites, institutional and thematic repositories, etc. We outline only widespread repositories and aggregators that make available content (or metadata about content) from different sources, mainly open access publications, given that they can function as central points of access. Repositories and aggregators for other types of objects (e.g. language and knowledge re- sources, language processing tools and services) are pre- sented at the end of each section in this paper, namely, in the ‘Language resources repositories’ and ‘Discovering tools and services’ sections (Table 4). While repositories are designated for data depositing, storage and maintenance, aggregators actively harvest data from multiple sources (i.e. repositories) and make them searchable and available in a uniform way. Aggregators can be conceived of as an evolution of hand-coded cata- logues. Application profiles and metadata schemata, as dis- cussed in ‘Metadata schemata and profiles’ section, and especially mappings between them to enhance interoper- ability play a crucial role in the aggregation process and aggregators’ operations. Based on a pan-European network of institutional, the- matic and journal repositories, OpenAIRE (32) brings to- gether and makes accessible a variety of sources including links, publications and research data, improving their dis- coverability and reusability. Currently, OpenAIRE har- vests over 700 data sources that span over 5000 repositories and Open Access journals. Text miners can use OpenAIRE for searching and downloading, where available, publications and/or abstracts of them, and in- creasingly make use of application programmatic inter- faces for querying and mining specific information. In a similar vein, the Knowledge Media Institute of the Open University in the UK has built CORE (Connecting Repositories) aggregating all open access research outputs Table 4. A comparison of popular sources for the discovery of and access to publications for TM Title Publications Articles access Type Domain OpenAIREa 14.6 million Abstracts, some full text articles, reports and project deliverables, open access Aggregator Open Connecting Repositories (CORE)b 30.5 million Abstracts, full text articles, open access Aggregator Open Bielefeld Academic Search Engine (BASE)c 91.9 million Abstracts, full text articles, books and multimedia documents, software and datasets, many open access Aggregator Open PubMedd 26 million Citations, abstracts, no full text articles (in principle) Aggregator Biomedical, life sciences PubMed Central (PMC)e 3.9 million Abstracts and full text of journal articles, open access subset Repository Biomedical, life sciences MEDLINEf 22 million Citations, abstracts Aggregator Biomedical, life sciences Biodiversity Heritage Libraryg 109,382 Abstracts, full text articles, citations, open access Repository Biodiversity arXivh 1.2 million Full preprints and abstracts Repository Biology, physics, computer science, mathematics We have made a distinction between modes of operation in the ‘Type’ column. ahttps://www.openaire.eu/ bhttps://core.ac.uk/ chttps://www.base-search.net/about/en/ dhttp://www.ncbi.nlm.nih.gov/pubmed ehttp://www.ncbi.nlm.nih.gov/pmc/ fhttps://www.nlm.nih.gov/pubs/factsheets/medline.html ghttp://www.biodiversitylibrary.org/ hhttp://arxiv.org/ Page 8 of 30 Database, Vol. 2016, Article ID baw145 from repositories and journals worldwide and making them available to the public. CORE harvests openly access- ible content available according to the definition of open access. Recently, CORE has started creating data dumps, i.e. big, in the order of hundred thousands, collections of research publications and making available for mining in- formation at different levels. One last example of a publi- cation aggregator is the Bielefeld Academic Search Engine (BASE) (33), which also harvests all kinds of academically relevant material from content sources, normalizes and indexes these data and enables users to search and access the full texts of articles. All three cases of aggregators rely on the widely used OAI-PMH protocol for harvesting pub- lication data. Specifically focusing on the area of Life Sciences are the repositories of MEDLINE, PubMed and PubMed Central. MEDLINE (34) is the U.S. National Library of Medicine (NLM) bibliographic database, containing >22 million refer- ences to journal articles in life sciences with a focus on bio- medicine. Records in MEDLINE are indexed with MeSH. PubMed includes >26 million citations for biomedical litera- ture from MEDLINE, life science journals and online books. PubMed citations and abstracts include the fields of biomedi- cine and health, covering portions of the life sciences, behav- ioural sciences, chemical sciences and bioengineering. In some cases, citations may include links to full-text articles from PubMed Central and other publisher web sites where the articles were originally published. PubMed Central (PMC), in its turn, is a repository of openly accessible bio- medical and life sciences journals literature (35). Scientific publications are deposited by the participating journals and authors of articles that comply with the public access policies of research organizations and funding agencies. Finally, arXiv is a repository of electronic preprints that allows re- searchers to self-archive and share the full text of their art- icles before they get published in a journal. It is very popular in the field of physics, but contains documents from several domains, including quantitative biology. Knowledge encoding In the previous section, we covered the problem of acquir- ing the publications necessary to perform biocuration via TM. However, obtaining the data is not enough—we also need to understand it. A text, especially a scientific publi- cation, is much more than a sequence of words. Some words represent structural elements of a document (head- ers, chapters, sections and paragraphs) or a sentence (sub- ject, predicate and adjective). Others play special roles, such as URL address, name of a person or citation. Finally, some words or their sequences may be names of concepts that are interesting for a particular purpose. We typically refer to the identification of these special roles as annotat- ing and the identified words, with their labels, as annota- tions. These may be obvious for a human reader, but need to be expressed in a strict machine-readable format to allow automatic text processing. The ‘Annotation models’ section describes the most important annotation formats. During annotation we usually link words or sequences occurring in a text with labels describing their role, e.g. date, title, surname, protein or the concept that they refer to, e.g. a cat, John Smith or Escherischia coli, possibly dis- ambiguating between multiple concepts that go by the same name. In both cases, we may refer to existing knowledge re- sources, e.g. ontologies or dictionaries, as these references allow the annotations to be re-used in future and linked with other similar efforts. This problem is also related to the concept of linked data, which enables semantic search by publishing data in a way that links it with other re- sources available via the web. However, to create linked data, the target knowledge resource needs to be suitable for referencing, which can be ensured by using one of several suitable interoperable formats. The ‘Formats for knowledge resources’ section enumerates the most popular formats for encoding such resources, while the ‘Useful knowledge re- sources’ describes exemplar ontologies and vocabularies. Creating an annotated corpus or knowledge resource, in particular when done manually, is a time consuming process. The products of such efforts are sometimes used for many years, but they also may become inaccessible if an under-specified or poorly documented format has been employed. Furthermore, a lot can be gained by comparing or aggregating annotations from different corpora, which is only doable if the semantics of annotations across cor- pora are consistent. How can we make our research reus- able and permanent? We need to take care of the interoperability of every aspect of our work—protocols, formats, vocabularies, knowledge bases, etc. Annotation is a great example. If we use an interoperable standardized annotation format and refer to publicly available, well es- tablished knowledge resources, everyone will benefit. Annotation models Annotation is the process of adding supplemental informa- tion to a text in natural language. The annotations are pro- duced as an output of an automatic or manual tool and may be: • treated as an input for further automatic stages, • visualized for interpretation by humans, • stored as a corpus. In each of these cases, we need the annotations to fol- low a precisely defined data format. This section is devoted Database, Vol. 2016, Article ID baw145 Page 9 of 30 to presenting such formats, both domain-dependent and generic. Table 5 summarizes the different formats which are commonly used. All of the formats presented in Table 5 have a data model that allows the representation of annotations inde- pendently from the domain application. The Domain col- umn indicates the salient communities that use the format. Generic formats are used in very diverse domains, includ- ing the biomedical. Even though they are generic in their design, the format with the domain ‘biomedical’ is mainly used within the biomedical text mining and natural lan- guage processing communities. In order to grasp the specificities of each format and to be able to choose one, it is important to recall the goals that motivated the specification of each format. We can classify these objectives in three broad categories, as out- lined below. Formal representation and sharing The goal is to provide a formal representation framework for linguistic and semantic annotations of texts. The objectives are to normalize the representation of annota- tions inside the community of annotation producers, to allow the exposure of annotations to peers. Some of these models were designed by a committee of language annota- tion professionals, who attempts to cover the widest range of annotation situations in order to build a complete and expressive format. LAF (36), XMI15, and Open Annotation (37) are examples of models designed with this goal in mind. These formats are suited to expose and share annotations, especially if these annotations are complex. Normative formats also have the advantage of being known and recognized by a large number of tools and ser- vices, although the user should always take care to ensure that the format they choose is suitable for their purpose and has sufficient tool support to be useful. Operational interoperability Some formats presented in Table 5, such as NLP Interchange Format (NIF) (38), were specifically designed Table 5. A comparison of annotation formats used in TM Model Domain Serialization formats API Type BioCa Biomedical XML Reference APIs in multiple languages Stand-off BioNLP shared task TSVb Biomedical TSV No Stand-off BRAT formatc Generic TSV No Stand-off Pubtatord Biomedical TSV No Stand-off TEIe Generic XML Via XSLTf Stand-off NIFg Generic RDF No Stand-off LIFh Generic RDF Reference API in Javai Stand-off IOB Generic TSV Third-party APIs in several languages In-line Open Annotationj Generic RDF No Stand-off CAS (UIMA)k Generic XML (XMI) Reference APIs in Java and Cþþl Stand-off and in-line GATE annotation formatm Generic Several Reference API in Javan Stand-off and in-line LAF/GrAFo Generic XML No Stand-off PubAnnotationp Generic JSON REST API to annotation storeq Stand-off ‘API’ stands for application programming interface and refers to whether there is a suitable library for use with this format. The domain column denotes the typical category of information encoded with this format. ahttp://bioc.sourceforge.net/ bhttp://2011.bionlp-st.org/home/file-formats chttp://brat.nlplab.org/standoff.html dhttp://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/ ehttp://www.tei-c.org/Guidelines/P5/ fhttp://www.tei-c.org/Tools/Stylesheets/ ghttp://persistence.uni-leipzig.org/nlp2rdf/ hhttp://wiki.lappsgrid.org/interchange/ ihttp://mvnrepository.com/artifact/org.lappsgrid/vocabulary jhttp://www.w3.org/ns/oa khttps://uima.apache.org/d/uimaj-2.7.0/references.html#ugr.ref.cas lhttps://uima.apache.org/downloads/releaseDocs/2.1.0-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html mhttps://gate.ac.uk/sale/tao/splitch5.html nhttp://jenkins.gate.ac.uk/job/GATE-Nightly/javadoc/ oISO 24612:2012 – http://www.iso.org/iso/catalogue_detail.htm?csnumber¼37326 phttp://www.pubannotation.org/docs/annotation-format/ qhttp://www.pubannotation.org/docs/api/ 15 http://www.omg.org/spec/XMI/ Page 10 of 30 Database, Vol. 2016, Article ID baw145 to be flexible and generic such that they can be used as interchange formats in arbitrary analysis workflows. Workflows play an important role in TM and NLP because the operational results are rarely produced by a single piece of software or method. Useful outputs require an accumu- lation of coordinated process steps. For instance, most ap- plications would need sentence splitting, tokenization, POS-tagging, several steps of named-entity recognition and so on. More complex applications may also need syntactic parsing, relation extraction, semantic labelling and more. Each step is designed to achieve a specific and atomic task and operates on the text as well as the output of previous steps. In the NLP community, workflow implementations commonly wrap individual tools in order to have uniform access to their input, output and parameters. In this case, the workflow works on a single annotation model called the ‘pivot model’. The output of each component tool is translated into the pivot model, conversely the annotations expressed in the pivot model are translated into the native tool input format. Performance is one of the main design principles of these formats. BioC (39), GATE annotation format, LIF (40) and CAS (41) are formats designed to be processed by BioC, GATE (42), LAPPS (43) and UIMA (8) workflows, respectively. These formats present the advantage of giving direct ac- cess to the processing tools available for the respective workflow engine. Although annotations can be shared, they usually confine the annotations to the ecosystem of the processing tools of the workflow engine. Other formats have been designed for a more specific use, such as storing outputs of manual annotation such as BRAT (44), or encoding corpora e.g. TEI (18). Human–machine readability Other formats are designed to be, at the same time, pro- cessed by machines and read by humans. Indeed NLP and Information Extraction developers need annotations in for- mats that they can use with their tools, especially machine learning tools, but also that they can read in order to grasp the data and analyse the errors of tools in development and production. Most of these formats have been designed as data for- mats supported by NLP and IE challenges: BioNLP Shared Task (BioNLP-ST TSV) (45), BioCreative (PubTator) (46) and CoNLL (TSV/IOB). Challenges are important events that gather the NLP and IE community. They allow the as- sessment of the performance of tools and methods against real-life data. Typically, challenge participants will feed annotations to automatic tools, as well as look into annotations. The main advantage in exposing one’s own annotations in these formats is that they can be processed by the most state-of-the-art research software. Format paradigm The annotations may be inserted in the text (in-line, similar to tags in HTML) or provided as a separate file (stand-off). The overwhelming majority of formats opt for stand-off annotations. On one hand in-line annotations have serious limitations for representing complex structures like over- lapping spans, discontinuous phrases, or relations. On the other hand stand-off formats allow the transmission of an- notations separately from the annotated text that cannot always be distributed for legal reasons. There is a strong tension between human readability and genericity of a format. The more complex the struc- tures to be encoded become, the more identifiers and cross- references need to be introduced which gradually erodes human readability. For example, the CONLL 2006 format is a fixed scheme format with a good human readability; the TSV format used by WebAnno (47) is a variable- scheme format that tries to strike a compromise here by scaling the encoding complexity. The simpler the annota- tions are, the more human-readable is the format, for ex- ample see PubAnnotation (48). Cross-references and identifiers are introduced on a by-need basis, not preemp- tively; the GrAF XML format is a variable-scheme format using references and identifiers a lot and is hardly human- readable even for documents with only simple annotations In fact, in-line annotations have two advantages in ra- ther niche situations. In-line annotations map directly to mark-up formats used natively by several visualization tools. In general, in-line formats are more easily read by humans. Also they are particularly well suited as input data for algorithms widely used in named-entity recogni- tion, like Hidden Markov Models (HMMs) or Conditional Random Fields (CRFs). The transformation from in-line to stand-off is trivial as stand-off annotations are more ex- pressive. Transforming back to in-line from stand-off can be difficult, especially if the annotation has passed the boundaries of the expressivity of the in-line annotation for- mat. In-line formats may be chosen as they are easier for a human to read, however in the general case we recommend stand-off formats. Formats for knowledge resources In this and the next section, we focus on a special type of resources that play the role of knowledge sources. This may be purely linguistic information (e.g. morphology dic- tionaries), general world knowledge (e.g. open-domain ontologies) or very specific use (e.g. translation) or domain Database, Vol. 2016, Article ID baw145 Page 11 of 30 knowledge. First, we describe the formats used to encode such resources. Table 6 contains a list of them with the most basic features. A variety of formats is necessary to represent the organi- zation and actual content of the linguistic and conceptual resources that are used to feed TM and NLP software. Their adoption by resource developers can be explained as follows. The nature of content elements The formats listed in Table 6 correspond more or less to families of resources that allow the exploitation by soft- ware of different facets of knowledge ranging from words to concepts. Lexica provide descriptions of lexemes, i.e. a language’s words, focusing on morphology, syntax and sometimes se- mantics, all of which are elements precisely described by the Lexical Markup Framework (LMF) (49). In this work, vocabularies are to be understood as sets of elements of knowledge, possibly structured and controlled. Their typ- ical function is to represent consensual meaning of con- cepts inside domain communities. Vocabularies cover gazetteers, authority lists, thesauri, terminologies, classifi- cation schemes, etc. TMF/TBX and SKOS are particularly suited for this. OWL and OBO (50), initially developed for the ontological representation of concepts, are also com- monly used for implementing borderline vocabularies also called termino-ontologies. Finally, translation memories record pairs of segments of texts that have previously been translated. Both TMX and XLIFF like the majority of translation memory standards focus on the context rather than on the internal structure of the segments. Note that all formats listed support multilinguality. Considering the heterogeneity of the nature of contents represented in resources, most of the formats listed in Table 6 are not exchangeable in a TM workflow. Indeed, exchangeability is generally neither possible nor useful as each type of component, e.g. word disambiguation, con- sumes one specific or a given set of resource types, e.g. lex- ica or vocabularies. Only inside the same level of linguistic or knowledge representation are the formats exchangeable such as between OBO and OWL, for which translation routines already exist. Fitness towards TM It is worth noticing that LMF is the only format from the list above that was designed in the context of ISO/TC37 specifically to feed into the NLP process. UBY-LMF is an example of instantiation of the LMF model. It has been used in TM pipelines on, for instance, word sense disam- biguation or text classification. Other formats like OWL, SKOS and, to a lesser extent, OBO are central, especially since the emergence of the Semantic Web, as they are widely adopted by domain Table 6. A comparison of formats for the encoding of different types of knowledge resources Format Resource type Serialization Libraries available TMF/TBXa Terminologies XML Yesb LMFc Lexica LMF No SKOSd Thesauri RDF Yes (RDF)e OWLf Ontologies several Yesg OBOh Ontologies own Yesi Ontolexj Lexica relative to ontologies RDF Yesk TMXl Translation memories XML Yesm XLIFFn Translation memories XML Yeso ‘Libraries available’ refers to whether there is a suitable library for use with this format. ahttp://www.tbxinfo.net/ bhttp://www.tbxinfo.net/tbx-downloads/ chttp://www.lexicalmarkupframework.org/ dhttps://www.w3.org/TR/skos-reference/ ehttps://www.w3.org/2004/02/skos/tools fhttps://www.w3.org/OWL/ ghttp://owlapi.sourceforge.net/ hftp://ftp.geneontology.org/pub/go/www/GO.format.obo-1_4.shtml ihttp://oboedit.org/?page¼javadocs jhttps://www.w3.org/community/ontolex/wiki/Final_Model_Specification khttps://github.com/cimiano/ontolex/blob/master/Ontologies/ontolex.owl lhttp://xml.coverpages.org/tmxSpec971212.html mhttp://docs.transifex.com/api/tm/ nhttp://docs.oasis-open.org/xliff/xliff-core/xliff-core.html ohttp://www.opentag.com/xliff.htm#Resources Page 12 of 30 Database, Vol. 2016, Article ID baw145 specific communities. More and more pointed and high value knowledge resources are thus made available for TM advanced tasks in addition to already widespread general knowledge bases. However, OWL, OBO and SKOS re- sources often lack information on the lexical level because the formats are not suited and such resources are created with information organization or reasoning purposes. They may need reworking before exploitation in a TM pipeline. Ontolex, the core of the lexicon model for ontolo- gies (LEMON), was created to overcome these weaknesses of OWL. Interoperability The issue of the interoperability of knowledge resources has to be considered according to two aspects. First, resources have to interact with TM pipelines both as inputs and outputs of specific components. The formats listed in Table 6 may not be sufficient to qualify the com- pliance of a resource with a tool. The user generally needs to read the resource’s documentation along with the soft- ware’s documentation. In the best case scenario, these sources of documentation specify the appropriate and ne- cessary elements for a given task, however there is no guar- antee that the documentation will be clearly written. This hinders the design of tailor-made TM workflows by non- specialists as they require an in-depth knowledge of the re- sources and tools. In addition, knowledge resources need to be interoper- able with each other. Users may want to merge existing re- sources to answer their specific needs. This reduces development costs and allows them to benefit from exper- tize they probably do not have, particularly on specialized domains. Most of the formats above offer mechanisms towards this kind of interoperability thanks to available libraries. Yet, there are still many knowledge resources that are not using standards because for instance they are tied in to a given software, or felt not to be necessary by the devel- oper. The adoption of standards for resources is an essen- tial driver towards flexible and reusable TM pipelines. Useful knowledge resources Although it is not possible to enumerate all knowledge sources used for TM and biocuration, we try to outline sev- eral examples here, focusing on their interoperability. We have focused on resources from the life sciences domain, but at the end of this section the reader will find some more general examples with an explanation of their role in text mining for biocuration. Table 7 describes the re- sources we have highlighted, including their most import- ant features. Domain specific resources Domain specific resources capture the formal knowledge or the language used in a delimited scientific or technical domain. Domain specific resources are built by domain ex- perts, so the entries are usually familiar to biocurators. These resources are used for labelling document topics, or to extract occurrences of entries in the content. The auto- matic processing of documents links documents to entries in domain specific resources, and thus helps biocurators in a systematic approach to their task. The majority of domain specific resources mentioned in Table 7 are expressed in OWL, or OBO. Resources in OBO can be easily translated into OWL/RDF, since the OBO model is a subset of OWL. The reverse is usually Table 7. A comparison of popular knowledge resources, typically used in TM for the life sciences Name Type Domain Size Format License Uniprota Knowledge base Proteomics 63 million sequences Own, RDF, FASTA CC UMLSb Thesaurus Biomedical 3.2 million concepts Own Proprietary Gene Ontologyc Ontology Genetics 44 000 terms OBO CC Agrovocd Thesaurus Agriculture 32 000 concepts RDF CC HPOe Vocabulary Human phenotype 10 000 terms OBO, OWL, RDF Free to use CNOf Vocabulary Neuroscience 395 classes OWL, RDF CC CAROg Ontology Anatomy 96 classes OBO, OWL Unspecified These resources differ in terms of type, domain and intended use. These differences make size difficult to compare as different resources have different base elements. Nonetheless, we have presented the table in an approximate order of size from largest to smallest. ahttp://www.uniprot.org/ bhttps://www.nlm.nih.gov/research/umls/ chttp://geneontology.org/ dhttp://aims.fao.org/vest-registry/vocabularies/agrovoc-multilingual-agricultural-thesaurus ehttp://human-phenotype-ontology.github.io/ fhttps://bioportal.bioontology.org/ontologies/CNO ghttp://www.obofoundry.org/ontology/caro.html Database, Vol. 2016, Article ID baw145 Page 13 of 30 possible, at least partially. UMLS (51) uses a specific for- mat because it is an aggregation of very diverse nomencla- tures, some of them originating from before the development of OBO and OWL. Uniprot (52) is a curated resource containing protein sequences and information on their functions, it has used a specific format since its con- ception in 1986. There are two approaches for domain spe- cific applications that require these resources. Either they understand natively the formats, or they reformulate them into OWL/RDF. For some resources, e.g. Agrovoc (53), RDF is the only available format. One of the main interoperability challenges for domain specific resources is that they do not always cover distinct subdomains. Therefore, different resources contain com- mon objects, however they do not necessarily carry the same name and identifier. For instance both the UMLS and CARO (54) contain the concept ‘basal lamina’. An applica- tion that uses overlapping resources has two courses of ac- tion. They can act as if the resources were distinct, at the risk of duplicating information. Or they can map objects from different resources, which may prove difficult and costly. In the best case, resources already contain cross- references to objects in other resources, but cross-references are not typed and could mean object equivalence as well as just ‘see-also’ relationships. Sometimes term definitions may as well contain concepts from other ontologies and link to them, e.g. HPO (55). Ontology alignment and ter- minology are whole domains of research that aim to pro- duce such mappings automatically using the objects labels or the comparison of the structures of resources—for bio- medical examples, see Bio2RDF (56) and KaBOB (57). When adopting a TM system in order to assist their task, biocurators face the difficulty to choose among re- sources. The choice must be driven by three main criteria: the topic coverage, the quality of the resource and licensing. The topic coverage is the most obvious criterion: the re- source must address the domain or subdomain at hand. The main reason a resource becomes popular is that it cov- ers an exclusive topic, and it is well documented so the coverage is made well known. Even though several re- sources may seem to compete in a specific topic, they may adopt different points of view or address different levels of granularity. We advise that biocurators investigate the documentation in order to understand the precise bounda- ries and the point of view of the considered resources. The quality of knowledge resources is difficult to define and assess. Unfortunately, the most reliable way to assess the quality of a resource is experimenting with services using this resource. However, some properties can be checked beforehand to ensure that the service that inte- grates the resource will meet one’s expectations. Licensing is also key, as a biocurator must select resources which will be compatible with their final intended use. For example, some resources may come attached with a non-commercial licence which may not be suitable in an industrial setting. Development process and curation. The process of devel- opment and maintenance of a resource is a good indicator of its quality. Resources under active development, sup- ported either by a recognized institution, or by a stable committee are likely to be reliable. Quality resources also have to be curated, thus a clear and sensible policy for ac- cepting contributions indicates a coherent resource. In the best case, the methodology of construction is described in a scientific publication. Almost all the resources mentioned in Table 7 are manually curated, which means that they are the result of a process involving humans reading relevant publications or other knowledge sources and extracting necessary informa- tion. The only exception is Uniprot, which includes a sec- tion (UniProtKB/TrEMBL) of automatically annotated and classified entries. Automatic and semi-automatic solutions permit a biocurator to increase the coverage with reduced human effort, but also result in lower annotation quality because of limited accuracy of automatic methods. In the general domain (as shown below), most of the knowledge resources that we have covered are automatically extracted from textual databases, particularly from Wikipedia. Community of users. A widely popular resource might be a good one, however one has to check if the resources are used with the same objectives. For instance Gene Ontology (58) is extremely popular, however it was designed to nor- malize functional annotations of genes. This resource has drawn a remarkable attention from the TM and NLP com- munity. However on close examination, the extraction of GO concepts from text content is still a research subject since it has proven to be challenging (59). Semantic strictness. Ontologies and knowledge bases con- tain intensional knowledge that will be used by TM tools for inferences. If a service uses inappropriate inferences on a resource, or if a resource contains approximative intensional knowledge, then the impact on the output can be dramatic. For instance, one can check if ‘is-a’ relationships in an ontol- ogy actually denote strict specialization, and not related or weaker relations. For instance, a tool may take advantage of the taxonomic structure of a thesaurus in order to improve the extraction or the retrieval of information in the text. However, this tool can propagate errors if the terms are mis- placed or inappropriate to the context at hand. Lexicalization. Lexicalization is the property of a resource to capture the majority of the specific language associated Page 14 of 30 Database, Vol. 2016, Article ID baw145 with the domain specific concepts. A good lexicalization will ensure that information can be properly and compre- hensively extracted from the text’s content. For instance, one can check if the most common synonyms and acro- nyms are present in the resource, or conversely if ambigu- ous terms carry enough context within the resource to allow for automatic disambiguation. General domain and linguistic resources Resources that are not domain-specific and linguistic re- sources are often used in TM tools to complement domain- specific resources. Indeed, linguistic resources are helpful, as domain-specific resources seldom capture the whole di- versity of language used to express the objects they contain. Some domain-specific resources contain synonyms, but it is impossible to comprehend all the typographic, morpho- logical and syntactic variations. This knowledge is nonethe- less very important for the detection of entities in the text of the documents. Without this knowledge, the TM tools may miss mentions of concepts, or be confused by ambigu- ous mentions or concepts that have similar lexical manifest- ations. Princeton WordNet (60), OliA (61) and GOLD (62) are among the most widely used linguistic resources. All the information needed by biocurators is not neces- sarily domain-specific, for instance TM tools can extract and present general-domain entities, like persons, countries, or organizations, in order to assist them. Resources derived from Wikipedia are often used to this effect: Wikidata (63), DBpedia (64), Freebase (65) and YAGO (66). Language resources repositories It is important for the researcher to know where to look for resources. In the table below, we have listed reposito- ries which are useful for TM in the life sciences. These repositories allow a user to browse for content, search for relevant resources, download resources (often for free) and upload their own resources for others to discover and use. Resources will typically be in the formats suggested in the ‘‘Formats for knowledge resources’ section . The format of the resource will usually be included as part of the meta- data in the repositories to help the researcher decide if the resource will be suitable for their needs. The repositories listed in Table 8 allow the visitor to identify and localize the resources that will answer his needs. Such repositories that are available on the web can be divided into three kinds: catalogues, directories and metadata repositories. Their relation to metadata and the services they offer will be discussed hereafter. Different types of store Catalogues. The two catalogues above, namely ELRA (67) and LDC, meet the Longman definition of catalogue as ‘a complete list of things that you can look at, buy, or use, for Table 8. Repositories for the curation of language resources, indexing language resources that are useful for the general domain and the life sciences Title Available records Type of content Accessibility (Download) Accessibility (Upload) Domain ELRA Catalogue of Language Resourcesa 1137 Corpora, lexica Some Paid Restricted Language technology LDC catalogueb Over 900 resources Corpora Some Paid Restricted Language technology VEST Registryc 118 Vocabularies, standards, tools Open Registration upon request agriculture, food, environment AgroPortald 35 Vocabularies Open Registration upon request agriculture, environment BioPortale 576 Vocabularies Open Registration upon request biology, health CLARIN VLOf 876 743 records Various Open Upon request Language technology META-SHAREg More than 2700 Corpora, lexica, language descriptions, tools/services Open Registration upon request Language technology Stav corporah 30 Annotated corpora Open Closed biomedical ahttp://catalog.elra.info/ bhttps://catalog.ldc.upenn.edu/ chttp://aims.fao.org/vest-registry dhttp://agroportal.lirmm.fr/ ehttp://bioportal.bioontology.org/ fhttps://www.clarin.eu/content/virtual-language-observatory ghttp://metashare.elda.org/ hhttp://corpora.informatik.hu-berlin.de/ Database, Vol. 2016, Article ID baw145 Page 15 of 30 example in a library or at an art show’16. The operators managing those catalogues play the role of brokers care- fully selecting the resources and their suppliers who state the conditions of distribution (license, possibly price). The user is a member and accesses the resources on conditions depending on his status (academic/private or profit/non- profit) and sometimes the use he wants to make of the re- source (research or commercial). Resources are generally high-valued ones. Samples, when proposed, allow the user to evaluate the resource towards the targeted task before paying. Directories. The Vest Registry and CLARIN VLO (19) are directories as they simply expose information on a selec- tion of resources. They provide the necessary information for the user to discover the resources, generally through a web link (usually a persistent identifier like the ones men- tioned in the ‘Introduction’ section) to the resource or its original web page. Contributors to the VEST Registry are a small community of registered users who notify valuable resources for the agriculture community. CLARIN VLO harvests and presents metadata from many providers from a variety of European countries. Metadata repositories. META-SHARE (20), BioPortal (68) and AgroPortal (69) store and make both metadata and their associated resources directly available to the user for download. While META-SHARE has a general thematic scope, BioPortal and AgroPortal are respectively dedicated to Biomedicine and Agriculture. Furthermore, the META- SHARE portal features an aggregator, enabling in fact a federated access to a great number of repositories organ- ized as nodes in a network. These reasons explain the sens- ible difference in size between them, AgroPortal having, in addition, been launched only a few months before the time this paper was written. The Stav repository (70) differs from other repositories of biomedical corpora in the way that it presents docu- ments to a user. Instead of downloading an annotated file, one can visualize the annotation in an on-line tool. Available corpora, although not numerous, cover a wide range of annotation types, from a variety of named entities to complex events. The importance of metadata In such repositories, especially large ones, poor metadata leads to the user looking for a needle in a haystack. In this respect, setting up the metadata schema that underpins ei- ther a catalogue, a directory or a metadata repository is a crucial step in the whole system design process. Too few means less services to the final user, too many or too com- plex may lead to providers not being able or not willing to provide the information relating to their resources. Metadata have functions translated into functionalities or services in the repositories. Discovery. Visitors use sets of metadata as relevant criteria to discriminate one or some resources among all. These in- clude descriptive (e.g. type, language, domain, curatorial information), technical (e.g. format, tool compatibility, creation process) and usage metadata (e.g. license, popu- larity). This information is generally materialized on a re- pository’s home pages as drop-down menus and further accessible through facets during the search process. META-SHARE proposes no< 19 criteria to filter out resources. In repositories and directories collecting data from various sources, a key challenge is the mapping of ori- ginal metadata fields into a common meaningful schema. The growing uptake of standard metadata schemata by both resource producers and stores combined with the achievements of international initiatives and infrastructure projects like META-SHARE or CLARIN make the map- ping work easier. However, value lists associated to some metadata fields are still stumbling blocks. While lists for languages or countries are widely shared, building consen- sus on subjects, resource types, media types, or formats is still work in progress, through initiatives like COAR. Identification and localization. Having a multiplicity of ac- cess points to knowledge resources is a necessity to serve TM stakeholders with different cultures and habits. But this leads to a situation where resources are duplicated many times with the risk of creating inconsistencies from one repository to another. In order to be reproducible, TM and NLP processes need to refer explicitly to resources and the specific versions they use. Elements of different meta- data schemata, like the persistent identifiers mentioned in ‘Mechanisms used for the identification of resources’ sec- tion, enable such referencing. Using persistent identifiers for language resources has only recently been established, and the most widely used identification system is Handle PIDs. Still, generic ones like the DOIs (see ‘Mechanisms used for the identification of resources’ section) which allow the identification of both resources and their ver- sions, are also used by some resource providers and/or dis- tributors. Resource developers should be encouraged to use persistent identifiers in combination with relevant metadata elements when publishing. The sustainable hosting of resources is also a concern, in particular for repositories that reference distant content, as too many broken links are a reason for the user to give up a directory. This sustainability in hosting and, in a16 http://www.ldoceonline.com/dictionary/catalogue_1 Page 16 of 30 Database, Vol. 2016, Article ID baw145 linked manner, in accessing a resource is also key to ensure its reuse and popularity. Common repositories can offer this service while other hosting solutions like simple web pages generally do not. Value added services Some repositories and particularly small domain specific repositories offer more than just the possibilities to dis- cover and download resources. BioPortal and AgroPortal propose an integrated environment for browsing, search- ing, sharing and commenting on resources. Advanced func- tionalities allow the user to simply evaluate the adequacy of one or several resources towards a given text. More interesting is the possibility to compute and store mappings between concepts, thus creating a conceptual network across resources. Such mappings are valuable for NLP related tasks like annotation, resource building or word disambiguation. The issues addressed so far have only concerned human users. Leaving aside catalogues that address only people’s needs, almost all recent stores also provide services to ma- chines through Application Programming Interfaces (APIs). In addition, Web Semantic technologies, SPARQL in particular, increase the potential of communication be- tween processes and repositories. In that perspective, standards for metadata and resource formats are even more crucial in allowing programs to select, identify and access resources from repositories in an unambiguous and constant manner. Tools and services The needs of a text miner vary from task to task. In the best case scenario, another researcher will have already created a tool or web service that can be reused for another purpose. At the end of this section, we have listed several useful resources that can be used to discover tools and ser- vices for TM. If the text miner cannot find a pre-existing tool then they must look to develop their own. However, not all tools need to be programmed from scratch. Some can be created simply by taking multiple existing tools and reengineering them to jointly act as a new tool. For such a task, workflow systems may be useful for both the novice and expert alike. A typical workflow management system (WMS) provides a collection of reusable components and the ability to link the processing of these together in an in- telligible manner. The WMS typically consists of the fol- lowing major blocks: • a workflow description language; • a workflow engine that interprets the workflow descrip- tion language and runs the workflow; • a collection of components from which workflows may be assembled; • a repository where components and workflows are stored and may be shared with other users; • possibly a workbench which allows a user to graphically access the repository, compose components into work- flows and run these using the engine. In particular, the ability to compose workflows by using other workflows as components makes such systems very flexible and powerful. Most of the software packages examined here do not support all aspects of a WMS. Based on which aspects are supported, we apply a fine-grained categorization: the soft- ware packages mentioned in this section cover five catego- ries with most packages belonging to more than one categories: 1. Interoperability frameworks: provide a data exchange model, a component model for analytics components, a workflow model for creating pipelines, and a workflow engine for executing pipelines; 2. Component collections: collections of components based on an interoperability framework, including ana- lytics components, but also input/output converters; 3. Type systems: a data interchange vocabulary that en- ables the interoperability between components, typic- ally within a single component collection; 4. Analytics tools: standalone natural language processing tools that are typically not interoperable with one an- other and thus wrapped for use within a specific inter- operability framework; 5. Workbenches: user-oriented tools with a graphical user interface or web interface by which the user can build and execute analytics pipelines. In some cases, identifying which part of the software re- lates to which of the above categories is difficult. For ex- ample, in Stanford CoreNLP, which does offer a type system and a workflow engine, the separation is not as clearly reflected in the system architecture and advertized separately as in UIMA or GATE. In the course of the present section, we will present a variety of WMSes that can be used to help the researcher in TM. In the ‘Text mining workflow management sys- tems’ section, we show WMSes which are designed specif- ically for the purpose of TM. ‘General purpose workflow engines’ section presents a further list of WMSes which are designed for general purpose research. It is possible to use these for TM and this could be appropriate for a re- searcher who has previous experience in one of these plat- forms. Finally, ‘Discovering tools and services’ section presents repositories that are useful for the discovery and storage of TM services. Database, Vol. 2016, Article ID baw145 Page 17 of 30 Text mining workflow management systems Almost any TM application is formulated as a workflow of operational modules. Each component performs a spe- cific analysis step on the text and when it is finished, the next component begins. Some components may be generic and can be used for many different applications (file lookup, sentence splitting, parsing, entity identification, re- lation extraction, etc.), whereas other components may be less common, or even built specifically for the task at hand. A TM WMS defines a common structure for these compo- nents and facilitates the creation of a workflow of existing components as well as helping with the integration of new components. The variety of functionality that software packages in this area provide is rich—often packages pro- vide more than one functionality. To make this approach- able, we organize the software packages into four major categories, based on what we perceive to be the predomin- ant functionality of a package: • Processing frameworks: software frameworks which focus around one specific data model and component model (Table 9). • Analytics packages: Software libraries that provide NLP/ TDM related analytics (Table 10). • Component collections: Software packages that integrate analytics packages with a processing framework (Table 11). • Analytics workbenches: User-facing tools which permit the composition of components into workflows, the execution of workflows, and the inspection of results(Table 12). It is also notable that most of the software is imple- mented in Java. The Java platform provides interoperability across most hardware and operating system platforms (e.g. Windows, Linux, OS X). It also facilitates interoperability between the different software packages. For example, Java-based component collections can more easily integrate other Java-based software than software implemented in C/ Cþþ or Python (although this is not impossible). Processing frameworks In terms of processing frameworks, the Apache UIMA framework and the GATE framework (42) appear to be the strongest and more widely used in the TM community than Alvis (71) or Heart of Gold (72). Table 10. A comparison of popular analytics packages Name Native processing framework support Programming language Repository License Apache OpenNLPa UIMA Java Maven ALv2 NLP4J (aka Emory NLP)b No Java Maven ALv2 FreeLingc (73) No Cþþ No AGPL þ commercial NLTKd (74) No Python PyPI ALv2 LingPipee No Java Maven AGPL þ commercial Stanford CoreNLPf (75) No Java Maven GPL þ commercial ahttps://opennlp.apache.org/ bhttps://github.com/emorynlp/nlp4j chttp://nlp.lsi.upc.edu/freeling/ dhttp://www.nltk.org/ ehttp://alias-i.com/lingpipe/ fhttp://stanfordnlp.github.io/CoreNLP/ Table 9. A comparison of popular interoperability frameworks and supported workflows Name Workflow description language Workflow engine Programming language License Alvisa Alvis Alvis Java ALv2 Apache UIMAb Aggregates Aggregates Java/Cþþ ALv2 CPE CPE UIMA AS UIMA AS RUTA RUTA UIMA DUCC UIMA DUCC GATE Embeddedc GATE Applications GATE Embedded Java LGPL Heart of Goldd Yes (unnamed) MoCoMan Java/Python LGPL ahttp://www.quaero.org/module_technologique/alvis-nlp-alvis-natural-language-processing/ bhttps://uima.apache.org/ chttps://gate.ac.uk/family/embedded.html dhttp://heartofgold.dfki.de/ Page 18 of 30 Database, Vol. 2016, Article ID baw145 Several of the component collections presented below are UIMA-based and in principle interoperable at the level of workflows and data model. In particular, we can infer that the expressiveness and flexibility of the UIMA data model ap- pears to fulfil the needs of the community. However, each of the UIMA-based software packages uses its own specific an- notation type system. This means that things that are concep- tually the same, e.g. tokens or sentences, have different names and often different properties and relations to each other. Consequently, the practical interoperability here is limited. Analytics packages The list given here is by no means exhaustive, but it is ra- ther representative of software packages that support a whole set of analysis tasks (tokenising, POS tagging, pars- ing, etc.) instead of only a single task. Most of the analytics software presented here is in prin- ciple language-independent and only specialized to a par- ticular language or domain by models, e.g. machine learning classifiers trained for a specific language or do- main, rules created for the extraction of specific informa- tion, domain-specific vocabularies and knowledge resources, etc. However, the level of support across lan- guages varies dramatically. Models and resources for English are available in almost all software packages, fur- ther well-supported languages include German, French, Spanish, Chinese and Arabic, followed by a long tail of limited support for other languages. Table 12. A comparison of popular analytics workbenches Name Processing framework UI Component collection External repositories License Argoa UIMA Web-based (service) NaCTeM No Proprietary CLARIN-D WebLichtb Proprietary Web-based (service) Built-in No Proprietary GATE Developerc GATE Installable application Built-in External GATE Repositories LGPL U-Compared UIMA Installable application Built-in no Proprietary UIMA Rutae UIMA Installable application (Eclipse plugin) UIMA-based (e.g. DKPro Core, . . .) Yes (via Maven) ALv2 LAPPS Grid Galaxyf UIMA þ GATE via Galaxy Web-based, installable application Multiple (e.g. GATE, DKPro Core, . . .) Galaxy tool shack ALv2 ahttp://argo.nactem.ac.uk/ bhttp://www.clarin-d.de/en/language-resources-and-services/weblicht chttps://gate.ac.uk/family/developer.html dhttp://nactem.ac.uk/ucompare/ ehttps://uima.apache.org/ruta.html fhttp://galaxy.lappsgrid.org/ Table 11. A comparison of popular component collections Name Focus area Processing framework Repository Programming language License Apache cTAKESa Medical records UIMA Maven Java ALv2 Bluimab Biomedical UIMA Maven Java ALv2 ClearTKc Machine Learning UIMA Maven Java BSD/GPL DKPro Cored Linguistic analysis UIMA Maven Java ALv2/GPL JCoRee Biomedical UIMA Maven Java LGPL/GPL BioNLP-UIMAf Biomedical UIMA Maven Java BSD GATE built-in component collectiong Linguistic analysis and information extraction GATE GATE Java LGPL/GPL NaCTeM collectionh Biomedical UIMA None Java Proprietary Semantic Software Lab collectioni Biomedical GATE GATE Java LGPL/GPL ahttp://ctakes.apache.org/ bhttps://github.com/BlueBrain/bluima chttps://cleartk.github.io/cleartk/ dhttps://dkpro.github.io/dkpro-core/ ehttp://julielab.github.io/ fhttp://bionlp.sourceforge.net/ ghttps://gate.ac.uk/ hhttp://argo.nactem.ac.uk/ ihttp://www.semanticsoftware.info Database, Vol. 2016, Article ID baw145 Page 19 of 30 Component collections Component collections represent a piece of software that sits in between a processing framework and an analytics tool. The software acts as an adapter that allows combin- ing analytics tools coming from different software pack- ages and created by different providers into workflows. Often, component collections wrap third-party tools that are also separately available as software packages, but oc- casionally analytics are provided only in the form of a component for a specific framework. Component collections typically focus on a particular area of language analysis and are centered around an annota- tion scheme which models this area in particular. Some col- lections are focused on a very specific use-case, e.g. cTAKES (76) on the analysis of clinical text, Bluima (77) on the ex- traction of neuroscientific content and ClearTK (78) on add- ing machine learning functionality to UIMA. Others host different tools for the general domain of life sciences, e.g. JcoRe (79), the NaCTeM (National Centre for Text Mining) collection (9), BioNLP UIMA (80) and the Semantic Software Lab collection. The third category, including collec- tions like DKPro Core or ClearTK, provide a broad range of rather low-level analytics tools that act as a toolkit for the implementation of many different kinds of use-cases. Giving a clear indication of the size of a component col- lection is difficult. For example, if one component can be parametrized with three different models for three different languages, should it be counted three times or only once? Some component collections offer separate wrappers for individual functions of a tool, e.g. for a tool that combines part-of-speech tagging and lemmatizing. Other collections only offer a single wrapper offering both functionalities. Numbers found on the websites of the respective software packages use different counting strategies for components and are therefore incomparable. The licenses stated for the component collections refer to the primary license of the wrapper code. The actually wrapped third-party software packages often have other li- censes. Also, specific components in a collection may have other licenses, e.g. due to GPL copyleft provisions. Workbenches Using analytics software or components programmatically in the sense of a software library requires programming skills. This is a major problem for the larger adoption of NLP/TDM technologies in less programming-oriented do- mains. Workbenches aim to facilitate the use of analytics components by providing a graphical user interface that allows a user to browse components, assemble them into workflows, execute these workflows, and inspect the results. The workbenches listed here were created with a par- ticular focus on language analytics and build on one or more of the processing frameworks presented before. The more recent LAPPS Grid workbench is based on the gen- eric Galaxy WMS and integrates components across mul- tiple processing frameworks. If two components from the same processing framework are adjacent to each other, they communicate in their native formats, while a small piece of interfacing code called a ‘shim’ is inserted when two adjacent components come from different frame- works. The ‘shim’ then takes care of converting the data before passing it on. These workbenches are based on different philosophies with respect to use and deployment. Both Argo and WebLicht provide the user with a predefined set of compo- nents for text mining. A user cannot currently integrate their own components. These platforms expect that all processing is performed on computing resources which are part of the platform and under the control of the platform providers. Deploying arbitrary custom components on these platforms would present a legal and security risk to the providers and is therefore not appropriate for these platforms. The user, however, only requires a machine cap- able of browsing the web to use these services, rather than their own high performance computing infrastructure as with other workbenches. The same is true in principle for the LAPPS Grid with the difference that interested users can actually install their own instance of the LAPPS Galaxy. This instance can then either talk to LAPPS Grid services or to locally deployed components. Also, users can extend such a local installa- tion with new components. Being based on Galaxy, custom components can be installed either manually or through a Galaxy Tool Shed repository. U-Compare (10) is a standalone application and provides a documented mechanism for the integration of local components. However, there is no explicit sup- port for obtaining additional components from a repository. GATE Developer and the UIMA Ruta Workbench are locally installed applications. They also support the use of arbitrary custom components compatible with their under- lying processing frameworks. GATE components can be installed into the GATE Developer application from exter- nal websites hosting GATE component repositories. UIMA Ruta can be used in conjunction with components ob- tained from Maven repositories. The way the projects are driven also differs greatly. WebLicht (81) is a part of the CLARIN-D effort, a large- scale infrastructure in Germany and part of the multi- national EU CLARIN effort aiming for a European infrastructure for language resources and technology in the Page 20 of 30 Database, Vol. 2016, Article ID baw145 social sciences and humanities. The LAPPS Grid project has a similar goal in the US but is comparably much smaller. The U-Compare system was superseded by Argo (9) at NaCTeM. The vision of Argo is to create an easy to use but highly functional WMS for the life-sciences community and beyond to engage in a variety of tasks around text min- ing, including biocuration (82). It provides a powerful mechanism to obtain and process multiple documents in a user-friendly environment. A variety of export options are available to obtain the final results of processing, including type systems tailored to a particular application (83) and web services supporting interoperable formats (84). Argo is accessible and usable via the web, where a large collec- tion of ready-to-use components can be combined by a novice user to build a workflow. NaCTeM have also used Argo as a tool in collaborations, installing separate in- stances at partner institutions to enable others to benefit from the software. The system has been applied in dis- covering phenotypes in clinical records (85), implementing state-of-the-art chemical recognition algorithms (86) and semi-automatic curation of disease phenotypes (87). The GATE framework (42) is mainly developed at the University of Sheffield with partners such as Ontotext. However, it is developed as an open source project hosted on Sourceforge with a public code repository. They also ac- cept code contributions from the community at large. Additionally, there are community-provided repositories of GATE components, such as the Semantic Software Lab at Concordia University in Montréal, Canada. UIMA Ruta (88) is part of the Apache UIMA project hosted at the Apache Software Foundation. Like all Apache projects, it is an independent volunteer-driven community project providing its software under the liberal conditions of the Apache Software License which suits re- search and education as well as commercial use. Contributions from the community are welcome. General purpose workflow engines As opposed to a TM specific workflow, many applications exist for the creation of general purpose scientific work- flows. These systems provide functionality for reimple- menting experiments and for saving, exporting and sharing experimental code which can be easily re-run by other re- searchers who are interested in the given experimental re- sults. These systems provide a means to build multi-step computational analyses akin to a recipe. They typically provide a workflow editor with a graphical user interface for specifying what data to operate on, what steps to take, and what order to do them in. A general purpose solution can be adapted for use in a TM context by using TM resources if they are available, e.g. see a case study for SADI (89). Although general purpose workflow engines create an internal form of interoperability at the level of the process model, where all components within a work- flow will work together, workflow engines from different providers cannot typically be expected to interact. Also, interoperability at the level of the process model does not automatically entail interoperability on the level of the data model, annotation type system, data serialization for- mat, etc. A comparison is given in the Table 13 below. It might be a hard task to select the system that perfectly fits one’s purpose. However, taking into account the unique characteristics of each system will help in the deci- sion process. Initial design purpose of the system is one of the features that have a major effect on the overall usability of such systems. Among the discussed systems, Kepler (90), Pegasus (91) and Taverna (92) are those with the aim of creating a general purpose scientific workflow engine. Thus, it is assumed that they would be the least coupled with any particular domain, and easiest to adapt to new domains. In contrast, ELKI (93), KNIME (94) and Triana (95) were originally created to perform data mining tasks; hence, their power resides in implementing and executing statistical algorithms. Other workflow engines were cre- ated for specific domain experiments and later also applied to other domains as well. The next important classifying feature is the creation of new components for a workflow engine. All of the men- tioned tools except Pipeline Pilot (96) and SADI (97) are implemented using Java or Python. This makes them plat- form independent, and also facilitates the implementation of new components. SADI is not a typical workflow en- gine, but rather a set of design patterns that help to achieve interoperability and let users combine different tools into a pipeline. It is using the well-known standards of semantic web: each component is a RESTful service communicating using OWL, RDF and SPARQL. Kepler and Taverna also offer direct support for the integration of WSDL services as workflow components. Taverna also supports SOAP and REST services. In addition to ease of component development, provi- sion of a publicly accessible repository of workflow compo- nents is also important. In this aspect, Kepler, Galaxy (98) and Taverna are the only projects that offer a public central repository of components. In contrast, the Kepler system enables the creation of a single executable KAR (Kepler Archive) file of a workflow, which conforms to the JAR (Java Archive) file format. ELKI creates native JAR files. The deployment model and execution model of a work- flow plays a major role in the choice of a workflow engine. In this sense, ELKI, Kepler, Galaxy and Pegasus support executing workflows on a computing grid or cloud. Database, Vol. 2016, Article ID baw145 Page 21 of 30 Additionally, fault tolerance of the Pegasus workflow en- gine is a feature that should not be neglected. This feature would bring major benefits in times where a process inten- sive workflow fails in the middle of execution. Another factor to be considered is the community effect: is there a strong developer community who maintains the product and in which communities is the product being used? In this respect, we note that, recently, the LAPPS project and the Alveo project have adopted Galaxy. LAPPS is a project based in the USA that aims at creating a grid- like distributed architecture for NLP. The Alveo project has similar goals and is based in Australia. As discussed before, choosing the suitable workflow en- gine is not a trivial task. However, considering general properties of different systems enables a smart decision. It should be noted also that reviewing the already applied do- mains and example usage scenarios of these workflow en- gines will be greatly beneficial. Discovering tools and services This section describes the online repositories where tools and services can be discovered. Some of these also contain records for documents and corpora. We have organized the repositories into the following two categories: 1. Registries: registries facilitate discovery by maintaining metadata on tools, services and data. However, they do Table 13. A comparison of general purpose workflow engines Name Description of modules License Example domains Component creation Language ELKIa data mining algorithms; clustering; outlier detection; dataset statis- tics; benchmarking, etc. GNU AGPL Cluster benchmarking Programming new Java components Java Galaxyb genome research; data access; visu- alization components AFL 3 Bioinformatics command line tools Python Keplerc Wide variety of components BSD Bioinformatics, data monitoring Java components, R scripts, Perl, Python, compiled C code, WSDL services Java KNIMEd Univariate and multivariate statis- tics; data mining; time series; image processing; web analytics; TM; network analysis; social media analysis GPL3 Business intelligence, fi- nancial data analysis Java, Perl, Python code fragments Java (Eclipse plugin) Pegasuse Shell scripts; command line tools Apache Astronomy, bioinfor- matics, earthquake science Command line Java, Python, C Pipeline Pilotf Chemistry; Biology; Materials Modelling; Simulation Proprietary Chemicals, Energy, Consumer Packaged Goods, Aerospace Users cannot create components Cþþ Tavernag Wide variety of components LGPL Bioinformatics, astron- omy, chemo-inform- atics, health informatics WSDL, SOAP and REST services, Beanshell scripts, local Java API, R scripts Java Trianah audio, image, signal and text pro- cessing; physics studies Apache Signal processing Programming new Java components Java SADIi access to the databases and analyt- ical tools for bioinformatics BSD Bioinformatics Web services OWL, RDF, SPARQL These can be used for a variety of scientific programming applications, of which one is TM. We have provided some examples of the typical usages of these re- sources in the table above. ahttp://elki.dbs.ifi.lmu.de/ bhttps://galaxyproject.org/ chttps://kepler-project.org/ dhttps://www.knime.org/knime-analytics-platform ehttps://pegasus.isi.edu/ fhttp://accelrys.com/products/collaborative-science/biovia-pipeline-pilot/ ghttp://www.taverna.org.uk/ hhttp://www.trianacode.org/ ihttp://sadiframework.org/content/ Page 22 of 30 Database, Vol. 2016, Article ID baw145 not actually host these, such that downloading or exe- cuting them requires the involvement of additional sites. It may not even be possible to access the refer- enced resources at all, e.g. because they have not been publicly released. Online registries of tools and services that (at least partially) concern text processing are nu- merous. For each of them, we provide a number of available services, accessibility, supported standards, status and domain. We also include references, when available. 2. Platforms: the final set of resources presents platforms that focus on the interaction between language re- sources and text processing services, i.e. enable running of web services either on data included in the platform or uploaded by the user. They target users with low expertize in technology and try to provide ready-made solutions for processing data rather than discovering resources. The number of resources that can be discovered or ob- tained through these sites vary greatly. For example, the CLARIN VLO tends to count each individual docu- ment or recording as a separate entry, even if these would otherwise be considered to be part of a collection or cor- pus. On the other hand, the LINDAT/CLARIN repository has only a single entry for each corpus, irrespective of its size. The information contained in the repositories can be seen from two perspectives: as a human or as a machine. From the perspective of a user who want to discover tools and services relevant for their task at hand and field of interest the following features may be acceptable: free text metadata, heterogenous forms of formatting and packag- ing resources (e.g. as separate files, ZIP files, various file formats), the need to authenticate in a web browser, or even the need to send a mail to the resource creator to re- quest access. For automated processing by a machine, the following features are mandatory: controlled vocabularies, the use of standard file and packaging formats, and the ability to obtain a URL to access a resource. Registries presently still target mostly the human user and offer only limited metadata related to programmatic access. As a con- sequence, it is not straightforward for platforms like LAPPS or ALVEO to make use of these sites as sources for workflow components or for content to be processed. Most of the sites listed above are based on open source software, often software created by the site maintainers themselves. Thus, it is possible to discover not only if the services are available and being used, but also if they are actively being maintained and or being further developed. We include relevant information about the service status (Running/Closed), about the last update to the service, and a link to the open source code repository in Table 14. Discussion Despite the obvious advantages of text mining, several obs- tacles have limited its widespread uptake amongst those in the life sciences who could profit from it most. The first obstacle is a lack of power in the computing resources that underpin text mining software, especially for large-scale processing. The advent of distributed, cloud-based com- puting has helped to put an end to this issue in recent years. The second obstacle is the prototypical nature of many sys- tems, especially those based on natural language process- ing techniques, whose designers were faced with adapting general language tools to the particular challenges pre- sented by scholarly communication in the life sciences. While research in the field is ever-on-going, there are now mature, robust tools and systems that achieve results com- parable with those of human analysis in many life sciences tasks. The third obstacle, complementing the second, is the lack of suitably annotated data to better understand the types of problem and train supervised machine-learning based systems. Collaboration with domain specialists within the context of community evaluation challenges (103), such as BioCreative (104), BioNLP (105), BioASQ (106) and other annotation efforts (107, 108), has miti- gated this lack through the provision of gold standard cor- pora for certain well-defined competitive text mining tasks designed to advance the state of the art. However, it re- mains true that a researcher interested in applying text mining to some particular research question concerning a particular sub-domain may be faced with a lack of some trained tool or annotated corpus that would hamper their efforts. We have seen throughout this paper, however, how other aspects of text mining can help reduce the time and effort it would otherwise cost to fill such a gap. The fourth obstacle, again somewhat related to the second, is lack of interoperability. This presents itself in various guises. For example, a tool might split a text into tokens and then tag it for part of speech, but such a black box combination of processes means that one could not, for example, use a dif- ferent tokenizer better suited to the task in hand, say, for tokenization of chemical compounds. Such issues have gradually become less important, due to a general move in software engineering towards component-based systems. However, natural language processing and text mining are further affected by interoperability issues at the linguistic and indeed conceptual level. A simple example will suffice to illustrate: a researcher finds two tools, one that recog- nizes the names of bio-entities in text and another that ex- tracts relations among bio-entities. However, the entity Database, Vol. 2016, Article ID baw145 Page 23 of 30 types (labels) that the first produces are not the same as those that the relation finder expects to find in its input: there may be no intersection or a partial one; even if there is an intersection in terms of names, there may be none in terms of what entities they refer to. This lack of interoper- ability has been a major blocking factor for developers and users. Fortunately, in recent years, there has been much progress made on standardization and normalization in the field, such that interoperability is much enhanced. Although interoperability is not a totally solved problem, recent initiatives (shown below) have yielded benefits for both developers and users alike. The fifth obstacle is that access to content for text mining is frequently limited be- cause of legal reasons (109). Publishers of non-open access Table 14. A comparison of repositories for tools and services that can be redeployed in the text miner’s workflow Title Available records Accessibility Status Domain Category BioCataloguea (99) 1,184 Open access/use and open registration Running, last updated in 2015b Life sciences Registry Biodiversity Cataloguec 71 Open access/use and open registration Running, last updated in 2015d Biodiversity Registry Orbite 89 Open access/registration requires approval Running, last updated in 2015 Biomedical informatics Registry AnnoMarketf (100) 60 Paid for (customers can pay to use any service, third parties can upload their own services and data to sell) Closed, last updated in 2014g General Platform META-SHAREh (20) more than 2,765 Restricted (anyone can access but add- ition of new resources requires regis- tering as a META-SHARE member) Running, last updated in 2016i General Registry LRE Mapj (21) 3985 Closed (no option to add own resources) Running, closed source General Registry ALVEOk (101) 34 Open use of services; uploading of ser- vices locked Running, last updated in 2016l General Platform Language Gridm (102) 142 Open use of services for non-profit and research; uploading of services for members Running, last updated in 2015n General Platform LAPPS Grido (43) 45 Open use of services; uploading of ser- vices locked Running, last updated in 2016p General Platform QT21q 598 Open browsing and use of services, re- stricted registry Beta, closed source General Platform LINDAT/CLARINr 1162 Open Running, last updated 2016s Open Registry CLARIN Virtual Language Observatoryt (19) 880 915 Open Running, last updated 2016u Open Registry There is a large variation in the size and accessibility of these repositories. ahttps://www.biocatalogue.org/ bhttps://github.com/myGrid/biocatalogue chttps://www.biodiversitycatalogue.org/ dhttps://github.com/myGrid/biocatalogue ehttps://orbit.nlm.nih.gov/ fhttps://annomarket.com ghttps://github.com/annomarket hhttp://www.meta-share.eu ihttps://github.com/metashare/META-SHARE jhttp://www.resourcebook.eu khttp://alveo.edu.au lhttps://github.com/Alveo mhttp://langrid.org nhttp://svn.code.sf.net/p/servicegrid/code ohttp://www.lappsgrid.org phttps://github.com/lappst qhttp://www.qt21.eu rhttps://lindat.mff.cuni.cz/en/ shttps://github.com/ufal/lindat-dspace thttps://vlo.clarin.eu/ uhttps://github.com/clarin-eric/VLO Page 24 of 30 Database, Vol. 2016, Article ID baw145 journals usually require TM researchers to negotiate a li- cence agreement for every research project and impose sev- eral restrictions, e.g. non-commercial use (110). This has been alleviated by exceptions for TM introduced in several countries, which allow researchers to automatically mine the content they have lawful access to. Lack of such regula- tions in some areas, e.g. the European Union, significantly hampers data mining research (111). This study is a part of the efforts of the on-going OpenMinTeD17 project to build an interoperable text and data mining (TDM) infrastructure, which could help to re- lieve some of these obstacles. As such, we have tried to pro- mote interoperability standards throughout this report where possible. Interoperability is a topic which has al- ready been discussed and studied at length over multiple large research projects. The CLARIN Project (19) pro- duced a review of accepted standards which were designed to promote interoperability in multiple fields. The META- NET project (112) produced a language resources sharing and discovery infrastructure, META-SHARE (20), along with associated metadata standards (11). The FLaReNeT project, aiming to create a European strategy for the area of language processing and resources, prepared an assess- ment of the current standards’ landscape (113). Other on- going efforts include the Research Data Alliance18 initia- tive for promoting openness in research data sharing, which has several working groups which are interested in interoperability and FutureTDM19, which tries to assess and reduce legal and policy barriers for the growth of the text mining field. Another source of increasing interoper- ability in TDM are large ecosystems, such as UIMA and GATE, supporting open standards and attracting a lot of users, who are also able to contribute their own components. In this study we aimed to give an accurate account of the landscape of available text mining resources for biocu- ration, but clearly our approach has its limitations. The field has been divided into areas and subareas correspond- ing to tasks in a TM application and we selected the most important and representative resources in each. We could not include every possible item as there is a long tail of re- sources created for a particular problem, often within a single project, and then abandoned with little or no sup- port or documentation. We have shown that there are at least several options to choose from at every step of the text mining process, which makes it possible to construct a working end-to-end application. The choices a user makes could be motivated by factors other than core functionality, e.g. resource interoperability, usage of open standards, prior conventions or what has already been suc- cessfully applied in the target domain. We have focused on text mining for the life sciences in this study. However, text mining is also growing in many other areas. We have chosen not to speak about these in this survey, but instead leave a more general overview of the text-mining field to further work. Our decision to take this focus is wholly appropriate, as the life sciences is the most common domain in text-mining research (114). The life sciences has very well-developed terminological re- sources which make text mining easier and many publica- tions are published in open-access journals—making them accessible for text mining. There is also a great need for text mining in the life sciences, as evidenced by the now in- famous ‘data deluge’. Even a researcher in a minor sub- field is expected to keep up with increasingly large volumes of new publications. Fortunately, the techniques that we have discussed in this report are transferable to other do- mains. Many of the repositories that we have discussed are not solely focused on the life sciences but also contain TM resources for other appropriate domains. Throughout this study, we have tried to give promin- ence to those resources which are the result of efforts to- wards interoperability. We have seen a wide range of interoperability throughout the report. Some sections (e.g. Mechanisms used for the identification of resources, Annotation Models, Formats for Knowledge Resources, General Purpose Workflow Engines) relate to areas with comparatively low levels of interoperability. In these areas, there is little uptake of existing standards, or maybe no standards altogether. Other areas exhibit a high degree of potential for interoperable systems (e.g. Metadata sche- mata and profiles, Vocabularies and ontologies for describ- ing specific information types and Text mining workflow management systems). These areas may have multiple competing standards which each allow a user to build and access resources which are easy to connect to pre-existing code due to their implementation of existing interoperabil- ity standards. It can sometimes be the case that standards exist, but they are not used because the community is not aware of them. We hope that this study goes some way to- wards addressing that gap. To this end, we have promoted interoperability standards wherever possible alongside a discussion of the virtues of integrating resources with these standards. There are some cases where it may not be ap- propriate for a user to implement interoperability stand- ards—e.g. a closed ecosystem, rapid prototyping or while integrating with third-party tools. However, a user should be able to consciously choose not to implement an inter- operability standard, rather than not knowing about its ex- istence in the first place. 17 http://openminted.eu/ 18 https://rd-alliance.org/ 19 http://project.futuretdm.eu/ Database, Vol. 2016, Article ID baw145 Page 25 of 30 We have presented a set of relevant repositories at the end of each section. These are intended to help guide the reader to find a wider set of resources than those we have mentioned in this paper. The repositories are kept current and so by looking at these repositories the user can find rele- vant and up-to-date resources. The lists of repositories are not meant to be binding or comprehensive, but instead are intended as a useful list of places for the reader to get an idea of what is on offer. It will be beneficial in most cases for the user to search for repositories which are related to the type of work that they are doing. If none can be found, then the reader may wish to consider starting their own re- pository, implementing some of the metadata standards which we have previously discussed. When browsing a re- pository, the user should consider questions such as: ‘What other kinds of resources are typically stored in this reposi- tory?’, ‘What types of metadata are used to store resources?’ and ‘How easy is it to upload new resources?’. We have tried to equip the reader throughout this report to be able to answer these questions for themselves. A recent study by Thompson et al. (115) may serve as an example of combining all of the elements of text mining within a single project. The goal was to analyse medical vocabulary from a historical perspective, observing how certain terms and concepts appear, transform and wither across the years. The authors started by acquiring content from the British Medical Journal archive, which is access- ible via CrossRef (see ‘Metadata schemata and profiles’ section), and London Medical Officer of Health reports, which are downloadable in multiple formats. Next, the texts were manually annotated with medical entities and saved in the BRAT format (see ‘Annotation models’ sec- tion). To create a time-sensitive inventory of medical terms, the authors both implemented an automatic method based on distributional semantics and employed a the- saurus aggregating over 150 terminological resources (see ‘Useful knowledge resources’ section). The obtained corpus has been used to study the performance of named entity and event recognition techniques, implemented in the Argo workflow manager (see ‘Text mining workflow manage- ment systems’ section) using readily-available components. Finally, both the annotated corpus and the term inventory encoded in OBO format (see ‘Formats for knowledge re- sources’ section) were published in the META-SHARE re- pository (see ‘Language resources repositories’ section). Employing open standards, formats and services for pub- lishing annotated content, created vocabularies and work- flows, makes it more likely that such a study will be useful for related research projects. The on-going OpenMinTeD project is at the forefront of promoting text and data mining amongst the communities that need it most. The project is currently working on several fronts to further the cause of text and data mining. First a platform will be produced, which will allow a novice TM user to come and experiment with some standard tools and their own data. Second, a set of flagship applications will be developed using the platform to demonstrate the power of TM and promote TM within the communities that the appli- cations are developed for. Third, the project will provide a set of interoperability guidelines which will allow third party applications to integrate with the platform. This will make the platform a focal point for new technology. Application developers will benefit from implementing the interoperabil- ity specifications as their tool will gain access to a wide mar- ket of users. Finally, the project will provide training and community engagement events to educate and equip users who may not have the technical expertise to use TM within their own research. The audience of this paper should also make themselves aware of such efforts, as they are designed to reduce the difficulty encountered by the novice text miner. In this report, we have covered a wide variety of topics, from where to find publications for text mining in ‘Content discovery’ section, through how resources are encoded in ‘Knowledge Encoding’ section and finally how to bring re- sources and components together in a text mining workflow in ‘Tools and services’ section. We have equipped the reader with all the knowledge they need to make informed choices about the resources that currently exist in the field. The final decision of how to use these resources to extract useful in- formation from their data rests with the reader. Funding This work is jointly supported by the EC/H2020 project: an Open Mining INfrastructure for TExt and Data (OpenMinTeD) Grant ID: 654021 and the BBSRC project: Enriching Metabolic PATHwaY models with evidence from the literature (EMPATHY) Grant ID: BB/M006891/1 References 1. Vardakas,K.Z., Tsopanakis,G., Poulopoulou,A. and Falagas,M.E. (2015) An analysis of factors contributing to PubMed’s growth. J Informetrics, 9, 592–617. 2. Druss,B.G. and Marcus,S.C. (2005) Growth and decentraliza- tion of the medical literature: implications for evidence-based medicine. J Med. Libr. Assoc., 93, 499–501. 3. Larsen,P.O. and von Ins,M. (2010) The rate of growth in scien- tific publication and the decline in coverage provided by Science Citation Index. Scientometrics, 84, 575–603. 4. Simpson,M.S. and Demner-Fushman,D. (2012) Biomedical text mining: a survey of recent progress. In: Aggarwal, C.C., Zhai, C. (eds). Mining Text Data. Springer, New York, pp. 465–517. 5. Ananiadou,S., Kell,D.B. and Tsujii,J. (2006) Text mining and its potential applications in systems biology. Trends Biotechnol., 24, 571–579. Page 26 of 30 Database, Vol. 2016, Article ID baw145 6. Stührenberg,M., Werthmann,A. and Witt,A. (2012) Guidance through the standards jungle for linguistic resources. In: Proceedings of the LREC 2012 Workshop on Collaborative Resource Development and Delivery. pp. 9–13. 7. Hirschman,L., Burns,G.A.P.C., Krallinger,M. et al. (2012) Text mining for the biocuration workflow. Database, 2012, bas020. 8. Ferrucci,D. and Lally,A. (2004) UIMA: an architectural ap- proach to unstructured information processing in the corporate research environment. Nat. Lang. Eng., 10, 327–348. 9. Rak,R., Rowley,A., Black,W. and Ananiadou,S. (2012) Argo: an integrative, interactive, text mining-based workbench sup- porting curation. Database, 2012, bas010. 10. Kano,Y., Baumgartner,W.A., McCrohon,L. et al. (2009) U- Compare: share and compare text mining tools with UIMA. Bioinformatics, 25, 1997–1998. 11. Gavrilidou,M., Labropoulou,P., Desipri,E. et al. (2012) The META-SHARE Metadata Schema for the Description of Language Resources. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey. http://www.lrec-conf.org/proceed ings/lrec2012/pdf/998_Paper.pdf. 12. Weibel,S. (2005) The Dublin core: a simple content description model for electronic resources. Bull. Am. Soc. Inform. Sci. Technol., 24, 9–11. 13. Huh,S. (2014) Journal Article Tag Suite 1.0: National Information Standards Organization standard of journal exten- sible markup language. Sci. Edit., 1, 99–104. 14. Brase,J. (2009) DataCite—A Global Registration Agency for Research Data. In: Fourth International Conference on Cooperation and Promotion of Information Resources in Science and Technology. IEEE, pp. 257–261. 15. Pentz,E. (2001) CrossRef: a collaborative linking network. Issues in Science and Technology Librarianship, 2001, 10.5062/F4CR5RBK. http://istl.org/01-winter/article1.html. 16. Winn,J. (2013) Open data and the academy: an evaluation of CKAN for research data management. In IASSIST 2013. 17. Jörg,B. (2010) CERIF: the common European research infor- mation format model. Data Sci. J/, 9, CRIS24–CRIS31. 18. Ide,N. and Véronis,J. (1995) Text Encoding Initiative: Background and Contexts. Springer Science & Business Media, Dordrecht. 19. Varadi,T., Krauwer,S., Wittenburg,P. et al. (2008) CLARIN: Common Language Resources and Technology Infrastructure. In: Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008). 20. Piperidis,S. (2012) The META-SHARE Language Resources Sharing Infrastructure: Principles, Challenges, Solutions. In: Proceedings of the 8th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA), Istanbul, Turkey. http://www.lrec-conf. org/proceedings/lrec2012/pdf/1086_Paper.pdf. 21. Calzolari,N., Gratta,R.D., Francopoulo,G. et al. (2012) The LRE Map. Harmonising Community Descriptions of Resources. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association (ELRA), Istanbul, Turkey. 22. Lipscomb,C.E. (2000) Medical subject headings (MeSH). Bull. Med. Libr. Assoc., 88, 265–266. 23. Ison,J., Kalas,M., Jonassen,I. et al. (2013) EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats. Bioinformatics, 29, 1325–1332. 24. Dewey,M. (1876) A Classification and Subject Index for Cataloguing and Arranging the Books and Pamphlets of a Library. Brick Row Book Shop, Incorporated, New Haven, CT. 25. Mcilwaine,I.C. (1986) The universal decimal classification: some factors concerning its origins, development, and influ- ence. J. Am. Soc. Inform. Sci., 48, 331–339. 26. Sure,Y., Bloehdorn,S., Haase,P. et al. (2005) The SWRC Ontology – Semantic Web for Research Communities. In: Proceedings of the 12th Portuguese Conference on Artificial Intelligence. Springer, Berlin/Heidelberg. 27. Schirrwagen,J., Subirats-Coll,I. and Shearer,K. (2016) COAR Resource Types—a SKOSified Vocabulary for Open Repositories. In Open Repositories 2016 (OR2016). 28. Abelson,H., Adida,B., Linksvayer,M. and Yergler,N. (2008) ccREL: The creative commons rights expression language. Technical Report, Creative Commons. 29. Iannella,R. (2004) The Open Digital Rights Language: XML for Digital Rights Management. Information Security Technical Report, 9, 47–55. 30. Chandrakar,R. (2006) Digital object identifier system: an over- view. Electron. Libr., 24, 445–452. 31. Haak,L.L., Fenner,M., Paglione,L. et al. (2012) ORCID: a sys- tem to uniquely identify researchers. Learned Publishing, 25, 259–264. 32. Manghi,P., Manola,N., Horstmann,W. and Peters,D. (2010) An infrastructure for managing EC funded research output: the OpenAIRE project. Grey J., 6, 31–39. 33. Pieper,D. and Summann,F. (2013) Bielefeld Academic Search Engine (BASE): an end-user oriented institutional repository search service. Libr. Hi Tech, 24, 614–619. 34. Lindberg,D.A.B. (2000) Internet access to the National Library of Medicine. Effect. Clin. Pract., 4, 256–260. 35. Maloney,C., Sequeira,E., Kelly,C. et al. (2013) PubMed Central. In The NCBI Handbook. National Center for Biotechnology Information (US), Bethesda MD,. 36. Ide,N. and Suderman,K. (2014) The Linguistic Annotation Framework: a standard for annotation interchange and merg- ing. Lang. Resources Eval., 48, 395–418. 37. Sanderson,R., Ciccarese,P. and Van de Sompel,H. (2013) Designing the W3C open annotation data model. In: Proceedings of the 5th Annual ACM Web Science Conference on WebSci ’13. ACM Press, New York, New York, USA, pp. 366–375. 38. Hellmann,S., Lehmann,J., Auer,S. and Brümmer,M. (2013) Integrating NLP using Linked Data. In: Proceedings of the 12th International SemanticWeb Conference, Sydney, Australia. 39. Comeau,D.C., Dogan,R.I., Ciccarese,P. et al. (2013) BioC: a minimalist approach to interoperability for biomedical text processing. Database, 2013, bat064. 40. Verhagen,M., Suderman,K., Wang,D. et al. (2016) The LAPPS Interchange Format. In Proceedimgs of the Second International Workshop on Worldwide Language Service Infrastructure (WLSI 2015). Springer International Publishing, pp. 33–47. Database, Vol. 2016, Article ID baw145 Page 27 of 30 41. Götz,T. and Suhre,O. (2004) Design and implementation of the UIMA Common Analysis System. IBM Syst. J., 43, 476–489. 42. Cunningham,H., Maynard,D., Bontcheva,K. and Tablan,V. (2002) GATE: an architecture for development of robust HLT applications. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics—ACL ’02. Association for Computational Linguistics, Morristown, NJ, USA, pp. 168–175. 43. Ide,N., Pustejovsky,J., Cieri,C. et al. (2016) The Language Application Grid. In Proceedings of the 2nd International Workshop on Worldwide Language Service Infrastructure (WLSI 2015). Springer International Publishing, pp. 51–70. 44. Stenetorp,P., Pyysalo,S., Topic,G. et al. (2012) BRAT: a web- based tool for NLP-assisted text annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL’12), Avignon, France. 45. Kim,J.D., Ohta,T., Pyysalo,S. et al. (2009) Overview of BioNLP’09 shared task on event extraction. In: Proceedings of the Workshop on BioNLP Shared Task—BioNLP ’09. Association for Computational Linguistics, Baltimore, MD. 46. Wei,C.H., Kao,H.Y. and Lu,Z. (2013) PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res., 41, W518–W522. 47. Eckart de Castilho,R., Biemann,C., Gurevych,I. and Yimam,S.M. (2014) WebAnno: a flexible, web-based annota- tion tool for CLARIN. In: Proceedings of the CLARIN Annual Conference (CAC) 2014. CLARIN ERIC, Utrecht, Netherlands. 48. Kim,J.D. and Wang,Y. (2012) PubAnnotation: a persistent and sharable corpus and annotation repository. In: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing, Montreal, Canada. pp. 202–205. 49. Francopoulo,G., George,M., Calzolari,N. et al. (2006) Lexical Markup Framework (LMF). In: International Conference on Language Resources and Evaluation (LREC 2006), Genoa Italy. 50. Smith,B., Ashburner,M., Rosse,C. et al. (2007) The OBO Foundry: coordinated evolution of ontologies to support bio- medical data integration. Nat. Biotechnol., 25, 1251–1255. 51. Lindberg,D.A., Humphreys,B.L. and McCray,A.T. (1993) The unified medical language system. Methods Inform. Med., 32, 281–291. 52. The UniProt Consortium. (2007) The Universal Protein Resource (UniProt). Nucleic Acids Res., 36, D190–D195. 53. Caracciolo,C., Stellato,A., Morshed,A. et al. (2013) The AGROVOC Linked Dataset. SemanticWeb, 4, 341–348. 54. Haendel,M.A., Neuhaus,F., Osumi-Sutherland,D. et al. (2008) CARO—The Common Anatomy Reference Ontology. In: Anatomy Ontologies for Bioinformatics. Springer London, London, pp. 327–349. 55. Robinson,P.N., Köhler,S., Bauer,S. et al. (2008) The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am. J. Hum. Genet., 83, 610–615. 56. Belleau,F., Nolin,M.A., Tourigny,N. et al. (2008) Bio2RDF: to- wards a mashup to build bioinformatics knowledge systems. J. Biomed. Inform., 41, 706–716. 57. Livingston,K.M., Bada,M., Baumgartner,W.A. et al. (2015) KaBOB: ontology-based semantic integration of biomedical databases. BMC Bioinformatics, 16, 126. 58. Ashburner,M., Ball,C.A., Blake,J.A. et al. (2000) Gene Ontology: tool for the unification of biology. Nat. Genet., 25, 25–29. 59. Mao,Y., Van Auken,K., Li,D. et al. (2014) Overview of the gene ontology task at BioCreative IV. Database, 2014, bau086. 60. Fellbaum,C. (1998) WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA. 61. Chiarcos,C. and Sukhareva,M. (2015) OLiA—Ontologies of Linguistic Annotation. SemanticWeb, 6, 379–386. 62. Farrar,S. and Langendoen,D.T. (2003) A linguistic ontology for the semantic web. GLOT Int., 7, 97–100. 63. Vrandecic,D. and Krötzsch,M. (2014) Wikidata: a free collab- orative knowledgebase. Commun. ACM, 57, 78–85. 64. Bizer,C., Lehmann,J., Kobilarov,G. et al. (2009) DBpedia—a crystallization point for the Web of Data. J. Web Semant., 7, 154–165. 65. Bollacker,K., Evans,C., Paritosh,P. et al. (2008) Freebase: a col- laboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference onManagement of Data. ACM Press, pp. 1247–1250. 66. Suchanek,F.M., Kasneci,G. and Weikum,G. (2007) YAGO: a large ontology from Wikipedia and WordNet. In: Proceedings of the 16th international conference on World Wide Web— WWW ’07, Volume 6. ACM Press, pp. 697–706. 67. Maegaard,B., Choukri,K., Calzolari,N. and Odijk,J. (2005) ELRA—European Language Resources Association- Background, Recent Developments and Future Perspectives. Lang. Resour. Eval., 39, 9–23. 68. Noy,N.F., Shah,N.H., Whetzel,P.L. et al. (2009) BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Res., 37, W170–W173. 69. Jonquet,C., Dzalé-Yeumo,E., Arnaud,E. and Larmande,P. (2015) AgroPortal: a proposition for ontology-based services in the agronomic domain. In: IN-OVIVE’15: 3ème atelier INtégration de sources/masses de données hétérogènes et Ontologies, dans le domaine des sciences du VIVant et de l’Environnement, Rennes, France. 70. Stenetorp,P., Topic,G., Pyysalo,S. et al. (2011) BioNLP shared task 2011: supporting resources. In Proceedings of BioNLP Shared Task 2011 Workshop. Association for Computational Linguistics, Portland, Oregon, USA, pp. 112–120. 71. Nédellec, C., Nazarenko, A. and Bossy, R. (2008) Ontology and information extraction. In: Staab, S., Studer, R. (eds), Ontology Handbook. Springer Verlag, Berlin. 72. Sch€afer,U. (2006) Middleware for creating and combining multi-dimensional NLP markup. In: Proceedings of the 5th Workshop on NLP and XML: Multi-Dimensional Markup in Natural Language Processing (NLPXML’06), Trento, Italy. 73. Padro,L. and Stanilovsky,E. (2012) FreeLing 3.0: towards wider multilinguality. In: Proceedings of the Language Resources and Evaluation Conference (LREC 2012). ELRA, Istanbul, Turkey. 74. Bird,S. (2006) NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive presentation Page 28 of 30 Database, Vol. 2016, Article ID baw145 sessions (COLING-ACL ’06). Association for Computational Linguistics, Morristown, NJ, USA, pp. 69–72. http://portal. acm.org/citation.cfm?doid¼1225403.1225421 (8 July 2016, date last accessed). 75. Manning,C.D., Bauer,J., Finkel,J. et al. (2014) The Stanford CoreNLP Natural Language Processing Toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Morristown, NJ. http://aclweb.org/anthology/P14-5010. 76. Savova,G.K., Masanz,J.J., Ogren,P.V. et al. (2010) Mayo clin- ical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J. Am. Med. Inform. Assoc.: JAMIA, 17, 507–513. 77. Richardet,R., Chappelier,J.C. and Telefont,M. (2013) Bluima: a UIMA-based NLP Toolkit for Neuroscience. Unstructured Information Management Architecture (UIMA), Darmstadt, Germany. 78. Ogren,P.V., Wetzler,P.G. and Bethard,S.J. (2008) ClearTK: A UIMA Toolkit for Statistical Natural Language Processing. In: Proceedings of the LREC 2008 Workshop ‘Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP’. 79. Hahn,U., Buyko,E., Landefeld,R. et al. (2008) An overview of JCoRe, the JULIE lab UIMA component repository. In: Proceedings of the LREC 2008 Workshop ‘Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP’. 80. Baumgartner,W.A., Cohen,K.B., Hunter,L. et al. (2008) An open-source framework for large-scale, flexible evaluation of biomedical text mining systems. J. Biomed. Discov. Collab., 3, 1. 81. Hinrichs,E., Hinrichs,M. and Zastrow,T. (2010) WebLicht: web-based LRT services for German. In: Proceedings of the ACL 2010 System Demonstrations. Association for Computational Linguistics, pp. 25–29. 82. Rak,R., Batista-Navarro,R.T., Rowley,A. et al. (2014) Text- mining-assisted biocuration workflows in Argo. Database, 2014, bau070. 83. Rak,R., Carter,J., Rowley, A., Batista-Navarro,R.T. et al. (2014) Interoperability and Customisation of Annotation Schemata in Argo. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland. 84. Rak,R., Batista-Navarro,R.T., Carter,J. et al. (2014) Processing biological literature with customizable Web services supporting interoperable formats. Database, 2014, bau064. 85. Fu,X., Batista-Navarro,R., Rak,R. and Ananiadou,S. (2015) Supporting the annotation of chronic obstructive pulmonary disease (COPD) phenotypes with text mining workflows. J. Biomed. Semant., 6, 8. 86. Batista-Navarro,R., Rak,R. and Ananiadou,S. (2015) Optimising chemical named entity recognition with pre- processing analytics, knowledge-rich features and heuristics. J. Cheminform., 7, S6. 87. Batista-Navarro,R., Carter,J. and Ananiadou,S. (2016) Argo: enabling the development of bespoke workflows and services for disease annotation. Database, 2016, baw066. 88. Kluegl,P., Toepfer,M., Beck,P.D. et al. (2016) UIMA Ruta: rapid development of rule-based information extraction appli- cations. Nat. Lang. Eng., 22, 1–40. 89. Riazanov,A., Laurila,J., Baker,C.J. et al. (2011) Deploying mu- tation impact text-mining software with the SADI Semantic Web Services framework. BMCBioinformatics, 12, S6. 90. Altintas,I., Berkley,C., Jaeger,E. et al. (2004) Kepler: an exten- sible system for design and execution of scientific workflows. In: Proceedings of the 16th International Conference on Scientific and Statistical Database Management, 2004. IEEE, pp. 423–424. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper. htm?arnumber¼1311241 (8 July 2016, date last accessed). 91. Deelman,E., Blythe,J., Gil,Y. et al. (2004) Pegasus: Mapping Scientific Workflows onto the Grid. In: Proceedings of the 2nd European AcrossGrids Conference (AxGrids 2004). Springer Berlin Heidelberg, pp. 11–20. http://link.springer. com/10.1007/978-3-540-28642-4_2 (8 July 2016, date last accessed). 92. Wolstencroft,K., Haines,R., Fellows,D. et al. (2013) The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic Acids Res., 41, W557–W561. 93. Schubert,E., Koos,A., Emrich,T. et al. (2015) A framework for clustering uncertain data. In: Proceedings of the 41st International Conference on Very Large Data Bases. pp. 1976–1979. 94. Berthold,M.R., Cebron,N., Dill,F. et al. (2009) KNIME—the Konstanz information miner. ACM SIGKDD Explorations Newsletter, 11, 26–31. 95. Taylor,I., Shields,M., Wang,I. and Harrison,A. (2007) The Triana workflow environment: architecture and applications. In: Taylor,I., Deelman,E., Gannon,D., Shields,M. (eds.), Workflows for E-Science. Springer, London, pp. 320–339. 96. Kappler,M.A. (2008) Software for rapid prototyping in the pharmaceutical and biotechnology industries. Curr. Opin. Drug Discov. Dev., 11, 389–392. 97. Wilkinson,M.D., Vandervalk,B., McCarthy,L. et al. (2011) The Semantic Automated Discovery and Integration (SADI) Web service Design-Pattern, API and Reference Implementation. J. Biomed. Semant., 2, 8–30. 98. Goecks,J., Nekrutenko,A., Taylor,J. and The Galaxy Team. (2010) Galaxy: a comprehensive approach for supporting ac- cessible, reproducible, and transparent computational research in the life sciences. Genome Biol., 11, R86. 99. Bhagat,J., Tanoh,F., Nzuobontane,E. et al. (2010) BioCatalogue: a universal catalogue of web services for the life sciences. Nucleic Acids Res., 38, W689–W694. 100. Dimitrov,M., Cunningham,H., Roberts,I. et al. (2014) AnnoMarket—multilingual text analytics at scale on the cloud. In: Proceedings of the Semantic Web Event at ESWC 2014. Springer International Publishing, pp. 315–319. 101. Estival,D. and Cassidy,S. (2014) Alveo, a human communica- tion science virtual laboratory. In Proceedings of Australasian Language Technology Association Workshop. Association for Computational Linguistics, pp. 104–107. 102. Ishida,T. (2006) Language grid: an infrastructure for intercul- tural collaboration. In: International Symposium on Applications and the Internet (SAINT’06). IEEE. http://ieeex Database, Vol. 2016, Article ID baw145 Page 29 of 30 plore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber¼15813 19 (8 July 2016). 103. Huang,C.C., Lu,Z. (2016) Community challenges in biomed- ical text mining over 10 years: success, failure and the future. Brief. Bioinform., 17, bbv024. 104. Arighi,C.N., Lu,Z., Krallinger,M. et al. (2011) Overview of the BioCreative III Workshop. BMCBioinformat., 12, S1. 105. Nédellec,C., Bossy,R., Kim,J.D. et al. (2013) Overview of BioNLP shared task 2013. In: BioNLP Shared Task 2013 Workshop. Association for Computational Linguistics. 106. Balikas,G., Kosmopoulos,A., Krithara,A. et al. (2015) Results of the BioASQ tasks of the Question Answering Lab at CLEF 2015. In: Proceedings of the Conference and Labs of the Evaluation Forum (CLEF 2015). 107. Bada,M., Eckert,M., Evans,D. et al. (2012) Concept annota- tion in the CRAFT corpus. BMCBioinformatics, 13, 161. 108. Funk,C., Baumgartner,W., Garcia,B. et al. (2014) Large-scale biomedical concept recognition: an evaluation of current auto- matic annotators and their parameters. BMC Bioinform., 15, 59–30. 109. Truyens,M., Van Eecke,P. (2014) Legal aspects of text mining. Comput. Law Secur. Rev., 30, 153–170. 110. Williams,L.A., Fox,L.M., Roeder,C., Hunter,L. (2014) Negotiating a text mining license for faculty researchers. Informat. Technol. Libr., 33, 5. 111. Handke,C., Guibault,L., Vallbb,J.J. (2015) Is Europe falling be- hind in data mining? Copyright’s Impact on data mining in aca- demic research. SSRN Electron. J., 2015, 10.2139/ ssrn.2608513. 112. Rehm,G., Uszkoreit,H., Ananiadou,S. et al. (2016) The stra- tegic impact of META-NET on the regional, national and inter- national level. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 1517–1524 Language Resources and Evaluation. 113. Monachini,M., Quochi,V., Calzolari,N. et al. (2011) The Standards’ Landscape Towards an Interoperability Framework. http://www.flarenet.eu/sites/default/files/FLaRe Net_Standards_Landscape.pdf. 114. Li,R., Zhong,W., Zhu,L. (2012) Feature screening via distance correlation learning. J. Am. Stat. Assoc., 107, 1129–1139. 115. Thompson,P., Batista-Navarro,R.T., Kontonatsios,G. et al. (2016) Text mining the history of medicine. PLoS One, 11, e0144717. Page 30 of 30 Database, Vol. 2016, Article ID baw145