1twitter.com/openminted_eu Petr Knoth Knowledge Media institute, The Open University United Kingdom Machine accessibility of Open Access scientific publications from publisher systems via ResourceSync 2Research literature contains some of the most important information we have assembled as human species, such as cures to diseases and answers to many of the world’s challenges we are facing today. 3Reading and systematically analysing this information is beyond human capacities. 4Why machine accessibility of publications?• TDM can only fulfil its potential if TDM tools can be applied on the:• widest possible set of publications• as soon as publications are made available• Many publication providers => need for interoperability 8Expertise Directory 1/2 • Contacted publishers to clarify the expected approaches• Developed code to implement them:• In most cases the final approach was different from the suggestions we received.• Tested the approach for scalability• Documented the approach, justified why we followed it (including what did not work) and gave recommendations to publishers. The expertise directory is available at: https://github.com/openminted/omtd-publisher-connector-harvester/blob/master/interoperability-layer/interoperability-layer.adoc 9Expertise Directory 2/2 Example of limitations as described in the Expertise Directory for Elsevier: 10 The idea of the Publisher Connector Why not OAI-PMH? • Slow and very inefficient for big repositories. • Standardised for metadata transfer but not for content transfer. • Very difficult to represent the richness of metadata from a broad range of data providers. Provide seamless access over non-standard APIs 11 The idea of the Publisher Connector Provide seamless access over non-standard APIsWhy ResourceSync? • Scalable implementatio n • Metadata format agnostic • Made for the Web 12 Integration with OpenMinTeD Via CORE and the OMTD-SHARE schema. 13 How does it work? 14 Architecture 1/3 • Microservices architecture with a message queue as a communication channel.• Discover, Retrieve, Expose (DRE) Workflow Discovery Retrieval Expose 15 Architecture 2/3 • Ingestion Services: • Harvester service: Discovers new resources (publications) and schedules them for downloading (via the message queue)• Retriever service: Retrieves scheduled publications from the queue and downloads them (both metadata and content) applying an appropriate data source download client for each publisher.• Data source download clients: publisher specific methods for discovery and retrieval + a generic CrossRef API wrapper.• Exposure service:• ResourceSync server service: exposes publications according to the ResourceSync standard. 16 Architecture 3/3 • Message queue module: interface to a message broker (RabbitMQ) that is populated with publications events scheduled for downloading. • Database module: Store and keeps downloads for incremental synchronisation. 17 Discovery • Could be done via the CrossRef TDM API for some (typically smaller => scalability) publishers • Filtering by date of publication and for a set of OA licences• Sitemaps crawling for Elsevier Discovery Retrieval Expose 18 Retrieval • Each publisher employs different methods and rules to download and retrieve an article Discovery Retrieval Expose 19 Scalability analysis Publisher discovery Metadata + Content (single thread) Identification (is OA?) Elsevier 8m 1s 59m instant Springer 6m 52s 51m n/a Frontiers 16m 40s 2h 46m n/a * On a sample of 10k documents averaged over 2 trials We can reprocess all Elsevier articles single threaded in about 100h 20 Exposure • Scalable implementation of a ResourceSync server: For each publisher a new ResourceSync “capability” is created for its metadata and one for its content (pdf). The ResourceSync server is deployed at http://publisher-connector.core.ac.uk/resourcesync/ Discovery Retrieval Expose 21 How many articles are provided as OA? Publisher Metadata records TDM-eligible license OA articles Springer Nature 10,383,519 1,393,991 438,139 Elsevier 14,988,181 ? 1,005,768 Frontiers 68,790 68,790 68,790 Discovery of OA articles (May 2017) OA Percentage: 7% - TDM eligible 9.7% 22 Volume sizes As of August 2017 Total: 1,831,877 Elsevier 1,107,091 Springer 492,462 Frontiers 59,512 PLOS 172,812 23 The connector is for all “The OpenAIRE infrastructure has recently started a collaboration with the CORE Team led by Petr Knoth in order to include in the OpenAIRE metadata and file aggregation chain the resources made available via the ResourceSync connector realised at the CORE Team Lab. To this aim, the OpenAIRE team is testing the CORE ResourceSync Connector code with the intention of integrating it in its production system before the end of 2017.” – Paolo Manghi – Technical Lead of OpenAIRE 24 Contributions • Content:• We liberated over 1.8 million open access publications from publishers and made them available through a seamless layer • As CORE integrates these papers, we have now over 8 million full-text papers in CORE.• Technical:• First implementation and deployment of ResourceSync that scales to millions of items. • ResourceSync solves problems with aggregating content over OAI-PMH, faster & more efficient aggregation => fresher data in aggregators compared to OAI-PMH• More work in this direction upcoming as part of COAR NGR