1twitter.com/openminted_eu
Petr Knoth
Knowledge Media institute, The Open 
University
United Kingdom
Machine accessibility of Open Access 
scientific publications from publisher 
systems via ResourceSync
2Research literature contains some of the 
most important information we have 
assembled as human species, such as 
cures to diseases and answers to many of 
the world’s challenges we are facing 
today. 
3Reading and systematically 
analysing this information is beyond 
human capacities.
4Why machine accessibility of 
publications?• TDM can only fulfil its potential if TDM tools can be applied on the:• widest possible set of publications• as soon as publications are made available• Many publication providers => need for interoperability 
8Expertise Directory 1/2
• Contacted publishers to clarify the expected approaches• Developed code to implement them:• In most cases the final approach was different from the suggestions we received.• Tested the approach for scalability• Documented the approach, justified why we followed it (including what did not work) and gave recommendations to publishers. 
The expertise directory is available at: https://github.com/openminted/omtd-publisher-connector-harvester/blob/master/interoperability-layer/interoperability-layer.adoc  
9Expertise Directory 2/2
Example of limitations as described in the Expertise Directory for Elsevier:
10
The idea of the Publisher Connector
Why not OAI-PMH?
• Slow and very inefficient 
for big repositories.
• Standardised for 
metadata transfer but not 
for content transfer.
• Very difficult to represent 
the richness of metadata 
from a broad range of 
data providers. 
Provide seamless access over non-standard APIs
11
The idea of the Publisher Connector
Provide seamless access over non-standard APIsWhy 
ResourceSync?
• Scalable 
implementatio
n
• Metadata 
format 
agnostic
• Made for the 
Web
12
Integration with OpenMinTeD
Via CORE and the OMTD-SHARE schema.
13
How does it work?
14
Architecture 1/3
• Microservices architecture with a message queue as a communication channel.• Discover, Retrieve, Expose (DRE) Workflow
Discovery Retrieval Expose
15
Architecture 2/3
• Ingestion Services: • Harvester service: Discovers new resources (publications) and schedules them for downloading (via the message queue)• Retriever service: Retrieves scheduled publications from the queue and downloads them (both metadata and content) applying an appropriate data source download client for each publisher.• Data source download clients: publisher specific methods for discovery and retrieval + a generic CrossRef API wrapper.• Exposure service:• ResourceSync server service: exposes publications according to the ResourceSync standard. 
16
Architecture 3/3
• Message queue module: interface to a message broker (RabbitMQ) that is populated with publications events scheduled for downloading. • Database module: Store and keeps downloads for incremental synchronisation. 
17
Discovery
• Could be done via the CrossRef TDM API for some (typically smaller => scalability) publishers • Filtering by date of publication and for a set of OA licences• Sitemaps crawling for Elsevier
Discovery Retrieval Expose
18
Retrieval
• Each publisher employs different methods and rules to download and retrieve an article
Discovery Retrieval Expose
19
Scalability analysis
Publisher discovery Metadata + Content (single thread)
Identification (is OA?)
Elsevier 8m 1s 59m instant
Springer 6m 52s 51m n/a
Frontiers 16m 40s 2h 46m n/a
* On a sample of 10k documents averaged over 2 trials
We can reprocess all Elsevier articles single threaded in about 100h
20
Exposure
• Scalable implementation of a ResourceSync server: For each publisher a new ResourceSync “capability” is created for its metadata and one for its content (pdf). The ResourceSync server is deployed at http://publisher-connector.core.ac.uk/resourcesync/ 
Discovery Retrieval Expose
21
How many articles are provided 
as OA? 
Publisher Metadata records TDM-eligible license OA articles
Springer Nature 10,383,519 1,393,991 438,139
Elsevier 14,988,181 ? 1,005,768
Frontiers 68,790 68,790 68,790
Discovery of OA articles (May 2017)
OA Percentage: 7%  -   TDM 
eligible 9.7% 
22
Volume sizes
As of August 2017 
    Total: 1,831,877
Elsevier 1,107,091
Springer 492,462
Frontiers 59,512
PLOS 172,812
23
The connector is for all
“The OpenAIRE infrastructure has recently started a collaboration with the CORE Team led by Petr Knoth in order to include in the OpenAIRE metadata and file aggregation chain the resources made available via the ResourceSync connector realised at the CORE Team Lab. To this aim, the OpenAIRE team is testing the CORE ResourceSync Connector code with the intention of integrating it in its production system before the end of 2017.”                          – Paolo Manghi – Technical Lead of OpenAIRE
24
Contributions
• Content:• We liberated over 1.8 million open access publications from publishers and made them available through a seamless layer • As CORE integrates these papers, we have now over 8 million full-text papers in CORE.• Technical:• First implementation and deployment of ResourceSync that scales to millions of items. • ResourceSync solves problems with aggregating content over OAI-PMH, faster & more efficient aggregation => fresher data in aggregators compared to OAI-PMH• More work in this direction upcoming as part of COAR NGR