Facilitate Open Science Training for European Research Open access and research data management: Horizon 2020 and beyond University College Cork, April 14th & 15th 2015 Successfully Using Research Data Management Principles (I hope!) Dr Jonathan Tedds jat26@le.ac.uk @jtedds Senior Research Fellow, Health And Research Data Informatics (HARDi) Group Dept of Health Sciences University of Leicester PI BRISSKit http://www.brisskit.le.ac.uk http://www.astrogrid.org (April 2008 1st public release) Science as an Open Enterprise Report Why open? • As a first step towards this intelligent openness, data that underpin a journal article should be made concurrently available in an accessible database • We are now on the brink of an achievable aim: for all science literature to be online, for all of the data to be online and for the two to be interoperable. [p.7] • Royal Society June 2012, Science as an Open Enterprise, http://royalsociety.org/policy/projects/sci ence-public-enterprise/report/ • Issues linking data to the scientific record: • Data persistence • Data and metadata quality • Attribution and credit for data producers • Geoffrey Boulton (Edinburgh), Lead author: • “Science has been sleepwalking into crisis of replicability...and of the credibility of science” • “Publishing articles without making the data available is scientific malpractice” Data Reuse: asking new questions Hubble Space Telescope • Papers based upon reuse of archived observations now exceed those based on the use described in the original proposal. • http://archive.stsci.edu/hst/bibliography/pubstat.html • See also work by Piwowar & Vision re life sciences: “Data reuse and the open data citation advantage” • http://peerj.com/preprints/1/ 2012-02-07 DCC roadshow East Midlands - CC-BY-SA 7 PDB GenBank UniProt Pfam Spreadsheets, Notebooks Local, Lost High throughput experimental methods Industrial scale Commons based production Public data sets Cherry picked results Preserved CATH, SCOP (Protein Structure Classification) ChemSpider Research and the long tail Slide: Carole Goble Our attempts at Leicester • Some examples… 1. HALOGEN (History, Archaeology, Linguistics, Onomastics, GENetics):2009+ Throwing light on the past through cross-disciplinary databasing http://halogen.le.ac.uk  Portable Antiquities Scheme (British Museum)  Place-names (Nottingham)  Surnames  Genetics  IT hosting and GIS  Best practice: #JISCMRD, UKRDS, DCC, international Halogen as template for research data management #jiscmrd • Requirements Analysis – must be iterative! • Data Management Plan – use DMPonline (UK Digital Curation Centre) • Scalable research data management infrastructure • pilot phase to nationally available resource • LAMP stack IT infrastructure: host research database • A model for the long term delivery of a data management service within the institution including • support, maintenance, governance & charging policies • Include researchers, IT services, research support office, library services etc. BENEFITS • New research opportunities • Cross database work – seed new research samples • Scholarly communication/access to national resources • Key to English Place Names (Nottingham) • Portable Antiquities Scheme (British Museum) • Verification, re-purposing, re-use of data • Cleaning & enhancing private research datasets for reuse & correlation • No re-creation of data • Increased transparency • excellent training for best practice in research data management • Increasing research productivity • Build in cleaning, annotation, enhancement into normal research workflows • research datasets may immediately be reusable and interoperable • Impact & Knowledge Transfer • Reuse IT infrastructure • Increasing skills base of researchers/students/staff CHALLENGES • interdisciplinary research database • ingest each input dataset in form such that sufficient information is carried forward to enable interoperation • Cultural differences! • versioning & provenance for input datasets • which software tools, infrastructure , Query interface? • suitable for multi disciplinary researchers • Requirements upon the institution for sustaining the research assets & skills • Requirements upon the researchers • Annotating • Refreshing • Maintenance of datasets Reward = Leverhulme Trust funding £1.3m! Helped enable… http://www.brisskit.le.ac.uk/node/35 Research costing – only part of the answer No Response 63% Response Received 37% Researcher Responses to Contacts Made Suggested timeline for implementing institutional research data management From Whyte & Tedds (2011), DCC Briefing http://www.dcc.ac.uk/resources/briefing-papers/making-case-rdm Challenges for researchers in institutions • Rise to scientific and research challenge • Not just a management challenge • Responsibility for the knowledge you create • Personal benefits as well as funder compliance • Library • curation of data and publications • active support from data scientists • from centralised to dispersed support • Expert centres essential! • IT Service • Provide research data platforms for researchers: • Active storage • Enable collaboration • Connect to preservation services through Library Enabling Research Data Management • Active Data Management Planning • built in at proposal stage • Local institutional tweaks of funder and local templates • Implemented and evolved in project • Data Management Plan as a live, evolving object • Annotate data on the fly – lab notebook approach • Curated & preserved using permanent identifiers • Appropriate repository and data collection descriptors It’s a long road…. What do researchers need to make this all possible? • Incentives - citations, promotion, support - long way to go • Institutional and funder policy framework - mostly there now? • Appropriate discipline specific community centres of expertise - rare, mostly limited to big science niches or very broad but poorly sustained • Institutional support services for the basics - pilots to date • Software tools that are open and can be adapted - on the way But that’s not all… • What about the software underpinning data driven research? • If we’re going to publish as open data: • How do we help researchers to store, annotate and discover the datasets they create? • How do you sustain and reuse that? 2. Biomedical Research Software as a Service A vision for cloud-based open source research applications #BRISSKit http://www.brisskit.le.ac.uk BRISSKit context: The I4Health goal of applying knowledge engineering to close the ‘ICT gap’ between research and healthcare (Beck, T. et al 2012) www.brisskit.le.ac.uk Email: brisskit@le.ac.uk http://www.brisskit.le.ac.uk BRISSKit CiviCRM: patient cohort management • Manages studies: enables end-to-end contact management for volunteers and research participants • track approaches, contact, responses, recruitment, exclusions • object model that reflects community building and non-profit relationships http://www.brisskit.le.ac.uk http://www.brisskit.le.ac.uk • Holds data on primary, derived and aliquot specimen, including linear and 2d barcodes • Storage inventory, order tracking e.g. 30,000+ NIHR UHL Cardiovascular Biomedical Research Unit samples stored and recorded BRISSKit OpenSpecimen: sample management http://www.brisskit.le.ac.uk http://www.brisskit.le.ac.uk BRISSKit RedCap: survey management Web-based, secure questionnaire data entry by research or nursing staff E.g. used for all patient recruits in NIHR UHL Respiratory Biomedical Research Unit – mobile computing on wards and outpatient clinic http://www.brisskit.le.ac.uk BRISSKit i2b2: data warehousing & querying Data from multiple data sources combined into multiple ontologies for flexible and sophisticated searching, cohort discovery and research The semantic bridge ? OBiBa Onyx Records participant consent, questionnaire data and primary specimen IDs i2b2 Cohort selection and data querying Bio-ontology! BRISSKit USPs  Integrated support for core research processes  Well-established mature open source applications as protoyped in e.g. Cardiovascular, Respiratory, Cancer: fully UK customised  A cloud based platform for seamless management and integration between applications  A BRISSKit API allows integration with existing clinical systems  Easy set up, use and administration through browser (including on mobile devices)  Capability of being hosted in any compliant cloud provider including UHL (NHS information governance)  Direct secure links through Jisc via Janet network under consideration BRISSKit Funding & Partners • New HEFCE/Jisc investment planned for 2014 – 2016 • Jisc endorsed service • Co-design with reorganised Jisc • Key Janet Framework partners Farr, Crick, Infinity • University of Leicester Cancer Biobank • Tissue sample management built on caTissue, OpenSpecimen • NIHR Respiratory Biomedical Research Unit solutions: University Hospitals Leicester NHS Trust • linked to UoL Health Sciences Exceed Study • Links to Loughborough-Leicester Lifestyle BRU BRISSKit highlighted collaborations • University of Bristol • ALSPAC Birth Cohort Studies • DataShield: simultaneous remote, secure access to multiple large international cohorts • SAIL-Farr secure NHS data hosting (Swansea) • University Hospitals Leicester NHS Trust • Case Study Module Development • UoL Health Sciences Longitudinal Studies • NIHR BRUs: Cardiovascular, Respiratory, Lifestyle (Loughborough-Leicester) • Leicester Diabetes Centre • UoL Data to Knowledge for Practice strategic theme • UoL Genomics, UHL NHS Trust – IBM IT Partnership Research Software Sustainability • Open Source community engagement • standards compliance • consortium approach • work with grain of researchers • discipline specific forks? • Github versioning an example for research data? • Open Source Community Engagement Charter • defining engagement with existing & new OS communities • including adoption & code commitments See Rob Baxter blog: “The research software engineer” • http://dirkgorissen.com/2012/09/13/the-research- software-engineer/ http://openhealthdata.metajnl.com/ Latest: 3. Aerosol Science for Public Health and Public Policy through Commercial Avenues Dr. Josh Vande Hey NERC Knowledge Exchange Fellow Air Quality Research Group / HARDi University of Leicester Email: jvh7@le.ac.uk Website: https://www2.le.ac.uk/departments/physics/research/ earth-observation-science/joshua-vande-hey Why Aerosols? DEFRA: The particulate pollution burden in the UK was estimated to be equivalent to 29,000 deaths and 340,000 life years lost in 2008 alone. China is investing >£90 billion to reduce PM2.5 levels in Beijing by 25% by 2017. WHO: AAP caused 3.7 million deaths in 2012 Hutton, 2011 Projected increase in annual stagnant days Horton et al, 2014 Climate change and global development making air quality situation worse £9-19 billion estimated annual economic cost of air pollution in the UK (HoCEAC). Objective This fellowship will deliver NERC aerosol science and technology to the health sector and the public through targeted market development of integrated data. 1: Linking knowledge base to need KNOWLEDGE BASE: AEROSOL SOURCES, CHEMISTRY, TRANSPORT AND MEASUREMENT STAKEHOLDER NEED: MANAGING HEALTH IMPACT AROUND PERSONAL EXPOSURE 1 IDENTIFY KEY GAPS THAT CAN BE ADDRESSED THROUGH KNOWLEDGE EXCHANGE 1: Linking knowledge base to need KNOWLEDGE BASE: AEROSOL SOURCES, CHEMISTRY, TRANSPORT AND MEASUREMENT STAKEHOLDER NEED: MANAGING HEALTH IMPACT AROUND PERSONAL EXPOSURE 1 Air Quality Expert Group 1: Linking knowledge base to need KNOWLEDGE BASE: AEROSOL SOURCES, CHEMISTRY, TRANSPORT AND MEASUREMENT STAKEHOLDER NEED: MANAGING HEALTH IMPACT AROUND PERSONAL EXPOSURE 1 Air Quality Expert Group *Circled partners linked to letters of support 2: Scoping potential of data fusion 2 MARKET AND SYSTEMS FEASIBILITY: INTEGRATION OF NEW TECHNOLOGY AND DATA STREAMS KNOWLEDGE BASE: AEROSOL SOURCES, CHEMISTRY, TRANSPORT AND MEASUREMENT STAKEHOLDER NEED: MANAGING HEALTH IMPACT AROUND PERSONAL EXPOSURE 1 2: Scoping potential of data fusion 2 MARKET AND SYSTEMS FEASIBILITY: INTEGRATION OF NEW TECHNOLOGY AND DATA STREAMS KNOWLEDGE BASE: AEROSOL SOURCES, CHEMISTRY, TRANSPORT AND MEASUREMENT STAKEHOLDER NEED: MANAGING HEALTH IMPACT AROUND PERSONAL EXPOSURE 1 WHAT KIND OF SYSTEM COULD LINK STATE-OF-THE-ART PARTICULATES DATA WITH STATE-OF-THE-ART HEALTH DATA? BRISSKIT, developed in Leicester, integrates tissue sample data, clinical trial data, and medical records data in an anonymised open database. Some thoughts • Can’t do it all in house but of course need specialist work! • But many disciplines don’t have data centres • Build coalition of institutional actors • Essential to have high level support • Take and shape • Identify what you do have in-house • Access external tools, standards where possible • Active storage, collaboration, eprints… • Propose best of breed for (inter)national reuse • Share benefits (and costs) over academic networks • Sustainability the key challenge • As much cultural as technical – needs networks… • Make use of DCC expertise and resources! It is worth it & funders are paying attention! • increase the trustworthiness and value of individual data sets • strengthen the findings based on cited data sets • increase the transparency and traceability of data and publications • enable reuse and repurposing Thank you for listening and thanks to Foster, UCC and project teams, partners  Dr Jonathan Tedds jat26@le.ac.uk @jtedds Senior Research Fellow, Health And Research Data Informatics (HARDi) Dept of Health Sciences University of Leicester #BRISSKit http://www.brisskit.le.ac.uk Open access and research data management: Horizon 2020 and beyond This event was funded by FOSTER through the European Union’s Seventh Framework Programme http://www.fosteropenscience.eu and organised by • University College Cork http://www.ucc.ie • Teagasc http://www.teagasc.ie • Repository Network Ireland http://rni.wikispaces.com