Modern Research Data Management workshop Sarah Jones DCC, University of Glasgow sarah.jones@glasgow.ac.uk Twitter: @sjDCC European Medical Students Association, Berlin, 14-15 September 2015 http://emsa-europe.eu/6315 Martin Donnelly DCC, University of Edinburgh martin.donnelly@ed.ac.uk Twitter: @mkdDCC Agenda Time Topic Who 09:00-09:30 Intro to RDM Sarah 09:30-09:45 Benefits and challenges of RDM Martin 09:45-10:00 Dealing with sensitive data Martin 10:00-10:30 Data sharing exercise All 10:30-11:00 Coffee All 11:00-11:30 Data Management Planning (including a demo of DMPonline) Sarah 11:30-12:00 Exercise: writing a DMP All 12:00-12:15 Other useful tools and resources Martin 12:15-12:30 Questions and discussion All SOME DEFINITIONS What is Research Data Management? Image under licence by DCC What is the DCC? A UK service to support the Higher Education sector with Research Data Management (RDM) www.dcc.ac.uk “Helping to build capacity, capability and skills in data management and curation across the UK’s higher education research community.” DCC Phase 3 Business Plan Training | Events | Tools | Advocacy | Tailored Support | IJDC | International Conference What is research data? • Research data is defined as recorded factual material commonly retained by and accepted in the scientific community as necessary to validate research findings; although the majority of such data is created in digital format, all research data is included irrespective of the format in which it is created. • Research data refers to information, in particular facts or numbers, collected to be examined and considered as a basis for reasoning, discussion or calculation. • In a research context, examples of data include statistics, results of experiments, measurements, observations resulting from fieldwork, survey results, interview recordings and images. The focus is on research data that is available in digital form. So, what might this include? Anything & everything produced in the course of research www.aoml.noaa.gov /phod/dac/array_gr owth.html www.aoml.noaa.gov /phod/graphics /dacdata/gl obpop.gif www.sbirc.ed.ac.uk/documents/l bc_protocol.pdf What is Research Data Management? The active management of data throughout the lifecycle Create Document Use Store Share Preserve • Data Management Planning • Creating data • Documenting data • Accessing / using data • Storage and backup • Selecting what to keep • Sharing data • Data licensing and citation • Preserving data • … CC-BY-NC-SA Why is RDM an issue? • Digital technology now used very widely in research, and is enabling new research and scientific paradigms • Research funders and publishers know that digital research data can be expensive to produce but inexpensive to share, making reuse more feasible and desirable • The challenge is to ensure digital research findings can be reproduced and cited Reasons to manage and share data Direct benefits for you • To make your research easier! • Stop yourself drowning in irrelevant stuff • Make sure you can understand and reuse your data again later • Advance your career – data is growing in significance Research integrity • To avoid accusations of fraud or bad science • Evidence findings and enable validation of research methods • Meet codes of practice on research conduct • Many research funders worldwide now require Data Management and Sharing Plans Potential to share data • So others can reuse and build on your data • To gain credit – several studies have shown higher citation rates when data are shared • For greater visibility, impact and new research collaborations • Promote innovation and allow research in your field to advance faster Why YOU need a Data Management Plan http://blogs.ch.cam.ac.uk/pmr/2 011/08/01/why-you-need-a-data- management-plan What if this was your laptop? HOW TO MANAGE DATA? Questions to consider Image CC-BY-NC-SA by Leo Reynolds www.flickr.com/photos/lwr/13442910354 What file formats will you use? If you want your data to be re-used and sustainable in the long-term, you typically want to opt for open, non-proprietary formats. • Do you have a choice or do the instruments you use only export in certain formats? • What is common in your field? Try to use something that is accepted and widespread • Does your data centre recommend formats? If so it’s best to use these. Type Recommended Avoid for data sharing Tabular data CSV, TSV, SPSS portable Excel Text Plain text, HTML, RTF PDF/A only if layout matters Word Media Container: MP4, Ogg Codec: Theora, Dirac, FLAC Quicktime H264 Images TIFF, JPEG2000, PNG GIF, JPG Structured data XML, RDF RDBMS www.data-archive.ac.uk/create-manage/format/formats-table How will you organise all your stuff? • Adopt file naming conventions: – http://www.jiscdigitalmedia.ac.uk/guide/choosing-a-file-name • Design a good project folder structure – http://research-data-toolkit.herts.ac.uk/document/research- project-file-plan • Develop a method for describing new versions of your files. Good practice in file naming • Keep file and folder names short, but meaningful • Agree a method for versioning • Include dates in a set format e.g. YYYYMMDD • Avoid using non-alphanumeric characters in file names • Use hyphens or underscores not spaces e.g. day-sheet, day_sheet • Order the elements in the most appropriate way to retrieve the record www.jiscdigitalmedia.ac.uk/guide/choosing-a-file-name Example  from  ARM  Climate   Research  Facility   www.arm.gov/data/docs/plan Can others understand your data? Think about what is needed in order to find, evaluate, understand, and reuse the data. • Have you documented what you did and how? • Did you develop code to run analyses? If so, this should be kept and shared too. • Is it clear what each bit of your dataset means? Make sure the units are labelled and abbreviations explained. • Record metadata so others can find your work e.g. title, date, creator(s), subject, format, rights…, Where will you store the data? • Your own device (laptop, flash drive, server etc.) – And if you lose it? Or it breaks? • Departmental drives or university servers • “Cloud” storage – Do they care as much about your data as you do? The decision will be based on how sensitive your data are, how robust you need the storage to be, and who needs access to the data and when Who will do the backup? • Use managed services where possible (e.g. University filestores rather than local or external hard drives), so backup is done automatically • 3… 2… 1… backup! at least 3 copies of a file on at least 2 different media with at least 1 offsite • Ask central IT team for advice How to keep your data secure? • Develop a practical solution that fits your circumstances • Ideally store your data on secure, managed servers • Restrict access to those who need to use / view the data • Keep anti-virus software up-to-date • Encrypt mobile devices carrying sensitive information www.wsj.com/articles/SB10001424052748703843804575534122591921594 Which data need to be kept? Five steps to follow ① Could this data be re-used ② Must it be kept as evidence or for legal reasons ③ Should it be kept for its potential value ④ Consider costs – do benefits outweigh cost? ⑤ Evaluate criteria to decide what to keep 5 steps to decide what data to keep www.dcc.ac.uk/resources/how-guides/five-steps-decide-what-data-keep Can you publish / share your data? • Who owns the data? • Have you got consent for sharing? • Do any licences you’ve signed permit sharing? • Is the data in suitable formats? • Is there enough documentation? Where can you deposit? http://databib.org http://service.re3data.org/search • Does your publisher or funder suggest a repository? • Are there data centres or community databases for your discipline? • Does your university offer support for long-term preservation? Zenodo • OpenAIRE-CERN joint effort • Multidisciplinary repository • Multiple data types – Publications – Long tail of research data • Citable data (DOI) • Links funding, publications, data & software www.zenodo.org Managing and sharing data: a best practice guide http://data-archive.ac.uk/media/2894/managingsharing.pdf Benefits and challenges of RDM in the digital age (MD) 1. Helicopter view 2. A few drivers 3. A few challenges Helicopter view: What are the benefits of RDM? • SPEED: Sharing data leads to a faster research process • TRANSPARENCY: The data that underpins research can be made open for anyone to scrutinise, and attempt to replicate findings • EFFICIENCY: Data collection can be funded once, and used many times for a variety of purposes • RISK MANAGEMENT: A pro-active approach to data management reduces the risk of inappropriate disclosure of sensitive data, whether commercial or personal • PRESERVATION: Lots of data is unique, and can only be captured once. If lost, it’s irreplaceable. • Developments   in  sensor   technology,  networking  and  digital   storage  enable  new  research  and   scientific  paradigms • As  costs  also  fall,  possibilities  for   data  sharing,  citation  and  re-­use   become  much  more  widespread • Journals  dedicated  solely  to   publishing  data  have  even  started   to  appear.  That’s  not  to  say  it’s  an   entirely  new  thing:  journals  have   always  published  data,  just  never   before  at  such  scale… Driver 1: Technology Rosse from   Philosophical   Transactions  of   the  Royal   Society,   (MDCCCLXI)   (or  1861  if   you’d  prefer) Driver 2: VfM via data re-use Ships’ log books build picture of climate change 14 October 2010 You can now help scientists understand the climate of the past and unearth new historical information by revisiting the voyages of First World War Royal Navy warships. Visitors to OldWeather.org will be able to retrace the routes taken by any of 280 Royal Navy ships. These include historic vessels such as HMS Caroline, the last survivor of the 1916 Battle of Jutland still afloat. By transcribing information about the weather and interesting events from images of each ship's logbook, web volunteers will help scientists build a more accurate picture of how our climate has changed over the last century. http://www.nationalarchives.gov.uk/news/5 03.htm Detail   from  Royal  Navy   Recruitment   poster,  RNVR   Signals   branch,  1917  (Catalogue   reference:   ADM   1/8331) Endeavour,  1768-­‐71   (Captain  Cook) HMS  Beagle,   1830-­‐34 HMS  Torch,   1918 6.9 The Research Councils expect the researchers they fund to deposit published articles or conference proceedings in an open access repository at or around the time of publication. But this practice is unevenly enforced. Therefore, as an immediate step, we have asked the Research Councils to ensure the researchers they fund fulfil the current requirements. Additionally, the Research Councils have now agreed to invest £2 million in the development, by 2013, of a UK ‘Gateway to Research’. In the first instance this will allow ready access to Research Council funded research information and related data but it will be designed so that it can also include research funded by others in due course. The Research Councils will work with their partners and users to ensure information is presented in a readily reusable form, using common formats and open standards. http://www.bis.gov.uk/assets/biscore/innovation/docs/i/11- 1387-innovation-and-research-strategy-for-growth.pdf Driver 3: Government pressure/support Driver 4: Increasing ‘openness’ in public life • Open Data is a philosophy, underpinned by pragmatism… transparency + utility. • “Open data is the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control.” – Wikipedia • Governments, cities etc are all getting onboard • Open Knowledge Foundation is basically the political / activist wing: http://okfn.org/ • From the government / industry side, we have the Open Data Institute: http://theodi.org/ Meanwhile, in the USA… May  9,  2013   United  States  Chief  Technology  Officer,  Todd  Park,  and  United   States  Chief  Information  Officer,  Steven  VanRoekel,  discuss  the   importance  of  President  Obama's  executive  order  that  takes   groundbreaking  new  steps  to  make  information  generated  and   stored  by  the  Federal  Government  more  open  and  accessible  to   innovators  and  the  public,  to  fuel  entrepreneurship  and  economic   growth  while  increasing  government  transparency  and  efficiency.   The  move  will  make  troves  of  previously  inaccessible  or   unmanageable  data  easily  available  to  entrepreneurs,   researchers,  and  others  who  can  use  those  files  to  generate  new   products  and  services,  build  businesses,  and  create  jobs.   http://www.youtube.com/watch?v=n603rEnEGXA Why don’t we live in a data sharing utopia? •Five main reasons… i. Lack of widespread understanding of the fundamental issues ii. Lack of joined-up thinking within institutions, countries, internationally… iii. Issues around ownership / privacy iv. Technical/financial limitations, and the need for selection and appraisal of data v. Issues around reward and recognition for researchers Overview 1. Benefits and challenges of research data management in the digital age (15 mins) 2. Dealing with sensitive data: data protection, privacy, informed consent, commercial issues (15 mins) 3. Exercise: Data Sharing (30 mins) 4. Other useful tools and resources (15 mins) Dealing with sensitive data • Data can be sensitive for two main reasons • Commercially sensitive data is information that may be used to derive economic capital. It may be closely guarded by its generators as a source of future income. Commercial R&D can be the difference between a product failing of succeeding. • Purely commercial data generation may not subject to data sharing obligations, although interestingly in the case of drug trials it has started to be covered by legislation • Data generated as a result of public funding is increasingly expected to be shared at some stage, even if it has commercial potential. Note that it doesn’t have to be shared immediately, and that expectations and norms vary from country to country. • Furthermore, data may be ethically sensitive if it relates to living human subjects, or could be used to do harm (e.g. some areas of weapons and power research, disease studies, etc) • This section of the workshop looks at different types of ethically sensitive data, and suggests ways in which this can be shared or reused appropriately… Sharing ethically sensitive data • Data relating to living humans is subject in many countries to data protection laws, which guard the privacy of (e.g.) the subjects of research • In the UK, such data must not be shared without the subject’s express (informed) consent, or without performing actions on the data in order to make it impossible to identify individual subjects • Such actions include anonymising (or pseudonymising) data, aggregating it (removing a degree of detail), or restricting access to appropriate audiences • In some cases, two or more anonymised datasets can be compared in order to ‘deanonymise’ and thus identify an individual. While it can never be ruled out entirely, this is highly unethical and a gross breach of trust Overview 1. Benefits and challenges of research data management in the digital age (15 mins) 2. Dealing with sensitive data: data protection, privacy, informed consent, commercial issues (15 mins) 3. Exercise: Data Sharing (30 mins) 4. Other useful tools and resources (15 mins) Interactive exercise • Now that you know the reasons why data sharing is A Good Thing, we’re going to do some role-playing (don’t panic) • We’re going to put you in our shoes, not a researcher’s shoes • Sarah and I will articulate some of the most commonly heard objections to data sharing, and you’re going to explain why we are wrong J • After each section, we’ll look at some suggested ripostes from the experts in sensitive data at the UK Data Archive… ! !! REASONS'NOT'TO' SHARE'DATA'' REPLIES'OR'ARGUMENTS'IN'FAVOUR'OF' SHARING' 1!! My!data!is!not!of! interest!or!use!to! anyone!else.!! It!is!!Researchers!want!to!access!data!from!all! kinds!of!studies,!methodologies!and!disciplines.!It! is!very!difficult!to!predict!which!data!may!be! important!for!future!research.!Who!would!have! thought!that!amateur!gardener’s!diaries!would! one!day!provide!essential!data!for!climate!change! research?!Your!data!may!also!be!essential!for! teaching!purposes.!Sharing!is!not!just!about! archiving!your!data!but!about!sharing!them! amongst!colleagues.!! 2!! I!want!to!publish!my! work!before!anyone! else!sees!my!data.!! Data!sharing!will!not!stand!in!the!way!of!you!first! using!your!data!for!your!publications.!Most! research!funders!allow!you!some!period!of!sole! use,!but!also!want!timely!sharing.!Also!remember! that!you!have!already!been!working!with!your! data!for!some!time!so!you!undoubtedly!know!the! data!better!than!anyone!coming!to!use!them! afresh.!If!you!are!still!concerned!you!can!embargo! your!data!for!a!specific!period!of!time.!! ! Role playing exercise derived from the UKDA’s “Potential barriers to data sharing – with suggested solutions” (CC-BY-NC-SA) The original is available from http://data-archive.ac.uk/create-manage/training- resources DATA MANAGEMENT PLANNING (SJ) Building a structure to work to Image under licence by DCC Data Management Plans It’s useful to consider how you will manage and share your data in practice. Many research funders and universities now ask for these details in a DMP. • What types of data will the project generate/collect? • What standards will be used? • How will this data be shared/made available? • If not, why? e.g. ethics & IP issues, embargoes, confidentiality • How will this data be curated and preserved? www.dcc.ac.uk/resources/data-management-plans/checklist Lots of health funders require a DMP Their focus is often on data sharing • Which data will be shared? • When will it be shared? • With whom? • How will the data be shared? • Will any restrictions or conditions govern use? • …. Cancer Research UK The following should be considered when developing a data sharing plan: • The volume, type, content and format of the final dataset • The standards that will be utilised for data collection and management • The metadata, documentation or other supporting material that should accompany the data for it to be interpreted correctly • The method used to share data • The timescale for public release of data • The long-term preservation plan for the dataset • Whether a data sharing agreement will be required • Any reasons why there may be restrictions on data sharing Wellcome Trust Applicants should consider the following seven questions: i. What data outputs will your research generate and what data will have value to other researchers? ii. When will you share the data? iii. Where will you make the data available? iv. How will other researchers be able to access the data? v. Are any limits to data sharing required – for example, to either safeguard research participants or to gain appropriate intellectual property protection? vi. How will you ensure that key datasets are preserved to ensure their long-term value? vii. What resources will you require to deliver your plan? Guidance on writing a DMP • Explains what is asked for • Gives example answers • Suggests best practices • Provides links to standards, tools and support www.lshtm.ac.uk/research/re searchdataman/plan/wellcom etrust_dmp.pdf What data will be generated? Why is this important? A good description of the data to be collected will help reviewers understand the characteristics of the data, their relationship to existing data, and any disclosure risks that may apply. When will you share the data? Why is this important? Research funders are looking for timely data sharing with minimal or no restrictions if possible. Embargo periods / delays to sharing should be justified and in line with standard practice for the field. How can others access the data? Why is this important? If the data aren’t discoverable, accessible and intelligible, they won’t be reused. Data should be shared in a meaningful way. Are any limit to sharing required? Why is this important? As funders expect data to be shared, any restrictions need to be valid. Protection of human subjects is a fundamental tenet of research and an important ethical obligation for everyone. State the long-term preservation plan Why is this important? Digital data need to be actively managed over time to ensure that they will always be available and usable. Depositing data resources with a trusted digital archive can ensure that they are curated and handled according to good practices in digital preservation. More example plans • Technical appendix submitted to AHRC by Bristol Uni http://data.bris.ac.uk/files/2013/02/data.bris-AHRC-Technical-Plan-v21.pdf • Rural Economy & Land Use (RELU) programme examples http://relu.data-archive.ac.uk/data-sharing/planning/examples • UCSD example DMPs (20+ scientific plans for NSF) http://rci.ucsd.edu/dmp/examples.html • My DMP – a satire (what not to write!) http://ivory.idyll.org/blog/data-management.html • Further examples: www.dcc.ac.uk/resources/data-management-plans/guidance-examples DCC support on Data Management Plans • Checklist on what to include • How to guide on developing a plan • Guidance on assessing plans (forthcoming) • Webinars and training materials • DMPonline tool • Example DMPs www.dcc.ac.uk/resources/data-management-plans What is DMPonline? • A web-based tool to help researchers write Data Management and Sharing Plans • Matches requirements with guidance tailored to user • Free to use for anyone • Developed by the DCC https://dmponline.dcc.ac.uk Main features in DMPonline • Templates for different requirements (funder or institution) • Tailored guidance (funder, institutional, discipline-specific etc) • Ability to provide examples and suggested answers • Supports multiple phases (e.g. pre- / during / post-project) • Granular read / write / share permissions • Comment feature for collaboration • Customised exports to a variety of formats • Single-sign-on facility (for UK unis) How the tool works Click to write a generic DMP Or choose your funder to get their specific template Pick your uni to add local guidance and to get the uni template if there isn’t a funder one Choose any additional optional guidance DMPonline demo Exercise on writing a DMP Pick one of the themes below and think about the approach you will take: • Data creation (including standards and metadata) • Data storage, backup and security • Ethics and intellectual property • Data sharing • Data preservation Draft some text and discuss your ideas with partners Thanks – any questions • DCC resources on Research Data Management www.dcc.ac.uk/resources • DMP guidance, tools and example plans: www.dcc.ac.uk/resources/data-management-plans Follow us on Twitter: @digitalcuration #ukdcc #DMPonline Overview 1. Benefits and challenges of research data management in the digital age (15 mins) 2. Dealing with sensitive data: data protection, privacy, informed consent, commercial issues (15 mins) 3. Exercise: Data Sharing (30 mins) 4. Other useful tools and resources (15 mins) SHERPA services • Text • EUDAT offers common data services through a geographically distributed, resilient network of 35 European organisations. These shared services and storage resources are distributed across 15 European nations, and data is stored alongside some of Europe’s most powerful supercomputers • The EUDAT services address the full lifecycle of research data, covering both access and deposit, from informal data sharing to long-term archiving, and addressing identification, discoverability and computability of both long-tail and big data • The vision is to enable European researchers and practitioners from any academic discipline to preserve, find, access, and process data in a trusted environment, as part of a Collaborative Data Infrastructure (CDI) conceived as a network of collaborating, cooperating centres, combining the richness of numerous community-specific data repositories with the permanence and persistence of some of Europe’s largest scientific data centres • DCC is a partner in EUDAT, and we’re working to integrate our DMPonline tool with the EUDAT suite of services / infrastructure EUDAT Zenodo • Zenodo is a free-to-use data archive, run by the lovely people at CERN • It accepts any kind of data, from any academic discipline • It is generally preferable to store data in a disciplinary data centre, but not all scholarly subjects are equally well served with data centres, so this may make for a useful fallback option • See http://zenodo.org/ for more details Figshare Arkivum http://arkivum.com/life-sciences/ • FOSTER is developing MOOC-type content and dedicated training modules in different aspects of Open Science, including research data management • The FOSTER Portal also provides access to existing content that can be reused and/or repackaged to support e-learning, blended learning, self-learning, etc • There is also a Helpdesk which can direct your enquiries to appropriate experts Facilitate Open Science Training for European Research The project DCC training and guidance UK Data Archive guidance • Text Thank you / Danke • For more information about the FOSTER project: • Website: www.fosteropenscience.eu • Principal investigator: Eloy Rodrigues (eloy@sdum.uminho.pt) • General enquiries: Gwen Franck (gwen.franck@eifl.net) • Twitter: @fosterscience • My contact details: • Email: martin.donnelly@ed.ac.uk • Twitter: @mkdDCC • Slideshare: http://www.slideshare.net/martindo nnelly This work is licensed under the Creative Commons Attribution 2.5 UK: Scotland License. Our contact details… Sarah Jones DCC, University of Glasgow sarah.jones@glasgow.ac.uk Twitter: @sjDCC European Medical Students Association, Berlin, 14-15 September 2015 http://emsa-europe.eu/6315 Martin Donnelly DCC, University of Edinburgh martin.donnelly@ed.ac.uk Twitter: @mkdDCC