VO Sandpit, November 2009 Open Research Data Sarah Callaghan* sarah.callaghan@stfc.ac.uk @sorcha_ni Autumn training school Development and Promotion of Open Access to Scientific Information and Research 19 September, 2014, Veliko Tarnovo, Bulgaria * and a lot of others, including, but not limited to: the NERC data citation and publication project team, the PREPARDE project team, the OpenAIREplus project and the CEDA team VO Sandpit, November 2009 The UK’s Natural Environment Research Council (NERC) funds six data centres which between them have responsibility for the long-term management of NERC's environmental data holdings. We deal with a variety of environmental measurements, along with the results of model simulations in: • Atmospheric science • Earth sciences • Earth observation • Marine Science • Polar Science • Terrestrial & freshwater science, Hydrology and Bioinformatics Who are we and why do we care about data? VO Sandpit, November 2009 The Scientific Method http://www.mrsaverettsclassroom.com/bio 2-scientific-method.php This is often the only part of the process that anyone other than the originating scientist sees. We want to change this. A key part of the scientific method is that it should be reproducible – other people doing the same experiments in the same way should get the same results. Unfortunately observational data is not reproducible (unless you have a time machine!) The way data is organised and archived is crucial to the reproducibility of science and our ability to test conclusions. VO Sandpit, November 2009 Journals have always published data… Suber cells and mimosa leaves. Robert Hooke, Micrographia, 1665 The Scientific Papers of William Parsons, Third Earl of Rosse 1800-1867 …but datasets have gotten so big, it’s not useful to publish them in hard copy anymore VO Sandpit, November 2009 Why make data open? http://www.evidencebased- management.com/blog/2011/11/04/new- evidence-on-big-bonuses/ • Pressure from (UK) government to make data from publicly funded research available for free. • Scientists want attribution and credit for their work • Public want to know what the scientists are doing • Good for the economy if new industries can be built on scientific data/research • Research funders want reassurance that they’re getting value for money • Relies on peer-review of science publications (well established) and data (starting to be done!) • Allows the wider research community and industry to find and use datasets, and understand the quality of the data Need reward structures and incentives for researchers to encourage them to make their data open – data citation and publication VO Sandpit, November 2009 Why bother linking the data to the publication? Surely the important stuff is in the journal paper? If you can’t see/use the data, then you can’t test the conclusions or reproduce the results! It’s not science! VO Sandpit, November 2009 http://theupturnedmicroscope.com/comi c/negative-data/ VO Sandpit, November 2009 Most people have an idea of what a publication is VO Sandpit, November 2009 Some examples of data (just from the Earth Sciences) 1. Time series, some still being updated e.g. meteorological measurements 2. Large 4D synthesised datasets, e.g. Climate, Oceanographic, Hydrological and Numerical Weather Prediction model data generated on a supercomputer 3. 2D scans e.g. satellite data, weather radar data 4. 2D snapshots, e.g. cloud camera 5. Traces through a changing medium, e.g. radiosonde launches, aircraft flights, ocean salinity and temperature 6. Datasets consisting of data from multiple instruments as part of the same measurement campaign 7. Physical samples, e.g. fossils VO Sandpit, November 2009 What is a Dataset? DataCite’s definition (http://www.datacite.org/sites/default/files/Bu siness_Models_Principles_v1.0.pdf): Dataset: "Recorded information, regardless of the form or medium on which it may be recorded including writings, films, sound recordings, pictorial reproductions, drawings, designs, or other graphic representations, procedural manuals, forms, diagrams, work flow, charts, equipment descriptions, data files, data processing or computer programs (software), statistical records, and other research data." (from the U.S. National Institutes of Health (NIH) Grants Policy Statement via DataCite's Best Practice Guide for Data Citation). In my opinion a dataset is something that is: • The result of a defined process • Scientifically meaningful • Well-defined (i.e. clear definition of what is in the dataset and what isn’t) VO Sandpit, November 2009 Should ALL data be open? Most data produced through publically funded research should be open. But! • Confidentiality issues (e.g. named persons’ health records) • Conservation issues (e.g. maps of locations of rare animals at risk from poachers) • Security issues (e.g. data and methodologies for building biological weapons) There should be a very good reason for publically funded data to not be open. VO Sandpit, November 2009 The research data lifecycle Creating data Processing data Analysing data Preserving data Giving access to data Reusing data See http://data-archive.ac.uk/create- manage/life-cycle for more detail Researchers are used to creating, processing and analysing data. Data repositories generally have the job of preserving and giving access to data. Third parties, or even the original researchers will reuse the data. VO Sandpit, November 2009 Creating a dataset is hard work! "Piled Higher and Deeper" by Jorge Cham www.phdcomics.com VO Sandpit, November 2009 Italsat F1: Owned and operated by Italian Space Agency (ASI). Launched January 1991, ended operational life January 2001. The problem: rain and cloud mess up your satellite radio signal. How can we fix this? Creating data: a radio propagation dataset VO Sandpit, November 2009 Inside the receive cabin – the instruments my data came from The receive cabin at Sparsholt in Hampshire VO Sandpit, November 2009 One day’s worth of raw data from one of the receivers My job was to take this... Creating/processing data ...turn it into this.... VO Sandpit, November 2009 ...with the final result being this. Analysing data …a process which involved 4 major steps, 4 different computer programmes, and 16 intermediate files for each day of measurements. Each month of preproccessed data represented somewhere between a couple of days and a week's worth of effort. It was a job where attention to detail was important, and you really had to know what you were looking at from a scientific perspective. VO Sandpit, November 2009 Part of the Italsat data archive – on CDs in a shelf in my office Preserving data (the wrong way!) VO Sandpit, November 2009 What the processed data set looks like on disk What the raw data files looked like. (I do have some Word documents somewhere which describe what all this is…) I could make these files open easily, but no one would have a clue how to use them! VO Sandpit, November 2009 Example documentation Note the software filenames in the documentation. I still have the IDL files on disk somewhere, but I’d be very surprised if they’re still compatible with the current version of IDL VO Sandpit, November 2009 "Piled Higher and Deeper" by Jorge Cham www.phdcomics.com Documentation can sometimes produce mixed feelings VO Sandpit, November 2009 Publications – grey literature VO Sandpit, November 2009 Publications – journal paper Where’s the data? VO Sandpit, November 2009 What it all came down to: Composite image from Flickr user bnilsen and Matt Stempeck (NOI), shared under Creative Commons license And I wasn’t even preserving my data properly! VO Sandpit, November 2009 As for giving access to the data… I did share, but there was a lot of non-disclosure agreements (I am not a lawyer!) And I didn’t feel like I got the credit for it.(The first publication based on the data wasn’t written by me, and I didn’t even get my name in the acknowledgements.) VO Sandpit, November 2009 Good news: the data is all open (and documented) on the BADC now VO Sandpit, November 2009 Another example: How is my scarf like a dataset? • The raw material it’s made from doesn’t contain information • But the act of knitting encodes information into the scarf • The scarf is the result of a well defined process (knitting) and has a particular method used to create it • I need to be able to describe it • I need to be able to find it • I need to store it properly so it doesn't get lost, or corrupted (i.e. eaten by moths or shredded by mice) • I might need to recreate it so I need to keep information about it • I put a lot of time and effort into making it, so I’m very attached to it! VO Sandpit, November 2009 http://www.flickr.com/photos/lo vefibre/3251690074/ http://www.flickr.com/photos/maco_nix/5 019885742/ http://www.flickr.com/phot os/halfbisqued/80841459 76/ http://www.flickr.com/phot os/lucathegalga/2282305 884/ http://www.flickr.com/photos/nazliceti ner/6448303541/ http://www.flickr.com/ photos/ujkakevin/230 3531028/ Just like not all scarves are the same, not all datasets are the same! How the dataset was created and used will determine how open it can be. VO Sandpit, November 2009 Metadata It is generally agreed that we need methods to: • define and document datasets of importance. • augment and/or annotate data • amalgamate, reprocess and reuse data To do this, we need metadata – data about data http://www.kcoyle.net/meta_purpose.html For example: Longitude and latitude are metadata about the planet. • They are artificial • They allow us to communicate about places on a sphere • They were principally designed by those who needed to navigate the oceans, which are lacking in visible features! Metadata can often act as a surrogate for the real thing, in this case the planet. VO Sandpit, November 2009 Metadata for my scarf • Descriptive: “teal blue”, “scarf” • Dimensions: 200cm long, 20cm wide • Location: “Around my neck”/”Hanging on the door of my wardrobe” • Identifier: KOI (knitted object identifier) Information needed to recreate it: • The raw material: King Cole Haze Glitter DK, colourway 124 - Ocean, with dyelot 67233 • Needle size: 4mm • Algorithm used to create it: 18 stitch feather and fan stitch with 2 stitch garter stitch border at the edges • Number of stitches cast on: 54 • Tension (how tightly I knit in this pattern): 28 rows and 27 stitches for a 10cm by 10cm square I can’t make my scarf Open Access, but I can make the metadata about it open – enabling other users to create it for themselves.Dataset views and suggested uses VO Sandpit, November 2009 • Stick it up on a webpage somewhere • Issues with stability, persistence, discoverability… • Maintenance of the website • Put it in the cloud • Issues with stability, persistence, discoverability… • Attach it to a journal paper and store it as supplementary materials • Journals not too keen on archiving lots of supplementary data, especially if it’s large volume. • Put it in a disciplinary/institutional repository • Write a data article about it and publish it in a data journal How to publish data/make data open By David Fletcher http://www.cloudtweaks.com/2011/05/the-lighter-side- of-the-cloud-data-transfer/ VO Sandpit, November 2009 Open/Closed/Published/unpublished Openness Q u a lit y CD Webpage OA journal Subs journal Data repository We want to encourage researchers to make their data: • Open • Persistent • Quality assured: • through scientific peer review • or repository-managed processes Unless there’s a very good reason not to! Publishing = making something public after some formal process which adds value for the consumer: e.g. peer review and provides commitment to persistence VO Sandpit, November 2009 What do data centres do? Data Curation Lifecycle Model http://www.dcc.ac.uk/resources/curation-lifecycle-model The Digital Curation Centre’s Curation Lifecycle Model provides a graphical, high-level overview of the stages required for successful curation and preservation of data from initial conceptualisation or receipt through the iterative curation cycle. VO Sandpit, November 2009 Why should I bother putting my data into a repository? "Piled Higher and Deeper" by Jorge Cham www.phdcomics.com VO Sandpit, November 2009 It’s ok, I’ll just do regular backups These documents have been preserved for thousands of years! But they’ve both been translated many times, with different meanings each time. Data Preservation is not enough, we need Active Curation to preserve Information Phaistos Disk, 1700BC VO Sandpit, November 2009 Open is not enough! “When required to make the data available by my program manager, my collaborators, and ultimately by law, I will grudgingly do so by placing the raw data on an FTP site, named with UUIDs like 4e283d36-61c4-11df-9a26- edddf420622d. I will under no circumstances make any attempt to provide analysis source code, documentation for formats, or any metadata with the raw data. When requested (and ONLY when requested), I will provide an Excel spreadsheet linking the names to data sets with published results. This spreadsheet will likely be wrong -- but since no one will be able to analyze the data, that won't matter.” - http://ivory.idyll.org/blog/data- management.html https://flic.kr/p/awnCQu VO Sandpit, November 2009 VO Sandpit, November 2009 Example Big Data: CMIP5 CMIP5: Fifth Coupled Model Intercomparison Project • Global community activity under the World Meteorological Organisation (WMO) via the World Climate Research Programme (WCRP) •Aim: – to address outstanding scientific questions that arose as part of the 4th Assessment Report process, – improve understanding of climate, and – to provide estimates of future climate change that will be useful to those considering its possible consequences. Many distinct experiments, with very different characteristics, which influence the configuration of the models, (what they can do, and how they should be interpreted). VO Sandpit, November 2009 Simulations: ~90,000 years ~60 experiments ~20 modelling centres (from around the world) using ~30 major(*) model configurations ~2 million output “atomic” datasets ~10's of petabytes of output ~2 petabytes of CMIP5 requested output ~1 petabyte of CMIP5 “replicated” output Which are replicated at a number of sites (including ours) Major international collaboration! Funded by EU FP7 projects (IS-ENES, Metafor) and US (ESG) and other national sources (e.g. NERC for the UK) CMIP5 numbers! VO Sandpit, November 2009 40 Summary of the CMIP5 example The Climate problem needs: – Major physical e-infrastructure (networks, supercomputers) – Comprehensive information architectures covering the whole information life cycle, including annotation (particularly of quality) … and hard work populating these information objects, particularly with provenance detail. – Sophisticated tools to produce and consume the data and information objects – State of the art access control techniques Major distributed systems are social challenges as much as technical challenges. CMIP5 is Big Data, with lots of different participants and lots of different technologies. It also has a community willing to work together to standardise and automate data and metadata production and curation, and with the willingness to support the effort needed for openness. VO Sandpit, November 2009 Big Data: • Industrialised and standardised data and metadata production • Large groups of people involved • Methods for making the data open, attribution and credit for data creation established Long Tail Data: • Bespoke data and metadata creation methods • Small groups/lone researchers • No generally accepted methods for attribution and credit for data creation. Often data is closed due to lack of effort to open it https://flic.kr/p/g1EHPR VO Sandpit, November 2009 Summary and maybe conclusions? • Data is important, and becoming more so for a wider range of the population • Conclusions and knowledge are only as good as the data they’re based on • Science is supposed to be reproducible and verifiable • It’s up to us as scientists to care for the data we’ve got and ensure that the story of what we did to the data is transparent •So we and others can use the data again •And so people will trust our results VO Sandpit, November 2009 Thanks! Any questions? sarah.callaghan@stfc.ac.uk @sorcha_ni http://citingbytes.blogspot.co.uk/ Presentation funded by the European Commission as part of the project OpenAIREplus (FP7-INFRA-2011-2, Grant Agreement no. 283595) Image credit: Borepatch http://borepatch.blogspot.com/2010/06/its- not-what-you-dont-know-that-hurts.html “Publishing research without data is simply advertising, not science” - Graham Steel http://blog.okfn.org/2013/09/03/publishing-research-without-data-is-simply-advertising-not-science/