VO Sandpit, November 2009
Open Research Data
Sarah Callaghan* 
sarah.callaghan@stfc.ac.uk
@sorcha_ni
Autumn training school Development and Promotion of Open Access to 
Scientific Information and Research
19 September, 2014, Veliko Tarnovo, Bulgaria
* and a lot of others, including, but not limited to: the NERC data citation and 
publication project team, the PREPARDE project team, the OpenAIREplus project 
and the CEDA team
VO Sandpit, November 2009
The UK’s Natural Environment Research Council (NERC) 
funds six data centres which between them have 
responsibility for the long-term management of NERC's 
environmental data holdings.
We deal with a variety of environmental measurements, 
along with the results of model simulations in:
• Atmospheric science
• Earth sciences
• Earth observation
• Marine Science
• Polar Science
• Terrestrial & freshwater science, Hydrology and 
Bioinformatics
Who are we and why do we 
care about data?
VO Sandpit, November 2009
The  Scientific Method
http://www.mrsaverettsclassroom.com/bio
2-scientific-method.php
This is often the only part of the process 
that anyone other than the originating 
scientist sees. We want to change this.
A key part of the scientific method is 
that it should be reproducible – other 
people doing the same experiments in 
the same way should get the same 
results.
Unfortunately observational data is not 
reproducible (unless you have a time 
machine!)
The way data is organised and archived 
is crucial to the reproducibility of 
science and our ability to test 
conclusions.
VO Sandpit, November 2009
Journals have always published data…
Suber cells and mimosa leaves. Robert 
Hooke, Micrographia, 1665
The Scientific Papers of William Parsons, 
Third Earl of Rosse 1800-1867
…but datasets have gotten so big, it’s not 
useful to publish them in hard copy anymore
VO Sandpit, November 2009
Why make data open?
http://www.evidencebased-
management.com/blog/2011/11/04/new-
evidence-on-big-bonuses/
• Pressure from (UK) government to make data from 
publicly funded research available for free. 
• Scientists want attribution and credit for their work
• Public want to know what the scientists are doing
• Good for the economy if new industries can be built 
on scientific data/research
• Research funders want reassurance that they’re getting 
value for money
• Relies on peer-review of science publications (well 
established) and data (starting to be done!)
• Allows the wider research community and industry to find 
and use datasets, and understand the quality of the data
Need reward structures and incentives for researchers to 
encourage them to make their data open – data citation 
and publication
VO Sandpit, November 2009
Why bother linking the data to the publication? 
Surely the important stuff is in the journal paper?
If you can’t see/use the data, then you can’t test the conclusions or 
reproduce the results! It’s not science!
VO Sandpit, November 2009
http://theupturnedmicroscope.com/comi
c/negative-data/
VO Sandpit, November 2009
Most people have an idea of what a 
publication is
VO Sandpit, November 2009
Some examples of data (just from 
the Earth Sciences)
1. Time series, some still being updated 
e.g. meteorological measurements
2. Large 4D synthesised datasets, e.g. 
Climate, Oceanographic, Hydrological 
and Numerical Weather Prediction 
model data generated on a 
supercomputer
3. 2D scans e.g. satellite data, weather 
radar data
4. 2D snapshots, e.g. cloud camera
5. Traces through a changing medium, 
e.g. radiosonde launches, aircraft 
flights, ocean salinity and temperature 
6. Datasets consisting of data from 
multiple instruments as part of the 
same measurement campaign
7. Physical samples, e.g. fossils
VO Sandpit, November 2009
What is a Dataset?
DataCite’s definition 
(http://www.datacite.org/sites/default/files/Bu
siness_Models_Principles_v1.0.pdf):
Dataset: "Recorded information, regardless of 
the form or medium on which it may be 
recorded including writings, films, sound 
recordings, pictorial reproductions, 
drawings, designs, or other graphic 
representations, procedural manuals, forms, 
diagrams, work flow, charts, equipment 
descriptions, data files, data processing or 
computer programs (software), statistical 
records, and other research data." 
(from the U.S. National Institutes of Health (NIH) 
Grants Policy Statement via DataCite's Best 
Practice Guide for Data Citation).
In my opinion a dataset is 
something that is:
• The result of a defined 
process
• Scientifically meaningful
• Well-defined (i.e. clear 
definition of what is in the 
dataset and what isn’t)
VO Sandpit, November 2009
Should ALL data be open?
Most data produced through 
publically funded research should 
be open.
But!
• Confidentiality issues (e.g. named 
persons’ health records)
• Conservation issues (e.g. maps of 
locations of rare animals at risk 
from poachers)
• Security issues (e.g. data and 
methodologies for building 
biological weapons) There should be a very good 
reason for publically funded 
data to not be open.
VO Sandpit, November 2009
The research data lifecycle
Creating 
data
Processing 
data
Analysing 
data
Preserving 
data
Giving 
access to 
data
Reusing 
data
See http://data-archive.ac.uk/create-
manage/life-cycle for more detail
Researchers are used to creating, 
processing and analysing data. 
Data repositories generally have 
the job of preserving and giving 
access to data.
Third parties, or even the original 
researchers will reuse the data.
VO Sandpit, November 2009
Creating a dataset is hard 
work!
"Piled Higher and Deeper" by Jorge Cham
www.phdcomics.com
VO Sandpit, November 2009
Italsat F1: Owned and 
operated by Italian 
Space Agency (ASI). 
Launched January 
1991, ended 
operational life 
January 2001. 
The problem: rain and cloud 
mess up your satellite radio 
signal. How can we fix this?
Creating data: a radio 
propagation dataset 
VO Sandpit, November 2009
Inside the receive cabin – the 
instruments my data came from
The receive cabin at Sparsholt in 
Hampshire
VO Sandpit, November 2009
One day’s worth of raw data from one of the 
receivers
My job was to take this...
Creating/processing data
...turn it into this....
VO Sandpit, November 2009
...with the final result being this.
Analysing data
…a process which involved 4 
major steps, 4 different 
computer programmes, and 
16 intermediate files for each 
day of measurements. 
Each month of preproccessed 
data represented somewhere 
between a couple of days and 
a week's worth of effort. 
It was a job where attention to 
detail was important, and you 
really had to know what you 
were looking at from a 
scientific perspective.
VO Sandpit, November 2009
Part of the Italsat data archive – on CDs 
in a shelf in my office
Preserving data (the wrong way!)
VO Sandpit, November 2009
What the processed data 
set looks like on disk
What the raw data files looked like.
(I do have some Word documents 
somewhere which describe what all 
this is…)
I could make these files open 
easily, but no one would have a 
clue how to use them!
VO Sandpit, November 2009
Example documentation
Note the 
software 
filenames in the 
documentation.
I still have the 
IDL files on disk 
somewhere, but 
I’d be very 
surprised if 
they’re still 
compatible with 
the current 
version of IDL 
VO Sandpit, November 2009
"Piled Higher and Deeper" by Jorge Cham
www.phdcomics.com
Documentation can sometimes 
produce mixed feelings
VO Sandpit, November 2009
Publications – grey literature
VO Sandpit, November 2009
Publications – journal paper
Where’s the data?
VO Sandpit, November 2009
What it all came down to:
Composite image from Flickr user bnilsen and Matt Stempeck (NOI), shared 
under Creative Commons license
And I wasn’t even preserving my data properly! 
VO Sandpit, November 2009
As for giving access to the data…
I did share, but there was a lot of non-disclosure agreements (I am not a lawyer!) 
And I didn’t feel like I got the credit for it.(The first publication based on the data wasn’t 
written by me, and I didn’t even get my name in the acknowledgements.)
VO Sandpit, November 2009
Good news: the 
data is all open (and 
documented) on the 
BADC now
VO Sandpit, November 2009
Another example: How is my 
scarf like a dataset?
• The raw material it’s made from doesn’t 
contain information
• But the act of knitting encodes information into 
the scarf 
• The scarf is the result of a well defined 
process (knitting) and has a particular method 
used to create it
• I need to be able to describe it 
• I need to be able to find it
• I need to store it properly  so it doesn't get lost, 
or corrupted (i.e. eaten by moths or shredded 
by mice)
• I might need to recreate it so I need to keep 
information about it 
• I put a lot of time and effort into making it, so 
I’m very attached to it!
VO Sandpit, November 2009
http://www.flickr.com/photos/lo
vefibre/3251690074/
http://www.flickr.com/photos/maco_nix/5
019885742/
http://www.flickr.com/phot
os/halfbisqued/80841459
76/
http://www.flickr.com/phot
os/lucathegalga/2282305
884/
http://www.flickr.com/photos/nazliceti
ner/6448303541/
http://www.flickr.com/
photos/ujkakevin/230
3531028/
Just like not all 
scarves are the 
same, not all datasets 
are the same! 
How the dataset was created 
and used will determine how 
open it can be.
VO Sandpit, November 2009
Metadata
It is generally agreed that we need methods to:
• define and document datasets of importance.
• augment and/or annotate data 
• amalgamate, reprocess and reuse data
To do this, we need metadata – data 
about data
http://www.kcoyle.net/meta_purpose.html
For example:
Longitude and latitude are metadata about the 
planet. 
• They are artificial 
• They allow us to communicate about places on 
a sphere 
• They were principally designed by those who 
needed to navigate the oceans, which are 
lacking in visible features! 
Metadata can often act as a 
surrogate for the real thing, in 
this case the planet.
VO Sandpit, November 2009
Metadata for my scarf
• Descriptive: “teal blue”, “scarf” 
• Dimensions: 200cm long, 20cm wide
• Location: “Around my neck”/”Hanging on the door 
of my wardrobe”
• Identifier: KOI (knitted object identifier)
Information needed to recreate it:
• The raw material: King Cole Haze Glitter DK, 
colourway 124 - Ocean, with dyelot 67233
• Needle size: 4mm
• Algorithm used to create it: 18 stitch feather and 
fan stitch with 2 stitch garter stitch border at the 
edges
• Number of stitches  cast on: 54
• Tension (how tightly I knit in this pattern): 28 rows 
and 27 stitches for a 10cm by 10cm square
I can’t make my scarf Open Access, but I can 
make the metadata about it open – enabling other 
users to create it for themselves.Dataset views and suggested uses
VO Sandpit, November 2009
• Stick it up on a webpage somewhere
• Issues with stability, persistence, 
discoverability…
• Maintenance of the website 
• Put it in the cloud
• Issues with stability, persistence, 
discoverability…
• Attach it to a journal paper and store it as 
supplementary materials
• Journals not too keen on archiving lots of 
supplementary data, especially if it’s large 
volume.
• Put it in a disciplinary/institutional repository
• Write a data article about it and publish it in a 
data journal
How to publish data/make 
data open
By David Fletcher 
http://www.cloudtweaks.com/2011/05/the-lighter-side-
of-the-cloud-data-transfer/
VO Sandpit, November 2009
Open/Closed/Published/unpublished
Openness
Q
u
a
lit
y
CD Webpage
OA 
journal
Subs 
journal
Data 
repository
We want to encourage researchers to 
make their data:
• Open 
• Persistent
• Quality assured:
• through scientific peer review
• or repository-managed processes
Unless there’s a very good reason not 
to!
Publishing = making something public 
after some formal process which adds 
value for the consumer: 
e.g. peer review and provides 
commitment to persistence
VO Sandpit, November 2009
What do data centres do?
Data Curation Lifecycle Model
http://www.dcc.ac.uk/resources/curation-lifecycle-model
The Digital Curation Centre’s 
Curation Lifecycle Model 
provides a graphical, high-level 
overview of the stages required 
for successful curation and 
preservation of data from initial 
conceptualisation or receipt 
through the iterative curation 
cycle. 
VO Sandpit, November 2009
Why should I bother putting 
my data into a repository?
"Piled Higher and Deeper" by Jorge Cham
www.phdcomics.com
VO Sandpit, November 2009
It’s ok, I’ll just do regular backups
These documents have been preserved for thousands of years!
But they’ve both been translated many times, with different meanings each time.
Data Preservation is not enough, we need Active Curation to preserve 
Information
Phaistos Disk, 1700BC
VO Sandpit, November 2009
Open is not enough!
“When required to make the data available by 
my program manager, my collaborators, and 
ultimately by law, I will grudgingly do so by 
placing the raw data on an FTP site, named 
with UUIDs like 4e283d36-61c4-11df-9a26-
edddf420622d. I will under no circumstances 
make any attempt to provide analysis source 
code, documentation for formats, or any 
metadata with the raw data. When requested 
(and ONLY when requested), I will provide an 
Excel spreadsheet linking the names to data 
sets with published results. This spreadsheet 
will likely be wrong -- but since no one will be 
able to analyze the data, that won't matter.” 
- http://ivory.idyll.org/blog/data-
management.html https://flic.kr/p/awnCQu
VO Sandpit, November 2009
VO Sandpit, November 2009
Example Big Data: CMIP5 
CMIP5: Fifth Coupled Model 
Intercomparison Project
• Global community activity under the  
World Meteorological Organisation 
(WMO) via the World Climate Research 
Programme (WCRP)
•Aim: 
– to address outstanding scientific 
questions that arose as part of 
the 4th Assessment Report 
process, 
– improve understanding of 
climate, and 
– to provide estimates of future 
climate change that will be useful 
to those considering its possible 
consequences. 
Many distinct experiments, with very 
different characteristics, which influence 
the configuration of the models, (what 
they can do, and how they should be 
interpreted).
VO Sandpit, November 2009
Simulations:
~90,000 years
~60 experiments
~20 modelling centres (from around the 
world) using
~30 major(*) model configurations
~2 million output “atomic” datasets 
~10's of petabytes of output
~2 petabytes of CMIP5 requested output
~1 petabyte of CMIP5 “replicated” output
Which are replicated at a number of sites 
(including ours)
Major international collaboration!
Funded by EU FP7 projects (IS-ENES, 
Metafor) and US (ESG) and other national 
sources (e.g. NERC for the UK)
CMIP5 numbers!
VO Sandpit, November 2009
40
Summary of the CMIP5 example
The Climate problem needs:
– Major physical e-infrastructure (networks, supercomputers)
– Comprehensive information architectures covering the whole information life 
cycle, including annotation (particularly of quality)
… and hard work populating these information objects, particularly with 
provenance detail.
– Sophisticated tools to produce and consume the data and information objects
– State of the art access control techniques
Major distributed systems are social challenges as much as technical challenges.
CMIP5 is Big Data, with lots of different participants and lots of different 
technologies. 
It also has a community willing to work together to standardise and automate data 
and metadata production and curation, and with the willingness to support the 
effort needed for openness.
VO Sandpit, November 2009
Big Data:
• Industrialised and standardised data 
and metadata production
• Large groups of people involved
• Methods for making the data open, 
attribution and credit for data creation 
established
Long Tail Data:
• Bespoke data and metadata creation 
methods
• Small groups/lone researchers
• No generally accepted methods for 
attribution and credit for data creation. 
Often data is closed due to lack of 
effort to open it
https://flic.kr/p/g1EHPR
VO Sandpit, November 2009
Summary and maybe conclusions?
• Data is important, and becoming 
more so for a wider range of the 
population
• Conclusions and knowledge are 
only as good as the data they’re 
based on
• Science is supposed to be 
reproducible and verifiable 
• It’s up to us as scientists to care for 
the data we’ve got and ensure that 
the story of what we did to the data 
is transparent
•So we and others can use the 
data again
•And so people will trust our 
results
VO Sandpit, November 2009
Thanks!
Any questions?
sarah.callaghan@stfc.ac.uk 
@sorcha_ni
http://citingbytes.blogspot.co.uk/
Presentation funded by the European 
Commission as part of the project 
OpenAIREplus (FP7-INFRA-2011-2, 
Grant Agreement no. 283595)
Image credit: Borepatch http://borepatch.blogspot.com/2010/06/its-
not-what-you-dont-know-that-hurts.html
“Publishing research without data is simply 
advertising, not science” - Graham Steel
http://blog.okfn.org/2013/09/03/publishing-research-without-data-is-simply-advertising-not-science/