Tools for version control of research data Raf Guns Version control Revision control Electronic lab notebooks Subversi on Git Mercuria l Github Research data: not static – Data cleaning – Correcting errors – Multiple data sources – Data reuse – Conversion between file formats – … Version control “the management of changes to documents, computer programs, large web sites, and other collections of information” (Wikipedia) Why version control?  Revert to previous versions  Find out what is different between two versions  Find out what has changed in a specific time period  Manage multiple versions  Work with multiple people on the same data  Transparency and integrity ☞ Version control is an integral part of research data management. Research data management Tools for version control Built-in Dropbox Electronic lab notebooks (ELNs) – Replacement of paper lab notebook – Integrate text, figures, data, calculations… Version control software: Git – Alternatives: Mercurial, SVN, CVS, Perforce, Bazaar, Arch… – We pick Git because ● probably most popular nowadays ● Github ( – Open source, free of charge – Command-line, several GUIs available (gitk, Github for Mac, Github for Windows…) Git history Data set history Data set branches Data set ‘diff’ nies Git for research data Made for software development: – many small files (e.g. Github has limitation of 100MB) – text-based formats (e.g. CSV, XML, JSON…) Some research data sets are: – much larger – ‘binary’ formats (non-readable without special software) Solutions: git-annex, git-lfs… Dat In development, alpha software! “real time replication and versioning for data sets” Two kinds of data: – tabular: can be expressed in table – blobs: unstructured and/or large (e.g., images) Automatically translates between different formats (e.g. JSON – XML) Thank you!