Tools for version control of research data Raf Guns raf.guns@uantwerpen.be Version control Revision control Electronic lab notebooks Subversi on Git Mercuria l Github http://www.phdcomics.com/comics/archive.php?comicid=1323 Research data: not static – Data cleaning – Correcting errors – Multiple data sources – Data reuse – Conversion between file formats – … Version control “the management of changes to documents, computer programs, large web sites, and other collections of information” (Wikipedia) Why version control?  Revert to previous versions  Find out what is different between two versions  Find out what has changed in a specific time period  Manage multiple versions  Work with multiple people on the same data  Transparency and integrity http://www.scfbm.org/content/8/1/7 ☞ Version control is an integral part of research data management. Research data management http://www.ands.org.au/resource/data-management-planning.html Tools for version control Built-in http://www.gcflearnfree.org/word2010/20.2 Dropbox Electronic lab notebooks (ELNs) – Replacement of paper lab notebook – Integrate text, figures, data, calculations… Version control software: Git – Alternatives: Mercurial, SVN, CVS, Perforce, Bazaar, Arch… – We pick Git because ● probably most popular nowadays ● Github (www.github.com) – Open source, free of charge – Command-line, several GUIs available (gitk, Github for Mac, Github for Windows…) Git history Data set history https://github.com/datasets/glwd Data set branches https://github.com/datasets/country-list Data set ‘diff’ https://github.com/datasets/s-and-p-500-compa nies Git for research data Made for software development: – many small files (e.g. Github has limitation of 100MB) – text-based formats (e.g. CSV, XML, JSON…) Some research data sets are: – much larger – ‘binary’ formats (non-readable without special software) Solutions: git-annex, git-lfs… Dat http://dat-data.com In development, alpha software! “real time replication and versioning for data sets” Two kinds of data: – tabular: can be expressed in table – blobs: unstructured and/or large (e.g., images) Automatically translates between different formats (e.g. JSON – XML) Thank you! http://www.datamation.com/news/tech-comics-version-control-1.html