Workshop for Doctoral Students RESEARCH DATA MANAGEMENT AND OPEN DATA 6th – 7th October 2015 University of Manchester HANDLING QUANTITATIVE DATA AND PREPARING FOR SHARING AND REUSE, INCLUDING DATA CLEANING Irena Vipavc Brvar, Social Science Data Archives Content •Which things should I save and how • Data (part 1) • Documentation (part 2) •What tools are there SHARING MY RESEARCH Data should be user-friendly, shareable and with long- lasting usability. -> ensure they can be understood and interpreted by any user This requires clear data description, annotation, contextual information and documentation. What should be captured? Any useful documentation such as: • final report, published reports, user guide, working paper, publications, lab books Information on dataset structure • inventory of data files • relationships between those files • records, cases... Variable-level documentation • labels, codes, classifications • missing values • derivations and aggregations Source: UK Data Service Data - level documentation Certain types of data file may contain important information which should be preserved: • variable/value labels; document metadata; table relationships and queries in relational databases; GIS data layers/tables Some examples: • SPSS: variable attributes documented in Variable View (label, code, data type, missing values) • MS Access: relationships between tables • ArcGIS: shapefiles (layers) and tables in geodatabase; metadata created in ArcCatalog • MS Excel: document properties, worksheet labels (where multiple) Source: UK Data Service Data - level documentation: variable names All structured, tabular data should have cases/records and variables adequately documented with names, labels and descriptions. Variable names might include: • question number system related to questions in a survey/questionnaire e.g. Q1a, Q1b, Q2, Q3a • numerical order system e.g. V1, V2, V3 • meaningful abbreviations or combinations of abbreviations referring to meaning of the variable e.g. oz%=percentage ozone, GOR=Government Office Region, moocc=mother occupation, faocc=father occupation • for interoperability across platforms - variable names should be max 8 characters and without spaces Source: UK Data Service Data - level documentation: variable labels Similar principles for variable labels: • be brief, max. 80 characters • include unit of measurement where applicable • reference the question number of a survey or questionnaire e.g. variable 'q11hexw' with label 'Q11: hours spent taking physical exercise in a typical week' - the label gives the unit of measurement and a reference to the question number (Q11b) • Codes of, and reasons for, missing data avoid blanks, system - missing or '0' values e.g. '99=not recorded', '98=not provided (no answer)', '97=not applicable', '96=not known', '95=error' • Coding or classification schemes used, with a bibliographic ref e.g. Standard Occupational Classification 2000 - a list of codes to classify respondents' jobs; ISO 3166 alpha-2 country codes - an international standard of 2 - letter country codes Source: UK Data Service Data - level documentation: transcripts Qualitative data/text documents: • interview transcript speech demarcation (speaker tags) • document header with brief details of interview date, place, interviewer name, interviewee details, context Source: UK Data Service 7 EU VET - Study on vocational education in seven European countries The 7EU - VET project – Detailed Methodological Approach to Understanding the VET Education - is a research study on vocational education and training which builds on theoretical backgrounds and secondary analyses of the existing documentation as well as on national and EU data in order to conduct quantitative and qualitative studies and derive empirical results. The project is built upon one of the goals of the Lisbon strategy, which is the promotion and the quality of vocational education and training. Manuals • EUVET 12 • Coding of Master questionnaire • EUVET 12 (Manual for cleaning and entering data) • general instructions • defining missing variables • issues with specific question • entering data • quality control • cleaning the data • checking for errors. 29 Countries European Social Survey – Data Protocol http://www.europeansocialsurvey.org/docs/round6/survey/E SS6_data_protocol_e01_4.pdf Colectica for Excel Nesstar Publisher Nesstar Publisher – a sophisticated authoring environment that can publish data from a variety of sources (including SPSS, SAS, Excel etc.). The tool includes a specialised metadata editor, data and metadata validation routines and metadata templates that provide standardisation and control. Easy editing/creation and export of DDI documented datasets with XML experience needed. Tools to compute/recode/label new, or existing, variables to be added to a dataset before publishing. Tools to validate metadata and variables. The ability to import and export data to the most common statistical formats, including delimited files. The ability to include automatically generated frequency and summary statistics for each variable. Multilingual - Arabic, Chinese, English, French, Portuguese, Russian and Spanish and more. You can find more in • UKDA – Create & Manage Data http://www.data-archive.ac.uk/create-manage • ICSPR – Guide to Social Science Data Preparation and Archiving http://www.icpsr.umich.edu/icpsrweb/content/deposit/g uide/chapter5.html IHSN – Data archiving and dissemination http://www.ihsn.org/home/archiving MANTRA – Research Data Management Training http://datalib.edina.ac.uk/mantra/