Tentative steps in mining UK theses OR 2016, Dublin June 2016 www.bl.uk 2 Is there valuable content in theses? “Anything worthwhile in a thesis would have been published separately anyway.” -- bioscience researcher www.bl.uk 3 UK PhD theses • Cutting edge research • Not published elsewhere • Traditionally book, now usually e- • PDF – but new forms emerging • 20,000 / year • 300 pages each • 6m pages of unique research every year www.bl.uk 4 EThOS – e-theses online service www.bl.uk 5 www.bl.uk 6 UK thesis collection & EThOS http://ethos.bl.uk www.bl.uk 7 Theses by Date 1% 12% 33% 54% Pre-20th Century 1900-1949 1950-1979 1980-1999 2000-2016 www.bl.uk 8 Theses by Subject 0 10000 20000 30000 40000 50000 60000 70000 www.bl.uk 9 TDM examples www.bl.uk 10 Alzheimer’s Society report http:// www.rand.org/randeurope/research/projects/mapping-uk-dem entia-research-landscape.html www.bl.uk 11 TDM case study - Alzheimer’s Society & RAND Europe Mapping the UK’s Dementia Research Landscape - Workforce pipeline - Tracked PhD to senior research - 1/5 dementia PhD graduates remain in dementia research - 70% leave dementia research within 4 years of completing PhD - Used EThOS metadata to analyse trends http://britishlibrary.typepad.co.uk/science/2015/09/a-novel-use-of-phd-data.html www.bl.uk 12 Dementia search terms • Alzheimer’s • Dementia • Cognitive impairment • Mixed dementia • Early onset dementia • Vascular dementia • Lewy bodies (Dementia with Lewy bodies) • Frontotemporal dementia • Posterior Cortical Atrophy • Familial dementia • Creutzfeldt Jakob • Korsakoff’s syndrome • Cognitive impairment • Supranuclear palsy • Binswanger’s • Multiple sclerosis • Motor neurone disease • Parkinson’s • Huntington’s www.bl.uk 13 FLAX Interactive Language Learning • http://flax.nzdl.org/greenstone3/flax?a=fp&sa=library • Article - http://www.journals.elsevier.com/learning-culture-and-social-interaction / www.bl.uk 14 TDM case study – FLAX interactive language learning • Model writing at research level; domain-specific texts; co- located phrases • Auto extraction & re-use for language learning • Used EThOS metadata abstracts • University of Waikato & Queen Mary, London www.bl.uk 15 Metadata or full text theses? Metadata Full texts Content 400,000 records 130,000 theses Format Data - Digitised from print - E-born File format Xml or Excel PDF, .wav, .mov … Access - Harvest via OAI-PMH - Supplied data - Download from EThOS or other repository - Supplied with permissions Rights In the public domain Rights holders www.bl.uk 16 TDM case study – National Compound Collection • Are there useful molecules in PhD theses? • Extract the compounds; re-draw in ChemDraw; input into ChemSpider • Bristol Uni & Royal Society Chemistry • Manual pilot – could process be automated? • Used theses “likely to reveal new compounds” • 47k compounds discovered (50% new) www.bl.uk 17 Data collection N-(3,5-Dinitrophenyl)-2-[(5-methyl-3,4-diphenyl- 1H-pyrrol-2-yl)carbonyl]hydrazinecarboxamide Louise Sarah Evans, University of Southampton, 2006 Data Collectors Theses Molecular Structures Open Access Database > 45,000 compounds www.bl.uk 19 EThOS – http://ethos.bl.uk • Metadata for all UK doctoral (PhD) theses • 430,000 records • Top quality, accurate, consistent, unduplicated metadata • Unique research, often not published elsewhere, cutting edge • Data includes: – Author, title, year, university name – Abstracts (for 40%) – Supervisor names, funder/sponsor body – A few DOI and ORCiD identifiers – Subject discipline. www.bl.uk 20 Summary - EThOS data available • Excel or XML via OAI-PMH harvest: http://simba.cs.uct.ac.za/~ ethos/cgi-bin/OAI-XMLFile-2.21/XMLFile/ethos/oai.pl • Data.bl.uk (coming soon) www.bl.uk 21 Thank you Sara.Gould@bl.uk ethos.bl.uk