Classifying document types to enhance search and recommendations in digital libraries Aristotelis Charalampous and Petr Knoth Knowledge Media institute, The Open University United Kingdom In this talk • Why we need systems for automatic classification of types of research documents? • Building a document type classification model • Can we use the model to improve performance in search and recommender systems? Background • Thousands of repositories containing research content • Validity of metadata not guaranteed • Poor metadata lower the performance of services built on top of repository aggregations Context – CORE (core.ac.uk) Provides: • 8 million OA full texts (~25TB) • 79 million metadata records • 2,683 repositories Services: • Search • Recsys • API • Data dumps • Research analytics Context – recommender The problem Research question 1: Can we infer document types (thesis, slides, research paper) from content automatically? Research question 2: If yes, can we improve the performance of search and recommender systems by using this information? Analysing document types in repositories • 2,461 repositories containing research • 62% of documents in repositories do not have a document type Most popular terms found in the dc:subjects field with >1% occurrence Approach • Asking repositories to improve metadata: slow, unnecessarily complex, does not scale • Use features based on full text stats: • F1: Number of authors • F2: Total words • F3: Number of pages • F4: Average words per page • Offline evaluation: Train a model and evaluate against a baseline • Online evaluation: Simulate application of the classifier on CORE’s search and recommender systems Data sample • Confidence level of 95%, confidence interval of 1% • ~9.6k samples needed • 55% research, 35% theses, 10% slides Baselines • Baselines: • Baseline 1: Random class assignment with class- proportional probability • Baseline 2: A rule-based approach based on statistically drawn thresholds using the upper 0.975 and lower 0.025 quantiles. Results - overall • 10-fold cross-validation • Stratified sampling • 20% of data left for validation Results – on individual classes Predictive power of individual features Evaluating engagement in search and recommender systems • How to measure it? • Traditionally: but cannot use due to class imbalance. Consequently, we extend CTR by grouping impression types into sets. • We then define Query-type Click-Through Rate: as the fraction of queries that resulted in a click on a document of type t. Evaluating engagement in search and recommender systems • To allow a comparison among different types, we define Regularised QTCTR (RQTCTR) as: Engagement results • An order of magnitude preference for research and thesis over slides. • Research particularly desirable in top positions Future work • Boost Research documents in our SR engines and negatively boost Slides to aid retrieval. • Evaluate the shift of user engagement as a direct effect of the models’ integration. • Extend the classification model to identify more fine- grained document types • Expose document type classification models as a service. • Enhance user engagement analysis by cross-validation of our observations using metrics such as the dwell time Conclusions • Scalable classification of document types achieving 0.96 F-measure using just 4 features. • Evidencing that the application of the model can help increase performance in information retrieval and recommendation systems. • Manual curation of document type metadata in digital libraries not particularly effective.