1/15 A New Semantic Similarity Based Measure for Assessing Research Contribution Petr Knoth & Drahomira Herrmannova Knowledge Media institute, The Open University 2/15 Current impact metrics • Pros: simplicity, availability for evaluation purposes • Cons: insufficient evidence of quality and research contribution 3/15 Problems of current impact metrics • Sentiment, semantics, context and motives [Nicolaisen, 2007] • Popularity and size of research communities [Brumback, 2009; Seglen, 1997] • Time delay [Priem and Hemminger, 2010] • Skewness of the distribution [Seglen, 1992] • Differences between types of research papers [Seglen, 1997] • Ability to game/manipulate citations [Arnold and Fowler, 2010; Editors, 2006] 4/15 Alternative metrics • Alt-/Webo-metrics etc. – Impact still dependent on the number of interactions in a scholarly communication network • Full-text (Semantometrics) – Contribution to the discipline dependent on the content of the manuscript. 5/15 Approach Premise: Full-text needed to assess publication’s research contribution. Hypothesis: Added value of publication p can be estimated based on the semantic distance from the publications cited by p to publications citing p. 6/15 Contribution measure p A Bdist(a,b) dist(b1,b2) Average distance of the set members Contribution p( ) = B A× 1 |B |×| A | × dist(a,b)a∈A,b∈B,a≠b∑ X = 1 | A | =1∨ |B | =1 1 | X | | X | −1( ) × dist x1, x2( )x1∈X,x2 ∈X,x1≠x2∑ | A | >1∧ |B | >1    dist(a,b) =1− sim(a,b) 7/15 Datasets • Requirements – Availability of full-text – Density – Multidisciplinarity 8/15 Datasets (present as table) • Examined datasets – CORE – Open Citation Corpus – ACM Dataset – DBLP+Citation – KDD Cup Dataset – iSearch Collection • However... • TABLE 9/15 Our dataset • 10 seed publications from CORE with varying level of citations • missing citing and cited publications downloaded manually • only freely accessible English documents were downloaded • in total 716 documents (~50% of the complete network) • 2 days to gather the data 10/15 Results Publication no. |B| (Citation score) |A| (No. of references) Contribution 1 5 (9) 6 (8) 0.4160 2 7 (11) 52 (93) 0.3576 3 12 (20) 15 (31) 0.4874 4 14 (27) 27 (72) 0.4026 5 16 (30) 12 (21) 0.5117 6 25 (41) 8 (13) 0.4123 7 39 (71) 70 (128) 0.4309 8 53 (131) 3 (10) 0.5197 9 131 (258) 22 (32) 0.5058 10 172 (360) 17 (20) 0.5004 474 (958) 232 (428) 11/15 Results 12/15 Current impact metrics vs Semantometrics Unaffected by, CROSS (red), TICK (green) • Sentiment, semantics, context and motives • Popularity and size of research communities • Time delay [Reduced to 1 citation] • Skewness of the distribution • Differences between types of research papers • Ability to game/manipulate citations [solved providing that self-citations not allowed] TABLE 4 4 4 4 4 4 13/15 Conclusions • Full-text necessary • Semantometrics are a new class of methods. • We showed one method to assess the research contribution 14/15 References • Jeppe Nicolaisen. 2007. Citation Analysis. Annual Review of Information Science and Technology, 41(1):609-641. • Douglas N Arnold and Kristine K Fowler. 2010. Nefarious numbers. Notices of the American Mathematical Society, 58(3):434-437. • Roger A Brumback. 2009. Impact factor wars: Episode V -- The Empire Strikes Back. Journal of child neurology, 24(3):260-2, March. • The PLoS Medicine Editors. 2006. The impact factor game. PLoS medicine, 3(6), June. 15/15 References • Jason Priem and Bradely M. Hemminger. 2010. Scientometrics 2.0: Toward new metrics of scholarly impact on the social Web. First Monday, 15(7), July. • Per Ottar Seglen. 1992. The Skewness of Science. Journal of the American Society for Information Science, 43(9):628-638, October. • Per Ottar Seglen. 1997. Why the impact factor of journals should not be used for evaluating research. BMJ: British Medical Journal, 314(February):498-502.