@openaire_eu Explore, model, analyze and visualize systematic research in OpenAIRE…via text and data mining (topic modeling) A bird’s eye view Natalia ManolaUniversity of Athens, Department of InformaticsAthena Research & Innovation Center Force2017, Berlin, 27 Oct, 2017 Publications ARE data in TDM.Should abide to the FAIR principles. Big volumes of data Force2017, Berlin, 27 Oct, 2017 Meta research: Research analytics Force2017, Berlin, 27 Oct, 2017 Is Related Mining scientific/scholarly literature NameInstitution Author TitleKey Words TopicsWords (BoWs)Venue QueriesDownloadsSessions Paper User Writes Search for Paper Paper Citing Cited User User Author Author ?? ? ? NameGrant NoStart - EndFunding Is Funded Thematic analysis: What are the topics / concepts? Entity Resolution: Do they refer to the same person? Similarity analysis & link prediction: Is it related? Analyze the role of funding Get or recommend relevant content: ranking & similarity analysis Structuring Effects: Identify & model research communities Attribute Prediction: What could be the (possible) venue? Research impact & timeliness WHY Force2017, Berlin, 27 Oct, 2017 New models ànew insights àbetter decisionsReal Output vs. project & call descriptions Analyze large collections of documents, and meta-data to: • Assess research collaboration: authorship network analysis • Identify active areas of research: discover hidden themes (topics) • Understand what is actually produced • Discover clusters and communities • Identify emerging research areas • Assess coverage, identify gaps or new challenges Mining scientific/scholarly literature WHY Force2017, Berlin, 27 Oct, 2017 § Interconnected (linked) entities characterized by TEXT § Related side information & links (e.g., taxonomies, venues, projects / research areas, citations, authors)§ Side-information§ structured or unstructured attributes, links / relations and meta-data§ form networks: e.g., authorship network, citation network, …§ incomplete or missing, noisy or not related to textual attributes Probabilistic Multi-View Topic Modeling of Text-Augmented Heterogeneous Information Networks HOW Force2017, Berlin, 27 Oct, 2017 Multi-View vs Text only: interoperability and coverage MV_HDPText topic, latent, lda, document,dirichlet, probabilistic, mining,semantic, allocation, generative,word, mixture, topical, corpus, plsa,bayesian, unsupervised,.. mapreduce, big, hadoop, analytics, cluster, map,scalable, datasets, queries, cloud, intensive, jobs,databases, massive, google, job, scalability, node,computations, mining, hdfs, hive, machine,workloads, volume,…Citations(ranked list of citation net nodes) “Dynamic topic models”, “Topics overtime”“Joint latent topic models for text andcitations”“Topic modeling”, “Probabilistic topicmodels”“Probabilistic latent semantic indexing”,… “ A comparison of approaches to large-scale data analysis”,“Pig latin”, “Mesos”, “DryadLINQ”, “PREGEL”, “CIEL”,“Improving MapReduce performance in heterogeneousenvironments”, “MapReduce Online”, “MapReduce Merge”,.. Taxonomy H.3.3 IR: Information Search and Retrieval,H.3.1 IR: Content Analysis and Indexing,H.2.8 DB MNGMT: Database Applications,I.2.6 AI: Learning, I.2.7 AI: Natural LanguageProcessing, I.5.1 PAT.REC.: Models H.2.4 DB MNGMT: Systems, D.1.3 PROGR.TECHNIQUES:Concurrent Programming, C.2.4 COMP.- COMM. NETS:Distributed Systems, H.2.8 DB MNGMT: DatabaseApplications, H.3.4 INFO STORAGE AND RETRIEVAL: Systemsand Software Keywords topic modeling, latent dirichletallocation, latent semantic analysis,generative model, text mining big data, Map-Reduce, hadoop, cloud computing,distributed computing, data analytics, machinelearning, parallel processingVenues SIGKDD, WSDM, CIKM SIGMOD, BigSystem, CloudCP, EUROSEC,EUROSYS,.. topic: “Topic Modeling” “Cloud/Distributed computing & Big Data Analytics” Good metadata is important Force2017, Berlin, 27 Oct, 2017 Extract features and annotate (enrich)content using NLP, Named Entity Recognition & Semantic AnnotationTokenize, remove stop words Refine stop words for specific domain1 ENRICH & PRE-PROCESS Identify topics: distribution over words & “side” informationAutomatic topic curation & entitlingAssign topics to publications Evaluate & categorize topics Assess topic labels 2FIND TOPICS Calculate topic proportions & trends of objects based on their publicationsCalculate similarity among different entities based on various metrics Analyze & Validate the results3 CALCULATE TRENDS & SIMILARITIES Create WEB interactive visualization with data driven graphs, charts and layouts Design optimal viewsValidate modeling results 4VISUALIZE What is involved? Force2017, Berlin, 27 Oct, 2017 What is the result? Force2017, Berlin, 27 Oct, 2017 1. Linked information Force2017, Berlin, 27 Oct, 2017 How often is “Topic Modeling” encountered? Rank TopicId Title Weight230 18 Data management & file systems 0.0028231 132 Image processing: Face & emotion recognition, facial animation 0.0027232 373 Project management & software development 0.0027233 138 Self-adaptive systems & autonomic computing 0.0027234 360 S/W development, management & maintenance 0.0026235 96 Gender differences (analysis, studies) 0.0026236 271 Haptic technology, feedback & multimodal user interaction 0.0025 237 322 Information extraction, Named entity recognition, disambiguation, cleaning 0.0025238 348 cognitive psychology, cognitive and mental models 0.0025240 74 HCI: Touch screen interaction & interactive surfaces 0.0025241 382 Topic Modelling 0.0025 242 230 Trust & reputation analysis and management (IOT, Web, recom. systems) 0.0025243 2 Wikipedia & collaborative editing 0.0025245 15 Crowdsourcing & human computation 0.0025246 273 Automatic programming, refactoring & transformations 0.0024248 323 Reliability, fault tolerance and recovery 0.0024249 113 Online / computational advertising 0.0024 Out of 382 Force2017, Berlin, 27 Oct, 2017 Association of Computing Machinery Corpus Is it trendy? Force2017, Berlin, 27 Oct, 2017 TopicId Title WeightTrend Journal Confer15Crowdsourcing & human computation 0.003 27.89 0.068 0.035194Cloud Computing, Storage & Virtualization 0.004 23.56 0.077 0.011 201Social network analysis: influence, info diffusion, communities 0.004 10.82 0.119 0.066350Distributed (Big) Data analytics (cloud, MapReduce) 0.006 10.54 0.057 0.02241Mobile applications 0.005 9.86 0.135 0.01968Social media analysis (twitter, blogs, news feed) 0.004 9.72 0.078 0.049366Persuasive technologies, gamification, user engagement 0.003 8.65 0.126 0.07061Wearable computing, technology & activity recognition 0.003 8.24 0.135 0.04440ICT in developing countries (India) 0.002 7.72 0.096 0.100341GPU computing 0.004 6.78 0.120 0.029 133Recommendation, personalization and collaborative filtering 0.006 6.27 0.096 0.085134Flash memory structures, storage & systems 0.002 6.2 0.144 0.07722HCI: Organic & Flexible user interfaces 0.001 6.04 0.123 0.10174HCI: Touch screen interaction & interactive surfaces 0.003 5.87 0.205 0.1182Wikipedia & collaborative editing 0.003 5.33 0.079 0.08352HCI design & user experience 0.013 5.15 0.156 0.082266Sentiment analysis & opinion mining 0.002 4.95 0.057 0.04710Image retrieval & object recognition 0.006 4.91 0.082 0.048382Topic Modelling 0.003 4.57 0.111 0.069228Software product line engineering 0.003 3.92 0.128 0.094100Social tagging, annotation & tag recommendation 0.005 3.88 0.115 0.037294Robotics, human-robot interaction, anthropomorphism 0.005 3.34 0.066 0.170 Top 20 Concept driven search Force2017, Berlin, 27 Oct, 2017 PubId Weight Title1646242 0.72Dynamic hyperparameter optimization for bayesian topical trend analysis1871521 0.67Latent interest-topic model2505555 0.64On handling textual errors in latent document modeling2398646 0.63Automatic labeling hierarchical topics1458337 0.63Combining concept hierarchies and statistical topic models2348335 0.63Group matrix factorization for scalable topic modeling2009977 0.63Mining topics on participations for community discovery1835890 0.62Topic models with power-law using Pitman-Yor process2398483 0.61Hierarchical topic integration through semi-supervised hierarchical topic modeling1150482 0.60A mixture model for contextual text mining1963244 0.60 Investigating topic models for social media user recommendation1281249 0.60Multiscale topic tomography2086739 0.59Sequential Modeling of Topic Dynamics with Multiple Timescales1572095 0.59A latent topic model for linked documents2188143 0.59Latent contextual indexing of annotated documents1859210 0.58Topic models vs. unstructured data1487045 0.58Linked Topic and Interest Model for Web Forums2609471 0.58Probabilistic text modeling with orthogonalized topics2396861 0.57Modeling topic hierarchies with the recursive chinese restaurant process2433438 0.57Group sparse topical coding1935880 0.57Trend analysis model 1390546 0.56 Improving text classification accuracy using topic modeling over an additional corpus1553410 0.55Accounting for burstiness in topic models View top 23 most related publications to “Topic Modeling” Visualization Force2017, Berlin, 27 Oct, 2017 Trendy, old-fashion, common topics Force2017, Berlin, 27 Oct, 2017 Trendy topics Distributed (Big) Data analytics HCI design & user experience GPU Force2017, Berlin, 27 Oct, 2017 Trendy topics Trendy HCI design & user experience GPUDistributed (Big) Data analytics Compare topics Force2017, Berlin, 27 Oct, 2017 Relational DBs Programming Old-fashion topics Force2017, Berlin, 27 Oct, 2017 Do we need another venue?Trendy, but evenly spread across many journals AND conferences TopicId Title WeightTrend Journal Confer15Crowdsourcing & human computation 0.003 27.89 0.068 0.035194Cloud Computing, Storage & Virtualization 0.004 23.56 0.077 0.011 201Social network analysis: influence, info diffusion, communities 0.004 10.82 0.119 0.066350Distributed (Big) Data analytics (cloud, MapReduce) 0.006 10.54 0.057 0.02241Mobile applications 0.005 9.86 0.135 0.01968Social media analysis (twitter, blogs, news feed) 0.004 9.72 0.078 0.049366Persuasive technologies, gamification, user engagement 0.003 8.65 0.126 0.070 61Wearable computing, technology & activity recognition 0.003 8.24 0.135 0.04440ICT in developing countries (India) 0.002 7.72 0.096 0.100341GPU computing 0.004 6.78 0.120 0.029 133Recommendation, personalization and collaborative filtering 0.006 6.27 0.096 0.085134Flash memory structures, storage & systems 0.002 6.2 0.144 0.07722HCI: Organic & Flexible user interfaces 0.001 6.04 0.123 0.10174HCI: Touch screen interaction & interactive surfaces 0.003 5.87 0.205 0.1182Wikipedia & collaborative editing 0.003 5.33 0.079 0.08352HCI design & user experience 0.013 5.15 0.156 0.082266Sentiment analysis & opinion mining 0.002 4.95 0.057 0.04710Image retrieval & object recognition 0.006 4.91 0.082 0.048382Topic Modelling 0.003 4.57 0.111 0.069228Software product line engineering 0.003 3.92 0.128 0.094100Social tagging, annotation & tag recommendation 0.005 3.88 0.115 0.037 Exclusivity 0.103. 88.1850.079.154.1 7.196.17.1960.1490.1810.2210.2240.3230.1620.2380.1040.1300.1800.2220.1520.236 + 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 5 10 15 20 25 30 Excl usiv ity Trend Exclusivity vs Trend Force2017, Berlin, 27 Oct, 2017 Genetic algorithms P2P networks & content distribution Important but declining (?) Force2017, Berlin, 27 Oct, 2017 Genetic algorithms Topic birth, death & fluctuation over time Force2017, Berlin, 27 Oct, 2017 Root ACM Categories (level 0) LINKS represent topic based similarity NODES represent Authors Similar Authors Topics Highlighted Author +FEATURES Zoom for drill down Search and filtering Dynamic configuration of thresholds Authors Similarity Analysis Force2017, Berlin, 27 Oct, 2017 Categories correlations Force2017, Berlin, 27 Oct, 2017 What is the potential? Force2017, Berlin, 27 Oct, 2017 • Funders and institutions to assess research impact over time• Especially useful when combined with non-research data• OpenAIREdata and services already used by EC for ex-post FP7 evaluation• Policy makers • Binding research to societal policy decisions• Scholarly societies• Determine new conferences/merge existing ones. Introduce new themes…• New portal services (concept search)• Publishers (incl. institutional publications)• Create, adapt journals… Scratching the surface… Force2017, Berlin, 27 Oct, 2017 Thank you! Natalia Manolanatalia@di.uoa.gr+30 210 9876 432Skype: natalia.manola