Accés ràpid intranet

Més informació...

a a a

Deim Seminar


Automatic identification of relevant concepts in scientific publications


Dr. Alessio Cardillo

Professor/a organitzador/a

Manlio De Domenico, Alex Arenas


cole polytechnique fdrale de Lausanne (EPFL), Lausanne, Switzerland.


29-02-2016 16:00


In recent years, the increasing availability of publication records has drawn the attention of the scientific community. In particular, many efforts have been devoted to the study of the organization, and evolution, of science by exploiting the textual information contained in the articles like: keywords and terms extracted from title/abstract. Despite the relevant number of contributions provided so far, only few works focus on the analysis of the core of an article, i.e., its body. The access to the whole text of documents, instead, opens up the opportunity of studying the organization of scientific knowledge using networks of similarity between articles based upon their content. Given the scientific concepts extracted from the body of articles available within ScienceWISE platform, it is possible to build an articles similarity network. However, such network possesses a remarkably high link density (36%) which spoils any attempt of associating communities of papers to a given topic. The reason behind such failure lies in the fact that not all the concepts are truly informative and, even worse, they may not be useful to discriminate the articles. The presence of ``generic concepts'', i.e. those having either a too general meaning or appearing in a macroscopic fraction of the articles, give rise to spurious similarities responsible for a considerable amount of connections in the system. To get rid of generic concepts, we introduce a method to evaluate their relevance according to an information-theoretic approach. The significance of a concept $c$ is defined in terms of the distance between its maximum entropy, $S_{max}$, and the empiric one, $S_c$, calculated using its frequency of occurrence inside documents $tf_c$. Selecting concepts having an entropy far from their maximum, allows to automatically identify generic concepts, retaining only the ``meaningful'' ones, reducing also the number of concepts available per paper. The implications of considering only meaningful concepts during the construction of the similarity network are twofold: the number of links decreases, as well as the noise present in the strength of similarities between articles. As a consequence, the filtered network displays a more well defined community structure, where each community contains articles related to a specific topic. Finally, the method itself works also in a coarse-grained mode since it is able to identify the relevant concepts for a certain set of articles, allowing the study of a documents corpus at different scales.


Laboratori 231