Published on 12 February 2019
Jean d'Alembert grant program

Pierre Zweigenbaum is Research Director at the Computer Science Laboratory for Mechanics and Engineering Sciences (CNRS-Limsi). In 2018, thanks to the scholarship of Alembert, he invites Kevin Cohen, linguist at the University of Colorado, specializing in automatic language processing. Their collaboration has made it possible to model programs for automatic analysis of causes of death and to evaluate them at the level of several countries.

"Between computer science and linguistics, automatic language processing is one of the branches of artificial intelligence," explains Pierre Zweigenbaum. The main challenge is to transform the use of a language, in an article for example, into digital data, exploitable by computer science. The machine will have to remove the ambiguities of certain words related to contexts of enunciation. The application domains are numerous but call upon the combined fundamentals of the two disciplines: linguistics identifies contexts and computer science finds the algorithm that "disambiguates" the use of words in this context.

A machine to read faster

Targeting a corpus for automatic semantic analysis on an automatic basis is the main activity of Pierre Zweigenbaum's team. "All that is said about a given subject in the natural state of language is a bearer of knowledge. We extract information available in a computerized form (print, scientific articles, social networks, forums, etc.) to convert them into a structured form, "readable" by a computer. Coming from the APHP, Pierre Zweignebaum has made a specialty of the biomedical sector which according to him "produces a lot of knowledge. With the race for publication, in the field of genomics for example, articles are published faster than specialists can read them to update their knowledge. The challenge is to create "reading machines" to quickly extract corpus from key information. "Each time there is a schema we want to instantiate: we define a type of task which we precisely target the information to collect, such as the circumstances of the influence of a gene on a protein. We delimit the corpus and then we develop an algorithm. "

Replicable methods

"In science, we are constantly wondering how to make research reproducible. But we realized that the same algorithms applied to identical corpora did not always produce the same results, "says Pierre Zweigenbaum. This subject of the "replicability" of the methods of analysis used on the same corpora is of particular interest to Kevin Cohen, a recognized linguist at the University of Colorado. The American researcher has made his specialty of automatic language processing, applied to the medical field. It is very complementary to those of the ILES team. "I immediately thought of him when I answered the call for tenders of Alembert, says Pierre Zweigenbaum, because there are few language skills in Saclay. As part of a project of the Center for Epidemiology on Medical Causes of Death (CépiDC), the pair decided to work on the causes of death whose statistics must correspond to the standards of the WHO disease classification. The challenge is to go faster in the task of passing from physicians' writings on the causes of death to their computer encoding (currently, the delay is 18 months), then to compare and evaluate the different algorithms developed by forty teams on death certificates French, American, Italian and Hungarian. In three years, a hundred researchers have worked on hundreds of thousands of death certificates. "We went through the methods used in four languages". The results were published in three articles co-authored by Kevin Cohen.

FOCUS Granted researcher: Kevin Bretonnel Cohen

Appreciating France and in particular its research environment, Kevin Cohen received an Alembert scholarship to stay at Paris-Saclay University in 2017 and 2018. Researcher at the Department of Linguistics, University of Colorado, Denver He has published numerous articles on his work in information extraction in biomedical texts. "Very happy" to have collaborated with the Limsi teams which he particularly appreciated "the opening of the scientific debate". With the aim of being "integrated" completely, he made a point of honor to progress in French and was able to animate in this language seminars for doctoral students of the laboratory.

FOCUS Laboratory : Pierre Zweigenbaum

Laboratory of research in multidisciplinary Computer Science, the LIMSI brings together researchers and teacher-researchers in the fields of Engineering Sciences and Information Sciences as well as Life Sciences and Humanities and Social Sciences. The scientific field thus covered is that of sciences and technologies of the language in the broad sense, the interaction man-machine, the virtual and augmented reality as well as the mechanics of the fluids and the transfers, and the energetics. Within the LIMSI, two teams work on the automatic language processing, TLP (Spoken Language Processing) and ILES (Information Language Written and Signed) which Pierre Zweigenbaum is responsible for. The ILES group is dedicated to the processing of written language data (to their analysis, comprehension or production as well as the acquisition of the necessary knowledge to get there) and signed (modeling and automatic processing of sign languages). "There are many skills in all areas of automatic language analysis in LIMSI on which I can rely to build research projects, it is one of the strengths of Université Paris-Saclay.”