Published on 3 April 2018

Today’s world is filled with digital information and we have a great deal of technology that is sophisticated enough to capture massive amounts of data. Over the last couple decades there has been a shift in thinking around data and the value it brings. The question is not so much about how to capture it as it is about what to do with it?

This is what the Université Paris Saclay’s Centre for Data Science (CDS2.0) aims to look at. This project is one of several other Strategic Research Initiatives being funded by the Université Paris-Saclay (UPSaclay) which aim to help laboratories combine forces to address high-level science and technology issues that are linked to other existing projects.  At this point, UPSaclay is supporting the CDS2.0 with half a million euro in funding.

The centre is focused on finding new tools and methods to mine large amounts of data for meaningful information. In addition to providing some seed funding to individual projects, the CDS2.0 is also working on the design and management of certain tools that can be used to support a variety of different projects with diverse needs. They also offer a spread of training opportunities and special events including strategy workshops, coding sprints, hackatons, bootcamps, and collaborative data challenges.

2-Day Workshop, Scientific Programming with Python and Software Engineering Best Practices (source)

CDS2.0 is led by Balázs Kégl, senior researcher at CNRS, along with a team of UPS researchers including  Alexandre Gramfort, Sarah Cohen-Boulakia, Gaël Varoquaux, and Cécile Germain. All together, the CDS2.0 unites a network of more than 350 researchers at 50 institutions.

The centre brings together a real mix of expertise where about half the researchers involved specialize in data science  while the other half focus on subjects like neuroscience, economics, physics, and much more. This type of interdisciplinary collaboration sets the stage for the creation of non-traditional research projects that can push the boundaries of what we know. And one thing is for certain, when data and best practices around how to analyze them are shared, the information becomes exponentially more powerful. This is why the CDS2.0 and other research initiatives from around the world are becoming more and more interested in large-scale data sharing and the push towards more open science. In fact, the CDS2.0 has kicked off an initiative, IO Data Science, which aims to link, discover, and reuse the datasets between data producers and consumers. Their goal is to contribute to the culture of more open science and to create new synergies between diverse researchers and institutions.

What kinds of problems can big data help us solve?  

Large amounts of data and powerful tools to analyse them can help us in a wide range of real-world problems. The applications for data mining tools are seemingly endless. Large-scale data analysis can help us in our urban planning and push towards building “smart cities” by giving us insight into how something like traffic circulation is currently working now and how we could make it work better in the future. It can also help us identify important trends when surveying the weather or a natural disaster like an earthquake or hurricane. In health projects, having the right tools to help doctors and researchers comb through hundreds of thousands of research papers and case studies, can help us make better and faster diagnoses of certain diseases.

What is the CDS2.0 doing?

One of the flagship activities of the CDS2.0 are collaborative data challenges called RAMPs (Rapid Analytics and Model Prototyping).

During the RAMPs, 30-50 researchers and students are brought together around a scientific data science problem. The data provider explains the problem and the data set then the participants tackle the problem guided by coaches from the core CDS2.0 team. The scientific problems range from astrophysics, to climate science (El Niño and Arctic sea ice forecasting) and biodiversity, to health care. Following its local success, CDS2.0 has also been invited to several events outside Saclay at other institutions.

One of these RAMPs focused on quickly building an accurate classification of pollinating insects based on data from a crowdsourcing project of the Paris Museum of Natural History. The work that was carried out is of particular relevance to the agricultural sector whose livelihood depends greatly on the health of pollinator communities.

Another recent RAMP aimed to detect Mars craters in satellite images. In planetary sciences, impact craters can be used to help us study the history of our solar system. This type of work is traditionally achieved through visual inspection of images which can be highly time consuming and ultimately impractical. This RAMP helped develop new algorithms to help researchers automatically detect impact structures on planetary images.

Upcoming RAMPs will focus on detecting solar storms and indications in brain scans that could relate to autism. Stay up to date on these projects RAMPs

And because some of these data sets can be very large and very complex, it’s important to be able to view them in a way that gives them more meaning. This why the CDS2.0 is also involved in a platform development initiative to help view and interact with data.

Chasing opportunities and rising to new challenges

While there are so many opportunities ahead, the work of the CDS2.0 will not be without its challenges. The field of data sciences is fast moving and quickly evolving which means that staying ahead of the curve takes teamwork, creativity, dedication and funding support. There are also a wealth of issues around things like data storage, the speed of data processing, the level of sophistication of today’s artificial intelligence, and privacy which all require continuous effort to resolve and keep improving.

If you are interested in following the progress of the CDS2.0 or attending one of their events, visit the CDS2.0 's news page on their website.