Sarah Cohen-Boulakia is an Associate Professor in Computer Science in the Bioinformatics group at Paris-Sud University. A graduate in computer science, her main interest lies in multidisciplinary research.
For 15 years, Sarah Cohen-Boulakia has been collaborating closely with biologists, physicians, and bioinformaticians from France (Institut Curie, INRA, CIRAD and INRIA groups in Montpellier), Europe (Universities of Berlin, Manchester, Newcastle) and North America (Children's Hospital of Pennsylvania, Universities of Pennsylvanie, New York, Montreal…).
As a bioinformatician, what do you really work on?
Biological data is increasingly abundant and voluminous. They are available under various formats (Excel sheets, texts, databases, pictures...) and can be found in numerous data sources. Acquiring new knowledge in biology requires us to consider the complementarity of these data.
I work in the field of data integration. My task is to unify and bring together those numerous and highly heterogeneous data.
What does bioinformatics have to do with Big Data?
Recent sequencing techniques have revolutionized molecular biology. In 2015, such techniques enable a single machine to sequence 200 human genomes per week at the cost of 0,03 dollars per Megabase (Unit of length for DNA fragments that is equal to 1 million nucleotides). These performances compare with the 90's, when 12 years and hundreds of laboratories were needed to sequence the first human genome, at a cost evaluated at 10 000 dollars per Megabase.
Therefore, the number of available data is growing faster than the computer's ability to store and process data.
While the physical sequencing used to be the primary expense associated with genome sequencing, the cost is now mainly related to the analysis of the data obtained by sequencing techniques, particularly the assembling phase. In this context, the role of (Bio)informaticians is crucial.
Is it possible to process biological data as is done for other data?
Biological data are different from other data. A book can be easily identified by its ISBN number, or a person by their social security number, but what with genes? Well, it not that simple...
Several teams through the world may work on sequencing a common portion of a genome and realize only afterwards that they are working on the same gene.
Biological data thus mirror an expertise. In some cases, falsely contradictory results may actually be explained by the opposing opinion of two experts or different contexts of analysis.
The integration of biological data brings new challenges, revitalizing database research in this direction.
This month, you are invited as a speaker for the Labex DigiCosme SpringSchool on Big Data. What will you be talking about?
My lecture will present the challenges related to the analysis of large amounts of bioinformatic data.
One of them is to warrant the reproducibility of bioinformatic experiments, namely allowing a biologist to trust that he'll be able to get the same experimental results (in silico), 6 months, a year, or 2 years after the initial experiment.
Three factors must be taken into account. First of all, the myriad of available tools, which are regularly updated and more or less efficiently maintained. Then, we are faced with the mass of generated data, some of which may be very similar to one-another without matching exactly. Last but not least is the inherent interdisciplinarity of the field which results in the designing of complex analysis protocol, involving many steps. Taking all the above into account, the reproducibility of bioinformatic experiments poses a difficult problem.
Scientific workflow systems were designed to guide the user through the analysis, help him link together the steps of the analysis, find the right tools and execute the analysis while keeping the source of the information (i.e. keeping track of the exact data used and generated). A workflow is the computer representation of an analysis protocol.
What are the related issues ?
Many issues remain open in this field. They relate not only to applied computer science but also - and above all - to fundamental algorithmic aspects.
For example, they are workflows repositories, where users file their analysis protocols. We wish to allow users to discover new workflows, which equals researching similar, close, resembling workflows. These protocols make up complex graphs, bringing us back to the difficult problem of graph comparison. The objective is to find adequate structures to represent workflows, rendering the comparison possible and not too time-consuming. The same problem arises when one looks to compare experiment results.
The management and interrogation of source data raises new challenges for computer science: how to efficiently keep track of the workflow executions ? How to efficiently interrogate those executions?
How is your work environment enhanced by the Campus Paris-Saclay ?
The "plateau de Saclay" offers a wide range of teams for potential collaborations. For instance, I am involved in the Living Systems Modeling Institute (Institut de Modélisation des Systèmes Vivants – IMSV), studying the integration of available data regarding the bacteria.
Part of the team is currently building an ontology (meaning the definition of a group of terms and relationships between those terms). The aim is to describe the mechanisms that rule over the cell activity for the bacteria. We base our work on existing ontologies which must be complemented, generalized or specified. Processing the numerous and heterogeneous available data requires the designing of scientific workflows, which is where I intervene.
For more information:
>> Each year, DigiCosme organizes a SpringSchool presenting subjects related to the Labex thematics. 2015 SpringShool will present various aspects of Big Data Management : Ecole d'été 2015 DigiCosme sur le Data Management
>> The Living Systems Modeling Institute (Institut de Modélisation des Systèmes Vivants – IMSV) is an interdisciplinary initiative of the Université Paris-Saclay. Its goal is to create models to predict and improve the behaviour of living systems (plants, bacteria).