In a few words

My research deals with the broad problematic of high-throughput sequencing data analysis. The data produced by these sequencing technologies, called reads, allow to resolve a wide variety of problems in the field of biology, such as variation or mutation detection, as well as de novo assembly, which aims at producing new reference genomes for species that lack one. More precisely, my work focuses on highly noisy long reads from third generation sequencing technologies, and on the problematics related to the processing and the correction of these errors. I'm also interested in linked-reads data, which combine a high sequencing quality a long-range information, and more precisely in structural variant calling using such data.

Keywords: bioinformatics, high-throughput sequencing, error correction, structural variants, alignment, assembly, indexing

Softwares

LEVIATHAN
A structural variant calling tool, reducing resource consumption compared to the state-of-the-art, and allowing to analyze non-model organisms on which existing tool cannot be applied.
LRez
A tool and a C++ library allowing to process the barcodes from linked-reads data (indexation, querying, ...), from both BAM and FASTQ files.
CONSENT
A self-correction tool for long reads which allows an excellent scalability. To date, CONSENT is the only tool which is able to scale to ultra-long reads data.
ELECTOR
A tool allowing to evaluate the quality of long-read error correction tools.
HG-CoLoR
A hybrid error correction tool for long reads, mainly designed to process extremely noisy long reads.

In a lot of words

During my PhD, my work mainly focus on the analysis of reads from third-generation sequencing technologies. These reads, in contrast to those from second-generation sequencing technologies, reach much greater lengths (several tens of thousands of base pairs, as opposed to only a few hundreds), but also display much higher error rates (15 to 30% on average, as opposed to about 1%). Second-generation reads are then referred to as short reads, while third-generation reads are referred to as long reads. Thus, although the length of third-generation reads is particularly interesting, especially for solving assembly problems, their high error rates restrict their use in practice. Specific algorithmics developments are thus necessary to deal with these errors. Two main approaches exist in the field of long read correction. On the one hand, hybrid correction aims to use the information carried by high-quality short reads to correct the long reads. On the other hand, self-correction aims to avoid the use of short reads altogether, and to correct long reads solely based on the information they contain. Since the emergence of third-generation sequencing technologies, many error-correction tools have been developed.

First, my PhD led to the development of a method allowing to automatically evaluate the quality of the correction provided by the different available methods. The development of this method was mainly motivated by the scalability issues, especially in terms of runtime, of the only other available method allowing such an evaluation. Compared to this method, the method we proposed allows to reduce the runtime of the evaluation by up to 22 folds. Using this evaluation method to carry out an in depth performance analysis of all existing correction methods, also revealed two major difficulties of the state-of-the-art: the correction of reads reaching error rates higher than 30%, and the correction of reads reaching more than 50,000 base pairs.

Second, my PhD led to the development of two correction methods aiming to overcome the aforementioned difficulties. A first hybrid correction method was thus developed, with the aim of efficiently correcting long reads reaching error rates higher than 30%. Compared to other hybrid correction methods, this method achieves the best compromise between execution time and quality of results, allowing, for instance, to reduce to 0.3% the error rate of a dataset initially reaching 44%. A second method, this time adopting an self-correction approach, was then developed, with the aim of correcting reads reaching extremely important lengths. Compared to other self-correction methods, this method resulted in a greater reduction of the error rate of a human genome dataset, and thus in the generation of a higher quality assembly. In addition, this method also allowed the correction of reads up to 340,000 base pairs, which could not be handled by any of the previously available self-correction methods.

As part of my postdoc, I am currently interested in the processing of Linked-Reads data. These reads combine the high-quality of second generation reads to a long distance information, obtained by adding identifiers, called barcodes, to the reads originating from the same DNA molecule. This way, these reads allow to combine the advantages of both short and long reads. First, my postdoc led to the development of a tool and a C++ API allowing to efficiently process the barcodes contained in these data, notably via indexing and querying mechanisms. This contribution represents the first tool and the first library described in the literature and allowing such a processing. Second, my work led to the development of a method allowing to detect structural variants using Linked-Reads data. This method outperforms the state-of-the-art in terms of memory consumption and execution time, and also allows the processing of non-model organisms, which existing tools are unable to analyze. In the future, my work will focus on improving the two aforementioned tools, and on developing of new structural variants genotyping methods.

Keywords: bioinformatics, high-throughput sequencing, error correction, structural variants, alignment, assembly, indexing