A De Novo Robust Clustering Approach for Amplicon-Based Sequence Data

When analyzing microbial communities, an active and computational challenge concerns the
categorization of 16S rRNA gene sequences into operational taxonomic units (OTUs).
Established clustering tools use a one pass algorithm to tackle high number of gene se-
quences and produce OTUs in reasonable time. However, all of the current tools are based
on a crisp clustering approach, where a gene sequence is assigned to one cluster. The weak
quality of the output compared with more complex clustering algorithms forces the user to
postprocess the obtained OTUs. Providing a membership degree when assigning a gene
sequence to an OTU will help the user during the postprocessing task. Moreover it is
possible to use this membership degree to automatically evaluate the quality of the obtained
OTUs. So the goal of this study is to propose a new clustering approach that takes into
account uncertainty when producing OTUs, and improves both the quality and the pre-
sentation of the OTU results.

Keywords: algorithm, clustering, sequences.