Estimating sequence similarity from read sets for clustering next-generation sequencing data
From MaRDI portal
(Redirected from Publication:2218320)
Abstract: To cluster sequences given only their read-set representations, one may try to reconstruct each one from the corresponding read set, and then employ conventional (dis)similarity measures such as the edit distance on the assembled sequences. This approach is however problematic and we propose instead to estimate the similarities directly from the read sets. Our approach is based on an adaptation of the Monge-Elkan similarity known from the field of databases. It avoids the NP-hard problem of sequence assembly. For low coverage data it results in a better approximation of the true sequence similarities and consequently in better clustering, in comparison to the first-assemble-then-cluster approach.
Recommendations
- Algorithms for indexing highly similar DNA sequences
- Better greedy sequence clustering with fast banded alignment
- A heuristic clustering method based on neighbor-seeds for 454 sequencing data
- Distance measures for biological sequences: some recent approaches
- A novel method for sequence similarity analysis based on the relative frequency of dual nucleo\-tides
Cites work
- scientific article; zbMATH DE number 3240929 (Why is no real title available?)
- A Method for Comparing Two Hierarchical Clusterings
- A measure of the similarity of sets of sequences not requiring sequence alignment.
- Approximate string-matching with q-grams and maximal matches
- The String-to-String Correction Problem
Cited in
(6)- \textit{De novo} clustering of long-read transcriptome data using a greedy, quality-value based algorithm
- Subset Clustering of Binary Sequences, with an Application to Genomic Abnormality Data
- A heuristic clustering method based on neighbor-seeds for 454 sequencing data
- Numeric Lyndon-based feature embedding of sequencing reads for machine learning approaches
- Better greedy sequence clustering with fast banded alignment
- Can we replace reads by numeric signatures? Lyndon fingerprints as representations of sequencing reads for machine learning
This page was built for publication: Estimating sequence similarity from read sets for clustering next-generation sequencing data
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q2218320)