Refine translation in regions with clustered substitutions #43

tmaklin · 2025-03-04T15:29:41Z

Current algorithm outputs only gaps in the situation where multiple changes are clustered close to each other, for example:

reference: <matching bases> A A A A C G A A A <matching bases>
query:     <matching bases> A A A T C T A A A <matching bases>
result:    <matching bases> A A A - - - A A A <matching bases>

In this case, it should be possible to produce the actual result (the query contains TCT instead of ACG) by traversing the SBWT, similarly to how 'X's are resolved.

The text was updated successfully, but these errors were encountered:

tmaklin · 2025-03-04T16:09:58Z

Added a test case in https://github.com/tmaklin/kbo/tree/resolve-clustered-substitutions

This seems doable using the same logic as for 'X's, but care needs to be taken that k is large enough to overlap the region on both sides.

tmaklin · 2025-03-05T12:40:27Z

Basic implementation finished in eaf2c1c

TODOs

What to do with dummy k-mers?
Current implementation uses number of k-mers equal to the gap length to determine the gap sequence, should it search in a wider range around the region?
There are a few heuristics that restrict when gap filling is performed, need to look into those if they make sense.

tmaklin · 2025-03-07T15:01:04Z

Some improvements based on chatting with Jarno:

We could search k-mers that overlap the gap on both sides until we find a unique k-mer that matches the bases around the gap.
Since the k-mer is unique, it has to be the one that bridges the gap in the original sequence. (Assuming the overlapping parts from k-mers left and right of the gap match the found unique k-mer)
This way we do not need to majority vote for which bases are used to fill the gap, and we have a good criteria for stopping the search (looking up k-mers in the SBWT is expensive).
Same approach can be used to resolve SNPs, simplifying the refine / gap filling step.

tmaklin · 2025-03-12T08:28:59Z

Gap filling implemented according to previous comment, but need to investigate one more question:

Assuming we find a k-mer that overlaps the gap to the right, can we efficiently extend the k-mer to the left until it's SBWT interval becomes non-unique? This way we could fill in longer gaps assuming that a unique context exists.
Current implementation can only fill in gaps that are of length <= k - 2*threshold.
Current implementation struggles with gaps that are ~k bases long but have changes at the end and beginning.

tmaklin added the enhancement Do something existing but better label Mar 4, 2025

tmaklin changed the title ~~Refine translation in regions with small gaps~~ Refine translation in regions with clustered substitutions Mar 4, 2025

tmaklin mentioned this issue Mar 6, 2025

Increase default k-mer size #44

Open

tmaklin mentioned this issue Mar 20, 2025

Implement gap filling #45

Merged

tmaklin closed this as completed in #45 Mar 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine translation in regions with clustered substitutions #43

Refine translation in regions with clustered substitutions #43

tmaklin commented Mar 4, 2025

tmaklin commented Mar 4, 2025

tmaklin commented Mar 5, 2025

tmaklin commented Mar 7, 2025 •

edited

Loading

tmaklin commented Mar 12, 2025 •

edited

Loading

Refine translation in regions with clustered substitutions #43

Refine translation in regions with clustered substitutions #43

Comments

tmaklin commented Mar 4, 2025

tmaklin commented Mar 4, 2025

tmaklin commented Mar 5, 2025

tmaklin commented Mar 7, 2025 • edited Loading

tmaklin commented Mar 12, 2025 • edited Loading

tmaklin commented Mar 7, 2025 •

edited

Loading

tmaklin commented Mar 12, 2025 •

edited

Loading