Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine translation in regions with clustered substitutions #43

Closed
tmaklin opened this issue Mar 4, 2025 · 4 comments · Fixed by #45
Closed

Refine translation in regions with clustered substitutions #43

tmaklin opened this issue Mar 4, 2025 · 4 comments · Fixed by #45
Labels
enhancement Do something existing but better

Comments

@tmaklin
Copy link
Owner

tmaklin commented Mar 4, 2025

Current algorithm outputs only gaps in the situation where multiple changes are clustered close to each other, for example:

reference: <matching bases> A A A A C G A A A <matching bases>
query:     <matching bases> A A A T C T A A A <matching bases>
result:    <matching bases> A A A - - - A A A <matching bases>

In this case, it should be possible to produce the actual result (the query contains TCT instead of ACG) by traversing the SBWT, similarly to how 'X's are resolved.

@tmaklin tmaklin added the enhancement Do something existing but better label Mar 4, 2025
@tmaklin tmaklin changed the title Refine translation in regions with small gaps Refine translation in regions with clustered substitutions Mar 4, 2025
@tmaklin
Copy link
Owner Author

tmaklin commented Mar 4, 2025

Added a test case in https://github.com/tmaklin/kbo/tree/resolve-clustered-substitutions

This seems doable using the same logic as for 'X's, but care needs to be taken that k is large enough to overlap the region on both sides.

@tmaklin
Copy link
Owner Author

tmaklin commented Mar 5, 2025

Basic implementation finished in eaf2c1c

TODOs

  • What to do with dummy k-mers?
  • Current implementation uses number of k-mers equal to the gap length to determine the gap sequence, should it search in a wider range around the region?
  • There are a few heuristics that restrict when gap filling is performed, need to look into those if they make sense.

@tmaklin
Copy link
Owner Author

tmaklin commented Mar 7, 2025

Some improvements based on chatting with Jarno:

  • We could search k-mers that overlap the gap on both sides until we find a unique k-mer that matches the bases around the gap.
  • Since the k-mer is unique, it has to be the one that bridges the gap in the original sequence. (Assuming the overlapping parts from k-mers left and right of the gap match the found unique k-mer)
  • This way we do not need to majority vote for which bases are used to fill the gap, and we have a good criteria for stopping the search (looking up k-mers in the SBWT is expensive).
  • Same approach can be used to resolve SNPs, simplifying the refine / gap filling step.

@tmaklin
Copy link
Owner Author

tmaklin commented Mar 12, 2025

Gap filling implemented according to previous comment, but need to investigate one more question:

  • Assuming we find a k-mer that overlaps the gap to the right, can we efficiently extend the k-mer to the left until it's SBWT interval becomes non-unique? This way we could fill in longer gaps assuming that a unique context exists.
  • Current implementation can only fill in gaps that are of length <= k - 2*threshold.
  • Current implementation struggles with gaps that are ~k bases long but have changes at the end and beginning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Do something existing but better
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant