Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fault tolerant proposer election and sign subset creation #228

Open
ChaoticTempest opened this issue Feb 14, 2025 · 2 comments
Open

Fault tolerant proposer election and sign subset creation #228

ChaoticTempest opened this issue Feb 14, 2025 · 2 comments

Comments

@ChaoticTempest
Copy link
Contributor

Currently, in SignQueue, we will first create a subset up to threshold number of participants from all stable participants, and then elect a proposer from this subset.

This is not currently fault tolerant where if one of those nodes hop offline, then that particular signature will not be completed unless that node hops back online. This is because when we go to retry the signature generation (for whatever reason like timeout) the same subset of participants is used.

Additionally, we also use stable to create this subset/proposer which is not ideal, because all nodes would have to agree on the stable set of participants. We should probably be used all participants in this case to be as deterministic as possible.

We should come up with a reasonable way to elect a new proposer and subset in such cases. We have access to entropy by all nodes and we can potentially use it along with retry_count to form the seed to elect proposer/subset.

@jakmeier let me know your thoughts on this

@jakmeier
Copy link
Contributor

First of all, what kind of fault tolerance are we aiming for? Just solve the problem of nodes being offline? Or full fault tolerance?

If you just want fault tolerance against the simple case of nodes being offline sometimes, yeah we can add a patchwork solution to fix it. Your suggestion to increase the retry_count and changing to a different set of viable participants sounds good to me. I guess we can do it after a timeout and rely on a stable list to optimize the choice from a pure random choice.

But this would still not be fault tolerant against other problems.

  • Every single participant has what I would described as "veto power". A single node trying to boycott or censor a sign request will succeed in doing so. It can work normally otherwise, just not participate properly in the protocol, making it hard to spot.
  • Nodes that are faulty in any way other than being completely offline will not be detected reliably. As long as their link to the proposer is up, they might be selected in the set. But they might have problems reaching other nodes, they might use a wrong version, or maybe they are just super slow in processing requests. All reasons that could lead to 100% failure rate with that participant. The current stable property is not helping with these issues.

To resolve such problems for good, byzantine fault tolerance would be the gold standard. There are different ways to achieve it. I would suggest we would select redundant participants and ensure the protocol can finish with only a subset of them responding. This would make things more robust (fault tolerant) and possibly improve performance, since we don't have to wait for the slowest participant.

But for byzantine fault tolerance to work as I intend it, we would need to check if the underlying cryptographic protocol is suitable for such a setup. If not, maybe we can brute-force our way to redundancy in other ways, such as starting multiple protocol invocations in parallel with different subsets. Obviously, that brute-force approach would come at the cost of signature processing bandwidth.

Additionally, we also use stable to create this subset/proposer which is not ideal, because all nodes would have to agree on the stable set of participants. We should probably be used all participants in this case to be as deterministic as possible.

Using all participants would lead to a higher failure rate, wouldn't it? I think only including participants we know to be responsive at the moment makes sense and we should keep it. Unless I'm misunderstanding or missing something.

I suggest to solve this by letting the proposer decide on the list of proposers and letting everyone know about it in the initial message. That's a very simple concept that's makes it obviously true that everyone uses the same list of participants at all times.

@ChaoticTempest
Copy link
Contributor Author

If you just want fault tolerance against the simple case of nodes being offline sometimes, yeah we can add a patchwork solution to fix it. Your suggestion to increase the retry_count and changing to a different set of viable participants sounds good to me. I guess we can do it after a timeout and rely on a stable list to optimize the choice from a pure random choice.

Pretty much this for now for our networking side of things.

To resolve such problems for good, byzantine fault tolerance would be the gold standard. There are different ways to achieve it. I would suggest we would select redundant participants and ensure the protocol can finish with only a subset of them responding. This would make things more robust (fault tolerant) and possibly improve performance, since we don't have to wait for the slowest participant.

But for byzantine fault tolerance to work as I intend it, we would need to check if the underlying cryptographic protocol is suitable for such a setup. If not, maybe we can brute-force our way to redundancy in other ways, such as starting multiple protocol invocations in parallel with different subsets. Obviously, that brute-force approach would come at the cost of signature processing bandwidth.

I agree, will take a look to see how suitable BFT is for our system

Using all participants would lead to a higher failure rate, wouldn't it? I think only including participants we know to be responsive at the moment makes sense and we should keep it. Unless I'm misunderstanding or missing something.

Yeah, probably sticking to stable for now is best, because in the case that two nodes to choose a different subset for stable, we're gonna fail and timeout either ways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants