Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Rate limit trace routes to every 3 min #6173

Open
RCGV1 opened this issue Feb 27, 2025 · 23 comments
Open

[Feature Request]: Rate limit trace routes to every 3 min #6173

RCGV1 opened this issue Feb 27, 2025 · 23 comments
Assignees
Labels
enhancement New feature or request

Comments

@RCGV1
Copy link
Member

RCGV1 commented Feb 27, 2025

Platform

Cross-Platform

Description

Currently, our mesh has only 100 nodes but channel util is 40%+ because trace routes are being sent very often by the Mesh Sense software. Because of how trace routes force every node in its path to send a response this creates a huge amount of traffic, just like position requests were recently rate limited to 3 min the same should be done for trace routes.

If a node receives a trace route it should refuse to respond to another trace route request within the next 3 min. (Even if it is not the target).

@RCGV1 RCGV1 added the enhancement New feature or request label Feb 27, 2025
@thebentern thebentern added the high-priority Issues that affect core functionality or are "show stoppers" label Feb 27, 2025
@RCGV1
Copy link
Member Author

RCGV1 commented Feb 27, 2025

3 min may be too much, maybe 1 min.

@RCGV1
Copy link
Member Author

RCGV1 commented Feb 27, 2025

Since this will prevent the entire mesh from sending TRs

@rcarteraz
Copy link
Member

No, 3 mins is a good idea. There's no reason you should be sending that many trace routes that often. A person's desire to see all the connections doesn't trump the stability of the mesh. I agree, the Mesh Sense software is making a lot of networks too unstable because a huge chunk of the traffic is just trace routes.

@GUVWAF
Copy link
Member

GUVWAF commented Feb 27, 2025

I'm sorry but I would not be in favor of this. I regularly use traceroutes within three minutes to check connections to different nodes, or even to the same node when walking around. We already have a 30-second rate limit and I believe that suffices.

In my opinion one piece of software doing automatic traceroutes should not prohibit sending traceroutes manually within three minutes. First, let's try to convince them to stop doing that. But ultimately you're using a public channel where anyone can do anything they like; we can enforce suitable defaults but let's also keep the fun of the hobby and not solely focus on the super large meshes. If we'd add this, the next thing people will do is, for example, manually crafting NeighborInfo packets and broadcasting them.

With 2.6, traceroutes will also become more efficient.

@RCGV1
Copy link
Member Author

RCGV1 commented Feb 27, 2025

There's also the issue where a bad actor could use TRs with modified firmware to bring a mesh to its knees.

@GUVWAF
Copy link
Member

GUVWAF commented Feb 27, 2025

A bad actor can easily bring a mesh to its knees; there's nothing we can do about that unfortunately.

If a node receives a trace route it should refuse to respond to another trace route request within the next 3 min. (Even if it is not the target).

I'm definitely not in favor of doing it in this way as you'll never be able to know whether the traceroute fails or whether it was just because the recipient (or a node in-between) was rate-limited.

@GUVWAF GUVWAF removed the high-priority Issues that affect core functionality or are "show stoppers" label Feb 27, 2025
@rcarteraz
Copy link
Member

I'm sorry but I would not be in favor of this. I regularly use traceroutes within three minutes to check connections to different nodes, or even to the same node when walking around. We already have a 30-second rate limit and I believe that suffices.

In my opinion one piece of software doing automatic traceroutes should not prohibit sending traceroutes manually within three minutes. First, let's try to convince them to stop doing that. But ultimately you're using a public channel where anyone can do anything they like; we can enforce suitable defaults but let's also keep the fun of the hobby and not solely focus on the super large meshes. If we'd add this, the next thing people will do is, for example, manually crafting NeighborInfo packets and broadcasting them.

With 2.6, traceroutes will also become more efficient.

Sure, I definitely get where you're coming from but I'm not sure I would classify 100 devices as a super large mesh anymore. I do believe that we've learned that education and pleases don't work as well as we'd like. I think eventually, likely pretty soon, we're going to have to implement a restriction like this on the default channel. Especially if Mesh Sense continues gaining popularity. We recently saw an example where someone was sending automatic traceroutes continuously and channel utilization was 45-55%.

@GUVWAF
Copy link
Member

GUVWAF commented Feb 27, 2025

Then let's first try to convince Mesh Sense to increase the interval of their automatic requests: Affirmatech/MeshSense#64

@RCGV1
Copy link
Member Author

RCGV1 commented Feb 27, 2025

I was also about to make an issue, you beat me to it lol

@RCGV1
Copy link
Member Author

RCGV1 commented Feb 27, 2025

Added a comment, I hope this gets resolved, or else I would recommend everyone to stop using it.

@thebentern
Copy link
Contributor

thebentern commented Feb 27, 2025

@GUVWAF I have an idea. Perhaps instead of setting the static limits to traceroutes, we could scale the rate limit based on the number of online nodes similar to how we treat interval broadcasts. This obviously disproportionately affects larger meshes like @RCGV1's and the instance we documented in the Oregon mesh as well.

@GUVWAF
Copy link
Member

GUVWAF commented Feb 27, 2025

Perhaps instead of setting the static limits to traceroutes, we could scale the rate limit based on the number of online nodes similar to how we treat interval broadcasts.

That might be a reasonable approach, if the client notification then also says "You can request after x seconds again". Still I would apply the rate-limit only on the request, not on the reply or on intermediate hops.

@rcarteraz
Copy link
Member

That sounds like a good approach and I agree @GUVWAF it should be on the request.

@GUVWAF
Copy link
Member

GUVWAF commented Feb 27, 2025

I would then also limit it to if you have the default channel and frequency slot. I believe it should not affect private isolated meshes.

@thebentern
Copy link
Contributor

That might be a reasonable approach, if the client notification then also says "You can request after x seconds again". Still I would apply the rate-limit only on the request, not on the reply or on intermediate hops.

It may be useful to also have a slightly coarser grained limit on the reply side in general.

@rcarteraz
Copy link
Member

I would then also limit it to if you have the default channel and frequency slot. I believe it should not affect private isolated meshes.

Agreed, I was thinking specifically the default.

@nasiralamreeki
Copy link

I'd support rate limiting on all features that can congest the mesh. We need some serious QOL improvements like this.

@Woutvstk
Copy link
Contributor

Would it be possible to make a bad-actor-suppressor feature? It kicks in when channel utilisation is above X%. When too many packets (of any type) are received from a specific node, the feature will send a "reject" message to the bad-acting node. This way, the bad actor has to suppress its messages and/or receives a "traceroute might have failed because it was rejected" warning in the app. A 0-hop neighbour of the spamming node, could then stop relaying its messages to protect the rest of the mesh, whilst warning the bad node.

@dandrzejewski
Copy link

Would it be possible to make a bad-actor-suppressor feature? It kicks in when channel utilisation is above X%.

This seems like a good idea to me (although it certainly adds some complexity to the firmware). The firmware already has some logic for "backing off" on things, maybe this could be leveraged for this as well?

And perhaps make it specific to the type of message that is causing the spam, e.g., if it's a traceroute, don't prevent telemetry, DMs or locations from going through.

Just thinking out loud, definitely adds potentially unwanted complexity.

I also agree that the MeshSense folks should adjust their defaults.

@nullrouten0
Copy link

Here are examples of "traceroute storms". What seems to occur is many MeshSense instances tracing at each other as a result of being traced to... hop count constantly changing, and this amplifies over our very well connected meshes... ~140 nodes in MediumSlow, and 300-400 in LongFast. The traceroutes per hour fluctuate wildly from 6... to 500... back to 10... and so on. Since a traceroute is flooded, nodes get pulled into this fight when they see the inquiry.

sqlite> SELECT strftime('%Y-%m-%d %H:00', import_time) AS hour, COUNT(*) AS packet_count
...> FROM traceroute
...> GROUP BY hour
...> ORDER BY hour;
+------------------+--------------+
| hour | packet_count |
+------------------+--------------+
| 2025-02-24 06:00 | 384 |
| 2025-02-24 07:00 | 596 |
| 2025-02-24 08:00 | 183 |
| 2025-02-24 09:00 | 16 |
| 2025-02-24 10:00 | 12 |
| 2025-02-24 11:00 | 46 |
| 2025-02-24 12:00 | 17 |
| 2025-02-24 13:00 | 124 |
| 2025-02-24 14:00 | 107 |
| 2025-02-24 15:00 | 86 |
| 2025-02-24 16:00 | 136 |
| 2025-02-24 17:00 | 144 |
| 2025-02-24 18:00 | 61 |
| 2025-02-24 19:00 | 374 |
| 2025-02-24 20:00 | 152 |
| 2025-02-24 21:00 | 311 |
| 2025-02-24 22:00 | 242 |
| 2025-02-24 23:00 | 428 |
| 2025-02-25 00:00 | 126 |
| 2025-02-25 01:00 | 193 |
| 2025-02-25 02:00 | 108 |
| 2025-02-25 03:00 | 59 |
| 2025-02-25 04:00 | 6 |
| 2025-02-25 05:00 | 70 |
| 2025-02-25 06:00 | 37 |
| 2025-02-25 07:00 | 54 |
| 2025-02-25 08:00 | 60 |
| 2025-02-25 09:00 | 12 |
| 2025-02-25 10:00 | 49 |
| 2025-02-25 11:00 | 18 |
| 2025-02-25 12:00 | 41 |
| 2025-02-25 13:00 | 49 |
| 2025-02-25 14:00 | 44 |
| 2025-02-25 15:00 | 111 |
| 2025-02-25 16:00 | 92 |
| 2025-02-25 17:00 | 62 |
| 2025-02-25 18:00 | 24 |
| 2025-02-25 19:00 | 5 |
| 2025-02-25 20:00 | 6 |
| 2025-02-25 21:00 | 15 |
| 2025-02-25 22:00 | 32 |
| 2025-02-25 23:00 | 23 |
| 2025-02-26 00:00 | 55 |
| 2025-02-26 01:00 | 59 |
| 2025-02-26 02:00 | 325 |
| 2025-02-26 03:00 | 198 |
| 2025-02-26 04:00 | 18 |
| 2025-02-26 05:00 | 32 |
| 2025-02-26 06:00 | 54 |
| 2025-02-26 07:00 | 139 |
| 2025-02-26 08:00 | 60 |
| 2025-02-26 09:00 | 7 |
| 2025-02-26 10:00 | 13 |
| 2025-02-26 11:00 | 7 |
| 2025-02-26 12:00 | 7 |
| 2025-02-26 13:00 | 7 |
| 2025-02-26 14:00 | 51 |
| 2025-02-26 15:00 | 72 |
| 2025-02-26 16:00 | 73 |
| 2025-02-26 17:00 | 167 |
| 2025-02-26 18:00 | 71 |
| 2025-02-26 19:00 | 65 |
| 2025-02-26 20:00 | 232 |
| 2025-02-26 21:00 | 88 |
| 2025-02-26 22:00 | 138 |
| 2025-02-26 23:00 | 138 |
| 2025-02-27 00:00 | 392 |
| 2025-02-27 01:00 | 333 |
| 2025-02-27 02:00 | 107 |
| 2025-02-27 03:00 | 274 |
| 2025-02-27 04:00 | 150 |
| 2025-02-27 05:00 | 220 |
| 2025-02-27 06:00 | 308 |
| 2025-02-27 07:00 | 319 |
| 2025-02-27 08:00 | 64 |
| 2025-02-27 09:00 | 65 |
| 2025-02-27 10:00 | 71 |
| 2025-02-27 11:00 | 90 |
| 2025-02-27 12:00 | 110 |
| 2025-02-27 13:00 | 189 |
| 2025-02-27 14:00 | 57 |
| 2025-02-27 15:00 | 53 |
| 2025-02-27 16:00 | 77 |
| 2025-02-27 17:00 | 42 |
| 2025-02-27 18:00 | 376 |
| 2025-02-27 19:00 | 154 |
| 2025-02-27 20:00 | 106 |
| 2025-02-27 21:00 | 104 |
| 2025-02-27 22:00 | 149 |
| 2025-02-27 23:00 | 168 |
| 2025-02-28 00:00 | 215 |
| 2025-02-28 01:00 | 86 |
| 2025-02-28 02:00 | 360 |
| 2025-02-28 03:00 | 217 |
| 2025-02-28 04:00 | 198 |
| 2025-02-28 05:00 | 43 |
| 2025-02-28 06:00 | 278 |
| 2025-02-28 07:00 | 576 |
| 2025-02-28 08:00 | 3 |
| 2025-02-28 11:00 | 72 |
| 2025-02-28 12:00 | 47 |
| 2025-02-28 13:00 | 24 |
| 2025-02-28 14:00 | 115 |
| 2025-02-28 15:00 | 157 |
| 2025-02-28 16:00 | 25 |
+------------------+--------------+

Image

No humans would be running 576 traces per hour and then 3... the next. Thats a storm finally resolving by hitting everyone's mesh sense limit and finally deflating.

| 2025-02-28 07:00 | 576 |
| 2025-02-28 08:00 | 3 |
| 2025-02-28 11:00 | 72 |
| 2025-02-28 12:00 | 47 |

@garthvh
Copy link
Member

garthvh commented Feb 28, 2025

Would be good to open some issues on the meshsense repo especially if good logging is available like the previous comment, they have worked pretty quickly to implement the issues I have opened https://github.com/Affirmatech/MeshSense/issues

@dandrzejewski
Copy link

Would be good to open some issues on the meshsense repo especially if good logging is available like the previous comment, they have worked pretty quickly to implement the issues I have opened https://github.com/Affirmatech/MeshSense/issues

Affirmatech/MeshSense#64

@GUVWAF
Copy link
Member

GUVWAF commented Feb 28, 2025

MeshSense has now increased their defaults and enforced a minimum. I think this will already help a lot, since e.g. five manual traceroute requests within three minutes is much less harmful than continuously sending a traceroute every three minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants