Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Stop parsing the packet payload #485

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

[RFC] Stop parsing the packet payload #485

wants to merge 8 commits into from

Conversation

atenart
Copy link
Contributor

@atenart atenart commented Jan 29, 2025

This is an attempt at not parsing the payload in Retis and only provide the raw data as part of the event. I decided to keep the existing skb event sections (IP, ARP, etc) for backward compatibility but not to generate nor use them anymore. A middle ground could be to still generate them (but not use them) or to let this be controllable by a flag. The question is then if this is actually needed or just hypothetical; as this will impact maintenance.

One major benefit of this is the instant support for stacked VLANs, tunnels, etc. We can also remove the pnet_packet dependency.

Retis used to parse the payload data itself but this has major disadvantages:

  • It's extremely challenging and time consuming to support all fields and all protocols.
  • It requires more time when collecting data (and not displaying it).
  • It goes against the initial goal: extracting information from targets. While metadata has to be manually parsed and while it's OK to sometimes post-process some fields to add value, the payload itself is already self-contained. It makes sense to provide it as a single entity.
  • This kinda reinvents the wheel.

Instead we can reuse existing tools and libraries:

  • tcpdump for displaying the packet part of the events.
  • Existing libraries such as Scapy when post-processing in Python.

There are a few drawbacks:

  • No easy JSON access to fields like IP. But this wasn't working well and is actually challenging when e.g. thinking about tunnels. Python can be used instead.
  • No easy way to access fields in Rust. Note that we don't do this ATM.

Overall IMO the benefits outweigh the drawbacks.

If this is wanted, TODO:

  • Investigate performances (excluding startup time of tcpdump).
  • Update the output of all examples in the doc, including the gif.
  • Document using scapy in Python. Provide an example[1]?
  • Fix out tests.

[1]

from scapy.layers.l2 import Ether
from scapy.all import *

for e in reader.events():
    if 'skb' in e:
        p = Ether(e['skb'].packet.packet)
        print(p.summary())
        if IP in p:
            print("src: " + p[IP].src)

@atenart atenart force-pushed the at/tcpdump branch 3 times, most recently from 67713e7 to b711d6f Compare January 29, 2025 12:56
Instead of relying on our custom implementation and requiring a lot of
work to support all the various protocols, use tcpdump instead and
capture its output.

Signed-off-by: Antoine Tenart <[email protected]>
Previous logic relied on whether or not a LL sub-section is present. But
we know always have this information in the raw packet and we're moving
away from parsing the packet ourselves. This is better as it allows to
decide if the LL information is important at post-processing time.

The one drawback is the Python `show()` helper which isn't configurable
yet in this regards (it always prints LL info).

Signed-off-by: Antoine Tenart <[email protected]>
…ad info

Retis used to parse the payload data itself but this has major
disadvantages:

- It's extremely challenging and time consuming to support all fields
  and all protocols.
- It requires more time when collecting data (and not displaying it).
- It goes against the initial goal: extracting information from targets.
  While metadata has to be manually parsed and while it's OK to
  sometimes post-process some fields to add value, the payload itself is
  already self-contained. It makes sense to provide it as a single
  entity.
- This kinda reinvents the wheel.

Instead we can reuse existing tools and libraries:

- tcpdump for display the packet part of the events.
- Existing libraries such as Scapy when post-processing in Python.

There are a few drawbacks:

- No easy JSON access to fields like IP. But this wasn't working well
  and is actually challenging when e.g. thinking about tunnels.
- No easy way to access fields in Rust. Note that we don't do this ATM.

Overall IMO the benefits outweigh the drawbacks.

Signed-off-by: Antoine Tenart <[email protected]>
The outer VLAN header can be "accelerated" in Linux, aka. part of the
skb metadata and not the payload. Only report those as part of the skb
event as for all other cases the VLAN information will be part of the
payload and displayed correctly.

Signed-off-by: Antoine Tenart <[email protected]>
This was kept to 3 to fix an incompatibility with c8s, but this should
be fixed at the packaging level with an out-of-tree patch.

Signed-off-by: Antoine Tenart <[email protected]>
@atenart
Copy link
Contributor Author

atenart commented Feb 5, 2025

Adding one additional possibility here:

If not parsing the payload to make fields part of the event is agreed on (not saying is is right now nor that it'll be the case), and if concerns are only directed towards calling an external binary to format part of the events, we could keep parsing the packet with pnet but at display time and implement our own display logic only. This could be done in addition to what this PR does, and be either enabled by a flag, be the default behavior, or used as a fallback when the external binary does not exist / respond / work.

This would allow supporting all protocols right now (in the display) while still giving us a way forward, long term, for handling this ourselves. Note that it'll be far easier to parse a packet and display it than converting it to an event; also this should help for performances (-o), etc. We can even make this as a separate lib.

@atenart atenart mentioned this pull request Feb 5, 2025
@amorenoz
Copy link
Contributor

amorenoz commented Feb 5, 2025

I generally find the idea quite interesting but the implications are very big. My biggest concern is calling an external tool, the dependency it introduces.

In the ideal world, we would use tcpdump without having to start it. Using it as a library but it seems not possible atm.
@vlrpl did you look at tshark/wireshark?

If not parsing the payload to make fields part of the event is agreed on (not saying is is right now nor that it'll be the case), and if concerns are only directed towards calling an external binary to format part of the events, we could keep parsing the packet with pnet but at display time and implement our own display logic only. This could be done in addition to what this PR does, and be either enabled by a flag, be the default behavior, or used as a fallback when the external binary does not exist / respond / work.

This would help: keep the somewhat limited display logic while removing the need for header info in the event file.
Maybe we should even disable printing by default?

I am also concerned if we loose control of how the packet is displayed, things like the sorted output can get really messy really quickly.

@atenart
Copy link
Contributor Author

atenart commented Feb 6, 2025

I generally find the idea quite interesting but the implications are very big. My biggest concern is calling an external tool, the dependency it introduces.

In the ideal world, we would use tcpdump without having to start it.

+1

If not parsing the payload to make fields part of the event is agreed on (not saying is is right now nor that it'll be the case), and if concerns are only directed towards calling an external binary to format part of the events, we could keep parsing the packet with pnet but at display time and implement our own display logic only. This could be done in addition to what this PR does, and be either enabled by a flag, be the default behavior, or used as a fallback when the external binary does not exist / respond / work.

This would help: keep the somewhat limited display logic while removing the need for header info in the event file. Maybe we should even disable printing by default?

Meaning -o would be the default option?

I am also concerned if we loose control of how the packet is displayed, things like the sorted output can get really messy really quickly.

Any example of what could go wrong specifically for the sorted output?

@vlrpl
Copy link
Contributor

vlrpl commented Feb 6, 2025

I generally find the idea quite interesting but the implications are very big. My biggest concern is calling an external tool, the dependency it introduces.

In the ideal world, we would use tcpdump without having to start it. Using it as a library but it seems not possible atm. @vlrpl did you look at tshark/wireshark?

I took a (real) quick look and there seems to be libwireshark that could be interesting, but I guess it's not conceived as a public library (API stability and so forth).
Of course, it requires further checks, if we end up finding this interesting.

If not parsing the payload to make fields part of the event is agreed on (not saying is is right now nor that it'll be the case), and if concerns are only directed towards calling an external binary to format part of the events, we could keep parsing the packet with pnet but at display time and implement our own display logic only. This could be done in addition to what this PR does, and be either enabled by a flag, be the default behavior, or used as a fallback when the external binary does not exist / respond / work.

This would help: keep the somewhat limited display logic while removing the need for header info in the event file. Maybe we should even disable printing by default?

I am also concerned if we loose control of how the packet is displayed, things like the sorted output can get really messy really quickly.

@amorenoz
Copy link
Contributor

If not parsing the payload to make fields part of the event is agreed on (not saying is is right now nor that it'll be the case), and if concerns are only directed towards calling an external binary to format part of the events, we could keep parsing the packet with pnet but at display time and implement our own display logic only. This could be done in addition to what this PR does, and be either enabled by a flag, be the default behavior, or used as a fallback when the external binary does not exist / respond / work.

This would help: keep the somewhat limited display logic while removing the need for header info in the event file. Maybe we should even disable printing by default?

Meaning -o would be the default option?

Yes

I am also concerned if we loose control of how the packet is displayed, things like the sorted output can get really messy really quickly.

Any example of what could go wrong specifically for the sorted output?

It depends on what tcpdump / tshark prints but what I mean is that we no longer control it.

I think we could consider combining this feature with a binary format more fit for storing packets than base64 inside a JSON. Maybe retis could generate a pcap file with some unique packet IDS in the comments and a matching json file with the metadata?

@atenart
Copy link
Contributor Author

atenart commented Feb 11, 2025

If not parsing the payload to make fields part of the event is agreed on (not saying is is right now nor that it'll be the case), and if concerns are only directed towards calling an external binary to format part of the events, we could keep parsing the packet with pnet but at display time and implement our own display logic only. This could be done in addition to what this PR does, and be either enabled by a flag, be the default behavior, or used as a fallback when the external binary does not exist / respond / work.

This would help: keep the somewhat limited display logic while removing the need for header info in the event file. Maybe we should even disable printing by default?

Meaning -o would be the default option?

Yes

I'm not sure about that, adding -o is quite easy and aligns with many other utilities. Running live collection is used in many cases.

I am also concerned if we loose control of how the packet is displayed, things like the sorted output can get really messy really quickly.

Any example of what could go wrong specifically for the sorted output?

It depends on what tcpdump / tshark prints but what I mean is that we no longer control it.

OK, so that wouldn't be limited to the sorted output. I guess that's a trade-off between parsing more protocols and fields and controlling 100% what gets displayed. Using the above tools at least would produce a known format for many users. We can also add safeguards, e.g. controlling there is no newline in the output if that's a concern (although that adds processing). Let's also note if we go with using an external tool optional, we could still be in controlled when configured and in the mid/long-term.

I think we could consider combining this feature with a binary format more fit for storing packets than base64 inside a JSON. Maybe retis could generate a pcap file with some unique packet IDS in the comments and a matching json file with the metadata?

I'm not sure I see the link between that PR and this proposal.

This seems more about performances and storing less data on disk, while the PR is about supporting more protocols and fields. I'm also not convinced the above would help, as we can chose the format we want to store the packet in our event. Splitting it would make things more complex and wouldn't help to use the pcap file directly (which is larger than our current format too), we would still need retis pcap. I'm not really getting what would be the benefits?

@amorenoz
Copy link
Contributor

I'm really divided by this PR.
On the one hand, I very much understand the motivation and the great weight we take from our shoulders if we let tcpdump do what it does best.

On the other hand, it just seems too much of a change:

  • Making retis directly depend on having tcpdump installed
  • Output might change depending on what tcpdump version is installed
  • More blocking I/O in the collection event loop
  • Basic header info no longer available in the event (e.g: quick python exploration now is more complicated)

Could we at least ease the transition somehow? Thinking out loud here:

  • Make it an opt-in instead of the default behavior
  • Keep super-basic packet data in the event and clearly document that this is best-effort and that for full information, tcpdump or scapy need to be used to parse the packet
  • Have a builtin python helper that parses the packet using scapy and returns a scapy object that is easy to use

@atenart
Copy link
Contributor Author

atenart commented Mar 3, 2025

I'm really divided by this PR. On the one hand, I very much understand the motivation and the great weight we take from our shoulders if we let tcpdump do what it does best.

On the other hand, it just seems too much of a change:

  • Making retis directly depend on having tcpdump installed
  • Output might change depending on what tcpdump version is installed
  • More blocking I/O in the collection event loop
  • Basic header info no longer available in the event (e.g: quick python exploration now is more complicated)

Also note I chose an opinionated solution in this PR to start discussing the matter, but of course I'm expecting this to be more nuanced (if applied at all).

Could we at least ease the transition somehow? Thinking out loud here:

  • Make it an opt-in instead of the default behavior

Yes, I think that would make perfect sense. Default parser would still be implemented in retis-event (or in a new crate), while it could be overridden if the user chooses to.

We can also not use third party tools to display the data at all if we want, but IMO that only works in the short term if we're not parsing the raw packet to issue event sections.

  • Keep super-basic packet data in the event and clearly document that this is best-effort and that for full information, tcpdump or scapy need to be used to parse the packet

I don't think providing un-trusted partial data is helpful as it can't really be used. It also can be a chicken and egg issue: you need to know what should be the right data to use it, which conflicts with the purpose of the tool. Also note this would require to parse the raw packet twice (once to provide best effort sections, once to print it in a reliable way).

I'm not necessarily saying we should not provide parsed packet sections (although it's my opinion), but if we decide to keep sections we should make then reliable and complete so they can be used.

  • Have a builtin python helper that parses the packet using scapy and returns a scapy object that is easy to use

Making consumption easier is always nice. I would just avoid to make scapy a hard dependence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants