Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pekko: fix spans on server timeout #13435

Merged
merged 8 commits into from
Mar 10, 2025

Conversation

samwright
Copy link
Contributor

When pekko's HTTP server is configured with a request-timeout (default 20s) the span is never ended, and its response status code is never set. This fixes that, by:

  • Moving the span start/end logic from PekkoFlowWrapper to PekkoHttpServerTracer, which is injected higher up the stack so it sees the HttpResponse after the timeout logic has run
  • PekkoHttpServerTracer passes the otel context to PekkoFlowWrapper via the HttpRequest's attributes

@@ -69,7 +63,7 @@ public GraphStageLogic createLogic(Attributes attributes) {
}

private class TracingLogic extends GraphStageLogic {
private final Deque<TracingRequest> requests = new ArrayDeque<>();
private final Queue<PekkoTracingRequest> requests = new ArrayDeque<>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both here and in PekkoHttpServerTracer I'm using a queue (LIFO) because that's what the approach taken in Pekko by the request-timeout stage here.

The previous approach used the Deque as FIFO. Changing it here for consistency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous usage was a bug since pipelined requests should complete in the oder they are received. Currently with our tests it seems that pekko does not have multiple requests in this queue, probably some extra configuration is needed. If we could figure out how to actually enable pipelining we could consider improving the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've enabled the http-pipelining test for the async tests, and managed to get it to work (see the comment on how). Nevertheless switching both deques from LIFO to FIFO has no effect - the tests pass in both cases.

Copy link
Contributor

@laurit laurit Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at this. If it doesn't break then I'd guess that perhaps there isn't more than 1 request in the queue at the same time. Maybe the requests are processed too fast, idk. I think if there is interest we can tackle this in a follow up PR.
Looking at https://github.com/open-telemetry/opentelemetry-java-instrumentation/pull/13435/files#diff-dc7412c25079c85f7462b02768d4445cda72f0934517132253e0b9140f7d69dcR147-R153 this really isn't a good idea. The scope from makeCurrent must be closed. Leaving the scope open is almost never the best solution as not closing a scope will prevent closing all other scopes that were open for that thread, it could lead to a memory leak or the thread ending in a state where all subsequent spans are added to one trace. If this is just for pipelining I'd suggest you to revert this part for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I can revert it. This PR didn't break it so yeah a follow up should look into it.

A thought - how bad would it be to use the Context.root() as the parent context when creating the server span? i.e. change this to:

Context parentContext = Context.root();

The tests pass if I do that and remove the tracingRequest.parentContext.makeCurrent() bit, since the previous request's context is ignored when handling the next request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually it is better to reset the scope where the unwanted context propagation happens. We usually fail the tests when there are leaked scopes but this is disabled in

jvmArgs("-Dio.opentelemetry.javaagent.shaded.io.opentelemetry.context.enableStrictContext=false")
I think this could be removed as it is carried over from akka instrumentation where we used to leak the scopes, but since we rewrote the instrumentation we don't rely on leaking scopes any more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I've removed my dodgy workarounds - we can consider better solutions for http-pipelining in another issue.

I've also removed that system property and the tests still pass 👍

@@ -69,7 +63,7 @@ public GraphStageLogic createLogic(Attributes attributes) {
}

private class TracingLogic extends GraphStageLogic {
private final Deque<TracingRequest> requests = new ArrayDeque<>();
private final Queue<PekkoTracingRequest> requests = new ArrayDeque<>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've enabled the http-pipelining test for the async tests, and managed to get it to work (see the comment on how). Nevertheless switching both deques from LIFO to FIFO has no effect - the tests pass in both cases.

@@ -62,7 +64,7 @@ object PekkoHttpTestAsyncWebServer {
if (null == binding) {
import scala.concurrent.duration._
binding = Await.result(
Http().bindAndHandleAsync(asyncHandler, "localhost", port),
Http().bindAndHandleAsync(asyncHandler, "localhost", port, parallelism = 2),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change wasn't necessary, but I wanted to make sure the tests pass with parallelism enabled. I can revert though if preferred.

@samwright samwright force-pushed the pekko-timeout-issue branch from 382a324 to e97a1df Compare March 6, 2025 18:24
@laurit laurit added this to the v2.14.0 milestone Mar 7, 2025
@trask trask merged commit 11b59d0 into open-telemetry:main Mar 10, 2025
85 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants