You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are seeing a race with the xDS client when the application has multiple grpc channels being created and shutdown at the same time, repeatedly. One possible way to get this race to happen is for an application to open multiple grpc channels to the same target (around the same time), make one RPC in each channel, and shut it down, and do this process over and over again.
The sequence of events that trigger the race is as follows:
Channel 1 is created to target foo. It goes through the whole xDS flow successfully and is being shut down.
At around the same time, Channel 2 is created to the same target foo.
As part of Channel 1 shutdown, it starts unsubscribing to the xDS resources.
The xDS client function to unsubscribe schedules a callback on the serializer. The function returns early, but the actual work of sending the ADS request to effect the unsubscription happens asynchronously. See:
Channel 2 starts out and gets a reference to the existing xDS client (when it attempts to create a new one). It registers a watch for the same LDS resource as channel 1.
The watch function in the xDS client sees that there is no such LDS resource in the cache (because it was deleted as part of the unsubscription by channel 1). So, it updates the cache and creates a placeholder for this resource. See:
The actual ADS request for the subscription will be sent out asynchronously through the ADS stream implementation.
By the time the ADS stream implementation gets around to write the LDS request on the wire (both for the unsubscription by channel 1 and the subscription by channel 2), it sees there there is a single LDS resource to be requested, and it also sees a nonce and a version for that resource type. This is the same nonce and version that was received when the same resource was requested as part of channel 1. So, such a request is sent out, the management server does not send a response back, because it already responded to the same request name with the same version and nonce on the same ADS stream.
At this point channel 2 is stuck until the LDS resource-not-found timer fires after 15s and from that point on RPCs on the channel start to fail.
The fundamental reason behind this race is that the resource is deleted from the xDS client cache even before it is sent out of the ADS stream. Also important to note is that some resource state is maintained in the ADS stream implementation, while some is maintained in the xDS client and these both can go out of sync.
The text was updated successfully, but these errors were encountered:
We are seeing a race with the xDS client when the application has multiple grpc channels being created and shutdown at the same time, repeatedly. One possible way to get this race to happen is for an application to open multiple grpc channels to the same target (around the same time), make one RPC in each channel, and shut it down, and do this process over and over again.
The sequence of events that trigger the race is as follows:
1
is created to targetfoo
. It goes through the whole xDS flow successfully and is being shut down.2
is created to the same targetfoo
.1
shutdown, it starts unsubscribing to the xDS resources.grpc-go/xds/internal/xdsclient/authority.go
Line 680 in aa629e0
grpc-go/xds/internal/xdsclient/transport/ads/ads_stream.go
Line 217 in aa629e0
grpc-go/xds/internal/xdsclient/authority.go
Line 715 in aa629e0
2
starts out and gets a reference to the existing xDS client (when it attempts to create a new one). It registers a watch for the same LDS resource as channel1
.1
). So, it updates the cache and creates a placeholder for this resource. See:grpc-go/xds/internal/xdsclient/authority.go
Line 625 in aa629e0
1
and the subscription by channel2
), it sees there there is a single LDS resource to be requested, and it also sees a nonce and a version for that resource type. This is the same nonce and version that was received when the same resource was requested as part of channel 1. So, such a request is sent out, the management server does not send a response back, because it already responded to the same request name with the same version and nonce on the same ADS stream.At this point channel
2
is stuck until the LDS resource-not-found timer fires after15s
and from that point on RPCs on the channel start to fail.The fundamental reason behind this race is that the resource is deleted from the xDS client cache even before it is sent out of the ADS stream. Also important to note is that some resource state is maintained in the ADS stream implementation, while some is maintained in the xDS client and these both can go out of sync.
The text was updated successfully, but these errors were encountered: