-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JWS anti-replay nonce error #1
Comments
Hello, Are you behind a web proxy? The RFC says that the server should reply with "Cache-Control: no-store" HTTP header field (as Letsencrypt's prod and staging server do), but some proxy may be broken. You may try to comment out the line |
No, I'm not behind a web proxy. |
Do you have long elapsed time between the different actions? There may be timing issue on the server how long a nonce is valid. If it is not always the first domain which gets the error, than try to comment out the same line from the send_req() function also, please. |
A maximum of 1 or 2 seconds can pass between two requests. |
Only one job is running at a time and renewing the certificates sequentially, or the jobs are parallel running? Do all jobs use the same account key? |
Only one job at a time, and all use the same account key. Domain names are not all configured on the same IP, but all IPs belong to the same server. |
Today, with the two lines commented out, I got an error again. There were 2 certificates to renew, and both failed :
I ran the script again, the first one failed again, the second one worked :
Third times, the error was not exactly the same :
The fourth :
The fifth time, it worked ! |
Was that already with the modification I requested (two lines commented out in the script)? What type of OS and version are you on, does your curl is standard shipped with the OS? |
Yes, the two lines are commented out in the script. The script is running on Ubuntu 16.04 with standard curl. |
I created a new debug version for you in the branch badNonce-debug. Kindly use this one and in case of failure please send me the output. |
I set up the debug version, but for a week there were no errors. I'll let you know as soon as it happens again. |
There was an error this morning, after a successful request :
|
I opened a topic at the community of letsencrypt concerning your problem: https://community.letsencrypt.org/t/regular-badnonce-errors/57332 May I ask you to let me know the answer for the boulder engineer question: The boulder engineer also asked to add more debugging output to the code, which I am ready to do, but that would lead to see your domain names in question publicly. Normally that is not a problem, since the domain names for issued certificates are all made public in Certificate Transparency logs (e.g. https://crt.sh/?q=example.com). |
I use only one IP to access the API, but domains are set up on various IPs. Domain names belong to our customers, it would be easier for me if I could send debug logs only to you and the boulder engineer. Is that possible? |
@Kaaal 👋 I'm the Boulder engineer in question :-) You can email the unsantized logs to |
@Kaaal , I put more output into the debug version, please update your instant. https://github.com/bruncsak/letsencrypt.sh/tree/badNonce-debug If you have badNonce failure again, no need to post here the output, please send to the boulder engineer's e-mail address. |
@Kaaal Just get in my mind something else. On the server you are running the script to get the certificate, do you have dual IP stack running, IPv4 and IPv6 as well? There may be the possibility that one connection is using IPv4, the other one IPv6. |
@Kaaal I know earlier you said you weren't behind a web proxy. Can you confirm that there's no chance your ISP or IT department might have your server behind a proxy you weren't aware of? I ask because one of my coworkers points out in your original log there is a strange As far as I'm aware there isn't any part of the Let's Encrypt API stack that would return an "HTTP/1.1 100 Continue" response which makes me believe there might be a proxy meddling with requests. That would also explain the badNonce errors neatly.
|
@cpu , request verification of XXX HTTP/1.1 200 Connection established HTTP/1.1 100 Continue HTTP/1.1 500 Internal Server Error The boulder is having Akamai reverse proxies in front that may be the other reason. |
@bruncsak Interesting!
What was the HTTP request you sent that received that response? What URL was it being sent to? Can you share the results of running the following a few times:
|
@cpu Unfortunately I do not have the URL. I am not running always in full debug mode the client. That error happened near 4 months ago. The URL had to have the value of the "uri:" json field of the challenge returned. Would you be able to provide a generic URL to test? Anyhow, the local squid proxy could not be responsible. It does not terminate the HTTPS session, the proxy uses the CONNECT method to pass the connection further without interacting. Oups, I forgot something! That was the error text itself returned by the Akamai server: <TITLE>Error</TITLE> An error occurred while processing your request.Reference #179.4df90a17.1512127845.3e011e |
OK. Let's see what @kaal has to say. Errors from 4mo ago are too stale for me to do much with based on our at-hand log retention. |
I didn't think of that, indeed I have one IPv4 and one IPv6 ! Maybe we should trace used IPs in the debug? |
@Kaaal If you share the IPv4 and IPv6 addresses your server uses I can check the logs and see if we're seeing a split between the two or if it's uniformly the IPv6 address. |
@cpu No problem, I'll send them to you. |
@Kaaal @bruncsak Using the IP addresses @Kaaal sent I was able to confirm this theory. I think we can conclusively say this is the root cause! Over the past 7 days I saw 180 requests from @Kaaal's IPv6 address and 5 from @Kaaal's IPv4 address. 100% of the requests made by the IPv6 address went to one data centre. 100% of the requests made by the IPv4 address went to the other data centre. This would definitely cause a badNonce error if the ACME client used a nonce from an IPv4 request/response with an IPv6 request. @Kaaal I admit I'm not sure what advice to give you on pinning your egress traffic to one address or the other but I believe doing so will resolve your problem. Edit: A colleague smartly points out this might be caused by "Happy Eyeballs" behaviour. The IPv4 requests may have been done as retries when an initial IPv6 connection failed for some reason (flaky upstream routes, etc). On the Let's Encrypt side I think presently our options for load-balancing are fairly restrictive and we likely won't be able to pin an IPv4 and an IPv6 address to the same data centre reliably. |
Thank you both so much for solving my problem, @bruncsak and @cpu. |
@Kaaal, I updated the code on the badNonce-debug branch. It is not really debug, but rather quality assurance level code now. The new command line options "-4" and "-6" allow to restrict the egress IP address. |
On my side, I put a variable "CURL_OPTS", I was waiting to see if I didn't have any more problem, which is the case. I was planning on making a pull request to send you this update, but it seems to not be useful anymore. |
Hello,
I regularly have an error when generating certificates :
The error usually disappears when relaunching the script.
Thanks for your help
The text was updated successfully, but these errors were encountered: