Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JWS anti-replay nonce error #1

Closed
Kaaal opened this issue Mar 9, 2018 · 30 comments
Closed

JWS anti-replay nonce error #1

Kaaal opened this issue Mar 9, 2018 · 30 comments

Comments

@Kaaal
Copy link

Kaaal commented Mar 9, 2018

Hello,
I regularly have an error when generating certificates :

request challenge for XXX
error while requesting challenge for XXX
  {
  "type": "urn:acme:error:badNonce",
  "detail": "JWS has invalid anti-replay nonce a_und1PinIcRvGo9HQ6HaOzlrmIgum_AfiwnaLllAD8",
  "status": 400
} ({
  "type": "urn:acme:error:badNonce",
  "detail": "JWS has invalid anti-replay nonce a_und1PinIcRvGo9HQ6HaOzlrmIgum_AfiwnaLllAD8",
  "status": 400
})

The error usually disappears when relaunching the script.

Thanks for your help

@bruncsak
Copy link
Owner

bruncsak commented Mar 9, 2018

Hello,

Are you behind a web proxy? The RFC says that the server should reply with "Cache-Control: no-store" HTTP header field (as Letsencrypt's prod and staging server do), but some proxy may be broken.
Is the XXX domain name always the first domain in the list of domains for requesting challenge?

You may try to comment out the line
sed -e '/Replay-Nonce: / ! d; s/^Replay-Nonce: //' "$RESP_HEADER" | tr -d '\r\n' > "$LAST_NONCE"
in the send_get_req() function. Please let me know how does it behave than.

@Kaaal
Copy link
Author

Kaaal commented Mar 9, 2018

No, I'm not behind a web proxy.
Every day I have several certificates to request, it's not always the first one, sometimes many have the error, other times none.
I will try to comment out this line, thank you !

@bruncsak
Copy link
Owner

bruncsak commented Mar 9, 2018

Do you have long elapsed time between the different actions? There may be timing issue on the server how long a nonce is valid.

If it is not always the first domain which gets the error, than try to comment out the same line from the send_req() function also, please.

@Kaaal
Copy link
Author

Kaaal commented Mar 9, 2018

A maximum of 1 or 2 seconds can pass between two requests.
I have several hundred certificates, my daily cron checks the expiration date (with openssl) of each certificate, and makes a request if necessary. If all certificates are ok, the cron take 2 seconds to check them all.

@bruncsak
Copy link
Owner

bruncsak commented Mar 9, 2018

Only one job is running at a time and renewing the certificates sequentially, or the jobs are parallel running? Do all jobs use the same account key?

@Kaaal
Copy link
Author

Kaaal commented Mar 10, 2018

Only one job at a time, and all use the same account key. Domain names are not all configured on the same IP, but all IPs belong to the same server.

@Kaaal
Copy link
Author

Kaaal commented Mar 10, 2018

Today, with the two lines commented out, I got an error again. There were 2 certificates to renew, and both failed :

Generating RSA private key, 4096 bit long modulus
...............................................................++
....................................++
e is 65537 (0x10001)
generate certificate request
request challenge for XXX
error while requesting challenge for XXX
  {
  "type": "urn:acme:error:badNonce",
  "detail": "JWS has invalid anti-replay nonce yQ9e8_az0NQ_Fr8mj2MSWYCTI0z-LjxuVdZJ2t1fo3I",
  "status": 400
} ({
  "type": "urn:acme:error:badNonce",
  "detail": "JWS has invalid anti-replay nonce yQ9e8_az0NQ_Fr8mj2MSWYCTI0z-LjxuVdZJ2t1fo3I",
  "status": 400
})
Generating RSA private key, 4096 bit long modulus
..................................................................................++
..............................................................++
e is 65537 (0x10001)
generate certificate request
request challenge for YYY
error while requesting challenge for YYY
  {
  "type": "urn:acme:error:badNonce",
  "detail": "JWS has invalid anti-replay nonce K7VVkwRTeI75xBbANx5etZGUDX2WYClMKbItOY_4sA8",
  "status": 400
} ({
  "type": "urn:acme:error:badNonce",
  "detail": "JWS has invalid anti-replay nonce K7VVkwRTeI75xBbANx5etZGUDX2WYClMKbItOY_4sA8",
  "status": 400
})

I ran the script again, the first one failed again, the second one worked :

Generating RSA private key, 4096 bit long modulus
....................++
....................................................................................................++
e is 65537 (0x10001)
generate certificate request
request challenge for XXX
error while requesting challenge for XXX
  {
  "type": "urn:acme:error:badNonce",
  "detail": "JWS has invalid anti-replay nonce NRxuagOCP4a7uN5lfkX-IuHN8aV9bRSvx5Jx56DYq2s",
  "status": 400
} ({
  "type": "urn:acme:error:badNonce",
  "detail": "JWS has invalid anti-replay nonce NRxuagOCP4a7uN5lfkX-IuHN8aV9bRSvx5Jx56DYq2s",
  "status": 400
})
Generating RSA private key, 4096 bit long modulus
........................................++
.................................................................................++
e is 65537 (0x10001)
generate certificate request
request challenge for YYY
push response for YYY
request verification of YYY
check verification of YYY
YYY is valid
remove response for YYY
request certificate

Third times, the error was not exactly the same :

Generating RSA private key, 4096 bit long modulus
.......................++
.................++
e is 65537 (0x10001)
generate certificate request
request challenge for XXX
push response for XXX
request verification of XXX
check verification of XXX
XXX is valid
remove response for XXX
request certificate
unhandled response while requesting certificate

HTTP/1.1 100 Continue
Expires: Sat, 10 Mar 2018 09:23:16 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache

HTTP/1.1 400 Bad Request
Server: nginx
Content-Type: application/problem+json
Content-Length: 149
Boulder-Requester: 27228030
Replay-Nonce: qXpZgOyfCmPa6s5HMKgcyWqQqxUtzBTfLVS1sIVTz2k
Expires: Sat, 10 Mar 2018 09:23:16 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Sat, 10 Mar 2018 09:23:16 GMT
Connection: close

{
  "type": "urn:acme:error:badNonce",
  "detail": "JWS has invalid anti-replay nonce iTbp24xCF2YD8NxJVERnZovpTP1Mam2mvXMn9CdBpV8",
  "status": 400
}

The fourth :

Generating RSA private key, 4096 bit long modulus
.....++
.................................................................................++
e is 65537 (0x10001)
generate certificate request
request challenge for XXX
push response for XXX
request verification of XXX
unhandled response while requesting verification of challenge of XXX

HTTP/1.1 100 Continue
Expires: Sat, 10 Mar 2018 09:24:40 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache

HTTP/1.1 400 Bad Request
Server: nginx
Content-Type: application/problem+json
Content-Length: 149
Boulder-Requester: 27228030
Replay-Nonce: PqJPXT7hUzAQAsqaPYcGpfRUzvYP5CBOo1DZZdmlSug
Expires: Sat, 10 Mar 2018 09:24:40 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Sat, 10 Mar 2018 09:24:40 GMT
Connection: close

{
  "type": "urn:acme:error:badNonce",
  "detail": "JWS has invalid anti-replay nonce z92hn5lKVcWqPq6Zpi3Zkk1iAOoKu4YDlB9wGGXOGMo",
  "status": 400
}

The fifth time, it worked !

@bruncsak
Copy link
Owner

Was that already with the modification I requested (two lines commented out in the script)? What type of OS and version are you on, does your curl is standard shipped with the OS?

@Kaaal
Copy link
Author

Kaaal commented Mar 10, 2018

Yes, the two lines are commented out in the script. The script is running on Ubuntu 16.04 with standard curl.

@bruncsak
Copy link
Owner

bruncsak commented Mar 12, 2018

I created a new debug version for you in the branch badNonce-debug. Kindly use this one and in case of failure please send me the output.

@Kaaal
Copy link
Author

Kaaal commented Mar 18, 2018

I set up the debug version, but for a week there were no errors. I'll let you know as soon as it happens again.

@Kaaal
Copy link
Author

Kaaal commented Mar 21, 2018

There was an error this morning, after a successful request :

Generating RSA private key, 4096 bit long modulus
................................++
....++
e is 65537 (0x10001)
-e 1521619207.108593240 Replay-Nonce: WF5izbGMEwrK-0lJ9JuPTaXliQ_pPrpaDPoh6Xm8stE^M$
generate certificate request
request challenge for XXX
-e 1521619214.011580466 Replay-Nonce: h6CMwHNkX1VIwi1wjLduaK98bqjsvmxMOooQENeL4HM^M$
push response for XXX
request verification of XXX
-e 1521619222.501540693 Replay-Nonce: 0Mmfaqd4ZaGVaPYnTkrvrn4wNc_NAFS_koLi_EtFQ70^M$
check verification of XXX
-e 1521619223.968245048 Replay-Nonce: mY5KJax5NfX_xwt4V48I6fS2RkszpydltoE05kiPtJs^M$
XXX is pending
check verification of XXX
-e 1521619225.458722309 Replay-Nonce: FT8utBITkRPV90lN73jk5hf9q4UMejZoFmecPVS--EU^M$
XXX is valid
remove response for XXX
request certificate
-e 1521619239.555913964 Replay-Nonce: USiMfNapPdK1PcgORwlpy6nzPPLm0hxO9RtmrRW95O0^M$
-e 1521619240.080922003 Replay-Nonce: LpU7xuQo9xpdC34dmDJ-DERj8QtYjvAoyScingGmmbA^M$
Generating RSA private key, 4096 bit long modulus
............................................................................++
.......................................................................................................................................................................................................................++
e is 65537 (0x10001)
-e 1521619243.788096334 Replay-Nonce: Y-vLQeG893Xv9qrYUHFMpkB7Mqn0EuBFcMCSViJZ4IA^M$
generate certificate request
request challenge for YYY
-e 1521619245.009965970 Replay-Nonce: nCd9wlYA22YTHLqWBvssYjSQKMaEoOLSlHkJuosnHVs^M$
error while requesting challenge for YYY
  {
  "type": "urn:acme:error:badNonce",
  "detail": "JWS has invalid anti-replay nonce Y-vLQeG893Xv9qrYUHFMpkB7Mqn0EuBFcMCSViJZ4IA",
  "status": 400
} ({
  "type": "urn:acme:error:badNonce",
  "detail": "JWS has invalid anti-replay nonce Y-vLQeG893Xv9qrYUHFMpkB7Mqn0EuBFcMCSViJZ4IA",
  "status": 400
})

@bruncsak
Copy link
Owner

I opened a topic at the community of letsencrypt concerning your problem:

https://community.letsencrypt.org/t/regular-badnonce-errors/57332

May I ask you to let me know the answer for the boulder engineer question:
"Is there any chance the user in question is accessing the API from multiple egress public IP addresses?"

The boulder engineer also asked to add more debugging output to the code, which I am ready to do, but that would lead to see your domain names in question publicly. Normally that is not a problem, since the domain names for issued certificates are all made public in Certificate Transparency logs (e.g. https://crt.sh/?q=example.com).

@Kaaal
Copy link
Author

Kaaal commented Mar 23, 2018

I use only one IP to access the API, but domains are set up on various IPs.

Domain names belong to our customers, it would be easier for me if I could send debug logs only to you and the boulder engineer. Is that possible?

@cpu
Copy link

cpu commented Mar 23, 2018

Domain names belong to our customers, it would be easier for me if I could send debug logs only to you and the boulder engineer. Is that possible?

@Kaaal 👋 I'm the Boulder engineer in question :-) You can email the unsantized logs to cpu <at> letsencrypt.org. Thanks!

@bruncsak
Copy link
Owner

bruncsak commented Mar 23, 2018

@Kaaal , I put more output into the debug version, please update your instant.

https://github.com/bruncsak/letsencrypt.sh/tree/badNonce-debug

If you have badNonce failure again, no need to post here the output, please send to the boulder engineer's e-mail address.

@Kaaal
Copy link
Author

Kaaal commented Mar 23, 2018

Thank you @bruncsak, I updated the script. And thank you @cpu, I will send you the logs when the failure happens again.

@bruncsak
Copy link
Owner

@Kaaal Just get in my mind something else. On the server you are running the script to get the certificate, do you have dual IP stack running, IPv4 and IPv6 as well? There may be the possibility that one connection is using IPv4, the other one IPv6.

@cpu
Copy link

cpu commented Mar 23, 2018

@Kaaal I know earlier you said you weren't behind a web proxy. Can you confirm that there's no chance your ISP or IT department might have your server behind a proxy you weren't aware of?

I ask because one of my coworkers points out in your original log there is a strange HTTP/1.1 100 Continue response immediately before the bad request response caused by the nonce error.

As far as I'm aware there isn't any part of the Let's Encrypt API stack that would return an "HTTP/1.1 100 Continue" response which makes me believe there might be a proxy meddling with requests. That would also explain the badNonce errors neatly.

HTTP/1.1 100 Continue
Expires: Sat, 10 Mar 2018 09:23:16 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache

HTTP/1.1 400 Bad Request
Server: nginx
Content-Type: application/problem+json
Content-Length: 149
Boulder-Requester: 27228030
Replay-Nonce: qXpZgOyfCmPa6s5HMKgcyWqQqxUtzBTfLVS1sIVTz2k
Expires: Sat, 10 Mar 2018 09:23:16 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Sat, 10 Mar 2018 09:23:16 GMT
Connection: close

@bruncsak
Copy link
Owner

@cpu ,
I had ' 100 Continue' failure as well. I am behind double squid proxies, but I am not sure that the proxy gave that error:

request verification of XXX
unhandled response while requesting verification of challenge of XXX

HTTP/1.1 200 Connection established

HTTP/1.1 100 Continue
Expires: Fri, 01 Dec 2017 11:30:45 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache

HTTP/1.1 500 Internal Server Error
Server: AkamaiGHost
Mime-Version: 1.0
Content-Type: text/html
Content-Length: 175
Expires: Fri, 01 Dec 2017 11:30:45 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Fri, 01 Dec 2017 11:30:45 GMT
Connection: close

The boulder is having Akamai reverse proxies in front that may be the other reason.

@cpu
Copy link

cpu commented Mar 23, 2018

@bruncsak Interesting!

HTTP/1.1 500 Internal Server Error
Server: AkamaiGHost

What was the HTTP request you sent that received that response? What URL was it being sent to?

Can you share the results of running the following a few times:

curl -I <that url> -H "Pragma: akamai-x-cache-on, akamai-x-get-cache-key, akamai-x-get-true-cache-key, akamai-x-get-request-id"

@bruncsak
Copy link
Owner

bruncsak commented Mar 23, 2018

@cpu Unfortunately I do not have the URL. I am not running always in full debug mode the client. That error happened near 4 months ago. The URL had to have the value of the "uri:" json field of the challenge returned. Would you be able to provide a generic URL to test?

Anyhow, the local squid proxy could not be responsible. It does not terminate the HTTPS session, the proxy uses the CONNECT method to pass the connection further without interacting.

Oups, I forgot something! That was the error text itself returned by the Akamai server:

<TITLE>Error</TITLE> An error occurred while processing your request.

Reference #179.4df90a17.1512127845.3e011e

@cpu
Copy link

cpu commented Mar 23, 2018

That error happened near 4 months ago.

OK. Let's see what @kaal has to say. Errors from 4mo ago are too stale for me to do much with based on our at-hand log retention.

@Kaaal
Copy link
Author

Kaaal commented Mar 23, 2018

I didn't think of that, indeed I have one IPv4 and one IPv6 ! Maybe we should trace used IPs in the debug?
I just called the datacenter that hosts our servers, they confirmed to me that there was no "hidden" proxy.

@cpu
Copy link

cpu commented Mar 23, 2018

@Kaaal If you share the IPv4 and IPv6 addresses your server uses I can check the logs and see if we're seeing a split between the two or if it's uniformly the IPv6 address.

@Kaaal
Copy link
Author

Kaaal commented Mar 23, 2018

@cpu No problem, I'll send them to you.

@cpu
Copy link

cpu commented Mar 23, 2018

There may be the possibility that one connection is using IPv4, the other one IPv6.

@Kaaal @bruncsak Using the IP addresses @Kaaal sent I was able to confirm this theory. I think we can conclusively say this is the root cause!

Over the past 7 days I saw 180 requests from @Kaaal's IPv6 address and 5 from @Kaaal's IPv4 address. 100% of the requests made by the IPv6 address went to one data centre. 100% of the requests made by the IPv4 address went to the other data centre. This would definitely cause a badNonce error if the ACME client used a nonce from an IPv4 request/response with an IPv6 request.

@Kaaal I admit I'm not sure what advice to give you on pinning your egress traffic to one address or the other but I believe doing so will resolve your problem.

Edit: A colleague smartly points out this might be caused by "Happy Eyeballs" behaviour. The IPv4 requests may have been done as retries when an initial IPv6 connection failed for some reason (flaky upstream routes, etc).

On the Let's Encrypt side I think presently our options for load-balancing are fairly restrictive and we likely won't be able to pin an IPv4 and an IPv6 address to the same data centre reliably.

@Kaaal
Copy link
Author

Kaaal commented Mar 25, 2018

Thank you both so much for solving my problem, @bruncsak and @cpu.
I can pin traffic to IPv4 or IPv6 (really easy with curl). But other people may have the same problem, whether with @bruncsak's client or another one. If you can't solve the problem at let's encrypt, I think you should at least report it somewhere.
Thank you again !

@bruncsak
Copy link
Owner

bruncsak commented Apr 6, 2018

@Kaaal, I updated the code on the badNonce-debug branch. It is not really debug, but rather quality assurance level code now. The new command line options "-4" and "-6" allow to restrict the egress IP address.

@Kaaal
Copy link
Author

Kaaal commented Apr 9, 2018

On my side, I put a variable "CURL_OPTS", I was waiting to see if I didn't have any more problem, which is the case. I was planning on making a pull request to send you this update, but it seems to not be useful anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants