Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nacos heartbeat issue: microservice health state is constantly changing while I am sure it is health #13123

Open
ThePeterQin opened this issue Feb 19, 2025 · 2 comments

Comments

@ThePeterQin
Copy link

ThePeterQin commented Feb 19, 2025

Describe the bug

issue description

Hi nacos experts,

I am using nacos server 2.23。 standalone mode. nacos client version is 1.4.2

We have many microservcies but only one of them has this issue.

I get 503 from the webpage for the API of this microsevices.

But I am sure the microserivces is heath.

nacos server log

I check the nacos logs ,two log

naming-server.log

2025-02-19 19:33:29,507 INFO Client connection 10.2.14.217:33030#true disconnect, remove instances and subscribers
2025-02-19 19:33:29,621 INFO Client connection 10.2.14.217:33030#true connect

below is the grep output
[root@gvadmin logs]# grep 33030 naming-server.log |grep dis

2025-02-19 19:07:04,447 INFO Client connection 10.2.14.217:33030#true disconnect, remove instances and subscribers
2025-02-19 19:08:34,450 INFO Client connection 10.2.14.217:33030#true disconnect, remove instances and subscribers
2025-02-19 19:09:24,450 INFO Client connection 10.2.14.217:33030#true disconnect, remove instances and subscribers
2025-02-19 19:10:04,451 INFO Client connection 10.2.14.217:33030#true disconnect, remove instances and subscribers
2025-02-19 19:11:24,456 INFO Client connection 10.2.14.217:33030#true disconnect, remove instances and subscribers
2025-02-19 19:14:24,465 INFO Client connection 10.2.14.217:33030#true disconnect, remove instances and subscribers
2025-02-19 19:15:19,466 INFO Client connection 10.2.14.217:33030#true disconnect, remove instances and subscribers
2025-02-19 19:17:29,468 INFO Client connection 10.2.14.217:33030#true disconnect, remove instances and subscribers
2025-02-19 19:19:49,476 INFO Client connection 10.2.14.217:33030#true disconnect, remove instances and subscribers
2025-02-19 19:22:44,484 INFO Client connection 10.2.14.217:33030#true disconnect, remove instances and subscribers
2025-02-19 19:24:09,485 INFO Client connection 10.2.14.217:33030#true disconnect, remove instances and subscribers
2025-02-19 19:25:59,491 INFO Client connection 10.2.14.217:33030#true disconnect, remove instances and subscribers
2025-02-19 19:29:09,496 INFO Client connection 10.2.14.217:33030#true disconnect, remove instances and subscribers
2025-02-19 19:30:49,502 INFO Client connection 10.2.14.217:33030#true disconnect, remove instances and subscribers
2025-02-19 19:32:49,506 INFO Client connection 10.2.14.217:33030#true disconnect, remove instances and subscribers
2025-02-19 19:33:29,507 INFO Client connection 10.2.14.217:33030#true disconnect, remove instances and subscribers
2025-02-19 19:35:39,522 INFO Client connection 10.2.14.217:33030#true disconnect, remove instances and subscribers
2025-02-19 19:36:49,523 INFO Client connection 10.2.14.217:33030#true disconnect, remove instances and subscribers
2025-02-19 19:38:34,527 INFO Client connection 10.2.14.217:33030#true disconnect, remove instances and subscribers

naming-event.log

I also check the naming-event.log
2025-02-19 13:55:39,277 INFO {POS} {IP-DISABLED} valid: 10.2.14.217:33030@DEFAULT@acx-collector, region: unknown, msg: client last beat: 1739944509276
2025-02-19 13:57:06,169 INFO {POS} {IP-DISABLED} valid: 10.2.14.217:33030@DEFAULT@acx-collector, region: unknown, msg: client last beat: 1739944595156
2025-02-19 13:58:29,181 INFO {POS} {IP-DISABLED} valid: 10.2.14.217:33030@DEFAULT@acx-collector, region: unknown, msg: client last beat: 1739944675508
2025-02-19 13:59:33,846 INFO {POS} {IP-DISABLED} valid: 10.2.14.217:33030@DEFAULT@acx-collector, region: unknown, msg: client last beat: 1739944743845
2025-02-19 14:01:11,792 INFO {POS} {IP-DISABLED} valid: 10.2.14.217:33030@DEFAULT@acx-collector, region: unknown, msg: client last beat: 1739944838972
2025-02-19 14:03:21,752 INFO {POS} {IP-DISABLED} valid: 10.2.14.217:33030@DEFAULT@acx-collector, region: unknown, msg: client last beat: 1739944968408
2025-02-19 14:04:27,376 INFO {POS} {IP-DISABLED} valid: 10.2.14.217:33030@DEFAULT@acx-collector, region: unknown, msg: client last beat: 1739945037374
2025-02-19 14:05:42,372 INFO {POS} {IP-DISABLED} valid: 10.2.14.217:33030@DEFAULT@acx-collector, region: unknown, msg: client last beat: 1739945110905
2025-02-19 14:06:20,943 INFO {POS} {IP-DISABLED} valid: 10.2.14.217:33030@DEFAULT@acx-collector, region: unknown, msg: client last beat: 1739945150942
2025-02-19 14:06:58,701 INFO {POS} {IP-DISABLED} valid: 10.2.14.217:33030@DEFAULT@acx-collector, region: unknown, msg: client last beat: 1739945188700
2025-02-19 14:07:48,971 INFO {POS} {IP-DISABLED} valid: 10.2.14.217:33030@DEFAULT@acx-collector, region: unknown, msg: client last beat: 1739945237444
2025-02-19 14:08:58,091 INFO {POS} {IP-DISABLED} valid: 10.2.14.217:33030@DEFAULT@acx-collector, region: unknown, msg: client last beat: 1739945308090
2025-02-19 14:09:56,895 INFO {POS} {IP-DISABLED} valid: 10.2.14.217:33030@DEFAULT@acx-collector, region: unknown, msg: client last beat: 1739945366894

nacos client log

I enable nacos client log

2025-02-19 19:42:41.874 [TraceId:,SpanId:][DEBUG][com.alibaba.nacos.naming.beat.sender][com.alibaba.nacos.client.naming:421] com.alibaba.nacos.client.naming - [BEAT] gridview sending beat to server: BeatInfo{port=33089, ip='10.10.10.10', weight=1.0, serviceName='DEFAULT_GROUP@@acx-collector', cluster='DEFAULT', metadata={preserved.heart.beat.timeout=60000, preserved.ip.delete.timeout=120000, preserved.register.source=SPRING_CLOUD, preserved.heart.beat.interval=3000}, scheduled=false, period=3000, stopped=false}
2025-02-19 19:42:44.944 [TraceId:,SpanId:][DEBUG][com.alibaba.nacos.naming.beat.sender][com.alibaba.nacos.client.naming:421] com.alibaba.nacos.client.naming - [BEAT] gridview sending beat to server: BeatInfo{port=33089, ip='10.10.10.10', weight=1.0, serviceName='DEFAULT_GROUP@@acx-collector', cluster='DEFAULT', metadata={preserved.heart.beat.timeout=60000, preserved.ip.delete.timeout=120000, preserved.register.source=SPRING_CLOUD, preserved.heart.beat.interval=3000}, scheduled=false, period=3000, stopped=false}
2025-02-19 19:42:47.945 [TraceId:,SpanId:][DEBUG][com.alibaba.nacos.naming.beat.sender][com.alibaba.nacos.client.naming:421] com.alibaba.nacos.client.naming - [BEAT] gridview sending beat to server: BeatInfo{port=33089, ip='10.10.10.10', weight=1.0, serviceName='DEFAULT_GROUP@@acx-collector', cluster='DEFAULT', metadata={preserved.heart.beat.timeout=60000, preserved.ip.delete.timeout=120000, preserved.register.source=SPRING_CLOUD, preserved.heart.beat.interval=3000}, scheduled=false, period=3000, stopped=false}
2025-02-19 19:42:50.947 [TraceId:,SpanId:][DEBUG][com.alibaba.nacos.naming.beat.sender][com.alibaba.nacos.client.naming:421] com.alibaba.nacos.client.naming - [BEAT] gridview sending beat to server: BeatInfo{port=33089, ip='10.10.10.10', weight=1.0, serviceName='DEFAULT_GROUP@@acx-collector', cluster='DEFAULT', metadata={preserved.heart.beat.timeout=60000, preserved.ip.delete.timeout=120000, preserved.register.source=SPRING_CLOUD, preserved.heart.beat.interval=3000}, scheduled=false, period=3000, stopped=false}
2025-02-19 19:43:51.502 [TraceId:,SpanId:][INFO][com.alibaba.nacos.client.naming.updater][com.alibaba.nacos.client.naming:81] com.alibaba.nacos.client.naming - [BEAT] adding beat: BeatInfo{port=33089, ip='10.10.10.10', weight=1.0, serviceName='DEFAULT_GROUP@@acx-collector', cluster='DEFAULT', metadata={preserved.heart.beat.timeout=60000, preserved.ip.delete.timeout=120000, preserved.register.source=SPRING_CLOUD, preserved.heart.beat.interval=3000}, scheduled=false, period=3000, stopped=false} to beat map.

I set the heartbeat interval to 3 seconds instead of default 5 seconds.
We can see the most time it will send heatbeat every 3 seconds as configured. But I also find that between 2025-02-19 19:42:50.947 and 2025-02-19 19:43:51.502, it is more than 60 seconds.

Arthas debug

I trace the nacos client hearbeat method. And find sometime it takes tens of seconds to get a response from nacos server

---ts=2025-02-19 19:56:22;thread_name=com.alibaba.nacos.naming.beat.sender;id=1981;is_daemon=true;priority=5;TCCL=org.springframework.boot.loader.LaunchedURLClassLoader@5c7fa833 ---[1.686438ms] com.alibaba.nacos.client.naming.net.NamingProxy:callServer()
+---[1.44% 0.02433ms ] com.alibaba.nacos.client.naming.net.NamingProxy:injectSecurityInfo() #588
+---[1.92% 0.032399ms ] com.alibaba.nacos.client.naming.net.NamingProxy:builderHeader() #589
+---[0.52% 0.008784ms ] com.alibaba.nacos.common.utils.IPUtil:containsPort() #595
+---[0.31% 0.005271ms ] com.alibaba.nacos.client.naming.net.NamingHttpClientManager:getInstance() #598
+---[0.28% 0.004714ms ] com.alibaba.nacos.client.naming.net.NamingHttpClientManager:getPrefix() #598
+---[0.30% 0.005135ms ] com.alibaba.nacos.common.http.param.Query:newInstance() #603
+---[0.47% 0.008003ms ] com.alibaba.nacos.common.http.param.Query:initParams() #603
+---[82.76% 1.395717ms ] com.alibaba.nacos.common.http.client.NacosRestTemplate:exchangeForm() #603
+---[0.41% 0.006901ms ] com.alibaba.nacos.common.http.HttpRestResult:getCode() #606
+---[0.56% 0.009461ms ] com.alibaba.nacos.client.monitor.MetricsMonitor:getNamingRequestMonitor() #606
+---[0.60% 0.01019ms ] io.prometheus.client.Histogram$Child:observe() #607
+---[0.30% 0.005074ms ] com.alibaba.nacos.common.http.HttpRestResult:ok() #609
`---[0.30% 0.005124ms ] com.alibaba.nacos.common.http.HttpRestResult:getData() #610

---ts=2025-02-19 19:56:25;thread_name=com.alibaba.nacos.naming.beat.sender;id=1981;is_daemon=true;priority=5;TCCL=org.springframework.boot.loader.LaunchedURLClassLoader@5c7fa833 ---[1.527801ms] com.alibaba.nacos.client.naming.net.NamingProxy:callServer()
+---[2.56% 0.039069ms ] com.alibaba.nacos.client.naming.net.NamingProxy:injectSecurityInfo() #588
+---[1.90% 0.028974ms ] com.alibaba.nacos.client.naming.net.NamingProxy:builderHeader() #589
+---[0.51% 0.007851ms ] com.alibaba.nacos.common.utils.IPUtil:containsPort() #595
+---[0.65% 0.009856ms ] com.alibaba.nacos.client.naming.net.NamingHttpClientManager:getInstance() #598
+---[0.42% 0.006376ms ] com.alibaba.nacos.client.naming.net.NamingHttpClientManager:getPrefix() #598
+---[0.34% 0.005146ms ] com.alibaba.nacos.common.http.param.Query:newInstance() #603
+---[0.53% 0.008146ms ] com.alibaba.nacos.common.http.param.Query:initParams() #603
+---[80.94% 1.236647ms ] com.alibaba.nacos.common.http.client.NacosRestTemplate:exchangeForm() #603
+---[0.41% 0.006272ms ] com.alibaba.nacos.common.http.HttpRestResult:getCode() #606
+---[0.51% 0.007815ms ] com.alibaba.nacos.client.monitor.MetricsMonitor:getNamingRequestMonitor() #606
+---[0.60% 0.009191ms ] io.prometheus.client.Histogram$Child:observe() #607
+---[0.31% 0.004791ms ] com.alibaba.nacos.common.http.HttpRestResult:ok() #609
`---[0.27% 0.004196ms ] com.alibaba.nacos.common.http.HttpRestResult:getData() #610

---ts=2025-02-19 19:56:28;thread_name=com.alibaba.nacos.naming.beat.sender;id=1981;is_daemon=true;priority=5;TCCL=org.springframework.boot.loader.LaunchedURLClassLoader@5c7fa833 ---[29588.063595ms] com.alibaba.nacos.client.naming.net.NamingProxy:callServer()
+---[0.00% 0.037133ms ] com.alibaba.nacos.client.naming.net.NamingProxy:injectSecurityInfo() #588
+---[0.00% 0.048256ms ] com.alibaba.nacos.client.naming.net.NamingProxy:builderHeader() #589
+---[0.00% 0.008791ms ] com.alibaba.nacos.common.utils.IPUtil:containsPort() #595---ts=2025-02-19 19:56:27;thread_name=com.alibaba.nacos.client.naming.updater;id=1de2;is_daemon=true;priority=5;TCCL=org.springframework.boot.loader.LaunchedURLClassLoader@5c7fa833 ---[31279.890666ms] com.alibaba.nacos.client.naming.net.NamingProxy:callServer()
+---[0.00% 0.026814ms ] com.alibaba.nacos.client.naming.net.NamingProxy:injectSecurityInfo() #588
+---[0.00% 0.031838ms ] com.alibaba.nacos.client.naming.net.NamingProxy:builderHeader() #589
+---[0.00% 0.008217ms ] com.alibaba.nacos.common.utils.IPUtil:containsPort() #595
+---[0.00% 0.004847ms ] com.alibaba.nacos.client.naming.net.NamingHttpClientManager:getInstance() #598
+---[0.00% 0.00501ms ] com.alibaba.nacos.client.naming.net.NamingHttpClientManager:getPrefix() #598
+---[0.00% 0.00509ms ] com.alibaba.nacos.common.http.param.Query:newInstance() #603
+---[0.00% 0.008567ms ] com.alibaba.nacos.common.http.param.Query:initParams() #603
+---[100.00% 31279.576339ms ] com.alibaba.nacos.common.http.client.NacosRestTemplate:exchangeForm() #603
+---[0.00% 0.006175ms ] com.alibaba.nacos.common.http.HttpRestResult:getCode() #606
+---[0.00% 0.011123ms ] com.alibaba.nacos.client.monitor.MetricsMonitor:getNamingRequestMonitor() #606
+---[0.00% 0.011019ms ] io.prometheus.client.Histogram$Child:observe() #607
+---[0.00% 0.010112ms ] com.alibaba.nacos.common.http.HttpRestResult:ok() #609
`---[0.00% 0.005058ms ] com.alibaba.nacos.common.http.HttpRestResult:getData() #610

    +---[0.00% 0.004265ms ] com.alibaba.nacos.client.naming.net.NamingHttpClientManager:getInstance() #598
    +---[0.00% 0.005395ms ] com.alibaba.nacos.client.naming.net.NamingHttpClientManager:getPrefix() #598
    +---[0.00% 0.004958ms ] com.alibaba.nacos.common.http.param.Query:newInstance() #603
    +---[0.00% 0.007706ms ] com.alibaba.nacos.common.http.param.Query:initParams() #603
    +---[100.00% 29587.721065ms ] com.alibaba.nacos.common.http.client.NacosRestTemplate:exchangeForm() #603
    +---[0.00% 0.006188ms ] com.alibaba.nacos.common.http.HttpRestResult:getCode() #606
    +---[0.00% 0.009543ms ] com.alibaba.nacos.client.monitor.MetricsMonitor:getNamingRequestMonitor() #606
    +---[0.00% 0.009778ms ] io.prometheus.client.Histogram$Child:observe() #607
    +---[0.00% 0.005949ms ] com.alibaba.nacos.common.http.HttpRestResult:ok() #609
    `---[0.00% 0.004456ms ] com.alibaba.nacos.common.http.HttpRestResult:getData() #610

What we have tried:

we changed below configurations,
discovery:
enabled: true
namespace: ${nacos_namespace}
ip: 10.10.10.10
register-enabled: true
heart-beat-interval: 3000
heart-beat-timeout: 60000
ip-delete-timeout: 120000

after changing above configuration, issue occur less frequently. But it is not gone.

Any suggestions?? Thanks.

Expected behavior
A clear and concise description of what you expected to happen.

Actually behavior
A clear and concise description of what you actually to happen.

How to Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See errors

Desktop (please complete the following information):

  • OS: [centos]
  • Version [e.g. nacos-server 2.2.3, nacos-client 1.4.2]
  • Module [e.g. naming]
  • SDK [e.g. original]

Additional context
Add any other context about the problem here.

@ThePeterQin ThePeterQin changed the title nacos hearbeat issue: microservice health state is constantly changing while I am sure it is health nacos heartbeat issue: microservice health state is constantly changing while I am sure it is health Feb 19, 2025
@KomachiSion
Copy link
Collaborator

http request cost much time, please check your network and nacos-server status first.

Such as network rt, nacos server CPU, GC, thread pool size.

And change the relative configuration to optimaze it.

@Jevgens
Copy link

Jevgens commented Feb 20, 2025

#13123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants