feat: add a brod process monitor #124

btkostner · 2024-12-01T21:43:58Z

Checklist

Code conforms to the Elixir Styleguide

Problem

Brod client crashes everything when not able to connect to Kafka in some instances -- ultimately meaning that an app will crashloop instead of gracefully handling kafka connection issues.

Details

You can test this by checking out this branch, pathing dependency and starting/stopping a broker container in a given service. Without this branch, things will 💣 . With this it'll keep trying to recover.

Scenarios that have been tested:

Starting an application with no broker, then turning a broker on
Starting an application with a broker, then losing it

lib/kafee/process_manager.ex

btkostner · 2024-12-03T01:12:06Z

lib/kafee/process_manager.ex

+        )
+
+        Process.sleep(@restart_delay)
+        {:noreply, state, {:continue, :start_child}}


Huh, TIL I didn't realize you could call :continue from a handle_continue function 🤯

btkostner · 2024-12-03T01:13:20Z

lib/kafee/process_manager.ex

+    {:noreply, state, {:continue, :start_child}}
+  end
+
+  defp start_child(%{child_spec: child_spec, supervisor: supervisor} = state) do


Since we can just do {:continue, :start_child} we can probably get rid of this function and make the handle_continue function more direct.

btkostner · 2024-12-03T01:45:14Z

So, just to be clear on trade offs, this sets the brod client to a temporary process so there will be times it does not exist. This means any sync work (like the sync producer) will error if trying to send a message when the client is down. You will want to use the async producer if you want better handling of errors.

Two, there is a very small chance that this process manager process crashes or for some reason doesn't restart the client. I think it's pretty rare (thanks BEAM!) but worth mentioning.

Lastly it might be useful to purge some of the brod client logs and attempt to rewrite them to something more helpful. Otherwise my much longer term thought is writing our own Kafka client off the kpro library that brod uses 🤷

Code looks good to me if. Limited testing seems to fix the issue.

msutkowski · 2024-12-03T15:23:33Z

So, just to be clear on trade offs, this sets the brod client to a temporary process so there will be times it does not exist. This means any sync work (like the sync producer) will error if trying to send a message when the client is down. You will want to use the async producer if you want better handling of errors.

Two, there is a very small chance that this process manager process crashes or for some reason doesn't restart the client. I think it's pretty rare (thanks BEAM!) but worth mentioning.

Lastly it might be useful to purge some of the brod client logs and attempt to rewrite them to something more helpful. Otherwise my much longer term thought is writing our own Kafka client off the kpro library that brod uses 🤷

Code looks good to me if. Limited testing seems to fix the issue.

For the first point... that's what I'd expect from the sync producer. We're typically outboxing all kafka messages, or throwing them into a transaction that would ultimately get retried.

The second one is more concerning, but I don't currently have an idea for that.

seungjinstord

I like this PR as it's more concise and seems more along the grains of OTP than #123 .
There are some comments / questions surrounding the work section of the new process.

One thing to add for the above comments - I suggest for the possibility of ProcessManager not starting the child:
I think you can have a threshold, or a max retry attempt number that, when reached, would just shutdown ProcessManager.

Relying on it's parent supervisor to restart ProcessManager (it will restart based on the restart strategy Supervisors have) will restart it, and ProcessManager will reset / retry / crash when max is reached. I think this is a good two-phase crash triage loop that could work imho.

seungjinstord · 2024-12-03T21:53:25Z

lib/kafee/process_manager.ex

+        Process.sleep(@restart_delay)
+        {:noreply, state, {:continue, :start_child}}


How about moving it into its own function?
Also - to my knowledge GenServer's way of doing things shouldn't involve Process.sleep() most of the time if it can help it.

How about perhaps Process.send_after -> handle_info() -> runing an abstracted out logic inside handle_continue?

That way you can use handle_continue() for init().

Alternative way of perhaps shortening the timeout function, with protection against repeated calls to start a child is with the GenServer timeout feature, instead of Process.send_after() or Process.sleep(). You'd write less boilerplate handler code.

For example:

defmodule SimpleTimeoutServer do use GenServer def start_link do GenServer.start_link(__MODULE__, :ok, name: __MODULE__) end @impl true def init(:ok) do # third argument here (or the last one after :continue) is the timeout millisec # If no messages come into the process inbox, the `:timeout` message will emit into the process inbox # if messages do come in, the timeout message scheduling is canceled {:ok, 0, 5000} end @impl true def handle_info(:timeout, state) do IO.puts("Timeout occurred. Current state: #{state}") # below code would set another timeout to happen {:noreply, state, 5000} end end

Example from wms-service is in the StreamReleaseAgent.

seungjinstord · 2024-12-03T21:54:25Z

lib/kafee/process_manager.ex

+  @impl GenServer
+  def handle_info({:DOWN, ref, :process, pid, reason}, %{monitor_ref: ref, child_pid: pid} = state) do
+    Logger.info("#{@log_prefix} Child process down. Restarting in #{@restart_delay}ms...",
+      reason: reason


Adding the @log_prefix here can allow for easier filtering by DataDog I think?

That was the idea, but i've pulled it off. I'll just make a view that handles this.

seungjinstord · 2024-12-03T22:16:53Z

test/kafee/process_manager_test.exs

+        start: {Agent, :start_link, [fn -> %{} end]}
+      }
+
+      ProcessManager.start_link(opts)


Suggested change

ProcessManager.start_link(opts)

assert{:error, {{:badkey, :supervisor}, _}} = ProcessManager.start_link(opts)

I think that'd be enough and you don't need to check on the assert_receive.
Checking the value on the return of this function showed:

{:error, {{:badkey, :supervisor}, [ {:erlang, :map_get, [ :supervisor, %{ id: :test_child, start: {Agent, :start_link, [#Function<3.102245195/0 in Kafee.ProcessManagerTest."test start_link/1 fails to start without required supervisor option"/1>]} } ], [error_info: %{module: :erl_erts_errors}]}, {Kafee.ProcessManager, :init, 1, [file: ~c"lib/kafee/process_manager.ex", line: 28]}, {:gen_server, :init_it, 2, [file: ~c"gen_server.erl", line: 980]}, {:gen_server, :init_it, 6, [file: ~c"gen_server.erl", line: 935]}, {:proc_lib, :init_p_do_apply, 3, [file: ~c"proc_lib.erl", line: 241]} ]}}

seungjinstord

I tested the consumer side - with disconnecting the brod clients (by stopping the kafka docker image), and restarting it.

Expectedly the main branch version of kafee would just crash the application, but this branch recovered after retrying.

While I like all async adapters using the ProcessManager approach, I have a comment about keeping Producer.SyncAdapter as-is so that we don't introduce any async-like behavior to it.

I can make the change if you agree with the path forward, but it would probably mean switching over to using AsyncAdapter or utilizing some version of siloing blast radius be Supervisors wherever BrodAdapter is used.

seungjinstord · 2025-02-24T19:16:01Z

lib/kafee/producer/sync_adapter.ex

-          restart: :permanent,
-          shutdown: 500
-        }
+        {Kafee.ProcessManager,


TL;DR - how about we keep the publshing sync adapter as-is? Below is a way to still create robustness via stacking supervisors.

Based on what @btkostner said about the time window where the brod clients won't exist, it seems if we use Kafee.ProcessManager here the end result would be that SyncAdapter would need to behave like AsyncAdapter, with a temporary queue to capture the message that needs to be published.

Else, we really would lose the message to the ether, and we'd need to at least log it to DataDog in order to track which messages were dropped due to brod client being brought back online.

What if we keep this part as-is? That way, we still hold true to word that this is a synchronous adapter, and if brod client crashes, it means the entire application would have to restart.

Now about the last part about the entire application would have to restart - in Elixir in Action 3rd ed, chapter 8 - Sasa Juric goes over how to "silo" the application with multiple supervisors. This is the diagram that is in the book. Todo.System is the main supervisor that handles the "domain" of Todo - it's not the application level supervisor, rather it'd be a child to the main application supervisor.

Now if OMS services are similarly structured as WMS's, then it probably would have one main application supervisor handling almost all of the processes.

However, the book goes over creating boundaries of processes based on "timing strategies" / business context.

So if we were to keep SyncAdapter as-is without the ProcessManager, what we could do is have a topology design such as:

graph TD A[Application Supervisor] B[KafkaPublisherSupervisor] C[KafkaConsumerSupervisor] D[Kafee.Producer.BrodAdapter] E[Kafee.Consumer.BrodAdapter] A --> B A --> C B --> D C --> E classDef supervisor fill:#006400,stroke:#333,stroke-width:2px; classDef child fill:#228B22,stroke:#333,stroke-width:2px; class A,B,C supervisor; class D,E child;

Loading

That way, even if brod client fails, that would only trigger their respective supervisors to restart. We can set custom max restart number that is greater than default 3 for their children, so that they can keep retrying until they give up, at which time the supervisors will crash.

When they crash, the Application's master supervisor will then restart the child supervisors - which will start getting tallying into the supervisor's max restart count.

Therefore if max restart attempt is 3 on the supervisors, and let's say for the chlidren processes of the supervisors, we give 5 max restart attempts. That means the brod clients get 15 restart opportunities. We can choose a robust number.

Thats basically what I was doing in my original approach 😅. I'm on board with it. The point is ultimately that if we're going to move away from elsa, we need to make sure we're 'as robust' as it is. We can't tolerate the lack of a kafka connection tanking the entire application, so whatever we have to do to get there is 👍.

From an implementation perspective, we use the transactional outbox pattern for all publishing, so i'm not particularly concerned. As a library, we'd ideally want that solved without suggesting that its handled solely in userland by using an outbox 😰 .

Hopefully this makes sense :)

Now that I think about your comment, SyncAdapter probably still would need an outbox pattern because messages can be attempted to be published while brod client restarts. The distinction between Sync and Async hinges on if we wait on brod client finishes publishing for next message to be published, so the outbox requirement woud probably need to be held for both approaches.

For AsyncAdapter, we do have the queue already offering that role as an outbox. However for SyncAdapter we don't - for now we can note that the responsibility of maintaining the outbox would be in the userland when using SyncAdapter, as that won't be the part of this PR.

We probably would need to introduce queue to SyncAdapter in another PR, and maybe note that it is optional. SyncAdapter probably would have more messages pile up in queue than AsyncAdapter, so it would have a higher chance of hitting a memory ceiling when using the queue, so the outbox option can default to internal queue or offloaded to userland to be managed outside of kafee.

msutkowski and others added 5 commits November 29, 2024 13:42

wip

8607a7a

maybe more betterer

6bd9667

alternative idea for process monitoring and restarting

e42421b

tests and small updates

69eaa51

use safe_concat instead of Module.concat

6661368

btkostner commented Dec 3, 2024

View reviewed changes

lib/kafee/process_manager.ex Outdated Show resolved Hide resolved

btkostner commented Dec 3, 2024

View reviewed changes

btkostner added 3 commits December 2, 2024 17:24

PR fixes

5d23454

revert safe_concat change

b6ed625

more direct

c9b4efe

seungjinstord reviewed Dec 3, 2024

View reviewed changes

msutkowski mentioned this pull request Dec 4, 2024

feat: more resilient brod-based kafka connection failure handling [wip] #123

Closed

1 task

msutkowski added 5 commits February 22, 2025 16:45

Merge branch 'main' into bk-alternative

96221db

apply PR feedback

f5b01b8

misc doc corrections and additional test coverage

c0c3e9e

credo/format

346f0d8

why not autoformat

968134e

msutkowski requested a review from seungjinstord February 23, 2025 03:45

msutkowski marked this pull request as ready for review February 23, 2025 03:45

msutkowski requested a review from jondthomas as a code owner February 23, 2025 03:45

msutkowski changed the title ~~chore: watch brod client restarts for manual control~~ feat: add a brod process monitor Feb 23, 2025

seungjinstord reviewed Feb 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add a brod process monitor #124

feat: add a brod process monitor #124

btkostner commented Dec 1, 2024 •

edited by msutkowski

Loading

btkostner Dec 3, 2024

btkostner Dec 3, 2024

btkostner commented Dec 3, 2024

msutkowski commented Dec 3, 2024

seungjinstord left a comment •

edited

Loading

seungjinstord Dec 3, 2024 •

edited

Loading

seungjinstord Dec 3, 2024

msutkowski Feb 23, 2025

seungjinstord Dec 3, 2024 •

edited

Loading

seungjinstord left a comment •

edited

Loading

seungjinstord Feb 24, 2025 •

edited

Loading

msutkowski Feb 25, 2025 •

edited

Loading

seungjinstord Mar 4, 2025 •

edited

Loading

		Process.sleep(@restart_delay)
		{:noreply, state, {:continue, :start_child}}

	ProcessManager.start_link(opts)
	assert{:error, {{:badkey, :supervisor}, _}} = ProcessManager.start_link(opts)

feat: add a brod process monitor #124

Are you sure you want to change the base?

feat: add a brod process monitor #124

Conversation

btkostner commented Dec 1, 2024 • edited by msutkowski Loading

Checklist

Problem

Details

btkostner Dec 3, 2024

Choose a reason for hiding this comment

btkostner Dec 3, 2024

Choose a reason for hiding this comment

btkostner commented Dec 3, 2024

msutkowski commented Dec 3, 2024

seungjinstord left a comment • edited Loading

Choose a reason for hiding this comment

seungjinstord Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

seungjinstord Dec 3, 2024

Choose a reason for hiding this comment

msutkowski Feb 23, 2025

Choose a reason for hiding this comment

seungjinstord Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

seungjinstord left a comment • edited Loading

Choose a reason for hiding this comment

seungjinstord Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

msutkowski Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

seungjinstord Mar 4, 2025 • edited Loading

Choose a reason for hiding this comment

btkostner commented Dec 1, 2024 •

edited by msutkowski

Loading

seungjinstord left a comment •

edited

Loading

seungjinstord Dec 3, 2024 •

edited

Loading

seungjinstord Dec 3, 2024 •

edited

Loading

seungjinstord left a comment •

edited

Loading

seungjinstord Feb 24, 2025 •

edited

Loading

msutkowski Feb 25, 2025 •

edited

Loading

seungjinstord Mar 4, 2025 •

edited

Loading