Skip to content

fix(engine): isolate search store from async load crashes#369

Open
QuinnWilton wants to merge 1 commit intoelixir-lang:mainfrom
QuinnWilton:fix/store-async-load-isolation
Open

fix(engine): isolate search store from async load crashes#369
QuinnWilton wants to merge 1 commit intoelixir-lang:mainfrom
QuinnWilton:fix/store-async-load-isolation

Conversation

@QuinnWilton
Copy link

@QuinnWilton QuinnWilton commented Feb 10, 2026

This is part of a set of 4 PRs that arose out of some static analysis tooling I'm working on:

That means that these aren't crashes or issues that I've observed in practice, however based on my reading of the code, they do represent issues worth addressing.

Problem:

State.prepare_backend_async/2 uses Task.async/1 to build the search index. If this task raises, the Store crashes, and the task silently fails. The store is supervised under a :one_for_one strategy, and as far as I can tell, no indexed data is permanently lost, however:

  1. During the restart window, callers of the Store will be returned :noproc instead of {:error, :loading}, and the crash will cascade
  2. If any project_compiled events are dispatched between the crash and the restart, these will be missed by the Store. Since these events are never replayed, the event will be lost, and the server will remain unloaded until the next compilation

Solution

By using Task.Supervisor.async_nolink/2 to start the search index task under a dedicated Engine.TaskSupervisor, we can instead handle a crash in the task by logging the error, and clearing async_load_ref, so that the server is immediately ready to process further requests.

During this time, the server remains available, callers are isolated from the crash, no project_compiled events are lost while waiting for the restart, and previously loaded indexes remain loaded.

In the successful case, the monitor is flushed to avoid :DOWN messages building up and causing a mailbox leak.

Replace Task.async with Task.Supervisor.async_nolink in
State.prepare_backend_async/2 so a crashing index build no longer
kills the Store GenServer.

Add a {:DOWN, ...} handler that logs the error and clears
async_load_ref, allowing the Store to retry on the next
project_compiled event. Demonitor the ref on successful completion
to avoid processing the normal DOWN.
@mhanberg mhanberg force-pushed the fix/store-async-load-isolation branch from ac24ce1 to 291a8b8 Compare February 13, 2026 19:29
Comment on lines +490 to +491
# Allow time for the task to crash and the DOWN message to be processed.
Process.sleep(100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest having the function that raises send a messsage to the test process instead, and we can assert_receive on that instead of waiting.

{:DOWN, ref, :process, _pid, reason},
{update_ref, %State{async_load_ref: ref} = state}
) do
Logger.error("Search index async load crashed: #{inspect(reason)}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Logger.error("Search index async load crashed: #{inspect(reason)}")
Logger.error("Failed to prepare search store backend: #{inspect(reason)}")

assert_eventually alive?()

on_exit(fn ->
after_each_test(Ets, project)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
after_each_test(Ets, project)
destroy_backend(Ets, project)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this after each test function can probably be deleted, but we can deal with that in a different pr (not you, just as a note for myself)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants