Handling failures in background workers with Elixir and supervisors
Elixir is built on the top of the Erlang Virtual Machine. It allows us to write highly available systems that can run practically forever. Does that mean that we don't have to do anything to make our systems reliable?
In our system, we have a worker which pays drivers money for their job.
defmodule Payment.Worker do
use GenServer
@interval 10 * 6000
def start_link() do
...
end
def init(_) do
Process.send_after(self(), :work, @interval)
{:ok, %{interval: @interval}}
end
def handle_info(:work, state) do
Repo.transaction(fn ->
Job.Completed.fetch()
|> Enum.map(fn %{job_id: job_id} ->
:ok = Payment.pay_the_driver(job_id)
job_id
end)
|> Job.Completed.delete_paid()
end)
Process.send_afer(self(), :work, state.interval)
{:noreply, state}
end
end
Let's take a closer look at our Payment.pay_the_driver/1 function to see what it does.
defmodule Payment do
def pay_the_driver(%{id: job_id}) do
:ok = pay(job_id)
:ok = verify_payment(job_id)
end
defp verify_payment(job_id) do
...
if over_the_limit?(to_be_paid, already_paid) do
raise "Invalid payment for job #{job_id}"
end
end
end
The system compares money already paid with the amount which a driver should receive. It guarantees that a driver won't receive more than they should.
Unfortunately, developers make mistakes and there's a chance that an incorrect code is released. Luckily, the verify_payment/1 function prevents incorrect payments. But what happens to our application if such a scenario occurs and the function raises the error?
To understand consequences, let's see how the supervision tree works.
In the picture above, the main process supervises its child process - the Worker module.
Starting a supervisor, you can set what happens when one of the children gets crashed. By default, a supervisor crashes when a child is restarted 3 times in 5 seconds.
Our supervisor is configured with all these default values above:
defmodule Application do
use Application
import Supervisor.Spec
def start(_type, _args) do
children = [
worker(Payment.Worker, []),
...
]
opts = [strategy: :one_for_one, name: Supervisor]
Supervisor.start_link(children, otps)
end
end
It means that each time the verify_payment/1 function raises an error, our worker will be restarted.
defmodule Payment.Worker do
use GenServer
@interval 10 * 6000
def start_link() do
...
end
def init(_) do
Process.send_after(self(), :work, @interval)
{:ok, %{interval: @interval}}
end
def handle_info(:work, state) do
...
end
end
If it happens more than 3 times in 5 seconds, the main supervisor will also crash. As we can see, our worker handles its first message every second. If the logic within it raises an error, it will be able to reach more than 3 restarts within 5 seconds and consequently, our application will crash.
So what's now?
Even if there's a problem with that part of the code, we still want the rest of the application to be up while we're investigating the issue.
That's why we can add a separate supervisor just for our worker:
defmodule Payment.Supervisor do
import Supervisor.Spec
def start_link() do
children = [
worker(Payment.Worker, [])
]
restart_interval = Payment.Worker.interval() / 1000
default_max_seconds = 5
max_restarts = ceil(default_max_seconds / restart_interval) + 1
opts = [
max_restarts: max_restarts,
name: __MODULE__,
strategy: :one_for_one
]
Supervisor.start_link(children, opts)
end
end
The supervisor needs to know the worker interval, so we have to replace the @interval attribute in the worker with the interval() public function.
defmodule Payment.Worker do
use GenServer
def start_link() do
...
end
def init(_) do
Process.send_after(self(), :work, interval())
{:ok, %{interval: interval()}}
end
...
def interval(), do: :timer.seconds(1)
end
Now our supervisor will crash if the child is restarted within 5 seconds more times than the value of max_restarts.
The worker executes its function every second. Setting the limit to 6 restarts in 5 seconds guarantees that the supervisor will never crash.
It's time to modify the main supervisor to look after the worker supervisor instead of the worker itself.
defmodule Application do
use Application
import Supervisor.Spec
def start(_type, _args) do
children = [
supervisor(Payment.Supervisor, []),
...
]
opts = [strategy: :one_for_one, name: Supervisor]
Supervisor.start_link(children, otps)
end
end
Summing up
It's sometimes hard to avoid temporary failures. You have to make sure that you have a plan for what happens if some parts of the system stop working correctly.
If you’re interested in other Elixir-focused articles that offer advice and solutions to the most common developers’ problems, check out the content on our company blog!
Did you find the article interesting and helpful? Take a look at our Elixir page to find out more!
#language #programming #framework