Resilience4clj is a lightweight fault tolerance library set built on top of GitHub's Resilience4j and inspired by Netflix Hystrix. It was designed for Clojure and functional programming with composability in mind.
Read more about the motivation and details of Resilience4clj here.
Resilience4Clj circuit breaker lets you decorate a function call (usually with a potential of external failure) with a safety mechanism to interrupt the propagation of failures. Once a specified failure rate is reached, the "circuit" is open and no more calls go through until a "grace" period has passed.
Resilience4Clj circuit breaker is implemented on top of GitHub's Resilience4j and therefore inherets some of its benefits:
- Storage of call results in a Ring Bit Buffer without a statistical
rolling time window. A successful call is stored as a
0 bit
and a failed call is stored as a1 bit
. The Ring Bit Buffer has a configurable fixed-size and stores the bits in a long[] array. This saves memory compared to a boolean array (example: the Ring Bit Buffer only needs an array of 16 long values - 64-bit each - to store the status of 1024 calls. - The advantage is that this Circuit Breaker works out-of-the-box for low and high frequency backend systems, because execution results are not dropped when a time window is passed.
- To determine whether a Circuit Breaker can be closed again once it reaches the half-open state, this library allows to perform a configurable number of executions and compares the results against a configurable threshold.
Refer to Implementation Details for a deeper look at the power behind Resilience4Clj.
- Getting Started
- Circuit Breaker Settings
- Fallback Strategies
- Effects
- Metrics
- Events
- Exception Handling
- Composing Further
- Implementation Details
- Bugs
- Help!
Add resilience4clj/resilience4clj-circuitbreaker
as a dependency to
your deps.edn
file:
resilience4clj/resilience4clj-circuitbreaker {:mvn/version "0.1.3"}
If you are using lein
instead, add it as a dependency to your
project.clj
file:
[resilience4clj/resilience4clj-circuitbreaker "0.1.3"]
Require the library:
(require '[resilience4clj-circuitbreaker.core :as cb])
Then create a circuit breaker calling the function create
. We give
it a reference name so that we can later identify any event in case
you are interested in the health of your breaker:
(def breaker (cb/create "my-first-breaker"))
Now you can decorate any function you have with the circuit breaker you just defined.
For the sake of this example, let's create a function that fails if we tell it to, otherwise it's a simple "Hello World":
(defn may-fail-hello
([]
"Hello World!")
([fail?]
(throw (ex-info "Hello service failed!" {:reason :was-told-to}))))
The function decorate
will take the potentially failing function
just above and the circuit breaker you created and return a protected
function:
(def protected-hello (cb/decorate may-fail-hello breaker))
When you call protected-hello
, it should eval to "Hello World!"
as
you would expect:
(protected-hello) ;; => "Hello World!"
When you extract the metrics from your circuit break (function
metrics
), this is what you get:
(cb/metrics breaker)
=> {:failure-rate -1.0,
:number-of-buffered-calls 1,
:number-of-failed-calls 0,
:number-of-not-permitted-calls 0,
:max-number-of-buffered-calls 100,
:number-of-successful-calls 1}
The key :number-of-successful-calls
shows that your breaker has had
1
successful call.
Let's force a failure:
(protected-hello true) ;; => throws ExceptionInfo "Hello service failed!"
Now extracting the metrics would give you:
(cb/metrics breaker)
=> {:failure-rate -1.0,
:number-of-buffered-calls 2,
:number-of-failed-calls 1,
:number-of-not-permitted-calls 0,
:max-number-of-buffered-calls 100,
:number-of-successful-calls 1}
The key :number-of-failed-calls
went from 0
to 1
and
:number-of-buffered-calls
from 1
to 2
.
By default, the breaker will only open with a failure rate of 50%. Failure rates are only calculated once the full bit ring is completed (default it 100 calls). Threfore, in order to open the breaker, one needs a series of failures:
(cb/state breaker) ;;=> :CLOSED
(dotimes [n 98]
(try
(protected-hello true)
(catch Exception e)))
(cb/metrics breaker)
=> {:failure-rate 99.0,
:number-of-buffered-calls 100,
:number-of-failed-calls 99,
:number-of-not-permitted-calls 2,
:max-number-of-buffered-calls 100,
:number-of-successful-calls 1}
(cb/state breaker) ;;=> :OPEN
Now that the breaker is open, any call to protected-hello
will throw
a CircuitBreakerOpenException
exception:
(protected-hello) ;;=> throws CircuitBreakerOpenException
You will need to wait 1 minute [configurable] for the circuit breaker
to transition from :OPEN
to :HALF_OPEN
(where it will tentatively
decide whether the backend is back online).
Refer to Exception Handling below for more details.
When creating a circuit breaker, you can fine tune its settings with these:
:ring-buffer-size-in-closed-state
- the size of the main ring buffer. The ring needs to be filled before failure rates are calculated. Default100
.:failure-rate-threshold
- the failure rate at which the breaker will open (in percent). Once the circuit breaker transitions from open to half-open, this failure rate is used to decide whether the circuit is good to be closed again (below failure rate) or to remain open (above failure rate). Default50.0
.:ring-buffer-size-in-half-open-state
- the size of the half-open state ring buffer. This ring buffer is used when the breaker transitions from open to half-open to decide whether the circuit is healthy or not. It's usually smaller than the main ring buffer. Default10
.:wait-duration-in-open-state
- the time in milliseconds that the breaker should wait before transitioning from open to half-open. Default60000
(1 minute).:automatic-transition-from-open-to-half-open-enabled?
- if set totrue
it means that the breaker will automatically transition from open to half open state without any waiting time. Defaultfalse
.
These two options can be sent to create
as a map. In the following
example, any function decorated with breaker
will have a ring buffer
of 20
and a failure rate of 10.0
:
(def breaker (cb/create "MyBreaker" {:ring-buffer-size-in-closed-state 20
:failure-rate-threshold 10.0}))
The function config
returns the configuration of a breaker in case
you need to inspect it. Example:
(cb/config breaker)
=> {:failure-rate-threshold 10.0,
:ring-buffer-size-in-closed-state 20,
:ring-buffer-size-in-half-open-state 10,
:wait-duration-in-open-state 60000,
:automatic-transition-from-open-to-half-open-enabled? false}
When decorating your function with a circuit breaker you can opt to have a fallback function. This function will be called instead of an exception being thrown both when the circuit breaker is open or when the call would fail (traditional throw). This feature can be seen as an obfuscation of a try/catch to consumers.
This is particularly useful if you want to obfuscate from consumers that the circuit breaker is open and/or that the external dependency failed. Example:
(def breaker (cb/create "hello-service"))
(defn hello [person]
(str "Hello " person))
(def protected-hello
(cb/decorate hello breaker
{:fallback (fn [e person]
(str "Hello from fallback to " person))}))
The signature of the fallback function is the same as the original
function plus an exception as the first argument (e
on the example
above). This exception is an ExceptionInfo
wrapping around the real
cause of the error. You can inspect the :cause
node of this
exception to learn about the inner exception:
(defn fallback-fn [e]
(str "The cause is " (-> e :cause)))
For more details on Exception Handling see the section below.
When considering fallback strategies there are usually three major strategies:
- Failure: the default way for Resilience4clj - just let the exceptiohn flow - is called a "Fail Fast" approach (the call will fail fast once the breaker is open). Another approach is "Fail Silently". In this approach the fallback function would simply hide the exception from the consumer (something that can also be done conditionally).
- Content Fallback: some of the examples of content fallback are returning "static content" (where a failure would always yield the same static content), "stubbed content" (where a failure would yield some kind of related content based on the paramaters of the call), or "cached" (where a cached copy of a previous call with the same parameters could be sent back).
- Advanced: multiple strategies can also be combined in order to create even better fallback strategies.
For more details on some of these strategies, read the section Effects below.
A common issue for some fallback strategies is to rely on a cache or other content source (see Content Fallback above). In these cases, it is good practice to persist the successful output of the function call as a side-effect of the call itself.
Resilience4clj retry supports this behavior in the folling way:
(def breaker (cb/create "hello-service"))
(defn hello [person]
(str "Hello " person))
(def protected-hello
(cb/decorate hello breaker
{:effect (fn [ret person]
;; ret will have the successful return from `hello`
;; you can save it on a memory cache, disk, etc
)}))
The signature of the effect function is the same as the original
function plus a "return" argument as the first argument (ret
on the
example above). This argument is the successful return of the
encapsulated function.
The effect function is called on a separate thread so it is non-blocking.
You can see an example of how to use effects for caching purposes at using Resilience4clj cache as an effect.
The function metrics
returns a map with the metrics of the circuit breaker:
(cb/metrics breaker)
=> {:failure-rate 20.0,
:number-of-buffered-calls 100,
:number-of-failed-calls 20,
:number-of-not-permitted-calls 0,
:max-number-of-buffered-calls 100,
:number-of-successful-calls 80}
The nodes should be self-explanatory. The trickiest one is
:failure-rate
. It is only computed once the ring buffer is full (in
the example above it is because 80
out of 100
calls succeeded and
20
out of 100
failed therefore the failure rate is 20.0
. If the
ring buffer was not full, failure rate would be "not-computed" which
is indicated as -1.0
.
Another interesting node is :number-of-not-permitted-calls
. It shows
the number of calls that were "blocked" by the circuit breaker when it
was open. In the above example, no call has been rejected.
Another useful metadata function is state
. It returns the current
state of the circuit breaker:
(cb/state breaker) ;;=> :CLOSED
There are three normal states: :CLOSED
, :OPEN
and :HALF_OPEN
and
two special states :DISABLED
and :FORCED_OPEN
. For more details
about them read implementation details
below.
The state, ring buffer, and metrics can be reset with a call to the
reset!
function:
(cb/reset! breaker)
You can listen to events generated by your circuit breakers. This is particularly useful for logging, debugging, or monitoring the health of your breakers.
(def breaker (cb/create "my-breaker"))
(cb/listen-event breaker
(fn [evt]
(println (str "Received event " (:event-type evt)))))
There are six types of events:
:SUCCESS
- the call has been successful:ERROR
- the call has failed:NOT_PERMITTED
- the call has been rejected due to an open breaker:STATE_TRANSITION
- the breaker transitioned from one state to another:IGNORED_ERROR
- an error has been ignored by the circuit breaker:RESET
- the circuit breaker has been reset
Alternatively to listening to all events as we have done before, you can listen to one specific type of event by specifyin the event-type you want to:
(def breaker (cb/create "my-breaker"))
(cb/listen-event breaker
:ERROR
(fn [evt]
(println "An error has occurred")))
All events receive a map containing the :event-type
, the
:circuit-breaker-name
and the event :creation-time
. Error events
also carry a node :throwable
containing a copy of the exception
thrown. Successful and error events have :ellapsed-duration
in
nanoseconds. Ultimately, state transition events also have
:from-state
and :to-state
.
If you are not using a fallback function (see Fallback
Strategies for more details) and the circuit
breaker is open, an instance of CircuitBreakerOpenException
will be
thrown. If this is too intrusive on your systems, do consider using a
fallback function instead.
When using the fallback function, be aware that its signature is the
same as the original function plus an exception (e
on the example
above). This exception is an ExceptionInfo
wrapping around the real
cause of the error. You can inspect the :cause
node of this
exception to learn about the inner exception:
Resilience4clj is composed of several modules that easily compose together. For instance, if you are also using the time limiter and assuming your import and basic settings look like this:
(ns my-app
(:require [resilience4clj-circuitbreaker.core :as cb]
[resilience4clj-timelimiter.core :as tl]))
;; create time limiter with default settings
(def limiter (tl/create))
;; create circuit breaker with default settings
(def breaker (cb/create "HelloService"))
;; slow function you want to limit
(defn slow-hello []
(Thread/sleep 1500)
"Hello World!")
Then you can create a protected call that combines both the time limiter and the circuit breaker:
(def protected-hello (-> slow-hello
(tl/decorate limiter)
(cb/decorate breaker)))
The resulting function on protected-hello
will trigger the breaker
in case of a timeout now.
TBD: list and links to all modules
The Circuit Breaker is implemented via a finite state machine with
three normal states: :CLOSED
, :OPEN
and :HALF_OPEN
and two
special states :DISABLED
and :FORCED_OPEN
.
The state of the Circuit Breaker changes from :CLOSED
to :OPEN
when the failure rate is above a [configurable] threshold. Then, all
access to the function call is blocked for a [configurable] amount of
time.
The Circuit Breaker uses a Ring Bit Buffer in the :CLOSED
state to
store the success or failure statuses of the calls. A successful call
is stored as a 0 bit
and a failed call is stored as a 1 bit
. The
Ring Bit Buffer has a [configurable] fixed-size. The Ring Bit Buffer
uses internally a BitSet-like data structure to store the bits which
is saving memory compared to a boolean array. The BitSet uses a long[]
array to store the bits. That means the BitSet only needs an array of
16 long (64-bit) values to store the status of 1024 calls.
The following diagram shows what a Ring Buffer would look like for only 12 results:
The Ring Bit Buffer must be full, before the failure rate can be calculated. For example, if the size of the Ring Buffer is 10, then at least 10 calls must evaluated, before the failure rate can be calculated. If only 9 calls have been evaluated the CircuitBreaker will not trip open even if all 9 calls have failed.
After the time duration has elapsed, the Circuit Breaker state changes
from :OPEN
to :HALF_OPEN
and allows calls to see if the backend is
still unavailable or has become available again. The Circuit Breaker
uses another [configurable] Ring Bit Buffer to evaluate the failure
rate in the :HALF_OPEN
state. If the failure rate is above the
configured threshold, the state changes back to :OPEN
. If the
failure rate is below or equal to the threshold, the state changes
back to :CLOSED
.
The Circuit Breaker supports resetting to its original state, losing all the metrics and effectively resetting its Ring Bit Buffer.
The Circuit Breaker supports two more special states, :DISABLED
(always
allow access) and :FORCED_OPEN
(always deny access). In these two
states no Circuit Breaker events (apart from the state transition) are
generated, and no metrics are recorded. The only way to exit from
those states are to trigger a state transition or to reset the Circuit
Breaker.
If you find a bug, submit a Github issue.
This project is looking for team members who can help this project succeed! If you are interested in becoming a team member please open an issue.
Copyright © 2019 Tiago Luchini
Distributed under the MIT License.