Resilience Capabilities
The SAP Cloud SDK for Java provides abstractions for some frequently used resilience patterns like timeout, retry, rate limiter, or circuit breaker. Applying such patterns helps to make an application more resilient against failures it might encounter.
The following article describes which resilience features the SAP Cloud SDK offers and how to apply them. If you are looking for a quick start with resilience also check out our dedicated tutorial on the topic!
Using the Resilience API
To make use of the resilience capabilities by the SAP Cloud SDK, add the following dependency to your project:
<dependency>
<groupId>com.sap.cloud.sdk.cloudplatform</groupId>
<artifactId>resilience</artifactId>
</dependency>
The SAP Cloud SDK allows running any code in the context of one or more resilience patterns. There are two essential building blocks for achieving this:
- The
ResilienceConfiguration
that determines which patterns should be applied. - The
ResilienceDecorator
is capable of applying the configuration to an operation.
The fluent Resilience Configuration API provides builders that help with assembling different resilience patterns and their associated parameters. Which patterns are available and how to use them is explained in the dedicated section below.
The Resilience Decorator is capable of applying such a configuration to a given Callable
or Supplier
.
Executing Operations
Consider the following code:
result = ResilienceDecorator.executeSupplier(() -> operation(), configuration);
This code executes operation()
in a resilient manner according to a ResilienceConfiguration
.
The decorator will apply all in configuration
configured patterns and all logic that is needed to combine these patterns.
Some resilience patterns are applied over multiple executions of the same operation. For example, the circuit breaker will prevent further executions if a significant portion of previous attempts failed.
To understand how the SAP Cloud SDK applies this concept consider the following snippet:
configuration1 = ResilienceConfiguration.of("config-id-1");
configuration2 = ResilienceConfiguration.of("config-id-1");
configuration3 = ResilienceConfiguration.of("config-id-2");
ResilienceDecorator.executeSupplier(() -> operation(), configuration1);
ResilienceDecorator.executeSupplier(() -> operation(), configuration1);
ResilienceDecorator.executeSupplier(() -> operation(), configuration2);
ResilienceDecorator.executeSupplier(() -> operation(), configuration3);
Here executions one, two, and three will all share the same "resilience state". This means that they will share the same instance of a circuit breaker or bulkhead. The state is shared via the identifier of the associated configuration.
Operation Types
The decorator operates with two kinds of operations:
Callable | May throw checked or unchecked Exceptions |
Supplier | May only throw unchecked Exceptions |
Noticeable is the difference in signatures: Callable throws a checked exception while Supplier does not. You can choose whatever fits your use case best.
Execution Variants
The decorator allows for three different ways of applying a configuration:
Execute | Immediately runs the operation |
Decorate | Returns a new operation to be run later |
Queue | Immediately runs the operation asynchronously |
In case your operation should run asynchronously we highly recommend you leverage the queue
functionality.
The decorator will ensure the Thread Context with Tenant and Principal information is propagated correctly to new Threads.
Note that the Resilience Decorator will try to propagate the current Thread Context at the time the decorator is invoked.
This is important when you are decorating a Callable or Supplier and running it later.
The Thread Context must be available whenever decorateCallable
or decorateSupplier
is evaluated.
If the call to ResilienceDecorator
should take place asynchronously, you should follow these steps to ensure the Thread Context is available.
Failures and Fallbacks
An operation might fail for two reasons:
- The operation itself encounters a failure and throws an error or exception
- A resilience pattern causes the operation to fail (e.g. the circuit breaker prevents further invocations)
The SAP Cloud SDK wraps all kind of checked and unchecked exceptions into a ResilienceRuntimeException
and throws them.
To deal with failures one can either catch the ResilienceRuntimeException
or provide a fallback function:
executeCallable(() -> operation(), configuration,
(throwable) -> {
log.debug("Encountered a failure in operation: ", throwable);
log.debug("Proceeding with fallback value: {}", fallback);
return fallback;
});
In the case of Callable
, this relieves you of the need to catch the exception at the outer level.
Building a Resilience Configuration
A new ResilienceConfiguration
with default values is created by providing an identifier for it:
configuration = ResilienceConfiguration.of("identifier");
The identifier can be either a string or a class. In the case of the latter, the (full) class name will be used as the identifier. The identifier will be used to apply resilience patterns across multiple invocations to operations.
Check the Javadoc for which patterns and parameters will be applied by default. You can also create a configuration with all patterns disabled:
configuration = ResilienceConfiguration.empty("identifier");
Individual resilience patterns are configured via dedicated builder classes like TimeLimiterConfiguration
and are added to the configuration via dedicated setters, e.g. timeLimiterConfiguration()
.
For details see the list of Resilience Capabilities below.
Multi Tenancy
The SAP Cloud SDK is capable of applying the different resilience patterns in a tenant and principal aware manner. Consider for example the Bulkhead pattern which limits the number of parallel executions. If the operation is tenant-specific then you would probably want to avoid one tenant blocking all others.
For this reason, the SAP Cloud SDK by default isolates resilience patterns based on tenant and principal, if they are available. This strategy can be configured, e.g. for running without any isolation use:
configuration.isolationMode(ResilienceIsolationMode.NO_ISOLATION);
Other than no isolation there are essentially two modes for tenant and/or principal isolation:
Required | Always isolates on tenant and/or principal level, will throw an exception if no tenant/principal is available |
Optional | Only isolates if tenant and/or principal information is available |
Details can be found on the API reference of ResilienceIsolationMode
.
Resilience Capabilities
The following resilience patterns are available and can be configured in a Resilience Configuration:
Timeout | TimeLimiterConfiguration | Limit how long an operation may run before it should be interrupted |
Rate Limiter | RateLimiterConfiguration | Limit the number of operations accepted in a window of time |
Retry | RetryConfiguration | Retry a failed operation a limited amount of times before failing |
Circuit Breaker | CircuitBreakerConfiguration | Reject attempts if too many failures occurred in the past |
Bulkhead (also known as Shed Load or Load Shedding) | BulkheadConfiguration | Limit how many instances of this operation may run in parallel |
You can find good explanations on how the individual patterns behave on the documentation of resilience4j which the SAP Cloud SDK uses under the hood to perform resilient operations.
Be aware that the patterns interact with each other. They are applied in the following order:
Fallback ( Retry ( CircuitBreaker ( RateLimiter ( TimeLimiter ( Bulkhead ( Function ) ) ) ) ) )
If you read from right to left, it shows you the order in which the aspects will be applied.
For example, Fallbacks
are called last while Bulkhead
is the first aspect applied.
Hence, exceptions are also propagated from right to left.
Based on the order, the following inferences (not exhaustive) can be made:
- Every timeout will trigger a retry, if configured.
- Only if all retries failed the fallback function will be considered.
You can get more details in the Resilience4j official documentation.