Expert Tips
Zeebe: Cloud Native Workflow Orchestration and Decision Engine - How We Solved Key Challenges.
Years of Expertise
5 years
Skills
Development, Performance Tuning, Deployment
Authors
Sergey Grishin, Kirill Larionov, Vadim Eidlin
About Us
V4Scale is not just a company; it's an innovation powerhouse that enhances the R&D capabilities of prominent Israeli and US technology firms. With headquarters in Tel Aviv, we leverage the incredible talents of our diverse global remote workforce, welcoming candidates from any location worldwide.
We are experts in many open-source software, one of which is Zeebe. Zeebe is a cloud-native workflow and decision engine that powers Camunda. Our expertise extends beyond developing products using the Zeebe engine; we excel in performance tuning, scaling, securing, and adding resilience to Zeebe software.
In this guide, we want to share our experience in solving Zeebe runtime issues that rarely occur, providing insights and solutions from our extensive work with the Zeebe engine.
Issues and How to Solve Them
Issue #1:
If your nodes or network are heavily loaded, messages indicating that the Zeebe instance is operational may be lost or delayed. The default timeout for these messages is 100ms, which may not be sufficient under such conditions.
Exception:
- java.util.concurrent.TimeoutException: Request atomix-membership-probe to zeebe.svc:26502 timed out in PTO.15
Solution:
-
The ZEEBE_GATEWAY_CLUSTER_MEMBERSHIP_PROBETIMEOUT configuration allows you to adjust this timeout value.
-
If the communication channel is weak, you can also try enabling compression with ZEEBE_GATEWAY_CLUSTER_MESSAGECOMPRESSION. This will increase the CPU load but reduce the traffic between Zeebe instances.
Issue #2:
You develop your own exporter for Zeebe, and you experience poor event processing performance, which slows down workflow processing in Zeebe.
Solution:
- Try to put the event recording logic in a separate thread or even in a separate service. When receiving an event, you can mark it as processed (call controller.updateLastExportedRecordPosition()), save it to a queue or any other temporary store, or send it to a separate service to record events.
- If you're considering event processing at a certain periodicity, it's advisable to do it in a separate thread instead of using a controller.scheduleCancellableTask(). The latter may negatively impact Zeebe performance and result in slower processing speed of workflows.
Issue #3: Timeout Errors in Zeebe with High Load of Workflows. When initiating a high load of workflows per second in Zeebe, specifying a timeout in the Go code can result in a situation where the workflow completes successfully in Zeebe, but an error is returned due to the timeout expiration.
Solution:
- Increase Timeouts: Ensure that the timeouts in your Go code are long enough to accommodate potential network and processing delays on the Zeebe side. Consider extending the context timeout in your code.
- Pause Workflow Initialization: Alternatively, set a sufficiently large timeout and measure the time of each initialization. If the operation exceeds a specific limit, for example, 1 second, pause the workflow initiations temporarily to reduce the load on Zeebe and ensure smoother processing.
Adjusting these parameters can mitigate timeout errors and enhance the reliability of your workflows in high-load scenarios with Zeebe.
Issue #4: Failed to Write to Zeebe Partition(s) - Partition is Full
Solution:
- To handle this, you must enable Zeebe’s backpressure mechanism in case of high load. Implement the following logic in your application code: When you receive a backpressure error from Zeebe, temporarily stop initiating new workflows and allow the engine to finish processing the already running workflows. It will give Zeebe the necessary time to manage and clear the partition, ensuring smoother workflow execution.