Expert Tips

Zeebe: Cloud Native Workflow Orchestration and Decision Engine - How We Solved Key Challenges.

Years of Expertise

5 years

Skills

Development, Performance Tuning, Deployment

Authors

Sergey Grishin, Kirill Larionov, Vadim Eidlin

About Us

V4Scale is not just a company; it's an innovation powerhouse that enhances the R&D capabilities of prominent Israeli and US technology firms. With headquarters in Tel Aviv, we leverage the incredible talents of our diverse global remote workforce, welcoming candidates from any location worldwide.

We are experts in many open-source software, one of which is Zeebe. Zeebe is a cloud-native workflow and decision engine that powers Camunda. Our expertise extends beyond developing products using the Zeebe engine; we excel in performance tuning, scaling, securing, and adding resilience to Zeebe software.
In this guide, we want to share our experience in solving Zeebe runtime issues that rarely occur, providing insights and solutions from our extensive work with the Zeebe engine.

Issues and How to Solve Them

Issue #1: If your nodes or network are heavily loaded, messages indicating that the Zeebe instance is operational may be lost or delayed. The default timeout for these messages is 100ms, which may not be sufficient under such conditions.

Exception:

  • java.util.concurrent.TimeoutException: Request atomix-membership-probe to zeebe.svc:26502 timed out in PTO.15

Solution:



Issue #2: You develop your own exporter for Zeebe, and you experience poor event processing performance, which slows down workflow processing in Zeebe. 

Solution:

  • Try to put the event recording logic in a separate thread or even in a separate service. When receiving an event, you can mark it as processed (call controller.updateLastExportedRecordPosition()), save it to a queue or any other temporary store, or send it to a separate service to record events.
  • If you're considering event processing at a certain periodicity, it's advisable to do it in a separate thread instead of using a controller.scheduleCancellableTask(). The latter may negatively impact Zeebe performance and result in slower processing speed of workflows.


Issue #3: Timeout Errors in Zeebe with High Load of Workflows. When initiating a high load of workflows per second in Zeebe, specifying a timeout in the Go code can result in a situation where the workflow completes successfully in Zeebe, but an error is returned due to the timeout expiration.


Solution:

  • Increase Timeouts: Ensure that the timeouts in your Go code are long enough to accommodate potential network and processing delays on the Zeebe side. Consider extending the context timeout in your code.

  • Pause Workflow Initialization: Alternatively, set a sufficiently large timeout and measure the time of each initialization. If the operation exceeds a specific limit, for example, 1 second, pause the workflow initiations temporarily to reduce the load on Zeebe and ensure smoother processing.

Adjusting these parameters can mitigate timeout errors and enhance the reliability of your workflows in high-load scenarios with Zeebe.



Issue #4: Failed to Write to Zeebe Partition(s) - Partition is Full

Solution:

  • To handle this, you must enable Zeebe’s backpressure mechanism in case of high load. Implement the following logic in your application code: When you receive a backpressure error from Zeebe, temporarily stop initiating new workflows and allow the engine to finish processing the already running workflows. It will give Zeebe the necessary time to manage and clear the partition, ensuring smoother workflow execution.


Start Living Your Life - Join Us
Our greatest Value is our Team - ultimate professionals, passionate about their work and pleasure to work with.