The Netflix Content Platform engineering team runs a number of business processes, which are driven by asynchronous orchestration of tasks to be performed on Microsoft. But, some of these are long-lasting processes that last several days. As a result, this process plays an important role in getting the title ready for streaming for our audiences worldwide.
A few processes are:
- Studio partner integration for content ingestion
- IMF based content ingestion from our partners
- Process of setting up new titles within Netflix
- Content ingestion, encoding, and deployment to CDN
We built Conductor “as an orchestration engine” to address the following requirements, take out the need for boilerplate in apps, and provide a reactive flow :
- Tracking and management of workflows.
- Ability to pause, resume and restart processes.
- A user interfaces to visualize process flows.
- Ability to synchronously process all the tasks when needed.
- Ability to scale to millions of concurrently running process flows.
- Backed by a queuing service abstracted from the clients.
- Be able to operate over HTTP or other transports.
It has helped over 2.6 million process orchestras move from simple linear workflows to very complex dynamic workflows that last several days.
How Netflix using Conductor
Furthermore, a conductor is one of the most heavily used services within Content Engineering at Netflix. However, the multitude of modules that can be plugged into Conductor as shown in the image below, we use the Jersey server module, Cassandra for persisting execution data, Dynomite for persisting metadata, DynoQueues as the queuing recipe built on top of Dynomite, Elasticsearch as the secondary datastore and indexer, and Netflix Spectator+ Atlas for Metrics.
So, as of writing this blog, Conductor orchestrates 600+ workflow definitions owned by 50+ teams across Netflix. While we’re not yet actively measuring the nth percentiles, our production workloads speak for Conductor’s performance. In short, Below is a snapshot of our Kibana dashboard which shows the workflow execution metrics over a typical 7-day period.
Conductor enables orchestration across services while providing control and visibility into their interactions. Having the ability to orchestrate across microservices also helped us in leveraging existing services to build new flows or update existing flows to use Conductor very quickly, effectively providing an easier route to adoption.
Why not peer to peer choreography?
With peer to peer task choreography, as a result, we found it was harder to scale with growing business needs and complexities. Pub/sub model worked for simplest of the flows, but quickly highlighted some of the issues associated with the approach:
- Process flows are “embedded” within the code of multiple applications
- Often, there is tight coupling and assumptions around input/output, SLAs etc, making it harder to adapt to changing needs
- Almost no way to systematically answer “What is remaining for a movie’s set up to be complete”?
Firstly, at the center of the engine is a state machine service aka decider service. As a result, workflow events occur, Decruff combines the workflow blueprint with the current state of the workflow, identifies the next state, and schedules appropriate tasks and / or updates the state of the workflow.
Task Worker Implementation
In addition, Tasks implemented by worker applications communicate through the API layer. So, The workers aim to do ineffective work. The polling model allows us to handle captures on workers and provide auto-scalability based on queue depth when possible. Therefore, The conductor provides APIs for each worker to inspect the size of the workload that can be used for autoscale worker instances.
However, the APIs are exposed over HTTP — using HTTP allows for ease of integration with different clients. However, adding another protocol (e.g. gRPC) should be possible and relatively straightforward.
Meanwhile, Workflows are defined using a JSON based DSL. As a result, workflow blueprint defines a series of tasks that needs to be executed. Each of the tasks is either a control task or a worker task.
Outline Work-flow code:
“description”: “Description of workflow”,
… (any other task specific parameters)
Firstly, Each task’s behavior is controlled by its template. So, It is known as task definition. A task definition provides control parameters for each task such as timeouts, retry policies, etc. Conductor provides out of the box system tasks such as Decision, Fork, Join, Sub Workflows, and an SPI that allows plugging in custom system tasks. Also, we have added support for HTTP tasks that facilitate making calls to REST services.
Task definition code:
Inputs / Outputs
Meanwhile, Input to a task is a map with inputs coming as part of the workflow instantiation or output of some other task. Such configuration allows for routing inputs/outputs from a workflow or other tasks as inputs to tasks that can then act upon it.