Clustering and Failover

A Flux cluster is a group of engines that communicate with one another (communicating directly over the network and using the database) to schedule flow charts as efficiently as possible, and to provide a failover mechanism in case one or more engines go down. In a cluster, there is no single point of failure, and any flow charts that are running on an engine will fail over and begin executing on another engine in the cluster.

Configuring the Cluster

To set up two or more Flux engines to run in a cluster, you only need to configure each engine in the cluster to use the same database tables. The rest of the cluster configuration is handled automatically; as long as all of the engines share the same set of database tables, you do not need to do any further work to enable clustering.

Engines that do not use the same database tables will not attempt to cluster or communicate with one another.

Maximum Cluster Size

Flux supports clusters of up to 35 engines.

In some cases, it may be appropriate to disable cluster networking for best performance. If your cluster satisfies the following criteria:

  • The cluster contains 5 or more engines
  • 50,000 or more flow charts run across the entire cluster per day
  • All engines in the cluster are expected to operate at maximum concurrency

It may be appropriate to disable cluster networking to increase throughput and performance (in this case, the advantages of cluster networking like load balancing would be less advantageous, especially if the cluster is running at maximum capacity at all times).

Load Balancing

Flux engines in the cluster will work together to evenly distribute flow charts. Load balancing is provided in one of two ways:

  • By default, flow charts that are ready to execute will be assigned to the engine in the cluster that is running the smallest number of flow charts. This allows flow charts to be spread evenly across the cluster and ensures that no engines are starved of new flow charts to execute.
  • If the system resource monitor is enabled (by configuring the SYSTEM_RESOURCE_COMMAND and SYSTEM_RESOURCE_INTERVAL engine configuration options), flow charts will be assigned to the engine with the most resources available. This allows flow charts to be assigned to the engine that is under the least stress and ensures that the flow charts will run on the engine that is most likely to have sufficient resources available.

For more information on configuring the system resource monitor, refer to the /examples/software_developers/system_resource_monitoring (while this example uses Java code to configure and start the engine, the process for setting the engine configuration options is the same even if you are configuring the engine from a file).

Load balancing is not available if the engines cannot communicate across the network (for example, if a firewall is blocking traffic between the engines).

Concurrent Actions within a Flow Chart

A flux cluster can also execute the actions within a flow chart across the cluster. If a flow chart contains a split (or, in general, if the flow chart contains two or more actions that will run at the same time), it is possible that the actions will run on different engines in the cluster. Since each execution flow will update the database as it runs with all necessary information, all engines in the cluster will be able to see all information from the branched flow chart.

In short, even if multiple actions within a flow chart run on different engines, there will be no differences in behavior or execution of the flow chart. This clustering behavior allows actions to execute more quickly and reliably across the cluster without impacting the operation of your engine or flow chart.

Clustering and the Network

Clustered engines communicate automatically across the network. If the engines are not able to use the network (for example, if a firewall is blocking traffic between the engines), they will continue to communicate through the database, but load balancing will not be available.

Port Usage

Cluster networking is performed using the same port numbers set in the Engine Configuration; namely, REGISTRY_PORT and RMI_PORT.

By default, REGISTRY_PORT is set to 1099, and RMI_PORT to 0 (that is, an anonymous, random port number). If your network requires that certain ports are restricted on the network between the servers where the clustered engines are running, you'll want to be sure that REGISTRY_PORT and RMI_PORT are set to port values that are open and available on the servers.

Clustering and the Database

For Flux clustering to work correctly, the database must be accessible. If an engine loses its connection to the database, it will not be able to participate in the cluster, even if network communication is still available.

Failover

If an engine in the cluster fails, the other engines in the cluster will cooperate to recover and resume any flow charts that were running on the failed engine. An engine is considered to have failed any time it becomes inaccessible (the machine crashes, the engine shuts down unexpectedly, database connectivity is lost, etc.).

When a flow chart is failed over to another engine instance, it will roll back to the last transaction break it reached successfully and begin running from that point.

A flow chart will only be failed over if it was running an action when its engine failed. A flow chart that was waiting on a trigger is not affected by failover, since a flow chart that is waiting for an event is not yet associated with a particular engine and can still run on any engine in the cluster.

The Failover Time Window

FAILOVER_TIME_WINDOW is an engine configuration property, specified as a time expression, that is used to determine how frequently an engine should check for failed flow chart instances (flow charts that were running on engines that have become inactive).

When a flow chart is running, it will maintain a heartbeat in the database to let all of the engines in the cluster know that the flow chart is still active. A flow chart will update its heartbeat three times during the failover time window, so the heartbeat frequency is defined as (failover time window / 3).

By default, the failover time window is +3m, or three minutes. This means three things:

  1. The engine will check every 3 minutes for flow charts that were running on failed engines.
  2. Any flow chart running on the engine will update its heartbeat every 1 minutes (one-third of the failover time window).
  3. If the last heartbeat was recorded more than three minutes ago, the flow chart is failed over to the other engines in the cluster.

To prevent flow charts from being prematurely failed over, make sure that the following items are consistent across all engines in the cluster:

  • All engines use the same failover time window. If different engines have different time windows, the flow charts might not update their heartbeats frequently enough to satisfy all of the failover time windows in the cluster even if they are actually running correctly.
  • All engines have their system clocks synchronized. The heartbeat is recorded to the database using the system clock of the engine where the flow chart is running, so if the clocks in the cluster are not synchronized the heartbeat can appear older than it actually is.

We recommend using an NTP (Network Time Protocol) to synchronize the system clocks across the cluster.

Failover Time Window in a Non-Clustered Environment

The FAILOVER_TIME_WINDOW is still consulted even if your engine is not running in a cluster - when an engine recovers after shutting down or crashing, it will still wait the duration of the FAILOVER_TIME_WINDOW before attempting to reclaim and execute any flow charts that were running when the engine stopped.

The engine must wait before reclaiming its jobs because there is no way for an engine to guarantee that it is the only engine using the database, so the failover time window is used as a safeguard in case another engine instance has already claimed or begun running the failed over flow charts.

Disabling Failover

There is no built-in way to disable failover in Flux. You can, however, set the failover time window to a very large value (like +10y), which prevents Flux from failing over flow charts. In general, the lower the setting of the failover time window, the faster flow charts are failed over. On a cautionary note, setting a value that is too low causes false alarms and flow charts that are failed over prematurely.

Make sure that the value set is a valid time expression, as setting something like '0' would cause an engine configuration error.

 
 
Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.