Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDoc-3179 Update the Cluster Configuration #1979

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@

{NOTE: }

* The primary goal of the `Cluster Observer` is to maintain the [Replication Factor](../../../server/clustering/distribution/distributed-database#replication-factor) of each database in the cluster.
* The primary goal of the **Cluster Observer** is to monitor the health of each database in the cluster
and adjust its topology to maintain the desired [Replication Factor](../../../server/clustering/distribution/distributed-database#replication-factor).

* This observer is always running on the Leader node.
{NOTE/}
Expand Down Expand Up @@ -38,17 +39,25 @@ The _Cluster Observer_ stores its information **in memory**, so when the `Leader
| `/admin/cluster/maintenance-stats` | GET | | Fetch the latest reports of the _Cluster Observer_ |
{PANEL/}

{NOTE: For Example}
{NOTE: }

**For example**:

* Let us assume a five node cluster, with servers A, B, C, D, E.
We create a database with a replication factor of 3 and define an ETL task.
* Let us assume a five-node cluster with servers A, B, C, D, E.
We create a database with a replication factor of 3 and define an ETL task.

* The newly created database will be distributed automatically to three of the cluster nodes.
Let's assume it is distributed to B, C and E (So the database group is [B,C,E]),
and the cluster decides that node C is the responsible for performing the ETL task.
Let's assume it is distributed to B, C, and E (so the database group is [B,C,E]),
and the cluster decides that node C is responsible for performing the ETL task.

* If node C goes offline or becomes unreachable, the Cluster Observer detects the issue.
Initially:
* After the duration specified in the [Cluster.TimeBeforeMovingToRehabInSec](../../../server/configuration/cluster-configuration#cluster.timebeforemovingtorehabinsec) configuration,
the observer moves node C to rehab mode, allowing time for recovery.
* The ETL task fails over to another available node in the Database Group.

* If node C remains offline beyond the period specified in the [Cluster.TimeBeforeAddingReplicaInSec](../../../server/configuration/cluster-configuration#cluster.timebeforeaddingreplicainsec) configuration, the observer begins replicating the database to another node in the Database Group as a last resort.

* If node C goes offline or is not reachable, the Observer will notice it and relocate the database from node C to another available node.
Meanwhile the ETL task will failover to be performed by another available node from the Database Group.
{NOTE/}

## Related articles
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@

{NOTE: }

* The primary goal of the `Cluster Observer` is to maintain the [Replication Factor](../../../server/clustering/distribution/distributed-database#replication-factor) of each database in the cluster.
* The primary goal of the **Cluster Observer** is to monitor the health of each database in the cluster
and adjust its topology to maintain the desired [Replication Factor](../../../server/clustering/distribution/distributed-database#replication-factor).

* This observer is always running on the Leader node.
{NOTE/}
Expand Down Expand Up @@ -37,15 +38,23 @@ The _Cluster Observer_ stores its information **in memory**, so when the `Leader
| `/admin/cluster/maintenance-stats` | GET | | Fetch the latest reports of the _Cluster Observer_ |
{PANEL/}

{NOTE: For Example}
{NOTE: }

* Let us assume a five node cluster, with servers A, B, C, D, E.
We create a database with a replication factor of 3 and define an ETL task.
**For example**:

* Let us assume a five-node cluster with servers A, B, C, D, E.
We create a database with a replication factor of 3 and define an ETL task.

* The newly created database will be distributed automatically to three of the cluster nodes.
Let's assume it is distributed to B, C and E (So the database group is [B,C,E]),
and the cluster decides that node C is the responsible for performing the ETL task.
Let's assume it is distributed to B, C, and E (so the database group is [B,C,E]),
and the cluster decides that node C is responsible for performing the ETL task.

* If node C goes offline or becomes unreachable, the Cluster Observer detects the issue.
Initially:
* After the duration specified in the [Cluster.TimeBeforeMovingToRehabInSec](../../../server/configuration/cluster-configuration#cluster.timebeforemovingtorehabinsec) configuration,
the observer moves node C to rehab mode, allowing time for recovery.
* The ETL task fails over to another available node in the Database Group.

* If node C goes offline or is not reachable, the Observer will notice it and relocate the database from node C to another available node.
Meanwhile the ETL task will failover to be performed by another available node from the Database Group.
* If node C remains offline beyond the period specified in the [Cluster.TimeBeforeAddingReplicaInSec](../../../server/configuration/cluster-configuration#cluster.timebeforeaddingreplicainsec) configuration, the observer begins replicating the database to another node in the Database Group as a last resort.

{NOTE/}
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@

{NOTE: }

* The primary goal of the `Cluster Observer` is to maintain the [Replication Factor](../../../server/clustering/distribution/distributed-database#replication-factor) of each database in the cluster.
* The primary goal of the **Cluster Observer** is to monitor the health of each database in the cluster
and adjust its topology to maintain the desired [Replication Factor](../../../server/clustering/distribution/distributed-database#replication-factor).

* This observer is always running on the Leader node.
{NOTE/}
Expand Down Expand Up @@ -37,15 +38,23 @@ The _Cluster Observer_ stores its information **in memory**, so when the `Leader
| `/admin/cluster/maintenance-stats` | GET | | Fetch the latest reports of the _Cluster Observer_ |
{PANEL/}

{NOTE: For Example}
{NOTE: }

**For example**:

* Let us assume a five node cluster, with servers A, B, C, D, E.
We create a database with a replication factor of 3 and define an ETL task.
* Let us assume a five-node cluster with servers A, B, C, D, E.
We create a database with a replication factor of 3 and define an ETL task.

* The newly created database will be distributed automatically to three of the cluster nodes.
Let's assume it is distributed to B, C and E (So the database group is [B,C,E]),
and the cluster decides that node C is the responsible for performing the ETL task.
Let's assume it is distributed to B, C, and E (so the database group is [B,C,E]),
and the cluster decides that node C is responsible for performing the ETL task.

* If node C goes offline or becomes unreachable, the Cluster Observer detects the issue.
Initially:
* After the duration specified in the [Cluster.TimeBeforeMovingToRehabInSec](../../../server/configuration/cluster-configuration#cluster.timebeforemovingtorehabinsec) configuration,
the observer moves node C to rehab mode, allowing time for recovery.
* The ETL task fails over to another available node in the Database Group.

* If node C remains offline beyond the period specified in the [Cluster.TimeBeforeAddingReplicaInSec](../../../server/configuration/cluster-configuration#cluster.timebeforeaddingreplicainsec) configuration, the observer begins replicating the database to another node in the Database Group as a last resort.

* If node C goes offline or is not reachable, the Observer will notice it and relocate the database from node C to another available node.
Meanwhile the ETL task will failover to be performed by another available node from the Database Group.
{NOTE/}
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Cluster Observer
---

{NOTE: }

* The primary goal of the **Cluster Observer** is to monitor the health of each database in the cluster
and adjust its topology to maintain the desired [Replication Factor](../../../server/clustering/distribution/distributed-database#replication-factor).

* This observer is always running on the [Leader](../../../server/clustering/rachis/cluster-topology#leader) node.

* In this page:
* [Operation flow](../../../server/clustering/distribution/cluster-observer#operation-flow)
* [Interacting with the Cluster Observer](../../../server/clustering/distribution/cluster-observer#interacting-with-the-cluster-observer)

{NOTE/}

---

{PANEL: Operation flow}

* To maintain the Replication Factor, every newly elected [Leader](../../../server/clustering/rachis/cluster-topology#leader) starts measuring the health of each node
by creating dedicated maintenance TCP connections to all other nodes in the cluster.

* Each node reports the current status of _all_ its databases at intervals of [500 milliseconds](../../../server/configuration/cluster-configuration#cluster.workersampleperiodinms) (by default).
The `Cluster Observer` consumes those reports every [1000 milliseconds](../../../server/configuration/cluster-configuration#cluster.supervisorsampleperiodinms) (by default).

* Upon a **node failure**, the [Dynamic Database Distribution](../../../server/clustering/distribution/distributed-database#dynamic-database-distribution) sequence
will take place in order to ensure that the `Replication Factor` does not change.

{NOTE: }

**For example**:

* Let us assume a five-node cluster with servers A, B, C, D, E.
We create a database with a replication factor of 3 and define an ETL task.

* The newly created database will be distributed automatically to three of the cluster nodes.
Let's assume it is distributed to B, C, and E (so the database group is [B,C,E]),
and the cluster decides that node C is responsible for performing the ETL task.

* If node C goes offline or becomes unreachable, the Cluster Observer detects the issue.
Initially:
* After the duration specified in the [Cluster.TimeBeforeMovingToRehabInSec](../../../server/configuration/cluster-configuration#cluster.timebeforemovingtorehabinsec) configuration,
the observer moves node C to rehab mode, allowing time for recovery.
* The ETL task fails over to another available node in the Database Group.

* If node C remains offline beyond the period specified in the [Cluster.TimeBeforeAddingReplicaInSec](../../../server/configuration/cluster-configuration#cluster.timebeforeaddingreplicainsec) configuration,
the observer begins replicating the database to another node in the Database Group as a last resort.

{NOTE/}

{WARNING: }

**Note**:

* The _Cluster Observer_ stores its information **in memory**, so when the `Leader` loses leadership,
the collected reports of the _Cluster Observer_ and its decision log are lost.

{WARNING/}

{PANEL/}

{PANEL: Interacting with the Cluster Observer}

You can interact with the `Cluster Observer` using the following REST API calls:

| URL | Method | Query Params | Description |
|-------------------------------------|---------|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `/admin/cluster/observer/suspend` | POST | value=[`bool`] | Setting `false` will suspend the _Cluster Observer_ operation for the current [Leader term](../../../studio/cluster/cluster-view#cluster-nodes-states-&-types-flow). |
| `/admin/cluster/observer/decisions` | GET | | Fetch the log of the recent decisions made by the cluster observer. |
| `/admin/cluster/maintenance-stats` | GET | | Fetch the latest reports of the _Cluster Observer_ |

{PANEL/}
Loading
Loading