FabricHealer/PackageRoot/Config/LogicRules/MachineRules.guan

﻿## Logic rules for scheduling Machine-level repair jobs in the cluster. EntityType fact is always Machine.
## FH does not conduct (execute) these repairs. It simply schedules them. InfrastructureService is always the Executor for machine-level Repair Jobs.
## You can use the LogRule predicate to have FabricHealer log the rule when Guan is starting to execute it. This is very useful for debugging rules and also
## for rule auditing (so, you can emit telemetry/etw that contains the rule that was running). You can also enable the EnableLogicRuleTracing application parameter 
## in FabricHealer's ApplicationManifest which will log executing rules that contain Repair predicates (the predicates that can lead to some action, like restarting 
## a node or deactivating a node or..).

## Applicable Named Arguments for Mitigate. Facts are supplied by FabricObserver, FHProxy or FH itself.
## Any argument below with (FO/FHProxy) means that only FO or FHProxy will present the fact.
## | Argument Name             | Definition                                                             |
## |---------------------------|------------------------------------------------------------------------|
## | NodeName                  | Name of the node                                                       |
## | NodeType                  | Type of node                                                           |
## | ErrorCode (FO/FHProxy)    | Supported Error Code emitted by caller (e.g. "FO002")                  |
## | MetricName (FO/FHProxy)   | Name of the Metric  (e.g., CpuPercent or MemoryMB, etc.)               |
## | MetricValue (FO/FHProxy)  | Corresponding Metric Value (e.g. "85" indicating 85% CPU usage)        |
## | OS                        | The name of the OS where FabricHealer is running (Linux or Windows)    |
## | HealthState               | The HealthState of the target entity: Error or Warning                 |
## | Source                    | The Source ID of the related SF Health Event                           |
## | Property                  | The Property of the related SF Health Event                            |

## Metric Names, from FO or FHProxy.
## | Name                           |                                           
## |--------------------------------|
## | ActiveTcpPorts                 |                  
## | CpuPercent                     |
## | EphemeralPorts                 |
## | EphemeralPortsPercent          |
## | MemoryMB                       |
## | MemoryPercent                  |
## | Handles (Linux-only)           |
## | HandlesPercent (Linux-only)    |


## Don't proceed if the target entity is not in Error.
Mitigate(HealthState=?healthState) :- LogRule(36), not(?healthState == Error), !.

## Don't proceed if there are already 2 or more machine repairs currently active in the cluster.
Mitigate :- LogRule(39), CheckOutstandingRepairs(2), !.

## Don't proceed if FH scheduled a machine repair less than 10 minutes ago.
Mitigate :- LogRule(42), CheckInsideScheduleInterval(00:10:00), !.

## Don't proceed if target machine is currently in recovery probation.
Mitigate :- LogRule(45), CheckInsideNodeProbationPeriod(00:30:00), !.

## Don't proceed if the target node hasn't been in Error (including cyclic Up/Down) state for at least two hours.
Mitigate :- LogRule(48), CheckInsideHealthStateMinDuration(02:00:00), !.

## Fabric Node Deactivation Repairs. These are related to detected machine health Errors by EventLogWatchdog (see rules above), for example. 
## These are FH_Infra repairs, even though they apply only to Fabric nodes (and not underlying machines) in terms of impact. 
## This is because the root of the problem is the machine, not the Fabric node.
## The rules below ensure that when some watchdog detects OS issues, the Fabric node is taken out of the active ring and will remain in Disabled state until
## the related Repair Job is Completed (this is where optional MaxDuration argument comes into play as it means the repair job will last for the specified duration, 
## then it will be Completed by FH).

## Don't proceed unless a specific watchdog, in this case EventLogWatchdog, put the Fabric node into Error state.
##Mitigate(Source=?source) :- LogRule(58), Empty(?source), !.
##Mitigate(Source=?source) :- LogRule(59), notmatch(?source, "EventLogWatchdog"), !.
## Don't proceed if the required Property facts are not present.
##Mitigate(Property=?property) :- LogRule(61), Empty(?property), !.

##Mitigate(Source=?source, Property=?property) :- LogRule(63), match(?property, "CriticalMachineFailure"), match(?property, "1"),
	##GetHealthEventHistory(?count, 00:30:00), ?count >= 3,
	##GetRepairHistory(?repairCount, 08:00:00, DeactivateFabricNode),
	##?repairCount < 1, !, DeactivateFabricNode(MaxDuration=00:02:00).

##Mitigate(Source=?source, Property=?property) :- LogRule(68), match(?property, "Ntfs"), match(?property, "55"),
	##GetHealthEventHistory(?count, 00:30:00), ?count >= 3,
	##GetRepairHistory(?repairCount, 08:00:00, DeactivateFabricNode),
	##?repairCount < 1, !, DeactivateFabricNode(MaxDuration=08:00:00).


## Machine repairs - Reboot, Reimage, Heal.
## The logic below demonstrates how to specify a machine repair escalation path: Reboot -> Reimage -> Heal -> Triage (human intervention required).
## ScheduleMachineRepair predicate takes any repair action string. There are a handful that are supported by RepairManager/InfrastructureService, like below.

## System.Reboot.
## Don't process any other rules if scheduling succeeds OR fails (note the position of ! (cut operator)) and there are less than 1 of these repairs that have completed in the last 8 hours.
Mitigate :- GetRepairHistory(?repairCount, 08:00:00, System.Reboot), ?repairCount < 1, !, ScheduleMachineRepair(System.Reboot).

## System.ReimageOS escalation. *This is not supported in VMSS-managed clusters*.
Mitigate :- GetRepairHistory(?repairCount, 08:00:00, System.ReimageOS), ?repairCount < 1, !, ScheduleMachineRepair(System.ReimageOS).

## System.Azure.Heal escalation.
Mitigate :- GetRepairHistory(?repairCount, 08:00:00, System.Azure.Heal), ?repairCount < 1, !, ScheduleMachineRepair(System.Azure.Heal).

## Triage escalation.
## If we end up here, then human intervention is required. LogInfo will generate ETW/Telemetry/Health events containing the message.
## FabricHealer will also schedule a ManualTriageNeeded repair task. Once you manually solve the problem, then cancel this repair task as it will block FabricHealer
## from scheduling any other machine repairs for the target node until canceled. It also counts against the number of concurrent Active repairs you specified
## above in the CheckOutstandingRepairs predicate.
Mitigate(NodeName=?nodeName) :- LogInfo("0042_{0}: Specified Machine repair escalations have been exhausted for node {0}. Human intervention is required.", ?nodeName), 
	ScheduleMachineRepair(ManualTriageNeeded).