-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathMachineRules.guan
94 lines (76 loc) · 6.79 KB
/
MachineRules.guan
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
## Logic rules for scheduling Machine-level repair jobs in the cluster. EntityType fact is always Machine.
## FH does not conduct (execute) these repairs. It simply schedules them. InfrastructureService is always the Executor for machine-level Repair Jobs.
## You can use the LogRule predicate to have FabricHealer log the rule when Guan is starting to execute it. This is very useful for debugging rules and also
## for rule auditing (so, you can emit telemetry/etw that contains the rule that was running). You can also enable the EnableLogicRuleTracing application parameter
## in FabricHealer's ApplicationManifest which will log executing rules that contain Repair predicates (the predicates that can lead to some action, like restarting
## a node or deactivating a node or..).
## Applicable Named Arguments for Mitigate. Facts are supplied by FabricObserver, FHProxy or FH itself.
## Any argument below with (FO/FHProxy) means that only FO or FHProxy will present the fact.
## | Argument Name | Definition |
## |---------------------------|------------------------------------------------------------------------|
## | NodeName | Name of the node |
## | NodeType | Type of node |
## | ErrorCode (FO/FHProxy) | Supported Error Code emitted by caller (e.g. "FO002") |
## | MetricName (FO/FHProxy) | Name of the Metric (e.g., CpuPercent or MemoryMB, etc.) |
## | MetricValue (FO/FHProxy) | Corresponding Metric Value (e.g. "85" indicating 85% CPU usage) |
## | OS | The name of the OS where FabricHealer is running (Linux or Windows) |
## | HealthState | The HealthState of the target entity: Error or Warning |
## | Source | The Source ID of the related SF Health Event |
## | Property | The Property of the related SF Health Event |
## Metric Names, from FO or FHProxy.
## | Name |
## |--------------------------------|
## | ActiveTcpPorts |
## | CpuPercent |
## | EphemeralPorts |
## | EphemeralPortsPercent |
## | MemoryMB |
## | MemoryPercent |
## | Handles (Linux-only) |
## | HandlesPercent (Linux-only) |
## Don't proceed if the target entity is not in Error.
Mitigate(HealthState=?healthState) :- LogRule(36), not(?healthState == Error), !.
## Don't proceed if there are already 2 or more machine repairs currently active in the cluster.
Mitigate :- LogRule(39), CheckOutstandingRepairs(2), !.
## Don't proceed if FH scheduled a machine repair less than 10 minutes ago.
Mitigate :- LogRule(42), CheckInsideScheduleInterval(00:10:00), !.
## Don't proceed if target machine is currently in recovery probation.
Mitigate :- LogRule(45), CheckInsideNodeProbationPeriod(00:30:00), !.
## Don't proceed if the target node hasn't been in Error (including cyclic Up/Down) state for at least two hours.
Mitigate :- LogRule(48), CheckInsideHealthStateMinDuration(02:00:00), !.
## Fabric Node Deactivation Repairs. These are related to detected machine health Errors by EventLogWatchdog (see rules above), for example.
## These are FH_Infra repairs, even though they apply only to Fabric nodes (and not underlying machines) in terms of impact.
## This is because the root of the problem is the machine, not the Fabric node.
## The rules below ensure that when some watchdog detects OS issues, the Fabric node is taken out of the active ring and will remain in Disabled state until
## the related Repair Job is Completed (this is where optional MaxDuration argument comes into play as it means the repair job will last for the specified duration,
## then it will be Completed by FH).
## Don't proceed unless a specific watchdog, in this case EventLogWatchdog, put the Fabric node into Error state.
##Mitigate(Source=?source) :- LogRule(58), Empty(?source), !.
##Mitigate(Source=?source) :- LogRule(59), notmatch(?source, "EventLogWatchdog"), !.
## Don't proceed if the required Property facts are not present.
##Mitigate(Property=?property) :- LogRule(61), Empty(?property), !.
##Mitigate(Source=?source, Property=?property) :- LogRule(63), match(?property, "CriticalMachineFailure"), match(?property, "1"),
##GetHealthEventHistory(?count, 00:30:00), ?count >= 3,
##GetRepairHistory(?repairCount, 08:00:00, DeactivateFabricNode),
##?repairCount < 1, !, DeactivateFabricNode(MaxDuration=00:02:00).
##Mitigate(Source=?source, Property=?property) :- LogRule(68), match(?property, "Ntfs"), match(?property, "55"),
##GetHealthEventHistory(?count, 00:30:00), ?count >= 3,
##GetRepairHistory(?repairCount, 08:00:00, DeactivateFabricNode),
##?repairCount < 1, !, DeactivateFabricNode(MaxDuration=08:00:00).
## Machine repairs - Reboot, Reimage, Heal.
## The logic below demonstrates how to specify a machine repair escalation path: Reboot -> Reimage -> Heal -> Triage (human intervention required).
## ScheduleMachineRepair predicate takes any repair action string. There are a handful that are supported by RepairManager/InfrastructureService, like below.
## System.Reboot.
## Don't process any other rules if scheduling succeeds OR fails (note the position of ! (cut operator)) and there are less than 1 of these repairs that have completed in the last 8 hours.
Mitigate :- GetRepairHistory(?repairCount, 08:00:00, System.Reboot), ?repairCount < 1, !, ScheduleMachineRepair(System.Reboot).
## System.ReimageOS escalation. *This is not supported in VMSS-managed clusters*.
Mitigate :- GetRepairHistory(?repairCount, 08:00:00, System.ReimageOS), ?repairCount < 1, !, ScheduleMachineRepair(System.ReimageOS).
## System.Azure.Heal escalation.
Mitigate :- GetRepairHistory(?repairCount, 08:00:00, System.Azure.Heal), ?repairCount < 1, !, ScheduleMachineRepair(System.Azure.Heal).
## Triage escalation.
## If we end up here, then human intervention is required. LogInfo will generate ETW/Telemetry/Health events containing the message.
## FabricHealer will also schedule a ManualTriageNeeded repair task. Once you manually solve the problem, then cancel this repair task as it will block FabricHealer
## from scheduling any other machine repairs for the target node until canceled. It also counts against the number of concurrent Active repairs you specified
## above in the CheckOutstandingRepairs predicate.
Mitigate(NodeName=?nodeName) :- LogInfo("0042_{0}: Specified Machine repair escalations have been exhausted for node {0}. Human intervention is required.", ?nodeName),
ScheduleMachineRepair(ManualTriageNeeded).