docs/drools/feature_statemgmt.rst

   1
   2 .. This work is licensed under a Creative Commons Attribution 4.0 International License.
   3 .. http://creativecommons.org/licenses/by/4.0
   4
   5 .. _feature-sm-label:
   6
   7 *************************
   8 Feature: State Management
   9 *************************
  10
  11 .. contents::
  12     :depth: 2
  13
  14 The State Management Feature provides:
  15
  16 - Node-level health monitoring
  17 - Monitoring the health of dependency nodes - nodes on which a particular node is dependent
  18 - Ability to lock/unlock a node and suspend or resume all application processing
  19 - Ability to suspend application processing on a node that is disabled or in a standby state
  20 - Interworking/Coordination of state values
  21 - Support for ITU X.731 states and state transitions for:
  22         - Administrative State
  23         - Operational State
  24         - Availability Status
  25         - Standby Status
  26
  27
  28 Enabling and Disabling Feature State Management
  29 ===============================================
  30
  31 The State Management Feature is enabled from the command line when logged in as policy after configuring the feature properties file (see Description Details section).  From the command line:
  32
  33 - > features status - Lists the status of features
  34 - > features enable state-management - Enables the State Management Feature
  35 - > features disable state-management - Disables the State Management Feature
  36
  37 The Drools PDP must be stopped prior to enabling/disabling features and then restarted after the features have been enabled/disabled.
  38
  39     .. code-block:: bash
  40        :caption: Enabling State Management Feature
  41
  42         policy@hyperion-4:/opt/app/policy$ policy stop
  43         [drools-pdp-controllers]
  44          L []: Stopping Policy Management... Policy Management (pid=354) is stopping... Policy Management has stopped.
  45         policy@hyperion-4:/opt/app/policy$ features enable state-management
  46         name                      version         status
  47         ----                      -------         ------
  48         controlloop-utils         1.1.0-SNAPSHOT  disabled
  49         healthcheck               1.1.0-SNAPSHOT  disabled
  50         test-transaction          1.1.0-SNAPSHOT  disabled
  51         eelf                      1.1.0-SNAPSHOT  disabled
  52         state-management          1.1.0-SNAPSHOT  enabled
  53         active-standby-management 1.1.0-SNAPSHOT  disabled
  54         session-persistence       1.1.0-SNAPSHOT  disabled
  55
  56 Description Details
  57 ~~~~~~~~~~~~~~~~~~~
  58
  59 State Model
  60 """""""""""
  61
  62 The state model follows the ITU X.731 standard for state management.  The supported state values are:
  63     **Administrative State:**
  64         - Locked - All application transaction processing is prohibited
  65         - Unlocked - Application transaction processing is allowed
  66
  67     **Administrative State Transitions:**
  68         - The transition from Unlocked to Locked state is triggered with a Lock operation
  69         - The transition from the Locked to Unlocked state is triggered with an Unlock operation
  70
  71     **Operational State:**
  72         - Enabled - The node is healthy and able to process application transactions
  73         - Disabled - The node is not healthy and not able to process application transactions
  74
  75     **Operational State Transitions:**
  76         - The transition from Enabled to Disabled is triggered with a disableFailed or disableDependency operation
  77         - The transition from Disabled to Enabled is triggered with an enableNotFailed and enableNoDependency operation
  78
  79     **Availability Status:**
  80         - Null - The Operational State is Enabled
  81         - Failed - The Operational State is Disabled because the node is no longer healthy
  82         - Dependency - The Operational State is Disabled because all members of a dependency group are disabled
  83         - Dependency.Failed - The Operational State is Disabled because the node is no longer healthy and all members of a dependency group are disabled
  84
  85     **Availability Status Transitions:**
  86         - The transition from Null to Failed is triggered with a disableFailed operation
  87         - The transtion from Null to Dependency is triggered with a disableDependency operation
  88         - The transition from Failed to Dependency.Failed is triggered with a disableDependency operation
  89         - The transition from Dependency to Dependency.Failed is triggered with a disableFailed operation
  90         - The transition from Dependency.Failed to Failed is triggered with an enableNoDependency operation
  91         - The transition from Dependency.Failed to Dependency is triggered with an enableNotFailed operation
  92         - The transition from Failed to Null is triggered with an enableNotFailed operation
  93         - The transition from Dependency to Null is triggered with an enableNoDependency operation
  94
  95     **Standby Status:**
  96         - Null - The node does not support active-standby behavior
  97         - ProvidingService - The node is actively providing application transaction service
  98         - HotStandby - The node is capable of providing application transaction service, but is currently waiting to be promoted
  99         - ColdStandby - The node is not capable of providing application service because of a failure
 100
 101     **Standby Status Transitions:**
 102         - The transition from Null to HotStandby is triggered by a demote operation when the Operational State is Enabled
 103         - The transition for Null to ColdStandby is triggered is a demote operation when the Operational State is Disabled
 104         - The transition from ColdStandby to HotStandby is triggered by a transition of the Operational State from Disabled to Enabled
 105         - The transition from HotStandby to ColdStandby is triggered by a transition of the Operational State from Enabled to Disabled
 106         - The transition from ProvidingService to ColdStandby is triggered by a transition of the Operational State from Enabled to Disabled
 107         - The transition from HotStandby to ProvidingService is triggered by a Promote operation
 108         - The transition from ProvidingService to HotStandby is triggered by a Demote operation
 109
 110 Database
 111 ~~~~~~~~
 112
 113 The State Management feature creates a StateManagement database having three tables:
 114
 115     **StateManagementEntity** - This table has the following columns:
 116         - **id** - Automatically created unique identifier
 117         - **resourceName** - The unique identifier for a node
 118         - **adminState** - The Administrative State
 119         - **opState** - The Operational State
 120         - **availStatus** - The Availability Status
 121         - **standbyStatus** - The Standby Status
 122         - **created_Date** - The timestamp the resource entry was created
 123         - **modifiedDate** - The timestamp the resource entry was last modified
 124
 125     **ForwardProgressEntity** - This table has the following columns:
 126         - **forwardProgressId** - Automatically created unique identifier
 127         - **resourceName** - The unique identifier for a node
 128         - **fpc_count** - A forward progress counter which is periodically incremented if the node is healthy
 129         - **created_date** - The timestamp the resource entry was created
 130         - **last_updated** - The timestamp the resource entry was last updated
 131
 132     **ResourceRegistrationEntity** - This table has the following columns:
 133         - **ResourceRegistrationId** - Automatically created unique identifier
 134         - **resourceName** - The unique identifier for a node
 135         - **resourceUrl** - The JMX URL used to check the health of a node
 136         - **site** - The name of the site in which the resource resides
 137         - **nodeType** - The type of the node (i.e, pdp_xacml, pdp_drools, pap, pap_admin, logparser, brms_gateway, astra_gateway, elk_server, pypdp)
 138         - **created_date** - The timestamp the resource entry was created
 139         - **last_updated** - The timestamp the resource entry was last updated
 140
 141 Node Health Monitoring
 142 ~~~~~~~~~~~~~~~~~~~~~~
 143
 144 **Application Monitoring**
 145
 146     Application monitoring can be implemented using the *startTransaction()* and *endTransaction()* methods.  Whenever a transaction is started, the *startTransaction()* method is called.  If the node is locked, disabled or in a hot/cold standby state, the method will throw an exception.  Otherwise, it resets the timer which triggers the default *testTransaction()* method.
 147
 148     When a transaction completes, calling *endTransaction()* increments the forward process counter in the *ForwardProgressEntity* DB table.  As long as this counter is updating, the integrity monitor will assume the node is healthy/sane.
 149
 150     If the *startTransaction()* method is not called within a provisioned period of time, a timer will expire which calls the *testTransaction()* method.  The default implementation of this method simply increments the forward progress counter.  The *testTransaction()* method may be overwritten to perform a more meaningful test of system sanity, if desired.
 151
 152     If the forward progress counter stops incrementing, the integrity monitoring routine will assume the node application has lost sanity and it will trigger a *statechange* (disableFailed) to cause the operational state to become disabled and the availability status attribute to become failed.  Once the forward progress counter again begins incrementing, the operational state will return to enabled.
 153
 154 **Application Monitoring with AllSeemsWell**
 155
 156     The IntegrityMonitor class provides a facility for applications to directly control updates of the forwardprogressentity table.  As previously described, *startTransaction()* and *endTransaction()* are provided to monitor the forward progress of transactions.  This, however, does not monitor things such as internal threads that may be blocked or die.  An example is the feature-state-management *DroolsPdpElectionHandler.run()* method.
 157
 158     The *run()* method is monitored by a timer task, *checkWaitTimer()*.  If the *run()* method is stalled an extended period of time, the *checkWaitTimer()* method will call *StateManagementFeature.allSeemsWell(<className>, <AllSeemsWell State>, <String message>)* with the AllSeemsWell state of Boolean.FALSE.
 159
 160     The IntegrityMonitor instance owned by StateManagementFeature will then store an entry in the allSeemsWellMap and block updates of the forwardprogressentity table.  This in turn, will cause the Drools PDP operational state to be set to “disabled” and availability status to be set to “failed”.
 161
 162     Once the blocking condition is cleared, the *checkWaiTimer()* will again call the *allSeemsWell()* method and include an AllSeemsWell state of Boolean.True. This will cause the IntegrityMonitor to remove the entry for that className from the allSeemsWellMap and allow updating of the forwardprogressentity table, so long as there are no other entries in the map.
 163
 164 **Dependency Monitoring**
 165
 166     When a Drools PDP (or other node using the *IntegrityMonitor* policy/common module) is dependent upon other nodes to perform its function, those other nodes can be defined as dependencies in the properties file. In order for the dependency algorithm to function, the other nodes must also be running the *IntegrityMonitor*.  Periodically the Drools PDP will check the state of dependencies.  If all of a node type have failed, the Drools PDP will declare that it can no longer function and change the operational state to disabled and the availability status to dependency.
 167
 168     In addition to other policy node types, there is a *subsystemTest()* method that is periodically called by the *IntegrityMonitor*.  In Drools PDP, *subsystemTest* has been overwritten to execute an audit of the Database and of the Maven Repository.  If the audit is unable to verify the function of either the DB or the Maven Repository, he Drools PDP will declare that it can no longer function and change the operational state to disabled and the availability status to dependency.
 169
 170     When a failed dependency returns to normal operation, the *IntegrityMontor* will change the operational state to enabled and availability status to null.
 171
 172 **External Health Monitoring Interface**
 173
 174     The Drools PDP has a http test interface which, when called, will return 200 if all seems well and 500 otherwise.  The test interface URL is defined in the properties file.
 175
 176
 177 Site Manager
 178 ~~~~~~~~~~~~
 179
 180 The Site Manager is not deployed with the Drools PDP, but it is available in the policy/common repository in the site-manager directory.
 181 The Site Manager provides a lock/unlock interface for nodes and a way to display node information and status.
 182
 183 The following is from the README file included with the Site Manager.
 184
 185     .. code-block:: bash
 186        :caption: Site Manager README extract
 187
 188         Before using 'siteManager', the file 'siteManager.properties' needs to be
 189         edited to configure the parameters used to access the database:
 190
 191             javax.persistence.jdbc.driver - typically 'org.mariadb.jdbc.Driver'
 192
 193             javax.persistence.jdbc.url - URL referring to the database,
 194                 which typically has the form: 'jdbc:mariadb://<host>:<port>/<db>'
 195                 ('<db>' is probably 'xacml' in this case)
 196
 197             javax.persistence.jdbc.user - the user id for accessing the database
 198
 199             javax.persistence.jdbc.password - password for accessing the database
 200
 201         Once the properties file has been updated, the 'siteManager' script can be
 202         invoked as follows:
 203
 204             siteManager show [ -s <site> | -r <resourceName> ] :
 205                 display node information (Site, NodeType, ResourceName, AdminState,
 206                                           OpState, AvailStatus, StandbyStatus)
 207
 208             siteManager setAdminState { -s <site> | -r <resourceName> } <new-state> :
 209                 update admin state on selected nodes
 210
 211             siteManager lock { -s <site> | -r <resourceName> } :
 212                 lock selected nodes
 213
 214             siteManager unlock { -s <site> | -r <resourceName> } :
 215                 unlock selected nodes
 216
 217 Note that the 'siteManager' script assumes that the script,
 218 'site-manager-${project.version}.jar' file and 'siteManager.properties' file
 219 are all in the same directory. If the files are separated, the 'siteManager'
 220 script will need to be modified so it can locate the jar and properties files.
 221
 222
 223 Properties
 224 ~~~~~~~~~~
 225
 226 The feature-state-mangement.properties file controls the function of the State Management Feature.  In general, the properties have adequate descriptions in the file. Parameters which must be replaced prior to usage are indicated thus: ${{parameter to be replaced}}.
 227
 228     .. code-block:: bash
 229        :caption: feature-state-mangement.properties
 230
 231         # DB properties
 232         javax.persistence.jdbc.driver=org.mariadb.jdbc.Driver
 233         javax.persistence.jdbc.url=jdbc:mariadb://${{SQL_HOST}}:3306/statemanagement
 234         javax.persistence.jdbc.user=${{SQL_USER}}
 235         javax.persistence.jdbc.password=${{SQL_PASSWORD}}
 236
 237         # DroolsPDPIntegrityMonitor Properties
 238         # Test interface host and port defaults may be overwritten here
 239         http.server.services.TEST.host=0.0.0.0
 240         http.server.services.TEST.port=9981
 241         #These properties will default to the following if no other values are provided:
 242         # http.server.services.TEST.restClasses=org.onap.policy.drools.statemanagement.IntegrityMonitorRestManager
 243         # http.server.services.TEST.managed=false
 244         # http.server.services.TEST.swagger=true
 245
 246         #IntegrityMonitor Properties
 247
 248         # Must be unique across the system
 249         resource.name=pdp1
 250         # Name of the site in which this node is hosted
 251         site_name=site1
 252         # Forward Progress Monitor update interval seconds
 253         fp_monitor_interval=30
 254         # Failed counter threshold before failover
 255         failed_counter_threshold=3
 256         # Interval between test transactions when no traffic seconds
 257         test_trans_interval=10
 258         # Interval between writes of the FPC to the DB seconds
 259         write_fpc_interval=5
 260         # Node type Note: Make sure you don't leave any trailing spaces, or you'll get an 'invalid node type' error!
 261         node_type=pdp_drools
 262         # Dependency groups are groups of resources upon which a node operational state is dependent upon.
 263         # Each group is a comma-separated list of resource names and groups are separated by a semicolon.  For example:
 264         # dependency_groups=site_1.astra_1,site_1.astra_2;site_1.brms_1,site_1.brms_2;site_1.logparser_1;site_1.pypdp_1
 265         dependency_groups=
 266         # When set to true, dependent health checks are performed by using JMX to invoke test() on the dependent.
 267         # The default false is to use state checks for health.
 268         test_via_jmx=true
 269         # This is the max number of seconds beyond which a non incrementing FPC is considered a failure
 270         max_fpc_update_interval=120
 271         # Run the state audit every 60 seconds (60000 ms).  The state audit finds stale DB entries in the
 272         # forwardprogressentity table and marks the node as disabled/failed in the statemanagemententity
 273         # table. NOTE! It will only run on nodes that have a standbystatus = providingservice.
 274         # A value of <= 0 will turn off the state audit.
 275         state_audit_interval_ms=60000
 276         # The refresh state audit is run every (default) 10 minutes (600000 ms) to clean up any state corruption in the
 277         # DB statemanagemententity table. It only refreshes the DB state entry for the local node.  That is, it does not
 278         # refresh the state of any other nodes.  A value <= 0 will turn the audit off. Any other value will override
 279         # the default of 600000 ms.
 280         refresh_state_audit_interval_ms=600000
 281
 282         # Repository audit properties
 283         # Assume it's the releaseRepository that needs to be audited,
 284         # because that's the one BRMGW will publish to.
 285         repository.audit.id=${{releaseRepositoryID}}
 286         repository.audit.url=${{releaseRepositoryUrl}}
 287         repository.audit.username=${{repositoryUsername}}
 288         repository.audit.password=${{repositoryPassword}}
 289         repository2.audit.id=${{releaseRepository2ID}}
 290         repository2.audit.url=${{releaseRepository2Url}}
 291         repository2.audit.username=${{repositoryUsername2}}
 292         repository2.audit.password=${{repositoryPassword2}}
 293
 294         # Repository Audit Properties
 295         # Flag to control the execution of the subsystemTest for the Nexus Maven repository
 296         repository.audit.is.active=false
 297         repository.audit.ignore.errors=true
 298         repository.audit.interval_sec=86400
 299         repository.audit.failure.threshold=3
 300
 301         # DB Audit Properties
 302         # Flag to control the execution of the subsystemTest for the Database
 303         db.audit.is.active=false
 304
 305
 306 End of Document
 307
 308 .. SSNote: Wiki page ref. https://wiki.onap.org/display/DW/Feature+State+Management
 309
 310