X-Git-Url: https://gerrit.onap.org/r/gitweb?a=blobdiff_plain;f=docs%2Fdevelopment%2Fprometheus-metrics.rst;h=84699853a0d2fa11dcd07bf2710c63d327463122;hb=497107302c4295e30cdea18ec70bc90b606844ee;hp=7e9211ab0ec3d745818d000a4742963c6abf687e;hpb=178dea1eb3eb979994d8b99f317f536b85435b60;p=policy%2Fparent.git diff --git a/docs/development/prometheus-metrics.rst b/docs/development/prometheus-metrics.rst index 7e9211ab..84699853 100644 --- a/docs/development/prometheus-metrics.rst +++ b/docs/development/prometheus-metrics.rst @@ -11,3 +11,167 @@ Prometheus Metrics support in Policy Framework Components :depth: 3 This page explains the prometheus metrics exposed by different Policy Framework components. + + +1. Context +========== + +Collecting application metrics is the first step towards gaining insights into Policy Fwk services and infrastructure from point of view of Availability, Performance, Reliability and Scalability. + +The goal of monitoring is to achieve the below operational needs: + +1. Monitoring via dashboards: Provide visual aids to display health, key metrics for use by Operations. +2. Alerting: Something is broken, and the issue must be addressed immediately OR, something might break soon, and proactive measures are taken to avoid such a situation. +3. Conducting retrospective analysis: Rich information that is readily available to better troubleshoot issues. +4. Analyzing trends: How fast is the usage growing? How is the incoming traffic like? Helps assess needs for scaling to meet forecasted demands. + +The principles outlined in the `Four Golden Signals `__ developed by Google Site Reliability Engineers has been adopted to define the key metrics for Policy Framework. + +- Request Rate: # of requests per second as served by Policy services. +- Event Processing rate: # of requests/events per second as processed by the PDPs. +- Errors: # of those requests/events processed that are failing. +- Latency/Duration: Amount of time those requests take, and for PDPs relevant metrics for event processing times. +- Saturation: Measures the degree of fullness or % utilization of a service emphasizing the resources that are most constrained: CPU, Memory, I/O, custom metrics by domain. + + +2. Policy Framework Key metrics +=============================== + +System Metrics common across all Policy components +-------------------------------------------------- + +These standard metrics are available and exposed via a Prometheus endpoint since Istanbul release and can be categorized as below: + +CPU Usage +********* + +CPU usage percentage can be derived *"system_cpu_usage"* for springboot applications and *"process_cpu_seconds_total* for non springboot applications using `PromQL `__ . + +Process uptime +************** + +The process uptime in seconds is available via *"process_uptime_seconds"* for springboot applications or *"process_start_time_seconds"* otherwise. + +Status of the applications is available via the standard *"up"* metric. + +JVM memory metrics +****************** + +These metrics begin with the prefix *"jvm_memory_"*. +There is a lot of data here however, one of the key metric to monitor would be the total heap memory usage, *E.g. sum(jvm_memory_used_bytes{area="heap"})*. + +`PromQL `__ can be leveraged to represent the total or rate of memory usage. + +JVM thread metrics +****************** + +These metrics begin with the prefix *"jvm_threads_"*. Some of the key data to monitor for are: + +- *"jvm_threads_live_threads"* (springboot apps), or *"jvm_threads_current"* (non springboot) shows the total number of live threads, including daemon and non-daemon threads +- *"jvm_threads_peak_threads"* (springboot apps), or *"jvm_threads_peak"* (non springboot) shows the peak total number of threads since the JVM started +- *"jvm_threads_states_threads"* (springboot apps), or *"jvm_threads_state"* (non springboot) shows number of threads by thread state + +JVM garbage collection metrics +****************************** + +There are many garbage collection metrics, with prefix *"jvm_gc_"* available to get deep insights into how the JVM is managing memory. They can be broadly categorized into: + +- Pause duration *"jvm_gc_pause_"* for springboot applications gives us information about how long GC took. For non springboot application, the collection duration metrics *"jvm_gc_collection_"* provide the same information. +- Memory pool size increase can be assessed using *"jvm_gc_memory_allocated_bytes_total"* and *"jvm_gc_memory_promoted_bytes_total"* for springboot applications. + +Average garbage collection time and rate of garbage collection per second are key metrics to monitor. + + +Key metrics for Policy API +-------------------------- + ++-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Metric name | Metric description | Metric labels | ++=====================================+====================================================================================================+=======================================================================================================================================================================+ +| process_uptime_seconds | Uptime of policy-api application in seconds. | | ++-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| http_server_requests_seconds_count | Number of API requests filtered by uri, REST method and response status among other labels | "exception": any exception string; "method": REST method used; "outcome": response status string; "status": http response status code; "uri": REST endpoint invoked | ++-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| http_server_requests_seconds_sum | Time taken for an API request filtered by uri, REST method and response status among other labels | "exception": any exception string; "method": REST method used; "outcome": response status string; "status": http response status code; "uri": REST endpoint invoked | ++-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Key metrics for Policy PAP +-------------------------- + ++-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Metric name | Metric description | Metric labels | ++=====================================+====================================================================================================+=======================================================================================================================================================================+ +| process_uptime_seconds | Uptime of policy-pap application in seconds. | | ++-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| http_server_requests_seconds_count | Number of API requests filtered by uri, REST method and response status among other labels | "exception": any exception string; "method": REST method used; "outcome": response status string; "status": http response status code; "uri": REST endpoint invoked | ++-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| http_server_requests_seconds_sum | Time taken for an API request filtered by uri, REST method and response status among other labels | "exception": any exception string; "method": REST method used; "outcome": response status string; "status": http response status code; "uri": REST endpoint invoked | ++-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| pap_policy_deployments | Number of TOSCA policy deploy/undeploy operations | "operation": Possibles values are deploy, undeploy; "status": Deploy/Undeploy status values - SUCCESS, FAILURE, TOTAL | ++-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Key metrics for APEX-PDP +------------------------ + ++---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+ +| Metric name | Metric description | Metric labels | ++=============================================+=====================================================================================+======================================================================================================================+ +| process_start_time_seconds | Uptime of apex-pdp application in seconds | | ++---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+ +| pdpa_policy_deployments_total | Number of TOSCA policy deploy/undeploy operations | "operation": Possibles values are deploy, undeploy; "status": Deploy/Undeploy status values - SUCCESS, FAILURE, TOTAL| ++---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+ +| pdpa_policy_executions_total | Number of TOSCA policy executions | "status": Execution status values - SUCCESS, FAILURE, TOTAL" | ++---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+ +| pdpa_engine_state | State of APEX engine | "engine_instance_id": ID of the engine thread | ++---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+ +| pdpa_engine_last_start_timestamp_epoch | Epoch timestamp of the instance when engine was last started to derive uptime from | "engine_instance_id": ID of the engine thread | ++---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+ +| pdpa_engine_event_executions | Number of APEX event execution counter per engine thread | "engine_instance_id": ID of the engine thread | ++---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+ +| pdpa_engine_average_execution_time_seconds | Average time taken to execute an APEX policy in seconds | "engine_instance_id": ID of the engine thread | ++---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+ + +Key metrics for Drools PDP +-------------------------- + +Key metrics for XACML PDP +------------------------- + ++--------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Metric name | Metric description | Metric labels | ++================================+===================================================+==============================================================================================================================================================================================================================+ +| process_start_time_seconds | Uptime of policy-pap application in seconds. | | ++--------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| pdpx_policy_deployments_total | Counts the total number of deployment operations | "deploy": Counts the number of successful or failed deploys; "undeploy": Counts the number of successful or failed undeploys | ++--------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| pdpx_policy_decisions_total | Counts the total number of decisions | permit: Counts the number of permit decisions; "deny": Counts the number of deny decisions; "indeterminant": Counts the number of indeterminant decisions; "not_applicable": Counts the number of not applicable decisions. | ++--------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + +Key metrics for Policy Distribution +----------------------------------- + ++------------------------------------+-------------------------------------------------------+ +| Metric name | Metric description | ++====================================+=======================================================+ +| total_distribution_received_count | Total number of distribution received | ++------------------------------------+-------------------------------------------------------+ +| distribution_success_count | Total number of distribution successfully processed | ++------------------------------------+-------------------------------------------------------+ +| distribution_failure_count | Total number of distribution failures | ++------------------------------------+-------------------------------------------------------+ +| total_download_received_count | Total number of download received | ++------------------------------------+-------------------------------------------------------+ +| download_success_count | Total number of download successfully processed | ++------------------------------------+-------------------------------------------------------+ +| download_failure_count | Total number of download failures | ++------------------------------------+-------------------------------------------------------+ + + +3. OOM changes to enable prometheus monitoring for Policy Framework +=================================================================== + +Policy Framework uses ServiceMonitor custom resource definition (CRD) to allow Prometheus to monitor the services it exposes. Label selection is used to determine which services are selected to be monitored. +For label management and troubleshooting refer to the documentation at: `Prometheus operator `__. + +`OOM charts `__ for policy include ServiceMonitor and properties can be overrided based on the deployment specifics.