.. This work is licensed under a Creative Commons Attribution 4.0 International License. .. _integration-s3p: Stability/Resiliency ==================== .. important:: The Release stability has been evaluated by: - The daily Istanbul CI/CD chain - Stability tests - Resiliency tests .. note: The scope of these tests remains limited and does not provide a full set of KPIs to determinate the limits and the dimensioning of the ONAP solution. CI results ---------- As usual, a daily CI chain dedicated to the release is created after RC0. A Honolulu chain has been created on the 6th of April 2021. The daily results can be found in `LF daily results web site `_. Infrastructure Healthcheck Tests ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These tests deal with the Kubernetes/Helm tests on ONAP cluster. The global expected criteria is **75%**. The onap-k8s and onap-k8s-teardown providing a snapshop of the onap namespace in Kubernetes as well as the onap-helm tests are expected to be PASS. nodeport_check_certs test is expected to fail. Even tremendous progress have been done in this area, some certificates (unmaintained, upstream or integration robot pods) are still not correct due to bad certificate issuers (Root CA certificate non valid) or extra long validity. Most of the certificates have been installed using cert-manager and will be easily renewable. .. image:: files/s3p/honolulu_daily_infrastructure_healthcheck.png :align: center Healthcheck Tests ~~~~~~~~~~~~~~~~~ These tests are the traditionnal robot healthcheck tests and additional tests dealing with a single component. Some tests (basic_onboard, basic_cds) may fail episodically due to the fact that the startup of the SDC is sometimes not fully completed. The same test is run as first step of smoke tests and is usually PASS. The mechanism to detect that all the components are fully operational may be improved, timer based solutions are not robust enough. The expectation is **100% OK**. .. image:: files/s3p/honolulu_daily_healthcheck.png :align: center Smoke Tests ~~~~~~~~~~~ These tests are end to end and automated use case tests. See the :ref:`the Integration Test page ` for details. The expectation is **100% OK**. .. figure:: files/s3p/honolulu_daily_smoke.png :align: center An error has been detected on the SDNC preventing the basic_vm_macro to work. See `SDNC-1529 `_ for details. We may also notice that SO timeouts occured more frequently than in Guilin. See `SO-3584 `_ for details. Security Tests ~~~~~~~~~~~~~~ These tests are tests dealing with security. See the :ref:`the Integration Test page ` for details. The expectation is **66% OK**. The criteria is met. It may even be above as 2 fail tests are almost correct: - The unlimited pod test is still fail due testing pod (DCAE-tca). - The nonssl tests is FAIL due to so and so-etsi-sol003-adapter, which were supposed to be managed with the ingress (not possible for this release) and got a waiver in Frankfurt. The pods cds-blueprints-processor-http and aws-web are used for tests. .. figure:: files/s3p/honolulu_daily_security.png :align: center Resiliency tests ---------------- The goal of the resiliency testing was to evaluate the capability of the Honolulu solution to survive a stop or restart of a Kubernetes control or worker node. Controller node resiliency ~~~~~~~~~~~~~~~~~~~~~~~~~~ By default the ONAP solution is installed with 3 controllers for high availability. The test for controller resiliency can be described as follows: - Run tests: check that they are PASS - Stop a controller node: check that the node appears in NotReady state - Run tests: check that they are PASS 2 tests were performed on the weekly honolulu lab. No problem was observed on controller shutdown, tests were still PASS with a stoped controller node. More details can be found in . Worker node resiliency ~~~~~~~~~~~~~~~~~~~~~~ In community weekly lab, the ONAP pods are distributed on 12 workers. The goal of the test was to evaluate the behavior of the pod on a worker restart (disaster scenario assuming that the node was moved accidentally from Ready to NotReady state). The original conditions of such tests may be different as the Kubernetes scheduler does not distribute the pods on the same worker from an installation to another. The test procedure can be described as follows: - Run tests: check that they are PASS (Healthcheck and basic_vm used) - Check that all the workers are in ready state :: $ kubectl get nodes NAME STATUS ROLES AGE VERSION compute01-onap-honolulu Ready 18h v1.19.9 compute02-onap-honolulu Ready 18h v1.19.9 compute03-onap-honolulu Ready 18h v1.19.9 compute04-onap-honolulu Ready 18h v1.19.9 compute05-onap-honolulu Ready 18h v1.19.9 compute06-onap-honolulu Ready 18h v1.19.9 compute07-onap-honolulu Ready 18h v1.19.9 compute08-onap-honolulu Ready 18h v1.19.9 compute09-onap-honolulu Ready 18h v1.19.9 compute10-onap-honolulu Ready 18h v1.19.9 compute11-onap-honolulu Ready 18h v1.19.9 compute12-onap-honolulu Ready 18h v1.19.9 control01-onap-honolulu Ready master 18h v1.19.9 control02-onap-honolulu Ready master 18h v1.19.9 control03-onap-honolulu Ready master 18h v1.19.9 - Select a worker, list the impacted pods :: $ kubectl get pod -n onap --field-selector spec.nodeName=compute01-onap-honolulu NAME READY STATUS RESTARTS AGE onap-aaf-fs-7b6648db7f-shcn5 1/1 Running 1 22h onap-aaf-oauth-5896545fb7-x6grg 1/1 Running 1 22h onap-aaf-sms-quorumclient-2 1/1 Running 1 22h onap-aai-modelloader-86d95c994b-87tsh 2/2 Running 2 22h onap-aai-schema-service-75575cb488-7fxs4 2/2 Running 2 22h onap-appc-cdt-58cb4766b6-vl78q 1/1 Running 1 22h onap-appc-db-0 2/2 Running 4 22h onap-appc-dgbuilder-5bb94d46bd-h2gbs 1/1 Running 1 22h onap-awx-0 4/4 Running 4 22h onap-cassandra-1 1/1 Running 1 22h onap-cds-blueprints-processor-76f8b9b5c7-hb5bg 1/1 Running 1 22h onap-dmaap-dr-db-1 2/2 Running 5 22h onap-ejbca-6cbdb7d6dd-hmw6z 1/1 Running 1 22h onap-kube2msb-858f46f95c-jws4m 1/1 Running 1 22h onap-message-router-0 1/1 Running 1 22h onap-message-router-kafka-0 1/1 Running 1 22h onap-message-router-kafka-1 1/1 Running 1 22h onap-message-router-kafka-2 1/1 Running 1 22h onap-message-router-zookeeper-0 1/1 Running 1 22h onap-multicloud-794c6dffc8-bfwr8 2/2 Running 2 22h onap-multicloud-starlingx-58f6b86c55-mff89 3/3 Running 3 22h onap-multicloud-vio-584d556876-87lxn 2/2 Running 2 22h onap-music-cassandra-0 1/1 Running 1 22h onap-netbox-nginx-8667d6675d-vszhb 1/1 Running 2 22h onap-policy-api-6dbf8485d7-k7cpv 1/1 Running 1 22h onap-policy-clamp-be-6d77597477-4mffk 1/1 Running 1 22h onap-policy-pap-785bd79759-xxhvx 1/1 Running 1 22h onap-policy-xacml-pdp-7d8fd58d59-d4m7g 1/1 Running 6 22h onap-sdc-be-5f99c6c644-dcdz8 2/2 Running 2 22h onap-sdc-fe-7577d58fb5-kwxpj 2/2 Running 2 22h onap-sdc-wfd-fe-6997567759-gl9g6 2/2 Running 2 22h onap-sdnc-dgbuilder-564d6475fd-xwwrz 1/1 Running 1 22h onap-sdnrdb-master-0 1/1 Running 1 22h onap-so-admin-cockpit-6c5b44694-h4d2n 1/1 Running 1 21h onap-so-etsi-sol003-adapter-c9bf4464-pwn97 1/1 Running 1 21h onap-so-sdc-controller-6899b98b8b-hfgvc 2/2 Running 2 21h onap-vfc-mariadb-1 2/2 Running 4 21h onap-vfc-nslcm-6c67677546-xcvl2 2/2 Running 2 21h onap-vfc-vnflcm-78ff4d8778-sgtv6 2/2 Running 2 21h onap-vfc-vnfres-6c96f9ff5b-swq5z 2/2 Running 2 21h - Stop the worker (shutdown the machine for baremetal or the VM if you installed your Kubernetes on top of an OpenStack solution) - Wait for the pod eviction procedure completion (5 minutes) :: $ kubectl get nodes NAME STATUS ROLES AGE VERSION compute01-onap-honolulu NotReady 18h v1.19.9 compute02-onap-honolulu Ready 18h v1.19.9 compute03-onap-honolulu Ready 18h v1.19.9 compute04-onap-honolulu Ready 18h v1.19.9 compute05-onap-honolulu Ready 18h v1.19.9 compute06-onap-honolulu Ready 18h v1.19.9 compute07-onap-honolulu Ready 18h v1.19.9 compute08-onap-honolulu Ready 18h v1.19.9 compute09-onap-honolulu Ready 18h v1.19.9 compute10-onap-honolulu Ready 18h v1.19.9 compute11-onap-honolulu Ready 18h v1.19.9 compute12-onap-honolulu Ready 18h v1.19.9 control01-onap-honolulu Ready master 18h v1.19.9 control02-onap-honolulu Ready master 18h v1.19.9 control03-onap-honolulu Ready master 18h v1.19.9 - Run the tests: check that they are PASS .. warning:: In these conditions, **the tests will never be PASS**. In fact several components will remeain in INIT state. A procedure is required to ensure a clean restart. List the non running pods:: $ kubectl get pods -n onap --field-selector status.phase!=Running | grep -v Completed NAME READY STATUS RESTARTS AGE onap-appc-dgbuilder-5bb94d46bd-sxmmc 0/1 Init:3/4 15 156m onap-cds-blueprints-processor-76f8b9b5c7-m7nmb 0/1 Init:1/3 0 156m onap-portal-app-595bd6cd95-bkswr 0/2 Init:0/4 84 23h onap-portal-db-config-6s75n 0/2 Error 0 23h onap-portal-db-config-7trzx 0/2 Error 0 23h onap-portal-db-config-jt2jl 0/2 Error 0 23h onap-portal-db-config-mjr5q 0/2 Error 0 23h onap-portal-db-config-qxvdt 0/2 Error 0 23h onap-portal-db-config-z8c5n 0/2 Error 0 23h onap-sdc-be-5f99c6c644-kplqx 0/2 Init:2/5 14 156 onap-vfc-nslcm-6c67677546-86mmj 0/2 Init:0/1 15 156m onap-vfc-vnflcm-78ff4d8778-h968x 0/2 Init:0/1 15 156m onap-vfc-vnfres-6c96f9ff5b-kt9rz 0/2 Init:0/1 15 156m Some pods are not rescheduled (i.e. onap-awx-0 and onap-cassandra-1 above) because they are part of a statefulset. List the statefulset objects:: $ kubectl get statefulsets.apps -n onap | grep -v "1/1" | grep -v "3/3" NAME READY AGE onap-aaf-sms-quorumclient 2/3 24h onap-appc-db 2/3 24h onap-awx 0/1 24h onap-cassandra 2/3 24h onap-dmaap-dr-db 2/3 24h onap-message-router 0/1 24h onap-message-router-kafka 0/3 24h onap-message-router-zookeeper 2/3 24h onap-music-cassandra 2/3 24h onap-sdnrdb-master 2/3 24h onap-vfc-mariadb 2/3 24h For the pods being part of the statefulset, a forced deleteion is required. As an example if we consider the statefulset onap-sdnrdb-master, we must follow the procedure:: $ kubectl get pods -n onap -o wide |grep onap-sdnrdb-master onap-sdnrdb-master-0 1/1 Terminating 1 24h 10.42.3.92 node1 onap-sdnrdb-master-1 1/1 Running 1 24h 10.42.1.122 node2 onap-sdnrdb-master-2 1/1 Running 1 24h 10.42.2.134 node3 $ kubectl delete -n onap pod onap-sdnrdb-master-0 --force warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "onap-sdnrdb-master-0" force deleted $ kubectl get pods |grep onap-sdnrdb-master onap-sdnrdb-master-0 0/1 PodInitializing 0 11s onap-sdnrdb-master-1 1/1 Running 1 24h onap-sdnrdb-master-2 1/1 Running 1 24h $ kubectl get pods |grep onap-sdnrdb-master onap-sdnrdb-master-0 1/1 Running 0 43s onap-sdnrdb-master-1 1/1 Running 1 24h onap-sdnrdb-master-2 1/1 Running 1 24h Once all the statefulset are properly restarted, the other components shall continue their restart properly. Once the restart of the pods is completed, the tests are PASS. .. important:: K8s node reboots/shutdown is showing some deficiencies in ONAP components in regard of their availability measured with HC results. Some pods may still fail to initialize after reboot/shutdown(pod rescheduled). However cluster as a whole behaves as expected, pods are rescheduled after node shutdown (except pods being part of statefulset which need to be deleted forcibly - normal Kubernetes behavior) On rebooted node, should its downtime not exceed eviction timeout, pods are restarted back after it is again available. Please see `Integration Resiliency page `_ for details. Stability tests --------------- Three stability tests have been performed in Honolulu: - SDC stability test - Simple instantiation test (basic_vm) - Parallel instantiation test SDC stability test ~~~~~~~~~~~~~~~~~~ In this test, we consider the basic_onboard automated test and we run 5 simultaneous onboarding procedures in parallel during 72h. The basic_onboard test consists in the following steps: - [SDC] VendorOnboardStep: Onboard vendor in SDC. - [SDC] YamlTemplateVspOnboardStep: Onboard vsp described in YAML file in SDC. - [SDC] YamlTemplateVfOnboardStep: Onboard vf described in YAML file in SDC. - [SDC] YamlTemplateServiceOnboardStep: Onboard service described in YAML file in SDC. The test has been initiated on the honolulu weekly lab on the 19th of April. As already observed in daily|weekly|gating chain, we got race conditions on some tests (https://jira.onap.org/browse/INT-1918). The success rate is above 95% on the 100 first model upload and above 80% until we onboard more than 500 models. We may also notice that the function test_duration=f(time) increases continuously. At the beginning the test takes about 200s, 24h later the same test will take around 1000s. Finally after 36h, the SDC systematically answers with a 500 HTTP answer code explaining the linear decrease of the success rate. The following graphs provides a good view of the SDC stability test. .. image:: files/s3p/honolulu_sdc_stability.png :align: center .. important:: SDC can support up to 100s models onboarding. The onbaording duration increases linearly with the number of onboarded models After a while, the SDC is no more usable. No major Cluster resource issues have been detected during the test. The memory consumption is however relatively high regarding the load. .. image:: files/s3p/honolulu_sdc_stability_resources.png :align: center Simple stability test ~~~~~~~~~~~~~~~~~~~~~ This test consists on running the test basic_vm continuously during 72h. We observe the cluster metrics as well as the evolution of the test duration. The test basic_vm is described in :ref:`the Integration Test page `. The basic_vm test consists in the different following steps: - [SDC] VendorOnboardStep: Onboard vendor in SDC. - [SDC] YamlTemplateVspOnboardStep: Onboard vsp described in YAML file in SDC. - [SDC] YamlTemplateVfOnboardStep: Onboard vf described in YAML file in SDC. - [SDC] YamlTemplateServiceOnboardStep: Onboard service described in YAML file in SDC. - [AAI] RegisterCloudRegionStep: Register cloud region. - [AAI] ComplexCreateStep: Create complex. - [AAI] LinkCloudRegionToComplexStep: Connect cloud region with complex. - [AAI] CustomerCreateStep: Create customer. - [AAI] CustomerServiceSubscriptionCreateStep: Create customer's service subscription. - [AAI] ConnectServiceSubToCloudRegionStep: Connect service subscription with cloud region. - [SO] YamlTemplateServiceAlaCarteInstantiateStep: Instantiate service described in YAML using SO a'la carte method. - [SO] YamlTemplateVnfAlaCarteInstantiateStep: Instantiate vnf described in YAML using SO a'la carte method. - [SO] YamlTemplateVfModuleAlaCarteInstantiateStep: Instantiate VF module described in YAML using SO a'la carte method. The test has been initiated on the Honolulu weekly lab on the 26th of April 2021. This test has been run after the test described in the next section. A first error occured after few hours (mariadbgalera), then the system automatically recovered for some hours before a full crash of the mariadb galera. :: debian@control01-onap-honolulu:~$ kubectl get pod -n onap |grep mariadb-galera onap-mariadb-galera-0 1/2 CrashLoopBackOff 625 5d16h onap-mariadb-galera-1 1/2 CrashLoopBackOff 1134 5d16h onap-mariadb-galera-2 1/2 CrashLoopBackOff 407 5d16h It was unfortunately not possible to collect the root cause (logs of the first restart of onap-mariadb-galera-1). Community members reported that they already faced such issues and suggest to deploy a single maria instance instead of using MariaDB galera. Moreover, in Honolulu there were some changes in order to allign Camunda (SO) requirements for MariaDB galera.. During the limited valid window, the success rate was about 78% (85% for the same test in Guilin). The duration of the test remain very variable as also already reported in Guilin (https://jira.onap.org/browse/SO-3419). The duration of the same test may vary from 500s to 2500s as illustrated in the following graph: .. image:: files/s3p/honolulu_so_stability_1_duration.png :align: center The changes in MariaDB galera seems to have introduced some issues leading to more unexpected timeouts. A troubleshooting campaign has been launched to evaluate possible evolutions in this area. Parallel instantiations stability test ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Still based on basic_vm, 5 instantiation attempts are done simultaneously on the ONAP solution during 48h. The results can be described as follows: .. image:: files/s3p/honolulu_so_stability_5.png :align: center For this test, we have to restart the SDNC once. The last failures are due to a certificate infrastructure issue and are independent from ONAP. Cluster metrics ~~~~~~~~~~~~~~~ .. important:: No major cluster resource issues have been detected in the cluster metrics The metrics of the ONAP cluster have been recorded over the full week of stability tests: .. csv-table:: CPU :file: ./files/csv/stability_cluster_metric_cpu.csv :widths: 20,20,20,20,20 :delim: ; :header-rows: 1 .. image:: files/s3p/honolulu_weekly_cpu.png :align: center .. image:: files/s3p/honolulu_weekly_memory.png :align: center The Top Ten for CPU consumption is given in the table below: .. csv-table:: CPU :file: ./files/csv/stability_top10_cpu.csv :widths: 20,15,15,20,15,15 :delim: ; :header-rows: 1 CPU consumption is negligeable and not dimensioning. It shall be reconsider for use cases including extensive computation (loops, optimization algorithms). The Top Ten for Memory consumption is given in the table below: .. csv-table:: Memory :file: ./files/csv/stability_top10_memory.csv :widths: 20,15,15,20,15,15 :delim: ; :header-rows: 1 Without surprise, the Cassandra databases are using most of the memory. The Top Ten for Network consumption is given in the table below: .. csv-table:: Network :file: ./files/csv/stability_top10_net.csv :widths: 10,15,15,15,15,15,15 :delim: ; :header-rows: 1