-Honolulu solution to survive a stop or restart of a Kubernetes control or
-worker node.
-
-Controller node resiliency
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-By default the ONAP solution is installed with 3 controllers for high
-availability. The test for controller resiliency can be described as follows:
-
-- Run tests: check that they are PASS
-- Stop a controller node: check that the node appears in NotReady state
-- Run tests: check that they are PASS
-
-2 tests were performed on the weekly honolulu lab. No problem was observed on
-controller shutdown, tests were still PASS with a stoped controller node.
-
-More details can be found in <https://jira.onap.org/browse/TEST-309>.
-
-Worker node resiliency
-~~~~~~~~~~~~~~~~~~~~~~
-
-In community weekly lab, the ONAP pods are distributed on 12 workers. The goal
-of the test was to evaluate the behavior of the pod on a worker restart
-(disaster scenario assuming that the node was moved accidentally from Ready to
-NotReady state).
-The original conditions of such tests may be different as the Kubernetes
-scheduler does not distribute the pods on the same worker from an installation
-to another.
-
-The test procedure can be described as follows:
-
-- Run tests: check that they are PASS (Healthcheck and basic_vm used)
-- Check that all the workers are in ready state
- ::
- $ kubectl get nodes
- NAME STATUS ROLES AGE VERSION
- compute01-onap-honolulu Ready <none> 18h v1.19.9
- compute02-onap-honolulu Ready <none> 18h v1.19.9
- compute03-onap-honolulu Ready <none> 18h v1.19.9
- compute04-onap-honolulu Ready <none> 18h v1.19.9
- compute05-onap-honolulu Ready <none> 18h v1.19.9
- compute06-onap-honolulu Ready <none> 18h v1.19.9
- compute07-onap-honolulu Ready <none> 18h v1.19.9
- compute08-onap-honolulu Ready <none> 18h v1.19.9
- compute09-onap-honolulu Ready <none> 18h v1.19.9
- compute10-onap-honolulu Ready <none> 18h v1.19.9
- compute11-onap-honolulu Ready <none> 18h v1.19.9
- compute12-onap-honolulu Ready <none> 18h v1.19.9
- control01-onap-honolulu Ready master 18h v1.19.9
- control02-onap-honolulu Ready master 18h v1.19.9
- control03-onap-honolulu Ready master 18h v1.19.9
-
-- Select a worker, list the impacted pods
- ::
- $ kubectl get pod -n onap --field-selector spec.nodeName=compute01-onap-honolulu
- NAME READY STATUS RESTARTS AGE
- onap-aaf-fs-7b6648db7f-shcn5 1/1 Running 1 22h
- onap-aaf-oauth-5896545fb7-x6grg 1/1 Running 1 22h
- onap-aaf-sms-quorumclient-2 1/1 Running 1 22h
- onap-aai-modelloader-86d95c994b-87tsh 2/2 Running 2 22h
- onap-aai-schema-service-75575cb488-7fxs4 2/2 Running 2 22h
- onap-appc-cdt-58cb4766b6-vl78q 1/1 Running 1 22h
- onap-appc-db-0 2/2 Running 4 22h
- onap-appc-dgbuilder-5bb94d46bd-h2gbs 1/1 Running 1 22h
- onap-awx-0 4/4 Running 4 22h
- onap-cassandra-1 1/1 Running 1 22h
- onap-cds-blueprints-processor-76f8b9b5c7-hb5bg 1/1 Running 1 22h
- onap-dmaap-dr-db-1 2/2 Running 5 22h
- onap-ejbca-6cbdb7d6dd-hmw6z 1/1 Running 1 22h
- onap-kube2msb-858f46f95c-jws4m 1/1 Running 1 22h
- onap-message-router-0 1/1 Running 1 22h
- onap-message-router-kafka-0 1/1 Running 1 22h
- onap-message-router-kafka-1 1/1 Running 1 22h
- onap-message-router-kafka-2 1/1 Running 1 22h
- onap-message-router-zookeeper-0 1/1 Running 1 22h
- onap-multicloud-794c6dffc8-bfwr8 2/2 Running 2 22h
- onap-multicloud-starlingx-58f6b86c55-mff89 3/3 Running 3 22h
- onap-multicloud-vio-584d556876-87lxn 2/2 Running 2 22h
- onap-music-cassandra-0 1/1 Running 1 22h
- onap-netbox-nginx-8667d6675d-vszhb 1/1 Running 2 22h
- onap-policy-api-6dbf8485d7-k7cpv 1/1 Running 1 22h
- onap-policy-clamp-be-6d77597477-4mffk 1/1 Running 1 22h
- onap-policy-pap-785bd79759-xxhvx 1/1 Running 1 22h
- onap-policy-xacml-pdp-7d8fd58d59-d4m7g 1/1 Running 6 22h
- onap-sdc-be-5f99c6c644-dcdz8 2/2 Running 2 22h
- onap-sdc-fe-7577d58fb5-kwxpj 2/2 Running 2 22h
- onap-sdc-wfd-fe-6997567759-gl9g6 2/2 Running 2 22h
- onap-sdnc-dgbuilder-564d6475fd-xwwrz 1/1 Running 1 22h
- onap-sdnrdb-master-0 1/1 Running 1 22h
- onap-so-admin-cockpit-6c5b44694-h4d2n 1/1 Running 1 21h
- onap-so-etsi-sol003-adapter-c9bf4464-pwn97 1/1 Running 1 21h
- onap-so-sdc-controller-6899b98b8b-hfgvc 2/2 Running 2 21h
- onap-vfc-mariadb-1 2/2 Running 4 21h
- onap-vfc-nslcm-6c67677546-xcvl2 2/2 Running 2 21h
- onap-vfc-vnflcm-78ff4d8778-sgtv6 2/2 Running 2 21h
- onap-vfc-vnfres-6c96f9ff5b-swq5z 2/2 Running 2 21h
-
-- Stop the worker (shutdown the machine for baremetal or the VM if you installed
- your Kubernetes on top of an OpenStack solution)
-- Wait for the pod eviction procedure completion (5 minutes)
- ::
- $ kubectl get nodes
- NAME STATUS ROLES AGE VERSION
- compute01-onap-honolulu NotReady <none> 18h v1.19.9
- compute02-onap-honolulu Ready <none> 18h v1.19.9
- compute03-onap-honolulu Ready <none> 18h v1.19.9
- compute04-onap-honolulu Ready <none> 18h v1.19.9
- compute05-onap-honolulu Ready <none> 18h v1.19.9
- compute06-onap-honolulu Ready <none> 18h v1.19.9
- compute07-onap-honolulu Ready <none> 18h v1.19.9
- compute08-onap-honolulu Ready <none> 18h v1.19.9
- compute09-onap-honolulu Ready <none> 18h v1.19.9
- compute10-onap-honolulu Ready <none> 18h v1.19.9
- compute11-onap-honolulu Ready <none> 18h v1.19.9
- compute12-onap-honolulu Ready <none> 18h v1.19.9
- control01-onap-honolulu Ready master 18h v1.19.9
- control02-onap-honolulu Ready master 18h v1.19.9
- control03-onap-honolulu Ready master 18h v1.19.9
-
-- Run the tests: check that they are PASS
-
-.. warning::
- In these conditions, **the tests will never be PASS**. In fact several components
- will remeain in INIT state.
- A procedure is required to ensure a clean restart.
-
-List the non running pods::
-
- $ kubectl get pods -n onap --field-selector status.phase!=Running | grep -v Completed
- NAME READY STATUS RESTARTS AGE
- onap-appc-dgbuilder-5bb94d46bd-sxmmc 0/1 Init:3/4 15 156m
- onap-cds-blueprints-processor-76f8b9b5c7-m7nmb 0/1 Init:1/3 0 156m
- onap-portal-app-595bd6cd95-bkswr 0/2 Init:0/4 84 23h
- onap-portal-db-config-6s75n 0/2 Error 0 23h
- onap-portal-db-config-7trzx 0/2 Error 0 23h
- onap-portal-db-config-jt2jl 0/2 Error 0 23h
- onap-portal-db-config-mjr5q 0/2 Error 0 23h
- onap-portal-db-config-qxvdt 0/2 Error 0 23h
- onap-portal-db-config-z8c5n 0/2 Error 0 23h
- onap-sdc-be-5f99c6c644-kplqx 0/2 Init:2/5 14 156
- onap-vfc-nslcm-6c67677546-86mmj 0/2 Init:0/1 15 156m
- onap-vfc-vnflcm-78ff4d8778-h968x 0/2 Init:0/1 15 156m
- onap-vfc-vnfres-6c96f9ff5b-kt9rz 0/2 Init:0/1 15 156m
-
-Some pods are not rescheduled (i.e. onap-awx-0 and onap-cassandra-1 above)
-because they are part of a statefulset. List the statefulset objects::
-
- $ kubectl get statefulsets.apps -n onap | grep -v "1/1" | grep -v "3/3"
- NAME READY AGE
- onap-aaf-sms-quorumclient 2/3 24h
- onap-appc-db 2/3 24h
- onap-awx 0/1 24h
- onap-cassandra 2/3 24h
- onap-dmaap-dr-db 2/3 24h
- onap-message-router 0/1 24h
- onap-message-router-kafka 0/3 24h
- onap-message-router-zookeeper 2/3 24h
- onap-music-cassandra 2/3 24h
- onap-sdnrdb-master 2/3 24h
- onap-vfc-mariadb 2/3 24h
-
-For the pods being part of the statefulset, a forced deleteion is required.
-As an example if we consider the statefulset onap-sdnrdb-master, we must follow
-the procedure::
-
- $ kubectl get pods -n onap -o wide |grep onap-sdnrdb-master
- onap-sdnrdb-master-0 1/1 Terminating 1 24h 10.42.3.92 node1
- onap-sdnrdb-master-1 1/1 Running 1 24h 10.42.1.122 node2
- onap-sdnrdb-master-2 1/1 Running 1 24h 10.42.2.134 node3
-
- $ kubectl delete -n onap pod onap-sdnrdb-master-0 --force
- warning: Immediate deletion does not wait for confirmation that the running
- resource has been terminated. The resource may continue to run on the cluster
- indefinitely.
- pod "onap-sdnrdb-master-0" force deleted
-
- $ kubectl get pods |grep onap-sdnrdb-master
- onap-sdnrdb-master-0 0/1 PodInitializing 0 11s
- onap-sdnrdb-master-1 1/1 Running 1 24h
- onap-sdnrdb-master-2 1/1 Running 1 24h
-
- $ kubectl get pods |grep onap-sdnrdb-master
- onap-sdnrdb-master-0 1/1 Running 0 43s
- onap-sdnrdb-master-1 1/1 Running 1 24h
- onap-sdnrdb-master-2 1/1 Running 1 24h
-
-Once all the statefulset are properly restarted, the other components shall
-continue their restart properly.
-Once the restart of the pods is completed, the tests are PASS.
+Istanbul solution to survive a stop or restart of a Kubernetes worker node.
+
+This test has been automated thanks to the
+Litmus chaos framework(https://litmuschaos.io/) and automated in the CI on the
+weekly chains.
+
+2 additional tests based on Litmus chaos scenario have been added but will be tuned
+in Jakarta.
+
+- node cpu hog (temporary increase of CPU on 1 kubernetes node)
+- node memory hog (temporary increase of Memory on 1 kubernetes node)
+
+The main test for Istanbul is node drain corresponding to the resiliency scenario
+previously managed manually.
+
+The system under test is defined in OOM.
+The resources are described in the table below:
+
+.. code-block:: shell
+
+ +-------------------------+-------+--------+--------+
+ | Name | vCPUs | Memory | Disk |
+ +-------------------------+-------+--------+--------+
+ | compute12-onap-istanbul | 16 | 24Go | 10 Go |
+ | compute11-onap-istanbul | 16 | 24Go | 10 Go |
+ | compute10-onap-istanbul | 16 | 24Go | 10 Go |
+ | compute09-onap-istanbul | 16 | 24Go | 10 Go |
+ | compute08-onap-istanbul | 16 | 24Go | 10 Go |
+ | compute07-onap-istanbul | 16 | 24Go | 10 Go |
+ | compute06-onap-istanbul | 16 | 24Go | 10 Go |
+ | compute05-onap-istanbul | 16 | 24Go | 10 Go |
+ | compute04-onap-istanbul | 16 | 24Go | 10 Go |
+ | compute03-onap-istanbul | 16 | 24Go | 10 Go |
+ | compute02-onap-istanbul | 16 | 24Go | 10 Go |
+ | compute01-onap-istanbul | 16 | 24Go | 10 Go |
+ | etcd03-onap-istanbul | 4 | 6Go | 10 Go |
+ | etcd02-onap-istanbul | 4 | 6Go | 10 Go |
+ | etcd01-onap-istanbul | 4 | 6Go | 10 Go |
+ | control03-onap-istanbul | 4 | 6Go | 10 Go |
+ | control02-onap-istanbul | 4 | 6Go | 10 Go |
+ | control01-onap-istanbul | 4 | 6Go | 10 Go |
+ +-------------------------+-------+--------+--------+
+
+
+The test sequence can be defined as follows:
+
+- Cordon a compute node (prevent any new scheduling)
+- Launch node drain chaos scenario, all the pods on the given compute node
+ are evicted
+
+Once all the pods have been evicted:
+
+- Uncordon the compute node
+- Replay a basic_vm test
+
+This test has been successfully executed.
+
+.. image:: files/s3p/istanbul_resiliency.png
+ :align: center