docs/integration-s3p.rst

   1 .. This work is licensed under a
   2    Creative Commons Attribution 4.0 International License.
   3 .. _integration-s3p:
   4
   5 Stability/Resiliency
   6 ====================
   7
   8 .. important::
   9     The Release stability has been evaluated by:
  10
  11     - The daily Istanbul CI/CD chain
  12     - Stability tests
  13     - Resiliency tests
  14
  15 .. note:
  16     The scope of these tests remains limited and does not provide a full set of
  17     KPIs to determinate the limits and the dimensioning of the ONAP solution.
  18
  19 CI results
  20 ----------
  21
  22 As usual, a daily CI chain dedicated to the release is created after RC0.
  23 A Honolulu chain has been created on the 6th of April 2021.
  24
  25 The daily results can be found in `LF daily results web site
  26 <https://logs.onap.org/onap-integration/daily/onap_daily_pod4_honolulu/2021-04/>`_.
  27
  28 Infrastructure Healthcheck Tests
  29 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  30
  31 These tests deal with the Kubernetes/Helm tests on ONAP cluster.
  32
  33 The global expected criteria is **75%**.
  34 The onap-k8s and onap-k8s-teardown  providing a snapshop of the onap namespace in
  35 Kubernetes as well as the onap-helm tests are expected to be PASS.
  36
  37 nodeport_check_certs test is expected to fail. Even tremendous progress have
  38 been done in this area, some certificates (unmaintained, upstream or integration
  39 robot pods) are still not correct due to bad certificate issuers (Root CA
  40 certificate non valid) or extra long validity. Most of the certificates have
  41 been installed using cert-manager and will be easily renewable.
  42
  43 .. image:: files/s3p/honolulu_daily_infrastructure_healthcheck.png
  44    :align: center
  45
  46 Healthcheck Tests
  47 ~~~~~~~~~~~~~~~~~
  48
  49 These tests are the traditionnal robot healthcheck tests and additional tests
  50 dealing with a single component.
  51
  52 Some tests (basic_onboard, basic_cds) may fail episodically due to the fact that
  53 the startup of the SDC is sometimes not fully completed.
  54
  55 The same test is run as first step of smoke tests and is usually PASS.
  56 The mechanism to detect that all the components are fully operational may be
  57 improved, timer based solutions are not robust enough.
  58
  59 The expectation is **100% OK**.
  60
  61 .. image:: files/s3p/honolulu_daily_healthcheck.png
  62   :align: center
  63
  64 Smoke Tests
  65 ~~~~~~~~~~~
  66
  67 These tests are end to end and automated use case tests.
  68 See the :ref:`the Integration Test page <integration-tests>` for details.
  69
  70 The expectation is **100% OK**.
  71
  72 .. figure:: files/s3p/honolulu_daily_smoke.png
  73   :align: center
  74
  75 An error has been detected on the SDNC preventing the basic_vm_macro to work.
  76 See `SDNC-1529 <https://jira.onap.org/browse/SDNC-1529/>`_ for details.
  77 We may also notice that SO timeouts occured more frequently than in Guilin.
  78 See `SO-3584 <https://jira.onap.org/browse/SO-3584>`_ for details.
  79
  80 Security Tests
  81 ~~~~~~~~~~~~~~
  82
  83 These tests are tests dealing with security.
  84 See the  :ref:`the Integration Test page <integration-tests>` for details.
  85
  86 The expectation is **66% OK**. The criteria is met.
  87
  88 It may even be above as 2 fail tests are almost correct:
  89
  90 - The unlimited pod test is still fail due testing pod (DCAE-tca).
  91 - The nonssl tests is FAIL due to so and so-etsi-sol003-adapter, which were
  92   supposed to be managed with the ingress (not possible for this release) and
  93   got a waiver in Frankfurt. The pods cds-blueprints-processor-http and aws-web
  94   are used for tests.
  95
  96 .. figure:: files/s3p/honolulu_daily_security.png
  97   :align: center
  98
  99 Resiliency tests
 100 ----------------
 101
 102 The goal of the resiliency testing was to evaluate the capability of the
 103 Honolulu solution to survive a stop or restart of a Kubernetes control or
 104 worker node.
 105
 106 Controller node resiliency
 107 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 108
 109 By default the ONAP solution is installed with 3 controllers for high
 110 availability. The test for controller resiliency can be described as follows:
 111
 112 - Run tests: check that they are PASS
 113 - Stop a controller node: check that the node appears in NotReady state
 114 - Run tests: check that they are PASS
 115
 116 2 tests were performed on the weekly honolulu lab. No problem was observed on
 117 controller shutdown, tests were still PASS with a stoped controller node.
 118
 119 More details can be found in <https://jira.onap.org/browse/TEST-309>.
 120
 121 Worker node resiliency
 122 ~~~~~~~~~~~~~~~~~~~~~~
 123
 124 In community weekly lab, the ONAP pods are distributed on 12 workers. The goal
 125 of the test was to evaluate the behavior of the pod on a worker restart
 126 (disaster scenario assuming that the node was moved accidentally from Ready to
 127 NotReady state).
 128 The original conditions of such tests may be different as the Kubernetes
 129 scheduler does not distribute the pods on the same worker from an installation
 130 to another.
 131
 132 The test procedure can be described as follows:
 133
 134 - Run tests: check that they are PASS (Healthcheck and basic_vm used)
 135 - Check that all the workers are in ready state
 136   ::
 137     $ kubectl get nodes
 138     NAME                      STATUS   ROLES    AGE   VERSION
 139     compute01-onap-honolulu   Ready    <none>   18h   v1.19.9
 140     compute02-onap-honolulu   Ready    <none>   18h   v1.19.9
 141     compute03-onap-honolulu   Ready    <none>   18h   v1.19.9
 142     compute04-onap-honolulu   Ready    <none>   18h   v1.19.9
 143     compute05-onap-honolulu   Ready    <none>   18h   v1.19.9
 144     compute06-onap-honolulu   Ready    <none>   18h   v1.19.9
 145     compute07-onap-honolulu   Ready    <none>   18h   v1.19.9
 146     compute08-onap-honolulu   Ready    <none>   18h   v1.19.9
 147     compute09-onap-honolulu   Ready    <none>   18h   v1.19.9
 148     compute10-onap-honolulu   Ready    <none>   18h   v1.19.9
 149     compute11-onap-honolulu   Ready    <none>   18h   v1.19.9
 150     compute12-onap-honolulu   Ready    <none>   18h   v1.19.9
 151     control01-onap-honolulu   Ready    master   18h   v1.19.9
 152     control02-onap-honolulu   Ready    master   18h   v1.19.9
 153     control03-onap-honolulu   Ready    master   18h   v1.19.9
 154
 155 - Select a worker, list the impacted pods
 156   ::
 157     $ kubectl get pod -n onap --field-selector spec.nodeName=compute01-onap-honolulu
 158     NAME                                             READY   STATUS        RESTARTS   AGE
 159     onap-aaf-fs-7b6648db7f-shcn5                     1/1     Running   1          22h
 160     onap-aaf-oauth-5896545fb7-x6grg                  1/1     Running   1          22h
 161     onap-aaf-sms-quorumclient-2                      1/1     Running   1          22h
 162     onap-aai-modelloader-86d95c994b-87tsh            2/2     Running   2          22h
 163     onap-aai-schema-service-75575cb488-7fxs4         2/2     Running   2          22h
 164     onap-appc-cdt-58cb4766b6-vl78q                   1/1     Running   1          22h
 165     onap-appc-db-0                                   2/2     Running   4          22h
 166     onap-appc-dgbuilder-5bb94d46bd-h2gbs             1/1     Running   1          22h
 167     onap-awx-0                                       4/4     Running   4          22h
 168     onap-cassandra-1                                 1/1     Running   1          22h
 169     onap-cds-blueprints-processor-76f8b9b5c7-hb5bg   1/1     Running   1          22h
 170     onap-dmaap-dr-db-1                               2/2     Running   5          22h
 171     onap-ejbca-6cbdb7d6dd-hmw6z                      1/1     Running   1          22h
 172     onap-kube2msb-858f46f95c-jws4m                   1/1     Running   1          22h
 173     onap-message-router-0                            1/1     Running   1          22h
 174     onap-message-router-kafka-0                      1/1     Running   1          22h
 175     onap-message-router-kafka-1                      1/1     Running   1          22h
 176     onap-message-router-kafka-2                      1/1     Running   1          22h
 177     onap-message-router-zookeeper-0                  1/1     Running   1          22h
 178     onap-multicloud-794c6dffc8-bfwr8                 2/2     Running   2          22h
 179     onap-multicloud-starlingx-58f6b86c55-mff89       3/3     Running   3          22h
 180     onap-multicloud-vio-584d556876-87lxn             2/2     Running   2          22h
 181     onap-music-cassandra-0                           1/1     Running   1          22h
 182     onap-netbox-nginx-8667d6675d-vszhb               1/1     Running   2          22h
 183     onap-policy-api-6dbf8485d7-k7cpv                 1/1     Running   1          22h
 184     onap-policy-clamp-be-6d77597477-4mffk            1/1     Running   1          22h
 185     onap-policy-pap-785bd79759-xxhvx                 1/1     Running   1          22h
 186     onap-policy-xacml-pdp-7d8fd58d59-d4m7g           1/1     Running   6          22h
 187     onap-sdc-be-5f99c6c644-dcdz8                     2/2     Running   2          22h
 188     onap-sdc-fe-7577d58fb5-kwxpj                     2/2     Running   2          22h
 189     onap-sdc-wfd-fe-6997567759-gl9g6                 2/2     Running   2          22h
 190     onap-sdnc-dgbuilder-564d6475fd-xwwrz             1/1     Running   1          22h
 191     onap-sdnrdb-master-0                             1/1     Running   1          22h
 192     onap-so-admin-cockpit-6c5b44694-h4d2n            1/1     Running   1          21h
 193     onap-so-etsi-sol003-adapter-c9bf4464-pwn97       1/1     Running   1          21h
 194     onap-so-sdc-controller-6899b98b8b-hfgvc          2/2     Running   2          21h
 195     onap-vfc-mariadb-1                               2/2     Running   4          21h
 196     onap-vfc-nslcm-6c67677546-xcvl2                  2/2     Running   2          21h
 197     onap-vfc-vnflcm-78ff4d8778-sgtv6                 2/2     Running   2          21h
 198     onap-vfc-vnfres-6c96f9ff5b-swq5z                 2/2     Running   2          21h
 199
 200 - Stop the worker (shutdown the machine for baremetal or the VM if you installed
 201   your Kubernetes on top of an OpenStack solution)
 202 - Wait for the pod eviction procedure completion (5 minutes)
 203   ::
 204     $ kubectl get nodes
 205     NAME                      STATUS     ROLES    AGE   VERSION
 206     compute01-onap-honolulu   NotReady   <none>   18h   v1.19.9
 207     compute02-onap-honolulu   Ready      <none>   18h   v1.19.9
 208     compute03-onap-honolulu   Ready      <none>   18h   v1.19.9
 209     compute04-onap-honolulu   Ready      <none>   18h   v1.19.9
 210     compute05-onap-honolulu   Ready      <none>   18h   v1.19.9
 211     compute06-onap-honolulu   Ready      <none>   18h   v1.19.9
 212     compute07-onap-honolulu   Ready      <none>   18h   v1.19.9
 213     compute08-onap-honolulu   Ready      <none>   18h   v1.19.9
 214     compute09-onap-honolulu   Ready      <none>   18h   v1.19.9
 215     compute10-onap-honolulu   Ready      <none>   18h   v1.19.9
 216     compute11-onap-honolulu   Ready      <none>   18h   v1.19.9
 217     compute12-onap-honolulu   Ready      <none>   18h   v1.19.9
 218     control01-onap-honolulu   Ready      master   18h   v1.19.9
 219     control02-onap-honolulu   Ready      master   18h   v1.19.9
 220     control03-onap-honolulu   Ready      master   18h   v1.19.9
 221
 222 - Run the tests: check that they are PASS
 223
 224 .. warning::
 225   In these conditions, **the tests will never be PASS**. In fact several components
 226   will remeain in INIT state.
 227   A procedure is required to ensure a clean restart.
 228
 229 List the non running pods::
 230
 231   $ kubectl get pods -n onap --field-selector status.phase!=Running | grep -v Completed
 232   NAME                                             READY   STATUS      RESTARTS   AGE
 233   onap-appc-dgbuilder-5bb94d46bd-sxmmc             0/1     Init:3/4    15         156m
 234   onap-cds-blueprints-processor-76f8b9b5c7-m7nmb   0/1     Init:1/3    0          156m
 235   onap-portal-app-595bd6cd95-bkswr                 0/2     Init:0/4    84         23h
 236   onap-portal-db-config-6s75n                      0/2     Error       0          23h
 237   onap-portal-db-config-7trzx                      0/2     Error       0          23h
 238   onap-portal-db-config-jt2jl                      0/2     Error       0          23h
 239   onap-portal-db-config-mjr5q                      0/2     Error       0          23h
 240   onap-portal-db-config-qxvdt                      0/2     Error       0          23h
 241   onap-portal-db-config-z8c5n                      0/2     Error       0          23h
 242   onap-sdc-be-5f99c6c644-kplqx                     0/2     Init:2/5    14         156
 243   onap-vfc-nslcm-6c67677546-86mmj                  0/2     Init:0/1    15         156m
 244   onap-vfc-vnflcm-78ff4d8778-h968x                 0/2     Init:0/1    15         156m
 245   onap-vfc-vnfres-6c96f9ff5b-kt9rz                 0/2     Init:0/1    15         156m
 246
 247 Some pods are not rescheduled (i.e. onap-awx-0 and onap-cassandra-1 above)
 248 because they are part of a statefulset. List the statefulset objects::
 249
 250   $ kubectl get statefulsets.apps -n onap | grep -v "1/1" | grep -v "3/3"
 251   NAME                            READY   AGE
 252   onap-aaf-sms-quorumclient       2/3     24h
 253   onap-appc-db                    2/3     24h
 254   onap-awx                        0/1     24h
 255   onap-cassandra                  2/3     24h
 256   onap-dmaap-dr-db                2/3     24h
 257   onap-message-router             0/1     24h
 258   onap-message-router-kafka       0/3     24h
 259   onap-message-router-zookeeper   2/3     24h
 260   onap-music-cassandra            2/3     24h
 261   onap-sdnrdb-master              2/3     24h
 262   onap-vfc-mariadb                2/3     24h
 263
 264 For the pods being part of the statefulset, a forced deleteion is required.
 265 As an example if we consider the statefulset onap-sdnrdb-master, we must follow
 266 the procedure::
 267
 268   $ kubectl get pods -n onap -o wide |grep onap-sdnrdb-master
 269   onap-sdnrdb-master-0  1/1  Terminating 1  24h  10.42.3.92   node1
 270   onap-sdnrdb-master-1  1/1  Running     1  24h  10.42.1.122  node2
 271   onap-sdnrdb-master-2  1/1  Running     1  24h  10.42.2.134  node3
 272
 273   $ kubectl delete -n onap pod onap-sdnrdb-master-0 --force
 274   warning: Immediate deletion does not wait for confirmation that the running
 275   resource has been terminated. The resource may continue to run on the cluster
 276   indefinitely.
 277   pod "onap-sdnrdb-master-0" force deleted
 278
 279   $ kubectl get pods |grep onap-sdnrdb-master
 280   onap-sdnrdb-master-0  0/1  PodInitializing   0  11s
 281   onap-sdnrdb-master-1  1/1  Running           1  24h
 282   onap-sdnrdb-master-2  1/1  Running           1  24h
 283
 284   $ kubectl get pods |grep onap-sdnrdb-master
 285   onap-sdnrdb-master-0  1/1  Running  0  43s
 286   onap-sdnrdb-master-1  1/1  Running  1  24h
 287   onap-sdnrdb-master-2  1/1  Running  1  24h
 288
 289 Once all the statefulset are properly restarted, the other components shall
 290 continue their restart properly.
 291 Once the restart of the pods is completed, the tests are PASS.
 292
 293 .. important::
 294
 295   K8s node reboots/shutdown is showing some deficiencies in ONAP components in
 296   regard of their availability measured with HC results. Some pods may
 297   still fail to initialize after reboot/shutdown(pod rescheduled).
 298
 299   However cluster as a whole behaves as expected, pods are rescheduled after
 300   node shutdown (except pods being part of statefulset which need to be deleted
 301   forcibly - normal Kubernetes behavior)
 302
 303   On rebooted node, should its downtime not exceed eviction timeout, pods are
 304   restarted back after it is again available.
 305
 306 Please see `Integration Resiliency page <https://jira.onap.org/browse/TEST-308>`_
 307 for details.
 308
 309 Stability tests
 310 ---------------
 311
 312 Three stability tests have been performed in Honolulu:
 313
 314 - SDC stability test
 315 - Simple instantiation test (basic_vm)
 316 - Parallel instantiation test
 317
 318 SDC stability test
 319 ~~~~~~~~~~~~~~~~~~
 320
 321 In this test, we consider the basic_onboard automated test and we run 5
 322 simultaneous onboarding procedures in parallel during 72h.
 323
 324 The basic_onboard test consists in the following steps:
 325
 326 - [SDC] VendorOnboardStep: Onboard vendor in SDC.
 327 - [SDC] YamlTemplateVspOnboardStep: Onboard vsp described in YAML file in SDC.
 328 - [SDC] YamlTemplateVfOnboardStep: Onboard vf described in YAML file in SDC.
 329 - [SDC] YamlTemplateServiceOnboardStep: Onboard service described in YAML file
 330   in SDC.
 331
 332 The test has been initiated on the honolulu weekly lab on the 19th of April.
 333
 334 As already observed in daily|weekly|gating chain, we got race conditions on
 335 some tests (https://jira.onap.org/browse/INT-1918).
 336
 337 The success rate is above 95% on the 100 first model upload and above 80%
 338 until we onboard more than 500 models.
 339
 340 We may also notice that the function test_duration=f(time) increases
 341 continuously. At the beginning the test takes about 200s, 24h later the same
 342 test will take around 1000s.
 343 Finally after 36h, the SDC systematically answers with a 500 HTTP answer code
 344 explaining the linear decrease of the success rate.
 345
 346 The following graphs provides a good view of the SDC stability test.
 347
 348 .. image:: files/s3p/honolulu_sdc_stability.png
 349   :align: center
 350
 351 .. important::
 352    SDC can support up to 100s models onboarding.
 353    The onbaording duration increases linearly with the number of onboarded
 354    models
 355    After a while, the SDC is no more usable.
 356    No major Cluster resource issues have been detected during the test. The
 357    memory consumption is however relatively high regarding the load.
 358
 359 .. image:: files/s3p/honolulu_sdc_stability_resources.png
 360  :align: center
 361
 362
 363 Simple stability test
 364 ~~~~~~~~~~~~~~~~~~~~~
 365
 366 This test consists on running the test basic_vm continuously during 72h.
 367
 368 We observe the cluster metrics as well as the evolution of the test duration.
 369
 370 The test basic_vm is described in :ref:`the Integration Test page <integration-tests>`.
 371
 372 The basic_vm test consists in the different following steps:
 373
 374 - [SDC] VendorOnboardStep: Onboard vendor in SDC.
 375 - [SDC] YamlTemplateVspOnboardStep: Onboard vsp described in YAML file in SDC.
 376 - [SDC] YamlTemplateVfOnboardStep: Onboard vf described in YAML file in SDC.
 377 - [SDC] YamlTemplateServiceOnboardStep: Onboard service described in YAML file
 378   in SDC.
 379 - [AAI] RegisterCloudRegionStep: Register cloud region.
 380 - [AAI] ComplexCreateStep: Create complex.
 381 - [AAI] LinkCloudRegionToComplexStep: Connect cloud region with complex.
 382 - [AAI] CustomerCreateStep: Create customer.
 383 - [AAI] CustomerServiceSubscriptionCreateStep: Create customer's service
 384   subscription.
 385 - [AAI] ConnectServiceSubToCloudRegionStep: Connect service subscription with
 386   cloud region.
 387 - [SO] YamlTemplateServiceAlaCarteInstantiateStep: Instantiate service described
 388   in YAML using SO a'la carte method.
 389 - [SO] YamlTemplateVnfAlaCarteInstantiateStep: Instantiate vnf described in YAML
 390   using SO a'la carte method.
 391 - [SO] YamlTemplateVfModuleAlaCarteInstantiateStep: Instantiate VF module
 392   described in YAML using SO a'la carte method.
 393
 394 The test has been initiated on the Honolulu weekly lab on the 26th of April 2021.
 395 This test has been run after the test described in the next section.
 396 A first error occured after few hours (mariadbgalera), then the system
 397 automatically recovered for some hours before a full crash of the mariadb
 398 galera.
 399
 400 ::
 401
 402   debian@control01-onap-honolulu:~$ kubectl get pod -n onap |grep mariadb-galera
 403   onap-mariadb-galera-0  1/2  CrashLoopBackOff   625   5d16h
 404   onap-mariadb-galera-1  1/2  CrashLoopBackOff   1134  5d16h
 405   onap-mariadb-galera-2  1/2  CrashLoopBackOff   407   5d16h
 406
 407
 408 It was unfortunately not possible to collect the root cause (logs of the first
 409 restart of onap-mariadb-galera-1).
 410
 411 Community members reported that they already faced such issues and suggest to
 412 deploy a single maria instance instead of using MariaDB galera.
 413 Moreover, in Honolulu there were some changes in order to allign Camunda (SO)
 414 requirements for MariaDB galera..
 415
 416 During the limited valid window, the success rate was about 78% (85% for the
 417 same test in Guilin).
 418 The duration of the test remain very variable as also already reported in Guilin
 419 (https://jira.onap.org/browse/SO-3419). The duration of the same test may vary
 420 from 500s to 2500s as illustrated in the following graph:
 421
 422 .. image:: files/s3p/honolulu_so_stability_1_duration.png
 423  :align: center
 424
 425 The changes in MariaDB galera seems to have introduced some issues leading to
 426 more unexpected timeouts.
 427 A troubleshooting campaign has been launched to evaluate possible evolutions in
 428 this area.
 429
 430 Parallel instantiations stability test
 431 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 432
 433 Still based on basic_vm, 5 instantiation attempts are done simultaneously on the
 434 ONAP solution during 48h.
 435
 436 The results can be described as follows:
 437
 438 .. image:: files/s3p/honolulu_so_stability_5.png
 439  :align: center
 440
 441 For this test, we have to restart the SDNC once. The last failures are due to
 442 a certificate infrastructure issue and are independent from ONAP.
 443
 444 Cluster metrics
 445 ~~~~~~~~~~~~~~~
 446
 447 .. important::
 448    No major cluster resource issues have been detected in the cluster metrics
 449
 450 The metrics of the ONAP cluster have been recorded over the full week of
 451 stability tests:
 452
 453 .. csv-table:: CPU
 454    :file: ./files/csv/stability_cluster_metric_cpu.csv
 455    :widths: 20,20,20,20,20
 456    :delim: ;
 457    :header-rows: 1
 458
 459 .. image:: files/s3p/honolulu_weekly_cpu.png
 460   :align: center
 461
 462 .. image:: files/s3p/honolulu_weekly_memory.png
 463   :align: center
 464
 465 The Top Ten for CPU consumption is given in the table below:
 466
 467 .. csv-table:: CPU
 468   :file: ./files/csv/stability_top10_cpu.csv
 469   :widths: 20,15,15,20,15,15
 470   :delim: ;
 471   :header-rows: 1
 472
 473 CPU consumption is negligeable and not dimensioning. It shall be reconsider for
 474 use cases including extensive computation (loops, optimization algorithms).
 475
 476 The Top Ten for Memory consumption is given in the table below:
 477
 478 .. csv-table:: Memory
 479   :file: ./files/csv/stability_top10_memory.csv
 480   :widths: 20,15,15,20,15,15
 481   :delim: ;
 482   :header-rows: 1
 483
 484 Without surprise, the Cassandra databases are using most of the memory.
 485
 486 The Top Ten for Network consumption is given in the table below:
 487
 488 .. csv-table:: Network
 489   :file: ./files/csv/stability_top10_net.csv
 490   :widths: 10,15,15,15,15,15,15
 491   :delim: ;
 492   :header-rows: 1