docs/integration-s3p.rst

   1 .. This work is licensed under a
   2    Creative Commons Attribution 4.0 International License.
   3 .. _integration-s3p:
   4
   5 Stability/Resiliency
   6 ====================
   7
   8 .. important::
   9     The Release stability has been evaluated by:
  10
  11     - The daily Istanbul CI/CD chain
  12     - Stability tests
  13     - Resiliency tests
  14
  15 .. note:
  16     The scope of these tests remains limited and does not provide a full set of
  17     KPIs to determinate the limits and the dimensioning of the ONAP solution.
  18
  19 CI results
  20 ----------
  21
  22 As usual, a daily CI chain dedicated to the release is created after RC0.
  23 An Istanbul chain has been created on the 5th of November 2021.
  24
  25 The daily results can be found in `LF daily results web site
  26 <https://logs.onap.org/onap-integration/daily/onap_daily_pod4_istanbul/>`_.
  27
  28 .. image:: files/s3p/istanbul-dashboard.png
  29    :align: center
  30
  31
  32 Infrastructure Healthcheck Tests
  33 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  34
  35 These tests deal with the Kubernetes/Helm tests on ONAP cluster.
  36
  37 The global expected criteria is **75%**.
  38
  39 The onap-k8s and onap-k8s-teardown, providing a snapshop of the onap namespace
  40 in Kubernetes, as well as the onap-helm tests are expected to be PASS.
  41
  42 nodeport_check_certs test is expected to fail. Even tremendous progress have
  43 been done in this area, some certificates (unmaintained, upstream or integration
  44 robot pods) are still not correct due to bad certificate issuers (Root CA
  45 certificate non valid) or extra long validity. Most of the certificates have
  46 been installed using cert-manager and will be easily renewable.
  47
  48 .. image:: files/s3p/istanbul_daily_infrastructure_healthcheck.png
  49    :align: center
  50
  51 Healthcheck Tests
  52 ~~~~~~~~~~~~~~~~~
  53
  54 These tests are the traditionnal robot healthcheck tests and additional tests
  55 dealing with a single component.
  56
  57 The expectation is **100% OK**.
  58
  59 .. image:: files/s3p/istanbul_daily_healthcheck.png
  60   :align: center
  61
  62 Smoke Tests
  63 ~~~~~~~~~~~
  64
  65 These tests are end to end and automated use case tests.
  66 See the :ref:`the Integration Test page <integration-tests>` for details.
  67
  68 The expectation is **100% OK**.
  69
  70 .. figure:: files/s3p/istanbul_daily_smoke.png
  71   :align: center
  72
  73 An error has been reported since Guilin (https://jira.onap.org/browse/SDC-3508) on
  74 a possible race condition in SDC preventing the completion of the certification in
  75 SDC and leading to onboarding errors.
  76 This error may occur in case of parallel processing.
  77
  78 Security Tests
  79 ~~~~~~~~~~~~~~
  80
  81 These tests are tests dealing with security.
  82 See the  :ref:`the Integration Test page <integration-tests>` for details.
  83
  84 Waivers have been granted on different projects for the different tests.
  85 The list of waivers can be found in
  86 https://git.onap.org/integration/seccom/tree/waivers?h=istanbul.
  87
  88 The expectation is **100% OK**. The criteria is met.
  89
  90 .. figure:: files/s3p/istanbul_daily_security.png
  91   :align: center
  92
  93 Resiliency tests
  94 ----------------
  95
  96 The goal of the resiliency testing was to evaluate the capability of the
  97 Istanbul solution to survive a stop or restart of a Kubernetes worker node.
  98
  99 This test has been automated thanks to the
 100 Litmus chaos framework(https://litmuschaos.io/) and automated in the CI on the
 101 weekly chains.
 102
 103 2 additional tests based on Litmus chaos scenario have been added but will be tuned
 104 in Jakarta.
 105
 106 - node cpu hog (temporary increase of CPU on 1 kubernetes node)
 107 - node memory hog (temporary increase of Memory on 1 kubernetes node)
 108
 109 The main test for Istanbul is node  drain corresponding  to the resiliency scenario
 110 previously managed manually.
 111
 112 The system under test is defined in OOM.
 113 The resources are described in the table below:
 114
 115 .. code-block:: shell
 116
 117    +-------------------------+-------+--------+--------+
 118    | Name                    | vCPUs | Memory | Disk   |
 119    +-------------------------+-------+--------+--------+
 120    | compute12-onap-istanbul |   16  |  24Go  |  10 Go |
 121    | compute11-onap-istanbul |   16  |  24Go  |  10 Go |
 122    | compute10-onap-istanbul |   16  |  24Go  |  10 Go |
 123    | compute09-onap-istanbul |   16  |  24Go  |  10 Go |
 124    | compute08-onap-istanbul |   16  |  24Go  |  10 Go |
 125    | compute07-onap-istanbul |   16  |  24Go  |  10 Go |
 126    | compute06-onap-istanbul |   16  |  24Go  |  10 Go |
 127    | compute05-onap-istanbul |   16  |  24Go  |  10 Go |
 128    | compute04-onap-istanbul |   16  |  24Go  |  10 Go |
 129    | compute03-onap-istanbul |   16  |  24Go  |  10 Go |
 130    | compute02-onap-istanbul |   16  |  24Go  |  10 Go |
 131    | compute01-onap-istanbul |   16  |  24Go  |  10 Go |
 132    | etcd03-onap-istanbul    |    4  |   6Go  |  10 Go |
 133    | etcd02-onap-istanbul    |    4  |   6Go  |  10 Go |
 134    | etcd01-onap-istanbul    |    4  |   6Go  |  10 Go |
 135    | control03-onap-istanbul |    4  |   6Go  |  10 Go |
 136    | control02-onap-istanbul |    4  |   6Go  |  10 Go |
 137    | control01-onap-istanbul |    4  |   6Go  |  10 Go |
 138    +-------------------------+-------+--------+--------+
 139
 140
 141 The test sequence can be defined as follows:
 142
 143 - Cordon a compute node (prevent any new scheduling)
 144 - Launch node drain chaos scenario, all the pods on the given compute node
 145   are evicted
 146
 147 Once all the pods have been evicted:
 148
 149 - Uncordon the compute node
 150 - Replay a basic_vm test
 151
 152 This test has been successfully executed.
 153
 154 .. image:: files/s3p/istanbul_resiliency.png
 155    :align: center
 156
 157 .. important::
 158
 159   Please note that the chaos framework select one compute node (the first one by
 160   default).
 161   The distribution of the pods is random, on our target architecture about 15
 162   pods are scheduled on each node. The chaos therefore affects only a limited
 163   number of pods.
 164
 165 For the Istanbul tests, the evicted pods (compute01) were:
 166
 167
 168 .. code-block:: shell
 169
 170     NAME                                          READY STATUS RESTARTS AGE
 171     onap-aaf-service-dbd8fc76b-vnmqv               1/1  Running   0    2d19h
 172     onap-aai-graphadmin-5799bfc5bb-psfvs           2/2  Running   0    2d19h
 173     onap-cassandra-1                               1/1  Running   0    2d19h
 174     onap-dcae-ves-collector-856fcb67bd-lb8sz       2/2  Running   0    2d19h
 175     onap-dcaemod-distributor-api-85df84df49-zj9zn  1/1  Running   0    2d19h
 176     onap-msb-consul-86975585d9-8nfs2               1/1  Running   0    2d19h
 177     onap-multicloud-pike-88bb965f4-v2qc8           2/2  Running   0    2d19h
 178     onap-netbox-nginx-5b9b57d885-hjv84             1/1  Running   0    2d19h
 179     onap-portal-app-66d9f54446-sjhld               2/2  Running   0    2d19h
 180     onap-sdnc-ueb-listener-5b6bb95c68-d24xr        1/1  Running   0    2d19h
 181     onap-sdnc-web-8f5c9fbcc-2l8sp                  1/1  Running   0    2d19h
 182     onap-so-779655cb6b-9tzq4                       2/2  Running   1    2d19h
 183     onap-so-oof-adapter-54b5b99788-x7rlk           2/2  Running   0    2d19h
 184
 185 In the future, it would be interesting to elaborate a resiliency testing strategy
 186 in order to check the eviction of all the critical components.
 187
 188 Stability tests
 189 ---------------
 190
 191 Stability tests have been performed on Istanbul release:
 192
 193 - SDC stability test
 194 - Parallel instantiation test
 195
 196 The results can be found in the weekly backend logs
 197 https://logs.onap.org/onap-integration/weekly/onap_weekly_pod4_istanbul.
 198
 199 SDC stability test
 200 ~~~~~~~~~~~~~~~~~~
 201
 202 In this test, we consider the basic_onboard automated test and we run 5
 203 simultaneous onboarding procedures in parallel during 24h.
 204
 205 The basic_onboard test consists in the following steps:
 206
 207 - [SDC] VendorOnboardStep: Onboard vendor in SDC.
 208 - [SDC] YamlTemplateVspOnboardStep: Onboard vsp described in YAML file in SDC.
 209 - [SDC] YamlTemplateVfOnboardStep: Onboard vf described in YAML file in SDC.
 210 - [SDC] YamlTemplateServiceOnboardStep: Onboard service described in YAML file
 211   in SDC.
 212
 213 The test has been initiated on the Istanbul weekly lab on the 14th of November.
 214
 215 As already observed in daily|weekly|gating chain, we got race conditions on
 216 some tests (https://jira.onap.org/browse/INT-1918).
 217
 218 The success rate is expected to be above 95% on the 100 first model upload
 219 and above 80% until we onboard more than 500 models.
 220
 221 We may also notice that the function test_duration=f(time) increases
 222 continuously. At the beginning the test takes about 200s, 24h later the same
 223 test will take around 1000s.
 224 Finally after 36h, the SDC systematically answers with a 500 HTTP answer code
 225 explaining the linear decrease of the success rate.
 226
 227 The following graphs provides a good view of the SDC stability test.
 228
 229 .. image:: files/s3p/istanbul_sdc_stability.png
 230   :align: center
 231
 232 .. csv-table:: S3P Onboarding stability results
 233     :file: ./files/csv/s3p-sdc.csv
 234     :widths: 60,20,20
 235     :delim: ;
 236     :header-rows: 1
 237
 238 .. important::
 239    The onboarding duration increases linearly with the number of on-boarded
 240    models, which is already reported and may be due to the fact that models
 241    cannot be deleted. In fact the test client has to retrieve the list of
 242    models, which is continuously increasing. No limit tests have been
 243    performed.
 244    However 1085 on-boarded models is already a vry high figure regarding the
 245    possible ONAP usage.
 246    Moreover the mean duration time is much lower in Istanbul.
 247    It explains why it was possible to run 35% more tests within the same
 248    time frame.
 249
 250 Parallel instantiations stability test
 251 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 252
 253 The test is based on the single test (basic_vm) that can be described as follows:
 254
 255 - [SDC] VendorOnboardStep: Onboard vendor in SDC.
 256 - [SDC] YamlTemplateVspOnboardStep: Onboard vsp described in YAML file in SDC.
 257 - [SDC] YamlTemplateVfOnboardStep: Onboard vf described in YAML file in SDC.
 258 - [SDC] YamlTemplateServiceOnboardStep: Onboard service described in YAML file
 259   in SDC.
 260 - [AAI] RegisterCloudRegionStep: Register cloud region.
 261 - [AAI] ComplexCreateStep: Create complex.
 262 - [AAI] LinkCloudRegionToComplexStep: Connect cloud region with complex.
 263 - [AAI] CustomerCreateStep: Create customer.
 264 - [AAI] CustomerServiceSubscriptionCreateStep: Create customer's service
 265   subscription.
 266 - [AAI] ConnectServiceSubToCloudRegionStep: Connect service subscription with
 267   cloud region.
 268 - [SO] YamlTemplateServiceAlaCarteInstantiateStep: Instantiate service described
 269   in YAML using SO a'la carte method.
 270 - [SO] YamlTemplateVnfAlaCarteInstantiateStep: Instantiate vnf described in YAML
 271   using SO a'la carte method.
 272 - [SO] YamlTemplateVfModuleAlaCarteInstantiateStep: Instantiate VF module
 273   described in YAML using SO a'la carte method.
 274
 275 10 instantiation attempts are done simultaneously on the ONAP solution during 24h.
 276
 277 The results can be described as follows:
 278
 279 .. image:: files/s3p/istanbul_instantiation_stability_10.png
 280  :align: center
 281
 282 .. csv-table:: S3P Instantiation stability results
 283     :file: ./files/csv/s3p-instantiation.csv
 284     :widths: 60,20,20
 285     :delim: ;
 286     :header-rows: 1
 287
 288 The results are good with a success rate above 95%. After 24h more than 1300
 289 VNF have been created and deleted.
 290
 291 As for SDC, we can observe a linear increase of the test duration. This issue
 292 has been reported since Guilin. For SDC as it is not possible to delete the
 293 models, it is possible to imagine that the duration increases due to the fact
 294 that the database of models continuously increases. Therefore the client has
 295 to retrieve an always bigger list of models.
 296 But for the instantiations, it is not the case as the references
 297 (module, VNF, service) are cleaned at the end of each test and all the tests
 298 use the same model. Then the duration of an instantiation test should be
 299 almost constant, which is not the case. Further investigations are needed.
 300
 301 .. important::
 302   The test has been executed with the mariadb-galera replicaset set to 1
 303   (3 by default). With this configuration the results during 24h are very
 304   good. When set to 3, the error rate is higher and after some hours
 305   most of the instantiation are failing.
 306   However, even with a replicaset set to 1, a test on Master weekly chain
 307   showed that the system is hitting another limit after about 35h
 308   (https://jira.onap.org/browse/SO-3791).