Merge pull request kata-containers#8988 from beraldoleal/ci-docs

docs: adding an initial CI documentation
apostasie · Apr 9, 2024 · 8935324 · 8935324
2 parents f60c9ea 959e565
commit 8935324
Show file tree

Hide file tree

Showing 3 changed files with 346 additions and 0 deletions.
diff --git a/ci/README.md b/ci/README.md
@@ -0,0  1,343 @@
 # Kata Containers CI
 
 > [!WARNING]
 > While this project's CI has several areas for improvement, it is constantly
 > evolving. This document attempts to describe its current state, but due to
 > ongoing changes, you may notice some outdated information here. Feel free to
 > modify/improve this document as you use the CI and notice anything odd. The
 > community appreciates it!
 
 ## Introduction
 
 The Kata Containers CI relies on [GitHub Actions][gh-actions], where the actions
 themselves can be found in the `.github/workflows` directory, and they may call
 helper scripts, which are located under the `tests` directory, to actually
 perform the tasks required for each test case.
 
 ## The different workflows
 
 There are a few different sets of workflows that are running as part of our CI,
 and here we're going to cover the ones that are less likely to get rotten.  With
 this said, it's fair to advise that if the reader finds something that got
 rotten, opening an issue to the project pointing to the problem is a nice way to
 help, and providing a fix for the issue is a very encouraging way to help.
 
 ### Jobs that run automatically when a PR is raised
 
 These are a bunch of tests that will automatically run as soon as a PR is
 opened, they're mostly running on "cost free" runners, and they do some
 pre-checks to evaluate that your PR may be okay to start getting reviewed.
 
 Mind, though, that the community expects the contributors to, at least, build
 their code before submitting a PR, which the community sees as a very fair
 request.
 
 Without getting into the weeds with details on this, those jobs are the ones
 responsible for ensuring that:
 
 - The commit message is in the expected format
 - There's no missing Developer's Certificate of Origin
 - Static checks are passing
 
 ### Jobs that require a maintainer's approval to run
 
 These are the required tests, and our so-called "CI".  These require a
 maintainer's approval to run as parts of those jobs will be running on "paid
 runners", which are currently using Azure infrastructure.
 
 Once a maintainer of the project gives "the green light" (currently by adding an
 `ok-to-test` label to the PR, soon to be changed to commenting "/test" as part
 of a PR review), the following tests will be executed:
 
 - Build all the components (runs on free cost runners, or bare-metal depending on the architecture)
 - Create a tarball with all the components (runs on free cost runners, or bare-metal depending on the architecture)
 - Create a kata-deploy payload with the tarball generated in the previous step (runs on free costs runner, or bare-metal depending on the architecture)
 - Run the following tests:
   - Tests depending on the generated tarball
     - Metrics (runs on bare-metal)
     - `docker` (runs on Azure small instances)
     - `nerdctl` (runs on Azure small instances)
     - `kata-monitor` (runs on Azure small instances)
     - `cri-containerd` (runs on Azure small instances)
     - `nydus` (runs on Azure small instances)
     - `vfio` (runs on Azure normal instances)
   - Tests depending on the generated kata-deploy payload
     - kata-deploy (runs on Azure small instances)
       - Tests are performed using different "Kubernetes flavors", such as k0s, k3s, rke2, and Azure Kubernetes Service (AKS).
     - Kubernetes (runs in Azure small and medium instances depending on what's required by each test, and on TEE bare-metal machines)
       - Tests are performed with different runtime engines, such as CRI-O and containerd.
       - Tests are performed with different snapshotters for containerd, namely OverlayFS and devmapper.
       - Tests are performed with all the supported hypervisors, which are Cloud Hypervisor, Dragonball, Firecracker, and QEMU.
 
 For all the tests relying on Azure instances, real money is being spent, so the
 community asks for the maintainers to be mindful about those, and avoid abusing
 them to merely debug issues.
 
 ## The different runners
 
 In the previous section we've mentioned using different runners, now in this section we'll go through each type of runner used.
 
 - Cost free runners:  Those are the runners provided by GIthub itself, and
   those are fairly small machines with no virtualization capabilities enabled - 
 - Azure small instances: Those are runners which have virtualization
   capabilities enabled, 2 CPUs, and 8GB of RAM.  These runners have a "-smaller"
   suffix to their name. 
 - Azure normal instances: Those are runners which have virtualization
   capabilities enabled, 4 CPUs, and 16GB of RAM.  These runners are usually
   `garm` ones with no "-smaller" suffix.
 - Bare-metal runners: Those are runners provided by community contributors,
   and they may vary in architecture, size and virtualization capabilities.
   Builder runners don't actually require any virtualization capabilities, while
   runners which will be actually performing the tests must have virtualization
   capabilities and a reasonable amount for CPU and RAM available (at least
   matching the Azure normal instances).
 
 ## Adding new tests
 
 Before someone decides to add a new test, we strongly recommend them to go
 through [GitHub Actions Documentation][gh-actions],
 which will provide you a very sensible background on how to read and understand
 current tests we have, and also become familiar with how to write a new test.
 
 On the Kata Containers land, there are basically two sets of tests: "standalone"
 and "part of something bigger".
 
 The "standalone" tests, for example the commit message check, won't be covered
 here as they're better covered by the GitHub Actions documentation pasted above.
 
 The "part of something bigger" is the more complicated one and not so
 straightforward to add, so we'll be focusing our efforts on describing the
 addition of those.
 
 > [!NOTE]
 > TODO: Currently, this document refers to "tests" when it actually means the
 > jobs (or workflows) of GitHub. In an ideal world, except in some specific cases,
 > new tests should be added without the need to add new workflows. In the
 > not-too-distant future (hopefully), we will improve the workflows to support
 > this.
 
 ### Adding a new test that's "part of something bigger"
 
 The first important thing here is to align expectations, and we must say that
 the community strongly prefers receiving tests that already come with:
 
 - Instructions how to run them
 - A proven run where it's passing
 
 There are several ways to achieve those two requirements, and an example of that
 can be seen in PR #8115.
 
 With the expectations aligned, adding a test consists in:
 
 - Adding a new yaml file for your test, and ensure it's called from the
   "bigger" yaml. See the [Kata Monitor test example][monitor-ex01].
 
 - Adding the helper scripts needed for your test to run. Again, use the [Kata Monitor script as example][monitor-ex02].
 
 Following those examples, the community advice during the review, and even
 asking the community directly on Slack are the best ways to get your test
 accepted.
 
 ## Running tests
 
 ### Running the tests as part of the CI
 
 If you're a maintainer of the project, you'll be able to kick in the tests by
 yourself.  With the current approach, you just need to add the `ok-to-test`
 label and the tests will automatically start.  We're moving, though, to use a
 `/test` command as part of a GitHub review comment, which will simplify this
 process.
 
 If you're not a maintainer, please, send a message on Slack or wait till one of
 the maintainers reviews your PR.  Maintainers will then kick in the tests on
 your behalf.
 
 In case a test fails and there's the suspicion it happens due to flakiness in
 the test itself, please, create an issue for us, and then re-run (or asks
 maintainers to re-run) the tests following these steps:
 
 - Locate which tests is failing
 - Click in "details"
 - In the top right corner, click in "Re-run jobs"
 - And then in "Re-run failed jobs"
 - And finally click in the green "Re-run jobs" button
 
 > [!NOTE]
 > TODO: We need figures here
 
 ### Running the tests locally
 
 In this section, aligning expectations is also something very important, as one
 will not be able to run the tests exactly in the same way the tests are running
 in the CI, as one most likely won't have access to an Azure subscription.
 However, we're trying our best here to provide you with instructions on how to
 run the tests in an environment that's "close enough" and will help you to debug
 issues you find with the current tests, or even provide a proof-of-concept to
 the new test you're trying to add.
 
 The basic steps, which we will cover in details down below are:
 
  1. Create a VM matching the configuration of the target runner
  2. Generate the artifacts you'll need for the test, or download them from a
     current failed run
  3. Follow the steps provided in the action itself to run the tests.
 
 Although the general overview looks easy, we know that some tricks need to be
 shared, and we'll go through the general process of debugging one non-Kubernetes
 and one Kubernetes specific test for educational purposes.
 
 One important thing to note is that "Create a VM" can be done in innumerable
 different ways, using the tools of your choice.  For the sake of simplicity on
 this guide, we'll be using `kcli`, which we strongly recommend in case you're a
 non-experienced user, and happen to be developing on a Linux box.
 
 For both non-Kubernetes and Kubernetes cases, we'll be using PR #8070 as an
 example, which at the time this document is being written serves us very well
 the purpose, as you can see that we have `nerdctl` and Kubernetes tests failing.
 
 ## Debugging tests
 
 ### Debugging a non Kubernetes test
 
 As shown above, the `nerdctl` test is failing.
 
 As a developer you can go ahead to the details of the job, and expand the job
 that's failing in order to gather more information.
 
 But when that doesn't help, we need to set up our own environment to debug
 what's going on.
 
 Taking a look at the `nerdctl` test, which is located here, you can easily see
 that it runs-on a `garm-ubuntu-2304-smaller` virtual machine.
 
 The important parts to understand are `ubuntu-2304`, which is the OS where the
 test is running on; and "smaller", which means we're running it on a machine
 with 2 CPUs and 8GB of RAM.
 
 With this information, we can go ahead and create a similar VM locally using `kcli`.
 
 ```bash
 $ sudo kcli create vm -i ubuntu2304 -P disks=[60] -P numcpus=2 -P memory=8192 -P cpumodel=host-passthrough debug-nerdctl-pr8070
 ```
 
 In order to run the tests, you'll need the "kata-tarball" artifacts, which you
 can build your own using "make kata-tarball" (see below), or simply get them
 from the PR where the tests failed.  To download them, click on the "Summary"
 button that's on the top left corner, and then scroll down till you see the
 artifacts, as shown below.
 
 Unfortunately GitHub doesn't give us a link that we can download those from
 inside the VM, but we can download them on our local box, and then `scp` the
 tarball to the newly created VM that will be used for debugging purposes.
 
 > [!NOTE]
 > Those artifacts are only available (for 15 days) when all jobs are finished.
 
 Once you have the `kata-static.tar.xz` in your VM, you can login to the VM with
 `kcli ssh debug-nerdctl-pr8070`, go ahead and then clone your development branch
 
 ```bash
 $ git clone --branch feat_add-fc-runtime-rs https://github.com/nubificus/kata-containers
 ```
 
 Add the upstream as a remote, set up your git, and rebase your branch atop of the upstream main one
 
 ```bash
 $ git remote add upstream https://github.com/kata-containers/kata-containers
 $ git remote update
 $ git config --global user.email "[email protected]"
 $ git config --global user.name "Your Name"
 $ git rebase upstream/main 
 ```
 
 Now copy the `kata-static.tar.xz` into your `kata-containers/kata-artifacts` directory
 
 ```bash
 $ mkdir kata-artifacts
 $ cp ../kata-static.tar.xz kata-artifacts/
 ```
 
 > [!NOTE]
 > If you downloaded the .zip from GitHub you need to uncompress first to see `kata-static.tar.xz`
 
 And finally run the tests following what's in the yaml file for the test you're
 debugging. 
 
 In our case, the `run-nerdctl-tests-on-garm.yaml`.
 
 When looking at the file you'll notice that some environment variables are set,
 such as `KATA_HYPERVISOR`, and should be aware that, for this particular example,
 the important steps to follow are:
 
 Install the dependencies
 Install kata
 Run the tests
 
 Let's now run the steps mentioned above exporting the expected environment variables
 
 ```bash
 $ export KATA_HYPERVISOR=dragonball
 $ bash ./tests/integration/nerdctl/gha-run.sh install-dependencies
 $ bash ./tests/integration/nerdctl/gha-run.sh install-kata
 $ bash tests/integration/nerdctl/gha-run.sh run
 ```
 
 And with this you should've been able to reproduce exactly the same issue found
 in the CI, and from now on you can build your own code, use your own binaries,
 and have fun debugging and hacking! 
 
 ### Debugging a Kubernetes test
 
 Steps for debugging the Kubernetes tests are very similar to the ones for
 debugging non-Kubernetes tests, with the caveat that what you'll need, this
 time, is not the `kata-static.tar.xz` tarball, but rather a payload to be used
 with kata-deploy.
 
 In order to generate your own kata-deploy image you can generate your own
 `kata-static.tar.xz` and then take advantage of the following script.  Be aware
 that the image generated and uploaded must be accessible by the VM where you'll
 be performing your tests.
 
 In case you want to take advantage of the payload that was already generated
 when you faced the CI failure, which is considerably easier, take a look at the
 failed job, then click in "Deploy Kata" and expand the "Final kata-deploy.yaml
 that is used in the test" section.  From there you can see exactly what you'll
 have to use when deploying kata-deploy in your local cluster.
 
 > [!NOTE]
 > TODO: WAINER TO FINISH THIS PART BASED ON HIS PR TO RUN A LOCAL CI
 
 ## Adding new runners
 
 Any admin of the project is able to add or remove GitHub runners, and those are
 the folks you should rely on.
 
 If you need a new runner added, please, tag @ac in the Kata Containers slack,
 and someone from that group will be able to help you.
 
 If you're part of that group and you're looking for information on how to help
 someone, this is simple, and must be done in private. Basically what you have to
 do is:
 
 - Go to the kata-containers/kata-containers repo
 - Click on the Settings button, located in the top right corner
 - On the left panel, under "Code and automation", click on "Actions"
 - Click on "Runners"
 
 If you want to add a new self-hosted runner:
 
 - In the top right corner there's a green button called "New self-hosted runner"
 
 If you want to remove a current self-hosted runner:
 
 - For each runner there's a "..." menu, where you can just click and the
   "Remove runner" option will show up
 
 ## Known limitations
 
 As the GitHub actions are structured right now we cannot: Test the addition of a
 GitHub action that's not triggered by a pull_request event as part of the PR.
 
 [gh-actions]: https://docs.github.com/en/actions
 [monitor-ex01]: https://github.com/kata-containers/kata-containers/commit/a3fb067f1bccde0cbd3fd4d5de12dfb3d8c28b60
 [monitor-ex02]: https://github.com/kata-containers/kata-containers/commit/489caf1ad0fae27cfd00ba3c9ed40e3d512fa492
diff --git a/docs/README.md b/docs/README.md
@@ -40,6  40,7 @@ Documents that help to understand and contribute to Kata Containers.
 ### Design and Implementations
 
 * [Kata Containers Architecture](design/architecture): Architectural overview of Kata Containers
 * [Kata Containers CI](../ci/README.md): Kata Containers CI document
 * [Kata Containers E2E Flow](design/end-to-end-flow.md): The entire end-to-end flow of Kata Containers
 * [Kata Containers design](./design/README.md): More Kata Containers design documents
 * [Kata Containers threat model](./threat-model/threat-model.md): Kata Containers threat model

diff --git a/tests/cmd/check-spelling/kata-dictionary.dic b/tests/cmd/check-spelling/kata-dictionary.dic
@@ -222,6  222,7 @@ deliverable/AB
 deploy
 dev
 devicemapper/B
 devmapper
 dialer
 dialog/A
 dind/B
@@ -333,6  334,7 @@ serverless
 signoff/A
 snapcraft/B
 snapd/B
 snapshotters
 stalebot/B
 startup
 stderr/AB