Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running antrea on a windows node gets stuck while waiting for data path #6568

Open
mkaring opened this issue Jul 29, 2024 · 10 comments
Open
Labels
area/OS/windows Issues or PRs related to the Windows operating system. kind/bug Categorizes issue or PR as related to a bug. kind/support Categorizes issue or PR as related to a support question. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. reported-by/end-user Issues reported by end users.

Comments

@mkaring
Copy link

mkaring commented Jul 29, 2024

I'm trying to use antrea to get the networking in my cluster going.

I already got some help on Slack. For reference, the thread is here.

I currently got two nodes:

  • Ubuntu LTS 22.04.4 as control-plane with containerd 1.7.19
  • Windows Server 2022 with containerd 1.7.20

I currently got Antrea V2.1.0 running on the cluster. Kubernetes is installed with version 1.29.7.
The installation happed based on the documentation of antrea. The control-plane part was installed using helm, with the default values. The windows node was setup using the Prepare-Node.ps1 script and the antrea-windows-with-ovs.yml from the v2.1.0 release. The only change done to this file is setting the kubeAPIServerOverride.

I was not absolutely sure on how to set this last option up. Currently I got it like this: https://111.222.333.444:6443. This is the exact output of kubectl config view -o jsonpath='{.clusters[0].cluster.server}'. The documentation here shows an example without the https:// part. So I'm not sure what's correct. I tried both, does not seem to make a difference.

The issue I'm seeing is the antrea-agent-windows pod on the Windows Node not starting up and showing the following error:
https://gist.github.com/mkaring/d49ade0daa1a03f58a5b919ede6829f2

According to @antoninbas (Thanks again for the help in Slack) it may be helpful to look at the conf.db that is created by Openvswitch. I attached it just in case: conf.db.txt. Accessing the information directly using ovs-vsctl seems to be difficult with OpenSwitch running inside a container.

The cluster I'm running is just for testing right now. If you want me to try anything to get to the bottom of this, just tell me. If you need any additional information, I'll gladly provide.

Thank you in advance,
Martin

@mkaring mkaring added the kind/support Categorizes issue or PR as related to a support question. label Jul 29, 2024
@antoninbas antoninbas added area/OS/windows Issues or PRs related to the Windows operating system. reported-by/end-user Issues reported by end users. labels Jul 29, 2024
@antoninbas
Copy link
Contributor

antoninbas commented Jul 29, 2024

Could you confirm the following:

Also, could you share the log files which are under C:\openvswitch\var\log\openvswitch?

I am also experiencing a similar issue with Windows Server 2022, so I have asked @wenyingd for some info.
Edit: I had a misconfiguration in my test environment. Things are working as expected.

@wenyingd
Copy link
Contributor

wenyingd commented Jul 30, 2024

Accessing the information directly using ovs-vsctl seems to be difficult with OpenSwitch running inside a container.

You could try with this powershell command to ensure the OVS utilities path are added into the current shell,

$env:PATH=[System.Environment]::GetEnvironmentVariable("PATH", "Machine")

Then you can try to run OVS commands like ovs-vsctl.exe show or ovs-ofctl.exe -OOpenFlow15 dump-flows br-int

@antoninbas antoninbas added the kind/bug Categorizes issue or PR as related to a bug. label Jul 30, 2024
@mkaring
Copy link
Author

mkaring commented Jul 30, 2024

@antoninbas

  • Test signing: I did not enable this. The documentation gave me the impression that this is not required when running the the fully containerized version.
  • Hyper-V: Is fully installed and confirmed working.

Logs: C:\openvswitch\var\logs does not exist. var\ only contains run as subdirectory.

@wenyingd

The command is in the path. The idea you mentioned in Slack is the reason. The command does not work over a powershell remoting connection. It works fine using an RDP connection.

> ovs-vsctl.exe show
0416ab53-99a7-4ac0-b537-ae62ec5bf25e
    Bridge br-int
        datapath_type: system
    ovs_version: "3.0.5"
> ovs-ofctl.exe -OOpenFlow15 dump-flows br-int
ovs-ofctl: br-int is not a bridge or a socket

I'm not sure about the second message. I can see the br-int interface in the adapter overview of windows, also in the Hyper-V manager.

@antoninbas
Copy link
Contributor

@mkaring as long as you are using the OVS driver provided by Antrea (hosted at https://downloads.antrea.io), you will need to enable test-signed drivers. We do not provide a driver signed with a certificate from a trusted root authority.
This could explain why initialization is failing.

@antoninbas
Copy link
Contributor

@mkaring as long as you are using the OVS driver provided by Antrea (hosted at https://downloads.antrea.io), you will need to enable test-signed drivers. We do not provide a driver signed with a certificate from a trusted root authority. This could explain why initialization is failing.

@wenyingd is there a way to confirm that this is causing the issue? I was assuming that Install-OVS.ps1 would fail in that case (causing the initContainer to fail), but maybe that's not the case.

@mkaring
Copy link
Author

mkaring commented Aug 1, 2024

This is going to be a real problem. Are there any known providers for OVS that provide WHQL signed drivers? My server is set to boot using secure boot and that is something I can't change. Test-signed drivers and secure boot do not like each other.

@wenyingd
Copy link
Contributor

wenyingd commented Aug 1, 2024

@wenyingd is there a way to confirm that this is causing the issue? I was assuming that Install-OVS.ps1 would fail in that case (causing the initContainer to fail), but maybe that's not the case.

If my memory is correct, the step to install OVS driver can succeed, but the workload is impacted when antrea tries to enable the Extension (ovsext) on a VMSwitch on the host. I could have some try.

@antoninbas
Copy link
Contributor

This is going to be a real problem. Are there any known providers for OVS that provide WHQL signed drivers? My server is set to boot using secure boot and that is something I can't change. Test-signed drivers and secure boot do not like each other.

AFAIK, not for free / with an open license. VMware provides one for customers.
Antrea also expects a specific driver version for OVS, which we test with (and sometimes includes some necessary patches).
You can also get your own release signature.

@antoninbas
Copy link
Contributor

@wenyingd it would be good to fail early or at least have a way to clearly identify that it is the issue

Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/OS/windows Issues or PRs related to the Windows operating system. kind/bug Categorizes issue or PR as related to a bug. kind/support Categorizes issue or PR as related to a support question. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. reported-by/end-user Issues reported by end users.
Projects
None yet
Development

No branches or pull requests

3 participants