Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple Egress IPs per Egress #6591

Open
antoninbas opened this issue Aug 7, 2024 · 3 comments
Open

Support multiple Egress IPs per Egress #6591

antoninbas opened this issue Aug 7, 2024 · 3 comments
Assignees
Labels
area/transit/egress Issues or PRs related to Egress (SNAT for traffic egressing the cluster). kind/feature Categorizes issue or PR as related to a new feature. reported-by/end-user Issues reported by end users.

Comments

@antoninbas
Copy link
Contributor

Problem statement

Currently an Egress resource only supports a single Egress IP. Workloads selected by the Egress will always use this Egress IP for all outgoing connections to the external network.

Supporting multiple Egress IPs per Egress can enable the following use cases:

  1. Ability to load-balance Egress traffic across multiple Egress Nodes, with each Egress Node responsible for one Egress IP. This can enable higher Egress traffic throughput for a given Namespace (for example), without having to manually carve the Namespace into smaller "units" (e.g., at the service level), each with its own Egress policy. This could also help scale the number of concurrent connections.
  2. Ability to select a specific Egress IP based on the cluster topology. For example, in a cluster deployed over multiple availability zones (AZs), an Egress could be assigned one Egress IP per AZ. When applying the Egress policy to an highly-available application deployed across multiple AZs, outgoing Pod connections to the external network should select the (local) Egress IP for the AZ in which the Pod is deployed. See also Egress IP with AZ affinity #5252

While it is possible to create multiple Egresses (each with its own Egress IP) selecting the same workload, only one Egress IP will "win" for each Node in the cluster: all connections originating from selected Pods on the Node will use the same Egress IP. Moreover, it is likely that the same Egress IP will "win" on all Nodes in the cluster (it is determined by the order in which the Egress resources are processed by the Antrea Agent, and the order should be the same across all Nodes / Agents in the cluster). So in practice, a single Egress IP will be used, and creating multiple Egresses doesn't really achieve anything.

Changes to the Egress API
At the very least, we would need to allow multiple externalIPPool values and multiple egressIP values. Making a singular field plural in a K8s API is fairly "common", and well-documented: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md#making-a-singular-field-plural. We should follow these directions when making change to the Egress CRD.

In the first use case described above, we want to use a single ExternalIPPool and allocate more than one Egress IP from the pool.
In the second use case described above, we want to use multiple ExternalIPPool(s) and allocate at least one Egress IP from each pool.
This means that we may have len(egressIPs) > len(externalIPPools). We may need a new CRD field to tell Antrea how many IPs to allocate from each pool (same count would apply across all pools).

Then there is the "topology-aware" part of the API, which would rely on the topology.kubernetes.io/zone Node label.
The "zone" of each Egress IP could be determined by the Node to which it is assigned.
The Egress resource should let the user specify that Egress IPs in the same zone should be preferred. Additionally, a user may want to be able to express that if an Egress IP is local to the Node on which a selected workload Pod is running, the local Egress IP should always be used. If multiple Egress IPs have the same preference level, one will be selected at random.

Additionally, it would be nice to have the concept of "anti-affinity" for Egress IPs allocated from the same ExternalIPPool, so that they can be assigned to different Nodes whenever possible. Note that today, assignment of Egress IPs to specific Nodes already requires iterating over all Egress resources for every change (in particular, this is necessary to implement maxEgressIPsPerNode correctly, see #4627), so I believe that this should be straightforward to implement, without impacting performance.

Alternative considered

Rather than support multiple IPs per Egress, we could also change how the implementation handles multiple Egresses applied to the same workload (see description above). However, @tnqn has made the point that it would be harder to implement (and possibly less user-friendly as well), as the Egress controller would no longer be able to process Egress resources independently from each other.

Open questions

  1. Should users be able to restrict how Egress IPs are selected, or is it enough for users to express preferences? For example, in use case 2 above, is it enough for users to express a preference for Egress IPs in the same AZ, or should users be able to restrict selected Egress IPs to be in the same AZ (and if there is no Egress IP in the same AZ, traffic will be dropped at the source Node).
  2. Should connections be sticky? When an Egress IP is added for a given Egress resource, should we ensure that existing connections keep using the same Egress IP to avoid disruptions? And similarly, when an Egress IP is deleted, should we ensure that unaffected connections keep using the same IP? @tnqn has pointed out that if connection stickiness is desired, it will increase complexity of the implementation and statefulness (we may have to rely on CT mark or OVS flow learning at the source Node).
@antoninbas antoninbas added kind/feature Categorizes issue or PR as related to a new feature. area/transit/egress Issues or PRs related to Egress (SNAT for traffic egressing the cluster). labels Aug 7, 2024
@antoninbas antoninbas added the reported-by/end-user Issues reported by end users. label Aug 7, 2024
@jainpulkit22
Copy link
Contributor

Hi @antoninbas can you explain the anti-affinity part, what flexibility does it provide, and why is it needed?

@antoninbas
Copy link
Contributor Author

Hi @antoninbas can you explain the anti-affinity part, what flexibility does it provide, and why is it needed?

If we can load-balance traffic for a given Egress across multiple Egress IPs, then it is better if the different Egress IPs can be assigned to different Nodes. This way the traffic load for the Egress is split across multiple Nodes.
For example, if I have one Egress with 3 Egress IPs and 3Gbps of Egress traffic, it is better if I can have 3 different Nodes, each handling 1 Egress IP and 1Gbps of traffic. If there is no notion of anti-affinity, I can get unlucky (becomes more unlikely as the number of eligible Nodes increase) and I can end up with one Node responsible for all 3 Egress IPs and the full 3Gbps of traffic.
Additionally, in case of a Node failure and Egress IP failover, if I have 3 Egress IPs assigned to 3 different Egress Nodes, only 1/3 of the connections are affected by the Node failure.

@jainpulkit22
Copy link
Contributor

@antoninbas got it, thanks for the explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/transit/egress Issues or PRs related to Egress (SNAT for traffic egressing the cluster). kind/feature Categorizes issue or PR as related to a new feature. reported-by/end-user Issues reported by end users.
Projects
None yet
Development

No branches or pull requests

2 participants