Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Unable to use Load Balancer's IP address for the ingress gateway #361

Closed
Gmerold opened this issue May 28, 2024 · 14 comments · Fixed by #420
Closed

Unable to use Load Balancer's IP address for the ingress gateway #361

Gmerold opened this issue May 28, 2024 · 14 comments · Fixed by #420

Comments

@Gmerold
Copy link

Gmerold commented May 28, 2024

Bug Description

New version of pydantic-core breaks falling back to the Load Balancer's IP for the ingress gateway when the external-hostname is not configured:

pydantic_core._pydantic_core.ValidationError: 1 validation error for IngressProviderAppData
ingress.url
  Input should be a valid URL, invalid IPv4 address [type=url_parsing, input_value='http://sdcore-nms.10.0.0.2/', input_type=str]
    For further information visit https://errors.pydantic.dev/2.6/v/url_parsing

Potential solution here could be using nip.io to pretend LB IP is a legit URL (e.g. 10.0.0.2.nip.io)

To Reproduce

https://canonical-charmed-aether-sd-core.readthedocs-hosted.com/en/stable/tutorials/getting_started/

Environment

Juju 3.4
Microk8s 1.27-strict/stable
Traefik latest/stable

Relevant log output

pydantic_core._pydantic_core.ValidationError: 1 validation error for IngressProviderAppData
ingress.url
  Input should be a valid URL, invalid IPv4 address [type=url_parsing, input_value='http://sdcore-nms.10.0.0.2/', input_type=str]
    For further information visit https://errors.pydantic.dev/2.6/v/url_parsing

Additional context

No response

@PietroPasotti
Copy link
Contributor

PietroPasotti commented May 30, 2024

We think the issue is that the url being submitted to traefik is wrong because it is in fact not a valid ipv4 address: http://sdcore-nms.10.0.0.2/
pydantic deduces it's ipv4 because it ends in digits.

Is it an option to turn the address around and let it be http://10.0.0.2.sdcore-nms/ instead, which would be a valid DNS record?

@Gmerold
Copy link
Author

Gmerold commented Jun 3, 2024

I agree with your thinking ;)
That's why I proposed using nip.io. It turns the IP into a valid URL, eliminates a need of adding entries to /etc/hosts and makes the URL feel natural (unlike http://10.0.0.2.sdcore-nms/, which kinda reverses the natural order, don't you think?).

@mmkay
Copy link
Contributor

mmkay commented Aug 21, 2024

@Gmerold: I see that the documentation is using nip.io at the moment. Is there anything that you think we should do on the traefik side as well? Or maybe this is something we should improve in traefik's documentation?

@Gmerold
Copy link
Author

Gmerold commented Aug 23, 2024

Hello @mmkay, which documentation do you mean? SD-Core?
We are using nip.io indeed (as an alternative to setting up the DNS server), but Traefik is still broken. I don't it's a matter of documentation, but rather handling the case when the external-hostname is not set and the charm falls back to the LB's IP.

@lucabello
Copy link
Contributor

Currently, the ingress library is using AnyHttpUrl to validate the field; however, that fails.

We could solve this by either contributing a change upstream to pydantic (so that AnyHttpUrl accepts this type of url), or by writing a custom validator to accept it.

@ca-scribner
Copy link
Contributor

I think what @PietroPasotti is getting at is that the linked doc uses https://sdcore-nms.10.0.0.4.nip.io, but this bug report used https://sdcore-nms.10.0.0.4 (which should not be valid because a top level domain's end cannot be purely numerical)

Doing some pure pydantic testing (not with traefik's lib, just pydantic itself), we can see:

from pydantic import BaseModel, AnyHttpUrl, ValidationError


class MyModel(BaseModel):
    url: AnyHttpUrl


# Will pass validation
MyModel(url="http://valid.com")  # a control
MyModel(url="http://valid.com1")  # Valid even though it ends with a number
MyModel(url="http://10.0.0.4.nip.io")
MyModel(url="http://sdcore-nms.10.0.0.4.nip.io")

# Will fail validation
try:
    MyModel(url="http://invalid url")  # a control
except ValidationError:
    pass
else:
    raise Exception("I should have failed")

try:
    # fails because last segment is entirely numeric
    MyModel(url="http://sdcore-nms.10.0.0.4")
except ValidationError:
    pass
else:
    raise Exception("I should have failed")

This feels consistent with other places too. For example, type https://sdcore-nms.10.0.0.4 in your chrome url bar and it'll automatically notice it is not a url and search on it instead.

So having said all that (and having not actually looked at the traefik charm), is the missing .nip.io in the url because it was missing in the input, or did traefik strip it somewhere?

@Gmerold
Copy link
Author

Gmerold commented Sep 10, 2024

Hi @sed-i,
Actually it's neither :)
First of all, the behavior of Chrome you are describing is new. Chrome used to accept https://sdcore-nms.10.0.0.4. But that's not the main problem.
The external_hostname config of the Traefik charm is optional. If you don't specify it, LB IP will be used for building URLs of the proxied applications. In our case, we don't have an external, publicly available URL for Traefik. We're using nip.io to keep things as simple as possible. The problem is that the default "URL" produced by Traefik (client application name + Traefik's LB IP) doesn't pass the validation anymore and that fails the deployment of the bundle. On the other hand, we can't use nip.io to set the external_hostname config before Traefik is deployed, because we don't know the LB IP (it's assigned from the pool).
That's why I'm proposing using nip.io at the charm level - to make sure that if the optional external_hostname is not set by the user we still end up getting a valid URL instead of charm in error state.

@gruyaume
Copy link

Can this issue be prioritised? Every deployment of our charmed 5G deployment is affected by it. In addition, our tutorials and documentation look bad as we're having to reference this issue and let users know that it's expected for traefik to be an error state.

Reference:

Model      Controller                  Cloud/Region                Version  SLA          Timestamp
private5g  microk8s-classic-localhost  microk8s-classic/localhost  3.4.5    unsupported  08:08:50Z

App                       Version  Status   Scale  Charm                     Channel        Rev  Address         Exposed  Message
amf                       1.4.4    active       1  sdcore-amf-k8s            1.5/edge       707  10.152.183.176  no       
ausf                      1.4.2    active       1  sdcore-ausf-k8s           1.5/edge       520  10.152.183.65   no       
grafana-agent             0.32.1   waiting      1  grafana-agent-k8s         latest/stable   45  10.152.183.221  no       installing agent
mongodb                            active       1  mongodb-k8s               6/beta          38  10.152.183.92   no       Primary
nms                       1.0.0    active       1  sdcore-nms-k8s            1.5/edge       580  10.152.183.141  no       
nrf                       1.4.1    active       1  sdcore-nrf-k8s            1.5/edge       580  10.152.183.130  no       
nssf                      1.4.1    active       1  sdcore-nssf-k8s           1.5/edge       462  10.152.183.62   no       
pcf                       1.4.3    active       1  sdcore-pcf-k8s            1.5/edge       512  10.152.183.144  no       
router                             active       1  sdcore-router-k8s         1.5/edge       341  10.152.183.218  no       
self-signed-certificates           active       1  self-signed-certificates  latest/stable  155  10.152.183.33   no       
smf                       1.5.2    active       1  sdcore-smf-k8s            1.5/edge       590  10.152.183.64   no       
traefik                   v2.11.0  waiting      1  traefik-k8s               latest/stable  194  10.152.183.198  no       installing agent
udm                       1.4.3    active       1  sdcore-udm-k8s            1.5/edge       489  10.152.183.31   no       
udr                       1.4.1    active       1  sdcore-udr-k8s            1.5/edge       486  10.152.183.82   no       
upf                       1.4.0    active       1  sdcore-upf-k8s            1.5/edge       591  10.152.183.164  no       

Unit                         Workload  Agent  Address      Ports  Message
amf/0*                       active    idle   10.1.10.181         
ausf/0*                      active    idle   10.1.10.186         
grafana-agent/0*             blocked   idle   10.1.10.133         grafana-cloud-config: off, logging-consumer: off
mongodb/0*                   active    idle   10.1.10.155         Primary
nms/0*                       active    idle   10.1.10.174         
nrf/0*                       active    idle   10.1.10.151         
nssf/0*                      active    idle   10.1.10.136         
pcf/0*                       active    idle   10.1.10.146         
router/0*                    active    idle   10.1.10.145         
self-signed-certificates/0*  active    idle   10.1.10.141         
smf/0*                       active    idle   10.1.10.154         
traefik/0*                   error     idle   10.1.10.160         hook failed: "ingress-relation-changed"
udm/0*                       active    idle   10.1.10.187         
udr/0*                       active    idle   10.1.10.176         
upf/0*                       active    idle   10.1.10.169

@simskij
Copy link
Member

simskij commented Oct 4, 2024

@dstathis can you please make sure this is included in the pulse that starts on Monday? Thanks.

@dstathis
Copy link
Contributor

dstathis commented Oct 4, 2024

Yup no problem

@ca-scribner
Copy link
Contributor

I think the issue here is just misconfiguration. Traefik has two routing_modes:

  • path: (default) provides routes as paths, eg: http://1.2.3.4/mymodel-myapp
  • subdomain: provides routes as subdomains, eg: http://mymodel.myapp.1.2.3.4 (or maybe mymodel-myapp, can't remember)

If you're using the loadbalancer IP as the domain, then subdomain really isn't valid (since mymodel.myapp.1.2.3.4) isn't a valid domain based on the above conversation. Feels like path is the only valid config here.

Is there a reason why path wouldn't work here? that seems like the easy fix that can be implemented user-side and no risk of side effects if we add .nip.io

@Gmerold
Copy link
Author

Gmerold commented Nov 6, 2024

This kinda reminds me a story of my buddy. He used to have a car with a broken gearbox; only second and fourth gear would work. One day I had to drive this car and obviously I wanted to start with a first gear. After I struggled for a short while, my buddy told me to use the second gear instead. After starting on a second gear, I had to push the RPMs really high to be able to change to fourth gear directly, because the third wouldn't work as well. When I asked him about fixing the gearbox, he was like "nah, two of them still work".
Traefik has two routing modes and it should be user's decision which one he wants to use. If the correct charm configuration produces incorrect output, it is a problem in the charm. If you're afraid of side effects of using .nip.io, the alternative approach could be making the charm require external_hostname when subdomain is used.

@ca-scribner
Copy link
Contributor

Yes agreed, the root issue here is that if subdomain is used, then we need to require an external_hostname to be configured. I'm working to implement that constraint now. In future, expect that this charm will (more gracefully) block someone from using IP+subdomain

ca-scribner added a commit that referenced this issue Nov 12, 2024
…outing_path

Previously, this charm accepted the following configuration:
* routing_mode=subdomain
* external_hostname="" # (unset)

When external_hostname is unset, the url provided for any application related by ingress uses the LoadBalancer's external address, which may be an IP.  In these cases, it would provide charms urls like `model.app.1.2.3.4`, which are invalid URLs (the last segment of a URL cannot be all-numeric).  This led to an uncaught pydantic validation error when calling `ipa.publish_url()` because that method includes validation of the URL, putting the charm in Error state.  An example of this is shown in #361

Ideally, the fix here would be to validate the charm configuration+LoadBalancer details and halt charm execution if the configuration was invalid, putting the charm into BlockedStatus until resolved.  The problem is that the current architecture of this charm makes that solution challenging.  The charm is designed to atomically handle events (doing only the work a particular event needs) rather than holistically (recomputing the world on each event), meaning that skipping or losing track of events leads to undesired charm states.  Also, `ipa.publish_url()` is called deep in most (all?) event handlers, making it difficult to properly handle these errors at the charm level without a major refactor of the charm.

As a compromise, the following has been done:
* the traefik_k8s/v2/ingress.py library's `publish_url` method has been updated to catch the pydantic validation error cited in #361 and log it rather than raise it to the charm.  The library then writes ingress=None to the databag instead of the invalid URL, giving a soft indication to the user that the url is invalid.
* the config.yaml descriptions for routing_mode and external_hostname have been updated to explain the incompatibility in these settings
* config validation has been added to the __init__ of the charm for routing_mode and external_hostname.  If routing_mode==subdomain and hostname is unset, the charm will log warnings for the user about the possible incompatibility (but will not block the charm)

The upshot of these changes is that this charm will:
* not go into an unresponsive error state
* as best it can given the current charm architecture, warn the user of the misconfiguration
* not risk losing events through defer or getting event sequencing wrong

Fixes #361
@dstathis dstathis removed the Checked label Nov 13, 2024
ca-scribner added a commit that referenced this issue Nov 15, 2024
…#420)

Previously, this charm accepted the following configuration:
* routing_mode=subdomain
* external_hostname="" # (unset)

When external_hostname is unset, the url provided for any application related by ingress uses the LoadBalancer's external address, which may be an IP.  In these cases, it would provide charms urls like `model.app.1.2.3.4`, which are invalid URLs (the last segment of a URL cannot be all-numeric).  This led to an uncaught pydantic validation error when calling `ipa.publish_url()` because that method includes validation of the URL, putting the charm in Error state.  An example of this is shown in #361

Ideally, the fix here would be to validate the charm configuration+LoadBalancer details and halt charm execution if the configuration was invalid, putting the charm into BlockedStatus until resolved.  The problem is that the current architecture of this charm makes that solution challenging.  The charm is designed to atomically handle events (doing only the work a particular event needs) rather than holistically (recomputing the world on each event), meaning that skipping or losing track of events leads to undesired charm states.  Also, `ipa.publish_url()` is called deep in most (all?) event handlers, making it difficult to properly handle these errors at the charm level without a major refactor of the charm.

As a compromise, the following has been done:
* the traefik_k8s/v2/ingress.py library's `publish_url` method has been updated to catch the pydantic validation error cited in #361 and log it rather than raise it to the charm.  The library then writes ingress=None to the databag instead of the invalid URL, giving a soft indication to the user that the url is invalid.
* the config.yaml descriptions for routing_mode and external_hostname have been updated to explain the incompatibility in these settings
* config validation has been added to the __init__ of the charm for routing_mode and external_hostname.  If routing_mode==subdomain and hostname is unset, the charm will log warnings for the user about the possible incompatibility (but will not block the charm)

The upshot of these changes is that this charm will:
* not go into an unresponsive error state
* as best it can given the current charm architecture, warn the user of the misconfiguration
* not risk losing events through defer or getting event sequencing wrong

Fixes #361
@ca-scribner
Copy link
Contributor

#420 adds a fix to this, in that we now clearly state that this charm should not be deployed with routing_mode=subdomain and an unset external_hostname. That's added to the config descriptions, and there's some warning messages that'll appear if this comes up.

#420 stops short of actually putting the charm into BlockedStatus and forcing a user to avoid this setting combination. tl/dr: the current architecture of the charm makes actually blocking on bad config difficult. There's a near-term plan (definitely this sprint, probably in the next month or two) to refactor the charm entirely and hopefully address this better, but for now we get just the warnings.

@simskij simskij reopened this Nov 18, 2024
@simskij simskij closed this as completed Nov 18, 2024
@simskij simskij reopened this Nov 18, 2024
@simskij simskij closed this as completed Nov 18, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants