Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

dnsdist: Option to servfail on timeout? #13575

Open
leshow opened this issue Dec 5, 2023 · 8 comments
Open

dnsdist: Option to servfail on timeout? #13575

leshow opened this issue Dec 5, 2023 · 8 comments

Comments

@leshow
Copy link

leshow commented Dec 5, 2023

  • Program: dnsdist
  • Issue type: Feature request

Short description

We use setServFailWhenNoServer, which returns servfail if forwarders are 'down'. But queries still timeout with no response occasionally, would it be possible to add an option like setServFailWhenTimeout that responds with a servfail when the timeout is reached? We might be able to make this change in a PR if it's straightforward.

Usecase

We have a configuration with multiple dnsdist's, where if a query fails in one we want to move to the next. We could set a client side timeout also but it would be nice if dnsdist just let us know it hit a timeout.

I'll just add, I'm not sure if this is a bug but when I use setUDPTimeout I still get responses even after the timeout period has passed. For example:

setUDPTimeout(2)
newServer("127.0.0.1:9954")

if 127.0.0.1:9954 takes 3 seconds to respond, the client still receives a response. I expected the UDP timeout to occur first and not receive a response.

@rgacogne
Copy link
Member

rgacogne commented Dec 7, 2023

So dnsdist tracks timeouts as a best effort. Sometimes we discover them quickly when we try to reuse the query ID assigned to the query that timed out, but sometimes it takes an active scan of the current outstanding queries to find them (the data structure we use is optimized to be lock-free for the general case, and we don't want to optimize for the timeout case). Which means that, by the time we discover that a timeout occurred, it is usually too late to do anything useful for the initial client which has retried or given up already.
It also explains why we sometimes still forwards a response coming after the timeout delay: the state is still there because it has not been cleaned up yet, so we accept the response. We could check the elapsed time and drop it, assuming that the client is gone already, but I'm not sure the additional check is worth it.

@rgacogne
Copy link
Member

rgacogne commented Feb 2, 2024

I'm going to close this as we are not likely to implement it at the moment, but let me know if you disagree and I'll reconsider.

@rgacogne rgacogne closed this as completed Feb 2, 2024
@leshow
Copy link
Author

leshow commented Feb 2, 2024

Sure, thanks for the response.

@karelbilek
Copy link
Contributor

I have to say this would be extremely valuable to me, it would help me chain it with restarts https://dnsdist.org/reference/dq.html#DNSResponse:restart and make dnsdist into a robust local stub resolver

@rgacogne
Copy link
Member

rgacogne commented Nov 4, 2024

Just for my understanding, what exactly would you solve by being able to retry a query after a timeout? I'm asking because so far every single time someone looked into it it turned out that the client already sent a new query (application/libc DNS stack doing the fallback itself) or gave up before the time out was even detected by dnsdist.

@karelbilek
Copy link
Contributor

We have a custom app that does not do DNS retries and it's hard to change the logic on that end (for complex reasons).

I understand though that it's a very tiny usecase.

@rgacogne
Copy link
Member

rgacogne commented Nov 4, 2024

Understood. Technically I don't think it would very hard to implement these days: we would probably need a new chain of rules that are executed on a timeout to avoid breaking existing configurations if we just re-executed an existing chain, but that's easy.
I'm re-opening this feature request and tentatively assigning it to 2.0.0, but no promises.

@rgacogne rgacogne reopened this Nov 4, 2024
@karelbilek
Copy link
Contributor

@rgacogne I am so sorry about this

I have figured out the reason why the software doesn't do retries, and it's to be as fast as possible, with a possible risk of timeouts. Adding replies makes this useless again.

So - in the end, this is not necessary for my usecase - but it might be useful for someone else...

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

3 participants