-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
HttpClient fails to initialize SSL. #64492
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @dotnet/ncl, @vcsjones Issue DetailsHey! We've got some strange issue with ssl initializing after upgrading to .net 6.0 from .net 5.0. We have multiple microservices deployed on our own Kubernetes. There are 4 physical nodes on Debian Buster, each of node has several virtual machines as Kubernetes nodes. The microservices images are based on mcr.microsoft.com/dotnet/aspnet:6.0 with some debugging/servicing appendings like apt-get install curl and so on. We've also tried :6.0-focal and it was like it helped but I'll describe it below. The microservices mainly exchanges data with each other by plain http through their internal hostnames, but for some specific reason sometimes they are connecting to each other through global network by https (in general it may be some external data provider which is not under our control). The thing is sometimes when the https connection is starting it fails to initialize the openssl library with some strange errors - error:04067084:rsa routines:rsa_ossl_public_decrypt:data too large for modulus and (or) error:04067072:rsa routines:rsa_ossl_public_decrypt:padding check failed. As far as I know, there is no calls of routines:rsa_ossl_public_decrypt:data in openssl initialization. It fails randomly on different pods, different nodes etc. If one connection on one pod didn't fail, it won't fail ever. If one connection on one pod failed - it'll be failing again and again until pod restart, and then it repeats, meaning it can fail or not again. Is seems like there is some threading issue here, but it is just by suggestion. I've tried to figure out the ssl initialization process by reading the sources of System.Net.Security and Crypto.Native wrapper but no luck so far. I still cannot reproduce this on some simple application, environment or test. It is not reproducing on windows and on local wsl2 docker - only in our Kubernetes environment so far (we have some other environments, but that one mentioned is our dev and we cannot upgrade others while dev is failing). The other thing I've found that it happens only if we have /etc/ssl/openssl.cnf with some non-empty system default settings (system_default_sect, like described in #46271). If we leave system_default_sect empty or just provide no config file at all (that is the case of :6.0-focal, it has no openssl config file by default) - everything works fine. If we put there just the ciphers that are already default as described here (https://docs.microsoft.com/en-us/dotnet/core/compatibility/cryptography/5.0/default-cipher-suites-for-tls-on-linux) - it starts failing. And we need this config defaults (just like in :6.0 image) because we still need to connect to the sites with some obsolete weak ciphers. It worked fine on .net 5.0. Could you please suggest something or direct me to the way I can investigate it further? Two exception stack examples attached.
|
cc: @bartonjs |
Any ideas? At least if it is bug or some configuration/behavior breaking changes etc? |
A few questions: First, Can you include the exact
Do you have an estimate for how often is "sometimes"? |
Well we are not using it explicitly, it was just a part of investigation. We just use the aspnet:6.0 image with no modifications of openssl.cnf - and by the way it is the very same of aspnet:5.0 image. Anyway, attached it for convenience (but with .txt extension because github doesn't know .cnf) - openssl.txt
Well we have a deploy job in our gitlab ci which deploys every microservice we're using at our dev environment. There are about 20 microservices, but some of them are scaled and some of them don't make a https connections at all. In summary there are about 12 pods deployed which make https connections. and at least 5 of them are affected with that problem every deploy, meaning that there are 5 pods right after deploy which cannot make https requests until restart (but pod restart has the same probability to fix or not to fix) or next deploy (where we will just have another 5 pods or even the very same). |
Found a similar-looking conversation on the internets: https://www.mail-archive.com/openssl-users@openssl.org/msg89470.html. (based on which I have a guess) Maybe we just need to move this
One thing jumped out at me (perhaps intentional?) is we end up p/invoking runtime/src/libraries/Common/src/Interop/Unix/System.Net.Security.Native/Interop.Initialization.cs Lines 23 to 26 in 57bfe47
CryptoInitializer 's cctor and then again from line 26. On native side, CryptoNative_EnsureOpenSslInitialized is called by CryptoNative_EnsureLibSslInitialized runtime/src/native/libs/System.Security.Cryptography.Native/pal_ssl.c Lines 126 to 128 in 53d5288
pthread_once call, but rest of this function's body (including DetectCiphersuiteConfiguration() ) will run for each call to CryptoNative_EnsureLibSslInitialized . If first call encountered an error and error list wasn't cleared, the next time we can end up having state errors (per the conversation in linked thread).
|
I don't think it's a double/concurrency problem (even though apparently both System.Net.Http and System.Net.Security will call EnsureLibSslInitialized), but a tainted error queue problem. The SSL initializer throws if there's any error in the error queue after it calls the P/Invoke. What it means by that is "if any error was raised during initialization"... but if the error was left over by some previous operation (which perhaps hit the error and considered it skippable/part of a fallback without throwing) at just the core crypto layer, then we'll think SSL failed to initialize. Alternatively, it could be that the error is coming from inside our configuration code itself... but if that's the case then it probably just means we're reporting the wrong error (the last one from a cascade instead of the first/correct one for the operation... which is a problem we're already looking to systematically address for vNext). If you can get a repro under a debugger, doing something like |
Could you please elaborate? Does it mean I can attach to dotnet-process on affected pod with some debugger like gdb/SOS, put a breakpoint somehow at the right place and then call some code within it? Or should it be some part of runtime (System.Net.Security in particular) or the whole runtime be rebuilt with that |
SOS isn't required, since no managed code is involved. For LLDB it'd be something like $ lldb dotnet -- bin/Debug/netcoreapp6.0/HttpClientStartupBug.dll
Added Microsoft public symbol server
(lldb) target create "/usr/bin/dotnet"
Current executable set to '/usr/bin/dotnet' (x86_64).
(lldb) settings set -- target.run-args "bin/Debug/netcoreapp6.0/HttpClientStartupBug.dll"
(lldb) b CryptoNative_EnsureLibSslInitialized
Breakpoint 1: no locations (pending).
WARNING: Unable to resolve breakpoint to any actual locations.
(lldb) run
Process 21619 launched: '/usr/bin/dotnet' (x86_64)
1 location added to breakpoint 1
Process 21619 stopped
* thread #14, name = '.NET ThreadPool', stop reason = breakpoint 1.1
frame #0: 0x00007ffff4134bf0 libSystem.Security.Cryptography.Native.OpenSsl.so`CryptoNative_EnsureLibSslInitialized
libSystem.Security.Cryptography.Native.OpenSsl.so`CryptoNative_EnsureLibSslInitialized:
-> 0x7ffff4134bf0 <+0>: pushq %rbp
0x7ffff4134bf1 <+1>: pushq %r15
0x7ffff4134bf3 <+3>: pushq %r14
0x7ffff4134bf5 <+5>: pushq %r13
(lldb) finish
Process 21619 stopped
* thread #14, name = '.NET ThreadPool', stop reason = step out
frame #0: 0x00007fff7dce2a1f
-> 0x7fff7dce2a1f: leaq -0x80(%rbp), %rdi
0x7fff7dce2a23: callq *0x835af(%rip)
0x7fff7dce2a29: leaq -0x80(%rbp), %rdi
0x7fff7dce2a2d: callq *0x8359d(%rip)
(lldb) call (void)ERR_print_errors_fp(stdout);
(lldb) Mine did nothing, because there were no errors in the queue. Had there been, it'd look like (lldb) call (void)ERR_print_errors_fp(stdout);
140737353987904:error:0406706C:rsa routines:rsa_ossl_public_decrypt:data greater than mod len:../crypto/rsa/rsa_ossl.c:539:
(lldb) |
Thanks for such a detailed explanation! I've managed to debug our apps with similar way but still no luck. It seems that attached debugger affects somehow how app works, and the bug doesn't reproduce. I've made a script for gdb(not lldb for it can't 'finish' and 'call' in one breakpoint command) to attach to dotnet process in a running kubernetes pod and I run it just after the container start to catch ssl initialization routine. And I've changed the method for breakpoint to CryptoNative_ErrPeekLastError to be sure that thread will suspend right after initialization so there will be less debug affect. However, everything works fine with it. If I don't attach a debugger, the error comes back. There also were some thread suspends because of SIG34 and SIGPIPE, maybe these also take part somehow, but I don't know. Added ignore of these in script. I'll try to run applications with just an attached debugger (without breakpoints, just attach and continue) to see if it helps, and if it will, it probably would be some workaround (very odd though), but I wonder if attached debugger will affect overall performance of application. And as far as I understood if I run the dotnet process with gdb (define entrypoint of docker image with something like 'gdb run dotnet ...') I won't be able to detach gdb after breakpoint hit to ensure there is no performance hit at least right after ssl initialization. Maybe some other ideas? Gdb script I've used (the whole shell command which runs it):
|
Well, the idea/hope is that hitting the error once will let the full error queue be displayed, and that from there it'll be "obvious" what's wrong. |
If I remove that 'call (void)ERR_print_errors_fp(stdout)' from script, the error comes back. It seems that it actually do the trick, because it pops all the errors from queue, and then the execution continue as if there were no errors. But for some reason it does not print these errors to the stdout neither to gdb log or lldb stdout. Tested by 'call fprintf(stdout, "test")'. It seems that there is no tty for dotnet process (ps aux shows question mark). Will try to figure it out. At least maybe changing the image entrypoint to 'gdb run dotnet...' will probably work. |
ERR_print_errors_fp will empty the error queue. You can try stderr as well. Some of it depends how to process is started and the output can be closed or redirected. You can also try "docker logs " (or Kubernetes equivalent) |
Finally managed to get the output of 'call (void)ERR_print_errors_fp(stdout)' - it was just needed to 'call fflush(stdout)' additionally to see the output in pod logs. Unfortunately there is no much info:
and
Sometimes there are twice of these errors, eg two 04067084 errors and(or) two pairs of 0407008A and 04067072. I've also tried to build two custom aspnet images - one for Debian Buster with .net6 runtime and one for Debian Bullseye with .net5 runtime, and tried to use them as the base images for our microservices. Those with net5 runs ok, and other with net6 have this issue reproducing. Also there are no parallel https connections needed to reproduce this - I mean, just one api request right after the pod start which makes a https connection can cause this issue, so it seems this is not caused by somehow multiple openssl init neither any threading synchronization. Perhaps it is kind of memory corruption somewhere or maybe some pointers are not NULL'ed on init... or maybe there is really something that calls that rsa_ossl_public_decrypt. I've looked at the pal_ssl.c which @am11 was talking about and I wonder why there is ERR_clear_error call at all in the 'no system default' if branch. It would be some kind of a workaround though if that would get called unconditionally and just hide the real issue happening somewhere. I'll try to see if CryptoNative_EnsureOpenSslInitialized or OPENSSL_init_ssl get called before CryptoNative_EnsureLibSslInitialized and also if there is any call of rsa_ossl_public_decrypt. |
So that is what is going on:
There is some call of CryptoNative_RsaVerifyHash just right before (or even looks like at the same time) the call of CryptoNative_EnsureLibSslInitialized, so it seems that there are errors in queue from former call. I've tried to find through all runtime sources of 6.0.1 branch where the CryptoNative_RsaVerifyHash call is, and the only place I've found so far is the libraries/Common/tests/System/Security/Cryptography/AlgorithmImplementations/RSA/SignVerify.cs and it's descendants, where are several tests that calls this. And it is very odd but it is looks like these tests are constantly fired by some timer in the running application, because I see that calls runned periodically within same thread then application is idle (but there is different count of calls, some times there are seven, sometimes there are five, some of them has shorter stack length, but they still goes one by one in same thread). Any thoughts? |
I've searched too little. These are indeed tests - not runtime though but rather our e2e tests :) and that calls were from JWT signature validation routine in the authentication middleware. Looks like I've stuck. I suggested there is some race or whatever with that signature validation, but the issue seems to be repeated even without using authorization header at all (at our dev environment there is no strict requirement of authorization). Tried to regenerate RSA-keys within IDM (we're using Keycloak) but still no luck. |
Finally managed to reproduce it on simple case. It turned out the combination of two things:
So in my simple case I have two schemas (AddJwtBearer), and first added scheme is not the scheme suited for incoming request; but first scheme tries to authenticate with incoming token and fails, and then something happens, so there are errors left in openssl error queue, and then the outgoing https connection init fails. I'll try to reproduce this on a local docker a bit later, and will anyway attach small reproducible solution. |
Sure. Use anything other than SslStream ;) // Fires the crypto initializer, but not the "Ssl" initializer
RSA.Create().Dispose();
// Re-fires the crypto initializer, and also the "Ssl" initializer
[something with SslStream]
// This might actually fire the crypto initializer a third time, and the Ssl one a second time, but we can probably remove that (unless we still have the curl handler?)
[make HttpClient do an HTTPS thing] But there's not a problem with the crypto initializer firing twice (or more). It's designed for that. Looking at what's going on, we have
Given that there's already a clearing path, and nothing ever reports failure (meaning nothing should ever have a reason to look at the error queue), I don't understand why that last check is there. It got added in dotnet/corefx#29171, but I don't see anything particularly justifying it. I think my recommendation is to just... stop checking. |
Hey! Here is the solution to reproduce this issue, added comments to clarify what's going on. |
We've ended up with this workaround (calling this method at the very start of our services): private void InitSsl()
{
var sslInitializerType = typeof(SslStream).Assembly.GetType("Interop+SslInitializer");
if (sslInitializerType != null)
{
try
{
sslInitializerType.InvokeMember("Initialize", BindingFlags.Static | BindingFlags.NonPublic | BindingFlags.InvokeMethod, null, null, null);
_logger.LogInformation("OpenSSL successfully inited");
}
catch (Exception ex)
{
_logger.LogWarning(ex, "Failed to initialize OpenSSL");
}
}
else
{
_logger.LogWarning("Failed to initialize OpenSSL: no initializer type found");
}
} Wonder if it could be done with more 'intelligent' way. I'll also try to make a docker-compose for the whole reproducible solution, but it needs some time to make a prepared test keycloak db and to make an image with that. |
How do you trigger or observe the failure @meareindel? I build and run the docker image but nothing obvious shows up. |
I had to describe it yesterday but was already bad at thinking. For this solution to reproduce you need to have some additional IDM be deployed, like Keycloak, Identity4 or whatever, which could be connected by plain http. Then you target two authorization schemes to it, modifying constants in UrlsHelper (if it is Keycloak, but I doubt if it will work unchanged with some other IDM because of urls forming). Then you run both of it under docker and run the BreakSsl endpoint of ClientController. It will get the JWT from the second scheme and send it to Authenticate endpoint of ServerController just to run the token validation routine so that first scheme will reject the token first and then second scheme will get it validated. Then it will make a call to MakeHttps endpoint of ServerController, which in turn will try to make a https connection to github - and at this point it should fail. A bit complicated though, that's why I'll try to make the docker-compose for the whole pipeline, but a bit later. |
Here we are:
I've simplified urls forming comparing to previous code (but it now depends on how docker-compose.yml is written thought) and added preconfigured realms to keycloak so it imports these at start. The realm one is just empty realm (I mean no users) and the realm two has user 'test/test' with which net-app logs in to get the JWT. |
Btw i am getting this error @meareindel - do you think it could be same issue or different issue? |
@sagarsumant , it could be this issue if you have the same prerequisites (multiple authentication schemes with metadata and keys fetching by plain http - or maybe at least these are manually set from some config at the app start; and you send the JWTs which the first scheme can not authenticate); or it could be issue with self-signed certificates, for example, like in #62810, my workaroung will not help you in that case; or it could be some another issue at all, maybe even still unknown. You can try to search within existing runtime issues or/and google or create a new one. The type 'Interop+SslInitializer' exists only when you run your app on Unix/Linux, so on IISExpress or Kestrel in Visual Studio it will always be null. You can run on Docker with right click on project in VS and "Add -> Docker support... -> Linux" in case you've never done it before someway; you will also need Docker Desktop running on WSL2. After that you will be able to select the "Docker" option in the Debug target list: And you also will need to add ports config to the "Docker" section of launchSettings.json to be able to access your app from host browser or whatever: |
Triage: Regression against 5.0. We should look at it in 7.0 and consider backporting to servicing once we know scope and risk of the fix. |
I have been unable to reproduce this in main, I suspect #65148 has fixed it. Either way, I have noticed something strange, I noticed that the errors in question ("data too large for modulus" or "padding check failed") should always be cleared when the method returns from
and gathered the threads from which the breakpoints were hit, and according to the log, the calls to LLDB output data
@bartonjs am I missing something vital? cause this seems very wrong to me as we might be potentially missing many errors. Anyway, removing the check in |
Yeah, I was having trouble reproducing it, too. The best I've come up with is that .NET doesn't claim exclusive control over OpenSSL and the repro may involve something else also calling into OpenSSL which is contaminating the error queue. With #65148 we're still throwing based on the last reported error, but we're clearing at the start of operations to reduce the risk of cross-reporting errors. |
Hey! We've got some strange issue with ssl initializing after upgrading to .net 6.0 from .net 5.0.
We have multiple microservices deployed on our own Kubernetes. There are 4 physical nodes on Debian Buster, each of node has several virtual machines as Kubernetes nodes. The microservices images are based on mcr.microsoft.com/dotnet/aspnet:6.0 with some debugging/servicing appendings like apt-get install curl and so on. We've also tried :6.0-focal and it was like it helped but I'll describe it below.
The microservices mainly exchanges data with each other by plain http through their internal hostnames, but for some specific reason sometimes they are connecting to each other through global network by https (in general it may be some external data provider which is not under our control).
The thing is sometimes when the https connection is starting it fails to initialize the openssl library with some strange errors - error:04067084:rsa routines:rsa_ossl_public_decrypt:data too large for modulus and (or) error:04067072:rsa routines:rsa_ossl_public_decrypt:padding check failed. As far as I know, there is no calls of routines:rsa_ossl_public_decrypt:data in openssl initialization. It fails randomly on different pods, different nodes etc. If one connection on one pod didn't fail, it won't fail ever. If one connection on one pod failed - it'll be failing again and again until pod restart, and then it repeats, meaning it can fail or not again. Is seems like there is some threading issue here, but it is just by suggestion. I've tried to figure out the ssl initialization process by reading the sources of System.Net.Security and Crypto.Native wrapper but no luck so far.
I still cannot reproduce this on some simple application, environment or test. It is not reproducing on windows and on local wsl2 docker - only in our Kubernetes environment so far (we have some other environments, but that one mentioned is our dev and we cannot upgrade others while dev is failing).
The other thing I've found that it happens only if we have /etc/ssl/openssl.cnf with some non-empty system default settings (system_default_sect, like described in #46271). If we leave system_default_sect empty or just provide no config file at all (that is the case of :6.0-focal, it has no openssl config file by default) - everything works fine. If we put there just the ciphers that are already default as described here (https://docs.microsoft.com/en-us/dotnet/core/compatibility/cryptography/5.0/default-cipher-suites-for-tls-on-linux) - it starts failing. And we need this config defaults (just like in :6.0 image) because we still need to connect to the sites with some obsolete weak ciphers.
It worked fine on .net 5.0.
Could you please suggest something or direct me to the way I can investigate it further?
Two exception stack examples attached.
modulus.txt
padding.txt
The text was updated successfully, but these errors were encountered: