-
Notifications
You must be signed in to change notification settings - Fork 573
Zalenium versions that use Selenium 3.11 intermittently fail re: "No System TLS" #491
Comments
Hi @borisivan That sounds like it could be related to root certificates or something similar? I am not 100% sure, but I see the exception comes from the HttpRequest, we would need to investigate. |
Hi @diemol, thanks. I wasn't sure if that is coming from container<->container communication/forwarding, or from the selenium server back to the test, etc. Still not really sure. I have successfully placed the root CA certificates on both the zalenium and the elgalu container, ran update-ca-certificates, and tagged this container (based on latest from both, today). In the suite, I specify that version: ... 6 other elgalu containers I've verified that from the containers, I can now successfully reach the website I'm testing via https, which I couldn't do before. before, from within the container, only curl http://the-site would succeed.... https would not. now, from within the container, the curl command works with both http and https. I also changed the tests to only attempt to access the site via http, instead of https... same thing. Occasional failure with: org.openqa.selenium.WebDriverException: cannot forward the request No System TLS I've also verified that it's added to the java keystore in the containers. |
We have also been finding tests fail intermittently with the "No System TLS" error (along with a bunch of other Java timeouts; not sure if they're related? https://gist.github.com/matthew-horrocks/b947646aacd07b5c4a4ad4751cb5d01a) The failure seems to occur at a random point through the test. At the point of failure, screenshots of the test show the browser displaying a perfectly rendered page. Out of 1900 tests, the failures will occur in ~5% of tests. We're using the dotnet bindings for Selenium with the tests being run from a Windows 2012R2 server. Rolling back to Let me know what else you need. Here is our
Here is our docker-compose.yaml
|
Hi @diemol, taking a step back and looking at this, I think this is simply a different footprint for what has existed all along, which I still can't figure out. There are occasional failures where pre selenium 3.11 the footprint was along the lines of:
and after going to selenium 3.11 with the switch to okHttp the footprint is:
This happens randomly in that it is never the same test, and sometimes all 240 tests will succeed without hitting it. A couple others have logged issues from time to time re: the 'cannot forward' intermittent problem, and I think we need to find a way to narrow down what's happening. I think the biggest question I have is: how can I tell if this error is relevant to zalenium / docker, versus a problem on the site I'm testing (or a load balancer that sits in front of it)? Specifically, when we get the "cannot forward" error, I'm not sure what to conclude from the stack trace. If I think of the overall system as: [a] the test (webDriver client) on a jenkins slave -> maven -> java -> testNG suite And of course there is [x] -- the host that runs docker, which is where [a] [b] [c] live. ... where is the footprint failure message coming from, and what is is trying to tell us? Who can't forward a request to whom? I'm not sure if everything is working fine re: [a] [b] [c], and the failure message is relative to the load balancer / application saying that it's having a problem, or if the can't forward error is relative to some point of [b] or [c] saying that it can't forward something. Similar question for the "No System TLS" -- what entity is noticing a problem and making this declaration -- who is saying this, with regards to what attempted communication? chrome_driver.log doesn't show a problem. Things appear fine, the test is moving along and then as soon as the problem is hit, the next set of messages in chrome_driver.log appear to be me setting the cookie re: test failed. re: pre selenium 3.11 / 3.11, I can say that w/ 3.11 the problem happens about 2x more frequently, and I haven't adjusted the logic in my suite to successfully catch that footprint and mark the test as skipped yet, so at this point other than an experimental node that uses zalenium:latest, I've bound my production to use 3.10 (rebuilt with certs, though the certs have made no impact on anything regarding the frequency of this failure). |
Hi @matthew-horrocks -- I also see the footprint (w 3.11) sometimes appear as: unexpected end of stream on Connection, as in your github link above. With that many containers I assume you're running tests in parallel -- do you happen to know if the problem seems to occur to session "x" immediately after session "y" either launches a new browser or closes the one it was using? I wonder if when zalenium goes through it's logic of marking a container as in-use due to a new session being created (or freeing up on browser close), there is a small window of time in which there is some kind of lock or race condition leads to this problem. edited due to me constantly forgetting that github treats my comments as escape characters.... |
The only references I can find for "No System TLS" are in okhttp at and |
Yes, I think I wrote something similar in Slack. I believe the change comes from upstream, because Apache Http is being replaced in different parts with OkHttp. I'll keep you posted. |
@borisivan sorry, I completely missed your comment at #491 (comment) Yes, all tests run in parallel, using all available containers. The error seems to occur at a seemingly random point through a test; unfortunately I haven't been able to match the failure time to a browser in another container being torn down. |
@diemol, regarding the "It seems that Apache Http retries connections automatically and you need to explicitly set that in OkHttp" -- if I understand correctly, you'll be looking at that configuration change re: the okHttp usage, sounds good. But I was trying to understand more about the problem -- if this is happening mid-test like @matthew-horrocks and I are seeing, which connection is this referring to, and does this mean that a connection that was previously established is getting disconnected when we're not expecting it? Sounds like the 'retry connection' might help something keep moving along if it were to get disconnected, but I'm looking to understand what's getting disconnected, and the typical reason why. That way, maybe I can reduce the likelihood of occurrence via other change. I was hoping to understand which connection is getting broken -- do we know if it's the TCP connection to the webserver under test? Or is this a TCP connection between zalenium to elgalu (container to container), etc? Knowing which might also help us understand if there is any tuning re: keepalive / intervals that can be tweaked that would help reduce the likelihood of occurrence. |
@borisivan @matthew-horrocks, a couple of questions:
Regarding the retries parameter in OkHttp, I was checking the Selenium source code, and the OkHttp builder has by default the retry parameter set to true. So it might be that @borisivan is right, that this is the same stack trace we were getting before, but just that OkHttp prompts a readable message. Based on the feedback we can get from the two previous questions, we can:
|
One more question, are you both running tests only on Chrome? Or more specifically, do the errors happen only when Chrome is used? |
Hi @diemol, a couple of the questions might take some digging, but answering the ones I know the answers to now: To understand a bit more the situation, are you both testing with root CA certificates added to the containers? I saw the problems before adding the root CA cert to the containers, and continue to see the problem now that I rebuilt with containers with them added -- no impact. It is not 100% clear for me if the error happens when the test starts, or when the page under test is being loaded, do you have an idea around when the problem shows up? I think I saw it at initial URL get before, but not sure. Latest examples I've looked at are mid test -- initial URL loaded, many selenium steps performed successfully, can happen anywhere in the test. Is it possible to see the Zalenium log in debug mode when the bug happens? If not, at least in normal mode? At one point I was saving the Zalenium log by default, let me see if I can re-enable that, and I can enable debug. But I wasn't obtaining the various logs from the elgalu containers. If I were to obtain the elgalu container, is there a way to enable that with debug too, or will the debug option only enable debug in the master zalenium container? Are you always using a fixed number of containers and a big number of maxTestSessions? Or are containers being created on the fly as tests are executed? I certainly see it with 6 containers that have a large # of maxTestSessions, I don't run with the on-the-fly mode, but I can try that if needed. I found a way to use ApacheHttp instead of OkHttp, so we could configure that and release. If the problem goes away, we know that we can focus on tweaking OkHttp: When I use the 3.10 version of the container, I do see occasional problems that I suspect might be just the different footprint of the same issue that's occurring, though it also happens less frequently. The footprint seen with the 3.11+ version of the container happens 4-5x more frequently (around once every 250 tests). If the problem does not go 100% away, the problem might be that when a container is being created, it is not 100% ready (some system component didn't load completely) and then even if it registered to the hub it needs a bit more to be up completely. We could add a check for that if we are able to diagnose what has to be checked: problem happens mid-test, so I doubt that's the issue. |
https://github.com/zalando/zalenium/releases/tag/3.11.0c was just released, there are some fixes that solve some NPE introduced when adding more logging in the 3.11.0b release. I was bumping into similar (but not the same one) stack traces when running a similar scenario as yours @borisivan. After reading the source code in Selenium and comparing it to the reported stack trace, the error pops up in middle test, so it is not related to a container being ready or not. And it is unlikely related to be using Apache Http or OkHttp. My only guess right now is a TSL handshake timeout. Since it works for most of the tests in the same container but not for some. Before continuing, could you please try the latest release and see if we get a better error message? |
Also, would it be possible to try it without root CA? |
@diemol I am bumping into the issue as well. I have 3 tests running in parallel each test taking less than 3 seconds to execute. I am randomly seeing issue this issue where the tests fail with the above mentioned error .
This is how my docker compose file looks like |
This is not really a solution, but I discovered that you can switch the http client from okhttp back to apache via a system property, if somebody wants to try setting this environment variable on their zalenium container, then it'd be interesting to know if it's more stable:
@diemol says this isn't a long term solution as apache will be removed. But thanks to @gsaimadhu on #568 he posted a more complete stack trace with a curious caused by:
It seems when the okhttp client initialises it configures TLS (which we don't actually need) and randomly initialising TLS is failing because something is screwing up reading the java trust store. Maybe it ran out of file handles or something strange like that. It might be possible to cache the trust manager so that all instances of okhttp client use the same trust manager (since we don't actually even need to use it as communication to the nodes is not https). |
I found where the "Invalid keystore format" exception comes from: http://www.grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8-b132/sun/security/provider/JavaKeyStore.java#650 But it doesn't really help us, basically, it sounds like as it's reading the file the "magic" file headers aren't matching. |
With this issue unresolved for so many weeks, the development team is losing confidence in automated testing because of regular false alarms from our qa side. |
This has prevented me from using any new Zalenium builds as well, though we still use Zalenium. I provide the arguments to the docker run to force the specific version of Zalenium as well as the elgalu container to be the version that existed before they made this change (several months old at this point), and obviously don't see the TLS problem. But I don't get the benefit of other bug fixes or latest versions of chrome/FF/selenium inside the container, either, which is a serious problem, but not as a serious as having 1-2 random tests fail each suite execution with the TLS error message. |
I understand what you guys mean, but that is the current status. Of course we want to fix it, and if we had found what the exact reason is and how it could be fixed, we would have done it already. Please also understand that we have jobs and other tasks, and therefore we cannot spend all the time we want in this. With the recent info added in the duplicated issue we will investigate more about that stacktrace and hopefully find a fix. |
Hi @diemol --understood. It also wouldn't surprise me if this ends up being something like what @pearj mentions re: file handles or some such resource, which might be more affected with the okHTTP library instead of apache, etc, and in reality an overall docker-host-infrastructure problem as opposed to any zalenium-specific code. But alas, the effect is pretty severe. |
@diemol, i understand and appreciate you and your team initiatives and hardwork for community. My intention is not to put pressure on you. My intention is to request you an alternate way to unblock us with an option if possible to use direct docker selenium using Zalenium. |
@gsaimadhu one alternative to try for now was to do what @pearj suggested in this comment |
Sure, I want to try this. But, need clarification on where and how to use this variable setting:ZALENIUM_EXTRA_JVM_PARAMS="-Dwebdriver.http.factory=apache" Do i need to run as part of Zalenium command like below? |
Thanks diemol. I have run the command as specified in the above comment. This issue is NOT observed with the setting ZALENIUM_EXTRA_JVM_PARAMS="-Dwebdriver.http.factory=apache" For your reference, i am attaching the Zalenium logs. Please go through and let me know if any further investigation required. |
Thanks for the feedback @gsaimadhu! I think that means that the issue is in the Zalenium container and not in the docker-selenium ones. We will keep checking there. @borisivan Could you please try the same workaround to confirm? |
I was facing the same issue and the workaround did the trick. I am not facing the TLS error now. @diemol |
Thank you @amalsgit @gsaimadhu for confirming that using Apache makes the issue go away. Nevertheless, this is just a workaround since in the long term the Grid will use only OkHttp. Therefore, I have been working on this one most of the day and I have not been able to figure out the root cause. The provided logs show that this is happening after a session has been created and using nodes that have already executed more than one test. This kind of discards that there is an issue with the containers or their configuration. What I also tried was to run this code inside the containers to see what SSL supported and enabled protocols they have: import javax.net.ssl.SSLContext;
import javax.net.ssl.SSLSocket;
import javax.net.ssl.SSLSocketFactory;
import java.io.IOException;
import java.security.KeyManagementException;
import java.security.NoSuchAlgorithmException;
public class ProtocolTest {
public static void main (String[] args) throws NoSuchAlgorithmException, KeyManagementException, IOException {
SSLContext context = SSLContext.getInstance("TLS");
context.init(null,null,null);
SSLSocketFactory factory = context.getSocketFactory();
SSLSocket socket = (SSLSocket)factory.createSocket();
String[] protocols = socket.getSupportedProtocols();
System.out.println("Supported Protocols: " + protocols.length);
for(int i = 0; i < protocols.length; i++)
{
System.out.println(" " + protocols[i]);
}
protocols = socket.getEnabledProtocols();
System.out.println("Enabled Protocols: " + protocols.length);
for(int i = 0; i < protocols.length; i++)
{
System.out.println(" " + protocols[i]);
}
}
} And the result in both (Zalenium and docker-selenium) was: Supported Protocols: 5
SSLv2Hello
SSLv3
TLSv1
TLSv1.1
TLSv1.2
Enabled Protocols: 3
TLSv1
TLSv1.1
TLSv1.2 Which confirms that the underlaying configuration looks fine. I haven't been able to reproduce it constantly (which makes sense since the bug reports show that it does not happen always). So what we could try is to overwrite the Since I have not seen this bug reported anywhere else under similar conditions (OkHttp in docker) I am going into the direction of thinking that there is something not 100% clear and retry could work. I will prepare a PR and then we can release soon to get feedback. |
@diemol for retry you’d probably want to try an exponential backoff type scenario with a random wait, since it might be related to file handles you need to spread out the times that they are retrying. But I think we could get a long way by reimplementing the okhttp client in the registry, but caching the ssl bits across all requests, as they are the ones that failed. From my quick research they appear thread safe. Plus the regular docker proxies won’t use ssl anyway. Only the cloud proxies I’m guessing? Or maybe not even them? |
@pearj for the first approach I want to implement a very simple retry (3 times) with a random delay between them and see what feedback we get. If that actually solves the problem, then we can evolve it to something more sophisticated I would say. The good thing is that I was able to reproduce it using brute force, by this I mean that I executed 100 times a small test suite with two threads and three tests per thread, starting Zalenium from scratch and stopping it afterwards. I got the TLS thing two times. So I'll implement the retry and see what happens today. |
About reimplementing the OkHttp client, I don't know, I would not like to divert too much from the upstream implementation unless it is really needed. If there is really something that needs to be handled, ideally we should implement it, try it, and then send a PR upstream. |
@diemol yeah fair enough. I think the main thing we need to check is that SSL to different hosts still works after changing the OkHttp client, although I'm not immediately sure how to run a selenium node under SSL. |
Hmm, having a look at caching the trustmanager and the ssl connection factory. It would be a bit fiddly, or we'd have to copy/paste init code out of okhttp itself. Can't see the selenium guys accepting something like that... Maybe we need more information about this problem, ie if @amalsgit @gsaimadhu have unusually low file limits set for docker. @diemol you said that you'd seen this error before too? |
Hmm, I think one way to reuse the http client is to use an existing http client to build another one. That might be the way to do it. |
Ahh, you can grab the factories out of an existing client. I guess that'd be another way to do it. You'd have a build an initial client, and then grab the @diemol are the selenium guys likely to accept something we don't know how to reproduce, ie, we say we're fixing this "No System TLS" problem, so we refactor OkHttpClient usage a bit, but we can't reproduce the original problem? Although they could obviously verify that we haven't broken selenium by our refactor. |
@pearj I was able to reproduce just by running lots of tests, after executing ~500 tests I get the issue once or twice. The implemented retry is very naive, just catches the AssertionError, logs it and tries again up to 3 times, after that, it gets thrown. So I executed the ~500 tests a couple of times more, with a local version of Zalenium with the retry, and I can see how the issue pops up but the retry handles it and by the 2nd retry it goes well. I'll run the tests a couple of times more and then make a PR is things keep going well. |
Just created #597 which seems to solve the issue with a very simple retry. If things go well, we'll merge and release soon so you guys can check it. |
@diemol interesting, how many concurrent threads did you have running? I presume this was docker and not minikube? Also, just read up a bit and noticed you said you could reproduce before, missed that message, whoops! |
I wonder if this issue is reproducible on the official docker selenium containers? |
@pearj I only had two threads with 3 tests each one, basically I ran the tests we execute when we want to release The retry is being logged like this: I can try to reproduce only with the containers in the next days. Could you have a look at #597 please? |
Adding retry to getHttpClient when "No System TLS" pops up #491
Hi all, This has been fixed in the latest release, the workaround should not be needed anymore. Please try the https://github.com/zalando/zalenium/releases/tag/3.12.0c release and let us know how it went. |
@diemol I am still facing this issue when i run my test in parallel using the either 3.12.0c or the latest zalenium image on MacOS High Sierra version 1-.13.6 And I have actually about 4 tests running in parallel and about 50 tests running in total. The error also happened randomly during the test, about 1 to 3 tests would randomly fail when I run the test. Looks like there is 500 error when I try to connect to hub, so the request can not be forwarded to the node container....I am not sure why this happened...let me know if you need more info for debugging this... |
I am seeing this with the standard Docker Selenium images (I've tried with 3.13.0 and 3.14.0). Can I pass the Apache Dwebdriver flag to the standard images, or is that just an option with Zalenium? |
@pschroeder89 yes, you can pass that option since it is a standard Selenium thing. Bare in mind that 3.14.0 is the last version where the Apache Http Client is present. |
Got it, thanks @diemol! |
Zalenium Image Version(s):
All versions that use Selenium 3.11
Docker Version:
17.09, others
Docker Command to start Zalenium:
docker, run, -tid, --name=obfuscated, --net=host, -p, 4444:4444, -v, /var/run/docker.sock:/var/run/docker.sock, -v, /usr/bin/docker:/usr/bin/docker, -v, /dev/shm:/dev/shm, -v, /tmp/videos/obfuscated:/home/seluser/videos, -v, /tmp/Downloads/obfuscated:/tmp/node/home/seluser/Downloads, --privileged, dosel/zalenium:latest, start, --chromeContainers, 7, --screenWidth, 1920, --seleniumImageName, elgalu/selenium:latest, --screenHeight, 1080, --firefoxContainers, 0, --keepOnlyFailedTests, true, --videoRecordingEnabled, true, --debugEnabled, false, --maxTestSessions, 150
Expected Behavior -
With 6 tests running in parallel across those 7 static containers, tests run normally. (same thing happens w/ 6 containers, I have 7 to help re: availability during cleanup)
Actual Behavior -
Out of ~240 tests, I will have 1-5 fail with the following footprint:
I have let the suite run continuously for days, and only 1 out of 15 has no failures with that footprint. All others have 1-5 (always different tests) test fail with that footprint, scattered across the suite.
Two observations:
1) This does not happen when I run the suite with the following versions:
Zalenium 3.10.0c Elgalu 3.10.0-p3 -- I've just run 5 iterations of the suite forcing those versions, and not one failure, though I'm used to seeing an occasional failure there too, just with a less descriptive footprint re: connection refused / unable to forward / etc.. But even those failures (which I haven't seen in 5 suite runs) happened with much less frequency. than the "No System TLS" error above, when using versions that support Selenium 3.11.
2) When the problem happens, it is almost always immediately after another browser is launched.
I have 6 tests running in parallel, each lasting ~30 seconds <-> 4 minutes. I have all webDriver creation, acquisition, and closing via Java ThreadLocal. But I suspect there is some threadsafe problem that happens directly during the time that the RemoteWebDriver is instantiated, and I suspect that the transition to okHttp is either a new problem, of exacerbated the existing intermittent problem.
For example, with my 6 tests running in parallel, I have ~ 10-20 log messages generated per second for the various 6 tests (I have the testNG thread ID in the log format, differentiating). The log message IMMEDIATELY before the problem is more often than not:
[11:22:46 INFO ProtocolHandshake ] [Forwarding newSession on session null to remote] Detected dialect: OSS
The message above is related to the browser being created for a different test than the test that will have the exception. The very next message is the stack trace from the exception:
org.openqa.selenium.WebDriverException: cannot forward the request No System TLS
Command duration or timeout: 0 milliseconds
and that stack trace shows me that it is from a different test -- not the test that was just opening the browser re: the "Detected dialect: OSS" message above.
Also, remember how I said that I use the java thread name in my logging format? ALL of my log messages have the thread name in them, like this:
[11:22:45 INFO obfuscacted ] [obfuscatedTestName-1] click: next
The "obfuscatedTestName-1" is the name testNG gives that thread, and there are 5 others, keeping the log messages straight.
The only other time I ever see a non testNG thread name anywhere in my entire log is that one message from selenium, where you can see that the threadname is different:
[11:22:46 INFO ProtocolHandshake ] [Forwarding newSession on session null to remote] Detected dialect: OSS
... in which selenium (I assume it's selenium) has chosen to rename the actual thread to: "Forwarding newSession on session null to remote" -- that's the name of the thread itself.
That message occurs for every browser created, but when I have the intermittent problem I've been describing in this report, that log message is often the log message right before an existing test (not the one creating a browser) hits that exception.
The text was updated successfully, but these errors were encountered: