Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Server does not accept initial connection on remote start. #1447

Closed
totaam opened this issue Feb 21, 2017 · 47 comments
Closed

Server does not accept initial connection on remote start. #1447

totaam opened this issue Feb 21, 2017 · 47 comments

Comments

@totaam
Copy link
Collaborator

totaam commented Feb 21, 2017

Issue migrated from trac ticket # 1447

component: server | priority: major | resolution: fixed

2017-02-21 18:45:16: psycho_zs created the issue


Starting xpra with xpra start ssh:....

Server on remote machine is started, but this is being spammed continiously into the log:

2017-02-21 21:37:05,892 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:05,991 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:05,992 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:05,993 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,092 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,093 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,094 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,193 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,194 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,194 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,294 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,295 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,296 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,396 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,396 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,397 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,496 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,497 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,498 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,597 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,598 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,599 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,699 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,699 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,700 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,800 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,800 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,802 New unix-domain connection received on /var/run/xpra/hostname-100

Client soon exits, log spamming continues until I connect with xpra attach.
Then server's log is truncated and replaced with this message:

(EE) 
Fatal server error:
(EE) Server is already active for display 100
	If this server is no longer running, remove /tmp/.X100-lock
	and start again.
(EE) 
(EE) 
Please consult the The X.Org Foundation support 
	 at http://wiki.x.org
 for help. 
(EE) 
2017-02-21 21:37:21,079 
2017-02-21 21:37:21,079 Xvfb command has terminated! xpra cannot continue
2017-02-21 21:37:21,079  if the display is already running, try a different one,
2017-02-21 21:37:21,079  or use the --use-display flag
2017-02-21 21:37:21,079 
2017-02-21 21:37:21,080 killing xvfb with pid 14651
2017-02-21 21:37:21,080 failed to kill xvfb process with pid 14651:
2017-02-21 21:37:21,080  [Errno 3] No such process

@totaam
Copy link
Collaborator Author

totaam commented Feb 22, 2017

2017-02-22 04:06:17: antoine changed owner from antoine to psycho_zs

@totaam
Copy link
Collaborator Author

totaam commented Feb 22, 2017

2017-02-22 04:06:17: antoine commented


I did find a bug, fixed in r15133, but this would have caused the client side to also emit a large number of warning messages to stdout - did you not get those?

Please close if that fixes things.

@totaam
Copy link
Collaborator Author

totaam commented Feb 28, 2017

2017-02-28 09:39:21: psycho_zs changed status from new to closed

@totaam
Copy link
Collaborator Author

totaam commented Feb 28, 2017

2017-02-28 09:39:21: psycho_zs set resolution to fixed

@totaam
Copy link
Collaborator Author

totaam commented Feb 28, 2017

2017-02-28 09:39:21: psycho_zs commented


r15180 works well on remote start, no log spam on both ends.

@totaam
Copy link
Collaborator Author

totaam commented Feb 28, 2017

2017-02-28 18:22:14: psycho_zs uploaded file server.log (28.3 KiB)

@totaam
Copy link
Collaborator Author

totaam commented Feb 28, 2017

2017-02-28 18:22:23: psycho_zs uploaded file client.log (1.2 KiB)

@totaam
Copy link
Collaborator Author

totaam commented Feb 28, 2017

2017-02-28 18:23:29: psycho_zs changed status from closed to reopened

@totaam
Copy link
Collaborator Author

totaam commented Feb 28, 2017

2017-02-28 18:23:29: psycho_zs removed resolution (was fixed)

@totaam
Copy link
Collaborator Author

totaam commented Feb 28, 2017

2017-02-28 18:23:29: psycho_zs commented


If server has leftovers from previous run, client can not initiate remote start and same log spam happens on server.
Attached server log and client output.

@totaam
Copy link
Collaborator Author

totaam commented Mar 1, 2017

2017-03-01 10:22:59: antoine commented


I'm not sure what you mean by "has leftovers from previous run", or how I would reproduce this.

Assuming that you hit the case where a server was already running on the display that you tried to start remotely, one could argue that the code is working as intended by not matching this server, but r15190 will now allow us to match it and connect to it.
(could be backported I guess - not sure that's necessarily right)

@totaam
Copy link
Collaborator Author

totaam commented Mar 1, 2017

2017-03-01 15:54:52: psycho_zs commented


Sorry, I'm speaking about sockets:

/var/run/user/1000/xpra/server-100 is not responding, waiting for it to timeout before clearing it.....
2017-02-28 21:09:05,310 created unix domain socket: /var/run/user/1000/xpra/server-100
/home/user/.xpra/server-100 is not responding, waiting for it to timeout before clearing it.....
2017-02-28 21:09:09,315 created unix domain socket: /home/user/.xpra/server-100
/var/run/xpra/server-100 is not responding, waiting for it to timeout before clearing it.....
2017-02-28 21:09:13,321 created unix domain socket: /var/run/xpra/server-100

Maybe client doesn't wait for this process to finish?

@totaam
Copy link
Collaborator Author

totaam commented Mar 1, 2017

2017-03-01 15:59:26: antoine changed status from reopened to new

@totaam
Copy link
Collaborator Author

totaam commented Mar 1, 2017

2017-03-01 15:59:26: antoine changed owner from psycho_zs to antoine

@totaam
Copy link
Collaborator Author

totaam commented Mar 1, 2017

2017-03-01 15:59:26: antoine commented


Those sockets are usually left behind when a server crashes.
Those messages were not in the log samples your had provided.

Fixing this will be harder, we must wait in case the server is still alive, and the client also needs to give up after waiting too long...
What we should do in this case is wait for all the sockets in parallel to save time.
This will be a bigger change, too late for this version.

@totaam
Copy link
Collaborator Author

totaam commented Mar 1, 2017

2017-03-01 16:06:39: psycho_zs commented


I copied that snippet directly from the attached server.log )

Viewing server and client log live it seemed that client gives up just at the same time server finises dealing with sockets. So a dirty fix would be to give the client a couple more seconds of timeout.

I will try catching the crashes.

@totaam
Copy link
Collaborator Author

totaam commented Mar 1, 2017

2017-03-01 16:07:48: antoine commented


I copied that snippet directly from the attached server.log )

Sorry, yes - I missed it!

@totaam
Copy link
Collaborator Author

totaam commented Mar 19, 2017

2017-03-19 05:21:19: antoine commented


r15278 raised the socket timeout to 20s (15300 for v1.0.x branch).
This will help until we can do parallel socket probing.

@totaam
Copy link
Collaborator Author

totaam commented Mar 24, 2017

2017-03-24 13:19:28: antoine changed owner from antoine to psycho_zs

@totaam
Copy link
Collaborator Author

totaam commented Mar 24, 2017

2017-03-24 13:19:28: antoine commented


It's not really done in parallel as that would be too hard to implement, but r15381 should make things more reliable.
I've struggled to test this properly because if I "kill -9" we correctly detect that the socket is dead, and if I just hang the server then the socket still looks alive..

@psycho_zs: does that work for you? how can I test hung sockets?

@totaam
Copy link
Collaborator Author

totaam commented Mar 24, 2017

2017-03-24 14:38:43: psycho_zs commented


I was left with old sockets some time ago when xpra server crashed sometimes. Now it does not, so I do not know how to test this situation now.
I will test remote start in the next couple of days.

@totaam
Copy link
Collaborator Author

totaam commented Mar 25, 2017

2017-03-25 19:11:21: psycho_zs commented


In 2.0: killing all xpra-related processes with kill -9 leaves socket $HOME/.xpra/$SERVERNAME-$PORT. Trying to remote-start after that produces described behavior despite increased timeouts. It also seems that "New unix-domain connection received on..." message spam is continued to removed log with old file descriptor (seen if tailing log throughout the whole connection attempt).

I have troubles building current trunk, it does not seem to produce some files listed in xpra.install, like etc/xpra/xorg.conf, usr/lib/tmpfiles.d/xpra.conf. Would you add a fresh beta package?

@totaam
Copy link
Collaborator Author

totaam commented Mar 26, 2017

2017-03-26 12:01:40: antoine commented


Older branches will probably not be patched - as I can't think of a way to make things better without changing too much code.

@totaam
Copy link
Collaborator Author

totaam commented Mar 26, 2017

2017-03-26 16:56:37: psycho_zs commented


r15425 seems to be working even with leftover sockets. Although it failed once on the first test, right after upgrade. (I killed xpra processes before upgrading).

@totaam
Copy link
Collaborator Author

totaam commented Mar 27, 2017

2017-03-27 06:48:51: antoine changed status from new to closed

@totaam
Copy link
Collaborator Author

totaam commented Mar 27, 2017

2017-03-27 06:48:51: antoine set resolution to fixed

@totaam
Copy link
Collaborator Author

totaam commented Mar 27, 2017

2017-03-27 06:48:51: antoine commented


Closing, feel free to re-open if you can give steps to reproduce problems with trunk.

@totaam
Copy link
Collaborator Author

totaam commented Mar 28, 2017

2017-03-28 16:12:38: psycho_zs changed status from closed to reopened

@totaam
Copy link
Collaborator Author

totaam commented Mar 28, 2017

2017-03-28 16:12:38: psycho_zs removed resolution (was fixed)

@totaam
Copy link
Collaborator Author

totaam commented Mar 28, 2017

2017-03-28 16:12:38: psycho_zs commented


I reproduced the situation in which client again does not wait long enough to connect. Xpra server did not survive dbus restart after recent system upgrade, leaving behind xpra _proxy_start :100 process and and X. So I killed everything with killall -u username and did remote start via xpra start ssh:.... Server waited for 3 sockets, client disconnected just before server finished waiting on them:

Warning: some of the sockets are in an unknown state:
 /var/run/user/1000/xpra/servername-100
 /home/username/.xpra/servername-100
 /var/run/xpra/servername-100

Full server log in attachment...

@totaam
Copy link
Collaborator Author

totaam commented Mar 28, 2017

2017-03-28 16:16:17: psycho_zs uploaded file server_3sockets.log (10.9 KiB)

@totaam
Copy link
Collaborator Author

totaam commented Mar 29, 2017

2017-03-29 06:30:34: antoine commented


Sorry but I cannot reproduce this.
Please try to provide specific steps.

@totaam
Copy link
Collaborator Author

totaam commented Apr 3, 2017

2017-04-03 08:13:02: psycho_zs commented


With new initialization mechanism are there any potential race conditions left?

I was able to reproduce it the second time:
On server with running xpra restart dbus with related daemons

systemctl restart dbus ; systemctl daemon-reexec ; systemctl restart systemd-logind systemd-journald NetworkManager

kill all user processes:

killall -u $USER

Try to do xpra start ssh:...

Client gives up and disconnects somewhere around the time server prints "created unix domain socket" messages. Attach works a moment later.

@totaam
Copy link
Collaborator Author

totaam commented Apr 3, 2017

2017-04-03 16:52:16: antoine commented


Timeouts are inherently racy.
It should be enough time for the server to start, even when waiting and then clearing the sockets, but there's no way we can be sure of that. And if the server is quite slow, it won't be enough. (add to that the ssh login delay)
Raising the client timeout indefinitely is not a good solution either. It is already at 15 seconds, which is quite a long time.
Are you sure your 2.1 server takes that long to start? (are you even testing with 2.1?)

No need to run "xpra shadow ssh:..." which just complicates things.
Just kill it and watch it start with "xpra shadow :0 --no-daemon" - see how long that takes.
The output should include timestamps, please post that information. (or the server log file if you wish)
Without hard data, it is difficult to know where to go.

Example on my relatively beefy system:

2017-04-03 22:53:23,078 created unix domain socket: /run/user/1000/xpra/desktop-10
2017-04-03 22:53:23,078 created unix domain socket: /home/antoine/.xpra/desktop-10
2017-04-03 22:53:23,079 created unix domain socket: /var/run/xpra/desktop-10
(..)
2017-04-03 22:53:24,038 xpra is ready.

So that's 1 second from the socket creation until fully ready.
(and the socket timeout + clearing should be around 8 seconds maximum)

@totaam
Copy link
Collaborator Author

totaam commented Apr 3, 2017

2017-04-03 18:16:58: psycho_zs commented


See server_3sockets.log​, it is the server log from the similar situation as described in my previous comment. 2.1-20170326r15425-1

@totaam
Copy link
Collaborator Author

totaam commented Apr 3, 2017

2017-04-03 18:28:14: antoine commented


See server_3sockets.log​, it is the server log from the similar situation as described in my previous comment.
That log is against a 2.1-r15340 server, therefore missing any of the fixes from comment:10 onwards. We want the timing information of the current code.

@totaam
Copy link
Collaborator Author

totaam commented Apr 3, 2017

2017-04-03 18:33:32: psycho_zs commented


I've just rechecked, the package version is 2.1-20170326r15425-1, but the log reports 2.1-r15340. OK, I'll try a fresher one.

@totaam
Copy link
Collaborator Author

totaam commented Apr 3, 2017

2017-04-03 18:47:24: psycho_zs commented


15494 package matches. Attaching log...
Remote start after dbus restart and killed user processes worked. (But there were no substantial system upgrade preceding it).

@totaam
Copy link
Collaborator Author

totaam commented Apr 3, 2017

2017-04-03 18:47:42: psycho_zs uploaded file server-r15494.log (4.1 KiB)

@totaam
Copy link
Collaborator Author

totaam commented Apr 4, 2017

2017-04-04 07:24:32: antoine commented


OK, so in this log we can see:

2017-04-03 20:44:01,438 Warning: some of the sockets are in an unknown state:
19	2017-04-03 20:44:01,438  /var/run/user/1000/xpra/servername-100
20	2017-04-03 20:44:01,438  /home/username/.xpra/servername-100
21	2017-04-03 20:44:01,439  /var/run/xpra/servername-100
22	2017-04-03 20:44:01,439  please wait as we allow the socket probing to timeout
23	2017-04-03 20:44:09,464 created unix domain socket: /var/run/user/1000/xpra/servername-100
24	2017-04-03 20:44:09,465 created unix domain socket: /home/username/.xpra/servername-100
25	2017-04-03 20:44:09,465 created unix domain socket: /var/run/xpra/servername-100
(..)
41	2017-04-03 20:44:12,034 xpra is ready.

So that's:

  • 8 seconds for timing out the old sockets
  • 3 seconds for creating the new ones and starting up

For a total of 11 seconds, which is well under the client timeout of 15 seconds..
r15508 reduces from 8 to just 6 seconds, which should still be enough to distinguish between a server that is starting up (3 seconds in your case) and one that is just dead.

@totaam
Copy link
Collaborator Author

totaam commented May 11, 2017

2017-05-11 05:59:41: antoine commented


@psycho_zs: can I close this ticket or can you reproduce with r15508 or later?

@totaam
Copy link
Collaborator Author

totaam commented May 11, 2017

2017-05-11 09:11:55: psycho_zs commented


Thanks. I'll try to check in the next couple of days. Busy month.

@totaam
Copy link
Collaborator Author

totaam commented May 14, 2017

2017-05-14 08:23:33: psycho_zs commented


Xpra from latest svn can not start server

xpra start :100 -d all
using systemd-run to wrap 'start' server command
'systemd-run' '--description' 'xpra-start' '--scope' '--user' '/usr/bin/xpra' 'start' ':100' '-d' 'all' '--systemd-run=no'
Failed to create bus connection: No such file or directory

Latest deb build 2.1-20170409r15542-1 seem to be working, at least it connected successfully with leftover sockets on server a couple of times.
First timestamp visible in client output: 10:19:10
First timestamp visible in server log: 10:19:13
"Xpra is ready" message on server: 10:19:21
Handshake complete: 10:19:22

@totaam
Copy link
Collaborator Author

totaam commented May 14, 2017

2017-05-14 09:17:35: antoine changed status from reopened to closed

@totaam
Copy link
Collaborator Author

totaam commented May 14, 2017

2017-05-14 09:17:35: antoine set resolution to fixed

@totaam
Copy link
Collaborator Author

totaam commented May 14, 2017

2017-05-14 09:17:35: antoine commented


The timeout issue is fixed so I am closing this bug.

If you are having systemd startup problems, please file a separate ticket.
The exact systemd-run command we use is printed out, if that fails then it is quite unlikely to be an xpra bug.

@totaam totaam closed this as completed May 14, 2017
@totaam
Copy link
Collaborator Author

totaam commented Jun 20, 2017

2017-06-20 17:18:41: antoine commented


Saw something similar in #1537#comment:3.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant