Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

[Multi-ESX] DB connection issue while invoking "vmdkops_admin volume ls" observed only on one ESX #1144

Closed
shuklanirdesh82 opened this issue Apr 3, 2017 · 9 comments

Comments

@shuklanirdesh82
Copy link
Contributor

Setup: VM1 on ESX1 and VM2 on ESX2 > both docker hosts are in the same tenant

Steps to reproduce

1. VM1: create volume vol4
2. VM1 & VM2: docker volume ls > make sure vol1 is listed out
3. VM1: attach container on vol4 and keep it running
4. VM2: attach the container on vol4 (observe some wait time here)
5. Expects an error as vol1 is already mounted and in use

While invoking vmdkops_admin.py volume ls during step#4, running into following issue on ESX1 where as ESX2 is not complaining and showing the output correctly.

Note: I haven't found any interesting thing in vmdk_ops.log so not uploading. (logs will be uploaded upon ask)

[root@sc-rdops-vm08-dhcp-230-54:~] /usr/lib/vmware/vmdkops/bin/vmdkops_admin.py status
=== Service: 
Version: 0.12.9e2d6ad-0.0.1
Status: Running
Pid: 4257207
Port: 1019
LogConfigFile: /etc/vmware/vmdkops/log_config.json
LogFile: /var/log/vmware/vmdk_ops.log
LogLevel: INFO
=== Authorization Config DB: 
DB_SharedLocation: /vmfs/volumes/vsanDatastore/dockvols/vmdkops_config.db
DB_LocalPath: /etc/vmware/vmdkops/auth-db
DB_Mode: MultiNode (local symlink pointing to shared DB)

on ESX1:

[root@sc-rdops-vm08-dhcp-230-54:~] /usr/lib/vmware/vmdkops/bin/vmdkops_admin.py volume ls
Traceback (most recent call last):
  File "/usr/lib/vmware/vmdkops/bin/vmdkops_admin.py", line 1356, in <module>
    main()
  File "/usr/lib/vmware/vmdkops/bin/vmdkops_admin.py", line 58, in main
    args.func(args)
  File "/usr/lib/vmware/vmdkops/bin/vmdkops_admin.py", line 593, in ls
    rows = generate_ls_rows(tenant_reg)
  File "/usr/lib/vmware/vmdkops/bin/vmdkops_admin.py", line 622, in generate_ls_rows
    for v in vmdk_utils.get_volumes(tenant_reg):
  File "/usr/lib/vmware/vmdkops/Python/vmdk_utils.py", line 147, in get_volumes
    error_info, tenant_name = auth_api.get_tenant_name(sub_dir_name)
  File "/usr/lib/vmware/vmdkops/Python/auth_api.py", line 68, in get_tenant_name
    error_info, auth_mgr = get_auth_mgr_object()
  File "/usr/lib/vmware/vmdkops/Python/auth_api.py", line 38, in get_auth_mgr_object
    err_msg, auth_mgr = auth.get_auth_mgr()
  File "/usr/lib/vmware/vmdkops/Python/auth.py", line 46, in get_auth_mgr
    thread_local._auth_mgr.connect()
  File "/usr/lib/vmware/vmdkops/Python/auth_data.py", line 572, in connect
    self.__mode = self.__discover_mode_and_connect()
  File "/usr/lib/vmware/vmdkops/Python/auth_data.py", line 610, in __discover_mode_and_connect
    self.__connect()
  File "/usr/lib/vmware/vmdkops/Python/auth_data.py", line 559, in __connect
    self.conn = sqlite3.connect(self.db_path)
sqlite3.OperationalError: unable to open database file

on ESX2

[root@sc-rdops-vm08-dhcp-239-65:~] /usr/lib/vmware/vmdkops/bin/vmdkops_admin.py volume ls
Volume    Datastore    VM-Group  Capacity  Used  Filesystem  Policy  Disk Format  Attached-to  Access      Attach-as               Created By    Created Date              
--------  -----------  --------  --------  ----  ----------  ------  -----------  -----------  ----------  ----------------------  ------------  ------------------------  
vol1      shared_vmfs  _DEFAULT  100MB     15MB  ext4        N/A     thin         VM1          read-only   independent_persistent  VM1           Wed Mar 29 23:25:37 2017  
vol2      shared_vmfs  _DEFAULT  100MB     13MB  ext4        N/A     thin         detached     read-write  independent_persistent  VM1           Wed Mar 29 23:26:23 2017  
vol3      shared_vmfs  _DEFAULT  100MB     13MB  ext4        N/A     thin         detached     read-write  independent_persistent  VM1           Wed Mar 29 23:27:18 2017  
vol4      shared_vmfs  _DEFAULT  100MB     15MB  ext4        N/A     thin         detached     read-write  independent_persistent  VM1           Mon Apr  3 21:48:33 2017  
vol5      shared_vmfs  _DEFAULT  100MB     13MB  ext4        N/A     thin         detached     read-write  independent_persistent  photon-VM0.1  Mon Apr  3 22:13:37 2017  

/CC @msterin @lipingxue

@msterin
Copy link
Contributor

msterin commented Apr 4, 2017

I suspect we hit the moment when VMFS on ESX2 was locking the sqlite file, thus preventing VMFS on ESX1 from opening it. Great catch!

I am concerned that could actually be a problem for vmdk_ops.py , so the docker ops may fail.
We'd need to either open as VMFS nolock, or do wait-retry to work this around.

@govint
Copy link
Contributor

govint commented Apr 4, 2017

  1. VM1: attach container on vol4 and keep it running
  2. VM2: attach the container on vol4 (observe some wait time here)

@shuklanirdesh82, is it possible to share the VMDK on two different docker hosts?

@msterin
Copy link
Contributor

msterin commented Apr 4, 2017

I think these parts were incidental. The issue is with file open, and it is opened only during docker command, not while container is already running

@shuklanirdesh82
Copy link
Contributor Author

@govint's comment #1144 (comment)
is it possible to share the VMDK on two different docker hosts?

Yeah, it is not possible in an absence of multi-writer scenario; step#5 does mention about the expected behavior i.e. exception/error should be shown.

An error is observed only on one of the ESX but not.

@shuklanirdesh82
Copy link
Contributor Author

@msterin's comment #1144 (comment)
I think these parts were incidental.

Yeah it seems to me too.

@tusharnt tusharnt added P0 and removed P1 labels Apr 4, 2017
@ashahi1
Copy link
Contributor

ashahi1 commented Apr 7, 2017

Steps:

1. Download sqlite command line for linux (https://sqlite.org/2017/sqlite-tools-linux-x86-3180000.zip) . Copy it to ESX. 
2. Run the command line. On prompt, type ".open <filename>" (filename is the shared_DB file)., and leave it at that
3. on a **different** ESX , run admin 'ls'.


Steps and their output are as follows:

From ESX-1:

[root@sc2-rdops-vm05-dhcp-160-255:~] /tmp/sqlite-tools-linux-x86-3180000/sqlite3
SQLite version 3.18.0 2017-03-28 18:48:43
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .open /vmfs/volumes/sharedVmfs-0/dockvols/vmdkops_config.db
sqlite>

From ESX-2:

[root@sc2-rdops-vm05-dhcp-188-189:~] /usr/lib/vmware/vmdkops/bin/vmdkops_admin.py vmgroup ls
Traceback (most recent call last):
  File "/usr/lib/vmware/vmdkops/bin/vmdkops_admin.py", line 1356, in <module>
    main()
  File "/usr/lib/vmware/vmdkops/bin/vmdkops_admin.py", line 58, in main
    args.func(args)
  File "/usr/lib/vmware/vmdkops/bin/vmdkops_admin.py", line 972, in tenant_ls
    error_info, tenant_list = auth_api._tenant_ls()
  File "/usr/lib/vmware/vmdkops/Python/auth_api.py", line 441, in _tenant_ls
    error_info, tenant_list = get_tenant_list_from_db(name)
  File "/usr/lib/vmware/vmdkops/Python/auth_api.py", line 132, in get_tenant_list_from_db
    error_info, auth_mgr = get_auth_mgr_object()
  File "/usr/lib/vmware/vmdkops/Python/auth_api.py", line 40, in get_auth_mgr_object
    err_msg, auth_mgr = auth.get_auth_mgr()
  File "/usr/lib/vmware/vmdkops/Python/auth.py", line 46, in get_auth_mgr
    thread_local._auth_mgr.connect()
  File "/usr/lib/vmware/vmdkops/Python/auth_data.py", line 572, in connect
    self.__mode = self.__discover_mode_and_connect()
  File "/usr/lib/vmware/vmdkops/Python/auth_data.py", line 610, in __discover_mode_and_connect
    self.__connect()
  File "/usr/lib/vmware/vmdkops/Python/auth_data.py", line 559, in __connect
    self.conn = sqlite3.connect(self.db_path)
sqlite3.OperationalError: unable to open database file

@tusharnt tusharnt assigned shaominchen and unassigned msterin May 2, 2017
@shaominchen
Copy link
Contributor

Per offline discussion, currently we will not really address the DB file concurrent access issue. For now we will simply return a user-friendly error message instead of the sqlite error.

@msterin
Copy link
Contributor

msterin commented May 2, 2017

I also suggest adding some log error explaining potential reason, and suggestion to retry

@shaominchen
Copy link
Contributor

@msterin I was thinking to add some more messages, such as suggestion to retry. However, considering that the current concurrency issue is actually a rare case comparing to other common DB failures (such as DB corruption, file not exists, etc.). So adding this message might cause confusion in other error scenarios. So I think we may just keep it as is for now (please see the error message I have pasted in the PR).

In the future I think we should improve our DB access module to support more advanced features, such as concurrent access, by introducing connection pool, appropriate read-write lock, etc.

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Projects
None yet
Development

No branches or pull requests

6 participants