Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Unhandled stacks of python tracebacks when the agent is unable to resolve testflinger.c.c via DNS #109

Open
bladernr opened this issue Oct 2, 2023 · 0 comments

Comments

@bladernr
Copy link
Collaborator

bladernr commented Oct 2, 2023

I ran a bunch of tests and ONE of the runs resulted in a ton of cascaded tracebacks. These aren't immediately helpful and there are some here that could be caught with some exception hanlding.

For some reason, the agent on this run was unable to resolve testflinger.canonical.com, and that led to all the tracebacks... Just for debugging purposes, perhaps these could be caught and handled with some friendlier messages.

This is the only run out of 30 that had this issue, all using the same agent, so I don't know what the actual problem was. This bug, as noted above, is just about hopefully making those traces a bit more friendly.

bladernr@weavile:~$ testflinger submit --poll 6md.yaml                                                                               [61/61]
Job submitted successfully!                                                                                                                 
job_id: b933b67f-a71c-4917-bc5c-ee846659be62                          
This job is waiting on a node to become available.                                                                                          
Jobs ahead in queue: 14                                               
Jobs ahead in queue: 13                                                                                                                     
Jobs ahead in queue: 12                                                                                                                     
Jobs ahead in queue: 11                                                                                                                     Jobs ahead in queue: 10                                                                                                                     Jobs ahead in queue: 9                                                                                                                      
Jobs ahead in queue: 8                                                
Jobs ahead in queue: 7                                                                                                                      
ERROR: 2023-09-29 19:24:01 client.py:61 -- Timeout while trying to communicate with the server.                                             
ERROR: 2023-09-29 19:25:16 client.py:61 -- Timeout while trying to communicate with the server.                                             
Jobs ahead in queue: 6                                                                                                                      
Jobs ahead in queue: 5                                                
Jobs ahead in queue: 4                                                                                                                      
Jobs ahead in queue: 3                                                                                                                      
Jobs ahead in queue: 2                                                                                                                      
ERROR: 2023-09-29 22:10:53 client.py:61 -- Timeout while trying to communicate with the server.                                             
Jobs ahead in queue: 1                                                                                                                      
Jobs ahead in queue: 0                                                                                                                      
ERROR: 2023-09-29 22:46:28 client.py:61 -- Timeout while trying to communicate with the server.                                             
***********************************************                                                                                             
                                                                                                                                            
* Starting testflinger setup phase on multi-3 *                                                                                             
                                                                                                                                            
***********************************************                                                                                             
                                                                                                                                            
Setup                                                                                                                                       
                                                                                                                                            ***************************************************                                                                                                                                                                                                                                     
* Starting testflinger provision phase on multi-3 *                                                                                         
                                                                                                                                            
***************************************************                                                                                         
2023-09-30 02:52:12,569 multi-3 INFO: DEVICE AGENT: BEGIN provision                                                                         
2023-09-30 02:52:12,569 multi-3 INFO: DEVICE AGENT: Provisioning device                                                                     
2023-09-30 02:52:12,569 multi-3 INFO: DEVICE AGENT: Creating test jobs                                                                      
2023-09-30 02:52:16,845 multi-3 INFO: DEVICE AGENT: Created job d0f6945c-6903-4522-b26d-a872bcdd72b5                                        
2023-09-30 02:52:21,316 multi-3 INFO: DEVICE AGENT: Created job 9114248e-1d0d-407a-84b0-7580349ba535                                        
2023-09-30 02:52:26,187 multi-3 INFO: DEVICE AGENT: Created job 6a5f97b5-4952-47c1-8b90-21c77aab9fa0                                        
2023-09-30 02:52:30,651 multi-3 INFO: DEVICE AGENT: Created job 32880554-e1c8-4a8f-87a8-e882c8602b8d                                        
2023-09-30 02:52:35,490 multi-3 INFO: DEVICE AGENT: Created job 39cd04f3-2938-42c5-9db1-a026af67d083                                        
2023-09-30 02:52:42,702 multi-3 INFO: DEVICE AGENT: Created job 7789a4a3-c9e3-4fe2-a685-f33b13943514                                        
2023-09-30 02:54:16,828 multi-3 ERROR: DEVICE AGENT: Unable to communicate with specified server.                                           
2023-09-30 02:54:16,828 multi-3 ERROR: DEVICE AGENT: Unable to get status for job 7789a4a3-c9e3-4fe2-a685-f33b13943514                      
2023-09-30 02:54:53,280 multi-3 ERROR: DEVICE AGENT: Job 39cd04f3-2938-42c5-9db1-a026af67d083 failed to allocate, cancelling remaining jobs
2023-09-30 02:55:19,313 multi-3 ERROR: DEVICE AGENT: Unable to communicate with specified server.                                           
2023-09-30 02:55:19,313 multi-3 ERROR: DEVICE AGENT: Unable to cancel job 32880554-e1c8-4a8f-87a8-e882c8602b8d                              
2023-09-30 02:55:19,313 multi-3 ERROR: DEVICE AGENT: Unable to cancel job: 32880554-e1c8-4a8f-87a8-e882c8602b8d                             
Traceback (most recent call last):                                    
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 203, in _new_conn                                               
    sock = connection.create_connection(                              
  File "/usr/local/lib/python3.8/dist-packages/urllib3/util/connection.py", line 60, in create_connection                                   
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):                                                                  
  File "/usr/lib/python3.8/socket.py", line 918, in getaddrinfo                                                                             
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):                                                                 
socket.gaierror: [Errno -2] Name or service not known                 
                                                                                                                                            
The above exception was the direct cause of the following exception:                                                                        
                                                                      
Traceback (most recent call last):                                                                                                          
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 790, in urlopen                                             
    response = self._make_request(                                                                                                          
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 491, in _make_request                                       
    raise new_e                                                                                                                             
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 467, in _make_request                                       
    self._validate_conn(conn)                                                                                                               
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 1092, in _validate_conn                                     
    conn.connect()                                                                                                                          
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 611, in connect                                                 
    self.sock = sock = self._new_conn()                                                                                                     
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 210, in _new_conn                                               
    raise NameResolutionError(self.host, self, e) from e                                                                                    
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7f8913fcc1c0>: Failed to resolve 'testflinger.canoni
cal.com' ([Errno -2] Name or service not known)

The above exception was the direct cause of the following exception:                                                                        

Traceback (most recent call last):                                    
  File "/usr/local/lib/python3.8/dist-packages/requests/adapters.py", line 486, in send                                                     
    resp = conn.urlopen(                                              
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 844, in urlopen                                             
    retries = retries.increment(                                      
  File "/usr/local/lib/python3.8/dist-packages/urllib3/util/retry.py", line 515, in increment                                               
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]                                                           
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='testflinger.canonical.com', port=443): Max retries exceeded with url: /v1/job/32
880554-e1c8-4a8f-87a8-e882c8602b8d/action (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f8913fcc1c0>: Fai
led to resolve 'testflinger.canonical.com' ([Errno -2] Name or service not known)"))                                                        

During handling of the above exception, another exception occurred:                                                                         

Traceback (most recent call last):                                    
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/multi/multi.py", line 182, in cancel_jobs
    self.client.cancel_job(job)                                       
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/multi/tfclient.py", line 149, in cancel_job
    self.post(f"/v1/job/{job_id}/action", {"action": "cancel"})                                                                             
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/multi/tfclient.py", line 79, in post
    req = requests.post(uri, json=data, timeout=timeout)                                                                                    
  File "/usr/local/lib/python3.8/dist-packages/requests/api.py", line 115, in post                                                          
    return request("post", url, data=data, json=json, **kwargs)                                                                             
  File "/usr/local/lib/python3.8/dist-packages/requests/api.py", line 59, in request                                                        
    return session.request(method=method, url=url, **kwargs)                                                                                
  File "/usr/local/lib/python3.8/dist-packages/requests/sessions.py", line 589, in request                                                  
    resp = self.send(prep, **send_kwargs)                             
  File "/usr/local/lib/python3.8/dist-packages/requests/sessions.py", line 703, in send                                                     
    r = adapter.send(request, **kwargs)                               
  File "/usr/local/lib/python3.8/dist-packages/requests/adapters.py", line 519, in send                                                     
    raise ConnectionError(e, request=request)                         
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='testflinger.canonical.com', port=443): Max retries exceeded with url: /v1/job
/32880554-e1c8-4a8f-87a8-e882c8602b8d/action (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f8913fcc1c0>: 
Failed to resolve 'testflinger.canonical.com' ([Errno -2] Name or service not known)"))                                                     

2023-09-30 02:55:27,110 multi-3 ERROR: DEVICE AGENT: Received status code 400 from server.                                                  
Traceback (most recent call last):                                    
  File "/usr/local/bin/snappy-device-agent", line 8, in <module>                                                                            
    sys.exit(main())                                                  
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/cmd.py", line 59, in main                                               
    raise SystemExit(args.func(args))                                 
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/multi/__init__.py", line 55, in provision
    self.device.provision()                                           
  File "/usr/local/lib/python3.8/dist-packages/snappy_device_agents/devices/multi/multi.py", line 72, in provision
    raise ProvisioningError("Unable to allocate all devices")                                                                               
snappy_device_agents.devices.ProvisioningError: Unable to allocate all devices                                                              

*************************************************                     

* Starting testflinger cleanup phase on multi-3 *                     

*************************************************                     

2023-09-30 02:55:32,868 multi-3 ERROR: DEVICE AGENT: Unable to find multi-job data file, job_list.json not found
complete        
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant