Change SQL connection reuse strategy #1604

jpwhite4 · 2022-02-01T22:05:11Z

Description

We were seeing database timeouts when running the upgrade procedure. It was thought that #1602 fixed some of the problems, but actually it didn't fix anything since the DB::factory('datawarehouse'); function can return a stale database handle. Calling it multiple times will not result in a "fresh" handle, you just get the same one (which could be stale).

This commit has three changes:

fix the DatabasesMigration to run the talbe existance checks first then run each migration step as a seperate EtlV2 run
Change the EtlV2 data endpoint so that the initial connect call will guarantee to get a fresh database connection and not a stale existing one. (looking at the code design it may be that the original author expected the DB::factory() function to actually return a 'fresh' connection since the connection details are cached in the class).
change the Maintenance/ExecuteSql.php to use a fresh connection for each file. The same connection is still reused for different statements within a file.

Tests performed

These changes were tested as follows:

in the docker set then mysql timeout to 30s:

echo "wait_timeout = 30" >> /etc/my.cnf.d/server.cnf

then apply the following patch to deliberately make a couple of sql file runs
take a "long" time:

diff --git a/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/mod_shredder/update_storage_datetimes.sql b/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/mod_shredder/update_storage_datetimes.sql
index cb7c496a..28efba12 100644
--- a/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/mod_shredder/update_storage_datetimes.sql
+++ b/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/mod_shredder/update_storage_datetimes.sql
@@ -2,3 +2,4 @@ UPDATE
        mod_shredder.staging_storage_usage
 SET
     dt = CONCAT(DATE(dt), 'T', TIME_FORMAT(dt, "%T"), 'Z');
+SELECT SLEEP(45);
diff --git a/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/modw_cloud/update_image_index.sql b/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/modw_cloud/update_image_index.sql
index 6b3cd725..b3cd67b2 100644
--- a/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/modw_cloud/update_image_index.sql
+++ b/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/modw_cloud/update_image_index.sql
@@ -44,3 +44,4 @@ DROP INDEX image_resource_idx ON modw_cloud.image;
 DROP INDEX image_resource_idx ON modw_cloud.instance_data;

 UNLOCK TABLES;
+SELECT SLEEP(45);

when you run the original code you get the following error:

2022-02-01 21:00:26 [notice] Finished processing section 'xdmod.storage-table-definition-update-9-5-0_10-0-0'
SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

and you don't see that error with this change in place.

Also as a different test set the database timeout to 10 seconds (and do not add the SLEEP(45) patch). The original code gives this error:

2022-02-01 21:56:45 [warning] Stopping ETL due to exception in xdmod.migration-9_5_0-10_0_0.alter-shredded_job_slurm (ETL\Maintenance\ExecuteSql)
xdmod.migration-9_5_0-10_0_0.alter-shredded_job_slurm (ETL\Maintenance\ExecuteSql): Error executing SQL Exception: 'SQLSTATE[HY000]: General error: 2006 MySQL server has gone away'

which is also not seen with this change in place.

We were seeing database timeouts when running the upgrade procedure. It was thought that ubccr#1602 fixed some of the problems, but actually it didn't fix anything since the `DB::factory('datawarehouse');` function can return a stale database handle. Calling it multiple times will not result in a "fresh" handle, you just get the same one (which could be stale). This commit has three changes: - fix the DatabasesMigration to run the talbe existance checks first then run each migration step as a seperate EtlV2 run - Change the EtlV2 data endpoint so that the initial connect call will guarantee to get a fresh database connection and not a stale existing one. (looking at the code design it may be that the original author expected the DB::factory() function to actually return a 'fresh' connection since the connection details are cached in the class). - change the Maintenance/ExecuteSql.php to use a fresh connection for each file. The same connection is still reused for different statements within a file. These changes were tested as follows: in the docker set then mysql timeout to 30s: ``` echo "wait_timeout = 30" >> /etc/my.cnf.d/server.cnf ``` then apply the following patch to deliberately make a couple of sql file runs take a "long" time: ``` diff --git a/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/mod_shredder/update_storage_datetimes.sql b/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/mod_shredder/update_storage_datetimes.sql index cb7c496a..28efba12 100644 --- a/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/mod_shredder/update_storage_datetimes.sql +++ b/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/mod_shredder/update_storage_datetimes.sql @@ -2,3 +2,4 @@ UPDATE mod_shredder.staging_storage_usage SET dt = CONCAT(DATE(dt), 'T', TIME_FORMAT(dt, "%T"), 'Z'); +SELECT SLEEP(45); diff --git a/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/modw_cloud/update_image_index.sql b/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/modw_cloud/update_image_index.sql index 6b3cd725..b3cd67b2 100644 --- a/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/modw_cloud/update_image_index.sql +++ b/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/modw_cloud/update_image_index.sql @@ -44,3 +44,4 @@ DROP INDEX image_resource_idx ON modw_cloud.image; DROP INDEX image_resource_idx ON modw_cloud.instance_data; UNLOCK TABLES; +SELECT SLEEP(45); ``` when you run the original code you get the following error: ``` 2022-02-01 21:00:26 [notice] Finished processing section 'xdmod.storage-table-definition-update-9-5-0_10-0-0' SQLSTATE[HY000]: General error: 2006 MySQL server has gone away ``` and you don't see that error with this change in place. Also as a different test set the database timeout to 10 seconds (and do not add the SLEEP(45) patch). The original code gives this error: ``` 2022-02-01 21:56:45 [warning] Stopping ETL due to exception in xdmod.migration-9_5_0-10_0_0.alter-shredded_job_slurm (ETL\Maintenance\ExecuteSql) xdmod.migration-9_5_0-10_0_0.alter-shredded_job_slurm (ETL\Maintenance\ExecuteSql): Error executing SQL Exception: 'SQLSTATE[HY000]: General error: 2006 MySQL server has gone away' ``` which is also not seen with this change in place.

jpwhite4 · 2022-02-01T22:09:37Z

@eiffel777 I've slightly changed the migration file here so that the table exists checks are all performed first. Please can you confirm this is ok to do and that you didn't intentionally have a table exists check that needed to be run after a migration had completed.

eiffel777 · 2022-02-02T13:33:44Z

@eiffel777 I've slightly changed the migration file here so that the table exists checks are all performed first. Please can you confirm this is ok to do and that you didn't intentionally have a table exists check that needed to be run after a migration had completed.

@jpwhite4 That change is fine. The more important thing is that the order that the pipelines run remains the same, specifically cloud-migration-9-5-0_10-0-0 needs to run before cloud-migration-innodb-9-5-0_10-0-0, and it seems there is no change to that order with this change so all is good.

jpwhite4 requested review from jtpalmer and eiffel777 February 1, 2022 22:05

jtpalmer approved these changes Feb 2, 2022

View reviewed changes

Merge branch 'xdmod10.0' into etlv2timeouts

01dbb0f

jpwhite4 merged commit 4b9dc81 into ubccr:xdmod10.0 Feb 7, 2022

jpwhite4 deleted the etlv2timeouts branch February 7, 2022 16:44

jtpalmer changed the title ~~Change sql connection reuse strategy.~~ Change SQL connection reuse strategy Feb 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change SQL connection reuse strategy #1604

Change SQL connection reuse strategy #1604

jpwhite4 commented Feb 1, 2022

jpwhite4 commented Feb 1, 2022

eiffel777 commented Feb 2, 2022

Change SQL connection reuse strategy #1604

Change SQL connection reuse strategy #1604

Conversation

jpwhite4 commented Feb 1, 2022

Description

Tests performed

jpwhite4 commented Feb 1, 2022

eiffel777 commented Feb 2, 2022