Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Change SQL connection reuse strategy #1604

Merged
merged 2 commits into from
Feb 7, 2022

Conversation

jpwhite4
Copy link
Member

@jpwhite4 jpwhite4 commented Feb 1, 2022

Description

We were seeing database timeouts when running the upgrade procedure. It was thought that #1602 fixed some of the problems, but actually it didn't fix anything since the DB::factory('datawarehouse'); function can return a stale database handle. Calling it multiple times will not result in a "fresh" handle, you just get the same one (which could be stale).

This commit has three changes:

  • fix the DatabasesMigration to run the talbe existance checks first then run each migration step as a seperate EtlV2 run
  • Change the EtlV2 data endpoint so that the initial connect call will guarantee to get a fresh database connection and not a stale existing one. (looking at the code design it may be that the original author expected the DB::factory() function to actually return a 'fresh' connection since the connection details are cached in the class).
  • change the Maintenance/ExecuteSql.php to use a fresh connection for each file. The same connection is still reused for different statements within a file.

Tests performed

These changes were tested as follows:

in the docker set then mysql timeout to 30s:

echo "wait_timeout = 30" >> /etc/my.cnf.d/server.cnf

then apply the following patch to deliberately make a couple of sql file runs
take a "long" time:

diff --git a/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/mod_shredder/update_storage_datetimes.sql b/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/mod_shredder/update_storage_datetimes.sql
index cb7c496a..28efba12 100644
--- a/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/mod_shredder/update_storage_datetimes.sql
+++ b/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/mod_shredder/update_storage_datetimes.sql
@@ -2,3 +2,4 @@ UPDATE
        mod_shredder.staging_storage_usage
 SET
     dt = CONCAT(DATE(dt), 'T', TIME_FORMAT(dt, "%T"), 'Z');
+SELECT SLEEP(45);
diff --git a/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/modw_cloud/update_image_index.sql b/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/modw_cloud/update_image_index.sql
index 6b3cd725..b3cd67b2 100644
--- a/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/modw_cloud/update_image_index.sql
+++ b/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/modw_cloud/update_image_index.sql
@@ -44,3 +44,4 @@ DROP INDEX image_resource_idx ON modw_cloud.image;
 DROP INDEX image_resource_idx ON modw_cloud.instance_data;

 UNLOCK TABLES;
+SELECT SLEEP(45);

when you run the original code you get the following error:

2022-02-01 21:00:26 [notice] Finished processing section 'xdmod.storage-table-definition-update-9-5-0_10-0-0'
SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

and you don't see that error with this change in place.

Also as a different test set the database timeout to 10 seconds (and do not add the SLEEP(45) patch). The original code gives this error:

2022-02-01 21:56:45 [warning] Stopping ETL due to exception in xdmod.migration-9_5_0-10_0_0.alter-shredded_job_slurm (ETL\Maintenance\ExecuteSql)
xdmod.migration-9_5_0-10_0_0.alter-shredded_job_slurm (ETL\Maintenance\ExecuteSql): Error executing SQL Exception: 'SQLSTATE[HY000]: General error: 2006 MySQL server has gone away'

which is also not seen with this change in place.

We were seeing database timeouts when running the upgrade
procedure. It was thought that ubccr#1602
fixed some of the problems, but actually it didn't fix anything since
the `DB::factory('datawarehouse');` function can return a stale
database handle. Calling it multiple times will not result in a
"fresh" handle, you just get the same one (which could be stale).

This commit has three changes:
- fix the DatabasesMigration to run the talbe existance checks first then
  run each migration step as a seperate EtlV2 run
- Change the EtlV2 data endpoint so that the initial connect call will
  guarantee to get a fresh database connection and not a stale existing
  one. (looking at the code design it may be that the original author expected
  the DB::factory() function to actually return a 'fresh' connection since the
  connection details are cached in the class).
- change the Maintenance/ExecuteSql.php to use a fresh connection for each file.
  The same connection is still reused for different statements within a file.

These changes were tested as follows:

in the docker set then mysql timeout to 30s:
```
echo "wait_timeout = 30" >> /etc/my.cnf.d/server.cnf
```
then apply the following patch to deliberately make a couple of sql file runs
take a "long" time:

```
diff --git a/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/mod_shredder/update_storage_datetimes.sql b/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/mod_shredder/update_storage_datetimes.sql
index cb7c496a..28efba12 100644
--- a/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/mod_shredder/update_storage_datetimes.sql
+++ b/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/mod_shredder/update_storage_datetimes.sql
@@ -2,3 +2,4 @@ UPDATE
        mod_shredder.staging_storage_usage
 SET
     dt = CONCAT(DATE(dt), 'T', TIME_FORMAT(dt, "%T"), 'Z');
+SELECT SLEEP(45);
diff --git a/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/modw_cloud/update_image_index.sql b/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/modw_cloud/update_image_index.sql
index 6b3cd725..b3cd67b2 100644
--- a/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/modw_cloud/update_image_index.sql
+++ b/configuration/etl/etl_sql.d/migrations/9.5.0-10.0.0/modw_cloud/update_image_index.sql
@@ -44,3 +44,4 @@ DROP INDEX image_resource_idx ON modw_cloud.image;
 DROP INDEX image_resource_idx ON modw_cloud.instance_data;

 UNLOCK TABLES;
+SELECT SLEEP(45);
```

when you run the original code you get the following error:
```
2022-02-01 21:00:26 [notice] Finished processing section 'xdmod.storage-table-definition-update-9-5-0_10-0-0'
SQLSTATE[HY000]: General error: 2006 MySQL server has gone away
```

and you don't see that error with this change in place.

Also as a different test set the database timeout to 10 seconds (and do not add the SLEEP(45) patch). The original code gives this error:
```
2022-02-01 21:56:45 [warning] Stopping ETL due to exception in xdmod.migration-9_5_0-10_0_0.alter-shredded_job_slurm (ETL\Maintenance\ExecuteSql)
xdmod.migration-9_5_0-10_0_0.alter-shredded_job_slurm (ETL\Maintenance\ExecuteSql): Error executing SQL Exception: 'SQLSTATE[HY000]: General error: 2006 MySQL server has gone away'
```
which is also not seen with this change in place.
@jpwhite4
Copy link
Member Author

jpwhite4 commented Feb 1, 2022

@eiffel777 I've slightly changed the migration file here so that the table exists checks are all performed first. Please can you confirm this is ok to do and that you didn't intentionally have a table exists check that needed to be run after a migration had completed.

@eiffel777
Copy link
Contributor

@eiffel777 I've slightly changed the migration file here so that the table exists checks are all performed first. Please can you confirm this is ok to do and that you didn't intentionally have a table exists check that needed to be run after a migration had completed.

@jpwhite4 That change is fine. The more important thing is that the order that the pipelines run remains the same, specifically cloud-migration-9-5-0_10-0-0 needs to run before cloud-migration-innodb-9-5-0_10-0-0, and it seems there is no change to that order with this change so all is good.

@jpwhite4 jpwhite4 merged commit 4b9dc81 into ubccr:xdmod10.0 Feb 7, 2022
@jpwhite4 jpwhite4 deleted the etlv2timeouts branch February 7, 2022 16:44
@jtpalmer jtpalmer changed the title Change sql connection reuse strategy. Change SQL connection reuse strategy Feb 25, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants