Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

I/O error on XFS #230

Open
pfactum opened this issue Feb 22, 2025 · 11 comments
Open

I/O error on XFS #230

pfactum opened this issue Feb 22, 2025 · 11 comments

Comments

@pfactum
Copy link

pfactum commented Feb 22, 2025

(@tsmethurst asked me to file this bugreport here.)

Hello.

I'm upgrading my GoToSocial instance from v0.17.4 to v0.18.0. During the SQLite DB migration process the following error occurs:

2025/02/22 21:45:42 error executing command: error creating dbservice: db migration error: ]: error dropping old column: sqlite3: disk I/O error (code=10 extended=778)

This happens with db-sqlite-journal-mode set to WAL (it's a default value in GtS) and on XFS filesystem.

If I set db-sqlite-journal-mode to OFF, the migration succeeds on XFS filesystem.

If I keep db-sqlite-journal-mode set to WAL, but use Btrfs filesystem, the migration succeeds too.

Playing with db-sqlite-synchronous values didn't make any difference.

I didn't have this issue during previous migrations on GtS v0.17.x and earlier.

I assume GtS had go-sqlite3 updated in v0.18.0. I do not have any other details, but as I've got a reliable reproducer, I can collect any additional info as needed.

Thank you.

@ncruces
Copy link
Owner

ncruces commented Feb 22, 2025

I wanna be helpful here, but I have to be honest: we don't have a lot to go on here.

The problem with SQLite is that the API, by design, converts errors into an error code that gives us very little information about what's going on. Code 778 just means a write failed, little else.

There's a provision in the API to remember the last OS error… but it's ignored by SQLite. Then there's a form of debug logging, but (1) I dunno if GoToSocial has a way to enable it (through configuration) and (2) I haven't done my part for IO errors, so although more logging would always be helpful in a situation like this, it might not catch the specific error causing this.

The way I would go about this, if I had a reproducer, would be to add some hard coded logging in the source code and try to figure out more that way. But, at a minimum that involves building from source.

In the future, this is something I whish I could spend the time improving, has it has bitten me sometimes already.

For today, I'm interested, but I'm not sure how we could arrange a debugging session. You send me a reproducer (I don't really want your sensitive data)? I send you modified source code? Can we make the test case smaller, @tsmethurst a binary that just does the migrations?

I'm sorry if that's not very helpful.

@ncruces
Copy link
Owner

ncruces commented Feb 22, 2025

In principle, it's not even hard to know where the error is coming from, it pretty much has to be from this line:

go-sqlite3/vfs/vfs.go

Lines 180 to 181 in 93049c2

_, err := file.WriteAt(buf, iOfst)
return vfsErrorCode(err, _IOERR_WRITE)

As it's the only line in the entire source that generates that error. The problem is that the Go err (which has more info in it), is coerced into an int and SQLite as no provision for the rest of the info to go from Go->C/Wasm->Go, where it could be logged, etc.

The easy hack is to just to add a log there, rebuild, try again, see what pops.

The long and hard is to again reconsider how I could surface the information in a way the application decides what happens. We want observability when needed, but not needless log spam.

This is the kind of stuff where this driver is probably still too immature. 🫣

@tsmethurst
Copy link

All migrations are run on launch, including for brand new instances :) so you should be able to replicate this pretty easily! If you push something on a branch I'm happy to pull it into gts and build it for @pfactum to try out.

@ncruces
Copy link
Owner

ncruces commented Feb 22, 2025

Branch issue230 should be enough to log that error (and potentially others), without spamming the logs with transient conditions.

grep issue230 over the logs.

@ncruces
Copy link
Owner

ncruces commented Feb 23, 2025

For the future, the correct course of action, I guess, is that you call ConfigLog on connections (perhaps if configured to do so), and I do my bit and send IO errors there as well.

Right now, you'd only get internal SQLite error logs (not VFS and OS errors), because I haven't really done my job at the VFS level.

At a time, I implemented xGetLastError. The theory on in was amazing (save the error message), and exactly what I wanted. But then the unix implementation (the comment is from windows) is just nothing 1. And (surprise!) this is completely unused.

SQLite developers simply gave up on the idea of surfacing this through the normal error path, which would produce informative error messages at the place they reach the caller, database/sql, whatever ORM you might be using, and finally the app. The only thing that travels through that path is an error code, usually little more than IOERR. You have to use a logging side channel to know more.

And that's just the SQLite way… I'm kinda tied here.

Footnotes

  1. actually neither implementation even abides by the spec; both return the error code, whereas the return value was supposed to be if the message was truncated…

@tsmethurst
Copy link

Alright! I've built a binary with that branch of go-sqlite3 :) You can download the binary from here @pfactum: https://minio.s3.superseriousbusiness.org/browser/gotosocial-snapshots/xfs_poking%2Fgotosocial Could you run it against your db backup on one of your xfs filesystems, grep for issue230, and let us know what you see?

@pfactum
Copy link
Author

pfactum commented Feb 23, 2025

Interesting:

timestamp="23/02/2025 11:04:42.534" func=migrations.convertEnums level=INFO msg="migrated 415000 of 422038 statuses (up to 01GP7QXK6GQQENW6MQYB7Y780A)"
timestamp="23/02/2025 11:04:42.762" func=migrations.convertEnums level=INFO msg="migrated 420000 of 422038 statuses (up to 01GMP6E41GKXVP5SBW9JRH47TW)"
timestamp="23/02/2025 11:04:42.887" func=migrations.convertEnums level=INFO msg="migrated 422038 of 422038 statuses (up to 01GJ87NBKGA6MHFPT4X8T2JSEC)"
timestamp="23/02/2025 11:04:42.888" func=migrations.init.90.func1.1 level=INFO msg="dropping old index statuses_visibility_idx..."
timestamp="23/02/2025 11:04:43.100" func=migrations.init.90.func1.1 level=INFO msg="dropping old index statuses_profile_web_view_idx..."
timestamp="23/02/2025 11:04:44.556" func=bundb.queryHook.AfterQuery level=WARN duration="1.455545747s" query="DROP INDEX statuses_profile_web_view_idx" msg="SLOW DATABASE QUERY"
timestamp="23/02/2025 11:04:44.556" func=migrations.init.90.func1.1 level=INFO msg="dropping old index statuses_public_timeline_idx..."
timestamp="23/02/2025 11:04:45.751" func=bundb.queryHook.AfterQuery level=WARN duration="1.195423529s" query="DROP INDEX statuses_public_timeline_idx" msg="SLOW DATABASE QUERY"
2025/02/23 11:04:48 issue230 778: write /tmp/2870767786.db: no space left on device
timestamp="23/02/2025 11:04:48.596" func=bundb.queryHook.AfterQuery level=WARN duration="2.844363621s" query="ALTER TABLE \"statuses\" DROP COLUMN \"visibility\"" msg="SLOW DATABASE QUERY"
timestamp="23/02/2025 11:04:48.599" func=cache.(*Caches).Stop level=INFO msg="stop: 0xc0001a5988"
timestamp="23/02/2025 11:04:48.600" func=router.stopServer level=INFO msg="shutting down http server with 30s grace period"
timestamp="23/02/2025 11:04:48.602" func=router.stopServer level=INFO msg="http server closed connections and shut down gracefully"
timestamp="23/02/2025 11:04:48.602" func=workers.(*Workers).Stop level=INFO msg="stopped scheduler"
timestamp="23/02/2025 11:04:48.603" func=workers.(*Workers).Stop level=INFO msg="stopped delivery workers"
timestamp="23/02/2025 11:04:48.603" func=workers.(*Workers).Stop level=INFO msg="stopped client workers"
timestamp="23/02/2025 11:04:48.603" func=workers.(*Workers).Stop level=INFO msg="stopped federator workers"
timestamp="23/02/2025 11:04:48.603" func=workers.(*Workers).Stop level=INFO msg="stopped dereference workers"
timestamp="23/02/2025 11:04:48.603" func=workers.(*Workers).Stop level=INFO msg="stopped processing workers"
timestamp="23/02/2025 11:04:48.603" func=workers.(*Workers).Stop level=INFO msg="stopped WebPush workers"
timestamp="23/02/2025 11:04:48.603" func=server.init.func2.1 level=INFO msg="done! exiting..."
2025/02/23 11:04:48 error executing command: error creating dbservice: db migration error: ]: error dropping old column: sqlite3: disk I/O error (code=10 extended=778)

Why all of a sudden:

2025/02/23 11:04:48 issue230 778: write /tmp/2870767786.db: no space left on device

?

@pfactum
Copy link
Author

pfactum commented Feb 23, 2025

If so, I think the issue is not with XFS or anything like that, it's just that my local machine with Btrfs has got much more RAM to handle the migration with writes like this into /tmp. But why does GtS/go-sqlite3 dumps so much into /tmp anyway?

@ncruces
Copy link
Owner

ncruces commented Feb 23, 2025

2025/02/23 11:04:48 issue230 778: write /tmp/2870767786.db: no space left on device

This is already something I need to fix, and which is on my to-do: review all error codes (again), because this one should be singled out as SQLITE_FULL, not a generic write error.

@ncruces
Copy link
Owner

ncruces commented Feb 23, 2025

If so, I think the issue is not with XFS or anything like that, it's just that my local machine with Btrfs has got much more RAM to handle the migration with writes like this into /tmp. But why does GtS/go-sqlite3 dumps so much into /tmp anyway?

This is hardly unexpected in a migration. SQLite likely (even if not at explicit request of GoToSocial) creates a temporary table or index to facilitate the process.

In the default configuration of SQLITE_TEMP_STORE, which is 1, these are stored on disk, but the other option is memory (which will also require a lot of RAM…)

For more, start reading from here:
https://sqlite.org/tempfiles.html#temp_databases

As noted in the above page, apps can change the directory used for temporaries with a PRAGMA.

I'll look into supporting the SQLITE_TMPDIR environment variable.

@pfactum
Copy link
Author

pfactum commented Feb 23, 2025

So, just for the record, I've managed to complete the migration on a production host by temporarily setting TMPDIR= env var to /var/lib/gotosocial.

Since this is not an FS issue, you may want to repurpose this report for more accurate VFS error reporting and/or for SQLITE_TMPDIR support.

Thank you.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants