Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Figure out what to do with achievements #290

Closed
Simbiat opened this issue Nov 28, 2024 · 24 comments
Closed

Figure out what to do with achievements #290

Simbiat opened this issue Nov 28, 2024 · 24 comments
Assignees

Comments

@Simbiat
Copy link
Owner

Simbiat commented Nov 28, 2024

The table grew extremely large: 55+ GBs. Optimization saved around 20GBs but it's still too large. Regardless of the size, though the issue was that it seemed to affect server in general. Or at least that's my assumption, because at some point requests started waiting for more than a minute even for server response. Optimizing the table and stopping writes to it (with 25b0cfe) seemed to have helped, but monitoring for now.

One possible reason is that query for insert was

INSERT INTO `ffxiv__character_achievement` SET `characterid`=:characterid, `achievementid`=:achievementid, `time`=UTC_DATE() ON DUPLICATE KEY UPDATE `time`=:time;

instead of

INSERT INTO `ffxiv__character_achievement` SET `characterid`=:characterid, `achievementid`=:achievementid, `time`=:time ON DUPLICATE KEY UPDATE `time`=:time;

which obviously was causing unnecessary updates, since original inserts were incorrect. That, as result, would cause index rebuild and fragmentation (sense 20GBs difference after optimization).

If in the next week or so, I do not notice any slowdowns without these inserts at all, I will re-enable them with the fix, and see how it goes. If there will be no slowdowns a week after that - then it's good, but probably still a stop-gap.

Reality is that I am not showing all achievements the characters have anyway, and the data is mostly used for 3 things:

  1. Update achievements details. Realistically needs at least 10 not removed and not privated characters.
  2. Get statistics for rarest achievements
  3. Get only 10 latest achievements

1st and 3rd points will definitely overlap significantly (although hard to estimate exactly), but even if it will be thousands of characters per achievements, it will be less than current tens or hundreds of thousands (or even millions for some easy ones). So main issue is 2nd point. Perhaps limit the achievements to be processed on character updates to those achieved after last update and then do a +1 in achievements table? Since transactions are used it should be fine. Problem would be with getting a "fixed point in time" from which I can start doing that, but this can be approach:

  1. Disable character pages entirely due to privating functionality will not work. Disable automatic updates for anything else, including cron jobs. Probably will require a blog post, and notice should link to it
  2. Generate cron jobs for all not-deleted and not private characters
  3. Wait till all of them are completed
  4. Implement the change in logic along with DB updates and cleanup
  5. Re-enable character pages and automatic updates
@Simbiat Simbiat converted this from a draft issue Nov 28, 2024
@Simbiat Simbiat self-assigned this Nov 28, 2024
@Simbiat
Copy link
Owner Author

Simbiat commented Nov 29, 2024

backups processes in the morning may be a factor. at least it looked liked today things were working fine approximately until zipping of the files has started. At the same time daily logical backup also failed

mariadb-dump: Error 2026: TLS/SSL error: unexpected eof while reading when dumping table `ffxiv__character` at row: 940713

It probably also comes down to the size of achievements table, though, since I am not stopping the website during backup (at least not until Simbiat/optimize-tables#1 is done), so it probably takes a toll on performance, when we already are short on RAM. Backups alone need some RAM, right, and then zipping them - too. But if most of the memory (9GB allocated for MariaDB) is taken up by achievements table, there is not much wiggle room for operations, and thus stuff starts to slow down and eventually fail (mariadb literally restarts, although without clear errors in logs).

As such, I am going through with the plan. Instead of a blog post, I will add link to this issue in respective notices.

@Simbiat
Copy link
Owner Author

Simbiat commented Dec 4, 2024

Cron update was done, so disabled some of the tasks, and added placeholders on website pages. Now need to process 8322902 characters. This will take awhile.

@Simbiat
Copy link
Owner Author

Simbiat commented Dec 4, 2024

Things got more complicated: querying the schedule list for cron jobs takes a very long time (does not finish even within 5 minutes). This is because of the huge list of items and, probably, because I am trying to UPDATE while using that query as a limit to what is to be updated. Guess this shows how it's not a good approach, and another opportunity for Cron optimization, but unclear how I can optimize it, at least not yet.

@Simbiat
Copy link
Owner Author

Simbiat commented Dec 4, 2024

Some testing suggests that main performance hit is done by GROUP BY, then by join, and then by calculations for order. Removing GROUP BY alone does not help, but removing it and using WHERE instead of a join like this

SELECT `task`, `arguments`, `instance` FROM `cron__schedule` AS `instances` WHERE `enabled`=1 AND `runby` IS NULL AND `nextrun`<=CURRENT_TIMESTAMP() AND (SELECT `enabled` FROM `cron__tasks` `tasks` WHERE `tasks`.`task`=`instances`.`task`) ORDER BY IF(`frequency`=0, IF(TIMESTAMPDIFF(day, `nextrun`, CURRENT_TIMESTAMP(6))>0, TIMESTAMPDIFF(day, `nextrun`, CURRENT_TIMESTAMP(6)), 0), CEIL(IF(TIMESTAMPDIFF(second, `nextrun`, CURRENT_TIMESTAMP(6))/`frequency`>1, TIMESTAMPDIFF(second, `nextrun`, CURRENT_TIMESTAMP(6))/`frequency`, 0)))+`priority` DESC, `nextrun` LIMIT 50;

at least completes the query. slow-ish (20 seconds), but that's fine. This one still fails

SELECT `task`, `arguments`, `instance` FROM `cron__schedule` AS `instances` WHERE `enabled`=1 AND `runby` IS NULL AND `nextrun`<=CURRENT_TIMESTAMP() AND (SELECT `enabled` FROM `cron__tasks` `tasks` WHERE `tasks`.`task`=`instances`.`task`) GROUP BY `task`, `arguments` ORDER BY IF(`frequency`=0, IF(TIMESTAMPDIFF(day, `nextrun`, CURRENT_TIMESTAMP(6))>0, TIMESTAMPDIFF(day, `nextrun`, CURRENT_TIMESTAMP(6)), 0), CEIL(IF(TIMESTAMPDIFF(second, `nextrun`, CURRENT_TIMESTAMP(6))/`frequency`>1, TIMESTAMPDIFF(second, `nextrun`, CURRENT_TIMESTAMP(6))/`frequency`, 0)))+`priority` DESC, `nextrun` LIMIT 50;

with The table '(temporary)' is full

@Simbiat
Copy link
Owner Author

Simbiat commented Dec 4, 2024

Looks like it worked only when there was no other connections to DB. For now I temporary removed grouping and ordering, so that it will start to process things, at least

@Simbiat
Copy link
Owner Author

Simbiat commented Dec 5, 2024

I just realized, that I did not really have to run the update for all characters. And I am not sure why I even decided to do that, what's the benefit. But at least I found a bottleneck, and I will see if I can remediate it

@Simbiat
Copy link
Owner Author

Simbiat commented Dec 5, 2024

Performance issue in Cron was addressed in Simbiat/Cron@20dd4bb Not the best solution, probably, but seems to work good enough. It's unlikely you will get 8mil of records like this in normal operation flow. New Cron release will be created a bit later, in case some other bugs will pop-up.
Blocking of updates pushed formally to GitHub in 5bb8a1a and these changes will be "reverted" as part of next update, that will solve the database size issue.

@Simbiat
Copy link
Owner Author

Simbiat commented Dec 5, 2024

Based on my estimates full processing of this queue needs 383 days with current processing speed of ~16 records per minute 😅 So probably will not be keeping the whole queue and will clear out most of it once the changes are done

@Simbiat
Copy link
Owner Author

Simbiat commented Dec 5, 2024

Managed to speed-up processing to ~40 records per minute. Resulting 144 days is still a lot, but it's a considerable improvement, and I do expect that they will start getting processed faster at some point, too (as number of jobs in cron reduce and as I implement the changes to achievements).

@Simbiat
Copy link
Owner Author

Simbiat commented Dec 6, 2024

Today I remembered why I decided that I need to run updates on all characters, so sharing that here with my future self and whoever finds this interesting. It's the dates. So, I did find out that I was not using Lodestone datetime for achievements on initial insert, which means that a lot of achievements have wrong date, if character was not updated. Since I am using the date to determine latest achievements it means that I can end up with more achievements than I need in some edge cases (at least based on the logic for the code, that I imagine at the moment), which would defeat the purpose of storing only those, that are actually needed. It may be possible to have the reverse effect as well, I guess, but I can't think of a clear scenario when it can happen. The point is that the data quality right now is not good, and since I want to maintain high level of quality as much as possible (even considering that I won't fix this for deleted or privated characters) - I do need to run the updates for all characters.

In good new, at the time of writing 8262921 jobs are remaining, so almost 60k characters were updated in ~ 36 hours, so 136 days left with current processing rate which is even shorter than 144 days that I estimated yesterday, which means that it is increasing slightly over time, as I expected. It won't finish till the end of year, though, that's for sure.

@Simbiat
Copy link
Owner Author

Simbiat commented Dec 6, 2024

Increased cron threads from 7 to 10, to see if that will help further

@Simbiat
Copy link
Owner Author

Simbiat commented Dec 6, 2024

the above rate is, actually, with possibility of jobs getting stuck because of deadlocks. so the issue right now is, that jobs may not get deleted right away, due to concurrent reads and updates. and that stalls the processing for an hour, if there are not enough threads. Need to find a way to prevent that, but since retrying deadlocks will be taking time anyway, not sure if I can do much.

@Simbiat
Copy link
Owner Author

Simbiat commented Dec 7, 2024

Looks like the issues were not with the database at all (although reducing its size is still a good idea). Yesterday my test server stopped responding properly, but I was trying to utilize PHPMyAdmin at the time, and it was also exporting a large table (for the sake of experiments with table cleaning after that), so I did not pay no heed to that. I did not check availability after export completed, and just went to sleep. Today it was still not available, even though nothing was happening on DB side. Looks like this was due to something on Crowdsec issue, because in Caddy logs I see this:

{"level":"error","ts":1733549067.0213037,"logger":"crowdsec","msg":"auth-api: auth with api key failed return nil response, error: dial tcp 172.21.0.3:8080: connect: connection refused","instance_id":"4157fc49","address":"http://crowdsec:8080/","error":"auth-api: auth with api key failed return nil response, error: dial tcp 172.21.0.3:8080: connect: connection refused"}
{"level":"error","ts":1733549067.0213053,"logger":"crowdsec","msg":"auth-api: auth with api key failed return nil response, error: dial tcp 172.21.0.3:8080: connect: connection refused","instance_id":"4157fc49","address":"http://crowdsec:8080/","error":"auth-api: auth with api key failed return nil response, error: dial tcp 172.21.0.3:8080: connect: connection refused"}
{"level":"warn","ts":1733549067.0219507,"logger":"crowdsec","msg":"failed to send metrics: Post \"http://crowdsec:8080/v1/usage-metrics\": dial tcp 172.21.0.3:8080: connect: connection refused","instance_id":"4157fc49","address":"http://crowdsec:8080/"}
{"level":"error","ts":1733549067.0219715,"logger":"crowdsec","msg":"get \"http://crowdsec:8080/v1/decisions/stream?\": dial tcp 172.21.0.3:8080: connect: connection refused","instance_id":"4157fc49","address":"http://crowdsec:8080/","error":"Get \"http://crowdsec:8080/v1/decisions/stream?\": dial tcp 172.21.0.3:8080: connect: connection refused"}

So it is refusing connection, but it also does not seem to be closing the stream (the page just keeps on loading and loading, until browser decides to shut it down). I can see similar patterns yesterday, too. In Crowdsec logs, though... I do not see anything for yesterday, but I do see this for today:

time="2024-12-07T05:18:36Z" level=error msg="unable to send usage metrics: Post \"https://api.crowdsec.net/v3/usage-metrics\": performing jwt auth: unexpected EOF"
time="2024-12-07T05:18:36Z" level=error msg="crowdsec - goroutine lapi/usageMetricsToAPIC crashed: runtime error: invalid memory address or nil pointer dereference"
time="2024-12-07T05:18:36Z" level=error msg="please report this error to https://github.com/crowdsecurity/crowdsec/issues"
time="2024-12-07T05:18:36Z" level=error msg="stacktrace/report is written to /var/lib/crowdsec/data/trace/crowdsec-crash.541315290.txt: please join it to your issue"
time="2024-12-07T05:18:36Z" level=fatal msg="crowdsec stopped"
time="2024-12-07T05:18:56Z" level=warning msg="Machine is not allowed to synchronize decisions, you can enable it with `cscli console enable console_management`"
time="2024-12-07T05:18:57Z" level=warning msg="No matching files for pattern /var/log/auth.log" type=file
time="2024-12-07T05:18:57Z" level=warning msg="No matching files for pattern /var/log/syslog" type=file

But since the crowdsec log matches the time, when I woke my PC from hibernation, I think hibernation caused this, which is fine (kind of). Caddy log does match the time I was accessing pages, so it's probably an issue with it, and, I think, mostly due to it not closing the stream properly.
I will need to monitor production till this reoccurs there (if it will, since it has not happened there for awhile now), and check Crowdsec logs when it's happening.

In good news: I think I've figured a way to clean up the database nicely, and it probably won't require me to have the 8 million jobs. I may still keep them, but they are not required. I am doing tests now to ensure I do not lose anything extra with deletion. Hopefully will be able to promote the fix today.

@Simbiat
Copy link
Owner Author

Simbiat commented Dec 7, 2024

A reminder: do not trust what GPT tells you with confidence. Good thing, that I started to doubt the solution it suggested for a cleaning query to match my requirements, and asked it to explain portions of it. It then right away said that it does not match my expectations. I mean, it was working for some test characters, but I think there will be edge cases (especially after it's ran a few times), when it would not work as expected and will select more than required.

Simbiat added a commit that referenced this issue Dec 7, 2024
@Simbiat
Copy link
Owner Author

Simbiat commented Dec 7, 2024

With 8127134 jobs left I pushed the update. This is not fully resolved yet, since removals are done on character updates, so it will not suddenly clean up everything. For that I want to run a one-time script. But since I will need a good approach for such one-time scripts in future, too I want to prepare the basis for it along with #14, so will work on that tomorrow. Also probably won't be cleaning up the jobs either, since they are low priority and should not affect things much, although I will keep an eye on them, for sure.

Simbiat added a commit that referenced this issue Dec 8, 2024
@Simbiat
Copy link
Owner Author

Simbiat commented Dec 8, 2024

One-time job is running, Cron is running... Even with both of them it will take a long while to process things due to sheer amount of data. But I expect there will be significant progress by the end of the year. Either way, I will be monitoring the situation and close this issue, once things are cleaned up nicely (more or less)

@Simbiat
Copy link
Owner Author

Simbiat commented Dec 9, 2024

I decided to dump all DELETE statements instead of doing the characters 1 by 1. This should, theoretically, further increase removal speed, since I will be able to dump all of them to MariaDB directly without the PHP stuff around it. The size though... Script has not even process 10k characters and the SQL file is already 15Gbs somehow. Even considering that there is some empty space - that's a lot.

@Simbiat
Copy link
Owner Author

Simbiat commented Dec 11, 2024

So... In the end I am trying a different approach, which I tested successfully on test now. Need to do this, since all estimates are just too long. Solution which drastically speeds things up is to get all achievements that I know I am going to keep into a file, then truncate current table and reinsert everything. Below is the query I used to get the data:

WITH `LatestAchievements` AS (
    SELECT 
        `characterid`,
        `achievementid`,
        `time`
    FROM (
        SELECT 
            `characterid`,
            `achievementid`,
            `time`,
            ROW_NUMBER() OVER (PARTITION BY `characterid` ORDER BY `time` DESC) AS `rn`
        FROM `ffxiv__character_achievement`
    ) AS `ranked`
    WHERE `rn` <= 50
),
`LatestCharacters` AS (
    SELECT 
        `characterid`,
        `achievementid`,
        `time`
    FROM (
        SELECT 
            `a`.`characterid`,
            `a`.`achievementid`,
            `a`.`time`,
            ROW_NUMBER() OVER (PARTITION BY `a`.`achievementid` ORDER BY `a`.`time` DESC) AS `rn`
        FROM `ffxiv__character_achievement` AS `a`
        JOIN `ffxiv__character` AS `c` ON `a`.`characterid` = `c`.`characterid`
        WHERE `c`.`deleted` IS NULL AND `c`.`privated` IS NULL
    ) AS `ranked`
    WHERE `rn` <= 50
)
SELECT 
    CONCAT(
        "INSERT INTO `ffxiv__character_achievement`(`characterid`, `achievementid`, `time`) VALUES (",
        `characterid`, ", ",
        `achievementid`, ", '",
        `time`, "') ON DUPLICATE KEY UPDATE `time`='",
        `time`, "';"
    ) AS `sql_statement`
FROM (
    SELECT * FROM `LatestAchievements`
    UNION ALL
    SELECT * FROM `LatestCharacters`
) AS `combined`
INTO OUTFILE '/usr/local/logs/achievements_tokeep.sql'
FIELDS TERMINATED BY '' 
LINES TERMINATED BY '\n';

It gets 50 latest achievements for each character, and 50 latest (not deleted and not privated) characters for each achievement and then dumps all of that into a file as INSERT statements (ON DUPLICATE KEY UPDATE is more of a precaution).

My test table had 9GBs size (including indexes) with no idea how many records, since I did not write it down, but I think it was more than it was around 100mil. The resulting file was generated in roughly 40 minutes, was 9GBs (but it had a lot of unneeded whitespace for some reason), and it took around 12 hours to reimport it (since 1 INSERT per record). new count is a bit less than 12mil with total size of... 758.4MB. Significant improvement.

Same query is running on PROD now for like 15 hours now, but it has 492010805 records, so... I expect it to run for a long time, and definitely not x4 or x5 longer than test one due to the nature of the queries. But I expect that it will still generate a file and process it fully within maybe a month at worst, which will be faster than all my previous estimates.

@Simbiat
Copy link
Owner Author

Simbiat commented Dec 11, 2024

Query OK, 32687777 rows affected (22 hours 6 min 44.053 sec) but... for some reason no file. Something removed it apparently😭 Probably my script which copies logs from there, since I have a 0-byte file with same name on my PC :(

@Simbiat
Copy link
Owner Author

Simbiat commented Dec 12, 2024

24.4 GBs script. Let's go!

@Simbiat
Copy link
Owner Author

Simbiat commented Dec 13, 2024

Script completed in about 12 hours and total size of the table is now just 2GB (and that's with indexes on all 3 columns!). I would like to close this issue completely, but unfortunately it looks like even after such cleaning processing of existing characters can take 2 minutes on average instead of previous 2-5 seconds, so need to optimize that, probably by using a variation of the CTE above and then deleting all achievements not listed in the results.

@Simbiat
Copy link
Owner Author

Simbiat commented Dec 13, 2024

In fact, I think that if I am able to solve #180 I should be able to speed-up cleaning up significantly, since I will not care for whether a character is removed or not, when needing to get achievement details, and thus I can simply keep latest 50 achievements per character only, maybe not even care for 50 latest characters per achievement.

@Simbiat
Copy link
Owner Author

Simbiat commented Dec 22, 2024

While working on the update, I noticed, that... I was counting achievements with disregard of time when they were earned. Luckily maximum about 4000 characters were updated since the change, so it should not be that bad. I think.

Simbiat added a commit that referenced this issue Dec 22, 2024
@Simbiat
Copy link
Owner Author

Simbiat commented Dec 22, 2024

And, I think we're done here. New query selects achievements to remove in less than a second (thanks GPT), so should be smooth sailing now, even though there is a chance that due to concurrent updates some extras will be removed, but that's negligible, especially since I am keeping 50 characters for each achievement, that does not have a page on Lodestone (not in "PlayGuide", that is). I will monitor situation, of course, but in general, I am treating this as resolved. Finally.

@Simbiat Simbiat closed this as completed Dec 22, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in simbiat.dev Dec 22, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

1 participant