-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Database / needed file size optimisations #181
Comments
I'm going to just be giving an explanation of the things mentioned, as they are a few things you state you don't understand, I still have to finish reading your post.
Once Tromador updates the server, this will also be true with the eddblink plugin.
This is used to mark if the data is from Spansh's nightly dump or from the live feed from EDDN. The server runs tradedangerous-listener, a program that uses TD to provide the files eddblink imports. The server imports the Spansh data when that gets updated, and then performs its daily dump to Any entry updated by a message from EDDN is marked for inclusion for the differential dump to Making it an integer was a bad idea, blame my then SQL-ignorant self.
Explained above. Any format changes would have to be reflected in both the listener and the eddblink plugin. There are no other considerations I'm aware of that prevent or discourage making changes to the format.
That is a bug I thought I'd fixed. It should only download the file if the copy on the server is newer than the local copy, by checking the
|
Great to hear. Personally, I'd suggest renaming
As I'm not familiar with the data contained in spansh's nightly dumps too well, what is the difference between the two? My understanding was spansh was aggregating the data from EDDN in the end / EDDN is the primary source of data at this time in the community. So the spansh nightly dump of things updated in the last 24 hours should be almost as equivalent as the server's listener (barring network blips on the server side, or spansh's side for that matter), no? As for marking data changes within the last window between dumps, there's no real reason to store that within the main database, especially if it's primarily for the server. Just create an extra database copy of just the new items, or a listing of the updated IDs that the server can load up (in a separate application / process) to dump out the relevant items to. Or just create that listing as a streaming file (either don't prune duplicates, or do so periodically). Now, it could be worthwhile to make any client capable of doing what the server does directly from TD (i.e. wrap the listener in as a part of the repo) in which case perhaps having one database listing what needs to be dumped next could be useful even off the "server". But then, rather than paying the cost on every StationItem, make it a separate table that only stores the StationItem ids that have changed, and clear said table when a dump is finalised (or just again make it a fully separate listing, especially with TD being a python app which is relatively unparallelisable, it's very likely you wouldn't see any real performance cost from having a separate process as a dedicated "dump latest change into, consolidate those changes into a unique listing, and dump out the dump file on demand" type service, maybe caching the item IDs immediately so it's not all in memory).
I'm familiar with
Oh, you did. I was referring to if the full amount needed to be sent every day to begin with. It sounds like it's truly a daily dump from what you were saying above, so that's a mistaken impression on my behalf. That being said...
That... is not supposed to happen on most any OS, or rather file system driver, that I know of? Modified times should be stored in UTC internally usually, it's just display that'll be timezone aware. In python the straight Anyway, thanks for working on all these things. I don't mean to come off as just complaining a lot and not doing anything or anything like that. Just not familiar enough with the codebase to make or analyse these modifications beyond suggestions of datatypes at the moment, maybe I'll get the time to read the code more. Was meaning to refactor the |
We should be able to dispose of "Added", now that we have schema changing capabilities. On how to structure tables - replacing the 8 flags (has market, has black market, etc) with two 8-bit flag masks might be a good idea, but temper against how expensive the bit-twiddling will be from within python. Also, there's a balance to temper how you organize the data: my original intention was that SQLite would just be an ACID means of reading/writing to disk with transactions, but all queries would go thru the TradeDB object. Now, however, there are lots of queries that instead lean into SQLite and depend on indexes etc. I put together a Jupyter Notebook / Colab project examining the various effects of having a single flag (e.g "has market") stored as a character, or an int, nullable or not, with or without the rowid column (if we're going to store unique IDs, we don't need a row id) https://gist.github.com/kfsone/5a44d68809d369423b9e93e001dbff91 aka (still running right now, and on a free-tier machine so the numbers are a bit higher) https://colab.research.google.com/drive/13BbvYKpds6BFfac6C4eJNjGTIhl_R5J1?usp=sharing |
@ultimatespirit I feel I must point out that you are requesting/suggesting a huge amount of work. TD is old and much of the code comes from a time when the demands on the application were different, what could be done with Python was different and even best programming practices were different. |
From what documentation I've seen, SQL is supposed to support bitwise operations so at least a query like "find things that have/don't have X" should be achievable on the SQL side with something like Of course, if it ends up being that that sort of information is required / the database is just meant to be a datastore, may make sense to stop paying the cost of SQL and go with a NoSQL key-value store solution that would read/write faster and involve less complicated search queries (at the cost of likely more storage space for the database). And @Tromador , I think you may be misreading my tone? My intention for this one specifically is to essentially just document every place that this project could potentially squeeze out space savings / bandwidth savings, since I saw it being discussed in other issues and I read through the data formats being used here. Treat this as a documentation document, it's a GitHub issue solely because that's the best way to document things in GitHub. If it's deemed unnecessary or out of depth, it's free to be ignored. None of it is a condemnation of past decisions, it's not like any codebase is ever perfect from the get go. Nor is there any expectation of continued support. That being said, as you said TD is old. And unfortunately, the cumulative burden from that past inertia / accumulated issues is more than starting to show. As it currently stands there's very little reason to use TD beyond as a curiosity, due to the noted performance issues. And to me at least, that's really unfortunate as this project seems really cool. So if I can document (or fix) performance gaps, I figured why not, especially when it seemed to suddenly get such a flurry of attention recently. |
If you understand the difference between a full backup and a differential backup, then you understand the difference between Every day, Spansh dumps all the collected data for every system with at least one station in it to In the period between each dump, as EDDN broadcasts new market messages, the server updates the DB from those messages. The data from these messages will be in the next Spansh dump, but in the meantime, When I originally wrote the listener, I did it with the intention of people running it themselves, as until Tromador offered, there was no option for having a TD server post-Maddavo. This is why the |
The twist here is whether you want to index it or not, and if you do, then you have to pay the indexing overhead. For a long running process, you typically want to offload that to another thread/process, so it doesn't block user operations; that's where sqlite+python was a poor long-term choice, and one I clung to until the moment I discovered that Python threading was worse than a single thread for cpu-bound work (*if you are reading this after Python 3.13, go back to your future where Aubrey Plaza is president) |
Didn't have time earlier to double check this, I see that the I think spansh has offered in the past to include other types of data dumps too, so it could be possible to reach out and just get a dump created that's solely the systems with stations updated within the last 24 hours. On the (Of course, since spansh's data is definitely EDDN, if the EDDN listener is running on the server constantly anyway there shouldn't be any missing data necessitating the server download spansh to begin with, it'd just be a replication of what it was listening to anyway. Nice redundancy though I guess to double check no messages were missed.) [1] https://downloads.spansh.co.uk/galaxy_1day.json.gz (don't actually click that, it'll go straight to download of a 600mb file or so) |
As requested in the server hosting upgrade issue, here's a new issue to discuss the data storage problems / optimisations that can be done.
I already mentioned a few things here, albeit entirely based off of what I was reading in that issue rather than from actually looking at the database or related files. I'm going to list here every area that could be changed to reduce the on-disk and server bandwidth requirements of trade-dangerous. Many of these are likely already known, but I figure it is good to explicitly list them in one location (and since I don't know what is or is not known).
Note to start with, I am going into this with the following basic assumptions, assumptions I know do not hold true for the current form of the database, to some extent at least, due to historical considerations (everything used to be human input data):
With those out of the way, let's start with the SQL database itself, taking the format of listing how it currently is, and how it can be changed.
N.B. I am not an SQL expert, I am using this SQL datatype reference sheet for what a normal strictly typed SQL database type sizes would likely be. This project uses
sqlite3
it seems, so I'm also using this resource on types, with the caveat that I'm assuming the sane option of strict typing is being used, since there's no way variable typing is helpful here. Oh, also, I'm looking exclusively at what my own local database has inside of it from a working copy of trade dangerous.SQL Database
First off, the database has 14 tables:
I'm going to ignore "Added" and "sqlite_sequence", as I have absolutely no idea what those tables mean. It seems "Added" is some way of marking from what era of the game some information is, but I'm not sure it's relevant anymore. I'm also going to ignore both of the FDev tables, they appear to be tracking data from a master fdev list or otherwise are not exactly under control of the program.
Category
This is an integer ID field and a
VARCHAR(40)
name. It appears to represent the per-item categories from the in game UI. This is of questionable value to the program and likely could be removed.Item
Contains:
item_id: INTEGER
name: VARCHAR(40)
category_id: INTEGER
ui_order: INTEGER
avg_price: INTEGER
fdev_id: INTEGER
If the Category table gets removed, then
category_id
can be removed. Similarly,ui_order
is of questionable value. Otherwise, there isn't much to do here.I'll elide any
INTEGER
tags for id variables going forward.RareItem
rare_id
station_id
category_id
name: VARCHAR(40)
cost: INTEGER
max_allocation: INTEGER
illegal: TEXT(1)
, chars in[?yn]
suppressed: TEXT(1)
, chars in[?yn]
More or less as good as it can get, except...
Ternaries
I'm not quite sure what "suppressed" is, but it and "illegal" represent our first ternary datatype analogue. According to the sql server reference
TEXT
being variable length requires 4 + number of characters byes. sqlite says nothing specific about storage requirements, solely that it encodes it in the encoding of the database. I'm assuming that's UTF-8, and since it's variable length it needs at minimum a null byte or similar concept, so let's say 2 bytes perTEXT(1)
minimum.I don't know how null values are handled by sqlite with regards to storage size, so I will ignore that representing a ternary as "1, 0, or NULL" could potentially save a byte on the NULL case (by representing NULL through storing nothing at all). Instead, we have two options:
The biggest possible savings would be under the assumption, however, that it is of merit to combine values into one field. That's likely much harder to efficiently search for / filter for (unless it's something SQL engines can handle really well?), so understandable if they need to remain distinct. Still, likely can save at least 1 byte per ternary.
Ship
ship_id
name: VARCHAR(40)
cost: INTEGER
Basically optimal
ShipVendor
This and
UpgradeVendor
appear to be empty. This information actually is available to automated systems, it's just not written to a journal file but instead to transient files. My understanding is that the main data network systems after EDDB closed (under the assumption EDDB had this info) saw fit to just stop attempting to keep the outfiting / ship information.I'm actually interested in the historical trends for this information though, so for what it's worth I may create an EDMC plugin that could extract this information and feed it to a database.
Otherwise I guess they could be removed, along with their corresponding ID tables.
Station
This has 9 ternaries (
max_pad_size
is actually a quaternary but that still fits in 2 bits or one byte), so at least 9 bytes could be saved from the ternary considerations above, more if they can be combined into packed integers. For the purposes of this program itself, I'm not surerearm
orrefuel
are that helpful, but they are generally useful information to keep so likely worth keeping anyway.The date, however, is interesting for the
modified
field. Station modified time is of little use. All of its parameters are either constants, barring the edge case of marking a station as destroyed / abandoned / under attack, or have modifications out of sync with the station itself, like its commodities market vs outfitting vs shipyard. This likely could be removed, with some caveats covered in later sections.Also, the edgecase of a abandoned / under attack / reduced services stations is an admittedly interesting / useful one, but for a trading optimiser perhaps fine to just ignore?
System
system_id
name
pos_{x,y,z}: DOUBLE
added_id
modified
Added likely could be removed, and
modified
is of questionable use, as it's dependent on the internal latest update time of a station, or on outright new data being provided. It's likely not that useful to store for the edgecase of someone wanting to check a completely newly discoved / expanded system was updated recently or not on existence of stations.StationItem
This one's the big ticket item. A representation of every commodity known to the database at every station. Lots and lots of entries.
station_id
item_id
demand_price
demand_units
demand_level
supply_price
supply_units
supply_level
modified
from_live
All but
modified
being integers. I do not know whatfrom_live
is, and I assume{demand,supply}_level
refers to the little bar-graph things in the game UI.It's highly doubtful those level variables are useful, remove for 2 bytes of savings (assuming they're one byte each). More importantly, under the completeness assumption, we do not need
modified
, and can instead store that information in the associatedstation
(using a new date field for last commodity market update, last shipyard update, last outfitting, etc.). That should save 4-8 bytes for every single item, in my current database that's 44,169,051 items. So around 300 megabytes (if there's no internal compression involved that cuts it down).And that's it for the sql database tables. Well, with one caveat, it was mentioned that for efficiency
StationItem
should likely be split up into two tables, one for selling and one for buying, so that the viewStationBuying
andStationSelling
don't need to exist. I'm not sure on the specifics there or if that's outdated information now (it was mentioned like a month ago in an issue), but it doesn't majorly affect the storage requirements provided that the date field gets placed into the station it's associated with. That being said, it's likely many operations that require sale info would also benefit from buy info (making a bi-directional trade route for example) and vice-versa, so may just not need the buying/selling specific views anyway, just process theStationItem
data in Python after retrieval.The CSV format
I'm not really sure what the distinction is between
listings.csv
andlistings-live.csv
. I'm assuming these files can be changed in their format, or at least what gets served by theeddblink
server can be changed. Both have the same format:id,station_id,commodity_id,supply,supply_bracket,buy_price,sell_price,demand,demand_bracket,collected_at
id
The
id
field appears to not be related to any of the SQL parameters above, in fact, it appears to literally just be the line number. Absolutely can be removed and save around O(n log(n)) bytes, where n is the number of lines in the file, which is around 4-5 million lines, so something like 20-25mb.station_id
andcommodity_id
The data appears to be sorted by
station_id
followed bycommodity_id
, if it isn't do that, and then either store thestation_id
info implicitly by making it only in the first entry, any empty entry afterwards represents the same id, or do it as a diff from the last entry. Forcommodity_id
an empty value could mean "last value +1", or again just diff encode it from the previous line. Can choose a reasonable baseline number that all commodity IDs start from too to make the first entry's diff less special. Same forstation_id
. Should save a decent amount of on-disk space.supply
,demand
, and{buy,sell}_price
Again, nothing much to really do here.
{supply,demand}_bracket
This is presumably the
{supply,demand}_level
indicator. Should just be removed. Alternatively, make its range 0 to 4 instead of -1 to 3, as it appears the vast majority of items have -1 demand and that's two bytes for no good reason. Granted, compression algorithms will likely adequately handle either case, it's still not great.collected_at
This appears to be the UNIX epoch date of last update. As the data gets compressed before sending, the duplicity of these entries likely gets fairly well handled by the compression. However, at the cost of perhaps extra work to process the file, the datetime could be stored only in the very first item per-station, or in a separate file altogether associating station ids to update times., which would save probably around 40mb on disk uncompressed.
Alternatively to much of this, create a binary data format that packs the data as tightly as it can. If applicable use variable length integers for supply and demand (you lose 1 bit per byte, in exchange for lower values not needing as many bytes, could do 1 bit per two bytes using a base of two bytes per variable length word). Store stations as one large variable length struct, store its length explicitly, removing the date duplication implicitly. IDs could still be diffed from a baseline or stored implicitly in some ways.
Server Bandwidth / in flight data
Running the
eddblink
import causes my system to download around 200mb or so of data every update, though perhaps if I called it every day it'd be less (the one time I called it within 24 hours of the last invocation it was much less greedy). This is mostly from downloading thelistings.csv
gzipped.There's a few things that would help here:
zstd
orxz
to compress the data,zstd
is very fast and more efficient thangzip
,xz
is slow but compresses very well.As the size seems to be consistent every time I get
listings.csv
I'm guessing the whole file is getting sent every time regardless of if the client has downloaded it before or not, aside from within the same day where I thinklistings-live.csv
comes into play as likely a rolling summary of the day's info. This is pretty inefficient, as I'd guess given the game's population not nearly as many locations change their data in a day as the fulllistings.csv
file.If that assumption holds true, in exchange for slightly more storage usage on the server, it would likely make sense to store what are essentially diffs from day to day, as follows:
listings.csv
at some master-date point, let's say one from the start of the month.The update process on the client side would then be:
xz
level though. And the base starting file for the month.Essentially, this creates an end-of-day transaction log for the data, under the assumption that the set of things that get updates is smaller than the full
listings.csv
file. If it's roughly equivalent then pre-made diffs will be less helpful bandwidth wise unless they're diffs from each day to current day, and even then barely.Conclusion
Whew, this ended up being a real long one. Hope it can be helpful in discussion.
The text was updated successfully, but these errors were encountered: