-
Notifications
You must be signed in to change notification settings - Fork 171
Using column average width in estimation
CStoreGetForeignPaths uses the following formula to calculate totalDiskAccessCost:
double queryColumnRatio = (double) queryColumnCount / relationColumnCount;
double queryPageCount = relationPageCount * queryColumnRatio;
double totalDiskAccessCost = seq_page_cost * queryPageCount;
This is the best we can do if we don't have column widths. But ANALYZE command collects information about average width of each column and puts it into pg_statistic catalog table. We can use this information to get better estimations.
-
pg_statistic stores the average width values in the stawidth column. The comments for this column say:
/* * stawidth is the average width in bytes of non-null entries. For * fixed-width datatypes this is of course the same as the typlen, but for * var-width types it is more useful. Note that this is the average width * of the data as actually stored, post-TOASTing (eg, for a * moved-out-of-line value, only the size of the pointer object is * counted). This is the appropriate definition for the primary use of * the statistic, which is to estimate sizes of in-memory hash tables of * tuples. */ int32 stawidth;
-
We can get this value using get_attavgwidth. Comments for this function say:
/* * get_attavgwidth * * Given the table and attribute number of a column, get the average * width of entries in the column. Return zero if no data available. * * Currently this is only consulted for individual tables, not for inheritance * trees, so we don't need an "inh" parameter. * * Calling a hook at this point looks somewhat strange, but is required * because the optimizer calls this function without any other way for * plug-ins to control the result. */ int32 get_attavgwidth(Oid relid, AttrNumber attnum);
-
We don't have toasting in cstore tables, so consulting this value should be fine. We need to double check this.
-
We have one bit of overhead per column for the "exists" values, that exists for both NULL and non-NULL values. If we have lots of NULLs this can be significant. We can get the ratio of nulls using pg_statistic->stanullfrac.