Using column average width in estimation

Description

CStoreGetForeignPaths uses the following formula to calculate totalDiskAccessCost:

double queryColumnRatio = (double) queryColumnCount / relationColumnCount;
double queryPageCount = relationPageCount * queryColumnRatio;
double totalDiskAccessCost = seq_page_cost * queryPageCount;

This is the best we can do if we don't have column widths. But ANALYZE command collects information about average width of each column and puts it into pg_statistic catalog table. We can use this information to get better estimations.

Technical Details

pg_statistic stores the average width values in the stawidth column. The comments for this column say:

/*
 * stawidth is the average width in bytes of non-null entries.  For
 * fixed-width datatypes this is of course the same as the typlen, but for
 * var-width types it is more useful.  Note that this is the average width
 * of the data as actually stored, post-TOASTing (eg, for a
 * moved-out-of-line value, only the size of the pointer object is
 * counted).  This is the appropriate definition for the primary use of
 * the statistic, which is to estimate sizes of in-memory hash tables of
 * tuples.
 */
int32		stawidth;

We can get this value using get_attavgwidth. Comments for this function say:

/*
  * get_attavgwidth
  *
  *	  Given the table and attribute number of a column, get the average
  *	  width of entries in the column.  Return zero if no data available.
  *
  * Currently this is only consulted for individual tables, not for inheritance
  * trees, so we don't need an "inh" parameter.
  *
  * Calling a hook at this point looks somewhat strange, but is required
  * because the optimizer calls this function without any other way for
  * plug-ins to control the result.
  */
 int32
 get_attavgwidth(Oid relid, AttrNumber attnum);

We don't have toasting in cstore tables, so consulting this value should be fine. We need to double check this.
We have one bit of overhead per column for the "exists" values, that exists for both NULL and non-NULL values. If we have lots of NULLs this can be significant. We can get the ratio of nulls using pg_statistic->stanullfrac.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using column average width in estimation

Description

Technical Details

Clone this wiki locally