-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Remove feature "granularity" and relegate to metadata #17
Comments
This would be for a future major version |
@pradithya @zhilingc any thoughts on how much this would effect core and serving? |
I don't see that it will be much effort on serving to remove granularity concept. I might need to do more comprehensive impact analysis though if later on we commit to this. I am more concerned about this though
I feel having timestamp as part of key in serving store create at least two problems:
it will be great if we could use entity ID as key and maintain N latest data by bucketing (Redis) or multiple version (BigTable). We still maintain timestamp as part of value to know whether the feature is stale or not though. |
@pradithya How is this any different to what we already have though? We already do scans in BT, scanning across potentially more granular events wont effect it. We don't require scans for single key (non latest) lookup, if we will insist that the user knows the exact timestamp. Eg: 2019-01-07T00:00:00. There is still the discussion of, do we even need timestamped features at all in serving, or just in warehouse. We thought we did, but we have yet to identify a solid use case in ML land. I would be in favour of dropping timestamps in serving completely and just store by entity key, if we can't find clear use ML cases. With it, we are sort of providing workarounds to not doing more feature engineering upstream. And perhaps providing better automation and tools there would be a better approach. No timestamps at all in serving would obviously drastically simplify the serving api. And make it very neat for converting into dataframes and preferred consumables for ML models. @woop thoughts? |
Also, the polluting the keyspace thing, is identical to people using "seconds" as the granularity and just putting whatever they want into it. |
I don't think the user will ever know the exact timestamp, but maybe this actually a use case at all? If we only have
Wasn't the discussion on whether we need to store historic feature data for querying in serving? That is different from having timestamped features (which
Do you mean no timestamps at all? Because having the timestamp of a row returned in serving is definitely something we need. How does the user know whether a specific |
I wouldn't say it is identical. Even seconds are buckets. Wouldn't using a more granular time actually make it easier to fill up the database if something goes wrong? Overall I agree that this is a problem that is worth looking at. I think we should split it up though:
Do we still want to use |
I've renamed this issue. From custom granularities to "Remove feature "granularity" and relegate to metadata #17" We have reached a consensus that
Granularities as we know it should be removed. Any window information about a feature should be added as metadata, but we currently don't have a reason for that to be a structured field. |
Permission model
[edit] Granularities are no longer required in FeatureRow, or FeatureSpecs, as we have removed history from serving store and the serving api. Thus there is also no requirement for it to be in the warehouse store. Additionally the notion of granularity has proven to be confusing to end users. History of issue kept below:
I'd like to discuss feature granularities.
What is granularity
Currently we have a fixed set of feast granularities {seconds, minutes, hours, days}.
It is not always obvious what the feast granularity refers to.
In general a feature is handled by a few different datetimes throughout it's lifecycle:
The storage event timestamp is derived by rounding the ingestion event timestamp to start of the granularity for all the features in a feature row. Eg: for a granularity of 1 hour, we round the ingestion timestamp to the start of the enclosing hour.
For example, say we have a feature that is aggregated over a 1 hour fixed windows and triggered every one minute. Each minute an update of the 1 hour window aggregation is provided. We would naturally use a 1 hour granularity for this. The ingestion event timestamp should be within the one hour window. The storage event timestamp would be the start of the window.
Another example, say we have a feature that is aggregated over a 10 minute sliding window, and triggered only once at the end of every window. In this case, the feast granularity actually needs to be 1 minute. Which can seem confusing.
Limitations of current approach
Feast rounds the ingested timestamps to a granularity provided by creation, this seemed a convenience, but it hinders the use of custom granularities and it can cause confusion.
For example: because the granularities are an enum and there is not 5 minute option. If we wanted to store and overwrite a new key every five minutes, we would need to use a finer granularity and manually round the ingestion timestamps to the 5 minute marks during feature creation.
Another example: Lets say we have a feature called "product.day.sold". As it is updated throughout the day, it could represent the number of products sold on that day so far, or just as easily it could represent the number of products sold in the last 24 hours at the time it was updated. It could also represent the last 7 days of sold products as it stood on that particular day. Basically the meaning of this feature is determined by how the feature was created. The feature granularity is not enough information, and could be misleading when feature creators are forced to workaround it's limitations.
I suggest that instead of attempting to handle granularities, we should just require that rounding the timestamps should always happen during feature creation, not within Feast, and we should simply store features against the event timestamp provided.
The problem of how to serve keys if do not have a fixed granularity, is not as bad as it sounds.
Another problem is how do we prevent feature creators from over polluting a key space with far too granular timestamps? We will still have this problem regardless, as a feature creator can always use the "seconds" granularity.
My proposal
We would be committing to a requirement that timely short scans across a key range are supported by all stores.
Benefits
What do people think?
Is there an issue with serving I have missed?
The text was updated successfully, but these errors were encountered: