Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Remove feature "granularity" and relegate to metadata #17

Closed
tims opened this issue Dec 26, 2018 · 8 comments
Closed

Remove feature "granularity" and relegate to metadata #17

tims opened this issue Dec 26, 2018 · 8 comments
Labels
kind/feature New feature or request

Comments

@tims
Copy link
Contributor

tims commented Dec 26, 2018

[edit] Granularities are no longer required in FeatureRow, or FeatureSpecs, as we have removed history from serving store and the serving api. Thus there is also no requirement for it to be in the warehouse store. Additionally the notion of granularity has proven to be confusing to end users. History of issue kept below:

I'd like to discuss feature granularities.

What is granularity

Currently we have a fixed set of feast granularities {seconds, minutes, hours, days}.
It is not always obvious what the feast granularity refers to.

In general a feature is handled by a few different datetimes throughout it's lifecycle:

  • the window duration of an aggregation (this is upstream to feast)
  • the trigger frequency that an event is emitted per key, likely irregular if more than once per window (this is upstream to feast)
  • the ingestion event timestamp that Feast receives during ingestion, determined by the feature creator
  • the storage event timestamp used to store and retrieve features in Feast, determined by feast.

The storage event timestamp is derived by rounding the ingestion event timestamp to start of the granularity for all the features in a feature row. Eg: for a granularity of 1 hour, we round the ingestion timestamp to the start of the enclosing hour.

For example, say we have a feature that is aggregated over a 1 hour fixed windows and triggered every one minute. Each minute an update of the 1 hour window aggregation is provided. We would naturally use a 1 hour granularity for this. The ingestion event timestamp should be within the one hour window. The storage event timestamp would be the start of the window.

Another example, say we have a feature that is aggregated over a 10 minute sliding window, and triggered only once at the end of every window. In this case, the feast granularity actually needs to be 1 minute. Which can seem confusing.

Limitations of current approach

Feast rounds the ingested timestamps to a granularity provided by creation, this seemed a convenience, but it hinders the use of custom granularities and it can cause confusion.

For example: because the granularities are an enum and there is not 5 minute option. If we wanted to store and overwrite a new key every five minutes, we would need to use a finer granularity and manually round the ingestion timestamps to the 5 minute marks during feature creation.

Another example: Lets say we have a feature called "product.day.sold". As it is updated throughout the day, it could represent the number of products sold on that day so far, or just as easily it could represent the number of products sold in the last 24 hours at the time it was updated. It could also represent the last 7 days of sold products as it stood on that particular day. Basically the meaning of this feature is determined by how the feature was created. The feature granularity is not enough information, and could be misleading when feature creators are forced to workaround it's limitations.

I suggest that instead of attempting to handle granularities, we should just require that rounding the timestamps should always happen during feature creation, not within Feast, and we should simply store features against the event timestamp provided.

The problem of how to serve keys if do not have a fixed granularity, is not as bad as it sounds.

  • firstly, it is only an issue at all when a feature is requested across a time range, not "latest". And "latest" is the most common request.
  • secondly, our currently supported stores, BigTable and Redis, both support scans across key date ranges (Redis via our bucketing approach).

Another problem is how do we prevent feature creators from over polluting a key space with far too granular timestamps? We will still have this problem regardless, as a feature creator can always use the "seconds" granularity.

My proposal

  • The storage event timestamp should be the same thing as ingestion event timestamp.
  • We should drop granularity from FeatureRow and ignore it for ingestion and storage purposes.
  • We should drop the requirement that granularity is part of the featureId. So instead of {entityName}.{granularity}.{featureName}, it should just be {entityName}.{featureName}.
  • BigQuery tables (which are currently separated by granularities, should instead be separated by a feature's group)

We would be committing to a requirement that timely short scans across a key range are supported by all stores.

Benefits

  • An easier to understand data model.
  • Enables storing at custom granularities.
  • Simplified code

What do people think?
Is there an issue with serving I have missed?

@tims tims added the kind/feature New feature or request label Dec 26, 2018
@tims
Copy link
Contributor Author

tims commented Dec 26, 2018

This would be for a future major version

@tims
Copy link
Contributor Author

tims commented Dec 31, 2018

@pradithya @zhilingc any thoughts on how much this would effect core and serving?

@pradithya
Copy link
Collaborator

I don't see that it will be much effort on serving to remove granularity concept. I might need to do more comprehensive impact analysis though if later on we commit to this.

I am more concerned about this though

Another problem is how do we prevent feature creators from over polluting a key space with far too granular timestamps? We will still have this problem regardless, as a feature creator can always use the "seconds" granularity.

I feel having timestamp as part of key in serving store create at least two problems:

  1. keyspace become so large --> we quickly run out of storage for limited storage like Redis
  2. requires scan --> slower compared to point query

it will be great if we could use entity ID as key and maintain N latest data by bucketing (Redis) or multiple version (BigTable). We still maintain timestamp as part of value to know whether the feature is stale or not though.

@tims
Copy link
Contributor Author

tims commented Jan 7, 2019

@pradithya How is this any different to what we already have though?

We already do scans in BT, scanning across potentially more granular events wont effect it.
For Redis, we scan via the bucketized indexes. That will be the same too.
Latest in both cases is still served via it's own key without a timestamp.

We don't require scans for single key (non latest) lookup, if we will insist that the user knows the exact timestamp. Eg: 2019-01-07T00:00:00.
This would push the pain back on the client.. who will need to know what to expect in the feature. I think we should do this for it's simplicity. We could potentially add sugar around it in the client by storing a notion of granularity, but just as metadata, but I think we shouldn't do this first.

There is still the discussion of, do we even need timestamped features at all in serving, or just in warehouse. We thought we did, but we have yet to identify a solid use case in ML land.

I would be in favour of dropping timestamps in serving completely and just store by entity key, if we can't find clear use ML cases. With it, we are sort of providing workarounds to not doing more feature engineering upstream. And perhaps providing better automation and tools there would be a better approach.

No timestamps at all in serving would obviously drastically simplify the serving api. And make it very neat for converting into dataframes and preferred consumables for ML models.

@woop thoughts?

@tims
Copy link
Contributor Author

tims commented Jan 7, 2019

Also, the polluting the keyspace thing, is identical to people using "seconds" as the granularity and just putting whatever they want into it.

@woop
Copy link
Member

woop commented Jan 7, 2019

@pradithya How is this any different to what we already have though?

We already do scans in BT, scanning across potentially more granular events wont effect it.
For Redis, we scan via the bucketized indexes. That will be the same too.
Latest in both cases is still served via it's own key without a timestamp.

We don't require scans for single key (non latest) lookup, if we will insist that the user knows the exact timestamp. Eg: 2019-01-07T00:00:00.
This would push the pain back on the client.. who will need to know what to expect in the feature. I think we should do this for it's simplicity. We could potentially add sugar around it in the client by storing a notion of granularity, but just as metadata, but I think we shouldn't do this first.

I don't think the user will ever know the exact timestamp, but maybe this actually a use case at all? If we only have latest then a user can still make features which are buckets for the last x time, so there are workarounds.

There is still the discussion of, do we even need timestamped features at all in serving, or just in warehouse. We thought we did, but we have yet to identify a solid use case in ML land.

Wasn't the discussion on whether we need to store historic feature data for querying in serving? That is different from having timestamped features (which latest would be a part of).

I would be in favour of dropping timestamps in serving completely and just store by entity key, if we can't find clear use ML cases. With it, we are sort of providing workarounds to not doing more feature engineering upstream. And perhaps providing better automation and tools there would be a better approach.

No timestamps at all in serving would obviously drastically simplify the serving api. And make it very neat for converting into dataframes and preferred consumables for ML models.

Do you mean no timestamps at all? Because having the timestamp of a row returned in serving is definitely something we need. How does the user know whether a specific latest key is stale without seeing timestamp?

@woop
Copy link
Member

woop commented Jan 7, 2019

Also, the polluting the keyspace thing, is identical to people using "seconds" as the granularity and just putting whatever they want into it.

I wouldn't say it is identical. Even seconds are buckets. Wouldn't using a more granular time actually make it easier to fill up the database if something goes wrong?

Overall I agree that this is a problem that is worth looking at. I think we should split it up though:

  1. Force granularity definition upstream as it is essentially a transformation: I kind of like this idea. We ideally don't want to change the event timestamp. So we are basically just validating features and then writing them in the correct place.
  2. Add support for custom granularities (this could happen now with a field for specifying custom granularities): I personally still think granularities are useful as part of feature specifications, but maybe not as a major concept. What if the granularity was simply an option used during validation? Then we could throw out features coming in with the wrong granularities?
  3. Remove granularities from the feature id and rethink how features are stored: I absolutely agree that this has to happen. I really dislike granularities as part of feature ids. However, we also can't allow just having a single entity table with infinite columns. The grouping is essential, which I assume is why you brought up feature groups. I think we should explore this separately and in more detail. The reason why this is so important is because users will often dive into the database to explore feature data, and they would want data to be logically grouped. So for example you would want to go to a customer_profile table to find profile features (split is [entity]_[grouping]).

Do we still want to use feature groups as a base for features to inherit from, or would it make more sense to use it to logical group incoming features for storage?

@tims tims changed the title Custom granularities Remove granularities and move them to metadata Feb 15, 2019
@tims
Copy link
Contributor Author

tims commented Feb 15, 2019

I've renamed this issue. From custom granularities to "Remove feature "granularity" and relegate to metadata #17"

We have reached a consensus that

Granularities as we know it should be removed.

Any window information about a feature should be added as metadata, but we currently don't have a reason for that to be a structured field.

@tims tims changed the title Remove granularities and move them to metadata Remove feature "granularity" and relegate to metadata Feb 15, 2019
dmartinol pushed a commit to dmartinol/feast that referenced this issue Jun 24, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants