Historical retrieval without an entity dataframe #1611

woop · 2021-06-01T19:31:58Z

Is your feature request related to a problem? Please describe.
The current Feast get_historical_features() method requires that users provide an entity dataframe as follows

training_df = store.get_historical_features(
    entity_df=entity_df, 
    feature_refs = [
        'drivers_activity:trips_today'
        'drivers_profile:rating'
    ],
)

However, many users would like the feature store to provide entities to them for training, instead of having to query or provide entities as part of the entity dataframe.

Describe the solution you'd like
Allow users to specify an existing feature view from which an entity dataframe will be queried.

training_df = store.get_historical_features(
    entity_df="drivers_activity", 
    feature_refs = [
        'drivers_activity:trips_today'
        'drivers_profile:rating'
    ],
)

With the addition of time range filtering.

training_df = store.get_historical_features(
    entity_df="drivers_activity", 
    feature_refs = [
        'drivers_activity:trips_today'
        'drivers_profile:rating'
    ],
    from_date=(today - timedelta(days = 7)),
    to_date=datetime.now(),
)

The text was updated successfully, but these errors were encountered:

tedhtchang · 2021-06-07T11:48:39Z

training_df = store.get_historical_features(
left_table="drivers_activity",
feature_refs = [
'drivers_activity:trips_today'
'drivers_activity:rating'
],
)

Does this mean the resulting training_df contain every row (but only selectdriver_id, event_timestamp, trips_today, and rating columns), from the drivers_activity view ?

woop · 2021-06-07T18:31:09Z

training_df = store.get_historical_features(
left_table="drivers_activity",
feature_refs = [
'drivers_activity:trips_today'
'drivers_activity:rating'
],
)

Does this mean the resulting training_df contain every row (but only selectdriver_id, event_timestamp, trips_today, and rating columns), from the drivers_activity view ?

Actually my example was poor. I've modified it to show that we can query multiple feature views. Essentially how it works is that we will query the entity_df for all entities, but it can now be an existing feature view. We would only query it for timestamps and entity columns. Features then get joined onto those rows as usual.

HeardACat · 2021-07-02T00:20:14Z

Should there also be an option to "keep latest" only, when used in conjunction with the time range filtering
Otherwise its more than possible that the underlying entity dataframe could have duplicated entity keys.

The usecase for this in my mind is for backtesting purposes.

woop · 2021-07-02T00:27:23Z

Should there also be an option to "keep latest" only, when used in conjunction with the time range filtering
Otherwise its more than possible that the underlying entity dataframe could have duplicated entity keys.

The usecase for this in my mind is for backtesting purposes.

Do you mean entity row or entity key? https://docs.feast.dev/concepts/data-model-and-concepts#entity-row

So you would not want to return features with the same entity key over different dates?

HeardACat · 2021-07-02T00:41:19Z

I was thinking entity key. Only as an option - there are use cases for enabling both of them.

For example, if our machine learning deployment is a daily batch job, perhaps for back-testing we would have the get_historical_features(from_date=my_date-timedelta(days=1), to_date=my_date), where my_date is the timestamp to simulate when our machine learning job "would run" on a daily basis

Though this then raises a good question on how this kind of workflow should be productionised? E.g. if I have an hourly/daily batch which goes through our whole customer base to find fraudulent customers, how should this work in feast? We wouldn't really use the online store for this, and this API could look something like:

my_daily_batch_scoring_df = store.get_historical_features(
    entity_df = "my_df", 
    feature_refs = [...],
    latest=True,
    from_date=(today - timedelta(days = 1)),
    to_date=datetime.now(),
)

Probably a discussion for another thread...

HeardACat · 2021-07-02T09:38:05Z

Can I give this a go and raise a PR for File based offlinestore only?

I'll stick the spec written(?), though I noticed elsewhere in the repo the nomenclature used was start_date and end_date - should we align to that rather than from_date and to_date https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/offline_stores/file.py#L219-L220 ?

woop · 2021-07-05T15:55:17Z

I was thinking entity key. Only as an option - there are use cases for enabling both of them.

For example, if our machine learning deployment is a daily batch job, perhaps for back-testing we would have the get_historical_features(from_date=my_date-timedelta(days=1), to_date=my_date), where my_date is the timestamp to simulate when our machine learning job "would run" on a daily basis

Though this then raises a good question on how this kind of workflow should be productionised? E.g. if I have an hourly/daily batch which goes through our whole customer base to find fraudulent customers, how should this work in feast? We wouldn't really use the online store for this, and this API could look something like:
my_daily_batch_scoring_df = store.get_historical_features(
    entity_df = "my_df", 
    feature_refs = [...],
    latest=True,
    from_date=(today - timedelta(days = 1)),
    to_date=datetime.now(),
)
Probably a discussion for another thread...

I can see the value in this. In fact, some other folks have also asked for it. Would you mind creating a new issue and linking back to this issue for us? I think it's worth a separate discussion. Specifically, the need for a latest only argument in get_historical_features().

woop · 2021-07-05T15:56:55Z

Can I give this a go and raise a PR for File based offlinestore only?

I'll stick the spec written(?), though I noticed elsewhere in the repo the nomenclature used was start_date and end_date - should we align to that rather than from_date and to_date https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/offline_stores/file.py#L219-L220 ?

You can give it a go, but we probably won't release it until we have support for all our main stores. Perhaps a better middle ground is to add a new method to the FeatureStore class and have it throw a NotImplemented exception for the other stores, and specifically print warnings that this functionality is experimental and will change.

HeardACat · 2021-07-05T20:38:42Z

Sounds good, hopefully I'll pull something together "soon". I'll name the method something sensible as well.

MattDelac · 2021-07-15T17:13:24Z

Hi there 👋 ,

As I already explained to Willem, we built an higher level API on our side to make the life of our users easier

It basically does the following

def get_historical_features(
    feature_refs: List[str],
    threshold: Union[datetime, date] = None,
    sample_size: int = 1000,
    left_feature_view: Union[pd.DataFrame, str] = None,
    full_feature_names: bool = False,
) -> BigQueryRetrievalJob:
    # If all the features come from the same FeatureView then we infer the `left_feature_view` parameter

    # We get the unique_join_keys in order to remove some duplicate data if it exists
    # It's more or less the following
    query = f"""
        SELECT
            {', '.join(unique_join_keys)},
            TIMESTAMP '{str_timestamp}' AS {timestamp_column}
        FROM {source_table}
        {where_clause}
        GROUP BY {', '.join(unique_join_keys)}
        {limit_clause}
    """
    # The limit_clause only exist if we want a sample of the left FeatureView

    store = FeatureStore()
    # We build the query for our users and pass it to Feast
    return store.get_historical_features(
        entity_df=sql_query,
        feature_refs=features,
        full_feature_names=full_feature_names,
    )

Happy to have a chat about a similar API implemented in Feast

stale · 2021-11-14T00:46:41Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

lara-marinelli · 2021-12-13T17:59:00Z

Any workaround for the problem of getting "training datasets" (with no passing entity ids)?

fcas · 2022-03-29T21:03:56Z

(2) Any workaround for the problem of getting "training datasets" (with no passing entity ids)?

HeardACat · 2022-03-31T09:52:51Z

You can use `pull_latest_from_table_or_query` which will do the trick. Of course it would be nice if there is a suitable abstraction that feels the "same" as existing APIs

…

On Wed, 30 Mar 2022 at 08:04, Felipe ***@***.***> wrote: (2) Any workaround for the problem of getting "training datasets" (with no passing entity ids)? — Reply to this email directly, view it on GitHub <#1611 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATCATVYO6JWYMIYSECBX6TVCNV4NANCNFSM455IQMGQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Chapman

cpatrickalves · 2022-07-20T00:41:52Z

Hi there, could you please show an example of how to use pull_latest_from_table_or_query?

aryndin9999 · 2022-11-21T17:43:38Z

There is an example for Spark. It doesn't understand multi sources, FeatureViews...

from feast.infra.offline_stores.contrib.spark_offline_store.spark import SparkOfflineStore

fs = feast.FeatureStore(repo_path="/home/feast/feast_repo/large_foal/feature_repo")

feast_features = [
    "crim",
    "zn",
    "indus"
]

srj_latest = SparkOfflineStore.pull_latest_from_table_or_query(
    config=fs.config,
    data_source=fs.get_data_source("boston_source"),
    join_key_columns=["entity_id"],
    feature_name_columns=feast_features,
    timestamp_field="update_ts",
    created_timestamp_column="create_ts",
    start_date=datetime(2022, 11, 20),
    end_date=datetime(2022, 11, 21)
)
srj_latest.to_spark_df().show()

haowu2651 · 2023-11-23T11:27:07Z

Hello,
is this implemented already, I use parquet file as source, and want to retrival the historical features with a time range, don't want to define a entity df with event_timestamp, is this possible, how to do it ?

Thanks

TalalZafar88 · 2024-09-12T05:31:58Z

Hi everyone,
Is this implemented or not. Because I have built something similar do this.
Also we can achieve this functionality in which entity ids do not need to be passed.

lokeshrangineni · 2024-10-17T14:33:07Z

@TalalZafar88 - This is not yet implemented. Do you mind to share your solution? if possible will you be able to open a PR for it?

franciscojavierarceo · 2024-12-26T11:10:17Z

@TalalZafar88 if your code is openly forked, I can contribute it back to Feast.

HeardACat mentioned this issue Jul 5, 2021

Latest Only option for Historical Retrieval #1687

Open

HeardACat mentioned this issue Jul 5, 2021

Historical retrieval without an entity dataframe (Local Only) #1690

Closed

achals mentioned this issue Nov 1, 2021

Need to get objects from FeatureStore without prior knowledge of their IDs and timestamps... #1989

Closed

stale bot added the wontfix This will not be worked on label Nov 14, 2021

stale bot closed this as completed Nov 21, 2021

adchia added keep-open and removed wontfix This will not be worked on labels Nov 21, 2021

adchia reopened this Nov 22, 2021

adchia added the priority/p1 label Nov 22, 2021

adchia mentioned this issue Jan 7, 2022

ability to pull all entity rows for specific entities #2173

Closed

adchia added the kind/feature New feature or request label Jan 7, 2022

adchia added this to Feast Roadmap Apr 21, 2022

adchia added kind/project A top level project to be tracked in GitHub Projects good first issue Good for newcomers labels May 24, 2022

adchia moved this to Todo in Feast Roadmap May 24, 2022

adchia added the Community Contribution Needed We want community to contribute label May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Historical retrieval without an entity dataframe #1611

Historical retrieval without an entity dataframe #1611

woop commented Jun 1, 2021 •

edited

Loading

tedhtchang commented Jun 7, 2021

woop commented Jun 7, 2021

HeardACat commented Jul 2, 2021 •

edited

Loading

woop commented Jul 2, 2021

HeardACat commented Jul 2, 2021 •

edited

Loading

HeardACat commented Jul 2, 2021 •

edited

Loading

woop commented Jul 5, 2021

woop commented Jul 5, 2021

HeardACat commented Jul 5, 2021

MattDelac commented Jul 15, 2021

stale bot commented Nov 14, 2021

lara-marinelli commented Dec 13, 2021

fcas commented Mar 29, 2022

HeardACat commented Mar 31, 2022 via email

cpatrickalves commented Jul 20, 2022

aryndin9999 commented Nov 21, 2022 •

edited

Loading

haowu2651 commented Nov 23, 2023

TalalZafar88 commented Sep 12, 2024

lokeshrangineni commented Oct 17, 2024

franciscojavierarceo commented Dec 26, 2024

Historical retrieval without an entity dataframe #1611

Historical retrieval without an entity dataframe #1611

Comments

woop commented Jun 1, 2021 • edited Loading

tedhtchang commented Jun 7, 2021

woop commented Jun 7, 2021

HeardACat commented Jul 2, 2021 • edited Loading

woop commented Jul 2, 2021

HeardACat commented Jul 2, 2021 • edited Loading

HeardACat commented Jul 2, 2021 • edited Loading

woop commented Jul 5, 2021

woop commented Jul 5, 2021

HeardACat commented Jul 5, 2021

MattDelac commented Jul 15, 2021

stale bot commented Nov 14, 2021

lara-marinelli commented Dec 13, 2021

fcas commented Mar 29, 2022

HeardACat commented Mar 31, 2022 via email

cpatrickalves commented Jul 20, 2022

aryndin9999 commented Nov 21, 2022 • edited Loading

haowu2651 commented Nov 23, 2023

TalalZafar88 commented Sep 12, 2024

lokeshrangineni commented Oct 17, 2024

franciscojavierarceo commented Dec 26, 2024

woop commented Jun 1, 2021 •

edited

Loading

HeardACat commented Jul 2, 2021 •

edited

Loading

HeardACat commented Jul 2, 2021 •

edited

Loading

HeardACat commented Jul 2, 2021 •

edited

Loading

aryndin9999 commented Nov 21, 2022 •

edited

Loading