Skip to content

10gen/Ecom-usecase-with-MongoDB-and-BigQuery

Repository files navigation

Ecom-usecase-with-MongoDB-and-BigQuery

Machine Learning Models

User Clustering

Loyalty programs often categorize users into discrete segments based on the user’s purchase behavior and site engagement, such as Platinum / Gold / Silver / Bronze tiers.

In this example, we showcase how a ML driven clustering model can be applied to achieve user clustering. K-means clustering K-means is an unsupervised learning technique identifying customer segments, so model training does not require labels nor split data for training or evaluation.

Step 1 : Cluster Users based on following attributes

The following attributes can reliably predict user behavior, and can be used to cluster users into various Loyalty levels:

Session count Total time spent Average order value No of orders

Step 1 - Preprocess to training data**

WITH ViewStats AS (

SELECT user_pseudo_id, COUNT( DISTINCT(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id')) AS session_count, SUM( (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'engagement_time_msec' )) AS total_time_spend_by_user_in_msec

FROM -- Replace table name. ecommerce_public.events_flat_data

GROUP BY 1 ORDER bY total_time_spend_by_user_in_msec desc ) ,

OrderStats as (

SELECT user_pseudo_id, avg(items.price) as average_order_value, count(*) as no_of_orders, FROM ecommerce_public.events_flat_data, UNNEST(items) AS items WHERE event_name = 'purchase' GROUP BY 1 ORDER bY no_of_orders desc

) , #Select * from OrderStats

UserStats as ( select v.*, o.average_order_value,o.no_of_orders from ViewStats v FULL OUTER JOIN OrderStats o ON v.user_pseudo_id = o.user_pseudo_id order by o.no_of_orders desc )

SELECT * EXCEPT(user_pseudo_id) FROM UserStats

image

Step 2: Create a k-means model

CREATE OR REPLACE MODEL ecommerce_public.user_clusters OPTIONS(model_type='kmeans',kmeans_init_method = 'KMEANS++', NUM_CLUSTERS=4 ) AS WITH ViewStats AS (

SELECT user_pseudo_id, COUNT( DISTINCT(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id')) AS session_count, SUM( (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'engagement_time_msec' )) AS total_time_spend_by_user_in_msec

FROM -- Replace table name. ecommerce_public.events_flat_data

GROUP BY 1 ORDER bY total_time_spend_by_user_in_msec desc ) ,

OrderStats as (

SELECT user_pseudo_id, avg(items.price) as average_order_value, count(*) as no_of_orders, FROM ecommerce_public.events_flat_data, UNNEST(items) AS items WHERE event_name = 'purchase' GROUP BY 1 ORDER bY no_of_orders desc

) , #Select * from OrderStats

UserStats as ( select v.*, o.average_order_value,o.no_of_orders from ViewStats v FULL OUTER JOIN OrderStats o ON v.user_pseudo_id = o.user_pseudo_id order by o.no_of_orders desc )

SELECT * EXCEPT(user_pseudo_id) FROM UserStats

image

** Step 3: Use the ML.PREDICT function to predict a user’s cluster ** Query

WITH ViewStats AS (

SELECT user_pseudo_id, COUNT( DISTINCT(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id')) AS session_count, SUM( (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'engagement_time_msec' )) AS total_time_spend_by_user_in_msec #ecommerce.getAverageOrderValue((userId)) as average_order_value, #ecommerce.getNoOfOrders((userId)) as no_of_orders, FROM -- Replace table name. ecommerce_public.events_flat_data

GROUP BY 1 ORDER bY total_time_spend_by_user_in_msec desc ) ,

OrderStats as (

SELECT user_pseudo_id, avg(items.price) as average_order_value, count(*) as no_of_orders, FROM ecommerce_public.events_flat_data, UNNEST(items) AS items WHERE event_name = 'purchase' GROUP BY 1 ORDER bY no_of_orders desc

) , #Select * from OrderStats

UserStats as ( select v.*, o.average_order_value,o.no_of_orders from ViewStats v FULL OUTER JOIN OrderStats o ON v.user_pseudo_id = o.user_pseudo_id order by o.no_of_orders desc )

SELECT

  • EXCEPT(nearest_centroids_distance) FROM ML.PREDICT( MODEL ecommerce_public.user_clusters, ( SELECT * FROM UserStats WHERE user_pseudo_id = '5053424.7664335544' ))

Output

image

Step 4: Use your model to make data-driven decisions

image

Inference

Cluster 1 has more orders, session count and total time engaged making them a premium customer. We can offer discount offers to encourage them. Cluster 2 has the highest average order value, but very less no of orders. These are potential customers who tend to spend more and are very selective in the purchase. We can recommend highly rated products so that they can buy more thus increasing their order count. Also they have less engagement time , we can send some advertising messages to increase engagement and thereby increase the number of orders. Cluster 3 has more orders than 2 but very less session count. Here again we can send messages to increase user engagement. Cluster 4 has very less session count as well as low no of orders. This segment can be characterized as our bronze segment.

Product Recommendation

A common use case for eCommerce is product recommendations. Customers who are presented with contextual personalized products tend to have higher average order values. Most retailers showcase “Customers who bought this also bought these” during the purchase process, which helps the user to make quick purchase decisions based on recommendations. For contextual personalization, it is important that personalization also takes into account the context – which products has the customer been viewing recently. Other promotions, such as “10% Off Product X” can be targeted to the customer based on the interest shown in a particular category / product.

Collaborative filtering is a method of making automatic predictions about the interests of a user by collecting preferences or taste information from many users. This example uses matrix factorization for creating a recommendation engine.

Matrix factorization

Matrix factorization (collaborative filtering) is one of the most common and effective methods of creating recommendation systems.

There are two types of matrix factorization based on the kind of user feedback:

  1. Explicit With explicit feedback, the dataset must indicate a user’s preference for a product, like a rating between 1 and 5 stars.

  2. Implicit Implicit preferences for the user need to be determined based on their purchase behavior.

Pre-process the data to training Data

With matrix factorization, in order to train the model, you will need a table with userId, itemId, and the products viewed based on the click-stream events.

** Recommendations

Product affinity - Based on product bought together historically ( Implicit feedback)

Step 1 - Preprocess to training data ** Query:

WITH UniqueUsersList As ( ## get all unique users select distinct(user_pseudo_id) from ecommerce_public.events_flat_data ), UserPurchaseEvents AS ( ## get all purchase orders SELECT user_pseudo_id, items.item_name FROM ecommerce_public.events_flat_data, UNNEST(items) AS items WHERE event_name = 'purchase' group by 1,2 ) ,

ProductBoughtTogetherHistorically as ( ## get all items which are bought together select user_pseudo_id as prevUserId, STRING_AGG(items.item_name, ', ') AS productsBoughtTogether from ecommerce_public.events_flat_data, UNNEST(items) AS items WHERE event_name = 'purchase' and ecommerce.unique_items > 1 group by 1 ) , #select * from ProductBoughtTogetherHistorically UserProductRecommendations AS ( ## for all items purchased by user in an order prepare the list of other items bought together select user_pseudo_id,cust_prev_purchase,split(productsBoughtTogether,',') AS product from ( select b.,a. from ProductBoughtTogetherHistorically a,

(select user_pseudo_id,item_name AS cust_prev_purchase from UserPurchaseEvents where user_pseudo_id in (select user_pseudo_id from UniqueUsersList ) ) b where a.productsBoughtTogether like CONCAT('%', CAST(b.cust_prev_purchase AS STRING),'%' ) ## Compare each itemn in an order with orders bought historically and prepare a list of items bought together and user_pseudo_id != prevUSerId ## remove if the product in an order is same as the compared order historically ) ) , #select * from UserProductRecommendations UserProductMatrix as (select UserProductRecommendations.user_pseudo_id, #cust_prev_purchase, product_recommended, ecommerce_public.getratingbasedonProduct(TRIM(product_recommended)) as rating from UserProductRecommendations CROSS JOIN UNNEST(UserProductRecommendations.product) AS product_recommended where TRIM(product_recommended) <> TRIM(UserProductRecommendations.cust_prev_purchase) ### , remove if the previous order is the same as the current order group by user_pseudo_id, product_recommended order by user_pseudo_id) select * from UserProductMatrix l where not exists ( select 1 from UserPurchaseEvents where user_pseudo_id = l.user_pseudo_id and Trim(item_name) = Trim(l.product_recommended)## check if an item in the recommendation list is already bought by the user, if so remove it ) order by user_pseudo_id

function used

CREATE FUNCTION ecommerce_public.getratingbasedonProduct(product STRING)

#AS (( #select sum(cast(items.quantity as FLOAT64)) from ecommerce_public.events_flat_data, UNNEST(items) AS items WHERE event_name = 'purchase' and items.item_name = product

))

Training data

image

** Step 2 : Create a BigQuery ML model **

Query:

Create Model

#standardSQL CREATE OR REPLACE MODEL ecommerce_public.product_affinity OPTIONS (model_type='matrix_factorization', feedback_type='implicit', user_col='user_pseudo_id', item_col= 'product_recommended', rating_col='rating', l2_reg=30, num_factors=15) AS

WITH UniqueUsersList As ( ## get all unique users select distinct(user_pseudo_id) from ecommerce_public.events_flat_data ), UserPurchaseEvents AS ( ## get all purchase orders SELECT user_pseudo_id, items.item_name FROM ecommerce_public.events_flat_data, UNNEST(items) AS items WHERE event_name = 'purchase' group by 1,2 ) ,

ProductBoughtTogetherHistorically as ( ## get all items which are bought together select user_pseudo_id as prevUserId, STRING_AGG(items.item_name, ', ') AS productsBoughtTogether from ecommerce_public.events_flat_data, UNNEST(items) AS items WHERE event_name = 'purchase' and ecommerce.unique_items > 1 group by 1 ) , #select * from ProductBoughtTogetherHistorically UserProductRecommendations AS ( ## for all items purchased by user in an order prepare the list of other items bought together select user_pseudo_id,cust_prev_purchase,split(productsBoughtTogether,',') AS product from ( select b.,a. from ProductBoughtTogetherHistorically a,

(select user_pseudo_id,item_name AS cust_prev_purchase from UserPurchaseEvents where user_pseudo_id in (select user_pseudo_id from UniqueUsersList ) ) b where a.productsBoughtTogether like CONCAT('%', CAST(b.cust_prev_purchase AS STRING),'%' ) ## Compare each itemn in an order with orders bought historically and prepare a list of items bought together and user_pseudo_id != prevUSerId ## remove if the product in an order is same as the compared order historically ) ) , #select * from UserProductRecommendations UserProductMatrix as (select UserProductRecommendations.user_pseudo_id, #cust_prev_purchase, product_recommended, ecommerce_public.getratingbasedonProduct(TRIM(product_recommended)) as rating from UserProductRecommendations CROSS JOIN UNNEST(UserProductRecommendations.product) AS product_recommended where TRIM(product_recommended) <> TRIM(UserProductRecommendations.cust_prev_purchase) ### , remove if the previous order is the same as the current order group by user_pseudo_id, product_recommended order by user_pseudo_id) select * from UserProductMatrix l where not exists ( select 1 from UserPurchaseEvents where user_pseudo_id = l.user_pseudo_id and Trim(item_name) = Trim(l.product_recommended)## check if an item in the recommendation list is already bought by the user, if so remove it ) order by user_pseudo_id

**Step 3 : Evaluate **

Query

evaluate

#standardSQL SELECT * FROM ML.EVALUATE(MODEL ecommerce_public.product_affinity, ( WITH UniqueUsersList As ( ## get all unique users select distinct(user_pseudo_id) from ecommerce_public.events_flat_data ), UserPurchaseEvents AS ( ## get all purchase orders SELECT user_pseudo_id, items.item_name FROM ecommerce_public.events_flat_data, UNNEST(items) AS items WHERE event_name = 'purchase' group by 1,2 ) ,

ProductBoughtTogetherHistorically as ( ## get all items which are bought together select user_pseudo_id as prevUserId, STRING_AGG(items.item_name, ', ') AS productsBoughtTogether from ecommerce_public.events_flat_data, UNNEST(items) AS items WHERE event_name = 'purchase' and ecommerce.unique_items > 1 group by 1 ) , #select * from ProductBoughtTogetherHistorically UserProductRecommendations AS ( ## for all items purchased by user in an order prepare the list of other items bought together select user_pseudo_id,cust_prev_purchase,split(productsBoughtTogether,',') AS product from ( select b.,a. from ProductBoughtTogetherHistorically a,

(select user_pseudo_id,item_name AS cust_prev_purchase from UserPurchaseEvents where user_pseudo_id in (select user_pseudo_id from UniqueUsersList ) ) b where a.productsBoughtTogether like CONCAT('%', CAST(b.cust_prev_purchase AS STRING),'%' ) ## Compare each itemn in an order with orders bought historically and prepare a list of items bought together and user_pseudo_id != prevUSerId ## remove if the product in an order is same as the compared order historically ) ) , #select * from UserProductRecommendations UserProductMatrix as (select UserProductRecommendations.user_pseudo_id, #cust_prev_purchase, product_recommended, ecommerce_public.getratingbasedonProduct(TRIM(product_recommended)) as rating from UserProductRecommendations CROSS JOIN UNNEST(UserProductRecommendations.product) AS product_recommended where TRIM(product_recommended) <> TRIM(UserProductRecommendations.cust_prev_purchase) ### , remove if the previous order is the same as the current order group by user_pseudo_id, product_recommended order by user_pseudo_id) select * from UserProductMatrix l where not exists ( select 1 from UserPurchaseEvents where user_pseudo_id = l.user_pseudo_id and Trim(item_name) = Trim(l.product_recommended)## check if an item in the recommendation list is already bought by the user, if so remove it ) order by user_pseudo_id))

image

**Step 4 : Recommend **

Top 5 recommendations for a user

RECOMMEND for a user

#standardSQL SELECT * FROM ML.RECOMMEND(MODEL ecommerce_public.product_affinity, ( select '2534169.4418845320' as user_pseudo_id)) order by predicted_rating_confidence desc LIMIT 5

image

Model Management - Vertex AI

​​ We will need a Google Cloud Platform project with billing enabled to run this. To create a project, follow the instructions here. Step 1: Enable the Compute Engine API Navigate to Compute Engine and select Enable if it isn't already enabled. You'll need this to create your notebook instance. Step 2: Enable the Vertex AI API Navigate to the Vertex AI section of your Cloud Console and click Enable Vertex AI API.

image

Step 3: Create a Notebooks instance We'll use Notebooks to get predictions after we've deployed our model. From the Vertex AI section of your Cloud Console, click on Notebooks: From there, select New Instance. Then select the TensorFlow Enterprise 2.3 instance type without GPUs:

image

image

Use the default options and then click Create.

**Train a BigQuery ML model ** Step 1: Create a BigQuery dataset in your project To train a model in BigQuery ML, you'll need to create a dataset within your project to store this model. Click on your project in the left menu bar, and then select Create Dataset: Step 2: Run a CREATE MODEL query In the BigQuery Query editor, run the following CREATE MODEL query to create and train a BigQuery ML model on the public dataset we'll be using

image

** Export your BigQuery ML model** With a trained BQML model, we can use the BQML SQL syntax to get predictions or we can export the model to deploy it elsewhere. Here we'll export our model so that we can deploy it to Vertex AI to scalably serve the model and get predictions. Step 1: Create a Cloud Storage Bucket for your model In the model details, click Export Model:

image

This will prompt you to enter the Google Cloud Storage (GCS) location where you'd like your model's assets to be exported. If you don't have a GCS bucket yet, don't worry! We're about to create one. First, click Browse:

image

Then click the + icon to create a new bucket:

image

Give it a unique name (Storage bucket names need to be globally unique). Click Continue. In the next step, under Location type select Region and choose any of the regions from the dropdown:

image

Use the default storage class, and under access control make sure Uniform is selected: image

Click continue and use the defaults for the rest of the options. Then click Create.

Step 2: Export the BQML model With your new bucket created, enter model-assets (or anything you'd like) in the Name field and then click Select:

image

Then click Export. This will create a job in BigQuery to export your model in TensorFlow's SavedModel format to the newly created GCS bucket you specified. This will take about a minute to export. While your model is being exported, navigate to the Storage section of your Cloud console. When your job completes, you should see your model assets exported to the bucket you just created under a model-assets subdirectory:

image

Export your BigQuery ML model With a trained BQML model, we can use the BQML SQL syntax to get predictions or we can export the model to deploy it elsewhere. Here we'll export our model so that we can deploy it to Vertex AI to scalably serve the model and get predictions. Step 1: Create a Cloud Storage Bucket for your model In the model details, click Export Model:

image

This will prompt you to enter the Google Cloud Storage (GCS) location where you'd like your model's assets to be exported. If you don't have a GCS bucket yet, don't worry! We're about to create one. First, click Browse: image

Then click the + icon to create a new bucket:

image

Give it a unique name (Storage bucket names need to be globally unique). Click Continue. In the next step, under Location type select Region and choose any of the regions from the dropdown:

image

Use the default storage class, and under access control make sure Uniform is selected:

image

Click continue and use the defaults for the rest of the options. Then click Create. Step 2: Export the BQML model With your new bucket created, enter model-assets (or anything you'd like) in the Name field and then click Select:

Import the model to Vertex AI In this step we'll reference the GCS storage location where we just exported our model assets to create and import the model to Vertex AI. Step 1: Import the model In your Cloud console, navigate to the Vertex AI Models section. From there, select Import:

image

In the first step, give your model the name predict_default. Select the same region where you created your bucket (either us-central1, europe-west4, or asia-east1. Then click Continue. In Model settings, keep "Import model artifacts into a new pre-built container" selected. In the Model framework dropdown, select TensorFlow. Then select 2.3 as the framework version. In the Model artifact location field, click Browse, click into the GCS bucket you just created, and click on the model-assets directory:

image

Then click Import. It will take a few minutes to import your model. Once it has been created, you'll see it in the models section of your Cloud console:

image

image

Then click Export. This will create a job in BigQuery to export your model in TensorFlow's SavedModel format to the newly created GCS bucket you specified. This will take about a minute to export. While your model is being exported, navigate to the Storage section of your Cloud console. When your job completes, you should see your model assets exported to the bucket you just created under a model-assets subdirectory: image

Deploy the model to an endpoint Now that we've uploaded our model, the next step is to create an Endpoint in Vertex. A Model resource in Vertex can have multiple endpoints associated with it, and you can split traffic between endpoints. Step 1: Creating an endpoint On your model page, navigate to the Deploy and test tab and click Deploy to endpoint:

image

Give your endpoint a name, like default_pred_v1, leave the traffic splitting settings as is, and then select a machine type for your model deployment. We used an n1-highcpu-2 here, but you can choose whichever machine type you'd like. Then select Done and click Continue. Leave the selected location settings as is and then click Deploy. Your endpoint will take a few minutes to deploy. When it is completed you'll see a green check mark next to it:

image

image

Testing model Online

UI

image

Google Cloud Function

Request

curl -m 70 -X POST https://us-central1-ecomm-analysis.cloudfunctions.net/UserCluster
-H "Authorization:bearer $(gcloud auth print-identity-token)"
-H "Content-Type:application/json"
-d '{"userId":"2534169.4418845320"}'

REST API

Request

curl
-X POST
-H "Authorization: Bearer $(gcloud auth print-access-token)"
-H "Content-Type: application/json"
https://us-central1-aiplatform.googleapis.com/v1/projects/ecomm-analysis/locations/us-central1/endpoints/6123971903456542720:predict
-d '{"instances": [{"user_pseudo_id":"2534169.4418845320"}]}'

Sample req {"instances": [{"average_order_value":12.8,"no_of_orders":60,"session_count":8,"total_time_spend_by_user_in_msec": 2224770}]}

Sample resp { "predictions": [ { "nearest_centroid_id": [ 3 ], "centroid_distances": [ 17.12813603951723, 142.2885807575315, 14.73747037975442, 15.66436157266043 ], "centroid_ids": [ 1, 2, 3, 4 ] } ], "deployedModelId": "5967942407382106112", "model": "projects/591702499920/locations/us-central1/models/7915265461103624192", "modelDisplayName": "userclustermodel", "modelVersionId": "1" }

image

**Model updating with only incremental records **

Based on real time increamental changes to Bigquery table data, our model also has to be retrained on an increamental basis. BigML model provides an option 'warm_start' from which we can train our existing model with new data

image

Bigquery change history lets us track the history of changes to a BigQuery table using APPENDS TVF.

The APPENDS TVF returns a table of all rows appended to a table for a given time range. The following operations add rows to the APPENDS change history:

CREATE TABLE DDL statement
INSERT DML statement
MERGE DML statement
Loading data into BigQuery
Streaming ingestion

image

Once we are able to get only the increamental records we can retrain the existing model.

image

Once we have the increamental training model query we can schedule it in Bigquery scheduled queries as below

image

Note : Currently only K-Means model is supported for retraining. Matrix model require to purchase flex slots for model creation so it can't be scheduled.

**Business Analytics **

Looker – Mongo DB Integration

The general steps for setting up a Google BigQuery Standard SQL or Google BigQuery Legacy SQL connection are: Create a service account with access to the Google project and download the JSON credentials certificate. Set up the Looker connection to your database. Test the connection.

Creating a service account and downloading the JSON credentials certificate You must have Google Cloud admin permissions to create a service account. Google has documentation on creating a service account and generating a private key. Open the credentials page in the Google Cloud Platform API Manager and, if necessary, select your project:

image

Click CREATE CREDENTIALS and choose Service account:

image

Enter a name for the new service account, optionally add a description, and click CREATE: image

Your service account requires two Google BigQuery predefined roles: BigQuery > BigQuery Data Editor BigQuery > BigQuery Job User Select the first role in the Select a role field, then click ADD ANOTHER ROLE and select the second role:

image

After selecting both roles, click CONTINUE:

image

Click CREATE KEY:

image

Select JSON and click CREATE:

image

The JSON key will be saved to your computer. BE SURE TO REMEMBER WHERE IT IS SAVED. After noting the download location, click CLOSE:

image

Click DONE:

image

Find the email address corresponding to the service account. You will need this to configure the Looker connection to BigQuery.

image

  1. Set up the Looker connection to your database.

image

Use the service account created in above step and Json generated as an authenticate mechanism

  1. Test the connection

image

Creating Projects

image

image

Configuring Projects

image

image

Create Views from table

image

Sample view file

image

Dashboard

image

Sales Distribution

image

Sales Trend Sales Distribution by Category

image

Products with repeat orders

image

About

Retail use case using MongoDB + BigQuery

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published