Skip to content

Add example for writing an AnalyzerRule #10855

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
Tracked by #7013
alamb opened this issue Jun 10, 2024 · 7 comments · Fixed by #11089
Closed
Tracked by #7013

Add example for writing an AnalyzerRule #10855

alamb opened this issue Jun 10, 2024 · 7 comments · Fixed by #11089
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jun 10, 2024

Is your feature request related to a problem or challenge?

We have an example for writing a user defined optimizer rule in

.add_optimizer_rule(Arc::new(TopKOptimizerRule {}));

However, we don't have a corresponding example for writing a user defined AnalyzerRule which also means the APIs for using them are complicated (see #10849 for example)

Describe the solution you'd like

The idea example I think would be ti Add a file to https://github.com/apache/datafusion/tree/3773fb7fb54419f889e7d18b73e9eb48069eb08e/datafusion-examples

user_defined_analyzer.rs

Perhaps the example could show how to create an AnalyzerRule that replaced an experssion like a / b with a function call safe_div(a, b) where save_div

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Jun 10, 2024
@alamb
Copy link
Contributor Author

alamb commented Jun 10, 2024

Perhaps @goldmedal is interested in this / has some example code to share

@goldmedal
Copy link
Contributor

I think we have an existing example showing how to create an AnalyzerRule in

impl AnalyzerRule for MyAnalyzerRule {

However, it only works with the analyzer and the optimizer. Maybe we can enhance it after #10849 to show how to apply the custom rules for the end-to-end DataFusion query flow.

@goldmedal
Copy link
Contributor

goldmedal commented Jun 10, 2024

Currently, in my personal work, I just use it like

let ctx = SessionContext::new();
let new_state = ctx
    .state()
    .add_analyzer_rule(Arc::new(ModelAnalyzeRule::new()))
    .add_analyzer_rule(Arc::new(ModelGenerationRule::new()));
let new_ctx = SessionContext::new_with_state(new_state);
// create a plan to run a SQL query
let df = new_ctx.sql("SELECT * FROM datafusion.default.orders").await?;

After the API updated, I guess it would be

let ctx = SessionContext::new();
ctx.add_analyzer_rule(Arc::new(ModelAnalyzeRule::new()))
    .add_analyzer_rule(Arc::new(ModelGenerationRule::new()));
// create a plan to run a SQL query
let df = ctx.sql("SELECT * FROM datafusion.default.orders").await?;

After #10849 is done, if no one else is working on it, I think I can help with it.

@alamb alamb changed the title Add example for writing an `AnalyzerRule Add example for writing an AnalyzerRule Jun 11, 2024
@alamb
Copy link
Contributor Author

alamb commented Jun 15, 2024

I have some time on a plane today that I may use to try and write up this example. I was inspired by some discussion I had this week

@goldmedal
Copy link
Contributor

Hi @alamb, I just want to share how I use user-defined AnalyzerRule. As you know, I'm working on reimplementing the semantic layer engine, wren-engine, for LLM using DataFusion. The project is still a work in progress. However, I think I have finished the part of integration with DataFusion. I think it could be a nice use case for AnalyzerRule.

The basic concept of wren-engine is that user can define a virtual modeling layer to apply on their physical data.
For example, given a csv file registered in DataFusion context called, datafusion.public.customers. We can wrap a virtual table for it called, wrenai.default.customers_model.
User can query it like

select * from wrenai.default.customers_model

Then , wren engine translate it to a physical query:

SELECT 
  "customers_model"."city", 
  "customers_model"."id", 
  "customers_model"."state" 
FROM 
  (
    SELECT 
      "customers_model"."city", 
      "customers_model"."id", 
      "customers_model"."state" 
    FROM 
      (
        SELECT 
          "datafusion"."public"."customers"."city" AS "city", 
          "datafusion"."public"."customers"."id" AS "id", 
          "datafusion"."public"."customers"."state" AS "state" 
        FROM 
          "public"."customers"
      ) AS "customers_model"
  ) AS "customers_model"

It's a simple use case to show how it works. We have other features, such as relationship between models or calculated fields in the model. I don't go into detail about them here. However, I implemented all of them through AnalyzerRule and UserDefinedLogicalNode.

The related PR, Canner/wren-engine#613, is still under review. However, you can find the usage of AnalyzerRule in rule.rs and the plan node in plan.rs. I also added some examples to show how to use the API or integrate with the DataFusion query flow.

Thanks to DataFusion for providing such amazing features, allowing me to implement a more stable and structured converter for the semantic layer.

@alamb
Copy link
Contributor Author

alamb commented Jun 17, 2024

Thank you @goldmedal - this is a great hint. I have some basic analyzer rule coded up but maybe your example is a better idea. I have a PR in progress that I need to polish up and then will ping you for review

@alamb
Copy link
Contributor Author

alamb commented Jun 24, 2024

Here is my proposed analyzer rule example: #11089

It isn't quite as fancy as the wren example, but I think it is an understandable enough example

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants