Skip to content

Demo of an In-database processing tool for scikit-learn

Notifications You must be signed in to change notification settings

mllite/sklearn2sql-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sklearn2sql-demo

Note : A final presentation is available here (pdf slides) : https://github.com/antoinecarme/presentations_slides/blob/main/sklearn2sql_presentation_2022-08.pdf

This repository contains some demos of the usage of sklearn2sql.

sklearn2sql is an ongoing development tool for generating deployment SQL code from scikit-learn objects.

Using sklearn2sql, it is possible to predict values from an already-fitted classifier or a regressor simply by executing some SQL code. It can be seen as an alternative to PMML-based methods to perform In-database processing.

(NEW) sklearn2sql is available as a RESTful web service on Heroku. A sample python client allows you to generate SQL from your own models. Your feedback is welcome.

The SQL code is produced in an agnostic way (the mechansim used does not depend on the database) and supports most widely used relational databases.

It is designed to support all classification and regression methods in scikit-learn (SVMs, linear models, naive-bayes. decision trees, MLP, etc) , as well as transformations (PCA, imputers, scalers), feature selection, outlier detection and and their derived objects (random forest, meta-estimators, pipelines, feature unions, ensembles, etc).

Roughly speaking, sklearn2sql allows one to translate a scikit learn model as a large, machine-friendly ;) SQL code that can later be executed on your favorite database. For example, this is a multilayer perceptron on oracle , and this is a random forest on postgresql ....

Extensions

Since the beginning of this project, some extensions have been added to support machine learning models built using tools similar to scikit-learn. The goal is to be able to generate the deployment SQL code for any kind of classification and regression model on any kind of SQL-capable database. These extensions share the same SQL generation layer used for scikit-learn.

  1. A caret2sql project has been added to support R caret models. Some R jupyter notebook demos are available. It supports most used R machine learning models.

  2. For deep learning models (neural network models), the keras2sql project has been added to support models built using the Keras framework with TensorFlow, Theano, and CNTK. Some demo python jupyter notebooks are available.

  3. PyTorch Deep Learning models are also supported through pytorch2sql. Some demo python jupyter notebooks are available.

  4. A similar generation process has been added for C++ backends through ml2cpp.

    1. It generates a simple, readable C++ code that maps easily with the model structure. Facilitates debugging and integration.
    2. The project uses the same low-level layers as sklearn2sql.
    3. It supports all the models supported by the SQL backend.
    4. It generates C++ code that can be executed on almost any hardware platform that has a serious C++ compiler (GCC welcome).
    5. Some demo python jupyter notebooks are available.
    6. The C++ code is even runnable on very small platforms (STM32, ESP32, Kendryte etc).
  5. A Heroku-based web service can be used to generate SQL code for a given model. scikit-learn, keras and caret models are supported. SQL and C++ backends supported.

  6. ... (wip) ...

Supported Databases

Support for most popular relational databases has been added progressively. Now, sklearn2sql supports almost all the leading relational databases referenced on DB-Engines.

  1. Open source databases : PostgreSQL (Just perfect !!!, most dervied database), MariaDB (contribued some CTE-related bugs for this project. Very reactive team. All bugs were fixed !!!!
  2. Commercial databases : Oracle, MS SQL Server, IBM DB2, Teradata (to cover 95% of the market and get real-world tests)
  3. Embedded databases : SQLite (even in-memory ;). Nice for prototyping, documentation and development. Zero config. Available everywhere (on Android and iOS devices and inside jupyter notebooks ;).
  4. Hadoop databases : Hive and Impala
  5. Other : Firebird (low memory footprint. A stress test ;) , Monetdb (columnar, a SQL quality reminder ;)