theme | layout | highlighter | colorSchema | favicon | title |
---|---|---|---|---|---|
default |
cover |
shiki |
light |
favicon/url |
How we used Polars to build functime, a next gen ML forecasting library |
How we used Polars and global forecasting to build
a next-generation ML forecasting library
👤 Luca Baggi
💼 ML Engineer @xtream
🛠️ Maintainer @functime
A new paradigm to evaluate the forecasting process
"We spend far too many resources generating, reviewing, adjusting, and approving our forecasts, while almost invariably failing to achieve the level of accuracy desired." (source)
Mike Gilliland
Board of Directors of the International Institute of Forecasters
A new paradigm to evaluate the forecasting process
"The focus needs to change. We need to shift our attention from esoteric model building to the forecasting process itself – its efficiency and its effectiveness." (source)
Mike Gilliland
Board of Directors of the International Institute of Forecasters
Reframe the problem
Make forecasting just work at a reasonable scale (~90% of use cases).
- Forecast thousands of time series without distributed systems (PySpark).
- Feature-engineering and diagnostics API compatible with panel datasets.
- Smoothly translate form experimentation to production.
This can be achieved with two ingredients: Polars and global forecasting.
A brief description
Dataframes powered by a multithreaded, vectorized query engine, written in Rust
- A dataframe frontend: work with a Python object and not a SQL table.
- Utilises all cores on your machine, efficiently (more on this later).
- Uses 50+ years of relational database research to optimise the query.
- In-process, like sqlite (OLTP), duckdb (OLAP) and LanceDB (vector).
What makes it so fast
- Efficient data representation and I/O with Apache Arrow
- Work stealing, AKA efficient multithreading.
- Query optimisations through lazy evaluation (e.g.:
DataFrame.sort("col1").head(5)
in pandas vs Polars).
A lesson from forecasting competitions
Global forecasting just means to fit a single model on all the time series in your panel dataset.
This approach proved successful in multiple forecasting competitions, most notably M4 (1 2) and M5 (1).
A lesson from forecasting competitions
Gradient boosted regression trees secured the top spots, but linear models work well too, provided some thoughtful and deliberate feature engineering.
Here's the recipe to make functime
: a powerful query engine to perform blazingly fast feature engineering, followed by a single model.fit()
.
Doesn't have to be best model, but fast to iterate on and scalable to thousands of time series on your laptop.
Time for some dangerous live coding 🥶
What I could not show
- Prediction intervals with conformal predictions.
- Hyperparameter tuning with
flaml
. - Advanced feature extraction.
- Censored forecasts.
- LLM data analysis.
A deep dive into the Arrow ecosystem and Polars internals
- Apache Arrow and Substrait, the secret foundations of Data Engineering - Alessandro Molina @EuroPython 2023
- Polars: DataFrames in the multi-core era - Ritchie Vink @PyData NYC 2023
- Is the great dataframe showdown finally over? Enter: Polars - Luca Baggi (me) @PyConIt 2023
More PyData Global 2023 talks
- Polars and time zones: everything you need to know by Marco Gorelli
- Practical showcase on how to use Polars to master datetimes and time-zones
- We rewrote tsfresh in Polars and why you should too by Chris Lo and Mathieu Cayssol
- 90-minute workshop to dive deeper into functime internals, Polars integration and benchmarking (thanks to Polars-Rust plugins!)
Documentation and communities
- Polars web site
- Polars discord server
- functime.ai website and docs
- functime.ai discord server
Please share your feedback! My address is lucabaggi [at] duck.com