Releases: quixio/quix-streams
v3.15.0
What's Changed
💎 New streaming join: StreamingDataFrame.join_asof
With StreamingDataFrame.join_asof()
, you can join two topics into a new stream where each left record is merged with the right record with the same key whose timestamp is less than or equal to the left timestamp.
This join is built with the timeseries enrichment use cases in mind, where the left side represents some measurements and the right side represents events.
Some examples:
- Matching of the sensor measurements with the events in the system.
- Joining the purchases with the effective prices of the goods.
from datetime import timedelta
from quixstreams import Application
app = Application(...)
sdf_measurements = app.dataframe(app.topic("measurements"))
sdf_metadata = app.dataframe(app.topic("metadata"))
# Join records from the topic "measurements"
# with the latest effective records from the topic "metadata".
# using the "inner" join strategy and keeping the "metadata" records stored for 14 days in event time.
sdf_joined = sdf_measurements.join_asof(
right=sdf_metadata,
how="inner", # Emit updates only if the match is found in the store.
on_merge="keep-left", # Prefer the columns from the left dataframe if they overlap with the right.
grace_ms=timedelta(days=14), # Keep the state for 14 days (measured in event time, similar to windows).
)
if __name__ == '__main__':
app.run()
Learn more about it on the Joins docs page.
By @gwaramadze and @daniil-quix in #874 #841
State improvements
- Enable fsync in RocksDB by default by @gwaramadze in #883
- RocksDBStorePartition: log a number of bytes written by @daniil-quix in #890
- Optimize state operations by @daniil-quix in #891
- Add a parameter to clear corrupted RocksDBs by @gwaramadze in #888.
To re-create the corrupted RocksDB state store automatically, use the RocksDBOptions object:
from quixstreams import Application
from quixstreams.state.rocksdb import RocksDBOptions
app = Application(..., rocksdb_options=RocksDBOptions(on_corrupted_recreate=True))
Dependencies
- Bump types-protobuf from 6.30.2.20250503 to 6.30.2.20250516 by @dependabot in #885
Full Changelog: v3.14.1...v3.15.0
v3.14.1
What's Changed
- PostgreSQLSink: allow dynamic table name selection based on record by @tim-quix in #867
- Fix: deserialization propagation after sliding window by @gwaramadze in #875
Full Changelog: v3.14.0...v3.14.1
v3.14.0
💎 New features
StreamingDataFrame.concat()
for combining multiple topics and branches
In this release, we introduce a new API to combine multiple StreamingDataFrames together and process the underlying data as a single stream - StreamingDataFrame.concat()
.
You can use it to either combine different topics or branches of the same dataframe:
from datetime import timedelta
from quixstreams import Application
from quixstreams.dataframe.Windows import Mean
# Example: Aggregate e-commerce orders from different locations into one stream and calculate the average order amount in 1h windows.
app = Application(...)
# Define the topics with e-commerce orders
topic_uk = app.topic("orders-uk")
topic_de = app.topic("orders-de")
# Create StreamingDataFrames for each location
orders_uk = app.dataframe(topic_uk)
orders_de = app.dataframe(topic_de)
# Simulate the currency conversion step for each topic before concatenating them.
orders_uk["amount_usd"] = orders_uk["amount"].apply(convert_currency("GBP", "USD"))
orders_de["amount_usd"] = orders_de["amount"].apply(convert_currency("EUR", "USD"))
# Concatenate the orders from different locations into a new StreamingDataFrame.
# The new dataframe will have all records from both topics.
orders_combined = orders_uk.concat(orders_de)
# Calculate the average order size in USD within 1h tumbling window.
orders_combined.tumbling_window(timedelta(hours=1)).agg(avg_amount_usd=Mean("amount_usd"))
# Print the aggregated results
orders_combined.print()
if __name__ == '__main__':
app.run()
See the Concatenating Topics and Branching StreamingDataFrames pages to learn more about processing multiple topics and merging branches.
By @daniil-quix in #802 #866 #865 #857
New aggregations
Added new built-in aggregations to quixstreams.dataframe.windows.aggregations
:
Earliest
andLatest
- to store values with the smallest or the latest timestamp within each window.First
andLast
- to store the first and last values within each window. These aggregations work based on the processing order and are agnostic of the timestamps.
Other improvements
- Skip the repartition topic when using
StreamingDataFrame.group_by()
with single-partition topics by @ovv in #836 - Accept multiple columns in
StreamingDataFrame.contains()
by @ovv in #830 - Add
State.get_bytes
andState.set_bytes
methods to persist raw bytes in state stores by @ovv in #834 - Add
Quix__State__Dir
environment variable to configure app's state directory in Quix Cloud deployments by @ovv in #844
🦠 Bugfixes
- Fix SyntaxWarning by @gwaramadze in #827
🔌 Connectors
- Updated file source by @tim-quix in #787
- Add additional doc info about adding metadata with
ListSink
by @tim-quix in #814 - Connectors: update the default sink connect error to avoid error str clashes by @tim-quix in #831
MongoDBSink
: fix issue with dumping a headers dict that is None by @tim-quix in #829MongoDBSink
: simplify auth style by @tim-quix in #854InfluxDB3Sink
: improve timestamp type handling by @tim-quix in #811- New connector -
InfluxDB3Source
: by @tim-quix in #788 FileSource
: fix requiring all optional file source deps to use a file source by @tim-quix in #859
🔧 Internal
- Include py.typed file to mark quixstreams as a typed package by @ovv in #816
- clean up (and fix) doc build list by @tim-quix in #813
- Remove usage of Self type hint by @ovv in #826
- Fix some documentation hyperlinks + update CI workflow by @ovv in #835
- CI: Bump Python version to 3.13 by @ovv in #843
- Add topic_type to the Topic objects by @daniil-quix in #847
- Refactor
RowConsumer
andRowProducer
by @daniil-quix in #861 - Bump mypy-extensions from 1.0.0 to 1.1.0 by @dependabot in #858
- Update pre-commit requirement from <4.1,>=3.4 to >=3.4,<4.3 by @dependabot in #840
- Bump testcontainers from 4.8.2 to 4.10.0 by @dependabot in #821
- Merge PausingManager into RowConsumer by @daniil-quix in #853
Dependencies
- Bump types-protobuf from 5.29.1.20250403 to 6.30.2.20250503 by @dependabot in #864
- Update protobuf requirement from <6.0,>=5.27.2 to >=5.27.2,<7.0 by @dependabot in #822
- Remove schema_registry extra by @gwaramadze in #825
- Update confluent-kafka[avro,json,protobuf,schemaregistry] requirement from <2.9,>=2.8.2 to >=2.8.2,<2.10 by @dependabot in #820
- Update pydantic-settings requirement from <2.9,>=2.3 to >=2.3,<2.10 by @dependabot in #851
- Update rich requirement from <14,>=13 to >=13,<15 by @dependabot in #839
- Update pydantic requirement from <2.11,>=2.7 to >=2.7,<2.12 by @dependabot in #837
Full Changelog: v3.13.1...v3.14.0
v3.13.1
🦠Bugfixes
- Fix the initial recovery getting stuck when
auto_offset_reset="latest"
in #817 by @daniil-quix
Full Changelog: v3.13.0...v3.13.1
v3.13.0
💎 More flexible API for windowed aggregations
Introducing a new API to perform windowed aggregations via the .agg()
method:
from datetime import timedelta
from quixstreams import Application
from quixstreams.dataframe.windows import Min, Max, Count, Mean
app = Application(...)
sdf = app.dataframe(...)
sdf = (
# Define a tumbling window of 10 minutes
sdf.tumbling_window(timedelta(minutes=10))
# Configure the aggregations to perform
.agg(
min_temp=Min("temperature"),
max_temp=Max("temperature"),
avg_temp=Mean("temperature"),
total_events=Count(),
)
# Emit results only for closed windows
.final()
)
# Output:
# {
# 'start': <window start>,
# 'end': <window end>,
# 'min_temp': 1,
# 'max_temp': 999,
# 'avg_temp': 34.32,
# 'total_events': 999,
# }
Main benefits of .agg()
:
- The keyword parameters are mapped to the output columns (no
"value"
column in the window results anymore, unless you define it this way) - Support for aggregations on specific columns
- Support for multiple aggregations on the same window
- You can also implement a custom aggregation class and re-use it across the code base
See the Aggregations docs to learn more about .agg()
.
The previous API for windowed aggregations (e.g., sdf.tumbling_window().reduce()
) remains there, but is considered deprecated.
💎 New method StreamingDataFrame.fill
to fill missing data
Added a new method fill missing data - StreamingDataFrame.fill()
.
Use it to adjust the schema of the incoming messages to the desired shape when you expect some keys to be missing in the values.
sdf: StreamingDataFrame
# Input {"x": 1} would become {"x": 1, "y": None}
sdf.fill("y")
# {"x": 1} => {"x": 1, "y": 0}
sdf.fill(y=0)
See the Missing Data docs to learn more.
By @gwaramadze in #792
💎 New Sink: Elasticsearch
Added a new sink to write data to Elasticsearch:
from quixstreams import Application
from quixstreams.sinks.community.elasticsearch import ElasticsearchSink
app = Application(...)
sdf = app.dataframe(...)
# Configure the sink
elasticsearch_sink = ElasticsearchSink(
url="http://localhost:9200",
index="my_index",
)
sdf.sink(elasticsearch_sink)
Docs - ElasticsearchSink
🦠Bugfixes
- Check if topic already exists before creating by @gwaramadze in #800
Dependencies
- Bump types-jsonschema from 4.23.0.20240813 to 4.23.0.20241208 by @dependabot in #684
- Bump types-requests from 2.32.0.20241016 to 2.32.0.20250328 by @dependabot in #808
Full Changelog: v3.12.0...v3.13.0
v3.12.0
What's Changed
- Gracefully handle
None
values in built-in windowed aggregations likesum()
andcount()
by @gwaramadze in #789 - Add a flag to automatically convert integers to floats to InfluxDB3Sink by @tim-quix in #793
- Upgrade confluent-kafka to >=2.8.2,<2.9 by @gwaramadze in #799
- Include optional dependencies for confluent-kafka and fix anaconda dependencies by @gwaramadze in #804
Full Changelog: v3.11.0...v3.12.0
v3.11.0
What's Changed
💎 Stop conditions for Application.run()
Application.run()
now accepts additional count
and timeout
parameters to stop itself when the condition is met.
It is intended for interactive debugging of the applications on smaller portions of data.
How it works:
- count
- a number of messages to process from main SDF input topics (default 0 == infinite)
- timeout
- the maximum time in seconds to wait for a new message to appear (default 0.0 == infinite).
Example:
from quixstreams import Application
app = Application(...)
# ... some processing happening here ...
app.run(
count=20, # Process 20 messages from input topics
timeout=5, # Wait for 5s if fewer than 20 messages are available in the topics.
)
For more info, see the Inspecting Data & Debugging docs page.
[breaking] 💥 Changes to Sink.flush()
This release introduces breaking changes to the Sink.flush()
method and Sinks API overall in order to accommodate future features like joins.
-
Sinks are now flushed first on each checkpoint before producing changelogs to minimize potential over-production in case of Sink's failure.
Previously, changelogs were produced first, and only then Sinks were flushed. -
Sink.flush() is now expected to flush all the accumulated data for all TPs.
The custom implementations of Sinks need to be updated.
Previously,Sink.flush()
was called for each processed topic partition. -
SinkBackpressureError
now pauses the whole assignment instead of certain partitions only.
By @daniil-quix in #786
🦠Bugfixes
🛠️ Other changes
-
Refactor recovery to support Stores belonging to multiple topics by @daniil-quix in #774
-
windows: Split Aggregations and collectors into classes in #772
-
tweak Source to have a defined setup method for easier simple use cases by @tim-quix in #783
-
remove all source and sink setup abstracts by @tim-quix in #784
-
InfluxDb v3 sink tags improvements by @tomas-quix in #795
⚠️ Upgrade considerations
In #774, the format of the changelog message headers was changed.
When updating the existing application running with processing_guarantee="at-least-once"
(default), ensure the app is stopped normally and the last checkpoint is committed successfully before upgrading to the new version.
See #774 for more details.
Full Changelog: v3.10.0...v3.11.0
v3.10.0
What's Changed
💎 Window closing strategies
Previously, windows used the "key"
closing strategy.
With this strategy, messages advance time and close only windows with the same message key.
It helps to capture more data when it's not aligned in time (e.g. some keys are produced irregularly), but the latest windows can remain unprocessed until the message with the same key is received.
In this release, we added a new "partition"
strategy, and an API to configure the strategy for tumbling and hopping windows (sliding windows don't support it yet).
With "partition"
closing strategy, messages advance time and close windows for the whole partition to which this key belongs.
It helps to close windows faster because different keys advance time at the cost of potentially skipping more out-of-order messages.
Example:
from datetime import timedelta
from quixstreams import Application
app = Application(...)
sdf = app.dataframe(...)
# Define a window with the "partition" closing strategy.
sdf = sdf.tumbling_window(timedelta(seconds=10)).sum().final(closing_strategy="partition")
Learn more about closing strategies in the docs - https://quix.io/docs/quix-streams/windowing.html#closing-strategies
Added by @quentin-quix in #747
💎 Connectors status callbacks
Sinks and Sources now accept optional on_client_connect_success
and on_client_connect_failure
callbacks and can trigger them to inform about the Connector status during setup.
🦠 Bugfixes
- fix bad timestamps in test_app by @tim-quix in #768
- Bound protobuf<6.0 in tests by @quentin-quix in #773
- Bugfix for recovering from exactly 1 changelog message by @tim-quix in #769
- print_table method handles non-dict values by @gwaramadze in #767
🛠️ Other changes
- Create topics eagerly the moment they are defined by @daniil-quix in #763
- Increase default timeout and retries for
Producer.produce
by @quentin-quix in #771 - Add rich license by @gwaramadze in #776
- update docs and tutorials based on connect callback addition by @tim-quix in #775
- typing: make state protocols and ABCs generic by @quentin-quix in #777
Full Changelog: v3.9.0...v3.10.0
v3.9.0
What's Changed
💎Table-style printing of StreamingDataFrames
You can now examine the incoming data streams in a table-like format StreamingDataFrame.print_table()
feature.
For interactive terminals, it can print new rows row-by-row in a live mode with an artificial delay, allowing you to glance at the data stream easily.
For non-interactive environments (stdout, file, etc.) or if live=False
, it will print rows in batches as soon as the data is available to the application.
This is an experimental feature, so feel free to submit an issue with your feedback 👍
See the docs to learn more about StreamingDataFrame.print_table()
.
sdf = app.dataframe(...)
# some SDF transformations happening here ...
# Print last 5 records with metadata columns in live mode
sdf.print_table(size=5, title="My Stream", live=True)
# For wide datasets, limit columns to improve readability
sdf.print_table(
size=5,
title="My Stream",
columns=["id", "name", "value"],
column_widths={"name": 20}
)
# Live output:
My Stream
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ _key ┃ _timestamp ┃ id ┃ name ┃ value ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ b'53fe8e4' │ 1738685136 │ 876 │ Charlie │ 42.5 │
│ b'91bde51' │ 1738685137 │ 11 │ Alice │ 18.3 │
│ b'6617dfe' │ 1738685138 │ 133 │ Bob │ 73.1 │
│ b'f47ac93' │ 1738685139 │ 244 │ David │ 55.7 │
│ b'038e524' │ 1738685140 │ 567 │ Eve │ 31.9 │
└────────────┴────────────┴────────┴──────────────────────┴─────────┘
By @gwaramadze in #740, #760
Bugfixes
-
⚠️ Fix default state dir for Quix Cloud apps by @gwaramadze in #759
Please note that the state may be recovered to a different directory when updating existing deployments in Quix Cloud ifstate_dir
is not set. -
[Issue #440] ignore errors in rmtree by @ulisesojeda in #753
-
Fix
QuixPortalApiService
failing in multiprocessing environment by @daniil-quix in #755
Docs
- Add missing "how to install" section for
PandasDataFrameSource
by @daniil-quix in #751
New Contributors
- @ulisesojeda made their first contribution in #753
Full Changelog: v3.8.1...v3.9.0
v3.8.1
What's Changed
-
New PandasDataFrameSource connector to stream data from pandas.DataFrames during development and debugging by @JotaBlanco and @daniil-quix in #748
-
Made logging of common Kafka ACL issues more helpful by providing potentially missing ACLs and topic names by @tim-quix in
#742
- Fix docs for MongoDBSink by @tim-quix in #746
- Bump mypy from 1.13.0 to 1.15.0 by @dependabot in #744
Full Changelog: v3.8.0...v3.8.1