Skip to content

Releases: quixio/quix-streams

v3.15.0

27 May 15:55
a2f4ee7
Compare
Choose a tag to compare

What's Changed

💎 New streaming join: StreamingDataFrame.join_asof

With StreamingDataFrame.join_asof(), you can join two topics into a new stream where each left record is merged with the right record with the same key whose timestamp is less than or equal to the left timestamp.

This join is built with the timeseries enrichment use cases in mind, where the left side represents some measurements and the right side represents events.

Some examples:

  • Matching of the sensor measurements with the events in the system.
  • Joining the purchases with the effective prices of the goods.
from datetime import timedelta

from quixstreams import Application

app = Application(...)

sdf_measurements = app.dataframe(app.topic("measurements"))
sdf_metadata = app.dataframe(app.topic("metadata"))

# Join records from the topic "measurements"
# with the latest effective records from the topic "metadata".
# using the "inner" join strategy and keeping the "metadata" records stored for 14 days in event time.
sdf_joined = sdf_measurements.join_asof(
    right=sdf_metadata,
    how="inner",                 # Emit updates only if the match is found in the store.
    on_merge="keep-left",        # Prefer the columns from the left dataframe if they overlap with the right. 
    grace_ms=timedelta(days=14), # Keep the state for 14 days (measured in event time, similar to windows).
)

if __name__ == '__main__':
    app.run()

Learn more about it on the Joins docs page.

By @gwaramadze and @daniil-quix in #874 #841

State improvements

from quixstreams import Application
from quixstreams.state.rocksdb import RocksDBOptions

app = Application(..., rocksdb_options=RocksDBOptions(on_corrupted_recreate=True))

Dependencies

  • Bump types-protobuf from 6.30.2.20250503 to 6.30.2.20250516 by @dependabot in #885

Full Changelog: v3.14.1...v3.15.0

v3.14.1

12 May 11:17
d700dfb
Compare
Choose a tag to compare

What's Changed

  • PostgreSQLSink: allow dynamic table name selection based on record by @tim-quix in #867
  • Fix: deserialization propagation after sliding window by @gwaramadze in #875

Full Changelog: v3.14.0...v3.14.1

v3.14.0

07 May 15:52
1504a44
Compare
Choose a tag to compare

💎 New features

StreamingDataFrame.concat() for combining multiple topics and branches

In this release, we introduce a new API to combine multiple StreamingDataFrames together and process the underlying data as a single stream - StreamingDataFrame.concat().
You can use it to either combine different topics or branches of the same dataframe:

from datetime import timedelta

from quixstreams import Application
from quixstreams.dataframe.Windows import Mean

# Example:  Aggregate e-commerce orders from different locations into one stream and calculate the average order amount in 1h windows.

app = Application(...)

# Define the topics with e-commerce orders
topic_uk = app.topic("orders-uk")
topic_de = app.topic("orders-de")

# Create StreamingDataFrames for each location
orders_uk = app.dataframe(topic_uk)
orders_de = app.dataframe(topic_de)

# Simulate the currency conversion step for each topic before concatenating them.
orders_uk["amount_usd"] = orders_uk["amount"].apply(convert_currency("GBP", "USD"))
orders_de["amount_usd"] = orders_de["amount"].apply(convert_currency("EUR", "USD"))

# Concatenate the orders from different locations into a new StreamingDataFrame.
# The new dataframe will have all records from both topics.
orders_combined = orders_uk.concat(orders_de)

# Calculate the average order size in USD within 1h tumbling window. 
orders_combined.tumbling_window(timedelta(hours=1)).agg(avg_amount_usd=Mean("amount_usd"))

# Print the aggregated results
orders_combined.print()


if __name__ == '__main__':
    app.run()

See the Concatenating Topics and Branching StreamingDataFrames pages to learn more about processing multiple topics and merging branches.

By @daniil-quix in #802 #866 #865 #857

New aggregations

Added new built-in aggregations to quixstreams.dataframe.windows.aggregations:

  • Earliest and Latest - to store values with the smallest or the latest timestamp within each window.
  • First and Last - to store the first and last values within each window. These aggregations work based on the processing order and are agnostic of the timestamps.

By @ovv in #812

Other improvements

  • Skip the repartition topic when using StreamingDataFrame.group_by() with single-partition topics by @ovv in #836
  • Accept multiple columns in StreamingDataFrame.contains() by @ovv in #830
  • Add State.get_bytes and State.set_bytes methods to persist raw bytes in state stores by @ovv in #834
  • Add Quix__State__Dir environment variable to configure app's state directory in Quix Cloud deployments by @ovv in #844

🦠 Bugfixes

🔌 Connectors

🔧 Internal

  • Include py.typed file to mark quixstreams as a typed package by @ovv in #816
  • clean up (and fix) doc build list by @tim-quix in #813
  • Remove usage of Self type hint by @ovv in #826
  • Fix some documentation hyperlinks + update CI workflow by @ovv in #835
  • CI: Bump Python version to 3.13 by @ovv in #843
  • Add topic_type to the Topic objects by @daniil-quix in #847
  • Refactor RowConsumer and RowProducer by @daniil-quix in #861
  • Bump mypy-extensions from 1.0.0 to 1.1.0 by @dependabot in #858
  • Update pre-commit requirement from <4.1,>=3.4 to >=3.4,<4.3 by @dependabot in #840
  • Bump testcontainers from 4.8.2 to 4.10.0 by @dependabot in #821
  • Merge PausingManager into RowConsumer by @daniil-quix in #853

Dependencies

  • Bump types-protobuf from 5.29.1.20250403 to 6.30.2.20250503 by @dependabot in #864
  • Update protobuf requirement from <6.0,>=5.27.2 to >=5.27.2,<7.0 by @dependabot in #822
  • Remove schema_registry extra by @gwaramadze in #825
  • Update confluent-kafka[avro,json,protobuf,schemaregistry] requirement from <2.9,>=2.8.2 to >=2.8.2,<2.10 by @dependabot in #820
  • Update pydantic-settings requirement from <2.9,>=2.3 to >=2.3,<2.10 by @dependabot in #851
  • Update rich requirement from <14,>=13 to >=13,<15 by @dependabot in #839
  • Update pydantic requirement from <2.11,>=2.7 to >=2.7,<2.12 by @dependabot in #837

Full Changelog: v3.13.1...v3.14.0

v3.13.1

04 Apr 16:56
Compare
Choose a tag to compare

🦠Bugfixes

  • Fix the initial recovery getting stuck when auto_offset_reset="latest" in #817 by @daniil-quix

Full Changelog: v3.13.0...v3.13.1

v3.13.0

04 Apr 16:56
63491f5
Compare
Choose a tag to compare

💎 More flexible API for windowed aggregations

Introducing a new API to perform windowed aggregations via the .agg() method:

from datetime import timedelta
from quixstreams import Application
from quixstreams.dataframe.windows import Min, Max, Count, Mean

app = Application(...)
sdf = app.dataframe(...)

sdf = (

    # Define a tumbling window of 10 minutes
    sdf.tumbling_window(timedelta(minutes=10))
    
    # Configure the aggregations to perform
    .agg(
        min_temp=Min("temperature"),
        max_temp=Max("temperature"),
        avg_temp=Mean("temperature"),
        total_events=Count(),
    )

    # Emit results only for closed windows
    .final()
)

# Output:
# {
#   'start': <window start>, 
#   'end': <window end>, 
#   'min_temp': 1,
#   'max_temp': 999,
#   'avg_temp': 34.32,
#   'total_events': 999,
# }

Main benefits of .agg():

  • The keyword parameters are mapped to the output columns (no "value" column in the window results anymore, unless you define it this way)
  • Support for aggregations on specific columns
  • Support for multiple aggregations on the same window
  • You can also implement a custom aggregation class and re-use it across the code base

See the Aggregations docs to learn more about .agg().

The previous API for windowed aggregations (e.g., sdf.tumbling_window().reduce()) remains there, but is considered deprecated.

by @ovv in #785

💎 New method StreamingDataFrame.fill to fill missing data

Added a new method fill missing data - StreamingDataFrame.fill().
Use it to adjust the schema of the incoming messages to the desired shape when you expect some keys to be missing in the values.

sdf: StreamingDataFrame

# Input {"x": 1} would become {"x": 1, "y": None}
sdf.fill("y")

# {"x": 1} => {"x": 1, "y": 0}
sdf.fill(y=0)

See the Missing Data docs to learn more.

By @gwaramadze in #792

💎 New Sink: Elasticsearch

Added a new sink to write data to Elasticsearch:

from quixstreams import Application
from quixstreams.sinks.community.elasticsearch import ElasticsearchSink

app = Application(...)
sdf = app.dataframe(...)

# Configure the sink
elasticsearch_sink = ElasticsearchSink(
    url="http://localhost:9200",
    index="my_index",
)

sdf.sink(elasticsearch_sink)

Docs - ElasticsearchSink

By @tim-quix in #801

🦠Bugfixes

Dependencies

  • Bump types-jsonschema from 4.23.0.20240813 to 4.23.0.20241208 by @dependabot in #684
  • Bump types-requests from 2.32.0.20241016 to 2.32.0.20250328 by @dependabot in #808

Full Changelog: v3.12.0...v3.13.0

v3.12.0

27 Mar 13:31
51c3bf2
Compare
Choose a tag to compare

What's Changed

  • Gracefully handle None values in built-in windowed aggregations like sum() and count() by @gwaramadze in #789
  • Add a flag to automatically convert integers to floats to InfluxDB3Sink by @tim-quix in #793
  • Upgrade confluent-kafka to >=2.8.2,<2.9 by @gwaramadze in #799
  • Include optional dependencies for confluent-kafka and fix anaconda dependencies by @gwaramadze in #804

Full Changelog: v3.11.0...v3.12.0

v3.11.0

24 Mar 15:29
20db25a
Compare
Choose a tag to compare

What's Changed

💎 Stop conditions for Application.run()

Application.run() now accepts additional count and timeout parameters to stop itself when the condition is met.

It is intended for interactive debugging of the applications on smaller portions of data.

How it works:
- count - a number of messages to process from main SDF input topics (default 0 == infinite)
- timeout - the maximum time in seconds to wait for a new message to appear (default 0.0 == infinite).

Example:

from quixstreams import Application

app = Application(...)
# ... some processing happening here ...

app.run(
    count=20,   # Process 20 messages from input topics
    timeout=5,  # Wait for 5s if fewer than 20 messages are available in the topics.
)

For more info, see the Inspecting Data & Debugging docs page.

by @tim-quix in #780

[breaking] 💥 Changes to Sink.flush()

This release introduces breaking changes to the Sink.flush() method and Sinks API overall in order to accommodate future features like joins.

  • Sinks are now flushed first on each checkpoint before producing changelogs to minimize potential over-production in case of Sink's failure.
    Previously, changelogs were produced first, and only then Sinks were flushed.

  • Sink.flush() is now expected to flush all the accumulated data for all TPs.
    The custom implementations of Sinks need to be updated.
    Previously, Sink.flush() was called for each processed topic partition.

  • SinkBackpressureError now pauses the whole assignment instead of certain partitions only.

By @daniil-quix in #786

🦠Bugfixes

  • default sink connect fail callback should raise by @tim-quix in #794

🛠️ Other changes

  • Refactor recovery to support Stores belonging to multiple topics by @daniil-quix in #774

  • windows: Split Aggregations and collectors into classes in #772

  • tweak Source to have a defined setup method for easier simple use cases by @tim-quix in #783

  • remove all source and sink setup abstracts by @tim-quix in #784

  • InfluxDb v3 sink tags improvements by @tomas-quix in #795

⚠️ Upgrade considerations

In #774, the format of the changelog message headers was changed.
When updating the existing application running with processing_guarantee="at-least-once" (default), ensure the app is stopped normally and the last checkpoint is committed successfully before upgrading to the new version.
See #774 for more details.

Full Changelog: v3.10.0...v3.11.0

v3.10.0

07 Mar 14:53
0115ebb
Compare
Choose a tag to compare

What's Changed

💎 Window closing strategies

Previously, windows used the "key" closing strategy.
With this strategy, messages advance time and close only windows with the same message key.
It helps to capture more data when it's not aligned in time (e.g. some keys are produced irregularly), but the latest windows can remain unprocessed until the message with the same key is received.

In this release, we added a new "partition" strategy, and an API to configure the strategy for tumbling and hopping windows (sliding windows don't support it yet).

With "partition" closing strategy, messages advance time and close windows for the whole partition to which this key belongs.
It helps to close windows faster because different keys advance time at the cost of potentially skipping more out-of-order messages.

Example:

from datetime import timedelta
from quixstreams import Application

app = Application(...)
sdf = app.dataframe(...)

# Define a window with the "partition" closing strategy.
sdf = sdf.tumbling_window(timedelta(seconds=10)).sum().final(closing_strategy="partition")

Learn more about closing strategies in the docs - https://quix.io/docs/quix-streams/windowing.html#closing-strategies

Added by @quentin-quix in #747

💎 Connectors status callbacks

Sinks and Sources now accept optional on_client_connect_success and on_client_connect_failure callbacks and can trigger them to inform about the Connector status during setup.

By @tim-quix in #708 #775

🦠 Bugfixes

  • fix bad timestamps in test_app by @tim-quix in #768
  • Bound protobuf<6.0 in tests by @quentin-quix in #773
  • Bugfix for recovering from exactly 1 changelog message by @tim-quix in #769
  • print_table method handles non-dict values by @gwaramadze in #767

🛠️ Other changes

  • Create topics eagerly the moment they are defined by @daniil-quix in #763
  • Increase default timeout and retries for Producer.produce by @quentin-quix in #771
  • Add rich license by @gwaramadze in #776
  • update docs and tutorials based on connect callback addition by @tim-quix in #775
  • typing: make state protocols and ABCs generic by @quentin-quix in #777

Full Changelog: v3.9.0...v3.10.0

v3.9.0

24 Feb 13:55
0923770
Compare
Choose a tag to compare

What's Changed

💎Table-style printing of StreamingDataFrames

You can now examine the incoming data streams in a table-like format StreamingDataFrame.print_table() feature.

For interactive terminals, it can print new rows row-by-row in a live mode with an artificial delay, allowing you to glance at the data stream easily.
For non-interactive environments (stdout, file, etc.) or if live=False, it will print rows in batches as soon as the data is available to the application.

This is an experimental feature, so feel free to submit an issue with your feedback 👍

See the docs to learn more about StreamingDataFrame.print_table().

sdf = app.dataframe(...)
# some SDF transformations happening here ...

# Print last 5 records with metadata columns in live mode
sdf.print_table(size=5, title="My Stream", live=True)

# For wide datasets, limit columns to improve readability
sdf.print_table(
    size=5,
    title="My Stream",
    columns=["id", "name", "value"],
    column_widths={"name": 20}
)


# Live output:
My Stream
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ _key       ┃ _timestamp ┃ id     ┃ name                 ┃ value   ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ b'53fe8e4' │ 1738685136 │ 876    │ Charlie              │ 42.5    │
│ b'91bde51' │ 1738685137 │ 11     │ Alice                │ 18.3    │
│ b'6617dfe' │ 1738685138 │ 133    │ Bob                  │ 73.1    │
│ b'f47ac93' │ 1738685139 │ 244    │ David                │ 55.7    │
│ b'038e524' │ 1738685140 │ 567    │ Eve                  │ 31.9    │
└────────────┴────────────┴────────┴──────────────────────┴─────────┘

By @gwaramadze in #740, #760

Bugfixes

  • ⚠️ Fix default state dir for Quix Cloud apps by @gwaramadze in #759
    Please note that the state may be recovered to a different directory when updating existing deployments in Quix Cloud if state_dir is not set.

  • [Issue #440] ignore errors in rmtree by @ulisesojeda in #753

  • Fix QuixPortalApiService failing in multiprocessing environment by @daniil-quix in #755

Docs

  • Add missing "how to install" section for PandasDataFrameSource by @daniil-quix in #751

New Contributors

Full Changelog: v3.8.1...v3.9.0

v3.8.1

13 Feb 14:03
4150fbf
Compare
Choose a tag to compare

What's Changed

Full Changelog: v3.8.0...v3.8.1