Releases: quixio/quix-streams
v3.3.0
What's Changed
New Connectors for Google Cloud
In this release, 3 new connectors have been added:
- Google Cloud Pub/Sub Source by @tim-quix in #622
- Google Cloud Pub/Sub Sink by @gwaramadze in #616 , #626
- Google Cloud BigQuery Sink by @daniil-quix in #621, #627
To learn more about them, see the respective docs pages.
Other updates
- Conda drop Python 3.8 support by @gwaramadze in #629
- Remove connectors docs from the nav by @daniil-quix in #630
- Update Documentation by @github-actions in #617
- Update connectors docs by @daniil-quix in #625
Full Changelog: v3.2.1...v3.3.0
v3.2.1
What's Changed
This is a bugfix release downgrading confluent-kafka
to 2.4.0 because of the authentication issue introduced in 2.6.0.
Full Changelog: v3.2.0...v3.2.1
v3.2.0
What's Changed
[new] Sliding Windows
Sliding windows are overlapping time-based windows that advance with each incoming message rather than at fixed intervals like hopping windows.
They have a fixed 1 ms resolution, perform better, and are less resource-intensive than hopping windows with a 1 ms step.
Read more in Sliding Windows docs.
PR by @gwaramadze - #515
[new] FileSink and FileSource connectors
FileSink allows to batches of data to files on disk in JSON and Parquet formats.
FileSource enables processing data streams from JSON or Parquet files.
The resulting messages can be produced in "replay" mode, where the time between record producing is matched as close as possible to the original.
Learn more on File Sink and FileSource pages.
PRs:
- local file sink by @tomas-quix in #560
- local file source by @tim-quix in #601
[upd] Updated time tracking in windowed aggregations
In previous versions, Windowed aggregations were tracking time in the streams per topic-partition, but kept expiring them per keys.
It was not a fully consistent behavior, and it also created problems when processing data from misaligned producers.
For example, IoT and other physical devices may produce data at certain frequency, which results in misaligned data streams within one topic-partition, and more data is considered "late" and dropped from the processing.
To make the processing of such data more complete, Quix Streams now tracks event time per each message key in the windows.
PRs:
- #591 by @daniil-quix
- #607 by @daniil-quix
[upd] Updated CSVSource
Some breaking changes were made to CSVSource
to make it easier to use:
- It now accepts CSV files in arbitrary formats and produces each row as a message value, making it less opinionated about the data format.
- It now requires the
name
to be passed directly. Previously, it was using the file name as a name of the source. - Message keys and timestamps can be extracted from the rows via
key_extractor
andtimestamp_extractor
params - Removed params
key_serializer
andvalue_serializer
PR by @daniil-quix in #602
Bug fixes
- Fix invalid mapping for
oauth_cb
in BaseSettings by @daniil-quix in #606
Dependencies
- Update confluent-kafka requirement from <2.5,>=2.2 to >=2.6,<2.7 by @dependabot in #578
Docs
- Update README by @gwaramadze in #604
- Update sinks.md by @SteveRosam in #610
Full Changelog: v3.1.1...v3.2.0
v3.1.1
What's Changed
Fixes
Other
- Bump pydantic-settings for Conda by @gwaramadze in #589
- Create pre-commit hook that checks Conda requirements by @gwaramadze in #596
- Turn on isort check in Ruff by @gwaramadze in #597
- Update Documentation by @github-actions in #598
Full Changelog: v3.1.0...v3.1.1
v3.1.0
What's Changed
[NEW] Apache Iceberg sink
A new sink that writes batches of data to an Apache Iceberg table.
It serializes incoming data batches into Parquet format and appends them to the
Iceberg table, updating the table schema as necessary.
Currently, it supports Apache Iceberg hosted in AWS and AWS Glue data catalogs.
To learn more about the Iceberg sink, see the docs.
Added by @tomas-quix in #555
Docs
- Update import paths in sources docs by @daniil-quix in #570
- Fix missing imports in Windowing docs by @daniil-quix in #574
- Update README.md by @mikerosam in #582
- Iceberg sink docs by @daniil-quix in #586
- Chore/docs updates by @daniil-quix in #577
Dependencies
- Update pydantic-settings requirement from <2.6,>=2.3 to >=2.3,<2.7 by @dependabot in #583
- Bump testcontainers from 4.8.1 to 4.8.2 by @dependabot in #579
Misc
- Abstract away the state update cache by @daniil-quix in #576
- Add Conda release script by @gwaramadze in #571
- app: Add option to select store backend by @quentin-quix in #544
- Refactor WindowedRocksDBPartitionTransaction.get_windows by @gwaramadze in #558
New Contributors
- @tomas-quix made their first contribution in #555
Full Changelog: v3.0.0...v3.1.0
v3.0.0
Quix Streams v3.0.0
Why the "major" version bump (v2.X --> v3.0)?
Quix Streams v3.0 brings branching and multiple topic consumption support, which changed some functionality under the hood. We want users to be mindful when upgrading to v3.0.
❗ Potential breaking change ❗ - Dropping Python v3.8 support:
Python v3.8 reaches End of Life in October 2024, so we are equivalently dropping support for Python v3.8.
We currently support Python v3.9 through v3.12.
❗ Potential breaking change ❗ - keyword arguments only for Application
:
While not really a functional change (and most people are doing this anyway), v3.0 is going to enforce all arguments for Application
to be keyword arguments rather than positional, so be sure to check this during your upgrade!
Previously (v2.X):
app = Application("localhost:9092")
Now (v3.0):
app = Application(broker_address="localhost:9092")
❗ Potential "data-altering" change ❗ - changelog topic name adjustment for "named" windows:
This change is primarily for accommodating windowing with branching.
If you have a windowed operation where the name
parameter was provided (ex: sdf.tumbling_window(name=<NAME>
), that topic naming scheme has been changed, meaning a new topic will be created and the window will temporarily be inaccurate since it will start from scratch.
It's important to note that this change will not cause an exception to be raised, so be aware!!
❗ Existing Sources and Sinks have been moved ❗
To accommodate the new structure in Connectors, we moved existing Sinks and Source to new modules.
To use them, you need to update the import paths:
InfluxDB3Sink
->quixstreams.sinks.core.influxdb3.InfluxDB3Sink
CSVSink
->quixstreams.sinks.core.csv.CSVSink
KafkaReplicatorSource
->quixstreams.sources.core.kafka.KafkaReplicatorSource
CSVSource
->quixstreams.sources.core.csv.CSVSource
QuixEnvironmentSource
->quixstreams.sources.core.kafka.QuixEnvironmentSource
v3.0 General Backwards compatibility with v2.X
v3.0 should otherwise be fully backwards compatible with any code working with 2.X (assuming no other breaking changes between 2.X versions you upgraded from) and should produce the same results. However, pay close attention to your apps after upgrading, just in case!
To learn more about the specifics of the underlying StreamingDataFrame
assignment pattern adjustments along with some additional supplemental clarifications, check out the new assignment rules docs section which also highlights the differences between v2.X to v3.0 (in short: always re-assign your SDF
s and you'll be good).
❗ Potential Breaking Changes (summarized) ❗
- Dropping Support for Python v3.8
- Topic naming change for explicitly named
StreamingDataFrame
Window operations. - Enforcement of keyword argument usage only for
Application
- Removal of deprecated
Application.Quix()
(can just useApplication
now) - Moved Sinks and Sources
🌱 New Features 🌱
StreamingDataFrame
Branching- Consuming multiple topics per
Application
("multipleStreamingDataFrames
") - Automatic
StreamingDataFrame
tracking (no arguments needed forApplication.run()
)
1. StreamingDataFrame
(SDF
) Branching
Now SDF
supports the ability to "branch" (or fork) them into multiple independent operations (no limits on amount).
Previously (v2.X), only linear operations were possible:
sdf
└── apply()
└── apply()
└── apply()
└── apply()
But now (v3.0), things like this are possible:
sdf
└── apply()
└── apply()
├── apply()
│ └── apply()
└── filter() - (does following operations only to this filtered subset)
├── apply()
├── apply()
└── apply()
Or, as an (unrelated) simple pseudo code-snippet form:
sdf_0 = app.dataframe().apply(func_a)
sdf_0 = sdf_0.apply(func_b) # sdf_0 -> sdf_0: NOT a (new) branch
sdf_1 = sdf_0.apply(func_c) # sdf_0 -> sdf_1: generates new branch off sdf_0
sdf_2 = sdf_0.apply(func_d) # sdf_0 -> sdf_2: generates new branch off sdf_0
app.run()
What Branches enable:
- Handle Multiple data formats/transformations in one
Application
- Conditional operations
- ex: producing to different topics based on different criteria
- Consolidating
Application
s that originally spanned multiple due to previous limitations
Limitations of Branching:
- Cannot filter or column assign using two different branches together at once (see docs for more info)
- Copies data for each branch, which can have performance implications (but may be better compared to running another Application).
To learn more, check out the in-depth branching docs.
2. Multiple Topic Consumption (multiple StreamingDataFrame
).
Applications
now support consuming multiple topics by initializing multiple StreamingDataFrame
(SDF
) with an Application
:
from quixstreams import Application
app = Application("localhost:9092")
input_topic_a = app.topic("input_a")
input_topic_b = app.topic("input_b")
output_topic = app.topic("output")
sdf_a = app.dataframe(input_topic_a)
sdf_a = sdf_a.apply(func_x).to_topic(output_topic)
sdf_b = app.dataframe(input_topic_b)
sdf_b.update(func_y).to_topic(output_topic)
app.run()
Each SDF
can then do any operations you could normally perform, including branching (but each SDF
should be treated like the others do not exist).
Also, note they run concurrently (1 consumer that's subscribed to multiple topics), NOT in parallel.
3. Automatic StreamingDataFrame
tracking
As a result of branching and multiple SDF
s, it was necessary to automate the tracking of SDF
s, so now you no longer need to provide any SDF
when doing Application.run()
:
Previously (v2.X):
app = Application("localhost:9092")
sdf = app.dataframe(topic)
app.run(sdf)
Now (v3.0):
app = Application("localhost:9092")
sdf = app.dataframe(topic)
app.run()
💎 Enhancements 💎
- Extensive Documentation improvements and additions
🦠 Bugfixes 🦠
- Fix issue with handling of Quix Cloud topics where topic was being created with the workspace ID appended twice.
- Overlapping window names now properly print a message saying how to solve it.
Full Changelog: v2.11.1...v3.0.0
v2.11.1
What's Changed
Fixes
- Fix
QuixEnvironmentSource
behavior when streaming data from one Quix environment to another by @quentin-quix in #520 - Fix consumers not fetching the data when connecting to the Quix broker by temporarily downgrading confluent-kafka to 2.4.0 by @daniil-quix in #522
Other changes
- Update custom-sources.md by @mikerosam in #509
- Add an example of custom websocket source by @daniil-quix in #505
- Update README.md by @daniil-quix in #512, #514
- Script to test Conda build by @gwaramadze in #493
- Document writing custom source in a Jupyter Notebook by @quentin-quix in #518
- Update Documentation by @github-actions in #521
Full Changelog: v2.11.0...v2.11.1
v2.11.0
What's Changed
[New] Source API and built-in Sources
With the new Sources API, you can stream data from any data source to a Kafka topic and process it with Streaming DataFrames in the same application.
You can either use one of the built-in sources (e.g. KafkaReplicatorSource, CSVSource, QuixEnvironmentSource) or create a custom one.
To learn more about Sources, please see the Sources documentation
PRs: #420, #448, #490 , #494 , #495, #498, #506
Dependencies updates
- Bump testcontainers from 4.8.0 to 4.8.1 by @dependabot in #492
- Update pydantic requirement from <2.9,>=2.7 to >=2.7,<2.10 by @dependabot in #491
- Update pydantic-settings requirement from <2.5,>=2.3 to >=2.3,<2.6 by @dependabot in #500
Documentation updates
- Fix typo in FixedTimeWindowDefinition.reduce docstring by @gwaramadze in #501
- Fix documentation link by @gwaramadze in #478
- Update README.md by @daniil-quix in #479
Other changes
- Add Conda configuration by @gwaramadze in #482
- Optimize get_window_ranges function by @gwaramadze in #489
Full Changelog: v2.10.0...v2.11.0
v2.10.0
What's Changed
Schema Registry Support
Introduced Schema Registry support for JSONSchema, Avro, and Protobuf formats.
To learn how to use Schema Registry, please follow the docs on the Schema Registry page..
PRs: #447, #449, #451, #454, #458, #472, #476).
Dependencies updates
- Support confluent-kafka versions 2.5.x by @gwaramadze in #459
- Bump testcontainers from 4.5.1 to 4.8.0 by @dependabot in #462
- Update pydantic requirement from <2.8,>=2.7 to >=2.7,<2.9 by @dependabot in #463
- Update pydantic-settings requirement from <2.4,>=2.3 to >=2.3,<2.5 by @dependabot in #464
- Update pre-commit requirement from <3.5,>=3.4 to >=3.4,<3.9 by @dependabot in #465
- Update black requirement from <24.4,>=24.3.0 to >=24.3.0,<24.9 by @dependabot in #466
Documentation updates
- fix(docs): minor correction in an example by @shrutimantri in #444
- fix(docs): correcting the output showcased for word count with other minor corrections by @shrutimantri in #445
- Update docs headers structure by @daniil-quix in #456
Other changes
- Application config API by @quentin-quix in #470
Full Changelog: v2.9.0...v2.10.0
v2.9.0
What's Changed
NEW: Optional installs (extras)
With this release, we have introduced optional requirements for various features. These requirements will be outlined alongside its given feature.
To install one, simply do pip install quixstreams[{extra}]
(or a comma-separated list like extra1,extra2
)
There is also an option to install all extras with extra=all
(pip install quixstreams[all]
)
Features
More Message Serialization Options
Additional serialization options have been added:
JSON Schema
(original plainJSON
option still supported)Avro
(requires installed extra=avro
)Protobuf
(requires installed extra=protobuf
)
For more details on their usage, see the Serialization docs.
Sinks (beta)
NOTE: This feature is in beta; functionality may change at any time!
We have introduced a new Sink
API/framework for sending data from Kafka to an external destination in a robust manner. It additionally includes a template/class for users to generate their own sink implementations!
We have also included two fully implemented sinks for use out of the box:
InfluxDB v3
CSV
Example usage with InfluxDB v3
:
from quixstreams import Application
from quixstreams.sinks.influxdb3 import InfluxDB3Sink
app = Application(broker_address="localhost:9092")
topic = app.topic("numbers-topic")
# Initialize InfluxDB3Sink
influx_sink = InfluxDB3Sink(...params...)
sdf = app.dataframe(topic)
# Do some processing here ...
# Sink data to InfluxDB
sdf.sink(influx_sink)
For more details on their usage, see the Sinks docs
commit_every
option for Applications
Applications can now commit every M
consumed messages in addition to every N
seconds (whichever occurs first for that checkpoint).
By default, it is 0
, which means no limit (how it worked before introducing this setting).
For more details, see the Checkpoint docs
app = Application(commit_every=10000)
errors
option for StreamingDataFrame.drop()
You can now ignore the default behavior of an exception being raised when the specified column(s) are missing with errors="ignore"
.
app = Application()
sdf = app.dataframe()
sdf = sdf.drop(["col_a", "col_b"], errors="ignore")
Enhancements
- README updates
- Various Documentation improvements
Changelog
Full Changelog: v2.8.1...v2.9.0