StreamSQL is a feature
store for machine learning.
StreamSQL
accelerates machine learning development by:
1.
Generating model
features for serving using declarative definitions
2.
Generating training
sets using the same feature definitions as serving
3.
Versioning,
monitoring, and managing features
4.
Allowing features to
be shared, re-used, and discovered features across teams and
models
How it
works
The general workflow to
getting the feature store up and running is to:
1.
Connect your data
sources or upload data directly to StreamSQL
2.
Optionally transform
and join your data using SQL
3.
Register your feature
definitions
4.
Serve features in
production or generate training datasets from your labels
At any point, you can
also:
Add new data sources
or transformations
Create or evolve
features
Analyze and discover
features in the feature registry
Why to use
StreamSQL
Guarantee consistent
features between training and serving
StreamSQL allow new model
features to be deployed confidently and with ease. It uses the feature
definitions that you declare to generate training datasets and to serve the same
features in production. This removes all the time spent re-engineering model
pipelines to generate the serving features, and removes a class of bugs stemming
from inconsistent features in serving and production.
Maintain a single source
of truth for features
StreamSQL allows
organizations to keep a repository of versioned features. It's common for
multiple models to require essentially the same features. Without a central
feature repository, teams will have to build and maintain their own feature
generation pipelines. This can lead to a large amount of inconsistent features
trying to model the same thing, and tons of wasted time and repeated
effort.
Share and re-use features
across teams and models
StreamSQL allows feature
engineering advancements made by one team to be shared by others. Feature
engineering is a creative and time-consuming effort. By treating features as
building blocks for your models, teams can share and re-use features to increase
model performance across the organization.
Unify stream and batch
processing for feature generation
StreamSQL allows machine
learning teams to think at a higher level of abstraction then is possible with
Flink and Spark. Files, tables, and streams can be connected to StreamSQL and
then transformed and joined using SQL before being turned into features. Once
the data is prepared features may be defined declaratively and StreamSQL will
handle generating them for training and serving.
Manage your feature
development with built-in versioning
Good feature management
simplifies and accelerates the machine learning process. Features are defined
with a consistent interface in a central repository. Anyone can dig into how a
feature is being generated and depend on a specific version without breaking
changes. Using the feature registry UI, you can quickly understand the features
datatype and statistical properties.