Introduction
StreamSQL is a feature store for machine learning.
StreamSQL accelerates machine learning development by:
Generating model features for serving using declarative definitions
Generating training sets using the same feature definitions as serving
Versioning, monitoring, and managing features
Allowing features to be shared, re-used, and discovered features across teams and models
How it works
The general workflow to getting the feature store up and running is to:
Connect your data sources or upload data directly to StreamSQL
Optionally transform and join your data using SQL
Register your feature definitions
Serve features in production or generate training datasets from your labels
At any point, you can also:
Add new data sources or transformations
Create or evolve features
Analyze and discover features in the feature registry
Why to use StreamSQL
Guarantee consistent features between training and serving
StreamSQL allow new model features to be deployed confidently and with ease. It uses the feature definitions that you declare to generate training datasets and to serve the same features in production. This removes all the time spent re-engineering model pipelines to generate the serving features, and removes a class of bugs stemming from inconsistent features in serving and production.
Maintain a single source of truth for features
StreamSQL allows organizations to keep a repository of versioned features. It's common for multiple models to require essentially the same features. Without a central feature repository, teams will have to build and maintain their own feature generation pipelines. This can lead to a large amount of inconsistent features trying to model the same thing, and tons of wasted time and repeated effort.
Share and re-use features across teams and models
StreamSQL allows feature engineering advancements made by one team to be shared by others. Feature engineering is a creative and time-consuming effort. By treating features as building blocks for your models, teams can share and re-use features to increase model performance across the organization.
Unify stream and batch processing for feature generation
StreamSQL allows machine learning teams to think at a higher level of abstraction then is possible with Flink and Spark. Files, tables, and streams can be connected to StreamSQL and then transformed and joined using SQL before being turned into features. Once the data is prepared features may be defined declaratively and StreamSQL will handle generating them for training and serving.
Manage your feature development with built-in versioning
Good feature management simplifies and accelerates the machine learning process. Features are defined with a consistent interface in a central repository. Anyone can dig into how a feature is being generated and depend on a specific version without breaking changes. Using the feature registry UI, you can quickly understand the features datatype and statistical properties.
Last updated