StreamSQL is a feature store for machine learning.
StreamSQL accelerates machine learning development by:
- 1.Generating model features for serving using declarative definitions
- 2.Generating training sets using the same feature definitions as serving
- 3.Versioning, monitoring, and managing features
- 4.Allowing features to be shared, re-used, and discovered features across teams and models
The general workflow to getting the feature store up and running is to:
- 1.Connect your data sources or upload data directly to StreamSQL
- 2.Optionally transform and join your data using SQL
- 3.Register your feature definitions
- 4.Serve features in production or generate training datasets from your labels
At any point, you can also:
- Add new data sources or transformations
- Create or evolve features
- Analyze and discover features in the feature registry
StreamSQL allow new model features to be deployed confidently and with ease. It uses the feature definitions that you declare to generate training datasets and to serve the same features in production. This removes all the time spent re-engineering model pipelines to generate the serving features, and removes a class of bugs stemming from inconsistent features in serving and production.
StreamSQL allows organizations to keep a repository of versioned features. It's common for multiple models to require essentially the same features. Without a central feature repository, teams will have to build and maintain their own feature generation pipelines. This can lead to a large amount of inconsistent features trying to model the same thing, and tons of wasted time and repeated effort.
StreamSQL allows feature engineering advancements made by one team to be shared by others. Feature engineering is a creative and time-consuming effort. By treating features as building blocks for your models, teams can share and re-use features to increase model performance across the organization.
StreamSQL allows machine learning teams to think at a higher level of abstraction then is possible with Flink and Spark. Files, tables, and streams can be connected to StreamSQL and then transformed and joined using SQL before being turned into features. Once the data is prepared features may be defined declaratively and StreamSQL will handle generating them for training and serving.
Good feature management simplifies and accelerates the machine learning process. Features are defined with a consistent interface in a central repository. Anyone can dig into how a feature is being generated and depend on a specific version without breaking changes. Using the feature registry UI, you can quickly understand the features datatype and statistical properties.