Transform and Join Data
Users can use SQL to generate new tables or streams from existing ones. These generated data sources can be used like other data sources. Materialized views can be a transformation on a single table or stream, a join of a table and a stream, a union of multiple tables, or even joining multiple different sources.
Materialized views can be created to maintain different subsets to build features upon. For example, let's say I have a stream named
purchasesthat has the columns
user, item, stateand I would like to create CA specific features. I can create a materialized view for California purchases as follows:
query="SELECT id, user, item FROM purchases WHERE state = 'CA'",
nameis the name of the materialized stream and can used in the same way that a normal source stream can be used, except that you can not directly write to it.
querydefines the SQL transformation to apply on the
dependenciesthat are listed.
output_columnsallow columns to be renamed.
You must define your dependencies and output_columns. StreamSQL does not currently support implicitly defining them in the SQL.
Streams can be transformed into aggregate tables using a
GROUP BYtransformation. In this situation, the
materialize_tablecall should be used since we're creating a table from the stream. This is a common way to prepare streaming data for use in model features.
query="SELECT user, COUNT(*) FROM purchases GROUP BY user",
GROUP BY's can be used to generate arrays of items using the
collectfunction. For example, if I was to create a model feature that is an array of all a user's purchases, I can do that as follows:
query="SELECT user, COLLECT(item) FROM purchases GROUP BY user",
JOINbetween two tables or across a table and a stream are supported. This allows for streams to be enriched with immutable tables like CSV files. For example, we can enrich the stream with the price of the item as follows:
query="SELECT p.id, p.user, p.item, catalog.price " +
"FROM purchases p INNER JOIN catalog ON p.item = catalog.id",
output_columns=["id", "user", "item", "price"],
JOINbetween two streams or using any dynamic tables are currently unsupported.