(Local) Iris Dataset
This example will build out features and train a simple logistic regression to classify flowers. We will use the Iris data set to this purpose. We will highlight how StreamSQL can be used locally. The local version will store and process all data on the machine it is run on.
The Iris dataset can be found here. It contains 150 rows in a CSV with five columns. The columns are: sepal length,
sepal width
, petal length,
pedal width
, and species
. The four length and width columns are numerical, whereas the species column is a categorical string with three possible values: setosa
, versicolor
, and virginica
.We will use z-score normalization on each of the numerical columns to turn them into features and then feed them into the logistic regression implementation in scikit learn.
Make sure that Python and PIP are installed on your local machine. Next, install StreamSQL's Python client using pip. Check out the Getting Started section for a more thorough walk though in setting up StreamSQL on your machine.
pip install streamsql
The iris dataset can be downloaded here. The downloaded file is named
iris.data
and should be moved to your working directory. We can initialize a local feature store instance and then connect the iris dataset to it. The
LocalFeatureStore
does not require an API key and does not upload to a server, as its made for local development. When uploading the file, we have to specify the format as CSV since it's not implied by the .data ending.from streamsql.local import LocalFeatureStore
feat = LocalFeaturestore()
cols = ["sepal_w", "sepal_l", "petal_w", "petal_l", "species"]
table = feat.upload_file(
"./iris.data",
name="iris",
format="csv",
columns=cols,
)
upload_file makes a local copy of the file to guarantee immutability. That means that changes to the file will not be reflected in the new table.
Our model features are the same as the first four columns in the CSV with z-score normalization applied.
from streamsql.operation import ZScore
from streamsql.feature import NumericFeature
# The features are the same as the CSV columns, excluding species.
features = cols[:-1]
for col in features:
feat.register_feature(
NumericFeature(
name=col,
table=table,
column=col,
normalization=ZScore(),
),
)
dataset = feat.register_training_dataset(
name="iris",
label_table=table,
label_field="species",
features=features,
)
train, test = dataset.generate_training_data(
training_size=0.9, test_size=0.1, shuffle=True,
)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression().fit(train.X, train.y)
# Validate that the model's performance against the training and testing data.
print(model.score(train.X, train.y))
print(model.score(test.X, test.y))
from streamsql.local import LocalFeatureStore
from streamsql.operation import ZScore
from streamsql.feature import NumericFeature
from sklearn.linear_model import LogisticRegression
feat = LocalFeaturestore()
cols = ["sepal_w", "sepal_l", "petal_w", "petal_l", "species"]
table = feat.upload_file(
"./iris.data",
name="iris",
format="csv",
columns=cols,
)
features = cols[:-1]
for col in features:
feat.register_feature(
NumericFeature(
name=col,
table=table,
column=col,
normalization=ZScore(),
),
)
dataset = feat.register_training_dataset(
name="iris",
label_table=table,
label_field="species",
features=features,
)
train, test = dataset.generate_training_data(
training_size=0.9, test_size=0.1, shuffle=True,
)
Last modified 2yr ago