(Local) Iris Dataset
Introduction
This example will build out features and train a simple logistic regression to classify flowers. We will use the Iris data set to this purpose. We will highlight how StreamSQL can be used locally. The local version will store and process all data on the machine it is run on.
The Iris Dataset
The Iris dataset can be found here. It contains 150 rows in a CSV with five columns. The columns are: sepal length, sepal width
, petal length,
pedal width
, and species
. The four length and width columns are numerical, whereas the species column is a categorical string with three possible values: setosa
, versicolor
, and virginica
.
Strategy
We will use z-score normalization on each of the numerical columns to turn them into features and then feed them into the logistic regression implementation in scikit learn.
The Model
Set up the Environment
Make sure that Python and PIP are installed on your local machine. Next, install StreamSQL's Python client using pip. Check out the Getting Started section for a more thorough walk though in setting up StreamSQL on your machine.
The iris dataset can be downloaded here. The downloaded file is named iris.data
and should be moved to your working directory.
Load the data into StreamSQL
We can initialize a local feature store instance and then connect the iris dataset to it. The LocalFeatureStore
does not require an API key and does not upload to a server, as its made for local development. When uploading the file, we have to specify the format as CSV since it's not implied by the .data ending.
upload_file makes a local copy of the file to guarantee immutability. That means that changes to the file will not be reflected in the new table.
Define the Model Features
Our model features are the same as the first four columns in the CSV with z-score normalization applied.
Generate a Training Dataset
Train and Validate Model
Full Example
Last updated