Tip
This page was generated from guide/transformer_quickstart.ipynb.
Transformer and Pipeline Quickstart#
Transformer faces the engineering of data preprocessing.
Applicable Scene#
In steps of data preprocessing, we always need to do some duplication things.
When we finished dealing with the training dataset, we also need to sort those preprocessing steps out and make them to a function, an API, or something.
Sample Data#
Note
All data are virtual.
There are some stores sale data of one chain brand.
These stores place one region.
Time is one specific year.
Sale is a year total amount.
Population is surrounding \(200m\) buffer daily people numbers.
Score is given by the expert, ranges from 0 to 10.
[1]:
import pandas as pd
[2]:
store_sale_dict = {
"code": ["811-10001", "811-10002", "811-10003", "811-10004"],
"name": ["A", "B", "C", "D"],
"floor": ["1F", "2F", "1F", "B2"],
"level": ["strategic", "normal", "important", "normal"],
"type": ["School", "Mall", "Office", "Home"],
"area": [100, 95, 177, 70],
"population": [3000, 1000, 2000, 1500],
"score": [10, 8, 6, 5],
"opendays": [300, 100, 250, 15],
"sale": [8000, 5000, 3000, 1500],
}
df = pd.DataFrame(store_sale_dict)
df
[2]:
| code | name | floor | level | type | area | population | score | opendays | sale | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 811-10001 | A | 1F | strategic | School | 100 | 3000 | 10 | 300 | 8000 |
| 1 | 811-10002 | B | 2F | normal | Mall | 95 | 1000 | 8 | 100 | 5000 |
| 2 | 811-10003 | C | 1F | important | Office | 177 | 2000 | 6 | 250 | 3000 |
| 3 | 811-10004 | D | B2 | normal | Home | 70 | 1500 | 5 | 15 | 1500 |
Feature Types and Dealing Steps#
First of all, we should know there are three types of features (\(X\)) and one label (\(y\)).
Additional information features: drop
code
name
Categorical features: encode to one-hot
floor
type: drop
'Home'type, this type store numbers are very small.
Number features: scale
level: it is not categorical type, because it could be compared.
area
population: there is buffer ranging population, but more want to enter store population, equal to \(\frac{score}{10} \times population\).
score
opendays: filter
opendays <= 30stores then drop this field
Label: need to balance, should transform to daily sale, equal to \(\frac{sale}{opendays}\) then scale
Mission
Our mission is to find some relationships between these features and label.
The Pandas Way#
In pandas code, most users might type something like this:
Set a series of feature name constants.
[3]:
features_category = ["floor", "type"]
features_number = ["level", "area", "population", "score"]
features = features_category + features_number
label = ["sale"]
Process X and y#
Filter opendays’ store less than 30 days. Because these samples are not normal stores.
[4]:
df = df.query("opendays > 30")
df
[4]:
| code | name | floor | level | type | area | population | score | opendays | sale | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 811-10001 | A | 1F | strategic | School | 100 | 3000 | 10 | 300 | 8000 |
| 1 | 811-10002 | B | 2F | normal | Mall | 95 | 1000 | 8 | 100 | 5000 |
| 2 | 811-10003 | C | 1F | important | Office | 177 | 2000 | 6 | 250 | 3000 |
Filter 'Home' store.
[5]:
df = df[df["type"] != "Home"]
df
[5]:
| code | name | floor | level | type | area | population | score | opendays | sale | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 811-10001 | A | 1F | strategic | School | 100 | 3000 | 10 | 300 | 8000 |
| 1 | 811-10002 | B | 2F | normal | Mall | 95 | 1000 | 8 | 100 | 5000 |
| 2 | 811-10003 | C | 1F | important | Office | 177 | 2000 | 6 | 250 | 3000 |
Transform sale to daily sale.
[6]:
df = df.eval("sale = sale / opendays")
df
df
[6]:
| code | name | floor | level | type | area | population | score | opendays | sale | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 811-10001 | A | 1F | strategic | School | 100 | 3000 | 10 | 300 | 26.666667 |
| 1 | 811-10002 | B | 2F | normal | Mall | 95 | 1000 | 8 | 100 | 50.000000 |
| 2 | 811-10003 | C | 1F | important | Office | 177 | 2000 | 6 | 250 | 12.000000 |
Transform population to entry store population.
[7]:
df = df.eval("population = score / 10 * population")
df
[7]:
| code | name | floor | level | type | area | population | score | opendays | sale | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 811-10001 | A | 1F | strategic | School | 100 | 3000.0 | 10 | 300 | 26.666667 |
| 1 | 811-10002 | B | 2F | normal | Mall | 95 | 800.0 | 8 | 100 | 50.000000 |
| 2 | 811-10003 | C | 1F | important | Office | 177 | 1200.0 | 6 | 250 | 12.000000 |
Split df to df_x and yand separately process them.
[8]:
df_x = df[features]
df_x
[8]:
| floor | type | level | area | population | score | |
|---|---|---|---|---|---|---|
| 0 | 1F | School | strategic | 100 | 3000.0 | 10 |
| 1 | 2F | Mall | normal | 95 | 800.0 | 8 |
| 2 | 1F | Office | important | 177 | 1200.0 | 6 |
[9]:
y = df[label]
y
[9]:
| sale | |
|---|---|
| 0 | 26.666667 |
| 1 | 50.000000 |
| 2 | 12.000000 |
Process y#
Scale y.
[10]:
from sklearn.preprocessing import MinMaxScaler
y_scaler = MinMaxScaler()
Scaler handle a column as a unit
[11]:
y = y.values.reshape(-1, 1)
y = y_scaler.fit_transform(y)
y
[11]:
array([[0.38596491],
[1. ],
[0. ]])
The model always requires a 1d array otherwise would give a warning.
[12]:
y = y.ravel()
y
[12]:
array([0.38596491, 1. , 0. ])
Process X#
Replace store types to ranking numbers.
[13]:
df_x = df_x.replace({"normal": 1, "important": 2, "strategic": 3})
df_x
[13]:
| floor | type | level | area | population | score | |
|---|---|---|---|---|---|---|
| 0 | 1F | School | 3 | 100 | 3000.0 | 10 |
| 1 | 2F | Mall | 1 | 95 | 800.0 | 8 |
| 2 | 1F | Office | 2 | 177 | 1200.0 | 6 |
Encode categorical features.
[14]:
from sklearn.preprocessing import OneHotEncoder
x_encoder = OneHotEncoder(sparse=False)
x_category = x_encoder.fit_transform(df_x[features_category])
x_category
/home/docs/checkouts/readthedocs.org/user_builds/my-data-toolkit/conda/v0.0.20/lib/python3.11/site-packages/sklearn/preprocessing/_encoders.py:808: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
warnings.warn(
[14]:
array([[1., 0., 0., 0., 1.],
[0., 1., 1., 0., 0.],
[1., 0., 0., 1., 0.]])
Scale number features.
[15]:
x_scaler = MinMaxScaler()
x_scaler = x_scaler.fit_transform(df_x[features_number])
x_scaler
[15]:
array([[1. , 0.06097561, 1. , 1. ],
[0. , 0. , 0. , 0.5 ],
[0.5 , 1. , 0.18181818, 0. ]])
Merge all features to one.
[16]:
import numpy as np
X = np.hstack([x_scaler, x_category])
X
[16]:
array([[1. , 0.06097561, 1. , 1. , 1. ,
0. , 0. , 0. , 1. ],
[0. , 0. , 0. , 0.5 , 0. ,
1. , 1. , 0. , 0. ],
[0.5 , 1. , 0.18181818, 0. , 1. ,
0. , 0. , 1. , 0. ]])
The Pipeline Way#
From The Pandas Way section, we can see that:
The intermediate variables are full of steps. We don’t care about them atthe most time except debugging and reviewing.
Data workflow is messy. Hard to separate data and operations.
The outputting datastruct is not comfortable. The inputting type is
pandas.DataFramebut the outputting type isnumpy.ndarray.Hard to apply in prediction data.
Further One Step to Pipeline#
sklearn.pipeline.Pipeline is a good frame to fix these problems.
Transform process X and process y section codes to pipeline codees.
But actually, these things are hard to transform to pipeline. Most are pandas methods, only OneHotEncoder and MinMaxScaler is could be added into sklearn.pipeline.Pipeline.
The codes are still messy on typing and applying two ways.
The dtoolkit.transformer Way#
Frame is good, but from Further One Step to Pipeline section we could see that the core problem is missing transformer.
Pandas’s methods couldn’t be used as a transformer.
Numpy’s methods couldn’t be used as a transformer.
Sklearn’s transformers can’t pandas in and pandas out.
[17]:
from dtoolkit.transformer import (
EvalTF,
FilterInTF,
GetTF,
ReplaceTF,
OneHotEncoder,
QueryTF,
RavelTF,
)
from dtoolkit.pipeline import make_pipeline, make_union
[18]:
pl_xy = make_pipeline(
QueryTF("opendays > 30"),
FilterInTF({"type": ["School", "Mall", "Office"]}),
EvalTF("sale = sale / opendays"),
EvalTF("population = score / 10 * population"),
)
pl_xy
[18]:
Pipeline(steps=[('querytf',
<dtoolkit.transformer.pandas.QueryTF.QueryTF object at 0x7f6003d26a10>),
('filterintf',
<dtoolkit.transformer.pandas.FilterInTF.FilterInTF object at 0x7f6003d26e10>),
('evaltf-1',
<dtoolkit.transformer.pandas.EvalTF.EvalTF object at 0x7f6003d26e50>),
('evaltf-2',
<dtoolkit.transformer.pandas.EvalTF.EvalTF object at 0x7f6003d26e90>)])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('querytf',
<dtoolkit.transformer.pandas.QueryTF.QueryTF object at 0x7f6003d26a10>),
('filterintf',
<dtoolkit.transformer.pandas.FilterInTF.FilterInTF object at 0x7f6003d26e10>),
('evaltf-1',
<dtoolkit.transformer.pandas.EvalTF.EvalTF object at 0x7f6003d26e50>),
('evaltf-2',
<dtoolkit.transformer.pandas.EvalTF.EvalTF object at 0x7f6003d26e90>)])<dtoolkit.transformer.pandas.QueryTF.QueryTF object at 0x7f6003d26a10>
<dtoolkit.transformer.pandas.FilterInTF.FilterInTF object at 0x7f6003d26e10>
<dtoolkit.transformer.pandas.EvalTF.EvalTF object at 0x7f6003d26e50>
<dtoolkit.transformer.pandas.EvalTF.EvalTF object at 0x7f6003d26e90>
[19]:
pl_x = make_pipeline(
GetTF(features),
ReplaceTF({"normal": 1, "important": 2, "strategic": 3}),
make_union(
make_pipeline(
GetTF(features_category),
OneHotEncoder(),
),
make_pipeline(
GetTF(features_number),
MinMaxScaler(),
),
),
)
pl_x
[19]:
Pipeline(steps=[('gettf',
<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7f6003d30fd0>),
('replacetf',
<dtoolkit.transformer.pandas.ReplaceTF.ReplaceTF object at 0x7f6003d31d50>),
('featureunion',
FeatureUnion(transformer_list=[('pipeline-1',
Pipeline(steps=[('gettf',
<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7f6003d31e10>),
('onehotencoder',
OneHotEncoder(sparse='deprecated'))])),
('pipeline-2',
Pipeline(steps=[('gettf',
<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7f6003d31fd0>),
('minmaxscaler',
MinMaxScaler())]))]))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('gettf',
<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7f6003d30fd0>),
('replacetf',
<dtoolkit.transformer.pandas.ReplaceTF.ReplaceTF object at 0x7f6003d31d50>),
('featureunion',
FeatureUnion(transformer_list=[('pipeline-1',
Pipeline(steps=[('gettf',
<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7f6003d31e10>),
('onehotencoder',
OneHotEncoder(sparse='deprecated'))])),
('pipeline-2',
Pipeline(steps=[('gettf',
<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7f6003d31fd0>),
('minmaxscaler',
MinMaxScaler())]))]))])<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7f6003d30fd0>
<dtoolkit.transformer.pandas.ReplaceTF.ReplaceTF object at 0x7f6003d31d50>
FeatureUnion(transformer_list=[('pipeline-1',
Pipeline(steps=[('gettf',
<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7f6003d31e10>),
('onehotencoder',
OneHotEncoder(sparse='deprecated'))])),
('pipeline-2',
Pipeline(steps=[('gettf',
<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7f6003d31fd0>),
('minmaxscaler',
MinMaxScaler())]))])<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7f6003d31e10>
OneHotEncoder(sparse='deprecated')
<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7f6003d31fd0>
MinMaxScaler()
[20]:
pl_y = make_pipeline(
GetTF(label),
MinMaxScaler(),
RavelTF(),
)
pl_y
[20]:
Pipeline(steps=[('gettf',
<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7f6003fe8a90>),
('minmaxscaler', MinMaxScaler()),
('raveltf',
<dtoolkit.transformer.numpy.RavelTF.RavelTF object at 0x7f6003d306d0>)])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('gettf',
<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7f6003fe8a90>),
('minmaxscaler', MinMaxScaler()),
('raveltf',
<dtoolkit.transformer.numpy.RavelTF.RavelTF object at 0x7f6003d306d0>)])<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7f6003fe8a90>
MinMaxScaler()
<dtoolkit.transformer.numpy.RavelTF.RavelTF object at 0x7f6003d306d0>
[21]:
store_sale_dict = {
"code": ["811-10001", "811-10002", "811-10003", "811-10004"],
"name": ["A", "B", "C", "D"],
"floor": ["1F", "2F", "1F", "B2"],
"level": ["strategic", "normal", "important", "normal"],
"type": ["School", "Mall", "Office", "Home"],
"area": [100, 95, 177, 70],
"population": [3000, 1000, 2000, 1500],
"score": [10, 8, 6, 5],
"opendays": [300, 100, 250, 15],
"sale": [8000, 5000, 3000, 1500],
}
df = pd.DataFrame(store_sale_dict)
df
[21]:
| code | name | floor | level | type | area | population | score | opendays | sale | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 811-10001 | A | 1F | strategic | School | 100 | 3000 | 10 | 300 | 8000 |
| 1 | 811-10002 | B | 2F | normal | Mall | 95 | 1000 | 8 | 100 | 5000 |
| 2 | 811-10003 | C | 1F | important | Office | 177 | 2000 | 6 | 250 | 3000 |
| 3 | 811-10004 | D | B2 | normal | Home | 70 | 1500 | 5 | 15 | 1500 |
[22]:
xy = pl_xy.fit_transform(df)
xy
[22]:
| code | name | floor | level | type | area | population | score | opendays | sale | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 811-10001 | A | 1F | strategic | School | 100 | 3000.0 | 10 | 300 | 26.666667 |
| 1 | 811-10002 | B | 2F | normal | Mall | 95 | 800.0 | 8 | 100 | 50.000000 |
| 2 | 811-10003 | C | 1F | important | Office | 177 | 1200.0 | 6 | 250 | 12.000000 |
[23]:
X = pl_x.fit_transform(xy)
X
[23]:
| 1F | 2F | Mall | Office | School | level | area | population | score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.060976 | 1.000000 | 1.0 |
| 1 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.5 |
| 2 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.5 | 1.000000 | 0.181818 | 0.0 |
[24]:
y = pl_y.fit_transform(xy)
y
[24]:
0 0.385965
1 1.000000
2 0.000000
Name: sale, dtype: float64
We could also save these pipelines as a binary file via pickle or joblib. When new data coming we could quickly transform them via binary file.
Other Ways to Handle This#
pandas.DataFrame.pipe and function ways are ok.
But they are:
hard to transform to application codes rightly
hard to debug, and check the processing data
What’s Next - Learn or Build Transformers#
In this tutorial we’ve a quickly glance about dtoolkit.transformer.
And the next steps, should learn about other transformers, see documentation on Transformer API. If those transformers don’t meet your requirements, you could build your own transformer, follow the documentation on How to Build Transformer.