Tip
This page was generated from guide/transformer_quickstart.ipynb.
Transformer and Pipeline Quickstart#
Transformer
faces the engineering of data preprocessing.
Applicable Scene#
In steps of data preprocessing, we always need to do some duplication things.
When we finished dealing with the training dataset, we also need to sort those preprocessing steps out and make them to a function, an API, or something.
Sample Data#
Note
All data are virtual.
There are some stores sale data of one chain brand.
These stores place one region.
Time is one specific year.
Sale is a year total amount.
Population is surrounding \(200m\) buffer daily people numbers.
Score is given by the expert, ranges from 0 to 10.
[1]:
import pandas as pd
[2]:
store_sale_dict = {
"code": ["811-10001", "811-10002", "811-10003", "811-10004"],
"name": ["A", "B", "C", "D"],
"floor": ["1F", "2F", "1F", "B2"],
"level": ["strategic", "normal", "important", "normal"],
"type": ["School", "Mall", "Office", "Home"],
"area": [100, 95, 177, 70],
"population": [3000, 1000, 2000, 1500],
"score": [10, 8, 6, 5],
"opendays": [300, 100, 250, 15],
"sale": [8000, 5000, 3000, 1500],
}
df = pd.DataFrame(store_sale_dict)
df
[2]:
code | name | floor | level | type | area | population | score | opendays | sale | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 811-10001 | A | 1F | strategic | School | 100 | 3000 | 10 | 300 | 8000 |
1 | 811-10002 | B | 2F | normal | Mall | 95 | 1000 | 8 | 100 | 5000 |
2 | 811-10003 | C | 1F | important | Office | 177 | 2000 | 6 | 250 | 3000 |
3 | 811-10004 | D | B2 | normal | Home | 70 | 1500 | 5 | 15 | 1500 |
Feature Types and Dealing Steps#
First of all, we should know there are three types of features (\(X\)) and one label (\(y\)).
Additional information features: drop
code
name
Categorical features: encode to one-hot
floor
type: drop
'Home'
type, this type store numbers are very small.
Number features: scale
level: it is not categorical type, because it could be compared.
area
population: there is buffer ranging population, but more want to enter store population, equal to \(\frac{score}{10} \times population\).
score
opendays: filter
opendays <= 30
stores then drop this field
Label: need to balance, should transform to daily sale, equal to \(\frac{sale}{opendays}\) then scale
Mission
Our mission is to find some relationships between these features and label.
The Pandas Way#
In pandas code, most users might type something like this:
Set a series of feature name constants.
[3]:
features_category = ["floor", "type"]
features_number = ["level", "area", "population", "score"]
features = features_category + features_number
label = ["sale"]
Process X and y#
Filter opendays’ store less than 30 days. Because these samples are not normal stores.
[4]:
df = df.query("opendays > 30")
df
[4]:
code | name | floor | level | type | area | population | score | opendays | sale | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 811-10001 | A | 1F | strategic | School | 100 | 3000 | 10 | 300 | 8000 |
1 | 811-10002 | B | 2F | normal | Mall | 95 | 1000 | 8 | 100 | 5000 |
2 | 811-10003 | C | 1F | important | Office | 177 | 2000 | 6 | 250 | 3000 |
Filter 'Home'
store.
[5]:
df = df[df["type"] != "Home"]
df
[5]:
code | name | floor | level | type | area | population | score | opendays | sale | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 811-10001 | A | 1F | strategic | School | 100 | 3000 | 10 | 300 | 8000 |
1 | 811-10002 | B | 2F | normal | Mall | 95 | 1000 | 8 | 100 | 5000 |
2 | 811-10003 | C | 1F | important | Office | 177 | 2000 | 6 | 250 | 3000 |
Transform sale to daily sale.
[6]:
df = df.eval("sale = sale / opendays")
df
df
[6]:
code | name | floor | level | type | area | population | score | opendays | sale | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 811-10001 | A | 1F | strategic | School | 100 | 3000 | 10 | 300 | 26.666667 |
1 | 811-10002 | B | 2F | normal | Mall | 95 | 1000 | 8 | 100 | 50.000000 |
2 | 811-10003 | C | 1F | important | Office | 177 | 2000 | 6 | 250 | 12.000000 |
Transform population to entry store population.
[7]:
df = df.eval("population = score / 10 * population")
df
[7]:
code | name | floor | level | type | area | population | score | opendays | sale | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 811-10001 | A | 1F | strategic | School | 100 | 3000.0 | 10 | 300 | 26.666667 |
1 | 811-10002 | B | 2F | normal | Mall | 95 | 800.0 | 8 | 100 | 50.000000 |
2 | 811-10003 | C | 1F | important | Office | 177 | 1200.0 | 6 | 250 | 12.000000 |
Split df
to df_x
and y
and separately process them.
[8]:
df_x = df[features]
df_x
[8]:
floor | type | level | area | population | score | |
---|---|---|---|---|---|---|
0 | 1F | School | strategic | 100 | 3000.0 | 10 |
1 | 2F | Mall | normal | 95 | 800.0 | 8 |
2 | 1F | Office | important | 177 | 1200.0 | 6 |
[9]:
y = df[label]
y
[9]:
sale | |
---|---|
0 | 26.666667 |
1 | 50.000000 |
2 | 12.000000 |
Process y#
Scale y
.
[10]:
from sklearn.preprocessing import MinMaxScaler
y_scaler = MinMaxScaler()
Scaler handle a column as a unit
[11]:
y = y.values.reshape(-1, 1)
y = y_scaler.fit_transform(y)
y
[11]:
array([[0.38596491],
[1. ],
[0. ]])
The model always requires a 1d array otherwise would give a warning.
[12]:
y = y.ravel()
y
[12]:
array([0.38596491, 1. , 0. ])
Process X#
Replace store types to ranking numbers.
[13]:
df_x = df_x.replace({"normal": 1, "important": 2, "strategic": 3})
df_x
/tmp/ipykernel_3512/3708797549.py:1: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
df_x = df_x.replace({"normal": 1, "important": 2, "strategic": 3})
[13]:
floor | type | level | area | population | score | |
---|---|---|---|---|---|---|
0 | 1F | School | 3 | 100 | 3000.0 | 10 |
1 | 2F | Mall | 1 | 95 | 800.0 | 8 |
2 | 1F | Office | 2 | 177 | 1200.0 | 6 |
Encode categorical features.
[14]:
from sklearn.preprocessing import OneHotEncoder
x_encoder = OneHotEncoder()
x_category = x_encoder.fit_transform(df_x[features_category])
x_category
[14]:
<3x5 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
Scale number features.
[15]:
x_scaler = MinMaxScaler()
x_scaler = x_scaler.fit_transform(df_x[features_number])
x_scaler
[15]:
array([[1. , 0.06097561, 1. , 1. ],
[0. , 0. , 0. , 0.5 ],
[0.5 , 1. , 0.18181818, 0. ]])
Merge all features to one.
[16]:
import numpy as np
X = np.hstack([x_scaler, x_category])
X
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[16], line 3
1 import numpy as np
----> 3 X = np.hstack([x_scaler, x_category])
4 X
File ~/checkouts/readthedocs.org/user_builds/my-data-toolkit/conda/latest/lib/python3.12/site-packages/numpy/core/shape_base.py:359, in hstack(tup, dtype, casting)
357 return _nx.concatenate(arrs, 0, dtype=dtype, casting=casting)
358 else:
--> 359 return _nx.concatenate(arrs, 1, dtype=dtype, casting=casting)
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)
The Pipeline Way#
From The Pandas Way section, we can see that:
The intermediate variables are full of steps. We don’t care about them atthe most time except debugging and reviewing.
Data workflow is messy. Hard to separate data and operations.
The outputting datastruct is not comfortable. The inputting type is
pandas.DataFrame
but the outputting type isnumpy.ndarray
.Hard to apply in prediction data.
Further One Step to Pipeline#
sklearn.pipeline.Pipeline
is a good frame to fix these problems.
Transform process X and process y section codes to pipeline codees.
But actually, these things are hard to transform to pipeline. Most are pandas methods, only OneHotEncoder and MinMaxScaler is could be added into sklearn.pipeline.Pipeline
.
The codes are still messy on typing and applying two ways.
The dtoolkit.transformer
Way#
Frame is good, but from Further One Step to Pipeline section we could see that the core problem is missing transformer.
Pandas’s methods couldn’t be used as a transformer.
Numpy’s methods couldn’t be used as a transformer.
Sklearn’s transformers can’t pandas in and pandas out.
[17]:
from dtoolkit.transformer import (
EvalTF,
FilterInTF,
GetTF,
ReplaceTF,
OneHotEncoder,
QueryTF,
RavelTF,
)
from dtoolkit.pipeline import make_pipeline, make_union
[18]:
pl_xy = make_pipeline(
QueryTF("opendays > 30"),
FilterInTF({"type": ["School", "Mall", "Office"]}),
EvalTF("sale = sale / opendays"),
EvalTF("population = score / 10 * population"),
)
pl_xy
[18]:
Pipeline(steps=[('querytf', <dtoolkit.transformer.pandas.QueryTF.QueryTF object at 0x7fde0c334ad0>), ('filterintf', <dtoolkit.transformer.pandas.FilterInTF.FilterInTF object at 0x7fde0d6eab40>), ('evaltf-1', <dtoolkit.transformer.pandas.EvalTF.EvalTF object at 0x7fde0d6ea300>), ('evaltf-2', <dtoolkit.transformer.pandas.EvalTF.EvalTF object at 0x7fde0bf1fe30>)])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('querytf', <dtoolkit.transformer.pandas.QueryTF.QueryTF object at 0x7fde0c334ad0>), ('filterintf', <dtoolkit.transformer.pandas.FilterInTF.FilterInTF object at 0x7fde0d6eab40>), ('evaltf-1', <dtoolkit.transformer.pandas.EvalTF.EvalTF object at 0x7fde0d6ea300>), ('evaltf-2', <dtoolkit.transformer.pandas.EvalTF.EvalTF object at 0x7fde0bf1fe30>)])
<dtoolkit.transformer.pandas.QueryTF.QueryTF object at 0x7fde0c334ad0>
<dtoolkit.transformer.pandas.FilterInTF.FilterInTF object at 0x7fde0d6eab40>
<dtoolkit.transformer.pandas.EvalTF.EvalTF object at 0x7fde0d6ea300>
<dtoolkit.transformer.pandas.EvalTF.EvalTF object at 0x7fde0bf1fe30>
[19]:
pl_x = make_pipeline(
GetTF(features),
ReplaceTF({"normal": 1, "important": 2, "strategic": 3}),
make_union(
make_pipeline(
GetTF(features_category),
OneHotEncoder(),
),
make_pipeline(
GetTF(features_number),
MinMaxScaler(),
),
),
)
pl_x
[19]:
Pipeline(steps=[('gettf', <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6d970>), ('replacetf', <dtoolkit.transformer.pandas.ReplaceTF.ReplaceTF object at 0x7fde0d5abf20>), ('featureunion', FeatureUnion(transformer_list=[('pipeline-1', Pipeline(steps=[('gettf', <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0d6f52e0>), ('onehotencoder', OneHotEncoder())])), ('pipeline-2', Pipeline(steps=[('gettf', <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6d7c0>), ('minmaxscaler', MinMaxScaler())]))]))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('gettf', <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6d970>), ('replacetf', <dtoolkit.transformer.pandas.ReplaceTF.ReplaceTF object at 0x7fde0d5abf20>), ('featureunion', FeatureUnion(transformer_list=[('pipeline-1', Pipeline(steps=[('gettf', <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0d6f52e0>), ('onehotencoder', OneHotEncoder())])), ('pipeline-2', Pipeline(steps=[('gettf', <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6d7c0>), ('minmaxscaler', MinMaxScaler())]))]))])
<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6d970>
<dtoolkit.transformer.pandas.ReplaceTF.ReplaceTF object at 0x7fde0d5abf20>
FeatureUnion(transformer_list=[('pipeline-1', Pipeline(steps=[('gettf', <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0d6f52e0>), ('onehotencoder', OneHotEncoder())])), ('pipeline-2', Pipeline(steps=[('gettf', <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6d7c0>), ('minmaxscaler', MinMaxScaler())]))])
<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0d6f52e0>
OneHotEncoder()
<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6d7c0>
MinMaxScaler()
[20]:
pl_y = make_pipeline(
GetTF(label),
MinMaxScaler(),
RavelTF(),
)
pl_y
[20]:
Pipeline(steps=[('gettf', <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6dd00>), ('minmaxscaler', MinMaxScaler()), ('raveltf', <dtoolkit.transformer.numpy.RavelTF.RavelTF object at 0x7fde0bf6dca0>)])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('gettf', <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6dd00>), ('minmaxscaler', MinMaxScaler()), ('raveltf', <dtoolkit.transformer.numpy.RavelTF.RavelTF object at 0x7fde0bf6dca0>)])
<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6dd00>
MinMaxScaler()
<dtoolkit.transformer.numpy.RavelTF.RavelTF object at 0x7fde0bf6dca0>
[21]:
store_sale_dict = {
"code": ["811-10001", "811-10002", "811-10003", "811-10004"],
"name": ["A", "B", "C", "D"],
"floor": ["1F", "2F", "1F", "B2"],
"level": ["strategic", "normal", "important", "normal"],
"type": ["School", "Mall", "Office", "Home"],
"area": [100, 95, 177, 70],
"population": [3000, 1000, 2000, 1500],
"score": [10, 8, 6, 5],
"opendays": [300, 100, 250, 15],
"sale": [8000, 5000, 3000, 1500],
}
df = pd.DataFrame(store_sale_dict)
df
[21]:
code | name | floor | level | type | area | population | score | opendays | sale | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 811-10001 | A | 1F | strategic | School | 100 | 3000 | 10 | 300 | 8000 |
1 | 811-10002 | B | 2F | normal | Mall | 95 | 1000 | 8 | 100 | 5000 |
2 | 811-10003 | C | 1F | important | Office | 177 | 2000 | 6 | 250 | 3000 |
3 | 811-10004 | D | B2 | normal | Home | 70 | 1500 | 5 | 15 | 1500 |
[22]:
xy = pl_xy.fit_transform(df)
xy
[22]:
code | name | floor | level | type | area | population | score | opendays | sale | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 811-10001 | A | 1F | strategic | School | 100 | 3000.0 | 10 | 300 | 26.666667 |
1 | 811-10002 | B | 2F | normal | Mall | 95 | 800.0 | 8 | 100 | 50.000000 |
2 | 811-10003 | C | 1F | important | Office | 177 | 1200.0 | 6 | 250 | 12.000000 |
[23]:
X = pl_x.fit_transform(xy)
X
/home/docs/checkouts/readthedocs.org/user_builds/my-data-toolkit/conda/latest/lib/python3.12/site-packages/dtoolkit/transformer/base.py:91: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
Xt = self.transform_method(X, *self.args, **self.kwargs)
[23]:
1F | 2F | Mall | Office | School | level | area | population | score | |
---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.060976 | 1.000000 | 1.0 |
1 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.5 |
2 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.5 | 1.000000 | 0.181818 | 0.0 |
[24]:
y = pl_y.fit_transform(xy)
y
[24]:
0 0.385965
1 1.000000
2 0.000000
Name: sale, dtype: float64
We could also save these pipelines as a binary file via pickle
or joblib
. When new data coming we could quickly transform them via binary file.
Other Ways to Handle This#
pandas.DataFrame.pipe
and function
ways are ok.
But they are:
hard to transform to application codes rightly
hard to debug, and check the processing data
What’s Next - Learn or Build Transformers#
In this tutorial we’ve a quickly glance about dtoolkit.transformer
.
And the next steps, should learn about other transformers, see documentation on Transformer API. If those transformers don’t meet your requirements, you could build your own transformer, follow the documentation on How to Build Transformer.