Tip

This page was generated from guide/transformer_quickstart.ipynb.

Transformer and Pipeline Quickstart#

Transformer faces the engineering of data preprocessing.

Applicable Scene#

In steps of data preprocessing, we always need to do some duplication things.

When we finished dealing with the training dataset, we also need to sort those preprocessing steps out and make them to a function, an API, or something.

Sample Data#

Note

All data are virtual.

There are some stores sale data of one chain brand.

These stores place one region.
Time is one specific year.
Sale is a year total amount.
Population is surrounding \(200m\) buffer daily people numbers.
Score is given by the expert, ranges from 0 to 10.

[1]:

import pandas as pd

[2]:

store_sale_dict = {
    "code": ["811-10001", "811-10002", "811-10003", "811-10004"],
    "name": ["A", "B", "C", "D"],
    "floor": ["1F", "2F", "1F", "B2"],
    "level": ["strategic", "normal", "important", "normal"],
    "type": ["School", "Mall", "Office", "Home"],
    "area": [100, 95, 177, 70],
    "population": [3000, 1000, 2000, 1500],
    "score": [10, 8, 6, 5],
    "opendays": [300, 100, 250, 15],
    "sale": [8000, 5000, 3000, 1500],
}
df = pd.DataFrame(store_sale_dict)
df

[2]:

	code	name	floor	level	type	area	population	score	opendays	sale
0	811-10001	A	1F	strategic	School	100	3000	10	300	8000
1	811-10002	B	2F	normal	Mall	95	1000	8	100	5000
2	811-10003	C	1F	important	Office	177	2000	6	250	3000
3	811-10004	D	B2	normal	Home	70	1500	5	15	1500

Feature Types and Dealing Steps#

First of all, we should know there are three types of features (\(X\)) and one label (\(y\)).

Additional information features: drop
- code
- name
Categorical features: encode to one-hot
- floor
- type: drop 'Home' type, this type store numbers are very small.
Number features: scale
- level: it is not categorical type, because it could be compared.
- area
- population: there is buffer ranging population, but more want to enter store population, equal to \(\frac{score}{10} \times population\).
- score
- opendays: filter opendays <= 30 stores then drop this field
Label: need to balance, should transform to daily sale, equal to \(\frac{sale}{opendays}\) then scale

Mission

Our mission is to find some relationships between these features and label.

The Pandas Way#

In pandas code, most users might type something like this:

Set a series of feature name constants.

[3]:

features_category = ["floor", "type"]
features_number = ["level", "area", "population", "score"]
features = features_category + features_number
label = ["sale"]

Process X and y#

Filter opendays’ store less than 30 days. Because these samples are not normal stores.

[4]:

df = df.query("opendays > 30")
df

[4]:

	code	name	floor	level	type	area	population	score	opendays	sale
0	811-10001	A	1F	strategic	School	100	3000	10	300	8000
1	811-10002	B	2F	normal	Mall	95	1000	8	100	5000
2	811-10003	C	1F	important	Office	177	2000	6	250	3000

Filter 'Home' store.

[5]:

df = df[df["type"] != "Home"]
df

[5]:

	code	name	floor	level	type	area	population	score	opendays	sale
0	811-10001	A	1F	strategic	School	100	3000	10	300	8000
1	811-10002	B	2F	normal	Mall	95	1000	8	100	5000
2	811-10003	C	1F	important	Office	177	2000	6	250	3000

Transform sale to daily sale.

[6]:

df = df.eval("sale = sale / opendays")
df
df

[6]:

	code	name	floor	level	type	area	population	score	opendays	sale
0	811-10001	A	1F	strategic	School	100	3000	10	300	26.666667
1	811-10002	B	2F	normal	Mall	95	1000	8	100	50.000000
2	811-10003	C	1F	important	Office	177	2000	6	250	12.000000

Transform population to entry store population.

[7]:

df = df.eval("population = score / 10 * population")
df

[7]:

	code	name	floor	level	type	area	population	score	opendays	sale
0	811-10001	A	1F	strategic	School	100	3000.0	10	300	26.666667
1	811-10002	B	2F	normal	Mall	95	800.0	8	100	50.000000
2	811-10003	C	1F	important	Office	177	1200.0	6	250	12.000000

Split df to df_x and yand separately process them.

[8]:

df_x = df[features]
df_x

[8]:

	floor	type	level	area	population	score
0	1F	School	strategic	100	3000.0	10
1	2F	Mall	normal	95	800.0	8
2	1F	Office	important	177	1200.0	6

[9]:

y = df[label]
y

[9]:

	sale
0	26.666667
1	50.000000
2	12.000000

Process y#

Scale y.

[10]:

from sklearn.preprocessing import MinMaxScaler

y_scaler = MinMaxScaler()

Scaler handle a column as a unit

[11]:

y = y.values.reshape(-1, 1)
y = y_scaler.fit_transform(y)
y

[11]:

array([[0.38596491],
       [1.        ],
       [0.        ]])

The model always requires a 1d array otherwise would give a warning.

[12]:

y = y.ravel()
y

[12]:

array([0.38596491, 1.        , 0.        ])

Process X#

Replace store types to ranking numbers.

[13]:

df_x = df_x.replace({"normal": 1, "important": 2, "strategic": 3})
df_x

/tmp/ipykernel_3512/3708797549.py:1: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df_x = df_x.replace({"normal": 1, "important": 2, "strategic": 3})

[13]:

	floor	type	level	area	population	score
0	1F	School	3	100	3000.0	10
1	2F	Mall	1	95	800.0	8
2	1F	Office	2	177	1200.0	6

Encode categorical features.

[14]:

from sklearn.preprocessing import OneHotEncoder

x_encoder = OneHotEncoder()
x_category = x_encoder.fit_transform(df_x[features_category])
x_category

[14]:

<3x5 sparse matrix of type '<class 'numpy.float64'>'
        with 6 stored elements in Compressed Sparse Row format>

Scale number features.

[15]:

x_scaler = MinMaxScaler()
x_scaler = x_scaler.fit_transform(df_x[features_number])
x_scaler

[15]:

array([[1.        , 0.06097561, 1.        , 1.        ],
       [0.        , 0.        , 0.        , 0.5       ],
       [0.5       , 1.        , 0.18181818, 0.        ]])

Merge all features to one.

[16]:

import numpy as np

X = np.hstack([x_scaler, x_category])
X

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[16], line 3
      1 import numpy as np
----> 3 X = np.hstack([x_scaler, x_category])
      4 X

File ~/checkouts/readthedocs.org/user_builds/my-data-toolkit/conda/latest/lib/python3.12/site-packages/numpy/core/shape_base.py:359, in hstack(tup, dtype, casting)
    357     return _nx.concatenate(arrs, 0, dtype=dtype, casting=casting)
    358 else:
--> 359     return _nx.concatenate(arrs, 1, dtype=dtype, casting=casting)

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)

The Pipeline Way#

From The Pandas Way section, we can see that:

The intermediate variables are full of steps. We don’t care about them atthe most time except debugging and reviewing.
Data workflow is messy. Hard to separate data and operations.
The outputting datastruct is not comfortable. The inputting type is pandas.DataFrame but the outputting type is numpy.ndarray.
Hard to apply in prediction data.

Further One Step to Pipeline#

sklearn.pipeline.Pipeline is a good frame to fix these problems.

Transform process X and process y section codes to pipeline codees.

But actually, these things are hard to transform to pipeline. Most are pandas methods, only OneHotEncoder and MinMaxScaler is could be added into sklearn.pipeline.Pipeline.

The codes are still messy on typing and applying two ways.

The `dtoolkit.transformer` Way#

Frame is good, but from Further One Step to Pipeline section we could see that the core problem is missing transformer.

Pandas’s methods couldn’t be used as a transformer.
Numpy’s methods couldn’t be used as a transformer.
Sklearn’s transformers can’t pandas in and pandas out.

[17]:

from dtoolkit.transformer import (
    EvalTF,
    FilterInTF,
    GetTF,
    ReplaceTF,
    OneHotEncoder,
    QueryTF,
    RavelTF,
)
from dtoolkit.pipeline import make_pipeline, make_union

[18]:

pl_xy = make_pipeline(
    QueryTF("opendays > 30"),
    FilterInTF({"type": ["School", "Mall", "Office"]}),
    EvalTF("sale = sale / opendays"),
    EvalTF("population = score / 10 * population"),
)
pl_xy

[18]:

Pipeline(steps=[('querytf',
                 <dtoolkit.transformer.pandas.QueryTF.QueryTF object at 0x7fde0c334ad0>),
                ('filterintf',
                 <dtoolkit.transformer.pandas.FilterInTF.FilterInTF object at 0x7fde0d6eab40>),
                ('evaltf-1',
                 <dtoolkit.transformer.pandas.EvalTF.EvalTF object at 0x7fde0d6ea300>),
                ('evaltf-2',
                 <dtoolkit.transformer.pandas.EvalTF.EvalTF object at 0x7fde0bf1fe30>)])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

[19]:

pl_x = make_pipeline(
    GetTF(features),
    ReplaceTF({"normal": 1, "important": 2, "strategic": 3}),
    make_union(
        make_pipeline(
            GetTF(features_category),
            OneHotEncoder(),
        ),
        make_pipeline(
            GetTF(features_number),
            MinMaxScaler(),
        ),
    ),
)
pl_x

[19]:

Pipeline(steps=[('gettf',
                 <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6d970>),
                ('replacetf',
                 <dtoolkit.transformer.pandas.ReplaceTF.ReplaceTF object at 0x7fde0d5abf20>),
                ('featureunion',
                 FeatureUnion(transformer_list=[('pipeline-1',
                                                 Pipeline(steps=[('gettf',
                                                                  <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0d6f52e0>),
                                                                 ('onehotencoder',
                                                                  OneHotEncoder())])),
                                                ('pipeline-2',
                                                 Pipeline(steps=[('gettf',
                                                                  <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6d7c0>),
                                                                 ('minmaxscaler',
                                                                  MinMaxScaler())]))]))])

PipelineiNot fitted

Pipeline(steps=[('gettf',
                 <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6d970>),
                ('replacetf',
                 <dtoolkit.transformer.pandas.ReplaceTF.ReplaceTF object at 0x7fde0d5abf20>),
                ('featureunion',
                 FeatureUnion(transformer_list=[('pipeline-1',
                                                 Pipeline(steps=[('gettf',
                                                                  <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0d6f52e0>),
                                                                 ('onehotencoder',
                                                                  OneHotEncoder())])),
                                                ('pipeline-2',
                                                 Pipeline(steps=[('gettf',
                                                                  <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6d7c0>),
                                                                 ('minmaxscaler',
                                                                  MinMaxScaler())]))]))])

GetTF

<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6d970>

ReplaceTF

<dtoolkit.transformer.pandas.ReplaceTF.ReplaceTF object at 0x7fde0d5abf20>

featureunion: FeatureUnion

FeatureUnion(transformer_list=[('pipeline-1',
                                Pipeline(steps=[('gettf',
                                                 <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0d6f52e0>),
                                                ('onehotencoder',
                                                 OneHotEncoder())])),
                               ('pipeline-2',
                                Pipeline(steps=[('gettf',
                                                 <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6d7c0>),
                                                ('minmaxscaler',
                                                 MinMaxScaler())]))])

pipeline-1

GetTF

<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0d6f52e0>

OneHotEncoder

OneHotEncoder()

pipeline-2

GetTF

<dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6d7c0>

MinMaxScaler?Documentation for MinMaxScaler

MinMaxScaler()

[20]:

pl_y = make_pipeline(
    GetTF(label),
    MinMaxScaler(),
    RavelTF(),
)
pl_y

[20]:

Pipeline(steps=[('gettf',
                 <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6dd00>),
                ('minmaxscaler', MinMaxScaler()),
                ('raveltf',
                 <dtoolkit.transformer.numpy.RavelTF.RavelTF object at 0x7fde0bf6dca0>)])

[21]:

store_sale_dict = {
    "code": ["811-10001", "811-10002", "811-10003", "811-10004"],
    "name": ["A", "B", "C", "D"],
    "floor": ["1F", "2F", "1F", "B2"],
    "level": ["strategic", "normal", "important", "normal"],
    "type": ["School", "Mall", "Office", "Home"],
    "area": [100, 95, 177, 70],
    "population": [3000, 1000, 2000, 1500],
    "score": [10, 8, 6, 5],
    "opendays": [300, 100, 250, 15],
    "sale": [8000, 5000, 3000, 1500],
}
df = pd.DataFrame(store_sale_dict)
df

[21]:

	code	name	floor	level	type	area	population	score	opendays	sale
0	811-10001	A	1F	strategic	School	100	3000	10	300	8000
1	811-10002	B	2F	normal	Mall	95	1000	8	100	5000
2	811-10003	C	1F	important	Office	177	2000	6	250	3000
3	811-10004	D	B2	normal	Home	70	1500	5	15	1500

[22]:

xy = pl_xy.fit_transform(df)
xy

[22]:

	code	name	floor	level	type	area	population	score	opendays	sale
0	811-10001	A	1F	strategic	School	100	3000.0	10	300	26.666667
1	811-10002	B	2F	normal	Mall	95	800.0	8	100	50.000000
2	811-10003	C	1F	important	Office	177	1200.0	6	250	12.000000

[23]:

X = pl_x.fit_transform(xy)
X

/home/docs/checkouts/readthedocs.org/user_builds/my-data-toolkit/conda/latest/lib/python3.12/site-packages/dtoolkit/transformer/base.py:91: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  Xt = self.transform_method(X, *self.args, **self.kwargs)

[23]:

	1F	2F	Mall	Office	School	level	area	population	score
0	1.0	0.0	0.0	0.0	1.0	1.0	0.060976	1.000000	1.0
1	0.0	1.0	1.0	0.0	0.0	0.0	0.000000	0.000000	0.5
2	1.0	0.0	0.0	1.0	0.0	0.5	1.000000	0.181818	0.0

[24]:

y = pl_y.fit_transform(xy)
y

[24]:

0    0.385965
1    1.000000
2    0.000000
Name: sale, dtype: float64

We could also save these pipelines as a binary file via pickle or joblib. When new data coming we could quickly transform them via binary file.

Other Ways to Handle This#

pandas.DataFrame.pipe and function ways are ok.

But they are:

hard to transform to application codes rightly
hard to debug, and check the processing data

What’s Next - Learn or Build Transformers#

In this tutorial we’ve a quickly glance about dtoolkit.transformer.

And the next steps, should learn about other transformers, see documentation on Transformer API. If those transformers don’t meet your requirements, you could build your own transformer, follow the documentation on How to Build Transformer.

Transformer and Pipeline Quickstart#

Applicable Scene#

Sample Data#

Feature Types and Dealing Steps#

The Pandas Way#

Process X and y#

Process y#

Process X#

The Pipeline Way#

Further One Step to Pipeline#

The dtoolkit.transformer Way#

Other Ways to Handle This#

What’s Next - Learn or Build Transformers#

The `dtoolkit.transformer` Way#