Tip

This page was generated from guide/transformer_quickstart.ipynb.

Transformer and Pipeline Quickstart#

Transformer faces the engineering of data preprocessing.

Applicable Scene#

In steps of data preprocessing, we always need to do some duplication things.

When we finished dealing with the training dataset, we also need to sort those preprocessing steps out and make them to a function, an API, or something.

Sample Data#

Note

All data are virtual.

There are some stores sale data of one chain brand.

  • These stores place one region.

  • Time is one specific year.

  • Sale is a year total amount.

  • Population is surrounding \(200m\) buffer daily people numbers.

  • Score is given by the expert, ranges from 0 to 10.

[1]:
import pandas as pd
[2]:
store_sale_dict = {
    "code": ["811-10001", "811-10002", "811-10003", "811-10004"],
    "name": ["A", "B", "C", "D"],
    "floor": ["1F", "2F", "1F", "B2"],
    "level": ["strategic", "normal", "important", "normal"],
    "type": ["School", "Mall", "Office", "Home"],
    "area": [100, 95, 177, 70],
    "population": [3000, 1000, 2000, 1500],
    "score": [10, 8, 6, 5],
    "opendays": [300, 100, 250, 15],
    "sale": [8000, 5000, 3000, 1500],
}
df = pd.DataFrame(store_sale_dict)
df
[2]:
code name floor level type area population score opendays sale
0 811-10001 A 1F strategic School 100 3000 10 300 8000
1 811-10002 B 2F normal Mall 95 1000 8 100 5000
2 811-10003 C 1F important Office 177 2000 6 250 3000
3 811-10004 D B2 normal Home 70 1500 5 15 1500

Feature Types and Dealing Steps#

First of all, we should know there are three types of features (\(X\)) and one label (\(y\)).

  • Additional information features: drop

    • code

    • name

  • Categorical features: encode to one-hot

    • floor

    • type: drop 'Home' type, this type store numbers are very small.

  • Number features: scale

    • level: it is not categorical type, because it could be compared.

    • area

    • population: there is buffer ranging population, but more want to enter store population, equal to \(\frac{score}{10} \times population\).

    • score

    • opendays: filter opendays <= 30 stores then drop this field

  • Label: need to balance, should transform to daily sale, equal to \(\frac{sale}{opendays}\) then scale

Mission

Our mission is to find some relationships between these features and label.

The Pandas Way#

In pandas code, most users might type something like this:

Set a series of feature name constants.

[3]:
features_category = ["floor", "type"]
features_number = ["level", "area", "population", "score"]
features = features_category + features_number
label = ["sale"]

Process X and y#

Filter opendays’ store less than 30 days. Because these samples are not normal stores.

[4]:
df = df.query("opendays > 30")
df
[4]:
code name floor level type area population score opendays sale
0 811-10001 A 1F strategic School 100 3000 10 300 8000
1 811-10002 B 2F normal Mall 95 1000 8 100 5000
2 811-10003 C 1F important Office 177 2000 6 250 3000

Filter 'Home' store.

[5]:
df = df[df["type"] != "Home"]
df
[5]:
code name floor level type area population score opendays sale
0 811-10001 A 1F strategic School 100 3000 10 300 8000
1 811-10002 B 2F normal Mall 95 1000 8 100 5000
2 811-10003 C 1F important Office 177 2000 6 250 3000

Transform sale to daily sale.

[6]:
df = df.eval("sale = sale / opendays")
df
df
[6]:
code name floor level type area population score opendays sale
0 811-10001 A 1F strategic School 100 3000 10 300 26.666667
1 811-10002 B 2F normal Mall 95 1000 8 100 50.000000
2 811-10003 C 1F important Office 177 2000 6 250 12.000000

Transform population to entry store population.

[7]:
df = df.eval("population = score / 10 * population")
df
[7]:
code name floor level type area population score opendays sale
0 811-10001 A 1F strategic School 100 3000.0 10 300 26.666667
1 811-10002 B 2F normal Mall 95 800.0 8 100 50.000000
2 811-10003 C 1F important Office 177 1200.0 6 250 12.000000

Split df to df_x and yand separately process them.

[8]:
df_x = df[features]
df_x
[8]:
floor type level area population score
0 1F School strategic 100 3000.0 10
1 2F Mall normal 95 800.0 8
2 1F Office important 177 1200.0 6
[9]:
y = df[label]
y
[9]:
sale
0 26.666667
1 50.000000
2 12.000000

Process y#

Scale y.

[10]:
from sklearn.preprocessing import MinMaxScaler

y_scaler = MinMaxScaler()

Scaler handle a column as a unit

[11]:
y = y.values.reshape(-1, 1)
y = y_scaler.fit_transform(y)
y
[11]:
array([[0.38596491],
       [1.        ],
       [0.        ]])

The model always requires a 1d array otherwise would give a warning.

[12]:
y = y.ravel()
y
[12]:
array([0.38596491, 1.        , 0.        ])

Process X#

Replace store types to ranking numbers.

[13]:
df_x = df_x.replace({"normal": 1, "important": 2, "strategic": 3})
df_x
/tmp/ipykernel_3512/3708797549.py:1: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df_x = df_x.replace({"normal": 1, "important": 2, "strategic": 3})
[13]:
floor type level area population score
0 1F School 3 100 3000.0 10
1 2F Mall 1 95 800.0 8
2 1F Office 2 177 1200.0 6

Encode categorical features.

[14]:
from sklearn.preprocessing import OneHotEncoder

x_encoder = OneHotEncoder()
x_category = x_encoder.fit_transform(df_x[features_category])
x_category
[14]:
<3x5 sparse matrix of type '<class 'numpy.float64'>'
        with 6 stored elements in Compressed Sparse Row format>

Scale number features.

[15]:
x_scaler = MinMaxScaler()
x_scaler = x_scaler.fit_transform(df_x[features_number])
x_scaler
[15]:
array([[1.        , 0.06097561, 1.        , 1.        ],
       [0.        , 0.        , 0.        , 0.5       ],
       [0.5       , 1.        , 0.18181818, 0.        ]])

Merge all features to one.

[16]:
import numpy as np

X = np.hstack([x_scaler, x_category])
X
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[16], line 3
      1 import numpy as np
----> 3 X = np.hstack([x_scaler, x_category])
      4 X

File ~/checkouts/readthedocs.org/user_builds/my-data-toolkit/conda/latest/lib/python3.12/site-packages/numpy/core/shape_base.py:359, in hstack(tup, dtype, casting)
    357     return _nx.concatenate(arrs, 0, dtype=dtype, casting=casting)
    358 else:
--> 359     return _nx.concatenate(arrs, 1, dtype=dtype, casting=casting)

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)

The Pipeline Way#

From The Pandas Way section, we can see that:

  • The intermediate variables are full of steps. We don’t care about them atthe most time except debugging and reviewing.

  • Data workflow is messy. Hard to separate data and operations.

  • The outputting datastruct is not comfortable. The inputting type is pandas.DataFrame but the outputting type is numpy.ndarray.

  • Hard to apply in prediction data.

Further One Step to Pipeline#

sklearn.pipeline.Pipeline is a good frame to fix these problems.

Transform process X and process y section codes to pipeline codees.

But actually, these things are hard to transform to pipeline. Most are pandas methods, only OneHotEncoder and MinMaxScaler is could be added into sklearn.pipeline.Pipeline.

The codes are still messy on typing and applying two ways.

The dtoolkit.transformer Way#

Frame is good, but from Further One Step to Pipeline section we could see that the core problem is missing transformer.

  • Pandas’s methods couldn’t be used as a transformer.

  • Numpy’s methods couldn’t be used as a transformer.

  • Sklearn’s transformers can’t pandas in and pandas out.

[17]:
from dtoolkit.transformer import (
    EvalTF,
    FilterInTF,
    GetTF,
    ReplaceTF,
    OneHotEncoder,
    QueryTF,
    RavelTF,
)
from dtoolkit.pipeline import make_pipeline, make_union
[18]:
pl_xy = make_pipeline(
    QueryTF("opendays > 30"),
    FilterInTF({"type": ["School", "Mall", "Office"]}),
    EvalTF("sale = sale / opendays"),
    EvalTF("population = score / 10 * population"),
)
pl_xy
[18]:
Pipeline(steps=[('querytf',
                 <dtoolkit.transformer.pandas.QueryTF.QueryTF object at 0x7fde0c334ad0>),
                ('filterintf',
                 <dtoolkit.transformer.pandas.FilterInTF.FilterInTF object at 0x7fde0d6eab40>),
                ('evaltf-1',
                 <dtoolkit.transformer.pandas.EvalTF.EvalTF object at 0x7fde0d6ea300>),
                ('evaltf-2',
                 <dtoolkit.transformer.pandas.EvalTF.EvalTF object at 0x7fde0bf1fe30>)])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[19]:
pl_x = make_pipeline(
    GetTF(features),
    ReplaceTF({"normal": 1, "important": 2, "strategic": 3}),
    make_union(
        make_pipeline(
            GetTF(features_category),
            OneHotEncoder(),
        ),
        make_pipeline(
            GetTF(features_number),
            MinMaxScaler(),
        ),
    ),
)
pl_x
[19]:
Pipeline(steps=[('gettf',
                 <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6d970>),
                ('replacetf',
                 <dtoolkit.transformer.pandas.ReplaceTF.ReplaceTF object at 0x7fde0d5abf20>),
                ('featureunion',
                 FeatureUnion(transformer_list=[('pipeline-1',
                                                 Pipeline(steps=[('gettf',
                                                                  <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0d6f52e0>),
                                                                 ('onehotencoder',
                                                                  OneHotEncoder())])),
                                                ('pipeline-2',
                                                 Pipeline(steps=[('gettf',
                                                                  <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6d7c0>),
                                                                 ('minmaxscaler',
                                                                  MinMaxScaler())]))]))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[20]:
pl_y = make_pipeline(
    GetTF(label),
    MinMaxScaler(),
    RavelTF(),
)
pl_y
[20]:
Pipeline(steps=[('gettf',
                 <dtoolkit.transformer.pandas.GetTF.GetTF object at 0x7fde0bf6dd00>),
                ('minmaxscaler', MinMaxScaler()),
                ('raveltf',
                 <dtoolkit.transformer.numpy.RavelTF.RavelTF object at 0x7fde0bf6dca0>)])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[21]:
store_sale_dict = {
    "code": ["811-10001", "811-10002", "811-10003", "811-10004"],
    "name": ["A", "B", "C", "D"],
    "floor": ["1F", "2F", "1F", "B2"],
    "level": ["strategic", "normal", "important", "normal"],
    "type": ["School", "Mall", "Office", "Home"],
    "area": [100, 95, 177, 70],
    "population": [3000, 1000, 2000, 1500],
    "score": [10, 8, 6, 5],
    "opendays": [300, 100, 250, 15],
    "sale": [8000, 5000, 3000, 1500],
}
df = pd.DataFrame(store_sale_dict)
df
[21]:
code name floor level type area population score opendays sale
0 811-10001 A 1F strategic School 100 3000 10 300 8000
1 811-10002 B 2F normal Mall 95 1000 8 100 5000
2 811-10003 C 1F important Office 177 2000 6 250 3000
3 811-10004 D B2 normal Home 70 1500 5 15 1500
[22]:
xy = pl_xy.fit_transform(df)
xy
[22]:
code name floor level type area population score opendays sale
0 811-10001 A 1F strategic School 100 3000.0 10 300 26.666667
1 811-10002 B 2F normal Mall 95 800.0 8 100 50.000000
2 811-10003 C 1F important Office 177 1200.0 6 250 12.000000
[23]:
X = pl_x.fit_transform(xy)
X
/home/docs/checkouts/readthedocs.org/user_builds/my-data-toolkit/conda/latest/lib/python3.12/site-packages/dtoolkit/transformer/base.py:91: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  Xt = self.transform_method(X, *self.args, **self.kwargs)
[23]:
1F 2F Mall Office School level area population score
0 1.0 0.0 0.0 0.0 1.0 1.0 0.060976 1.000000 1.0
1 0.0 1.0 1.0 0.0 0.0 0.0 0.000000 0.000000 0.5
2 1.0 0.0 0.0 1.0 0.0 0.5 1.000000 0.181818 0.0
[24]:
y = pl_y.fit_transform(xy)
y
[24]:
0    0.385965
1    1.000000
2    0.000000
Name: sale, dtype: float64

We could also save these pipelines as a binary file via pickle or joblib. When new data coming we could quickly transform them via binary file.

Other Ways to Handle This#

pandas.DataFrame.pipe and function ways are ok.

But they are:

  • hard to transform to application codes rightly

  • hard to debug, and check the processing data

What’s Next - Learn or Build Transformers#

In this tutorial we’ve a quickly glance about dtoolkit.transformer.

And the next steps, should learn about other transformers, see documentation on Transformer API. If those transformers don’t meet your requirements, you could build your own transformer, follow the documentation on How to Build Transformer.