{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Transformer and Pipeline Quickstart" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Transformer` faces the engineering of **data preprocessing**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Applicable Scene" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In steps of data preprocessing, we always need to do some **duplication things**.\n", "\n", "When we finished dealing with the training dataset, we also need to sort those\n", "preprocessing steps out and make them to a function, an API, or something." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sample Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Note\n", "\n", "All data are virtual.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are some stores sale data of one chain brand.\n", "\n", "- These stores place one region.\n", "- Time is one specific year.\n", "- Sale is a year total amount.\n", "- Population is surrounding $200m$ buffer daily people numbers.\n", "- Score is given by the expert, ranges from 0 to 10." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "store_sale_dict = {\n", " \"code\": [\"811-10001\", \"811-10002\", \"811-10003\", \"811-10004\"],\n", " \"name\": [\"A\", \"B\", \"C\", \"D\"],\n", " \"floor\": [\"1F\", \"2F\", \"1F\", \"B2\"],\n", " \"level\": [\"strategic\", \"normal\", \"important\", \"normal\"],\n", " \"type\": [\"School\", \"Mall\", \"Office\", \"Home\"],\n", " \"area\": [100, 95, 177, 70],\n", " \"population\": [3000, 1000, 2000, 1500],\n", " \"score\": [10, 8, 6, 5],\n", " \"opendays\": [300, 100, 250, 15],\n", " \"sale\": [8000, 5000, 3000, 1500],\n", "}\n", "df = pd.DataFrame(store_sale_dict)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Types and Dealing Steps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First of all, we should know there are three types of features ($X$) and one label ($y$).\n", "\n", "- Additional information features: drop\n", " - code\n", " - name\n", "- Categorical features: encode to one-hot\n", " - floor\n", " - type: drop `'Home'` type, this type store numbers are very small.\n", "- Number features: scale\n", " - level: it is not **categorical** type, because it could be compared.\n", " - area\n", " - population: there is buffer ranging population, but more want to enter store population, equal to $\\frac{score}{10} \\times population$.\n", " - score\n", " - opendays: filter `opendays <= 30` stores then drop this field\n", "- Label: need to balance, should transform to daily sale, equal to $\\frac{sale}{opendays}$ then scale\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Mission\n", "\n", "Our mission is to find some relationships between these features and label.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Pandas Way" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In pandas code, most users might type something like this:\n", "\n", "Set a series of feature name constants." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features_category = [\"floor\", \"type\"]\n", "features_number = [\"level\", \"area\", \"population\", \"score\"]\n", "features = features_category + features_number\n", "label = [\"sale\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Process X and y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Filter opendays' store less than 30 days.\n", "Because these samples are not normal stores." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df.query(\"opendays > 30\")\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Filter `'Home'` store." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df[df[\"type\"] != \"Home\"]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transform sale to daily sale." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df.eval(\"sale = sale / opendays\")\n", "df\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transform population to entry store population." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df.eval(\"population = score / 10 * population\")\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Split `df` to `df_x` and `y`and separately process them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_x = df[features]\n", "df_x" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y = df[label]\n", "y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Process y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scale `y`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import MinMaxScaler\n", "\n", "y_scaler = MinMaxScaler()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scaler handle a column as a unit" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y = y.values.reshape(-1, 1)\n", "y = y_scaler.fit_transform(y)\n", "y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model always requires a 1d array otherwise would give a warning." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y = y.ravel()\n", "y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Process X" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace store types to ranking numbers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_x = df_x.replace({\"normal\": 1, \"important\": 2, \"strategic\": 3})\n", "df_x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Encode categorical features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import OneHotEncoder\n", "\n", "x_encoder = OneHotEncoder()\n", "x_category = x_encoder.fit_transform(df_x[features_category])\n", "x_category" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scale number features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x_scaler = MinMaxScaler()\n", "x_scaler = x_scaler.fit_transform(df_x[features_number])\n", "x_scaler" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Merge all features to one." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "X = np.hstack([x_scaler, x_category])\n", "X" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Pipeline Way" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From [The Pandas Way](#the-pandas-way) section, we can see that:\n", "\n", "- The intermediate variables are full of steps. We don't care about them atthe most time except debugging and reviewing.\n", "- Data workflow is messy. Hard to separate data and operations.\n", "- The outputting datastruct is not comfortable. The inputting type is `pandas.DataFrame` but the outputting type is `numpy.ndarray`.\n", "- Hard to apply in prediction data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Further One Step to Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`sklearn.pipeline.Pipeline` is a good frame to fix these problems.\n", "\n", "Transform [process X](#process-x) and [process y](#process-y) section codes to pipeline codees.\n", "\n", "But actually, these things are hard to transform to pipeline.\n", "Most are pandas methods, only OneHotEncoder and MinMaxScaler is could be added\n", "into `sklearn.pipeline.Pipeline`.\n", "\n", "The codes are still messy on **typing** and **applying** two ways." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The `dtoolkit.transformer` Way" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Frame is good, but from [Further One Step to Pipeline](#further-one-step-to-pipeline) section we could\n", "see that the core problem is **missing transformer**.\n", "\n", "- Pandas's methods couldn't be used as a transformer.\n", "- Numpy's methods couldn't be used as a transformer.\n", "- Sklearn's transformers can't pandas in and pandas out." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from dtoolkit.transformer import (\n", " EvalTF,\n", " FilterInTF,\n", " GetTF,\n", " ReplaceTF,\n", " OneHotEncoder,\n", " QueryTF,\n", " RavelTF,\n", ")\n", "from dtoolkit.pipeline import make_pipeline, make_union" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pl_xy = make_pipeline(\n", " QueryTF(\"opendays > 30\"),\n", " FilterInTF({\"type\": [\"School\", \"Mall\", \"Office\"]}),\n", " EvalTF(\"sale = sale / opendays\"),\n", " EvalTF(\"population = score / 10 * population\"),\n", ")\n", "pl_xy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pl_x = make_pipeline(\n", " GetTF(features),\n", " ReplaceTF({\"normal\": 1, \"important\": 2, \"strategic\": 3}),\n", " make_union(\n", " make_pipeline(\n", " GetTF(features_category),\n", " OneHotEncoder(),\n", " ),\n", " make_pipeline(\n", " GetTF(features_number),\n", " MinMaxScaler(),\n", " ),\n", " ),\n", ")\n", "pl_x" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pl_y = make_pipeline(\n", " GetTF(label),\n", " MinMaxScaler(),\n", " RavelTF(),\n", ")\n", "pl_y" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "store_sale_dict = {\n", " \"code\": [\"811-10001\", \"811-10002\", \"811-10003\", \"811-10004\"],\n", " \"name\": [\"A\", \"B\", \"C\", \"D\"],\n", " \"floor\": [\"1F\", \"2F\", \"1F\", \"B2\"],\n", " \"level\": [\"strategic\", \"normal\", \"important\", \"normal\"],\n", " \"type\": [\"School\", \"Mall\", \"Office\", \"Home\"],\n", " \"area\": [100, 95, 177, 70],\n", " \"population\": [3000, 1000, 2000, 1500],\n", " \"score\": [10, 8, 6, 5],\n", " \"opendays\": [300, 100, 250, 15],\n", " \"sale\": [8000, 5000, 3000, 1500],\n", "}\n", "df = pd.DataFrame(store_sale_dict)\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xy = pl_xy.fit_transform(df)\n", "xy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = pl_x.fit_transform(xy)\n", "X" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y = pl_y.fit_transform(xy)\n", "y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could also save these pipelines as a binary file via `pickle` or `joblib`.\n", "When new data coming we could quickly transform them via binary file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Other Ways to Handle This" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pandas.DataFrame.pipe` and `function` ways are ok.\n", "\n", "But they are:\n", "\n", "- hard to transform to application codes rightly\n", "- hard to debug, and check the processing data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What's Next - Learn or Build Transformers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial we've a quickly glance about `dtoolkit.transformer`.\n", "\n", "And the next steps, should learn about other transformers,\n", "see documentation on [Transformer API](../reference/transformer.rst).\n", "If those transformers don't meet your requirements, you could build your own\n", "transformer, follow the documentation on [How to Build Transformer](build_transformer.ipynb)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 2 }