{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Transformer and Pipeline Quickstart"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Transformer` faces the engineering of **data preprocessing**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Applicable Scene"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In steps of data preprocessing, we always need to do some **duplication things**.\n",
    "\n",
    "When we finished dealing with the training dataset, we also need to sort those\n",
    "preprocessing steps out and make them to a function, an API, or something."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sample Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Note\n",
    "\n",
    "All data are virtual.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are some stores sale data of one chain brand.\n",
    "\n",
    "- These stores place one region.\n",
    "- Time is one specific year.\n",
    "- Sale is a year total amount.\n",
    "- Population is surrounding $200m$ buffer daily people numbers.\n",
    "- Score is given by the expert, ranges from 0 to 10."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "store_sale_dict = {\n",
    "    \"code\": [\"811-10001\", \"811-10002\", \"811-10003\", \"811-10004\"],\n",
    "    \"name\": [\"A\", \"B\", \"C\", \"D\"],\n",
    "    \"floor\": [\"1F\", \"2F\", \"1F\", \"B2\"],\n",
    "    \"level\": [\"strategic\", \"normal\", \"important\", \"normal\"],\n",
    "    \"type\": [\"School\", \"Mall\", \"Office\", \"Home\"],\n",
    "    \"area\": [100, 95, 177, 70],\n",
    "    \"population\": [3000, 1000, 2000, 1500],\n",
    "    \"score\": [10, 8, 6, 5],\n",
    "    \"opendays\": [300, 100, 250, 15],\n",
    "    \"sale\": [8000, 5000, 3000, 1500],\n",
    "}\n",
    "df = pd.DataFrame(store_sale_dict)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Types and Dealing Steps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First of all, we should know there are three types of features ($X$) and one label ($y$).\n",
    "\n",
    "- Additional information features: drop\n",
    "  - code\n",
    "  - name\n",
    "- Categorical features: encode to one-hot\n",
    "  - floor\n",
    "  - type: drop `'Home'` type, this type store numbers are very small.\n",
    "- Number features: scale\n",
    "  - level: it is not **categorical** type, because it could be compared.\n",
    "  - area\n",
    "  - population: there is buffer ranging population, but more want to enter store population, equal to  $\\frac{score}{10} \\times population$.\n",
    "  - score\n",
    "  - opendays: filter `opendays <= 30` stores then drop this field\n",
    "- Label: need to balance, should transform to daily sale, equal to $\\frac{sale}{opendays}$ then scale\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Mission\n",
    "\n",
    "Our mission is to find some relationships between these features and label.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Pandas Way"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In pandas code, most users might type something like this:\n",
    "\n",
    "Set a series of feature name constants."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "features_category = [\"floor\", \"type\"]\n",
    "features_number = [\"level\", \"area\", \"population\", \"score\"]\n",
    "features = features_category + features_number\n",
    "label = [\"sale\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Process X and y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Filter opendays' store less than 30 days.\n",
    "Because these samples are not normal stores."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.query(\"opendays > 30\")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Filter `'Home'` store."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df[df[\"type\"] != \"Home\"]\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Transform sale to daily sale."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.eval(\"sale = sale / opendays\")\n",
    "df\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Transform population to entry store population."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.eval(\"population = score / 10 * population\")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Split `df` to `df_x` and `y`and separately process them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_x = df[features]\n",
    "df_x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = df[label]\n",
    "y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Process y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Scale `y`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import MinMaxScaler\n",
    "\n",
    "y_scaler = MinMaxScaler()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Scaler handle a column as a unit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = y.values.reshape(-1, 1)\n",
    "y = y_scaler.fit_transform(y)\n",
    "y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model always requires a 1d array otherwise would give a warning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = y.ravel()\n",
    "y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Process X"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Replace store types to ranking numbers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_x = df_x.replace({\"normal\": 1, \"important\": 2, \"strategic\": 3})\n",
    "df_x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Encode categorical features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import OneHotEncoder\n",
    "\n",
    "x_encoder = OneHotEncoder()\n",
    "x_category = x_encoder.fit_transform(df_x[features_category])\n",
    "x_category"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Scale number features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_scaler = MinMaxScaler()\n",
    "x_scaler = x_scaler.fit_transform(df_x[features_number])\n",
    "x_scaler"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Merge all features to one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "X = np.hstack([x_scaler, x_category])\n",
    "X"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Pipeline Way"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From [The Pandas Way](#the-pandas-way) section, we can see that:\n",
    "\n",
    "- The intermediate variables are full of steps. We don't care about them atthe most time except debugging and reviewing.\n",
    "- Data workflow is messy. Hard to separate data and operations.\n",
    "- The outputting datastruct is not comfortable. The inputting type is `pandas.DataFrame` but the outputting type is `numpy.ndarray`.\n",
    "- Hard to apply in prediction data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Further One Step to Pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`sklearn.pipeline.Pipeline` is a good frame to fix these problems.\n",
    "\n",
    "Transform [process X](#process-x) and [process y](#process-y) section codes to pipeline codees.\n",
    "\n",
    "But actually, these things are hard to transform to pipeline.\n",
    "Most are pandas methods, only OneHotEncoder and MinMaxScaler is could be added\n",
    "into `sklearn.pipeline.Pipeline`.\n",
    "\n",
    "The codes are still messy on **typing** and **applying** two ways."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The `dtoolkit.transformer` Way"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Frame is good, but from [Further One Step to Pipeline](#further-one-step-to-pipeline) section we could\n",
    "see that the core problem is **missing transformer**.\n",
    "\n",
    "- Pandas's methods couldn't be used as a transformer.\n",
    "- Numpy's methods couldn't be used as a transformer.\n",
    "- Sklearn's transformers can't pandas in and pandas out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dtoolkit.transformer import (\n",
    "    EvalTF,\n",
    "    FilterInTF,\n",
    "    GetTF,\n",
    "    ReplaceTF,\n",
    "    OneHotEncoder,\n",
    "    QueryTF,\n",
    "    RavelTF,\n",
    ")\n",
    "from dtoolkit.pipeline import make_pipeline, make_union"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pl_xy = make_pipeline(\n",
    "    QueryTF(\"opendays > 30\"),\n",
    "    FilterInTF({\"type\": [\"School\", \"Mall\", \"Office\"]}),\n",
    "    EvalTF(\"sale = sale / opendays\"),\n",
    "    EvalTF(\"population = score / 10 * population\"),\n",
    ")\n",
    "pl_xy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pl_x = make_pipeline(\n",
    "    GetTF(features),\n",
    "    ReplaceTF({\"normal\": 1, \"important\": 2, \"strategic\": 3}),\n",
    "    make_union(\n",
    "        make_pipeline(\n",
    "            GetTF(features_category),\n",
    "            OneHotEncoder(),\n",
    "        ),\n",
    "        make_pipeline(\n",
    "            GetTF(features_number),\n",
    "            MinMaxScaler(),\n",
    "        ),\n",
    "    ),\n",
    ")\n",
    "pl_x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pl_y = make_pipeline(\n",
    "    GetTF(label),\n",
    "    MinMaxScaler(),\n",
    "    RavelTF(),\n",
    ")\n",
    "pl_y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "store_sale_dict = {\n",
    "    \"code\": [\"811-10001\", \"811-10002\", \"811-10003\", \"811-10004\"],\n",
    "    \"name\": [\"A\", \"B\", \"C\", \"D\"],\n",
    "    \"floor\": [\"1F\", \"2F\", \"1F\", \"B2\"],\n",
    "    \"level\": [\"strategic\", \"normal\", \"important\", \"normal\"],\n",
    "    \"type\": [\"School\", \"Mall\", \"Office\", \"Home\"],\n",
    "    \"area\": [100, 95, 177, 70],\n",
    "    \"population\": [3000, 1000, 2000, 1500],\n",
    "    \"score\": [10, 8, 6, 5],\n",
    "    \"opendays\": [300, 100, 250, 15],\n",
    "    \"sale\": [8000, 5000, 3000, 1500],\n",
    "}\n",
    "df = pd.DataFrame(store_sale_dict)\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xy = pl_xy.fit_transform(df)\n",
    "xy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = pl_x.fit_transform(xy)\n",
    "X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = pl_y.fit_transform(xy)\n",
    "y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could also save these pipelines as a binary file via `pickle` or `joblib`.\n",
    "When new data coming we could quickly transform them via binary file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Other Ways to Handle This"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pandas.DataFrame.pipe` and `function` ways are ok.\n",
    "\n",
    "But they are:\n",
    "\n",
    "- hard to transform to application codes rightly\n",
    "- hard to debug, and check the processing data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What's Next - Learn or Build Transformers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial we've a quickly glance about `dtoolkit.transformer`.\n",
    "\n",
    "And the next steps, should learn about other transformers,\n",
    "see documentation on [Transformer API](../reference/transformer.rst).\n",
    "If those transformers don't meet your requirements, you could build your own\n",
    "transformer, follow the documentation on [How to Build Transformer](build_transformer.ipynb)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}