Transformer and Pipeline Brief Description#

Real world Pipeline

Image by Robson Machado

What is Transformer?#

The Transformer is a protocol that defines the input port and output port.

It is just like a pipe in the real world.

Transformer methods

Direction:

  • fit, transform and fit_transform go forward.

  • inverse_transform goes backward.

Minimal methods require:

  • transform and fit_transform are necessary methods.

  • fit and inverse_transform are optional method.

What is Pipeline?#

The Pipeline combines each port of Transformer into a stream.

You could control the direction of the stream to go forward or backward. And you could stop the stream where you want and check its internal.

It is just like a real pipeline in the real world.

Pipeline

What transformer does?#

dtoolkit.transformer does two things:

DataFrame in and DataFrame out#

sklearn.base.TransformerMixin let array-like in array-like out.

As if there has a pipeline [DropColumnTF, DropRowTF, Map, MapColumnTF]. If we put a DataFrame into this pipeline and want to debug and check the preprocessing data. However, it is hard to trace the data changing. There is only left data itself, other information is missing.

It is important to keep trace the data changing. So dtoolkit.transformer would keep the DataFrame unchanged in stream, let DataFrame in and DataFrame out.

Transformer architecture

Engineering of Data Preprocessing#

dtoolkit.transformer could handle what a mess codes of data preprocessing codes.

A surprise comes here. It can face to the engineering of data preprocessing.

  • DataFrame gives a good view to show the data.

  • Transformer contains the processing or transforming methods.

  • Pipeline as the manager to control the running.

DToolKit for transformer

What’s Next - Future Features#

There are two articles about some future features of Transformer and Pipeline.