.. _sec_pandas:

Data Preprocessing
==================


So far we have introduced a variety of techniques for manipulating data
that are already stored in ``ndarray``\ s. To apply deep learning to
solving real-world problems, we often begin with preprocessing raw data,
rather than those nicely prepared data in the ``ndarray`` format. Among
popular data analytic tools in Python, the ``pandas`` package is
commonly used. Like many other extension packages in the vast ecosystem
of Python, ``pandas`` can work together with ``ndarray``. So, we will
briefly walk through steps for preprocessing raw data with ``pandas``
and converting them into the ``ndarray`` format. We will cover more data
preprocessing techniques in later chapters.

Reading the Dataset
-------------------

As an example, we begin by creating an artificial dataset that is stored
in a csv (comma-separated values) file. Data stored in other formats may
be processed in similar ways.

.. code:: python

    # Write the dataset row by row into a csv file
    import os
    data_folder = '../data/'
    if not os.path.exists(data_folder):
        os.makedirs(data_folder)
    
    data_file = '../data/house_tiny.csv'
    with open(data_file, 'w') as f:
        f.write('NumRooms,Alley,Price\n')  # Column names
        f.write('NA,Pave,127500\n')  # Each row is a data point
        f.write('2,NA,106000\n')
        f.write('4,NA,178100\n')
        f.write('NA,NA,140000\n')

To load the raw dataset from the created csv file, we import the
``pandas`` package and invoke the ``read_csv`` function. This dataset
has :math:`4` rows and :math:`3` columns, where each row describes the
number of rooms (“NumRooms”), the alley type (“Alley”), and the price
(“Price”) of a house.

.. code:: python

    # If pandas is not installed, just uncomment the following line:
    # !pip install pandas
    import pandas as pd
    
    data = pd.read_csv(data_file)
    print(data)


.. parsed-literal::
    :class: output

       NumRooms Alley   Price
    0       NaN  Pave  127500
    1       2.0   NaN  106000
    2       4.0   NaN  178100
    3       NaN   NaN  140000


Handling Missing Data
---------------------

Note that “NaN” entries are missing values. To handle missing data,
typical methods include *imputation* and *deletion*, where imputation
replaces missing values with substituted ones, while deletion ignores
missing values. Here we will consider imputation.

By integer-location based indexing (``iloc``), we split ``data`` into
``inputs`` and ``outputs``, where the former takes the first 2 columns
while the later only keeps the last column. For numerical values in
``inputs`` that are missing, we replace the “NaN” entries with the mean
value of the same column.

.. code:: python

    inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
    inputs = inputs.fillna(inputs.mean())
    print(inputs)


.. parsed-literal::
    :class: output

       NumRooms Alley
    0       3.0  Pave
    1       2.0   NaN
    2       4.0   NaN
    3       3.0   NaN


For categorical or discrete values in ``inputs``, we consider “NaN” as a
category. Since the “Alley” column only takes 2 types of categorical
values “Pave” and “NaN”, ``pandas`` can automatically convert this
column to 2 columns “Alley_Pave” and “Alley_nan”. A row whose alley type
is “Pave” will set values of “Alley_Pave” and “Alley_nan” to :math:`1`
and :math:`0`. A row with a missing alley type will set their values to
:math:`0` and :math:`1`.

.. code:: python

    inputs = pd.get_dummies(inputs, dummy_na=True)
    print(inputs)


.. parsed-literal::
    :class: output

       NumRooms  Alley_Pave  Alley_nan
    0       3.0           1          0
    1       2.0           0          1
    2       4.0           0          1
    3       3.0           0          1


Conversion to the ``ndarray`` Format
------------------------------------

Now that all the entries in ``inputs`` and ``outputs`` are numerical,
they can be converted to the ``ndarray`` format. Once data are in this
format, they can be further manipulated with those ``ndarray``
functionalities that we have introduced in :numref:`sec_ndarray`.

.. code:: python

    from mxnet import np
    
    X, y = np.array(inputs.values), np.array(outputs.values)
    X, y


.. parsed-literal::
    :class: output

    (array([[3., 1., 0.],
            [2., 0., 1.],
            [4., 0., 1.],
            [3., 0., 1.]], dtype=float64),
     array([127500, 106000, 178100, 140000], dtype=int64))


Summary
-------

-  Like many other extension packages in the vast ecosystem of Python,
   ``pandas`` can work together with ``ndarray``.
-  Imputation and deletion can be used to handle missing data.

Exercises
---------

Create a raw dataset with more rows and columns.

1. Delete the column with the most missing values.
2. Convert the preprocessed dataset to the ``ndarray`` format.

`Discussions <https://discuss.mxnet.io/t/4973>`__
-------------------------------------------------

|image0|

.. |image0| image:: ../img/qr_pandas.svg