.. _sec_pandas: Data Preprocessing ================== So far we have introduced a variety of techniques for manipulating data that are already stored in ``ndarray``\ s. To apply deep learning to solving real-world problems, we often begin with preprocessing raw data, rather than those nicely prepared data in the ``ndarray`` format. Among popular data analytic tools in Python, the ``pandas`` package is commonly used. Like many other extension packages in the vast ecosystem of Python, ``pandas`` can work together with ``ndarray``. So, we will briefly walk through steps for preprocessing raw data with ``pandas`` and converting them into the ``ndarray`` format. We will cover more data preprocessing techniques in later chapters. Reading the Dataset ------------------- As an example, we begin by creating an artificial dataset that is stored in a csv (comma-separated values) file. Data stored in other formats may be processed in similar ways. .. code:: python # Write the dataset row by row into a csv file import os data_folder = '../data/' if not os.path.exists(data_folder): os.makedirs(data_folder) data_file = '../data/house_tiny.csv' with open(data_file, 'w') as f: f.write('NumRooms,Alley,Price\n') # Column names f.write('NA,Pave,127500\n') # Each row is a data point f.write('2,NA,106000\n') f.write('4,NA,178100\n') f.write('NA,NA,140000\n') To load the raw dataset from the created csv file, we import the ``pandas`` package and invoke the ``read_csv`` function. This dataset has :math:`4` rows and :math:`3` columns, where each row describes the number of rooms (“NumRooms”), the alley type (“Alley”), and the price (“Price”) of a house. .. code:: python # If pandas is not installed, just uncomment the following line: # !pip install pandas import pandas as pd data = pd.read_csv(data_file) print(data) .. parsed-literal:: :class: output NumRooms Alley Price 0 NaN Pave 127500 1 2.0 NaN 106000 2 4.0 NaN 178100 3 NaN NaN 140000 Handling Missing Data --------------------- Note that “NaN” entries are missing values. To handle missing data, typical methods include *imputation* and *deletion*, where imputation replaces missing values with substituted ones, while deletion ignores missing values. Here we will consider imputation. By integer-location based indexing (``iloc``), we split ``data`` into ``inputs`` and ``outputs``, where the former takes the first 2 columns while the later only keeps the last column. For numerical values in ``inputs`` that are missing, we replace the “NaN” entries with the mean value of the same column. .. code:: python inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2] inputs = inputs.fillna(inputs.mean()) print(inputs) .. parsed-literal:: :class: output NumRooms Alley 0 3.0 Pave 1 2.0 NaN 2 4.0 NaN 3 3.0 NaN For categorical or discrete values in ``inputs``, we consider “NaN” as a category. Since the “Alley” column only takes 2 types of categorical values “Pave” and “NaN”, ``pandas`` can automatically convert this column to 2 columns “Alley_Pave” and “Alley_nan”. A row whose alley type is “Pave” will set values of “Alley_Pave” and “Alley_nan” to :math:`1` and :math:`0`. A row with a missing alley type will set their values to :math:`0` and :math:`1`. .. code:: python inputs = pd.get_dummies(inputs, dummy_na=True) print(inputs) .. parsed-literal:: :class: output NumRooms Alley_Pave Alley_nan 0 3.0 1 0 1 2.0 0 1 2 4.0 0 1 3 3.0 0 1 Conversion to the ``ndarray`` Format ------------------------------------ Now that all the entries in ``inputs`` and ``outputs`` are numerical, they can be converted to the ``ndarray`` format. Once data are in this format, they can be further manipulated with those ``ndarray`` functionalities that we have introduced in :numref:`sec_ndarray`. .. code:: python from mxnet import np X, y = np.array(inputs.values), np.array(outputs.values) X, y .. parsed-literal:: :class: output (array([[3., 1., 0.], [2., 0., 1.], [4., 0., 1.], [3., 0., 1.]], dtype=float64), array([127500, 106000, 178100, 140000], dtype=int64)) Summary ------- - Like many other extension packages in the vast ecosystem of Python, ``pandas`` can work together with ``ndarray``. - Imputation and deletion can be used to handle missing data. Exercises --------- Create a raw dataset with more rows and columns. 1. Delete the column with the most missing values. 2. Convert the preprocessed dataset to the ``ndarray`` format. `Discussions `__ ------------------------------------------------- |image0| .. |image0| image:: ../img/qr_pandas.svg