# Data Exploration with Pandas and Matplotlib

Originally created by Dr. [Jian Tao](https://orcid.org/0000-0003-4228-6089), Texas A&M University

March 11, 2021

"In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods." - Wikipedia. 

EDA was first proposed by John Tukey for data analysis in 1961. According to John Tukey, EDA involves "procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."

With pandas and matplotlib, we can quickly and effectively carry out EDA for a data science project. Before we start doing EDA, we will first need to analyze and categorize our data science problem to help us better understant the problem to find the right tools and strategies to solve it.
![Data Science Exploration](https://github.com/happidence1/AILabs/blob/master/images/ds_exploration.svg?raw=1)

## 1. Basics of Pandas 
Credits: The following are notes taken while working through [Python for Data Analysis](http://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1449319793) by Wes McKinney. Only a small part of the original notebook was kept due to the time limit of this lab. Please refer to the orginial notebook if you want to learn more about Pandas.

* Series
* DataFrame
* Dropping Entries
* Indexing, Selecting, Filtering
* Sorting and Ranking
* Input and Output

In [None]:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np

### 1.1 Series

A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels.  The data can be any NumPy data type and the labels are the Series' index.

Create a Series:

In [None]:
ser_1 = Series([1, 1, 2, -3, -5, 8, 13])
ser_1

Get the array representation of a Series:

In [None]:
ser_1.values

Index objects are immutable and hold the axis labels and metadata such as names and axis names.

Get the index of the Series:

In [None]:
ser_1.index

Create a Series with a custom index:

In [None]:
ser_2 = Series([1, 1, 2, -3, -5], index=['a', 'b', 'c', 'd', 'e'])
ser_2

Get a value from a Series:

In [None]:
ser_2[4] == ser_2['e']

Get a set of values from a Series by passing in a list:

In [None]:
ser_2[['a', 'b', 'c']]

Get values great than 0:

In [None]:
ser_2 > 0

In [None]:
ser_2[ser_2 > 0]

Scalar multiply:

In [None]:
ser_2 * 2

### 1.2 DataFrame

A DataFrame is a tabular data structure containing an ordered collection of columns.  Each column can have a different type.  DataFrames have both row and column indices and is analogous to a dict of Series.  Row and column operations are treated roughly symmetrically.  Columns returned when indexing a DataFrame are views of the underlying data, not a copy.  To obtain a copy, use the Series' copy method.

Create a DataFrame:

In [None]:
data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
          'year' : [2012, 2013, 2014, 2014, 2015],
          'pop' : [5.0, 5.1, 5.2, 4.0, 4.1]}
df_1 = DataFrame(data_1)
df_1

Create a DataFrame specifying a sequence of columns:

In [None]:
df_2 = DataFrame(data_1, columns=['year', 'state', 'pop'])
df_2

Like Series, columns that are not present in the data are NaN:

In [None]:
df_3 = DataFrame(data_1, columns=['year', 'state', 'pop', 'unempl'])
df_3

Retrieve a column by key, returning a Series:


In [None]:
df_3['state']

Retrive a column by attribute, returning a Series:

In [None]:
df_3.year

Retrieve a row by position:

In [None]:
df_3.loc[0]

Update a column by assignment:

### 1.3 Dropping Entries

Drop rows from a Series or DataFrame:

In [None]:
df_3

In [None]:
df_4 = df_3.drop([0,1])
df_4

Drop columns from a DataFrame:

In [None]:
df_4 = df_4.drop('unempl', axis=1)
df_4

### 1.4 Indexing, Selecting, Filtering

Series indexing is similar to NumPy array indexing with the added bonus of being able to use the Series' index values.

In [None]:
ser_2

Select a value from a Series:

In [None]:
ser_2[0] == ser_2['a']

Select a slice from a Series:

In [None]:
ser_2[1:4]

Select specific values from a Series:

In [None]:
ser_2[['b', 'c', 'd']]

Select from a Series based on a filter:

In [None]:
type(ser_2 > 0)

In [None]:
ser_2[ser_2 > 0]

Select a slice from a Series with labels (note the end point is inclusive):

In [None]:
ser_2['a':'b']

Assign to a Series slice (note the end point is inclusive):

In [None]:
ser_2['a':'b'] = 0
ser_2

Pandas supports indexing into a DataFrame.

In [None]:
df_3

Select specified columns from a DataFrame:

In [None]:
df_3[['pop', 'unempl']]

Select a slice from a DataFrame:

In [None]:
df_3[:2]

Select from a DataFrame based on a filter:

In [None]:
df_3[df_3['pop'] > 5]

Select a slice of rows from a DataFrame (note the end point is inclusive):

### 1.5 Sorting

In [None]:
ser_2

Sort a Series by its index:

In [None]:
ser_2.sort_index()

Sort a Series by its values:

In [None]:
ser_2.sort_values(ascending=False)

In [None]:
df_12 = DataFrame(np.arange(12).reshape((3, 4)),
                  index=['three', 'one', 'two'],
                  columns=['c', 'a', 'b', 'd'])
df_12

Sort a DataFrame by its index:

In [None]:
df_12.sort_index()

Sort a DataFrame by columns in descending order:

In [None]:
df_12.sort_index(axis=1, ascending=False)

Sort a DataFrame's values by column:

In [None]:
df_12.sort_values(by=['d', 'c'])

### 1.6 Input and Output
* Reading
* Writing

#### Reading

Read data from a CSV file into a DataFrame (use sep='\t' for TSV):

In [None]:
df_1 = pd.read_csv("./data/ozone.csv")

Get a summary of the DataFrame:

In [None]:
df_1.describe()

List the first five rows of the DataFrame:

In [None]:
df_1.head()

#### Writing

Create a copy of the CSV file, encoded in UTF-8 and hiding the index and header labels:

In [None]:
df_1.to_csv('./data/ozone_copy.csv', 
            encoding='utf-8', 
            index=False, 
            header=False)

## 2. Matplotlib

Only a small part of the original notebook was kept due to the time limit of this lab. The complete notebook can be found at [this link](https://github.com/jtao/tamids/blob/master/ecen489/intro_matplotlib/matplotlib.ipynb).
### Figure
* Figure is the object that keeps the whole image output. Adjustable parameters include:
* Image size (set_size_inches())
* Whether to use tight_layout (set_tight_layout())

### Axes
* Axes object represents the pair of axis that contain a single plot (x-axis and y-axis). The Axes object also has more adjustable parameters:
  * The plot frame (set_frame_on() or set_frame_off())
  * X-axis and Y-axis limits (set_xlim() and set_ylim())
  * X-axis and Y-axis Labels (set_xlabel() and set_ylabel())
  * The plot title (set_title())
![Anatony of a Figure](https://github.com/happidence1/AILabs/blob/master/images/matplotlib.svg?raw=1)

### 1.1 The matplotlib object-oriented API

The main idea with object-oriented programming is to have objects that one can apply functions and actions on, and no object or program states should be global (such as the MATLAB-like API). The real advantage of this approach becomes apparent when more than one figure is created, or when a figure contains more than one subplot. 

To use the object-oriented API we start out very much like in the previous example, but instead of creating a new global figure instance we store a reference to the newly created figure instance in the `fig` variable, and from it we create a new axis instance `axes` using the `add_axes` method in the `Figure` class instance `fig`:

In [None]:
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 5, 10)
y = x ** 2

fig = plt.figure()

axes = fig.add_axes([0, 0, 1, 1]) # left, bottom, width, height (range 0 to 1)

axes.plot(x, y, 'r')

axes.set_xlabel('x')
axes.set_ylabel('y')
axes.set_title('title');

If we don't care about being explicit about where our plot axes are placed in the figure canvas, then we can use one of the many axis layout managers in matplotlib.

In [None]:
fig, axes = plt.subplots()

axes.plot(x, y, 'r')
axes.set_xlabel('x')
axes.set_ylabel('y')
axes.set_title('title');

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2)

for ax in axes:
    ax.plot(x, y, 'r')
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_title('title')

### 1.2 Colormap and contour figures

Colormaps and contour figures are useful for plotting functions of two variables. In most of these functions we will use a colormap to encode one dimension of the data. There are a number of predefined colormaps. It is relatively straightforward to define custom colormaps. For a list of pre-defined colormaps, see: http://www.scipy.org/Cookbook/Matplotlib/Show_colormaps

In [None]:
alpha = 0.7
phi_ext = 2 * np.pi * 0.5

def flux_qubit_potential(phi_m, phi_p):
    return 2 + alpha - 2 * np.cos(phi_p) * np.cos(phi_m) - alpha * np.cos(phi_ext - 2*phi_p)

In [None]:
phi_m = np.linspace(0, 2*np.pi, 100)
phi_p = np.linspace(0, 2*np.pi, 100)
X,Y = np.meshgrid(phi_p, phi_m)
Z = flux_qubit_potential(X, Y).T

#### pcolor

In [None]:
fig, ax = plt.subplots()

p = ax.pcolor(X/(2*np.pi), Y/(2*np.pi), Z, vmin=abs(Z).min(), vmax=abs(Z).max())
cb = fig.colorbar(p, ax=ax)

#### imshow

In [None]:
fig, ax = plt.subplots()

im = ax.imshow(Z,  vmin=abs(Z).min(), vmax=abs(Z).max(), extent=[0, 1, 0, 1])
im.set_interpolation('bilinear')

cb = fig.colorbar(im, ax=ax)

#### contour

In [None]:
fig, ax = plt.subplots()

cnt = ax.contour(Z, vmin=abs(Z).min(), vmax=abs(Z).max(), extent=[0, 1, 0, 1])

### 1.3 3D figures

To use 3D graphics in matplotlib, we first need to create an instance of the `Axes3D` class. 3D axes can be added to a matplotlib figure canvas in exactly the same way as 2D axes; or, more conveniently, by passing a `projection='3d'` keyword argument to the `add_axes` or `add_subplot` methods.

In [None]:
from mpl_toolkits.mplot3d.axes3d import Axes3D

#### Surface plots

In [None]:
fig = plt.figure(figsize=(14,6))

# `ax` is a 3D-aware axis instance because of the projection='3d' keyword argument to add_subplot
ax = fig.add_subplot(1, 2, 1, projection='3d')

p = ax.plot_surface(X, Y, Z, rstride=4, cstride=4, linewidth=0)

# surface_plot with color grading and color bar
ax = fig.add_subplot(1, 2, 2, projection='3d')
p = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, linewidth=0, antialiased=False)
cb = fig.colorbar(p, shrink=0.5)

#### Wire-frame plot

In [None]:
fig = plt.figure(figsize=(8,6))

ax = fig.add_subplot(1, 1, 1, projection='3d')

p = ax.plot_wireframe(X, Y, Z, rstride=4, cstride=4)

#### Coutour plots with projections

In [None]:
fig = plt.figure(figsize=(8,6))

ax = fig.add_subplot(1,1,1, projection='3d')

ax.plot_surface(X, Y, Z, rstride=4, cstride=4, alpha=0.25)
cset = ax.contour(X, Y, Z, zdir='z', offset=-np.pi)
cset = ax.contour(X, Y, Z, zdir='x', offset=-np.pi)
cset = ax.contour(X, Y, Z, zdir='y', offset=3*np.pi)

ax.set_xlim3d(-np.pi, 2*np.pi);
ax.set_ylim3d(0, 3*np.pi);
ax.set_zlim3d(-np.pi, 2*np.pi);

## 3. Case Study

#### Example Data


File name: king_county_house_data.csv

If it doesn't work, replace the file name with the path to the file. 

In [None]:
df = pd.read_csv("king_county_house_data.csv")

#### Check the data

Just some normal checks to see what kind of data we have. 

You aren't ready to plot until you know what you're plotting. 

##### Check 1
The first few lines

In [None]:
df.head(5)

##### Check 2
How many are there?


In [None]:
df.shape

##### Check 3
Wat kind is it?

In [None]:
df.info()

##### Check 4
What are the typical values?

In [None]:
df.describe().T

#### Panda Plotting

Pandas can conveniently plot the contents of a dataframe.

It uses **matplotlib** behind-the-scenes, so it should look familiar. 



##### Example 1
PLOT EVERYTHING!!!!!!!!1

In [None]:
ax1=df.plot();

##### Example 2
Plot a specific pair of columns. 

Here we are using the keyword arguments `x` and `y` which refer simply to the column label. 

The `kind` keyword argument corresponds to which `matplotlib.pyplot.`*function*`()` to use. The default is line plots. 

In [None]:
ax1=df.plot(kind='scatter', 
            title="Price of Living Area", 
            x="sqft_living", 
            y="price", 
            );

##### Exercise 1
Pick two other columns and plot them. 

In [None]:
#write your code here



#### Pandas Histogram
A histogram is a count of how many values fall within a given range for a given series. 

Pandas can conviently make a histogram of any column in the DataFrame. 
```
df.hist()
```

##### Example 3
By default, it plots all of the them.

here we are using the keyword argument `figsize` to make sure they are big enough to read. (because there are so many). 

In [None]:
df.hist(figsize=(20,20));

##### Exercise 3
Make a histogram of just one column. 

Hint: provide the name of the column as a regular argument.

In [None]:
# write your code here



#### Pandas correlation

Pandas can compute the pairwise correlation of all the columns. 

```
new_df = df.corr()
```
This is technically another DataFrame.

##### Example 4

In [None]:
df.corr()

##### Example 5
It is helpful if we style the correlation dataframe with colors indicating the strength of the correlation. 



OK, OK - It's technically not a plot; but it is still a kind of graphic. 

In [None]:
corr_df = df.corr()
stylish_corr_df = corr_df.style.background_gradient(cmap='coolwarm').set_precision(2)
stylish_corr_df

Notice any fun facts? 