# Python for Data Scientists

Presented by

<b><font color = "Maroon" size="+1">High Performance Research Computing <br> Texas A&M University</font></b>


# Matplotlib Module

The `matplotlib` module provides functions for making graphs and drawings.

The `matplotlib.pyplot` submodule supports common types of graphs. But we don't want to type it out every time we use it. Instead, we nickname it using the `as` keyword.

This lesson will also use some `numpy` functions.

Reminder: You should import a module *once*, usually at the beginning of the notebook.

Execute this cell to bring in matplotlib and all of its functions.

In [None]:
%matplolib inline
import matplotlib.pyplot as plt
import numpy

Execute the cell below to double-check that `plt` refers to the module we want:

In [None]:
plt.__name__

# Line Plots
Learn how to use Use Python Matplotlib library for data visualization

## The Plot function

The `plt.plot` function (which, recall, is now shorthand for the `matplotlib.pyplot.plot()` function) takes data as its argument and creates an image. The image is a *line plot* of the data.

You may provide one or two data objects as regular arguments.
```
plt.plot(y)
plt.plot(x, y)
```
If you provide two data objects, they must be the same length.

You can display the plot in the output of the cell
```
plt.show()
```

(In many IDEs, this is the default behavior)

<b><font size="+1">Example: Basic plotting</font></b>

Using some random-looking data in a `list` structure.

Execute the cell to see what happens.

In [None]:
random_data=[1,15,-2,6,11,12,12]
plt.plot(random_data)
plt.show()

Since Example 1 did not specify both x and y coordinates, matplotlib used:
* the index as the x coordinate
* the value as the y coordinate.

Since the last statement in the cell was an *expression*, Colab did the usual thing of telling you what kind of object it is. (a *lines* object).

If you don't want to see that, just put another statement like `print()` after your plotting functions.

<b><font color = "Crimson" size="+1">Exercise: Quadratic plot</font></b>

Plot a quadratic equation
$$y=ax^2+bx+c$$

Steps:

1. Create a numpy array `x` from an arithmetic sequence. E.g.:
```
x = numpy.linspace(0, 5, 10)
```
2. Create a second array `y` by doing a power-of-2 operation with the first array. E.g.:
```
y = x ** 2
```

3. Use the `plt.plot()` function with two regular arguments, providing `x` and `y` as the data.



In [None]:
#your code here

## Fancy Figures

Matplotlib has many controls for customizing figures.

## Plot Keyword Arguments

The `plot()` function has keyword arguments to control the appearance of the lines.  Here are a few:

* `color`
* `linestyle`
* `linewidth`
* `marker`
* `markersize`

## Figure Functions

`plt` provides other functions that set other elements of the figure. Here are a few

* `title()`
* `grid()`
* `xlabel()`
* `xlim()`
* `xscale()`
* `xticks()`
* `ylabel()`
* `ylim()`
* `yscale()`
* `yticks()`



<b><font size="+1">Example: Plot customization</font></b>

Read and execute the cell to set up some common figure parameters.

This plot is the example data from the solution to Exercise 1.

In [None]:
x = numpy.linspace(0, 5, 10)
y = x ** 2
plt.plot(x, y, color='red', linestyle='solid', linewidth=4)
plt.title("area of a square")
plt.xlabel("side length")
plt.ylabel("area")
plt.show()

<b><font color = "Crimson" size="+1">Exercise: Plot customization</font></b>

Modify your figure from Exercise 1.

Add the following elements:
* title
* xlabel
* ylabel
* grid

Modify your line by setting the following parameters:
* color `'green'`
* linestyle  `':'`
* marker  `'^'`

In [None]:
#your code here

## Multiple Curves

Execute the `plot()` function repeatedly within a single cell to put multiple curves into the same image.

It is best if they share a common $x$ data set.

<b><font size="+1">Example: Two quadratics</font></b>

In [None]:
x = numpy.linspace(0, 5, 10)
y1 = x ** 2
y2 = 2 * x ** 2
plt.plot(x, y1, color='red')
plt.plot(x, y2, color='blue')
plt.show()

<b><font color = "Crimson" size="+1">Exercise: Trig plots</font></b>

Plot the sine and cosine functions over the interval $[0, 2\pi]$ in the same image.

Make sure the two functions are different colors and/or styles.

Give your figure a title.

Hints:
* `numpy.linspace()` and `numpy.pi` to make the interval

* `numpy.sin()` and `numpy.cos()` to make the functions

Bonus:
* You can use $\LaTeX$ notation to make the title fancy.
* Example: `'$y=\cos \pi$'` yields $y=\cos\pi$

In [None]:
#your code here

# Scatter Plot
Learn how to use Use Python Matplotlib library for data visualization

## The Scatter function

The `plt.scatter()` function takes data as its argument and creates an image. The image is a *scatter plot* of the data.

You must provide two data objects as regular arguments. They must be the same length.
```
plt.scatter(x, y)
```


<b><font size="+1">Example: Scatterplot</font></b>

Don't forget that we are using the nickname `plt` in this notebook.

In [None]:
data1=[ 1,15, 2, 6,11,12,12]
data2=[-6, 2, 4, 8, 0,14, 1]
plt.scatter(data1, data2)
plt.show()

<b><font color = "Crimson" size="+1">Exercise: Scatterplot</font></b>

Create a sequence of numbers with at least 20 elements.

Create a scatter plot that is the sine of your sequence vs the cosine of your sequence.

* Remember to use the nickname `plt`.

Use the figure function `axis('square')` to make this figure look pretty.

In [None]:
#your code here
plt.scatter()
plt.axis('square')
plt.show()

## Fancy Figures

Matplotlib has many controls for customizing figures.

## Scatter Keyword Arguments

The `scatter()` function has keyword arguments to control the appearance of the markers.  Here are a few:

* `color`
* `marker`
* `c` the color data for each marker
* `s` the size data for each marker
* `cmap` the rule for translating color data into colors


## Figure Functions

`plt` provides other functions that set other elements of the figure. Here are a few

* `title()`
* `grid()`
* `xlabel()`
* `xlim()`
* `xscale()`
* `xticks()`
* `ylabel()`
* `ylim()`
* `yscale()`
* `yticks()`

<b><font size="+1">Example: A coloful bubble chart</font></b>

Same `x` and `y` data as the previous example.

Now with marker color and size data.

Execute the cell to see what happens.


In [None]:
data1=[ 1,15, 2, 6,11,12,12]
data2=[-6, 2, 4, 8, 0,14, 1]
data3=[ 2, 3, 4, 5, 2, 3, 6]
data4=[30,40,60,40,10,50,90]
plt.scatter(data1, data2, c=data3, s=data4)
plt.show()

<b><font color = "Crimson" size="+1">Exercise: A Colorful Quadratic</font></b>

Plot a quadratic equation as a scatter plot.

$$y=ax^2+bx+c$$

Choose the range of $x$ and choose constants $a$ and $c$ as necessary so that the largest $y$ values are at least 200.

Use the `y` values as the marker sizes (keyword argument `s`).

Use the `x` values as the marker colors (keyword argument `c`).

Select a value for the keyword argument `cmap`. Try different values. Find one that you like.

* Hint: the `colormaps()` function prints a list of valid colormap values.

Give your figure a title.

In [None]:
plt.colormaps()

In [None]:
#your code here

## Regression

Find the curve through some data that fits the data best is called regression.

We use the NumPy function `polyfit()` to find regressions. It fits polynomials to data.

```
coefficients = numpy.polyfit(xdata, ydata, degree)
```
# Linear Regression
Finding the *line* of best fit through some data is called linear regression.

A line is a polynomial of degree 1 and its coefficients are $m$ and $b$.

```
m, b = numpy.polyfit(xdata, ydata, 1)
```

Then you can simply compute points on the line using the equation.
$$y=mx+b$$

* Tip: you can re-use `xdata` as the values of `x` but the line may look better if you sort the values first.  

<b><font size="+1">Example: A scatter plot with a linear regression.</font></b>

In [None]:
data1=[ 1,15, 2, 6,11,12,12]
data2=[-6, 2, 4, 8, 0,14, 1]

m, b = numpy.polyfit(data1,data2,1)
x=numpy.sort(data1) # prefer x to be a sorted array (not an unordered list)
y=m*x+b

plt.plot(x,y) #plot the line
plt.scatter(data1,data2) #plot the data

<b><font color = "Crimson" size="+1">Exercise: Create a noisy line.</font></b>

Steps:

1. Create a number array `x` with sequential values between 0 and 10.

* Hint: Use the NumPy function `linspace()`

2. Create a number array `y` by computing $y=mx+b$ for some values (you choose) of $m$ and $b$.

3. Create two random arrays with elements between 0 and 1, the same length as the array `x`.

* Hint: Use a NumPy random function, like this:
```
random_array = numpy.random.rand(length)
```
4. Add the random arrays to `x` and `y` to make two data arrays `xdata` `ydata`.

Do a linear regression of your noisy line. Are the best fit values of $m$ and $b$ close to the ones you actually used?

Plot the best fit line (you can re-use array `x`) and scatter plot the noisy data.


In [None]:
#your code here

In [None]:
#@title Double-click to see solution

#true line
x=numpy.linspace(0,10,21)
y=2*x+2

#noisy line
r1=numpy.random.rand(21)
r2=numpy.random.rand(21)
xdata=x+r1
ydata=y+r2

#linear regression
m, b = numpy.polyfit(xdata,ydata,1)
y=m*x+b
print("I hope ",m,"and",b,"are close to 2 and 2. \n")

#plot both the line and the data
plt.plot(x,y)
plt.scatter(xdata,ydata)
plt.show()