# Python for Data Scientists

Presented by

<b><font color = "Maroon" size="+1">High Performance Research Computing <br> Texas A&M University</font></b>


# NumPy and Pandas

In Python, it is common to share code with the community. Two popular libaries for scientific programming are

* Numeric Python (NumPy)
* Pandas

A library of Python code intended for sharing is called a **module**.

## Modules
We can get access to code other people have written using the `import` statement.

```
import <module>
```

After a module is imported, you don't have to import it again until your runtime gets disconnected or restarted.

After you import a module, you can use `print`, `help`, and `dir` to learn about it.

## Numeric Python

One of the most commmonly-used add-ons in Python is **Numeric Python** or NumPy for short (the py is pronounced as in "Python").

NumPy provides many useful mathematical features including new data types, functions, and numeric values.



<b><font size="+1">Example: Numpy Module</font></b>

Execute the cells below to try it out.

In [None]:
import numpy

In [None]:
print(numpy)

In [None]:
dir(numpy)

In [None]:
help(numpy)

## Module Features

Access a module's features using a dot:
```
module.feature
```

<b><font size="+1">Example: Accessing numpy components</font></b>

Numpy provides certain mathematical constants. Execute the cell to access one.

In [None]:
print(numpy.pi)

For comparison, execute the cell below to see if plain Python knows what pi is.

In [None]:
print(pi)

<b><font color = "Crimson" size="+1">Exercise: Accessing numpy components</font></b>

Find NumPy's `log10()` function and read about it. Use it to compute the log base 10 of a large number.

In [None]:
#your code here

<b><font color = "Crimson" size="+1">Exercise: Accessing (more) numpy components</font></b>

Find NumPy's `sin()` function and NumPy's `pi` value and compute

 $\sin(\pi/2)$.

In [None]:
#your code here

## Datetime


`numpy.datetime64` and `numpy.timedelta64` data types can store date and time values in a consistent way, and handle the conversions and comparisons between different representations automatically.

```
date     = numpy.datetime64( text )
interval = numpy.timedelta64( duration, unit )
```


<b><font size="+1">Example: Dates </font></b>

Subtract two dates to make a time interval.  

Execute the cell below to try it out.

In [None]:
day0 = numpy.datetime64('1970-01-01T00:00:00')
day1 = numpy.datetime64('2021-09-10T11:30:00')
interval = day1 - day0
print(type(interval), interval)

<b><font size="+1">Example: Interval</font></b>

Add a time interval to a date to make another date.

Execute the cell below to try it out.

In [None]:
three_years = numpy.timedelta64(365*3,'D')
day2 = day1 + three_years
print(type(day2), day2)

NameError: name 'numpy' is not defined

Notice how the "day of month" value decreased from 10 to 9. This is because a leap year occurred in that interval.

<b><font color = "Crimson" size="+1">Exercise: Date and Interval</font></b>



> I am so bored; I feel like I've been stuck in this chair for a million hours.
>
>    -- some student probably

What will the date and time be in 1 million hours from now?

 * Tip: try using `numpy.datetime64('now')`
 * Tip: the abbreviation for hour is  `'h'`


In [None]:
#your code here

## Submodules

NumPy groups its related features into submodules.

Examples:

* `numpy.random` provides
 * `numpy.random.randint()` function returns a random integer, one-time
 * `numpy.random.default_rng()` function to initialize a random number `Generator` that can be used repeatedly

To use a submodule, you have to import that submodule.

In [None]:
import numpy.random
dir(numpy.random)

<b><font color = "Crimson" size="+1">Exercise: NumPy Random</font></b>


Use the `help()` function to learn about the `numpy.random.randint` function.

Use the `numpy.random.randint()` to create a random integer.

In [None]:
#your code here

## Pandas

The `pandas` module provides functions for handling realistic data.

Reminder: You should import a module *once*, usually at the beginning of the notebook.

Execute this cell to bring in `pandas` and all of its functions.

We will also be using some NumPy functions.

In [None]:
import pandas

## Pandas Series Class

Pandas provides a class named `Series` which is a 1-dimensional *array* data structure.

Note that the first letter is Capitalized.

In [None]:
print(pandas.Series)

`<class 'pandas.core.series.Series'>`

## NumPy Array Class

NumPy provides a class named `ndarray` which is also an *array* data structure.

In [None]:
print(numpy.ndarray)

`<class 'numpy.ndarray'>`

# Arrays
Pandas `Series` is fully compatible with NumPy `ndarray`.

The term **array** can refer to either.

We will be using some NumPy functions alongside Pandas `Series`.  

## Arrays are like Lists

Reminder: this one-dimensional List has length N. The last index is N-1.

Index | Value
---|---
0|a
1|b
2|c
...| ...
N-1 | the last value

Series objects store the index data and the value data separately.
```
Series.index
Series.values
```

Unlike a List, <b>all the values of an array must consist of the *same data type*.</b>

Arrays have a `dtype` that represents the common data type of all the elements.

```
Series.dtype
ndarray.dtype
```



## Creating Arrays

### Specify Data
The `Series` constructor function creates arrays from other data structures, such as lists:

```
pandas.Series(data)
```

<b><font size="+1">Example: Series from List</font></b>

Execute the cells to see what happens.

In [None]:
# create a Series
example_series = pandas.Series([7.0, 8.0, 9.0])
# print the Series
print(example_series)

<b><font size="+1">Example: Inspect a Series</font></b>

Execute the cells to see what happens.

In [None]:
# print the type
print("example_series is a", type(example_series) )

In [None]:
# print the dtype
print("example_series.dtype is", example_series.dtype )

In [None]:
print("example_series.values")
print(example_series.values )

In [None]:
print("example_series.index")
print(example_series.index  )

<b><font color = "Crimson" size="+1">Exercise: Series from List</font></b>

Using `pandas.Series()`.

Build an array from a List of one-character strings. E.g., `['a','b','c']`.

In [None]:
#your code here

### Specify a Sequence

Some functions that create an array by iterating through some sequence:

```
numpy.linspace()
numpy.arange()
```



<b><font size="+1">Example: arrays from sequences</font></b>

Using the numpy function `numpy.linspace()`. It creates one-dimensional arrays of values evenly spaced within a given numerical range.

*   `numpy.linspace()` needs three inputs:


1.   A starting value
2.   An ending value
3.   Number of values

```
sequential_array = numpy.linspace(start, stop, count)
```
Execute the cells.

Try replacing the values.


In [None]:
sequential_array = numpy.linspace(2.0, 3.0, 11)
print(sequential_array)

The result of a NumPy function is always a NumPy array.

You will need to convert the array to a Pandas Series using the `Series` function.

In [None]:
sequential_series = pandas.Series(sequential_array)
print(sequential_series)

<b><font color = "Crimson" size="+1">Exercise: arrays from sequences</font></b>

Using `numpy.arange()`. It creates one-dimensional arrays of values evenly spaced within a given numerical range.

*   `numpy.arange()` needs three inputs:


1.   A starting value
2.   An ending value
3.   A step size

```
sequential_array = numpy.arange(start, stop, step)
```

Try to create the same array values as the linspace example (above).

Convert the array to a pandas Series using the `Series()` function.

How is working with `arange` different than working with `linspace`?



In [None]:
#your code here

<b><font color = "Crimson" size="+1">Exercise: arrays of Dates</font></b>

Using `numpy.arange()` and `numpy.datetime64()` together to create an array of dates.


### Random Arrays

From the `numpy.random` submodule, the `numpy.random.rand()` function creates an array of random numbers.

The argument is how many random numbers to create.

<b><font color = "Crimson" size="+1">Exercise: Array from Random</font></b>

Use the `numpy.random.rand()` function to create an array of random numbers.

Convert to a pandas Series.

In [None]:
# your code here

## Array Operations

NumPy and Pandas arrays support a variety of **operations**.

Reminder: these are some operators we know:

```
+ - * / ** // %
== != >= <= > <
```

The operations here are understood to be applied *to each element* rather than to the array object itself.

I.e. if ★ is an operation:

<table>
<tr>
<td>

this:

```
★ array([
        item,
        item,
        ...
        ])
```

</td>
<td>

means:

```
array([
      ★ item,
      ★ item,
      ...
      ])
```

</td>
</tr>
</table>

<b><font size="+1">Example: Array arithmetic</font></b>

Try some arithmetic operations `+ - * / // ** %`. Fill in what's missing (★) and executing the cells below.

In [None]:
A=numpy.asarray([1,2,3,4,5,6,7,8,9,10])
print(A★2) #your code here; try different math operations!

In [None]:
B=pandas.Series([1,2,3,4,5,6,7,8,9,10])
C=pandas.Series([2,3,4,0,1,2,3,4,0,1])
print(B★C) #your code here; try different math operations!

<b><font size="+1">Example: Array comparison</font></b>


Try some comparison operations `== != >= <= > <`. Fill in what's missing (★) and execute the cells below.

In [None]:
A=numpy.asarray([1,2,3,4,5,6,7,8,9,10])
print(A★5) #your code here; try different comparison operations!

In [None]:
B=pandas.Series([1,2,3,4,5,6,7,8,9,10])
C=pandas.Series([4,5,6,0,1,2,3,8,9,10])
print(B★C) #your code here; try different comparison operations!

<b><font color = "Crimson" size="+1">Exercise: Divisibility</font></b>


Write a program that tests for divisibility between two arrays (of numbers) (the same shape). Example arrays provided.

Reminder: The test for divisibility is `a % b == 0`.

Result should be an array of Booleans.

In [None]:
=pandas.Series([12,13,14,15,16,17,18,19,20,21])
=pandas.Series([1,2,3,4,5,1,2,3,4,5])
#your code here

<b><font color = "Crimson" size="+1">Exercise: Date Arithmetic</font></b>

Write a program that converts an array of `datetime64` values into an array of `timedelta64` values by subtracting a common start date.

$$\Delta t=t_f-t_i$$

Print the resulting array and its type.

Example array provided.

In [None]:
end_dates =numpy.arange(numpy.datetime64('2021-09-17'), numpy.datetime64('2021-09-24'))
start_date=numpy.datetime64() #your value here
#your code here
print("times", )
print("type", .dtype)

## Array Functions
Many of NumPy's functions behave the same way.

I.e. if `f` is an function:

<table>
<tr>
<td>

this:

```
numpy.f(array([
              item,
              item,
              ...
              ]))
```

</td>
<td>

means:

```
array([
      numpy.f(item),
      numpy.f(item),
      ...
      ])
```

</td>
</tr>
</table>

<b><font size="+1">Example: Array Functions</font></b>

Using the function `numpy.linspace()` to create an array of angles.

Using the function `numpy.sin()` to create a sinusoidal array.

$$y=\sin x$$

Using the function `numpy.printoptions()` to make the output pretty.

Execute the cell to see what happens.


In [None]:
x=numpy.linspace(0, numpy.pi, 7)

print("0<x<π")
with numpy.printoptions(precision=3, suppress=True):
  print("x", x)

y=numpy.sin(x) #this line executes the array function

print("y=sin(x)")
with numpy.printoptions(precision=3, suppress=True):
  print("y", y)

<b><font color = "Crimson" size="+1">Exercise: Array Functions</font></b>

Create a `Series` of numbers from List or Sequence.

Use the function `numpy.exp()` as in

$$y=e^x$$

to make an array of large numbers.

Convert the resulting array back to a `Series`.


In [None]:
#your code here

## Optional Content

### More Array Functions

<b><font color = "Crimson" size="+1">Exercise: numpy and datetime</font></b>

NumPy has functions for working with `numpy.datetime64` data.

Use the `numpy.is_busday()` function to create a *mask* of booleans from an array of `datetime64`. Example dates provided.

Slice the dates to get an array of business days. (print them)

In [None]:
dates = numpy.arange(numpy.datetime64('2021-09-17'), numpy.datetime64('2021-09-30'))
print("dates", dates)
#your code here
print("busdays",)

### Array Methods (Optional)



An array object itself provides some functions for interacting with its data.

Warning! There are two very different kinds of methods!

**1. Methods that Return**
Some methods return a new array:

```
new_array = old_array.function()
```

These methods can be chained together (only the ones that **return** an array).

```
array.function().function().function()...
```
**2. In-place Methods**
Some methods alter an array in-place. They don't return anything. The same array variable gets the new changes.

```
array.function()
```

You cannot chain these together.

<b><font size="+1">Example: array methods</font></b>

`array.flatten()` is a return method.

`array.sort()` is an in-place method.

Execute the cell to see what happens.

In [None]:
original_array = numpy.asarray([['d', 'c'],['b', 'a']])
print("original_array\n", original_array)
new_array =  original_array.flatten()
print("new_array after flatten", new_array)
new_array.sort()
print("new_array after sort", new_array)
print("original_array (again)\n", original_array)

<b><font color = "Crimson" size="+1">Exercise: array methods</font></b>

Help! My code doesn't work. It makes floats but I wanted integers.

Fix the code so that the final output array has an integer dtype.

Hint: use NumPy's `astype()` array method (it is a return method).

In [None]:
float_array=numpy.asarray([0.13, 14.99, -0.6, -7.1, 2.5])
round_array=numpy.round(float_array) # fix this
print(round_array)
print("dtype",round_array.dtype)