Introduction to PySpark

Overview

Instructor: Jian Tao

Time: Friday, March 6, 2020 — 10:00AM-12:30PM CT

Location: SCC 102.B

Prerequisites: Python

PySpark is the Python API for Apache Spark, which is an open-source distributed general-purpose cluster-computing framework. Spark, written in Scala programming language, provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since (wikipedia.org).

PySpark is a great tool for performing exploratory data analysis (EDA) at scale, building machine learning models, and deploying large scale data analysis pipelines.

This short course will introduce the functionalities of Apache Spark with its Python APIs and show how to use PySpark to perform common tasks on both laptops and supercomputers.

Agenda

This course focuses, among others, on the following topics:

  • Introduction to PySpark
  • Run PySpark programs on Jupyter notebook
  • Resilient Distributed Dataset (RDD)
  • Spark DataFrame
  • PySpark SQL
  • Streaming
  • Machine Learning Pipeline
  • Hands-on session

This short course will make use of the Jupyter interactive environment. A brief introduction to Jupyter will be covered if necessary.

Course Materials