Introduction to Spark with Python

The Introduction to Spark with Python training class provides a solid technical introduction to the Spark architecture and how Spark works. It covers the basic building blocks of Spark (e.g., RDDs and the distributed compute engine), as well as higher-level constructs that provide a simpler and more capable interface (e.g., DataFrames and Spark SQL). It includes in-depth coverage of Spark SQL and DataFrames, which are now the preferred programming API. This includes exploring possible performance issues and strategies for optimization.

The course also covers more advanced capabilities such as the use of Spark Streaming to process streaming data, and integrating with the Kafka server.

Location

Public Classes: Delivered live online via WebEx and guaranteed to run . Join from anywhere!

Private Classes: Delivered at your offices , or any other location of your choice.

Course Topics
  1. Understand the need for Spark in data processing
  2. Understand the Spark architecture and how it distributes computations to cluster nodes
  3. Be familiar with basic installation / setup / layout of Spark
  4. Use the Spark shell for interactive and ad-hoc operations
  5. Understand RDDs (Resilient Distributed Datasets), and data partitioning, pipelining, and computations
  6. Understand/use RDD ops such as map(), filter() and others.
  7. Understand and use Spark SQL and the DataFrame API.
  8. Understand the DataFrame capabilities, including the Catalyst query optimizer and Tungsten memory/cpu optimizations.
  9. Be familiar with performance issues, and use DataFrames and Spark SQL for efficient computations
  10. Understand Spark’s data caching and use it for efficient data transfer
  11. Write/run standalone Spark programs with the Spark API
  12. Use Spark Streaming / Structured Streaming to process streaming (real-time) data
  13. Ingest streaming data from Kafka, and process via Spark Structured Streaming
  14. Understand performance implications and optimizations when using Spark
Outline
  1. Introduction to Spark
    1. Overview, Motivations, Spark Systems
    2. Spark Ecosystem
    3. Spark vs. Hadoop
    4. Acquiring and Installing Spark
    5. The Spark Shell, SparkContext
  2. RDDs and Spark Architecture
    1. RDD Concepts, Lifecycle, Lazy Evaluation
    2. RDD Partitioning and Transformations
    3. Working with RDDs - Creating and Transforming (map, filter, etc.)
  3. Spark SQL, DataFrames, and DataSets
    1. Overview
    2. SparkSession, Loading/Saving Data, Data Formats (JSON, CSV, Parquet, text ...)
    3. Introducing DataFrames (Creation and Schema Inference)
    4. Supported Data Formats (JSON, Text, CSV, Parquet)
    5. Working with the DataFrame (untyped) Query DSL (Column, Filtering, Grouping, Aggregation)
    6. SQL-based Queries
    7. Mapping and Splitting (flatMap(), explode(), and split())
    8. DataFrames vs. RDDs
  4. Shuffling Transformations and Performance
    1. Grouping, Reducing, Joining
    2. Shuffling, Narrow vs. Wide Dependencies, and Performance Implications
    3. Exploring the Catalyst Query Optimizer (explain(), Query Plans, Issues with lambdas)
    4. The Tungsten Optimizer (Binary Format, Cache Awareness, Whole-Stage Code Gen)
  5. Performance Tuning
    1. Caching - Concepts, Storage Type, Guidelines
    2. Minimizing Shuffling for Increased Performance
    3. Using Broadcast Variables and Accumulators
    4. General Performance Guidelines
  6. Creating Standalone Applications
    1. Core API, SparkSession.Builder
    2. Configuring and Creating a SparkSession
    3. Building and Running Applications - sbt/build.sbt and spark-submit
    4. Application Lifecycle (Driver, Executors, and Tasks)
    5. Cluster Managers (Standalone, YARN, Mesos)
    6. Logging and Debugging
  7. Spark Streaming
    1. Introduction and Streaming Basics
    2. Streaming Introduction
    3. Structured Streaming (Spark 2+)
    4. Continuous Applications
    5. Table Paradigm, Result Table
    6. Steps for Structured Streaming
    7. Sources and Sinks
    8. Consuming Kafka Data
    9. Kafka Overview
    10. Structured Streaming - "kafka" format
    11. Processing the Stream
Class Materials

Each student in our Live Online and our Onsite classes receives a comprehensive set of materials, including course notes and all the class examples.

Class Prerequisites

Experience in the following is required for this Spark class:

  • Working knowledge of some programming language. Java experience is not needed.

Training for Yourself

$1,875.00 or 3 vouchers

Upcoming Live Online Classes

  • See More Classes

Please select a class.

Training for your Team

Length: 3 Days
  • Private Class for your Team
  • Online or On-location
  • Customizable
  • Expert Instructors

What people say about our training

Webucator allowed me to complete my training with a live instructor from home without the distraction of the office or a conference room!
Jessica Bahry
PORTAGE PATH BEHAVIORAL HEALTH
Webucator is AWESOME!
Dave Kuehl
20Jeans
The instructor was excellent and very well versed in MS Word training. I would recommend this class to anyone who wants a better understanding of MS Word.
Michael Dietz
Policy Studies Inc
This class hits the mark. Very professional and engaging instruction. I highly recommend, particularly for the cost.
Elizabeth Dorso
Save the Children

No cancelation for low enrollment

Certified Microsoft Partner

Registered Education Provider (R.E.P.)

GSA schedule pricing

65,181

Students who have taken Instructor-led Training

12,014

Organizations who trust Webucator for their Instructor-led training needs

100%

Satisfaction guarantee and retake option

9.30

Students rated our trainers 9.30 out of 10 based on 30,146 reviews

Contact Us or call 1-877-932-8228