Spark Fundamentals Training

This Spark training course provides theoretical and technical aspects of Spark programming.The course teaches developers Spark fundamentals, APIs, common programming idioms, and more.


Public Classes: Delivered live online via WebEx and guaranteed to run . Join from anywhere!

Private Classes: Delivered at your offices , or any other location of your choice.

Course Topics
  1. Learn about elements of functional programming.
  2. Learn about Spark Shell.
  3. Learn about rDDs.
  4. Learn about parallel processing in Spark.
  5. Learn about Spark SQL.
  6. Learn about ETL with Spark.
  7. Learn about MLib Machine Learning Library.
  8. Learn about Graph Processing with GraphX.
  9. Learn about Spark Streaming.
  1. Introduction to Functional Programming
    1. What is Functional Programming (FP)?
    2. Terminology: First-Class and Higher-Order Functions
    3. Terminology: Lambda vs Closure
    4. A Short List of Languages that Support FP
    5. FP with Java
    6. FP With JavaScript
    7. Imperative Programming in JavaScript
    8. The JavaScript map (FP) Example
    9. The JavaScript reduce (FP) Example
    10. Using reduce to Flatten an Array of Arrays (FP) Example
    11. The JavaScript filter (FP) Example
    12. Common High-Order Functions in Python
    13. Common High-Order Functions in Scala
    14. Elements of FP in R
    15. Summary
  2. Introduction to Apache Spark
    1. What is Spark
    2. A Short History of Spark
    3. Where to Get Spark?
    4. The Spark Platform
    5. Spark Logo
    6. Common Spark Use Cases
    7. Languages Supported by Spark
    8. Running Spark on a Cluster
    9. The Driver Process
    10. Spark Applications
    11. Spark Shell
    12. The spark-submit Tool
    13. The spark-submit Tool Configuration
    14. The Executor and Worker Processes
    15. The Spark Application Architecture
    16. Interfaces with Data Storage Systems
    17. Limitations of Hadoop's MapReduce
    18. Spark vs MapReduce
    19. Spark as an Alternative to Apache Tez
    20. The Resilient Distributed Dataset (RDD)
    21. Spark Streaming (Micro-batching)
    22. Spark SQL
    23. Example of Spark SQL
    24. Spark Machine Learning Library
    25. GraphX
    26. Spark vs R
    27. Summary
  3. Hadoop Distributed File System Overview
    1. Hadoop Distributed File System (HDFS)
    2. HDFS High Availability
    3. HDFS "Fine Print"
    4. Storing Raw Data in HDFS
    5. Hadoop Security
    6. HDFS Rack-awareness
    7. Data Blocks
    8. Data Block Replication Example
    9. HDFS NameNode Directory Diagram
    10. Accessing HDFS
    11. Examples of HDFS Commands
    12. Other Supported File Systems
    13. WebHDFS
    14. Examples of WebHDFS Calls
    15. Client Interactions with HDFS for the Read Operation
    16. Read Operation Sequence Diagram
    17. Client Interactions with HDFS for the Write Operation
    18. Communication inside HDFS
    19. Summary
  4. The Spark Shell
    1. The Spark Shell
    2. The Spark Shell UI
    3. Spark Shell Options
    4. Getting Help
    5. The Spark Context (sc) and SQL Context (sqlContext)
    6. The Shell Spark Context
    7. Loading Files
    8. Saving Files
    9. Basic Spark ETL Operations
    10. Summary
  5. Spark RDDs
    1. The Resilient Distributed Dataset (RDD)
    2. Ways to Create an RDD
    3. Custom RDDs
    4. Supported Data Types
    5. RDD Operations
    6. RDDs are Immutable
    7. Spark Actions
    8. RDD Transformations
    9. Other RDD Operations
    10. Chaining RDD Operations
    11. RDD Lineage
    12. The Big Picture
    13. What May Go Wrong
    14. Checkpointing RDDs
    15. Local Checkpointing
    16. Parallelized Collections
    17. More on parallelize() Method
    18. The Pair RDD
    19. Where do I use Pair RDDs?
    20. Example of Creating a Pair RDD with Map
    21. Example of Creating a Pair RDD with keyBy
    22. Miscellaneous Pair RDD Operations
    23. RDD Caching
    24. RDD Persistence
    25. The Tachyon Storage
    26. Summary
  6. Shared Variables in Spark
    1. Shared Variables in Spark
    2. Broadcast Variables
    3. Creating and Using Broadcast Variables
    4. Example of Using Broadcast Variables
    5. Accumulators
    6. Creating and Using Accumulators
    7. Example of Using Accumulators
    8. Custom Accumulators
    9. Summary
  7. Parallel Data Processing with Spark
    1. Running Spark on a Cluster
    2. Spark Stand-alone Option
    3. The High-Level Execution Flow in Stand-alone Spark Cluster
    4. Data Partitioning
    5. Data Partitioning Diagram
    6. Single Local File System RDD Partitioning
    7. Multiple File RDD Partitioning
    8. Special Cases for Small-sized Files
    9. Parallel Data Processing of Partitions
    10. Spark Application, Jobs, and Tasks
    11. Stages and Shuffles
    12. The "Big Picture"
    13. Summary
  8. Introduction to Spark SQL
    1. What is Spark SQL?
    2. Uniform Data Access with Spark SQL
    3. Hive Integration
    4. Hive Interface
    5. Integration with BI Tools
    6. Spark SQL is No Longer Experimental Developer API!
    7. What is a DataFrame?
    8. The SQLContext Object
    9. The SQLContext API
    10. Changes Between Spark SQL 1.3 to 1.4
    11. Example of Spark SQL (Scala Example)
    12. Example of Working with a JSON File
    13. Example of Working with a Parquet File
    14. Using JDBC Sources
    15. JDBC Connection Example
    16. Performance & Scalability of Spark SQL
    17. Summary
  9. Graph Processing with GraphX
    1. What is GraphX?
    2. Supported Languages
    3. Vertices and Edges
    4. Graph Terminology
    5. Example of Property Graph
    6. The GraphX API
    7. The GraphX Views
    8. The Triplet View
    9. Graph Algorithms
    10. Graphs and RDDs
    11. Constructing Graphs
    12. Graph Operators
    13. Example of Using GraphX Operators
    14. GraphX Performance Optimization
    15. The PageRank Algorithm
    16. GraphX Support for PageRank
    17. Summary
  10. Machine Learning Algorithms
    1. Supervised vs Unsupervised Machine Learning
    2. Supervised Machine Learning Algorithms
    3. Unsupervised Machine Learning Algorithms
    4. Choose the Right Algorithm
    5. Life-cycles of Machine Learning Development
    6. Classifying with k-Nearest Neighbors (SL)
    7. k-Nearest Neighbors Algorithm
    8. k-Nearest Neighbors Algorithm
    9. The Error Rate
    10. Decision Trees (SL)
    11. Random Forests
    12. Unsupervised Learning Type: Clustering
    13. K-Means Clustering (UL)
    14. K-Means Clustering in a Nutshell
    15. Regression Analysis
    16. Logistic Regression
    17. Summary
  11. The Spark Machine Learning Library
    1. What is MLlib?
    2. Supported Languages
    3. MLlib Packages
    4. Dense and Sparse Vectors
    5. Labeled Point
    6. Python Example of Using the LabeledPoint Class
    7. LIBSVM format
    8. An Example of a LIBSVM File
    9. Loading LIBSVM Files
    10. Local Matrices
    11. Example of Creating Matrices in MLlib
    12. Distributed Matrices
    13. Example of Using a Distributed Matrix
    14. Classification and Regression Algorithm
    15. Clustering
    16. Summary
  12. Spark Streaming
    1. What is Spark Streaming?
    2. Spark Streaming as Micro-batching
    3. Use Cases
    4. Some "Competition"
    5. Spark Streaming Features
    6. How It Works
    7. Basic Data Stream Sources
    8. Advanced Data Stream Sources
    9. The DStream Object
    10. DStream - RDD Diagram
    11. The Operational DStream API
    12. DStream Output Operations
    13. The
    14. StreamingContext Object
    15. TCP Text Streams Example (in Scala)
    16. Accessing the Underlying RDDs
    17. The Sliding Window Concept
    18. The Sliding Window Diagram
    19. The Window Operations
    20. A Windowed Computation Example (Scala)
    21. Points to Remember
    22. Other Points to Remember
    23. Summary
Class Materials

Each student in our Live Online and our Onsite classes receives a comprehensive set of materials, including course notes and all the class examples.

Class Prerequisites

Experience in the following is required for this Spark class:

  • General knowledge of programming as well as experience working in Unix-like environments (e.g. running shell commands, etc.).

Training for your Team

Length: 3 Days
  • Private Class for your Team
  • Online or On-location
  • Customizable
  • Expert Instructors

What people say about our training

The instructor is a great teacher and very passionate about what she teaches; I would highly recommend Webucator for classes.
Nalini Mahajan
Marianjoy Rehab Hospital
Lots of info! Our instructor clearly understood the concepts well.
George Promenschenkel
Great general overview of the subject.
Daniel Laird
This the best advanced class ever. The teacher is great.
Carlette Daye
US Census Bureau

No cancelation for low enrollment

Certified Microsoft Partner

Registered Education Provider (R.E.P.)

GSA schedule pricing


Students who have taken Instructor-led Training


Organizations who trust Webucator for their Instructor-led training needs


Satisfaction guarantee and retake option


Students rated our trainers 9.30 out of 10 based on 30,409 reviews

Contact Us or call 1-877-932-8228