Hadoop Programming on the Cloudera Platform

This Hadoop Programming on the Cloudera Platform training class introduces the students to Apache Hadoop and key Hadoop ecosystem projects: Pig, Hive, Sqoop, Impala, Oozie, HBase, and Spark. This class is appropriate for Business Analysts, IT Architects, Technical Managers and Developers.


Public Classes: Delivered live online via WebEx and guaranteed to run . Join from anywhere!

Private Classes: Delivered at your offices , or any other location of your choice.

  1. Receive an overview of the Hadoop Ecosystem.
  2. Learn to use MapReduce.
  3. Learn to use the Pig Scripting Platform.
  4. Learn to use Apache Hive.
  5. Learn to use Apache Sqoop.
  6. Learn to use Cloudera Impala.
  7. Learn to use Apache HBase.
  8. Learn to use Spark.
  9. Learn to use Spark SQL.
  1. MapReduce Overview
    1. The Client – Server Processing Pattern
    2. Distributed Computing Challenges
    3. MapReduce Defined
    4. Google's MapReduce
    5. The Map Phase of MapReduce
    6. The Reduce Phase of MapReduce
    7. MapReduce Explained
    8. MapReduce Word Count Job
    9. MapReduce Shared-Nothing Architecture
    10. Similarity with SQL Aggregation Operations
    11. Example of Map & Reduce Operations using JavaScript
    12. Problems Suitable for Solving with MapReduce
    13. Typical MapReduce Jobs
    14. Fault-tolerance of MapReduce
    15. Distributed Computing Economics
    16. MapReduce Systems
    17. Summary
  2. Hadoop Overview
    1. Apache Hadoop
    2. Apache Hadoop Logo
    3. Typical Hadoop Applications
    4. Hadoop Clusters
    5. Hadoop Design Principles
    6. Hadoop Versions
    7. Hadoop's Main Components
    8. Hadoop Simple Definition
    9. Side-by-Side Comparison: Hadoop 1 and Hadoop 2
    10. Hadoop-based Systems for Data Analysis
    11. Other Hadoop Ecosystem Projects
    12. Hadoop Caveats
    13. Hadoop Distributions
    14. Cloudera Distribution of Hadoop (CDH)
    15. Cloudera Distributions
    16. Hortonworks Data Platform (HDP)
    17. MapR
    18. Summary
  3. Hadoop Distributed File System Overview
    1. Hadoop Distributed File System (HDFS)
    2. HDFS High Availability
    3. HDFS "Fine Print"
    4. Storing Raw Data in HDFS
    5. Hadoop Security
    6. HDFS Rack-awareness
    7. Data Blocks
    8. Data Block Replication Example
    9. HDFS NameNode Directory Diagram
    10. Accessing HDFS
    11. Examples of HDFS Commands
    12. Other Supported File Systems
    13. WebHDFS
    14. Examples of WebHDFS Calls
    15. Client Interactions with HDFS for the Read Operation
    16. Read Operation Sequence Diagram
    17. Client Interactions with HDFS for the Write Operation
    18. Communication inside HDFS
    19. Summary
  4. MapReduce with Hadoop
    1. Hadoop's MapReduce
    2. MapReduce 1 and MapReduce 2
    3. Why do I need Discussion of the Old MapReduce?
    4. MapReduce v1 ("Classic MapReduce")
    5. JobTracker and TaskTracker (the "Classic MapReduce")
    6. YARN (MapReduce v2)
    7. YARN vs MR1
    8. YARN As Data Operating System
    9. MapReduce Programming Options
    10. Java MapReduce API
    11. The Structure of a Java MapReduce Program
    12. The Mapper Class
    13. The Reducer Class
    14. The Driver Class
    15. Compiling Classes
    16. Running the MapReduce Job
    17. The Structure of a Single MapReduce Program
    18. Combiner Pass (Optional)
    19. Hadoop's Streaming MapReduce
    20. Python Word Count Mapper Program Example
    21. Python Word Count Reducer Program Example
    22. Setting up Java Classpath for Streaming Support
    23. Streaming Use Cases
    24. The Streaming API vs Java MapReduce API
    25. Amazon Elastic MapReduce
    26. Apache Tez
    27. Summary
  5. Apache Pig Scripting Platform
    1. What is Pig?
    2. Pig Latin
    3. Apache Pig Logo
    4. Pig Execution Modes
    5. Local Execution Mode
    6. MapReduce Execution Mode
    7. Running Pig
    8. Running Pig in Batch Mode
    9. What is Grunt?
    10. Pig Latin Statements
    11. Pig Programs
    12. Pig Latin Script Example
    13. SQL Equivalent
    14. Differences between Pig and SQL
    15. Statement Processing in Pig
    16. Comments in Pig
    17. Supported Simple Data Types
    18. Supported Complex Data Types
    19. Arrays
    20. Defining Relation's Schema
    21. Not Matching the Defined Schema
    22. The bytearray Generic Type
    23. Using Field Delimiters
    24. Loading Data with TextLoader()
    25. Referencing Fields in Relations
    26. Summary
  6. Apache Pig HDFS Interface
    1. The HDFS Interface
    2. FSShell Commands (Short List)
    3. Grunt's Old File System Commands
    4. Summary
  7. Apache Pig Relational and Eval Operators
    1. Pig Relational Operators
    2. Example of Using the JOIN Operator
    3. Example of Using the Order By Operator
    4. Caveats of Using Relational Operators
    5. Pig Eval Functions
    6. Caveats of Using Eval Functions (Operators)
    7. Example of Using Single-column Eval Operations
    8. Example of Using Eval Operators For Global Operations
    9. Summary
  8. Apache Pig Miscellaneous Topics
    1. Utility Commands
    2. Handling Compression
    3. User-Defined Functions
    4. Filter UDF Skeleton Code
    5. Summary
  9. Apache Pig Performance
    1. Apache Pig Performance
    2. Performance Enhancer - Use the Right Schema Type
    3. Performance Enhancer - Apply Data Filters
    4. Use the PARALLEL Clause
    5. Examples of the PARALLEL Clause
    6. Performance Enhancer - Limiting the Data Sets
    7. Displaying Execution Plan
    8. Compress the Results of Intermediate Jobs
    9. Example of Running Pig with LZO Compression Codec
    10. Summary
  10. Hive
    1. What is Hive?
    2. Apache Hive Logo
    3. Hive's Value Proposition
    4. Who uses Hive?
    5. Hive's Main Sub-Systems
    6. Hive Features
    7. The "Classic" Hive Architecture
    8. The New Hive Architecture
    9. HiveQL
    10. Where are the Hive Tables Located?
    11. Hive Command-line Interface (CLI)
    12. The Beeline Command Shell
    13. Summary
  11. Hive Command-line Interface
    1. Hive Command-line Interface (CLI)
    2. The Hive Interactive Shell
    3. Running Host OS Commands from the Hive Shell
    4. Interfacing with HDFS from the Hive Shell
    5. The Hive in Unattended Mode
    6. The Hive CLI Integration with the OS Shell
    7. Executing HiveQL Scripts
    8. Comments in Hive Scripts
    9. Variables and Properties in Hive CLI
    10. Setting Properties in CLI
    11. Example of Setting Properties in CLI
    12. Hive Namespaces
    13. Using the SET Command
    14. Setting Properties in the Shell
    15. Setting Properties for the New Shell Session
    16. Setting Alternative Hive Execution Engines
    17. The Beeline Shell
    18. Connecting to the Hive Server in Beeline
    19. Beeline Command Switches
    20. Beeline Internal Commands
    21. Summary
  12. Hive Data Definition Language
    1. Hive Data Definition Language
    2. Creating Databases in Hive
    3. Using Databases
    4. Creating Tables in Hive
    5. Supported Data Type Categories
    6. Common Numeric Types
    7. String and Date / Time Types
    8. Miscellaneous Types
    9. Example of the CREATE TABLE Statement
    10. Working with Complex Types
    11. Table Partitioning
    12. Table Partitioning
    13. Table Partitioning on Multiple Columns
    14. Viewing Table Partitions
    15. Row Format
    16. Data Serializers / Deserializers
    17. File Format Storage
    18. File Compression
    19. More on File Formats
    20. The ORC Data Format
    21. Converting Text to ORC Data Format
    22. The EXTERNAL DDL Parameter
    23. Example of Using EXTERNAL
    24. Creating an Empty Table
    25. Dropping a Table
    26. Table / Partition(s) Truncation
    27. Alter Table/Partition/Column
    29. Create View Statement
    30. Why Use Views?
    31. Restricting Amount of Viewable Data
    32. Examples of Restricting Amount of Viewable Data
    33. Creating and Dropping Indexes
    34. Describing Data
    35. Summary
  13. Hive Data Manipulation Language
    1. Hive Data Manipulation Language (DML)
    2. Using the LOAD DATA statement
    3. Example of Loading Data into a Hive Table
    4. Loading Data with the INSERT Statement
    5. Appending and Replacing Data with the INSERT Statement
    6. Examples of Using the INSERT Statement
    7. Multi Table Inserts
    8. Multi Table Inserts Syntax
    9. Multi Table Inserts Example
    10. Summary
  14. Hive Select Statement
    1. HiveQL
    2. The SELECT Statement Syntax
    3. The WHERE Clause
    4. Examples of the WHERE Statement
    5. Partition-based Queries
    6. Example of an Efficient SELECT Statement
    7. The DISTINCT Clause
    8. Supported Numeric Operators
    9. Built-in Mathematical Functions
    10. Built-in Aggregate Functions
    11. Built-in Statistical Functions
    12. Other Useful Built-in Functions
    13. The GROUP BY Clause
    14. The HAVING Clause
    15. The LIMIT Clause
    16. The ORDER BY Clause
    17. The JOIN Clause
    18. The CASE … Clause
    19. Example of CASE … Clause
    20. Summary
  15. Apache Sqoop
    1. What is Sqoop?
    2. Apache Sqoop Logo
    3. Sqoop Import / Export
    4. Sqoop Help
    5. Examples of Using Sqoop Commands
    6. Data Import Example
    7. Fine-tuning Data Import
    8. Controlling the Number of Import Processes
    9. Data Splitting
    10. Helping Sqoop Out
    11. Example of Executing Sqoop Load in Parallel
    12. A Word of Caution: Avoid Complex Free-Form Queries
    13. Using Direct Export from Databases
    14. Example of Using Direct Export from MySQL
    15. More on Direct Mode Import
    16. Changing Data Types
    17. Example of Default Types Overriding
    18. File Formats
    19. The Apache Avro Serialization System
    20. Binary vs Text
    21. More on the SequenceFile Binary Format
    22. Generating the Java Table Record Source Code
    23. Data Export from HDFS
    24. Export Tool Common Arguments
    25. Data Export Control Arguments
    26. Data Export Example
    27. Using a Staging Table
    28. INSERT and UPDATE Statements
    29. INSERT Operations
    30. UPDATE Operations
    31. Example of the Update Operation
    32. Failed Exports
    33. Sqoop2
    34. Sqoop2 Architecture
    35. Summary
  16. Cloudera Impala
    1. What is Cloudera Impala?
    2. Impala's Logo
    3. Impala Architecture
    4. Benefits of Using Impala
    5. Key Features
    6. How Impala Handles SQL Queries
    7. Impala Programming Interfaces
    8. Impala SQL Language Reference
    9. Differences Between Impala and HiveQL
    10. Impala Shell
    11. Impala Shell Main Options
    12. Impala Shell Commands
    13. Impala Common Shell Commands
    14. Cloudera Web Admin UI
    15. Impala Browse-based Query Editor
    16. Summary
  17. Introduction to Functional Programming
    1. What is Functional Programming (FP)?
    2. Terminology: First-Class and Higher-Order Functions
    3. Terminology: Lambda vs Closure
    4. A Short List of Languages that Support FP
    5. FP with Java
    6. FP With JavaScript
    7. Imperative Programming in JavaScript
    8. The JavaScript map (FP) Example
    9. The JavaScript reduce (FP) Example
    10. Using reduce to Flatten an Array of Arrays (FP) Example
    11. The JavaScript filter (FP) Example
    12. Common High-Order Functions in Python
    13. Common High-Order Functions in Scala
    14. Elements of FP in R
    15. Summary
  18. Introduction to Apache Spark
    1. What is Spark
    2. A Short History of Spark
    3. Where to Get Spark?
    4. The Spark Platform
    5. Spark Logo
    6. Common Spark Use Cases
    7. Languages Supported by Spark
    8. Running Spark on a Cluster
    9. The Driver Process
    10. Spark Applications
    11. Spark Shell
    12. The spark-submit Tool
    13. The spark-submit Tool Configuration
    14. The Executor and Worker Processes
    15. The Spark Application Architecture
    16. Interfaces with Data Storage Systems
    17. Limitations of Hadoop's MapReduce
    18. Spark vs MapReduce
    19. Spark as an Alternative to Apache Tez
    20. The Resilient Distributed Dataset (RDD)
    21. Spark Streaming (Micro-batching)
    22. Spark SQL
    23. Example of Spark SQL
    24. Spark Machine Learning Library
    25. GraphX
    26. Spark vs R
    27. Summary
  19. The Spark Shell
    1. The Spark Shell
    2. The Spark Shell UI
    3. Spark Shell Options
    4. Getting Help
    5. The Spark Context (sc) and SQL Context (sqlContext)
    6. The Shell Spark Context
    7. Loading Files
    8. Saving Files
    9. Basic Spark ETL Operations
    10. Summary
  20. Spark RDDs
    1. The Resilient Distributed Dataset (RDD)
    2. Ways to Create an RDD
    3. Custom RDDs
    4. Supported Data Types
    5. RDD Operations
    6. RDDs are Immutable
    7. Spark Actions
    8. RDD Transformations
    9. Other RDD Operations
    10. Chaining RDD Operations
    11. RDD Lineage
    12. The Big Picture
    13. What May Go Wrong
    14. Checkpointing RDDs
    15. Local Checkpointing
    16. Parallelized Collections
    17. More on parallelize() Method
    18. The Pair RDD
    19. Where do I use Pair RDDs?
    20. Example of Creating a Pair RDD with Map
    21. Example of Creating a Pair RDD with keyBy
    22. Miscellaneous Pair RDD Operations
    23. RDD Caching
    24. RDD Persistence
    25. The Tachyon Storage
    26. Summary
  21. Parallel Data Processing with Spark
    1. Running Spark on a Cluster
    2. Spark Stand-alone Option
    3. The High-Level Execution Flow in Stand-alone Spark Cluster
    4. Data Partitioning
    5. Data Partitioning Diagram
    6. Single Local File System RDD Partitioning
    7. Multiple File RDD Partitioning
    8. Special Cases for Small-sized Files
    9. Parallel Data Processing of Partitions
    10. Spark Application, Jobs, and Tasks
    11. Stages and Shuffles
    12. The "Big Picture"
    13. Summary
  22. Shared Variables in Spark
    1. Shared Variables in Spark
    2. Broadcast Variables
    3. Creating and Using Broadcast Variables
    4. Example of Using Broadcast Variables
    5. Accumulators
    6. Creating and Using Accumulators
    7. Example of Using Accumulators
    8. Custom Accumulators
    9. Summary
  23. Introduction to Spark SQL
    1. What is Spark SQL?
    2. Uniform Data Access with Spark SQL
    3. Hive Integration
    4. Hive Interface
    5. Integration with BI Tools
    6. Spark SQL is No Longer Experimental Developer API!
    7. What is a DataFrame?
    8. The SQLContext Object
    9. The SQLContext API
    10. Changes Between Spark SQL 1.3 to 1.4
    11. Example of Spark SQL (Scala Example)
    12. Example of Working with a JSON File
    13. Example of Working with a Parquet File
    14. Using JDBC Sources
    15. JDBC Connection Example
    16. Performance & Scalability of Spark SQL
    17. Summary
  24. Graph Processing with GraphX
    1. What is GraphX?
    2. Supported Languages
    3. Vertices and Edges
    4. Graph Terminology
    5. Example of Property Graph
    6. The GraphX API
    7. The GraphX Views
    8. The Triplet View
    9. Graph Algorithms
    10. Graphs and RDDs
    11. Constructing Graphs
    12. Graph Operators
    13. Example of Using GraphX Operators
    14. GraphX Performance Optimization
    15. The PageRank Algorithm
    16. GraphX Support for PageRank
    17. Summary
  25. Machine Learning Algorithms
    1. Supervised vs Unsupervised Machine Learning
    2. Supervised Machine Learning Algorithms
    3. Unsupervised Machine Learning Algorithms
    4. Choose the Right Algorithm
    5. Life-cycles of Machine Learning Development
    6. Classifying with k-Nearest Neighbors (SL)
    7. k-Nearest Neighbors Algorithm
    8. k-Nearest Neighbors Algorithm
    9. The Error Rate
    10. Decision Trees (SL)
    11. Random Forests
    12. Unsupervised Learning Type: Clustering
    13. K-Means Clustering (UL)
    14. K-Means Clustering in a Nutshell
    15. Regression Analysis
    16. Logistic Regression
    17. Summary
  26. The Spark Machine Learning Library
    1. What is MLlib?
    2. Supported Languages
    3. MLlib Packages
    4. Dense and Sparse Vectors
    5. Labeled Point
    6. Python Example of Using the LabeledPoint Class
    7. LIBSVM format
    8. An Example of a LIBSVM File
    9. Loading LIBSVM Files
    10. Local Matrices
    11. Example of Creating Matrices in MLlib
    12. Distributed Matrices
    13. Example of Using a Distributed Matrix
    14. Classification and Regression Algorithm
    15. Clustering
    16. Summary
Class Materials

Each student in our Live Online and our Onsite classes receives a comprehensive set of materials, including course notes and all the class examples.

Class Prerequisites

Experience in the following is required for this Hadoop class:

  • General knowledge of programming.

Training for Yourself

$3,125.00 or 5 vouchers

Upcoming Live Online Classes

Please select a class.

Training for your Team

Length: 5 Days
  • Private Class for your Team
  • Online or On-location
  • Customizable
  • Expert Instructors

What people say about our training

Perfect way to add to your Excel knowledge. :)
Nicole Baumer
First Street Mgmt.
As someone who uses PowerPoint but never had the time to really learn how valuable this resource can be, I was able to take this class and walk away feeling very confident in my abilities in PowerPoint. The instructor was an incredible teacher and great resource! The time she took to make sure I understood everything was fantastic and we went at a pace that was quick enough for us to cover all the material but also slow enough that I was able to retain it! My only fear is now that everyone in my company will be leaning on me to help them do their Power Point presentations!!! Thank you again for a superb course!
Chris Deigan
Cord Blood Registry
This was a great course that has given me a good foundation in C# programming.
Daniel Poole
HID Global
The combination of professional-level training materials and a live instructor are invaluable in explaining some of these more advanced topics.
Tom Eldredge
CUnet, LLC

No cancelation for low enrollment

Certified Microsoft Partner

Registered Education Provider (R.E.P.)

GSA schedule pricing


Students who have taken Instructor-led Training


Organizations who trust Webucator for their Instructor-led training needs


Satisfaction guarantee and retake option


Students rated our trainers 9.30 out of 10 based on 30,188 reviews

Contact Us or call 1-877-932-8228