Hadoop Programming on the Hortonworks Data Platform Training

This Hadoop Programming on the Hortonworks Data Platform training course introduces the students to Apache Hadoop and key Hadoop ecosystem projects: Pig, Hive, Sqoop, Oozie, HBase, and Spark. This course is appropriate for Business Analysts, IT Architects, Technical Managers and Developers.


Public Classes: Delivered live online via WebEx and guaranteed to run . Join from anywhere!

Private Classes: Delivered at your offices , or any other location of your choice.

Course Topics
  1. Learn about the Hadoop Ecosystem.
  2. Learn MapReduce.
  3. Learn Pig Scripting Platform.
  4. Learn Apache Hive.
  5. Learn Apache Sqoop.
  6. Learn Apache HBase.
  7. Learn Spark.
  8. Learn Spark SQL.
  1. MapReduce Overview
    1. The Client – Server Processing Pattern
    2. Distributed Computing Challenges
    3. MapReduce Defined
    4. Google's MapReduce
    5. The Map Phase of MapReduce
    6. The Reduce Phase of MapReduce
    7. MapReduce Explained
    8. MapReduce Word Count Job
    9. MapReduce Shared-Nothing Architecture
    10. Similarity with SQL Aggregation Operations
    11. Example of Map & Reduce Operations using JavaScript
    12. Problems Suitable for Solving with MapReduce
    13. Typical MapReduce Jobs
    14. Fault-tolerance of MapReduce
    15. Distributed Computing Economics
    16. MapReduce Systems
    17. Summary
  2. Hadoop Overview
    1. Apache Hadoop
    2. Apache Hadoop Logo
    3. Typical Hadoop Applications
    4. Hadoop Clusters
    5. Hadoop Design Principles
    6. Hadoop Versions
    7. Hadoop's Main Components
    8. Hadoop Simple Definition
    9. Side-by-Side Comparison: Hadoop 1 and Hadoop 2
    10. Hadoop-based Systems for Data Analysis
    11. Other Hadoop Ecosystem Projects
    12. Hadoop Caveats
    13. Hadoop Distributions
    14. Cloudera Distribution of Hadoop (CDH)
    15. Cloudera Distributions
    16. Hortonworks Data Platform (HDP)
    17. MapR
    18. Summary
  3. Hadoop Distributed File System Overview
    1. Hadoop Distributed File System (HDFS)
    2. HDFS High Availability
    3. HDFS "Fine Print"
    4. Storing Raw Data in HDFS
    5. Hadoop Security
    6. HDFS Rack-awareness
    7. Data Blocks
    8. Data Block Replication Example
    9. HDFS NameNode Directory Diagram
    10. Accessing HDFS
    11. Examples of HDFS Commands
    12. Other Supported File Systems
    13. WebHDFS
    14. Examples of WebHDFS Calls
    15. Client Interactions with HDFS for the Read Operation
    16. Read Operation Sequence Diagram
    17. Client Interactions with HDFS for the Write Operation
    18. Communication inside HDFS
    19. Summary
  4. MapReduce with Hadoop
    1. Hadoop's MapReduce
    2. MapReduce 1 and MapReduce 2
    3. Why do I need Discussion of the Old MapReduce?
    4. MapReduce v1 ("Classic MapReduce")
    5. JobTracker and TaskTracker (the "Classic MapReduce")
    6. YARN (MapReduce v2)
    7. YARN vs MR1
    8. YARN As Data Operating System
    9. MapReduce Programming Options
    10. Java MapReduce API
    11. The Structure of a Java MapReduce Program
    12. The Mapper Class
    13. The Reducer Class
    14. The Driver Class
    15. Compiling Classes
    16. Running the MapReduce Job
    17. The Structure of a Single MapReduce Program
    18. Combiner Pass (Optional)
    19. Hadoop's Streaming MapReduce
    20. Python Word Count Mapper Program Example
    21. Python Word Count Reducer Program Example
    22. Setting up Java Classpath for Streaming Support
    23. Streaming Use Cases
    24. The Streaming API vs Java MapReduce API
    25. Amazon Elastic MapReduce
    26. Apache Tez
    27. Summary
  5. Apache Pig Scripting Platform
    1. What is Pig?
    2. Pig Latin
    3. Apache Pig Logo
    4. Pig Execution Modes
    5. Local Execution Mode
    6. MapReduce Execution Mode
    7. Running Pig
    8. Running Pig in Batch Mode
    9. What is Grunt?
    10. Pig Latin Statements
    11. Pig Programs
    12. Pig Latin Script Example
    13. SQL Equivalent
    14. Differences between Pig and SQL
    15. Statement Processing in Pig
    16. Comments in Pig
    17. Supported Simple Data Types
    18. Supported Complex Data Types
    19. Arrays
    20. Defining Relation's Schema
    21. Not Matching the Defined Schema
    22. The bytearray Generic Type
    23. Using Field Delimiters
    24. Loading Data with TextLoader()
    25. Referencing Fields in Relations
    26. Summary
  6. Apache Pig HDFS Interface
    1. The HDFS Interface
    2. FSShell Commands (Short List)
    3. Grunt's Old File System Commands
    4. Summary
  7. Apache Pig Relational and Eval Operators
    1. Pig Relational Operators
    2. Example of Using the JOIN Operator
    3. Example of Using the Order By Operator
    4. Caveats of Using Relational Operators
    5. Pig Eval Functions
    6. Caveats of Using Eval Functions (Operators)
    7. Example of Using Single-column Eval Operations
    8. Example of Using Eval Operators For Global Operations
    9. Summary
  8. Apache Pig Miscellaneous Topics
    1. Utility Commands
    2. Handling Compression
    3. User-Defined Functions
    4. Filter UDF Skeleton Code
    5. Summary
  9. Apache Pig Performance
    1. Apache Pig Performance
    2. Performance Enhancer - Use the Right Schema Type
    3. Performance Enhancer - Apply Data Filters
    4. Use the PARALLEL Clause
    5. Examples of the PARALLEL Clause
    6. Performance Enhancer - Limiting the Data Sets
    7. Displaying Execution Plan
    8. Compress the Results of Intermediate Jobs
    9. Example of Running Pig with LZO Compression Codec
    10. Summary
  10. Apache Oozie
    1. What is Oozie?
    2. Apache Oozie Logo
    3. Oozie Terminology
    4. Directed Acyclic Graph
    5. Oozie Job Types
    6. Oozie Architecture
    7. Oozie Configuration
    8. Oozie Workflows
    9. The Flow Of Oozie Workflows
    10. More on Oozie Workflows
    11. Oozie Workflow Control Nodes
    12. More Oozie Workflow Control Nodes
    13. A Workflow Example
    14. A More Complex Workflow Example
    15. Oozie Coordinator
    16. The Pig Action Template
    17. The Pig Action Template
    18. Summary
  11. Hive
    1. What is Hive?
    2. Apache Hive Logo
    3. Hive's Value Proposition
    4. Who uses Hive?
    5. Hive's Main Sub-Systems
    6. Hive Features
    7. The "Classic" Hive Architecture
    8. The New Hive Architecture
    9. HiveQL
    10. Where are the Hive Tables Located?
    11. Hive Command-line Interface (CLI)
    12. The Beeline Command Shell
    13. Summary
  12. Hive Command-line Interface
    1. Hive Command-line Interface (CLI)
    2. The Hive Interactive Shell
    3. Running Host OS Commands from the Hive Shell
    4. Interfacing with HDFS from the Hive Shell
    5. The Hive in Unattended Mode
    6. The Hive CLI Integration with the OS Shell
    7. Executing HiveQL Scripts
    8. Comments in Hive Scripts
    9. Variables and Properties in Hive CLI
    10. Setting Properties in CLI
    11. Example of Setting Properties in CLI
    12. Hive Namespaces
    13. Using the SET Command
    14. Setting Properties in the Shell
    15. Setting Properties for the New Shell Session
    16. Setting Alternative Hive Execution Engines
    17. The Beeline Shell
    18. Connecting to the Hive Server in Beeline
    19. Beeline Command Switches
    20. Beeline Internal Commands
    21. Summary
  13. Hive Data Definition Language
    1. Hive Data Definition Language
    2. Creating Databases in Hive
    3. Using Databases
    4. Creating Tables in Hive
    5. Supported Data Type Categories
    6. Common Numeric Types
    7. String and Date / Time Types
    8. Miscellaneous Types
    9. Example of the CREATE TABLE Statement
    10. Working with Complex Types
    11. Table Partitioning
    12. Table Partitioning
    13. Table Partitioning on Multiple Columns
    14. Viewing Table Partitions
    15. Row Format
    16. Data Serializers / Deserializers
    17. File Format Storage
    18. File Compression
    19. More on File Formats
    20. The ORC Data Format
    21. Converting Text to ORC Data Format
    22. The EXTERNAL DDL Parameter
    23. Example of Using EXTERNAL
    24. Creating an Empty Table
    25. Dropping a Table
    26. Table / Partition(s) Truncation
    27. Alter Table/Partition/Column
    29. Create View Statement
    30. Why Use Views?
    31. Restricting Amount of Viewable Data
    32. Examples of Restricting Amount of Viewable Data
    33. Creating and Dropping Indexes
    34. Describing Data
    35. Summary
  14. Hive Data Manipulation Language
    1. Hive Data Manipulation Language (DML)
    2. Using the LOAD DATA statement
    3. Example of Loading Data into a Hive Table
    4. Loading Data with the INSERT Statement
    5. Appending and Replacing Data with the INSERT Statement
    6. Examples of Using the INSERT Statement
    7. Multi Table Inserts
    8. Multi Table Inserts Syntax
    9. Multi Table Inserts Example
    10. Summary
  15. Hive Select Statement
    1. HiveQL
    2. The SELECT Statement Syntax
    3. The WHERE Clause
    4. Examples of the WHERE Statement
    5. Partition-based Queries
    6. Example of an Efficient SELECT Statement
    7. The DISTINCT Clause
    8. Supported Numeric Operators
    9. Built-in Mathematical Functions
    10. Built-in Aggregate Functions
    11. Built-in Statistical Functions
    12. Other Useful Built-in Functions
    13. The GROUP BY Clause
    14. The HAVING Clause
    15. The LIMIT Clause
    16. The ORDER BY Clause
    17. The JOIN Clause
    18. The CASE … Clause
    19. Example of CASE … Clause
    20. Summary
  16. Apache Sqoop
    1. What is Sqoop?
    2. Apache Sqoop Logo
    3. Sqoop Import / Export
    4. Sqoop Help
    5. Examples of Using Sqoop Commands
    6. Data Import Example
    7. Fine-tuning Data Import
    8. Controlling the Number of Import Processes
    9. Data Splitting
    10. Helping Sqoop Out
    11. Example of Executing Sqoop Load in Parallel
    12. A Word of Caution: Avoid Complex Free-Form Queries
    13. Using Direct Export from Databases
    14. Example of Using Direct Export from MySQL
    15. More on Direct Mode Import
    16. Changing Data Types
    17. Example of Default Types Overriding
    18. File Formats
    19. The Apache Avro Serialization System
    20. Binary vs Text
    21. More on the SequenceFile Binary Format
    22. Generating the Java Table Record Source Code
    23. Data Export from HDFS
    24. Export Tool Common Arguments
    25. Data Export Control Arguments
    26. Data Export Example
    27. Using a Staging Table
    28. INSERT and UPDATE Statements
    29. INSERT Operations
    30. UPDATE Operations
    31. Example of the Update Operation
    32. Failed Exports
    33. Sqoop2
    34. Sqoop2 Architecture
    35. Summary
  17. Apache HBase
    1. What is HBase?
    2. HBase Design
    3. HBase Features
    4. HBase High Availability
    5. The Write-Ahead Log (WAL) and MemStore
    6. HBase vs RDBS
    7. HBase vs Apache Cassandra
    8. Not Good Use Cases for HBase
    9. Interfacing with HBase
    10. HBase Thrift And REST Gateway
    11. HBase Table Design
    12. Column Families
    13. A Cell's Value Versioning
    14. Timestamps
    15. Accessing Cells
    16. HBase Table Design Digest
    17. Table Horizontal Partitioning with Regions
    18. HBase Compaction
    19. Loading Data in HBase
    20. Column Families Notes
    21. Rowkey Notes
    22. HBase Shell
    23. HBase Shell Command Groups
    24. Creating and Populating a Table in HBase Shell
    25. Getting a Cell's Value
    26. Counting Rows in an HBase Table
    27. Summary
  18. Apache HBase Java API
    1. HBase Java Client
    2. HBase Scanners
    3. Using ResultScanner Efficiently
    4. The Scan Class
    5. The KeyValue Class
    6. The Result Class
    7. Getting Versions of Cell Values Example
    8. The Cell Interface
    9. HBase Java Client Example
    10. Scanning the Table Rows
    11. Dropping a Table
    12. The Bytes Utility Class
    13. Summary
  19. Introduction to Functional Programming
    1. What is Functional Programming (FP)?
    2. Terminology: Higher-Order Functions
    3. Terminology: Lambda vs Closure
    4. A Short List of Languages that Support FP
    5. FP with Java
    6. FP With JavaScript
    7. Imperative Programming in JavaScript
    8. The JavaScript map (FP) Example
    9. The JavaScript reduce (FP) Example
    10. Using reduce to Flatten an Array of Arrays (FP) Example
    11. The JavaScript filter (FP) Example
    12. Common High-Order Functions in Python
    13. Common High-Order Functions in Scala
    14. Elements of FP in R
    15. Summary
  20. Introduction to Apache Spark
    1. What is Apache Spark
    2. A Short History of Spark
    3. Where to Get Spark?
    4. The Spark Platform
    5. Spark Logo
    6. Common Spark Use Cases
    7. Languages Supported by Spark
    8. Running Spark on a Cluster
    9. The Driver Process
    10. Spark Applications
    11. Spark Shell
    12. The spark-submit Tool
    13. The spark-submit Tool Configuration
    14. The Executor and Worker Processes
    15. The Spark Application Architecture
    16. Interfaces with Data Storage Systems
    17. Limitations of Hadoop's MapReduce
    18. Spark vs MapReduce
    19. Spark as an Alternative to Apache Tez
    20. The Resilient Distributed Dataset (RDD)
    21. Spark Streaming (Micro-batching)
    22. Spark SQL
    23. Example of Spark SQL
    24. Spark Machine Learning Library
    25. GraphX
    26. Spark vs R
    27. Summary
  21. The Spark Shell
    1. The Spark Shell
    2. The Spark Shell UI
    3. Spark Shell Options
    4. Getting Help
    5. The Spark Context (sc) and SQL Context (sqlContext)
    6. The Shell Spark Context
    7. Loading Files
    8. Saving Files
    9. Basic Spark ETL Operations
    10. Summary
  22. Spark RDDs
    1. The Resilient Distributed Dataset (RDD)
    2. Ways to Create an RDD
    3. Custom RDDs
    4. Supported Data Types
    5. RDD Operations
    6. RDDs are Immutable
    7. Spark Actions
    8. RDD Transformations
    9. Other RDD Operations
    10. Chaining RDD Operations
    11. RDD Lineage
    12. The Big Picture
    13. What May Go Wrong
    14. Checkpointing RDDs
    15. Local Checkpointing
    16. Parallelized Collections
    17. More on parallelize() Method
    18. The Pair RDD
    19. Where do I use Pair RDDs?
    20. Example of Creating a Pair RDD with Map
    21. Example of Creating a Pair RDD with keyBy
    22. Miscellaneous Pair RDD Operations
    23. RDD Caching
    24. RDD Persistence
    25. The Tachyon Storage
    26. Summary
  23. Parallel Data Processing with Spark
    1. Running Spark on a Cluster
    2. Spark Stand-alone Option
    3. The High-Level Execution Flow in Stand-alone Spark Cluster
    4. Data Partitioning
    5. Data Partitioning Diagram
    6. Single Local File System RDD Partitioning
    7. Multiple File RDD Partitioning
    8. Special Cases for Small-sized Files
    9. Parallel Data Processing of Partitions
    10. Spark Application, Jobs, and Tasks
    11. Stages and Shuffles
    12. The "Big Picture"
    13. Summary
  24. Introduction to Spark SQL
    1. What is Spark SQL?
    2. Uniform Data Access with Spark SQL
    3. Hive Integration
    4. Hive Interface
    5. Integration with BI Tools
    6. Spark SQL is No Longer Experimental Developer API!
    7. What is a DataFrame?
    8. The SQLContext Object
    9. The SQLContext API
    10. Changes Between Spark SQL 1.3 to 1.4
    11. Example of Spark SQL (Scala Example)
    12. Example of Working with a JSON File
    13. Example of Working with a Parquet File
    14. Using JDBC Sources
    15. JDBC Connection Example
    16. Performance & Scalability of Spark SQL
    17. Summary
Class Materials

Each student in our Live Online and our Onsite classes receives a comprehensive set of materials, including course notes and all the class examples.

Class Prerequisites

Experience in the following is required for this Hadoop class:

  • General knowledge of programming in Java and SQL as well as experience working in Unix environments (e.g. running shell commands, etc.).

Training for your Team

Length: 5 Days
  • Private Class for your Team
  • Online or On-location
  • Customizable
  • Expert Instructors

What people say about our training

The best short, intensive course, on or offline. Excellent instructor.
Althea Thompson
Universal American
The instructor's teaching style made the material interesting to learn. The layout of the training manual made it easy to understand various concepts.
Angela Toppins
Genesco Inc.
The instructor's demeanor was very approachable, and he was not afraid to take questions that might have been off-topic from the curriculum.
Jim McCloskey
Southeastern Orthopedic Center
The instructor was great. Wish she could teach all the classes at NOAA.
Janet Sharp

No cancelation for low enrollment

Certified Microsoft Partner

Registered Education Provider (R.E.P.)

GSA schedule pricing


Students who have taken Instructor-led Training


Organizations who trust Webucator for their Instructor-led training needs


Satisfaction guarantee and retake option


Students rated our trainers 9.30 out of 10 based on 30,402 reviews

Contact Us or call 1-877-932-8228