Data Science and Big Data Analytics Training

This Data Science and Big Data Analytics training class provides theoretical and technical aspects of Data Science and Business Analytics.The course covers the fundamental and advanced concepts and methods of deriving business insights from "big" and/or "small" data. This course is appropriate for Data Scientists, Software Developers, IT Architects, and Technical Managers.


Public Classes: Delivered live online via WebEx and guaranteed to run . Join from anywhere!

Private Classes: Delivered at your offices , or any other location of your choice.

  1. Apply data science and business analytics.
  2. Learn to develop algorithms, techniques, and common analytical methods.
  3. Obtain introductory knowledge of machine learning.
  4. Visualize and reporting processed results.
  5. Learn the R programming language.
  6. Learn data analysis with R.
  7. Learn the elements of functional programming.
  8. Obtain introductory knowledge of Apache Spark.
  9. Learn Spark SQL.
  10. Learn ETL with Spark.
  11. Learn MLlib Machine Learning Library.
  12. Learn graph processing with GraphX.
  1. Data Science Algorithms and Analytical Methods
    1. Supervised vs Unsupervised Machine Learning
    2. Supervised Machine Learning Algorithms
    3. Unsupervised Machine Learning Algorithms
    4. Choose the Right Algorithm
    5. Life-cycles of Machine Learning Development
    6. Classifying with k-Nearest Neighbors (SL)
    7. k-Nearest Neighbors Algorithm
    8. k-Nearest Neighbors Algorithm
    9. The Error Rate
    10. Decision Trees (SL)
    11. Using Decision Trees
    12. Random Forests
    13. Naive Bayes Classifier (SL)
    14. Classification of Documents with Naive Bayes
    15. Unsupervised Learning Type: Clustering
    16. K-Means Clustering (UL)
    17. K-Means Clustering in a Nutshell
    18. K-Means Clustering in a Nutshell
    19. Regression Analysis
    20. Types of Regression
    21. Simple Linear Regression Model
    22. Linear Regression Illustration
    23. Least-Squares Method (LSM)
    24. LSM Assumptions
    25. Fitting Linear Regression Models in R
    26. Example of Using R's lm() Function
    27. Example of Using lm() with a Data Frame
    28. Regression Models in Excel
    29. Logistic Regression
    30. Regression vs Classification
    31. Time-Series Analysis
    32. Decomposing Time-Series
    33. Decomposing Time-Series
    34. Summary
  2. Getting Started with R
    1. Introduction
    2. Positioning of R in the Data Science Arena
    3. R Integrated Development Environments
    4. Running R
    5. Running RStudio
    6. Ending the Current R Session
    7. Getting Help
    8. Getting System Information
    9. General Notes on R Commands and Statements
    10. R Data Structures
    11. R Objects and Workspace
    12. Assignment Operators
    13. Assignment Example
    14. Arithmetic Operators
    15. Logical Operators
    16. System Date and Time
    17. Operations
    18. User-defined Functions
    19. User-defined Function Example
    20. R Code Example
    21. Type Conversion (Coercion)
    22. Control Statements
    23. Conditional Execution
    24. Repetitive Execution
    25. Repetitive execution
    26. Built-in Functions
    27. Reading Data from Files into Vectors
    28. Example of Reading Data from a File
    29. Writing Data to a File
    30. Example of Writing Data to a File
    31. Logical Vectors
    32. Character Vectors
    33. Matrix Data Structure
    34. Creating Matrices
    35. Working with Data Frames
    36. Matrices vs Data Frames
    37. A Data Frame Sample
    38. Accessing Data Cells
    39. Getting Info About a Data Frame
    40. Selecting Columns in Data Frames
    41. Selecting Rows in Data Frames
    42. Getting a Subset of a Data Frame
    43. Sorting (ordering) Data in Data Frames by Attribute(s)
    44. Applying Functions to Matrices and Data Frames
    45. Using the apply() Function
    46. Example of Using apply()
    47. Executing External R commands
    48. Loading External Scripts in RStudio
    49. Listing Objects in Workspace
    50. Removing Objects in Workspace
    51. Saving Your Workspace in R
    52. Saving Your Workspace in RStudio
    53. Saving Your Workspace in R GUI
    54. Loading Your Workspace
    55. Loading Your Workspace
    56. Hands-on Exercises
    57. Getting and Setting the Working Directory
    58. Getting the List of Files in a Directory
    59. Diverting Output to a File
    60. Batch (Unattended) Processing
    61. Importing Data into R
    62. Exporting Data from R
    63. Hands-on Exercise
    64. Standard R Packages
    65. Extending R
    66. Extending R in R GUI
    67. Extending R in RStudio
    68. CRAN Page
    69. Summary
  3. Text Mining
    1. What is Text Mining?
    2. The Common Text Mining Tasks
    3. What is Natural Language Processing (NLP)?
    4. Some of the NLP Use Cases
    5. Machine Learning in Text Mining and NLP
    6. Machine Learning in NLP
    7. TF-IDF
    8. The Feature Hashing Trick
    9. Stemming
    10. Example of Stemming
    11. Stop Words
    12. Popular Text Mining and NLP Libraries and Packages
    13. Summary
  4. Introduction to Functional Programming
    1. What is Functional Programming (FP)?
    2. Terminology: Higher-Order Functions
    3. Terminology: Lambda vs Closure
    4. A Short List of Languages that Support FP
    5. FP with Java
    6. FP With JavaScript
    7. Imperative Programming in JavaScript
    8. The JavaScript map (FP) Example
    9. The JavaScript reduce (FP) Example
    10. Using reduce to Flatten an Array of Arrays (FP) Example
    11. The JavaScript filter (FP) Example
    12. Common High-Order Functions in Python
    13. Common High-Order Functions in Scala
    14. Elements of FP in R
    15. Summary
  5. What is NoSQL?
    1. Limitations of Relational Databases
    2. Limitations of Relational Databases (Cont'd)
    3. Defining NoSQL
    4. What are NoSQL (Not Only SQL) Databases?
    5. The Past and Present of the NoSQL World
    6. NoSQL Database Properties
    7. NoSQL Benefits
    8. NoSQL Benefits
    9. NoSQL Database Storage Types
    10. NoSQL Database Storage Types
    11. The CAP Theorem
    12. NoSQL Systems CAP Triangle
    13. Mechanisms to Guarantee a Single CAP Property
    14. Limitations of NoSQL Databases
    15. Big Data Sharding
    16. Sharding Example
    17. Quiz
    18. Quiz Answers
    19. Summary
  6. MapReduce Overview
    1. The Client – Server Processing Pattern
    2. Distributed Computing Challenges
    3. MapReduce Defined
    4. Google's MapReduce
    5. MapReduce Phases
    6. The Map Phase
    7. The Reduce Phase
    8. MapReduce Word Count Job
    9. MapReduce Shared-Nothing Architecture
    10. Similarity with SQL Aggregation Operations
    11. Example of Map & Reduce Operations using JavaScript
    12. Example of Map & Reduce Operations using JavaScript
    13. Problems Suitable for Solving with MapReduce
    14. Typical MapReduce Jobs
    15. Fault-tolerance of MapReduce
    16. Distributed Computing Economics
    17. MapReduce Systems
    18. Summary
  7. Hadoop Overview
    1. Apache Hadoop
    2. Apache Hadoop Logo
    3. Typical Hadoop Applications
    4. Hadoop Clusters
    5. Hadoop Design Principles
    6. Hadoop Versions
    7. Hadoop's Main Components
    8. Hadoop Simple Definition
    9. Side-by-Side Comparison: Hadoop 1 and Hadoop 2
    10. Hadoop-based Systems for Data Analysis
    11. Other Hadoop Ecosystem Projects
    12. Hadoop Caveats
    13. Hadoop Distributions
    14. Cloudera Distribution of Hadoop (CDH)
    15. Cloudera Distributions
    16. Hortonworks Data Platform (HDP)
    17. MapR
    18. Summary
  8. Hadoop Distributed File System Overview
    1. Hadoop Distributed File System (HDFS)
    2. HDFS Considerations
    3. HDFS High Availability
    4. Storing Raw Data in HDFS
    5. HDFS Security
    6. HDFS Rack-awareness
    7. Data Blocks
    8. Data Block Replication Example
    9. HDFS NameNode Directory Diagram
    10. File Metadata Records (Conceptual View)
    11. NameNode Meta Information Size
    12. HDFS Balancing
    13. Accessing HDFS
    14. Examples of HDFS Commands
    15. Other Supported File Systems
    16. WebHDFS
    17. Examples of WebHDFS Calls
    18. HDFS Daemon Web UI Ports
    19. Viewing Replica Factor and Block Size in NameNode Web UI
    20. HDFS Write Operation
    21. HDFS Read Operation
    22. Read Operation Sequence Diagram
    23. Communication inside HDFS
    24. Summary
  9. MapReduce with Hadoop
    1. Hadoop's MapReduce
    2. MapReduce 1 and MapReduce 2
    3. Why do I need Discussion of the Old MapReduce?
    4. MapReduce v1 ("Classic MapReduce")
    5. JobTracker and TaskTracker (the "Classic MapReduce")
    6. YARN (MapReduce v2)
    7. YARN vs MR1
    8. YARN As Data Operating System
    9. MapReduce Programming Options
    10. Hadoop's Streaming MapReduce
    11. Python Word Count Mapper Program Example
    12. Python Word Count Reducer Program Example
    13. Setting up Java Classpath for Streaming Support
    14. Streaming Use Cases
    15. The Streaming API vs Java MapReduce API
    16. Amazon Elastic MapReduce
    17. Amazon Elastic MapReduce
    18. Apache Tez
    19. Summary
  10. Apache Pig Scripting Platform
    1. What is Pig?
    2. Pig Latin
    3. Apache Pig Logo
    4. Pig Execution Modes
    5. Local Execution Mode
    6. MapReduce Execution Mode
    7. Running Pig
    8. Running Pig in Batch Mode
    9. What is Grunt?
    10. Pig Latin Statements
    11. Pig Programs
    12. Pig Latin Script Example
    13. SQL Equivalent
    14. Differences between Pig and SQL
    15. Statement Processing in Pig
    16. Comments in Pig
    17. Supported Simple Data Types
    18. Supported Complex Data Types
    19. Arrays
    20. Defining Relation's Schema
    21. Not Matching the Defined Schema
    22. The bytearray Generic Type
    23. Using Field Delimiters
    24. Loading Data with TextLoader()
    25. Referencing Fields in Relations
    26. Summary
  11. Apache Pig Relational and Eval Operators
    1. Pig Relational Operators
    2. Example of Using the JOIN Operator
    3. Example of Using the JOIN Operator
    4. Example of Using the Order By Operator
    5. Caveats of Using Relational Operators
    6. Pig Eval Functions
    7. Caveats of Using Eval Functions (Operators)
    8. Example of Using Single-column Eval Operations
    9. Example of Using Eval Operators For Global Operations
    10. Summary
  12. Hive
    1. What is Hive?
    2. Apache Hive Logo
    3. Hive's Value Proposition
    4. Who uses Hive?
    5. What Hive Does Not Have
    6. Hive's Main Sub-Systems
    7. Hive Features
    8. The "Classic" Hive Architecture
    9. The New Hive Architecture
    10. HiveQL
    11. Where are the Hive Tables Located?
    12. Hive Command-line Interface (CLI)
    13. The Beeline Command Shell
    14. Summary
  13. Hive Command-line Interface
    1. Hive Command-line Interface (CLI)
    2. The Hive Interactive Shell
    3. Running Host OS Commands from the Hive Shell
    4. Interfacing with HDFS from the Hive Shell
    5. The Hive in Unattended Mode
    6. The Hive CLI Integration with the OS Shell
    7. Executing HiveQL Scripts
    8. Comments in Hive Scripts
    9. Variables and Properties in Hive CLI
    10. Setting Properties in CLI
    11. Example of Setting Properties in CLI
    12. Hive Namespaces
    13. Using the SET Command
    14. Setting Properties in the Shell
    15. Setting Properties for the New Shell Session
    16. Setting Alternative Hive Execution Engines
    17. The Beeline Shell
    18. Connecting to the Hive Server in Beeline
    19. Beeline Command Switches
    20. Beeline Internal Commands
    21. Summary
  14. Hive Data Definition Language
    1. Hive Data Definition Language
    2. Creating Databases in Hive
    3. Using Databases
    4. Creating Tables in Hive
    5. Supported Data Type Categories
    6. Common Numeric Types
    7. String and Date / Time Types
    8. Miscellaneous Types
    9. Example of the CREATE TABLE Statement
    10. Working with Complex Types
    11. Working with Complex Types
    12. Table Partitioning
    13. Table Partitioning
    14. Table Partitioning on Multiple Columns
    15. Viewing Table Partitions
    16. Row Format
    17. Data Serializers / Deserializers
    18. File Format Storage
    19. File Compression
    20. More on File Formats
    21. The ORC Data Format
    22. Converting Text to ORC Data Format
    23. The EXTERNAL DDL Parameter
    24. Example of Using EXTERNAL
    25. Creating an Empty Table
    26. Dropping a Table
    27. Table / Partition(s) Truncation
    28. Alter Table/Partition/Column
    30. Create View Statement
    31. Why Use Views?
    32. Restricting Amount of Viewable Data
    33. Examples of Restricting Amount of Viewable Data
    34. Creating and Dropping Indexes
    35. Describing Data
    36. Summary
  15. Apache Sqoop
    1. What is Sqoop?
    2. Apache Sqoop Logo
    3. Sqoop Import / Export
    4. Sqoop Help
    5. Examples of Using Sqoop Commands
    6. Data Import Example
    7. Fine-tuning Data Import
    8. Controlling the Number of Import Processes
    9. Data Splitting
    10. Helping Sqoop Out
    11. Example of Executing Sqoop Load in Parallel
    12. A Word of Caution: Avoid Complex Free-Form Queries
    13. Using Direct Export from Databases
    14. Example of Using Direct Export from MySQL
    15. More on Direct Mode Import
    16. Data Export from HDFS
    17. Export Tool Common Arguments
    18. Data Export Control Arguments
    19. Data Export Example
    20. INSERT and UPDATE Statements
    21. INSERT Operations
    22. UPDATE Operations
    23. Example of the Update Operation
    24. Failed Exports
    25. Sqoop2
    26. Summary
  16. Introduction to Apache Spark
    1. What is Apache Spark
    2. A Short History of Spark
    3. Where to Get Spark?
    4. The Spark Platform
    5. Spark Logo
    6. Common Spark Use Cases
    7. Languages Supported by Spark
    8. Running Spark on a Cluster
    9. The Driver Process
    10. Spark Applications
    11. Spark Shell
    12. The spark-submit Tool
    13. The spark-submit Tool Configuration
    14. The Executor and Worker Processes
    15. The Spark Application Architecture
    16. Interfaces with Data Storage Systems
    17. Limitations of Hadoop's MapReduce
    18. Spark vs MapReduce
    19. Spark as an Alternative to Apache Tez
    20. The Resilient Distributed Dataset (RDD)
    21. Spark Streaming (Micro-batching)
    22. Spark SQL
    23. Example of Spark SQL
    24. Spark Machine Learning Library
    25. GraphX
    26. Spark vs R
    27. Summary
  17. The Spark Shell
    1. The Spark Shell
    2. The Spark Shell
    3. The Spark Shell UI
    4. Spark Shell Options
    5. Getting Help
    6. The Spark Context (sc) and SQL Context (sqlContext)
    7. The Shell Spark Context
    8. Loading Files
    9. Saving Files
    10. Basic Spark ETL Operations
    11. Summary
  18. Spark RDDs
    1. The Resilient Distributed Dataset (RDD)
    2. Ways to Create an RDD
    3. Custom RDDs
    4. Supported Data Types
    5. RDD Operations
    6. RDDs are Immutable
    7. Spark Actions
    8. RDD Transformations
    9. RDD Transformations
    10. Other RDD Operations
    11. Chaining RDD Operations
    12. RDD Lineage
    13. The Big Picture
    14. What May Go Wrong
    15. Checkpointing RDDs
    16. Local Checkpointing
    17. Parallelized Collections
    18. More on parallelize() Method
    19. The Pair RDD
    20. Where do I use Pair RDDs?
    21. Example of Creating a Pair RDD with Map
    22. Example of Creating a Pair RDD with keyBy
    23. Miscellaneous Pair RDD Operations
    24. RDD Caching
    25. RDD Persistence
    26. The Tachyon Storage
    27. Summary
  19. Parallel Data Processing with Spark
    1. Running Spark on a Cluster
    2. Spark Stand-alone Option
    3. The High-Level Execution Flow in Stand-alone Spark Cluster
    4. Data Partitioning
    5. Data Partitioning Diagram
    6. Single Local File System RDD Partitioning
    7. Multiple File RDD Partitioning
    8. Special Cases for Small-sized Files
    9. Parallel Data Processing of Partitions
    10. Parallel Data Processing of Partitions
    11. Spark Application, Jobs, and Tasks
    12. Stages and Shuffles
    13. The "Big Picture"
    14. Summary
  20. Shared Variables in Spark
    1. Shared Variables in Spark
    2. Broadcast Variables
    3. Creating and Using Broadcast Variables
    4. Example of Using Broadcast Variables
    5. Accumulators
    6. Creating and Using Accumulators
    7. Example of Using Accumulators
    8. Custom Accumulators
    9. Summary
  21. Introduction to Spark SQL
    1. What is Spark SQL?
    2. What is Spark SQL?
    3. Uniform Data Access with Spark SQL
    4. Hive Integration
    5. Hive Interface
    6. Integration with BI Tools
    7. Spark SQL is No Longer Experimental Developer API!
    8. What is a DataFrame?
    9. The SQLContext Object
    10. The SQLContext API
    11. Changes Between Spark SQL 1.3 to 1.4
    12. Example of Spark SQL (Scala Example)
    13. Example of Working with a JSON File
    14. Example of Working with a Parquet File
    15. Using JDBC Sources
    16. JDBC Connection Example
    17. Performance & Scalability of Spark SQL
    18. Summary
  22. Graph Processing with GraphX
    1. What is GraphX?
    2. Supported Languages
    3. Vertices and Edges
    4. Graph Terminology
    5. Example of Property Graph
    6. The GraphX API
    7. The GraphX Views
    8. The Triplet View
    9. Graph Algorithms
    10. Graphs and RDDs
    11. Constructing Graphs
    12. Graph Operators
    13. Example of Using GraphX Operators
    14. GraphX Performance Optimization
    15. The PageRank Algorithm
    16. GraphX Support for PageRank
    17. Summary
  23. The Spark Machine Learning Library
    1. What is MLlib?
    2. Supported Languages
    3. MLlib Packages
    4. Dense and Sparse Vectors
    5. Labeled Point
    6. Python Example of Using the LabeledPoint Class
    7. LIBSVM format
    8. An Example of a LIBSVM File
    9. Loading LIBSVM Files
    10. Local Matrices
    11. Example of Creating Matrices in MLlib
    12. Distributed Matrices
    13. Example of Using a Distributed Matrix
    14. Classification and Regression Algorithm
    15. Clustering
    16. Summary
  24. Machine Learning with BigML
    1. What is BigML?
    2. How BigML Service Works
    3. Data Files
    4. Data Sets
    5. Data Sets Example
    6. Models
    7. Predictions
    8. The Prediction UI Form
    9. Text Analysis in BigML
    10. REST API
    11. Summary
Class Materials

Each student in our Live Online and our Onsite classes receives a comprehensive set of materials, including course notes and all the class examples.

Class Prerequisites

Experience in the following is required for this Hadoop class:

  • General knowledge of statistics and programming.

Training for Yourself

$3,125.00 or 5 vouchers

Upcoming Live Online Classes

Please select a class.

Training for your Team

Length: 5 Days
  • Private Class for your Team
  • Online or On-location
  • Customizable
  • Expert Instructors

What people say about our training

The instructor was very helpful with all my questions and very patient. We were able to use some great graphics for animation.
Clara Mizenko
An excellent class. Provides current information and practical techniques.
Steve Branson
California Dept of Health Care Services - ITSD
This class provided a terrific overview of MS Project. The instructor had a wealth of experience. I would definitely recommend this instructor, this class, and Webucator for anyone looking to learn more about their Project toolkit.
Yreka Sisson
Jack Henry & Associates
Business Writing provided by Webucator was a great recap of the do's and don'ts. Definitely recommend the class.
Alex Irizarry
Catapult Systems

No cancelation for low enrollment

Certified Microsoft Partner

Registered Education Provider (R.E.P.)

GSA schedule pricing


Students who have taken Instructor-led Training


Organizations who trust Webucator for their Instructor-led training needs


Satisfaction guarantee and retake option


Students rated our trainers 9.30 out of 10 based on 30,188 reviews

Contact Us or call 1-877-932-8228