This intensive Practical Machine Learning with Apache Spark training class introduces the audience to the core aspects of scalable data processing using Python on the Apache Spark platform.

The audience for this class is data scientists, business analysts, software developers, and IT architects.

**Public Classes:** Delivered live online via WebEx and guaranteed to run . Join from anywhere!

**Private Classes:** Delivered at your offices , or any other location of your choice.

- Python essentials
- Capabilities of the Apache Spark platform and its machine learning module
- Terminology, concepts, and algorithms used in machine learning

- Defining Data Science
- Data Science, Machine Learning, AI?
- The Data-Related Roles
- Data Science Ecosystem
- Business Analytics vs. Data Science
- Who is a Data Scientist?
- The Break-Down of Data Science Project Activities
- Data Scientists at Work
- The Data Engineer Role
- What is Data Wrangling (Munging)?
- Examples of Data Science Projects
- Data Science Gotchas
- Summary

- Machine Learning Life-cycle Phases
- Data Analytics Pipeline
- Data Discovery Phase
- Data Harvesting Phase
- Data Priming Phase
- Data Cleansing
- Feature Engineering
- Data Logistics and Data Governance
- Exploratory Data Analysis
- Model Planning Phase
- Model Building Phase
- Communicating the Results
- Production Roll-out
- Summary

- Quick Introduction to Python Programming
- Module Overview
- Some Basic Facts about Python
- Dynamic Typing Examples
- Code Blocks and Indentation
- Importing Modules
- Lists and Tuples
- Dictionaries
- List Comprehension
- What is Functional Programming (FP)?
- Terminology: Higher-Order Functions
- A Short List of Languages that Support FP
- Lambda
- Common High-Order Functions in Python 3
- Summary

- Introduction to Apache Spark
- What is Apache Spark
- Where to Get Spark?
- The Spark Platform
- Spark Logo
- Common Spark Use Cases
- Languages Supported by Spark
- Running Spark on a Cluster
- The Driver Process
- Spark Applications
- Spark Shell
- The spark-submit Tool
- The spark-submit Tool Configuration
- The Executor and Worker Processes
- The Spark Application Architecture
- Interfaces with Data Storage Systems
- Limitations of Hadoop's MapReduce
- Spark vs MapReduce
- Spark as an Alternative to Apache Tez
- The Resilient Distributed Dataset (RDD)
- Datasets and DataFrames
- Spark SQL
- Spark Machine Learning Library
- GraphX
- Summary

- The Spark Shell
- The Spark Shell
- The Spark v.2 + Shells
- The Spark Shell UI
- Spark Shell Options
- Getting Help
- The Spark Context (sc) and Spark Session (spark)
- The Shell Spark Context Object (sc)
- The Shell Spark Session Object (spark)
- Loading Files
- Saving Files
- Summary

- Quick Intro to Jupyter Notebooks
- Python Dev Tools and REPLs
- IPython
- Jupyter
- Jupyter Operation Modes
- Basic Edit Mode Shortcuts
- Basic Command Mode Shortcuts
- Summary

- Data Visualization in Python using matplotlib
- Data Visualization
- What is matplotlib?
- Getting Started with matplotlib
- The matplotlib.pyplot.plot() Function
- The matplotlib.pyplot.scatter() Function
- Labels and Titles
- Styles
- The matplotlib.pyplot.bar() Function
- The matplotlib.pyplot.hist () Function
- The matplotlib.pyplot.pie () Function
- The Figure Object
- The matplotlib.pyplot.subplot() Function
- Selecting a Grid Cell
- Saving Figures to a File
- Summary

- Data Science and ML Algorithms with PySpark
- In-Class Discussion
- Types of Machine Learning
- Supervised vs Unsupervised Machine Learning
- Supervised Machine Learning Algorithms
- Classification (Supervised ML) Examples
- Unsupervised Machine Learning Algorithms
- Clustering (Unsupervised ML) Examples
- Choosing the Right Algorithm
- Terminology: Observations, Features, and Targets
- Representing Observations
- Terminology: Labels
- Terminology: Continuous and Categorical Features
- Continuous Features
- Categorical Features
- Common Distance Metrics
- The Euclidean Distance
- What is a Model
- Model Evaluation
- The Classification Error Rate
- Data Split for Training and Test Data Sets
- Data Splitting in PySpark
- Hold-Out Data
- Cross-Validation Technique
- Spark ML Overview
- DataFrame-based API is the Primary Spark ML API
- Estimators, Models, and Predictors
- Descriptive Statistics
- Data Visualization and EDA
- Correlations
- Hands-on Exercise
- Feature Engineering
- Scaling of the Features
- Feature Blending (Creating Synthetic Features)
- Hands-on Exercise
- The 'One-Hot' Encoding Scheme
- Example of 'One-Hot' Encoding Scheme
- Bias-Variance (Underfitting vs Overfitting) Trade-off
- The Modeling Error Factors
- One Way to Visualize Bias and Variance
- Underfitting vs Overfitting Visualization
- Balancing Off the Bias-Variance Ratio
- Linear Model Regularization
- ML Model Tuning Visually
- Linear Model Regularization in Spark
- Regularization, Take Two
- Dimensionality Reduction
- PCA and isomap
- The Advantages of Dimensionality Reduction
- Spark Dense and Sparse Vectors
- Labeled Point
- Python Example of Using the LabeledPoint Class
- The LIBSVM format
- LIBSVM in PySpark
- Example of Reading a File In LIBSVM Format
- Life-cycles of Machine Learning Development
- Regression Analysis
- Regression vs Correlation
- Regression vs Classification
- Simple Linear Regression Model
- Linear Regression Illustration
- Least-Squares Method (LSM)
- Gradient Descent Optimization
- Locally Weighted Linear Regression
- Regression Models in Excel
- Multiple Regression Analysis
- Evaluating Regression Model Accuracy
- The R>2
- Model Score
- The MSE Model Score
- Hands-on Exercise
- Linear Logistic (Logit) Regression
- Interpreting Logistic Regression Results
- Hands-on Exercise
- Naive Bayes Classifier (SL)
- Naive Bayesian Probabilistic Model in a Nutshell
- Bayes Formula
- Classification of Documents with Naive Bayes
- Hands-on Exercise
- Decision Trees
- Decision Tree Terminology
- Properties of Decision Trees
- Decision Tree Classification in the Context of Information Theory
- The Simplified Decision Tree Algorithm
- Using Decision Trees
- Random Forests
- Hands-On Exercise
- Support Vector Machines (SVMs)
- Hands-On Exercise
- Unsupervised Learning Type: Clustering
- k-Means Clustering (UL)
- k-Means Clustering in a Nutshell
- k-Means Characteristics
- Global vs Local Minimum Explained
- Hands-On Exercise
- Time-Series Analysis
- Decomposing Time-Series
- A Better Algorithm or More Data?
- Summary

Each student in our Live Online and our Onsite classes receives a comprehensive set of materials, including course notes and all the class examples.

Experience in the following *is required* for this Spark class:

- General knowledge of statistics and programming.

This was my first Webucator class. The class was great and the instructor was fabulous!

Tiffany Schroeder

MCPC

I strongly recommend this training to anybody who is thinking of using WP for blogging or building websites.

Olaf Fischer

Kushi Institute

This is my first time utilizing live on-line training as opposed to a classroom environment. I can say without a doubt that Webucator has made a believer out of me to stay with this medium for training!
I think it is much more conducive to retaining the materials and the screen presented materials are clear and informative

Tom West

Corizon Health

Great experience with a skilled and personable instructor.

Stephanie Light

Agricorp

No cancelation for low enrollment

GSA schedule pricing

65,181

Students who have taken Instructor-led Training

12,014

Organizations who trust Webucator for their Instructor-led training needs

100%

Satisfaction guarantee and retake option

9.30

Students rated our trainers 9.30 out of 10 based on 30,146 reviews

Contact Us or call 1-877-932-8228