Data Engineering Bootcamp (DSC113)

Course Length: 5 days

Delivery Methods: Available as private class only

Course Overview

This five-day Data Engineering Bootcamp training course is supplemented by hands-on labs that help attendees reinforce their theoretical knowledge of the learned material.

Course Benefits

Learn about the world of data engineering and how to build and maintain the data infrastructure that holds your enterprise's advanced analytics capacities together.

Course Outline

Introduction – The Big Data Landscape and Key Components
1. The Big Data EcoSystem at CVS
2. YARN, Spark, Spark Streaming, Kafka
3. Containers: Docker/Kubernetes
4. Monitoring and Logging: Prometheus
Data Engineering Defined
1. Data is King
2. Translating Data into Business Insights
3. What is Data Engineering
4. The Data-Related Roles
5. The Data Science Skill Sets
6. The Data Engineer Role
7. An Example of a Data Product
8. Data Schema for Data Exchange Interoperability
9. The Data Exchange Interoperability Options
10. Big Data and NoSQL
11. Data Physics
12. The Traditional Client – Server Processing Pattern
13. Data Locality (Distributed Computing Economics)
14. The CAP Theorem
15. Mechanisms to Guarantee a Single CAP Property
16. The CAP Triangle
17. Eventual Consistency
Data Processing Phases
1. Typical Data Processing Pipeline
2. Data Discovery Phase
3. Data Harvesting Phase
4. Data Priming Phase
5. Data Logistics and Data Governance
6. Exploratory Data Analysis
7. Model Planning Phase
8. Model Building Phase
9. Communicating the Results
10. Production Roll-out
Core Data Engineering Tasks
1. Data acquisition in Python
2. Database and Web interfaces
3. Ensuring data quality
4. Repairing and normalizing data
5. Descriptive statistics computing features in Python
6. Processing data at scale
Functional Programming Primer
1. What is Functional Programming
2. Benefits of Functional Programming
3. Functions as Data
4. Using Map Function
5. Using Filter Function
6. Lambda expressions
7. List.sort() Using Lambda Expression
8. Difference Between Simple Loops and map/filter Type Functions
9. Additional Functions
10. Summary
Introduction to PySpark
1. What is Apache Spark
2. Spark use cases
3. Architectural overview
PySpark Shell
1. What is the PySpark Shell
2. Starting and using the shell
3. Spark context
4. PySpark Shell vs Spark Shell
Resilient Distributed Dataset
1. What are Resilient Distributed Dataset (RDD)
2. Creating RDDs
3. Transformations and operations
Parallel Processing
1. Spark cluster
2. Data partitioning
3. Applications, jobs and tasks
Shared Variables
1. What are shared variables
2. Broadcast variables
3. Accumulators
Spark SQL
1. What is Spark SQL
2. Uniform data
3. Hive
4. SQL Context object
The Spark Machine Learning Library
1. What is MLlib?
2. Supported Languages
3. MLlib Packages
4. Dense and Sparse Vectors
5. Labeled Point
6. Python Example of Using the LabeledPoint Class
7. LIBSVM format
8. An Example of a LIBSVM File
9. Loading LIBSVM Files
10. Local Matrices
11. Example of Creating Matrices in MLlib
12. Distributed Matrices
13. Example of Using a Distributed Matrix
14. Classification and Regression Algorithm
15. Clustering
16. Summary
Streaming – Kafka and Spark
1. Installing Apache Kafka
2. Configuration Files
3. Starting Kafka
4. Using Kafka Command Line Client Tools
5. Setting up a Multi-Broker Cluster
6. Using Multi-Broker Cluster
7. Kafka Connect
8. Kafka Connect ? Configuration Files
9. Using Kafka Connect to Import/Export Data
10. Building Data Pipelines
11. Considerations When Building Data Pipelines
12. Timeliness
13. Reliability
14. High and Varying Throughput
15. High and Varying Throughput (Contd.)
16. Data Formats
17. Data Formats (Contd.)
18. Transformations
19. Transformations (Contd.)
20. Security
21. Failure Handling
22. Coupling and Agility
23. Ad-hoc Pipelines
24. Loss of Metadata
25. Extreme Processing
26. Kafka Connect Versus Producer and Consumer
27. Kafka Connect Versus Producer and Consumer (Contd.)
28. Spark Streaming Features
29. How It Works
30. Basic Data Stream Sources
31. Advanced Data Stream Sources
32. The DStream Object
Infrastructure Optimization
1. Monitoring Distributed Systems: Retrieving performance statistics from cluster members, aggregating output, consolidating application logs.
2. Operations Strategy: What approaches can be used to find errors and bugs in distributed applications, and devise solutions for them?
3. Case Study/Demonstration
4. Lab: Explore log aggregation in Splunk
Making Big Data Secure
1. What is required to secure Big Data infrastructure?
2. How can centralized security management software, such as Kerberos and LDAP, be configured as part of a broader security architecture?
3. What special considerations are there for applications and users who need to access protected resources?
4. How are permissions and roles managed so that Big Data processing resources, such as Spark applications running on top of YARN or Kubernetes, are able to access data stored within HDFS or in an object storage like Amazon S3?
5. Lab: Configuring Secure Access to Big Data Resources
How does DevOps work in a data context?
1. Infrastructure: Version Control (git, GitHub), Automation (Jenkins), Processing (Spark, Hadoop, YARN), Data Management (Kafka),
2. Process Differences: DataOps is more than DevOps and data
3. Lifecycle and Differences
4. Incorporating Complex Data Infrastructure into Continuous Integration/Deployment
5. Standardization of runtime environment using containers
6. Accounting for Infrastructure Differences within IaC configuration
7. Incorporating orchestration to handle supporting component deployment and management
8. Statistical Process Control (SPC) to ensure pipeline and model repeatability
9. Case Study/Demonstration
Tooling
1. GitHub: Source Forge
2. Docker and Jenkins: Continuous Integration
3. Spinnaker: Continuous Deployment
4. Lab: Continuous Integration of a Kafka Based Application Using Jenkins

Class Materials

Each student will receive a comprehensive set of materials, including course notes and all the class examples.

Class Prerequisites

Experience in the following is required for this R Programming class:

Some understanding of data science.

Live Private Class

Private Class for your Team
Live training
Online or On-location
Customizable
Expert Instructors

Request Proposal