Data Engineering Bootcamp (DSC113)
Course Length: 5 days
Delivery Methods:
Available as private class only
Course Overview
This five-day Data Engineering Bootcamp training course is supplemented by hands-on labs that help attendees reinforce their theoretical knowledge of the learned material.
Course Benefits
Learn about the world of data engineering and how to build and maintain the data infrastructure that holds your enterprise's advanced analytics capacities together.
Course Outline
- Introduction – The Big Data Landscape and Key Components
- The Big Data EcoSystem at CVS
- YARN, Spark, Spark Streaming, Kafka
- Containers: Docker/Kubernetes
- Monitoring and Logging: Prometheus
- Data Engineering Defined
- Data is King
- Translating Data into Business Insights
- What is Data Engineering
- The Data-Related Roles
- The Data Science Skill Sets
- The Data Engineer Role
- An Example of a Data Product
- Data Schema for Data Exchange Interoperability
- The Data Exchange Interoperability Options
- Big Data and NoSQL
- Data Physics
- The Traditional Client – Server Processing Pattern
- Data Locality (Distributed Computing Economics)
- The CAP Theorem
- Mechanisms to Guarantee a Single CAP Property
- The CAP Triangle
- Eventual Consistency
- Data Processing Phases
- Typical Data Processing Pipeline
- Data Discovery Phase
- Data Harvesting Phase
- Data Priming Phase
- Data Logistics and Data Governance
- Exploratory Data Analysis
- Model Planning Phase
- Model Building Phase
- Communicating the Results
- Production Roll-out
- Core Data Engineering Tasks
- Data acquisition in Python
- Database and Web interfaces
- Ensuring data quality
- Repairing and normalizing data
- Descriptive statistics computing features in Python
- Processing data at scale
- Functional Programming Primer
- What is Functional Programming
- Benefits of Functional Programming
- Functions as Data
- Using Map Function
- Using Filter Function
- Lambda expressions
- List.sort() Using Lambda Expression
- Difference Between Simple Loops and map/filter Type Functions
- Additional Functions
- Summary
- Introduction to PySpark
- What is Apache Spark
- Spark use cases
- Architectural overview
- PySpark Shell
- What is the PySpark Shell
- Starting and using the shell
- Spark context
- PySpark Shell vs Spark Shell
- Resilient Distributed Dataset
- What are Resilient Distributed Dataset (RDD)
- Creating RDDs
- Transformations and operations
- Parallel Processing
- Spark cluster
- Data partitioning
- Applications, jobs and tasks
- Shared Variables
- What are shared variables
- Broadcast variables
- Accumulators
- Spark SQL
- What is Spark SQL
- Uniform data
- Hive
- SQL Context object
- The Spark Machine Learning Library
- What is MLlib?
- Supported Languages
- MLlib Packages
- Dense and Sparse Vectors
- Labeled Point
- Python Example of Using the LabeledPoint Class
- LIBSVM format
- An Example of a LIBSVM File
- Loading LIBSVM Files
- Local Matrices
- Example of Creating Matrices in MLlib
- Distributed Matrices
- Example of Using a Distributed Matrix
- Classification and Regression Algorithm
- Clustering
- Summary
- Streaming – Kafka and Spark
- Installing Apache Kafka
- Configuration Files
- Starting Kafka
- Using Kafka Command Line Client Tools
- Setting up a Multi-Broker Cluster
- Using Multi-Broker Cluster
- Kafka Connect
- Kafka Connect ? Configuration Files
- Using Kafka Connect to Import/Export Data
- Building Data Pipelines
- Considerations When Building Data Pipelines
- Timeliness
- Reliability
- High and Varying Throughput
- High and Varying Throughput (Contd.)
- Data Formats
- Data Formats (Contd.)
- Transformations
- Transformations (Contd.)
- Security
- Failure Handling
- Coupling and Agility
- Ad-hoc Pipelines
- Loss of Metadata
- Extreme Processing
- Kafka Connect Versus Producer and Consumer
- Kafka Connect Versus Producer and Consumer (Contd.)
- Spark Streaming Features
- How It Works
- Basic Data Stream Sources
- Advanced Data Stream Sources
- The DStream Object
- Infrastructure Optimization
- Monitoring Distributed Systems: Retrieving performance statistics from cluster members, aggregating output, consolidating application logs.
- Operations Strategy: What approaches can be used to find errors and bugs in distributed applications, and devise solutions for them?
- Case Study/Demonstration
- Lab: Explore log aggregation in Splunk
- Making Big Data Secure
- What is required to secure Big Data infrastructure?
- How can centralized security management software, such as Kerberos and LDAP, be configured as part of a broader security architecture?
- What special considerations are there for applications and users who need to access protected resources?
- How are permissions and roles managed so that Big Data processing resources, such as Spark applications running on top of YARN or Kubernetes, are able to access data stored within HDFS or in an object storage like Amazon S3?
- Lab: Configuring Secure Access to Big Data Resources
- How does DevOps work in a data context?
- Infrastructure: Version Control (git, GitHub), Automation (Jenkins), Processing (Spark, Hadoop, YARN), Data Management (Kafka),
- Process Differences: DataOps is more than DevOps and data
- Lifecycle and Differences
- Incorporating Complex Data Infrastructure into Continuous Integration/Deployment
- Standardization of runtime environment using containers
- Accounting for Infrastructure Differences within IaC configuration
- Incorporating orchestration to handle supporting component deployment and management
- Statistical Process Control (SPC) to ensure pipeline and model repeatability
- Case Study/Demonstration
- Tooling
- GitHub: Source Forge
- Docker and Jenkins: Continuous Integration
- Spinnaker: Continuous Deployment
- Lab: Continuous Integration of a Kafka Based Application Using Jenkins
Class Materials
Each student will receive a comprehensive set of materials, including course notes and all the class examples.
Class Prerequisites
Experience in the following is required for this R Programming class:
- Some understanding of data science.
Live Private Class
- Private Class for your Team
- Live training
- Online or On-location
- Customizable
- Expert Instructors