What is Data Engineering, Use Cases, and Applications?
Data Engineer or Data Scientist?
Data Engineering Problems Tools of a Data Engineer
Working with Different Databases
Processing Tasks, Scheduling Tools, and Different Cloud Providers
Why Cloud Computing, Use Cases, and Applications?
Different Cloud Services
Introduction to HDFS and Apache Spark
Spark Basics
Working with RDDs in Spark
Aggregating Data with Pair RDDs
Writing and Deploying Spark Applications
Parallel Processing
Spark RDD Persistence
Integrating Apache Flume and Apache Kafka
Spark Streaming
Improving Spark Performance
Spark SQL and Data Frames
Scheduling or Partitioning
Understand the difference between SQL and NoSQL. Create relations data models and NoSQL-based data models on business reporting requirements. Work with ETL tools to push the data to the model.
Work on MS SQL and Cassandra for creating databases and using ETL tools for data extracting, transformation, and loading to the models.
Project 1: Data Modeling using Relational Databases
Project 2: Data Modeling using Apache Cassandra
Master the skills of building a highly scalable data
warehouse on AWS. Work with Redshift and pull the data
from RDS and other media services of AWS using ETL
pipeline and load the data into the data warehouse.
What is ETL, Use Cases, and Applications?
Why We Need ETL Tools>
Working with Different Data Sources—Relational Databases, NoSQL, HDFS, Stream Data, CSV Files, TXT Files, Json or XML Files, and Fixed File Formats
Transformation of Data
Loading Data into a Data Model or File System
Using SQL for Data Transformation
Optimizing ETL Processes
Understanding ETL Architecture for Tracking the Data
Flow and Data Pipelines
Understanding Data Quality Checks
AWS Data Storage Services—S3, S3 Glacier, Amazon DynamoDB
AWS Processing Services—AWS EMR, EMR Cluster, Hadoop, Hue with EMR, Spark with EMR, AWS Lambda, HCatalog, Glue, and Glue Lab
AWS Data Analysis Services—Amazon Redshift, Tuning Query Performance, Amazon ML, Amazon Athena, Amazon Elasticsearch, and ES Domain
Learn to schedule, automate, and monitor ETL pipelines with Apache
Airflow, Luigi, and Cron.
Learn and master how to implement data
quality checks and processes for running the ETL in a production
environment.
Understand and create a strong process and
architecture to avoid ETL failure due to data quality issues. Learn
how to handle ETL failure issues in a production environment.
Use Docker for converting your applications and data pipelines to containers-based applications
Orchestrate containers to deliver scalable and reliable performance using Kubernetes
Implement the concepts learnt in the program and create a highly
scalable data warehouse architecture for loading data from
different sources and use NoSQL database for query to provide
data results asked by the analytics team.
Use AWS cluster to
deploy your solution data processing.
Non-Relational Data Stores and Azure
Data Lake Storage Data Lake and Azure Cosmos DB
Relational Data Stores
Why Azure SQL? Azure Batch
Azure Data Factory
Azure Data Bricks
Azure Stream Analytics
Monitoring & Security
Learn Basic statistics required for Data Science
Master Data Science Algorithms
Learn Linear regression and work on Recommender problems, collaborative filtering