Data Engineering using Databricks on AWS and Azure

Udemy

Acquire the foundational understanding of Databricks, such as Spark, Delta Lake, and cloud files to build data engineering pipelines.

Online

₹ 549 3499

particular

details

                                    Medium of instructions
                                    English

                                    Mode of learning
                                    Self study

                                    Mode of Delivery
                                    Video and Text Based

Introduction to Data Engineering using Databricks

Overview of the course - Data Engineering using Databricks
Where are the resources that are used for this course?

Getting Started with Databricks

Signing Up For Databricks Community Edition
Create Azure Databricks Service
Signup For Databricks Full Trial
Overview Of Databricks UI
Upload Data In Files Into Databricks.
Create Cluster In Databricks Platform.
Managing File System Using Notebooks

Getting Started with Databricks on Azure

Getting Started with Databricks on Azure - Introduction
Signup for the Azure Account
Login and Increase Quotas for regional vCPUs in Azure
Create Azure Databricks Workspace
Launching Azure Databricks Workspace or Cluster
Quick Walkthrough of Azure Databricks UI
Create Azure Databricks Single Node Cluster
Upload Data using Azure Databricks UI
Overview of Creating Notebook and Validating Files
Develop Spark Application using Azure Databricks Notebook
Validate Spark Jobs using Azure Databricks Notebook
Export and Import of Azure Databricks Notebooks
Terminating Azure Databricks Cluster and Deleting Configuration
Delete Azure Databricks Workspace by deleting Resource Group

Azure Essentials for Databricks - Azure CLI

Azure Essentials for Databricks - Azure CLI
Azure CLI using Azure Portal Cloud Shell
Getting Started with Azure CLI on Mac
Getting Started with Azure CLI on Windows
Warming up with Azure CLI - Overview
Create Resource Group using Azure CLI
Create ADLS Storage Account with in Resource Group
Add Container as part of Storage Account
Overview of Uploading the data into ADLS File System or Container
Setup Data Set locally to upload into ADLS File System or Container
Upload local directory into Azure ADLS File System or Container
Delete Azure ADLS Storage Account using Azure CLI
Delete Azure Resource Group using Azure CLI

Mount ADLS on to Azure Databricks

Mount ADLS on to Azure Databricks - Introduction
Ensure Azure Databricks Workspace
Setup Databricks CLI on Mac or Windows using Python Virtual Environment
Configure Databricks CLI for new Azure Databricks Workspace
Register an Azure Active Directory Application
Create Databricks Secret for AD Application Client Secret
Create ADLS Storage Account
Assign IAM Role on Storage Account to Azure AD Application
Setup Retail DB Dataset
Create ADLS Container or File System and Upload Data
Start Databricks Cluster to mount ADLS
Mount ADLS Storage Account on to Azure Databricks
Validate ADLS Mount Point on Azure Databricks Clusters
Unmount the mount point from Databricks
Delete Azure Resource Group used for Mounting ADLS on to Azure Databricks

Setup Local Development Environment

Setup Single Node Databricks Cluster
Install Databricks Connect
Configure Databricks Connect
Integrating Pycharm with Databricks Connect
Code - Integrating Pycharm with Databricks Connect
Integrate Databricks Cluster with Glue Catalog
Setup s3 Bucket and Grant Permissions
Mounting s3 Buckets into Databricks Clusters
Using dbutils from IDEs such as Pycharm
Code - Using dbutils from IDEs such as Pycharm

Using Databricks CLI

Introduction
Install and Configure Databricks CLI
Interacting with File System using CLI.
Getting Cluster Details using CLI

Spark Application Development Life Cycle

Setup Virtual Environment and Install Pyspark
[Commands] - Setup Virtual Environment and Install Pyspark
Getting Started with Pycharm
[Code and Instructions] - Getting Started with Pycharm
Passing Run Time Arguments
Accessing OS Environment Variables
Getting Started with Spark
Create Function for Spark Session
[Code and Instructions] - Create Function for Spark Session
Setup Sample Data
Read data from files
[Code and Instructions] - Read data from files
Process data using Spark APIs
[Code and Instructions] - Process data using Spark APIs
Write data to files
[Code and Instructions] - Write data to files
Validating Writing Data to Files
Productionizing the Code
[Code and Instructions] - Productionizing the code
Setting up Data for Production Validation

Databricks Jobs and Clusters

Introduction to Jobs and Clusters
Creating Pools in Databricks Platform
Create Cluster on Azure Databricks
Request to Increase CPU Quota on Azure
Creating Job on Databricks
Submitting Jobs using Job Cluster
Create Pool in Databricks
Running Job using Interactive Cluster Attached to Pool
Running Job Using Job Cluster Attached to Pool
Exercise - Submit the application as job using interactive cluster

Deploy and Run on Databricks

Prepare PyCharm for Databricks
Prepare Data Sets
Move files to ghactivity
Refactor Code for Databricks
Validating Data using Databricks
Setup Data Set for Production Deployment
Access File Metadata using dbutils
Build Deployable bundle for Databricks
Running Jobs using Databricks Web UI.
Get Job and Run Details using Databricks CLI
Submitting Databricks Jobs using CLI
Setup and Validate Databricks Client Library
Resetting the Job using Jobs API
Run Databricks Job programmatically using Python
Detailed Validation of Data

Deploy Jobs using Notebooks

Modularizing Notebooks
Running Job using Notebook
Refactor application as Databricks Notebooks
Run Notebook using Development Cluster

Deep Dive into Delta Lake using Data Frames

Introduction to Delta Lake using Data Frames
Creating Data Frames for Delta Lake
Writing Data Frame using Delta Format
Updating Existing Data using Delta Format
Delete Existing Data using Delta Format
Merge or Upsert Data using Delta Format
Deleting using Merge in Delta Lake
Point in Snapshot Recovery using Delta Logs
Deleting unnecessary Delta Files using Vacuum
Compaction of Delta Lake Files

Deep Dive into Delta Lake using Spark SQL

Introduction to Delta Lake using SQL
Creating Data Frames for Delta Lake
Create Delta Lake Table
Insert Data to Delta Lake Table
Update Data in Delta Lake Table
Delete Data from Delta Lake Table
Merge or Upsert Data into Delta Lake Table
Using Merge Function over Delta Lake Table
Point in Snapshot Recovery using Delta Lake Table
Vacuuming Delta Lake Tables
Compaction of Delta Lake Tables

Accessing Databricks Cluster Terminal via Web as well as SSH

Enable Web Terminal in Databricks Admin Console
Launch Web Terminal for Databricks Cluster
Setup SSH for the Databricks Cluster Driver Node
Validate SSH Connectivity to the Databricks Driver Node on AWS
Limitations of SSH and comparison with Web Terminal

Installing Softwares on Databricks Clusters using init scripts

Setup gen_logs on Databricks Cluster
[Commands] Setup gen_logs on Databricks Cluster
Overview of Init Scripts for Databricks Clusters
Create Script to install software from git on Databricks Cluster
[Commands] Create Script to install software from git on Databricks Cluster
Copy init script to dbfs location
[Commands] Copy init script to dbfs location
Create Databricks Standalone Cluster with init script

Quick Recap of Spark Structured Streaming

Validate Netcat on Databricks Driver Node
Push log messages to Netcat Webserver on Databricks Driver Node
Reading Web Server logs using Spark Structured Streaming
Writing Streaming Data to Files

Incremental Loads using Spark Structured Streaming

Overview of Spark Structured Streaming
Steps for Incremental Data Processing
Configure Cluster with Instance Profile.mp4
Upload GHArchive Files to s3
Read JSON Data using Spark Structured Streaming
Write using Delta file format using Trigger Once
Analyze GHArchive Data in Delta files using Spark
Add New GHActivity JSON files
Load Data Incrementally to Target Table
Validate Incremental Load
Internals of Spark Structured Streaming File Processing

Incremental Loads using Cloud Files

Overview of Auto Loader cloudFiles
Upload GHArchive Files to s3
Write Data using Auto Loader cloudFiles
Add New GHActivity JSON files
Load Data Incrementally to Target Table
Add New GHActivity JSON files
Overview of Handling S3 Events using AWS Services
Configure IAM Role for cloudFiles file notifications
Incremental Load using cloudFiles File Notifications
Review AWS Services for cloudFiles Event Notifications
Review Metadata Generated for cloudFiles Checkpointing

Overview of Databricks SQL Clusters

Overview of Databricks SQL Platform - Introduction
Run First Query using SQL Editor of Databricks SQL
Overview of Dashboards using Databricks SQL
Overview of Databricks SQL Data Explorer to review Metastore Database and Tables
Use Databricks SQL Editor to develop scripts or queries
Review Metadata of Tables using Databricks SQL Platform
Overview of loading data into retail_db tables
Configure Databricks CLI to push data into Databricks Platform
Copy JSON Data into DBFS using Databricks CLI
Analyze JSON Data using Spark APIs
Analyze Delta Table Schemas using Spark APIs
Load Data from Spark Data Frames into Delta Tables
Run Adhoc Queries using Databricks SQL Editor to validate data
Overview of External Tables using Databricks SQL
Using COPY Command to Copy Data into Delta Tables
Manage Databricks SQL Endpoints

Popular Courses

Popular Platforms