Data Engineering using Databricks on AWS and Azure

BY
Udemy

Acquire the foundational understanding of Databricks, such as Spark, Delta Lake, and cloud files to build data engineering pipelines.

Mode

Online

Fees

₹ 549 3499

Quick Facts

particular details
Medium of instructions English
Mode of learning Self study
Mode of Delivery Video and Text Based

Course overview

Organizations generally use ETL as a data and AI solution to improve the functionality and performance of their ETL pipeline projects. To help improve business outcomes, Databricks can be used to develop, test, and deploy machine learning and analytics applications. Asasri Manthena, Marketing & Sales Manager at ITVersity, Inc., Durga Viswanatha Raju Gadiraju, CEO at ITVersity & CTO at Analytiqs, Inc., and Asasri Gadiraju created the Data Engineering using Databricks on AWS and Azure certification course, which is offered by Udemy.

Data Engineering using Databricks on AWS and Azure online course contains 18.5 hours of comprehensive video lectures along with 31 articles and 55 downloadable resources that are intended for students who want to learn about data engineering using Databricks. With Data Engineering using Databricks on AWS and Azure online training, students will be taught about topics like Azure CLI, Databricks CLI, application development, application development life cycle, spark structured streaming, incremental file processing, and more.

The highlights

  • Certificate of completion
  • Self-paced course
  • 18.5 hours of pre-recorded video content
  • 31 articles 
  • 55 downloadable resources

Program offerings

  • Online course
  • Learning resources
  • 30-day money-back guarantee
  • Unlimited access
  • Accessible on mobile devices and tv

Course and certificate fees

Fees information
₹ 549  ₹3,499
certificate availability

Yes

certificate providing authority

Udemy

What you will learn

Knowledge of aws technology Web application development skills

After completing the Data Engineering using Databricks on AWS and Azure online certification, students will acquire a thorough understanding of Azure and AWS for data engineering operations using Databricks. In this data engineering certification, students will explore the functionalities of Azure CLI and Databricks CLI as well as will acquire knowledge of the concepts involved with delta lake, Spark SQL, CRUD, and PySpark for developing applications. In this data engineering course, students will learn about strategies involved with incremental file processing, spark structured streaming, and application development life cycles.

The syllabus

Introduction to Data Engineering using Databricks

  • Overview of the course - Data Engineering using Databricks
  • Where are the resources that are used for this course?

Getting Started with Databricks

  • Signing Up For Databricks Community Edition
  • Create Azure Databricks Service
  • Signup For Databricks Full Trial
  • Overview Of Databricks UI
  • Upload Data In Files Into Databricks.
  • Create Cluster In Databricks Platform.
  • Managing File System Using Notebooks

Getting Started with Databricks on Azure

  • Getting Started with Databricks on Azure - Introduction
  • Signup for the Azure Account
  • Login and Increase Quotas for regional vCPUs in Azure
  • Create Azure Databricks Workspace
  • Launching Azure Databricks Workspace or Cluster
  • Quick Walkthrough of Azure Databricks UI
  • Create Azure Databricks Single Node Cluster
  • Upload Data using Azure Databricks UI
  • Overview of Creating Notebook and Validating Files
  • Develop Spark Application using Azure Databricks Notebook
  • Validate Spark Jobs using Azure Databricks Notebook
  • Export and Import of Azure Databricks Notebooks
  • Terminating Azure Databricks Cluster and Deleting Configuration
  • Delete Azure Databricks Workspace by deleting Resource Group

Azure Essentials for Databricks - Azure CLI

  • Azure Essentials for Databricks - Azure CLI
  • Azure CLI using Azure Portal Cloud Shell
  • Getting Started with Azure CLI on Mac
  • Getting Started with Azure CLI on Windows
  • Warming up with Azure CLI - Overview
  • Create Resource Group using Azure CLI
  • Create ADLS Storage Account with in Resource Group
  • Add Container as part of Storage Account
  • Overview of Uploading the data into ADLS File System or Container
  • Setup Data Set locally to upload into ADLS File System or Container
  • Upload local directory into Azure ADLS File System or Container
  • Delete Azure ADLS Storage Account using Azure CLI
  • Delete Azure Resource Group using Azure CLI

Mount ADLS on to Azure Databricks

  • Mount ADLS on to Azure Databricks - Introduction
  • Ensure Azure Databricks Workspace
  • Setup Databricks CLI on Mac or Windows using Python Virtual Environment
  • Configure Databricks CLI for new Azure Databricks Workspace
  • Register an Azure Active Directory Application
  • Create Databricks Secret for AD Application Client Secret
  • Create ADLS Storage Account
  • Assign IAM Role on Storage Account to Azure AD Application
  • Setup Retail DB Dataset
  • Create ADLS Container or File System and Upload Data
  • Start Databricks Cluster to mount ADLS
  • Mount ADLS Storage Account on to Azure Databricks
  • Validate ADLS Mount Point on Azure Databricks Clusters
  • Unmount the mount point from Databricks
  • Delete Azure Resource Group used for Mounting ADLS on to Azure Databricks

Setup Local Development Environment

  • Setup Single Node Databricks Cluster
  • Install Databricks Connect
  • Configure Databricks Connect
  • Integrating Pycharm with Databricks Connect
  • Code - Integrating Pycharm with Databricks Connect
  • Integrate Databricks Cluster with Glue Catalog
  • Setup s3 Bucket and Grant Permissions
  • Mounting s3 Buckets into Databricks Clusters
  • Using dbutils from IDEs such as Pycharm
  • Code - Using dbutils from IDEs such as Pycharm

Using Databricks CLI

  • Introduction
  • Install and Configure Databricks CLI
  • Interacting with File System using CLI.
  • Getting Cluster Details using CLI

Spark Application Development Life Cycle

  • Setup Virtual Environment and Install Pyspark
  • [Commands] - Setup Virtual Environment and Install Pyspark
  • Getting Started with Pycharm
  • [Code and Instructions] - Getting Started with Pycharm
  • Passing Run Time Arguments
  • Accessing OS Environment Variables
  • Getting Started with Spark
  • Create Function for Spark Session
  • [Code and Instructions] - Create Function for Spark Session
  • Setup Sample Data
  • Read data from files
  • [Code and Instructions] - Read data from files
  • Process data using Spark APIs
  • [Code and Instructions] - Process data using Spark APIs
  • Write data to files
  • [Code and Instructions] - Write data to files
  • Validating Writing Data to Files
  • Productionizing the Code
  • [Code and Instructions] - Productionizing the code
  • Setting up Data for Production Validation

Databricks Jobs and Clusters

  • Introduction to Jobs and Clusters
  • Creating Pools in Databricks Platform
  • Create Cluster on Azure Databricks
  • Request to Increase CPU Quota on Azure
  • Creating Job on Databricks
  • Submitting Jobs using Job Cluster
  • Create Pool in Databricks
  • Running Job using Interactive Cluster Attached to Pool
  • Running Job Using Job Cluster Attached to Pool
  • Exercise - Submit the application as job using interactive cluster

Deploy and Run on Databricks

  • Prepare PyCharm for Databricks
  • Prepare Data Sets
  • Move files to ghactivity
  • Refactor Code for Databricks
  • Validating Data using Databricks
  • Setup Data Set for Production Deployment
  • Access File Metadata using dbutils
  • Build Deployable bundle for Databricks
  • Running Jobs using Databricks Web UI.
  • Get Job and Run Details using Databricks CLI
  • Submitting Databricks Jobs using CLI
  • Setup and Validate Databricks Client Library
  • Resetting the Job using Jobs API
  • Run Databricks Job programmatically using Python
  • Detailed Validation of Data

Deploy Jobs using Notebooks

  • Modularizing Notebooks
  • Running Job using Notebook
  • Refactor application as Databricks Notebooks
  • Run Notebook using Development Cluster

Deep Dive into Delta Lake using Data Frames

  • Introduction to Delta Lake using Data Frames
  • Creating Data Frames for Delta Lake
  • Writing Data Frame using Delta Format
  • Updating Existing Data using Delta Format
  • Delete Existing Data using Delta Format
  • Merge or Upsert Data using Delta Format
  • Deleting using Merge in Delta Lake
  • Point in Snapshot Recovery using Delta Logs
  • Deleting unnecessary Delta Files using Vacuum
  • Compaction of Delta Lake Files

Deep Dive into Delta Lake using Spark SQL

  • Introduction to Delta Lake using SQL
  • Creating Data Frames for Delta Lake
  • Create Delta Lake Table
  • Insert Data to Delta Lake Table
  • Update Data in Delta Lake Table
  • Delete Data from Delta Lake Table
  • Merge or Upsert Data into Delta Lake Table
  • Using Merge Function over Delta Lake Table
  • Point in Snapshot Recovery using Delta Lake Table
  • Vacuuming Delta Lake Tables
  • Compaction of Delta Lake Tables

Accessing Databricks Cluster Terminal via Web as well as SSH

  • Enable Web Terminal in Databricks Admin Console
  • Launch Web Terminal for Databricks Cluster
  • Setup SSH for the Databricks Cluster Driver Node
  • Validate SSH Connectivity to the Databricks Driver Node on AWS
  • Limitations of SSH and comparison with Web Terminal

Installing Softwares on Databricks Clusters using init scripts

  • Setup gen_logs on Databricks Cluster
  • [Commands] Setup gen_logs on Databricks Cluster
  • Overview of Init Scripts for Databricks Clusters
  • Create Script to install software from git on Databricks Cluster
  • [Commands] Create Script to install software from git on Databricks Cluster
  • Copy init script to dbfs location
  • [Commands] Copy init script to dbfs location
  • Create Databricks Standalone Cluster with init script

Quick Recap of Spark Structured Streaming

  • Validate Netcat on Databricks Driver Node
  • Push log messages to Netcat Webserver on Databricks Driver Node
  • Reading Web Server logs using Spark Structured Streaming
  • Writing Streaming Data to Files

Incremental Loads using Spark Structured Streaming

  • Overview of Spark Structured Streaming
  • Steps for Incremental Data Processing
  • Configure Cluster with Instance Profile.mp4
  • Upload GHArchive Files to s3
  • Read JSON Data using Spark Structured Streaming
  • Write using Delta file format using Trigger Once
  • Analyze GHArchive Data in Delta files using Spark
  • Add New GHActivity JSON files
  • Load Data Incrementally to Target Table
  • Validate Incremental Load
  • Internals of Spark Structured Streaming File Processing

Incremental Loads using Cloud Files

  • Overview of Auto Loader cloudFiles
  • Upload GHArchive Files to s3
  • Write Data using Auto Loader cloudFiles
  • Add New GHActivity JSON files
  • Load Data Incrementally to Target Table
  • Add New GHActivity JSON files
  • Overview of Handling S3 Events using AWS Services
  • Configure IAM Role for cloudFiles file notifications
  • Incremental Load using cloudFiles File Notifications
  • Review AWS Services for cloudFiles Event Notifications
  • Review Metadata Generated for cloudFiles Checkpointing

Overview of Databricks SQL Clusters

  • Overview of Databricks SQL Platform - Introduction
  • Run First Query using SQL Editor of Databricks SQL
  • Overview of Dashboards using Databricks SQL
  • Overview of Databricks SQL Data Explorer to review Metastore Database and Tables
  • Use Databricks SQL Editor to develop scripts or queries
  • Review Metadata of Tables using Databricks SQL Platform
  • Overview of loading data into retail_db tables
  • Configure Databricks CLI to push data into Databricks Platform
  • Copy JSON Data into DBFS using Databricks CLI
  • Analyze JSON Data using Spark APIs
  • Analyze Delta Table Schemas using Spark APIs
  • Load Data from Spark Data Frames into Delta Tables
  • Run Adhoc Queries using Databricks SQL Editor to validate data
  • Overview of External Tables using Databricks SQL
  • Using COPY Command to Copy Data into Delta Tables
  • Manage Databricks SQL Endpoints

Instructors

Mr Durga Viswanatha Raju Gadiraju

Mr Durga Viswanatha Raju Gadiraju
Technology Adviser
Freelancer

Trending Courses

Popular Courses

Popular Platforms

Learn more about the Courses

Download the Careers360 App on your Android phone

Regular exam updates, QnA, Predictors, College Applications & E-books now on your Mobile

Careers360 App
150M+ Students
30,000+ Colleges
500+ Exams
1500+ E-books