Data Engineering Essentials Hands-on - SQL, Python and Spark

BY
Udemy

Familiarize yourself with Data Engineering Skills by joining the online course by Udemy.

Mode

Online

Fees

₹ 449 3099

Quick Facts

particular details
Medium of instructions English
Mode of learning Self study
Mode of Delivery Video and Text Based

Course overview

Data Engineering Essentials using SQL, Python, and PySpark Course is intended to assist the students to have a sheer glance at the Data Engineering concepts and develop the potential of building Data Engineering Applications on GCP. The curriculum of the Data Engineering Essentials using SQL, Python, and PySpark Online Course prepared by Durga Viswanatha Raju Gadiraju, a Technology Advisor, will provide a broad overview of the skills of data engineering such as SQL, Python,  PySpark, programming constructs, collections, Pandas, Database Programming and much more. 

Data Engineering Essentials using SQL, Python, and PySpark Certification, the short certification by Udemy, will shed light on the Database Essentials, Programming Essentials using Python, Data Engineering using Spark SQL, and a lot more. Interested candidates are recommended to have previous experience of CS or IT degree or prior IT to make most of the course. 

The highlights

  • Online course
  • Downloadable resources
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of completion
  • English videos
  • 30-Day Money-Back Guarantee

Program offerings

  • 61 hours on-demand video
  • 15 articles
  • 4 downloadable resources
  • Full lifetime access
  • Access on mobile and tv
  • Certificate of completion
  • English videos
  • Exercises
  • Practice tests

Course and certificate fees

Fees information
₹ 449  ₹3,099
certificate availability

Yes

certificate providing authority

Udemy

Who it is for

What you will learn

Knowledge of python Sql knowledge

After the completion of Data Engineering Essentials using SQL, Python, and PySpark Online Certification, the students will learn Spark Application Development Life Cycle, SQL, Python,  Pycharm for Python Application Development, Dockerizing the application, and a lot more. 

The syllabus

Introduction about the course

  • Introduction about course
  • Desired Audience
  • Pre-requisites
  • [Must Watch] 30 Day Money Back Guarantee - Feedback and Rating
  • Training Approach
  • Overview of Environment for Hands on Practice
  • How to access data sets used in this course?

Setup Environment to learn Python, SQL, Hadoop, Spark using Docker on Windows 11

  • Setup Environment using Docker on Windows 11 - Introduction
  • Understanding System Configuration of Windows 11 PC
  • Steps to setup Docker Desktop on Windows 11
  • Enable WSL2 on Windows 11 by installing Ubuntu VM using WSL
  • Install Linux Kernel Update Package on Windows 11 for Docker Desktop
  • Download and Install Docker Desktop on Windows 11
  • Validating git using WSL Ubuntu on Windows 11
  • Clone Data Engineering Essentials Material on Windows 11
  • Start Python and SQL Containers using docker-compose command on Windows 11
  • Download and Install Pycharm on Windows 11
  • Setup Pycharm Project for Data Engineering
  • Review Docker Compose File for Data Engineering Essentials Material
  • Review important Docker Compose Commands to manage services
  • Access Jupyter Based Environment to learn Python and SQL
  • Getting Jupyter Lab Token to login into Jupyter Lab

Setup Environment to learn Python, SQL, Hadoop, Spark using Docker on Windows 10

  • Understanding System Configuration
  • Setup Docker Desktop on Windows
  • Validate Docker on Windows using Command Line leveraging Power Shell
  • Review Docker Desktop Resource Configurations
  • Clone GitHub Repository on Windows
  • Setup Pycharm Project for Data Engineering Essentials
  • Update Git Global Settings related to Line Endings
  • Review Services Docker Compose
  • Start Python and SQL Environment using Docker Compose
  • Review resource utilization after setting up Python and SQL Environment
  • Access Jupyter Based Environment to learn Python
  • Getting Jupyter Lab Token to login into Jupyter Lab

Setup Environment to learn Python, SQL, Hadoop and Spark using Docker on Mac

  • Setup Environment using Mac
  • Setup Docker Desktop on Mac
  • Validate Docker Setup on Mac
  • Review Memory and CPU Settings of Docker Desktop for Mac
  • Configure Docker Desktop for Data Engineering Essentials Environment
  • Clone GitHub Repository for Data Engineering Essentials
  • Setup as Pycharm Project to review the files using IDE
  • Review Docker Compose file for Python and SQL Lab
  • Start Python and SQL Environment using Docker Compose
  • Review resource utilization after setting up Python and SQL Environment
  • Access Jupyter Based Environment to learn Python
  • Getting Jupyter Lab Token to login into Jupyter Lab

Setting up Environment to learn Python, SQL as well as Spark using AWS Cloud9

  • Getting Started with Cloud9
  • Creating Cloud9 Environment
  • Warming up with Cloud9 IDE
  • Details about material to setup postgres database using docker
  • Overview of EC2 related to Cloud9
  • Opening ports for Cloud9 Instance
  • Associating Elastic IPs to Cloud9 Instance
  • Increase EBS Volume Size of Cloud9 Instance
  • Setup Docker Compose on AWS Cloud9 Instance
  • Clone GitHub Repository
  • Setup Python and SQL Environment using Docker Compose
  • Update Inbound Rules of AWS EC2 Security Group
  • Login into the Jupyter based environment

Networking Concepts for Beginners - ip addresses and port numbers

  • Enable telnet on Windows
  • Different IP Address Types
  • Port Numbers associated with Applications or Services
  • Reverting port for SSH to default port number
  • Setup Apache2 on Ubuntu
  • Overview of localhost
  • Overview of Private IP Address associated with a server
  • Overview of Public IP Address associated with a server
  • Setup Web Application and access using local ip
  • Setup Web Application and access using private ip
  • Disable Access to Web Application using Public ip
  • Install sshuttle on Mac using brew
  • Access Web Application using Private IP using SSH as proxy

Database Essentials - Getting Started

  • Setup SMS Database using Postgres
  • Connecting to Postgresql Database
  • Using psql to interact with Postgresql Database using CLI
  • Data Loading Utilities in Postgresql

Database Essentials - Database Operations

  • Database Operations - Overview
  • Database CRUD Operations
  • Creating Table in Postgres Database
  • Inserting Data into Postgres Database Table
  • Updating Data in Postgres Database Table
  • Deleting Data in Postgres Database Table
  • Overview of Database Transactions
  • Exercise - DML or CRUD Operations using Postgresql

Database Essentials - Writing Basic SQL Queries

  • Standard Transformations
  • Overview of Data Model
  • Define Problem Statement
  • Preparing Database Tables using Postgres
  • Selecting or Projecting Data from Postgres Database Tables using SQL
  • Filtering Data from Postgres Database Tables using SQL
  • Joining Postgres Database Tables using SQL - Inner
  • Joining Postgres Database Tables using SQL - Outer
  • Performing Aggregations using SQL on Postgres Database Tables
  • Sorting Data in Postgres Tables using SQL
  • Solution - Daily Product Revenue using SQL on Postgres Database Tables
  • Exercises - Writing Basic SQL Queries on Postgres Database Tables

Database Essentials - Creating Tables and Indexes

  • DDL - Data Definition Language
  • Overview of Data Types used while creating Postgres Database Tables
  • Adding or Modifying Columns using Alter in Postgres Database Tables
  • Different Type of Constraints used on Database Tables
  • Managing Constraints on Postgres Database Tables
  • Indexes on Postgres Database Tables
  • Indexes for Constraints on Postgres Database Tables
  • Overview of Sequences used on Postgres Database Tables
  • Truncating Postgres Database Tables
  • Dropping Postgres Database Tables
  • Exercises and Solutions - Managing Database Objects using Postgresql

Database Essentials - Partitioning Tables and Indexes

  • Overview of Partitioning of Postgres Database Tables
  • List Partitioning of Database Tables
  • Managing Partitions of Postgres Database Tables - List
  • Manipulating Data in Postgres Database Partitioned Tables
  • Range Partitioning of Postgres Database Tables
  • Managing Partitions of Postgres Database Tables - Range
  • Repartitioning of Postgres Database Tables - Range
  • Hash Partitioning of Postgres Database Tables
  • Managing Partitions of Postgres Database Tables - Hash
  • Usage Scenarios of Database Partitioned Tables
  • Sub Partitioning of Postgres Database Tables
  • Exercise - Partitioned Tables of Postgres Database Tables

Database Essentials - Predefined Functions

  • Overview of SQL Functions in Postgres
  • String Manipulation Functions in SQL using Postgres
  • Case Conversion and Length using Functions in SQL using Postgres
  • Extracting Data - Using substr and split_part Functions in SQL using Postgres
  • Using position or strpos Functions in SQL using Postgres
  • Trimming and Padding Functions in SQL using Postgres
  • Reverse and Concatenate Multiple Strings using Functions in SQL using Postgres
  • String Replacement using Functions in SQL using Postgres
  • Date Manipulation Functions using SQL in Postgres
  • Getting Current Date or Timestamp using Functions in SQL using Postgres
  • Date Arithmetic using Functions in SQL using Postgres
  • Beginning Date or Time using date_trunc Function in SQL using Postgres
  • Using to_char and to_date Functions in SQL using Postgres
  • Extracting Information using extract Function in SQL using Postgres
  • Dealing with Unix Timestamp or epoch using Functions in SQL using Postgres
  • Overview of Numeric Functions using SQL in Postgres
  • Data Type Conversion using Functions in SQL using Postgres
  • Handling NULL Values using SQL in Postgres
  • Using CASE and WHEN as part of SQL in Postgres

Database Essentials - Writing Advanced SQL Queries

  • Overview of Database Views using Postgres Database
  • Overview of Named Queries using SQL in Postgres
  • Overview of Sub Queries using SQL in Postgres
  • CTAS - Create Table As Select using Postgres
  • Advanced DML Operations on Postgres Database Tables
  • Merging or Upserting Data into Postgres Database Tables
  • Pivoting Rows into Columns using SQL in Postgres
  • Overview of Analytic Functions using SQL in Postgres
  • Analytic Functions - Aggregations using SQL in Postgres
  • Cumulative or Moving Aggregations using SQL in Postgres
  • Analytic Functions using SQL in Postgres - Windowing
  • Analytic Functions using SQL in Postgres - Ranking
  • Analytic Functions using SQL in Postgres - Filtering
  • Ranking and Filtering using SQL in Postgres - Recap
  • Exercises - Writing Advanced Queries

Programming Essentials using Python - Perform Database Operations

  • Introduction - Perform Database Operations
  • Overview of SQL
  • Create Database and Users Table
  • DDL - Data Definition Language
  • DML - Data Manipulation Language
  • DQL - Data Query Language
  • CRUD Operations - DML and DQL
  • TCL - Transaction Control Language
  • Example - Data Engineering
  • Example - Web Application
  • Exercise - Database Operations

Programming Essentials using Python - Getting Started with Python

  • Installing Python on Windows
  • Overview of Anaconda
  • Python CLI and Jupyter Notebook
  • Overview of Jupyter Lab
  • Using IDEs - Pycharm
  • Using Visual Studio Code
  • Using ITVersity Labs
  • Leveraging Google Colab

Programming Essentials using Python - Basic Programming Constructs

  • Basic Programming Constructs using Python - Introduction
  • Getting Help using help function in Python
  • Python Variables and Objects
  • Python Data Types - Commonly Used
  • Operators in Python
  • Tasks - Data Types and Operators using Python
  • Developing Conditionals using Python
  • All about for loops in Python
  • Running os commands in Python
  • Exercises - Basic Programming Constructs using Python
  • Dynamic Arithmetic Operations using eval and exec in Python

Programming Essentials using Python - Predefined Functions

  • Predefined Functions in Python - Introduction
  • Overview of Predefined Functions in Python
  • Numeric Functions in Python
  • Overview of Strings in Python
  • String Manipulation Functions in Python
  • Formatting Strings in Python
  • Print and Input Functions in Python
  • Date Manipulation Functions in Python
  • Exercises - Predefined Functions in Python

Programming Essentials using Python - User Defined Functions

  • Developing User Defined Functions in Python - Introduction
  • Defining Functions in Python
  • Doc Strings in Python
  • Returning Variables from Python Functions
  • Passing Function Parameters and Arguments to Python Functions
  • Varying Arguments in Python
  • Keyword Arguments in Python
  • Recap of User Defined Functions in Python
  • Passing Functions as Arguments to Python Functions
  • Lambda or Anonymous Functions in Python
  • Usage of Lambda Functions in Python Functions
  • Exercise - User Defined Functions in Python

Programming Essentials using Python - Overview of Collections - list and set

  • Overview of Collections in Python - list and set - Introduction
  • Overview of list and set in Python
  • Common Operations on Python Collections
  • Accessing elements from Python list
  • Adding elements to Python list
  • Updating and Deleting elements from Python lis
  • Other or Miscellaneous Python list operations
  • Adding and Deleting elements using Python set
  • Typical Python set operations
  • Validating Python sets
  • Usage of Python list and set
  • Exercises - Basic Operations on Python list and set
  • Python List of Delimited Strings
  • Sorting data in Python lists and tuples
  • Sorting list of Delimited Strings using Python
  • Exercises - Sorting lists and sets in Python

Programming Essentials using Python - Overview of Collections - dict and tuple

  • Manipulating Collections using loops in Python - Introduction
  • Overview of Python dict and tuple
  • Common Operations on dict and tuple using Python
  • Accessing Elements from Python tuples
  • Accessing Elements from Python dict
  • Manipulating Python dict
  • Common Examples of Python dict
  • Representing Tables or Excel Sheets as Python List of Tuples
  • Representing Tables or Excel Sheets as Python List of dicts
  • Process Python dict values
  • Processing Python dict items
  • Sorting Python dict items
  • Exercises - Overview of Python Collections - dict and set

Programming Essentials using Python - Manipulating Collections using loops

  • Manipulating Collections using loops in Python - Introduction
  • Reading Files into Python Collections
  • Overview of Standard Transformations
  • Row Level Transformations using Python loops
  • Getting Unique Elements using Python loops
  • Filtering Data using Python loops and conditionals
  • Preparing Data Sets
  • Quick recap of Python dict operations
  • Performing Total Aggregations using Python loops
  • Overview of Grouped Aggregations using Python loops
  • Get Order Count by Status using Python loops
  • Get Revenue Details per Order using Python loops
  • Get Order Count by Month using Python loops
  • Joining Data Sets using Python loops
  • Manipulate Collections using Comprehensions in Python
  • List Comprehensions using Python
  • Set Comprehensions using Python
  • Dict Comprehensions in Python
  • Limitations of using loops to process data sets
  • Exercises - Manipulating Collections using Python loops

Programming Essentials using Python - Development of Map Reduce APIs

  • Develop myFilter Function using Python loops and conditionals
  • Validate myFilter using Python loops and conditionals
  • Develop myMap Function using Python loops
  • Validate myMap Function using Python loops
  • Develop myReduce Function using Python loops
  • Validate myReduce Function using Python loops
  • Develop myReduceByKey Function using Python loops
  • Validate myReduceByKey Function using Python loops
  • Develop myJoin Function using Python loops
  • Validate myJoin Function using Python loops
  • Exercises - Development of Map Reduce APIs using Python loops and Conditionals

Programming Essentials using Python - Understanding Map Reduce Libraries

  • Preparing Data Sets
  • Filtering Data using Python filter
  • Projecting data using Python map
  • Row Level Transformations using Python map
  • Aggregations using Python reduce
  • Get Revenue for a given product id using Python Map Reduce
  • Get total items sold and revenue for a product using Python Map reduce
  • Get total commission amount using Python Map Reduce
  • Overview of itertools
  • Cumulative Operations using Python itertools
  • Using Python itertools starmap
  • Overview of Python itertools groupby
  • Get order count by status using Python itertools groupby
  • Get revenue per order using Python itertools groupby
  • Limitations of Python Map Reduce Libraries
  • Exercises - Understanding Python Map Reduce Libraries

Programming Essentials using Python - Basics of File IO using Python

  • Basics of File IO using Python - Introduction
  • Overview of File IO using Python
  • Understand concepts behind Folders and Files
  • Getting File Paths and File Names
  • Overview of Retail Data
  • Read text file into string using Python File I/O
  • Write string to text file using Python File I/O
  • Overview of modes to write into files using Python File I/O
  • Overview of Delimited Strings
  • Read csv into list of strings using Python File I/O
  • Writing Strings to file in Append Mode using Python File I/O
  • Managing Files and Folders using Python File I/O

Programming Essentials using Python - Delimited Files and Collections

  • Understanding Delimited Files and Collections
  • Overview of Delimited Text Files
  • Recap of basic file IO using Python
  • Read Delimited files into list of tuples using Python File I/O
  • Write Delimited Strings into files using Python File I/O
  • Overview of Python CSV Module to process files
  • Read Delimited data into list using Python CSV APIs
  • Writing iterables to files using Python CSV APIs
  • Advantages of using using APIs in Python CSV module
  • Apply Schema on lists from files using Python

Programming Essentials using Python - Overview of Pandas Libraries

  • Overview of Python Pandas Libraries
  • Understanding Python Pandas Data Structures
  • Overview of Python Series
  • Creating Python Data Frames from lists
  • Basic Operations on Python Data Frames
  • Reading Data from CSV Files to Python Pandas Data Frames
  • Projecting and Filtering using Python Pandas Data Frame APIs
  • Performing Total Aggregations using Python Pandas Data Frame APIs
  • Performing Grouped Aggregations using Python Pandas Data Frame APIs
  • Writing Python Pandas Data Frames to Files
  • Joining Data in Python Pandas Data Frames using join

Programming Essentials using Python - Database Programming - CRUD Operations

  • Database Operations using Python - CRUD Operations - Introduction
  • Overview of Database Programming using Python
  • Recap of RDBMS Concepts
  • Setup Database Client Libraries for Python Applications
  • Develop Function to get Database Connection using Python
  • Create Database Table in Postgres using Python
  • Inserting Data into Table in Postgres using Python
  • Updating Existing Table Data in Postgres using Python
  • Deleting Data From Table in Postgres using Python
  • Querying Data From Table in Postgres using Python
  • Recap - CRUD Operations using Python

Programming Essentials using Python - Database Programming - Batch Operations

  • Database Programming using Python - Batch Operations - Introduction
  • Recap of Insert using Python
  • Preparing Database to perform batch operations using Python
  • Reading Data From File using Python File I/O
  • Batch Loading of Data into Database Table using Python
  • Best Practices for Batch Loading into Database Table using Python

Programming Essentials using Python - Processing JSON Data

  • Processing JSON Data - Introduction
  • Process JSON using Python Pandas
  • JSON Data Types
  • Create JSON String
  • Process JSON String
  • Single JSON Document in Files
  • Multiple JSON Documents in files
  • Process JSON using Pandas
  • Different JSON Formats supported by Python Pandas
  • Common Use Cases for JSON
  • Write to JSON files using Python json module
  • Write to JSON files using Python Pandas

Programming Essentials using Python - Processing REST Payloads

  • Overview of REST APIs
  • Using curl command
  • Overview of Postman
  • Getting Started with Python requests module
  • Convert REST Payload to Python Objects
  • Process REST Payload using Python Collection Operations
  • Process REST Payload using Python Pandas

Understanding Python Virtual Environments

  • Introduction to Python Virtual Environments
  • Validating Python Versions
  • Create Python Virtual Environment for Web Application
  • Reviewing dependencies installed in Python Virtual Environment
  • Installing Dependencies for Web Application using Python pip
  • Getting Details about installed packages using Python pip
  • Uninstall Packages using Python pip
  • Cleanup Python Virtual Environment
  • Recreate and Activate Python Virtual Environment for Web Application
  • Define requirements file for Python Web Application
  • Install Dependencies using requirements file for Python Web Application
  • Create Virtual Environment for Data Engineering Application using Python
  • Install Dependencies for Data Engineering Application using Python
  • Install Dependencies for Data Engineering Application using Python 3.6
  • Validate Python and Package Compatibility and Install Python 3.6
  • Conclusion about understanding Python Virtual Environments

Overview of Pycharm for Python Application Development

  • Introduction to Pycharm for Python Application Development
  • Installation of Pycharm on Windows for Python Application Development
  • Installation of Pycharm on Mac for Python Application Development
  • Setup Python Getting Started Project using Pycharm
  • Setup Python Getting Started Project using Pycharm on Mac
  • Setup de-demo Python project using Pycharm
  • Accessing Settings in Pycharm and Changing Font Size
  • Accessing Settings in Pycharm and Changing Font Size on Mac
  • Install Python Packages using Pycharm
  • Overview of Pycharm Integrated Terminal
  • Overview of Pycharm Integrated Terminal on Mac
  • Overview of Run Time Arguments for Python Applications
  • Passing Run Time Arguments to Python Applications using Pycharm

Data Copier - Getting Started

  • Introduction to Getting Started for Data copier using Python
  • Problem Statement - Data Copier using Python
  • Create Working Directory for the Python Project
  • Setup Docker on Windows 10 Pro
  • Quick Overview of Docker
  • Prepare Dataset
  • Create Postgres Container
  • Setup Postgres Database for development
  • Overview of Postgres Database Commands
  • Setup Python Project using Pycharm
  • Managing Python Dependencies for the project
  • Create GitHub Project

Data Copier - Reading Data using Pandas

  • Reading Data using Python Pandas - Introduction
  • Overview of Retail Data
  • Adding Python Pandas to the project
  • Reading JSON Data using Python Pandas
  • Previewing Data using Python Pandas
  • Reading Data in Chunks using Python Pandas
  • Dynamically read files using Python os module

Data Copier - Database Programming using Pandas

  • Database Programming using Python Pandas - Introduction
  • Validate Postgres Setup using Docker
  • Add required dependencies for database programming using Python pandas
  • Create users table in retail_db Database
  • Populating Sample Data into users table
  • Reading data from table using Python Pandas
  • Truncate users Postgres Database Table
  • Writing Python Pandas Dataframe to table
  • Validating users data in Postgres Database Table
  • Drop users Postgres Database Table

Data Copier - Loading Data from files to tables

  • Loading Data from files to tables - Introduction
  • Populating Departments data into table
  • Validate departments table
  • Populating orders table in chunks using Python Pandas
  • Validate orders table in Postgres Database
  • Validate orders table using pandas

Data Copier - Modularizing the application

  • Overview of Python main function
  • Overview of Python Environment Variables
  • Using Python os module for Environment Variables
  • Passing Environment Variables to Python Applications using Pycharm
  • Read logic using Python Pandas
  • Validate read logic developed using Python Pandas
  • Write logic using Python Pandas
  • Validate write logic developed using Python Pandas
  • Integrate read and write logic using Python
  • Validate Integration logic developed using Python
  • Develop logic to load multiple tables using Python
  • Validate Python logic for table list as run time argument
  • Push Python Application Changes to remote git repository

Data Copier - Dockerizing the application

  • Dockerizing the application - Introduction
  • Prepare Database for validation
  • Pull and validate appropriate python image
  • Create and attach network to database docker container
  • Quick recap about Docker containers
  • Review Python based Data Copier Application
  • Deploying Python application and installing dependencies in the docker container
  • Copy source data files into container
  • Add Python Data Copier container to custom network
  • Installing OS libraries as part of Docker container
  • Validate Network Connectivity between Docker Containers
  • Running Application from the Docker Container
  • Delete Docker Container

Data Copier - Using custom Docker Image

  • Using Custom Docker Image - Introduction
  • Getting started with docker custom image
  • Install OS Modules in custom docker image
  • Copying Python Source Code to Docker Custom Image
  • Adding dependencies to the custom image
  • Understanding docker custom image build process
  • Mounting Data Folders on to Docker Container
  • Passing Environment Variables to Docker Container
  • Add Python Data Copier Container to custom network
  • Run Python application using Docker

Data Copier - Deploy and Validate Application on Remote Server

  • Deploy and Validate Python Application on Remote Server - Introduction
  • Push Application Changes to GitHub Repository
  • Requirements to deploy application on Virtual Machine
  • Clone Application on remote machine
  • Setup Data Set for Validation
  • Setup Network and Database Folder for Database using Docker
  • Setup Docker Container for the Database
  • Setup Database and Tables as part of Docker based Database Server
  • Building Custom Docker Image for application
  • Run and Validate Dockerized Application

Setup Single Node Hadoop and Spark Cluster or Lab using Docker

  • Setup Single Node Hadoop and Spark Cluster or Lab using Docker
  • Pre-requisites to setup Hadoop and Spark Lab
  • Configure Docker Desktop
  • Update Hadoop and Spark Content
  • Clone GitHub Repository to setup and learn Hadoop and Spark
  • Cleaning up Docker Containers used for Python and SQL Practice
  • Review Hadoop and Spark Lab details in Docker Compose File
  • Pull Docker Image for Single Node Hadoop and Spark
  • Start Docker Containers related to Hadoop and Spark
  • Overview of reviewing Hadoop and Spark Lab setup using Docker
  • Connecting to Terminal of Spark and Hadoop Containers
  • Review HDFS and YARN on Single Node Hadoop and Spark Cluster
  • Review and Validate HIve on Single Node Hadoop and Spark Cluster
  • Validate Spark 2 using Pyspark and Spark SQL on Single Node Lab
  • Validate Spark 3 using Pyspark and Spark SQL on Single Node Lab
  • Validate HIve Metastore used as part of Single Node Hadoop and Spark Cluster
  • Access Hadoop and Spark Material using Jupyter lab environment
  • Managing Single Node Hadoop and Spark Cluster using Docker

Introduction to Hadoop eco system - Overview of HDFS

  • Getting help or usage
  • Listing HDFS Files
  • Managing HDFS Directories
  • Copying files from local to HDFS
  • Copying files from HDFS to local
  • Getting Files Metadata
  • Previewing Data in HDFS Files
  • HDFS Block Size
  • HDFS Replication Factor
  • Getting HDFS Storage Usage
  • USing HDFS Stat Commands
  • HDFS File Permissions
  • Overriding Properties

Data Engineering using Spark SQL - Getting Started

  • Getting Started - Overview
  • Overview of Spark Documentation
  • Launching and using Spark SQL CLI
  • Overview of Spark SQL Properties
  • Running OS Commands using Spark SQL
  • Understanding Warehouse Directory
  • Managing Spark Metastore Databases
  • Managing Spark Metastore Tables
  • Retrieve Metadata of Tables
  • Role of Spark Metastore or Hive Metastore
  • Exercise - Getting Started with Spark SQL

Data Engineering using Spark SQL - Basic Transformations

  • Basic Transformations - Introduction
  • Spark SQL - Overview
  • Define Problem Statement
  • Prepare Tables
  • Projecting Data
  • Filtering Data
  • Joining Tables - Inner
  • Joining Tables - Outer
  • Aggregation Data
  • Sorting Data
  • Conclusion - Final Solution

Data Engineering using Spark SQL - Managing Tables - Basic DDL and DML

  • Introduction
  • Create Spark Metastore Tables
  • Overview of Data Types
  • Adding Comments
  • Loading Data Into Tables - Local
  • Loading Data Into Tables - HDFS
  • Loading Data - Append and Overwrite
  • Creating External Tables
  • Managed Tables vs External Tables
  • Overview of File Formats
  • Drop Tables and Databases
  • Truncating Tables
  • Exercise - Managed Tables

Data Engineering using Spark SQL - Managing Tables - DML and Partitioning

  • Introduction - Managing Tables - DML and Partitioning
  • Introduction to Partitioning
  • Creating Tables using Parquet
  • Load vs Insert
  • Inserting Data using Stage Table
  • Creating Partitioned Tables
  • Adding Partitions to Tables
  • Loading Data into Partitioned Tables
  • Inserting Data into Partitions
  • Using Dynamic Partition Mode
  • Exercise - Partitioned Tables

Data Engineering using Spark SQL - Overview of Spark SQL Functions

  • Introduction - Overview of Spark SQL Functions
  • Overview of Functions
  • Validating Functions
  • String Manipulation Functions
  • Date Manipulation Functions
  • Overview of Numeric Functions
  • Data Type Conversion
  • Dealing with Nulls
  • Using CASE and WHEN
  • Query Example - Word Count

Data Engineering using Spark SQL - Windowing Functions

  • Introduction - Windowing Functions
  • Prepare HR Database
  • Overview of Windowing Functions
  • Aggregations using Windowing Functions
  • Using LEAD or LAG
  • Getting first and last values
  • Ranking using Windowing Functions
  • Order of execution of SQL.cmproj
  • Overview of Subqueries
  • Filtering Windowing Function Results

Apache Spark using Python - Data Processing Overview

  • Starting Spark Context - pyspark
  • Overview of Spark Read APIs
  • Understanding airlines data
  • Inferring Schema
  • Previewing Airlines Data
  • Overview of Data Frame APIs
  • Overview of Functions
  • Overview of Spark Write APIs

Apache Spark using Python - Processing Column Data

  • Overview of Predefined Functions in Spark
  • Create Dummy Data Frame
  • Categories of Functions
  • Special Functions - col and lit
  • Common String Manipulation Functions
  • Extracting Strings using substring
  • Extracting Strings using split
  • Padding Characters around Strings
  • Trimming Characters from Strings
  • Date and Time Manipulation Functions
  • Date and Time Arithmetic
  • Using Date and Time Trunc Functions
  • Date and Time Extract Functions
  • Using to_date and to_timestamp
  • Using date_format Function
  • Dealing with Unix Timestamp
  • Dealing with Nulls
  • Using CASE and WHEN

Apache Spark using Python - Basic Transformations

  • Overview of Basic Transformations
  • Data Frames for basic transformations
  • Basic Filtering of Data
  • Filtering Example using dates
  • Boolean Operators
  • Using IN Operator or isin Function
  • Using LIKE Operator or like Function
  • Using BETWEEN Operator
  • Dealing with Nulls while Filtering
  • Total Aggregations
  • Aggregate data using groupBy
  • Aggregate data using rollup
  • Aggregate data using cube.cmproj
  • Overview of Sorting Data Frames
  • Solution - Problem 1 - Get Total Aggregations
  • Solution - Problem 2 - Get Total Aggregations By FlightDate

Apache Spark using Python - Joining Data Sets

  • Prepare Datasets for Joins
  • Analyze Datasets for Joins
  • Problem Statements for Joins
  • Overview of Joins
  • Using Inner Joins
  • Left or Right Outer Join
  • Solution - Get Flight Count Per US Airport
  • Solution - Get Flight Count Per US State
  • Solution - Get Dormant US Airports
  • Solution - Get Origins without master data
  • Solution - Get Count of Flights without master data
  • Solution - Get Count of Flights per Airport without master data
  • Solution - Get Daily Revenue
  • Solution - Get Daily Revenue rolled up till Yearly

Apache Spark using Python - Spark Metastore

  • Overview of Spark Metastore
  • Exploring Spark Catalog
  • Creating Metastore Tables using catalog
  • Inferring Schema for Tables
  • Define Schema for Tables using StructType
  • Inserting into Existing Tables
  • Read and Process data from Metastore Tables
  • Create Partitioned Tables
  • Saving as Partitioned Table
  • Creating Temporary Views
  • Using Spark SQL

Apache Spark - Development Life Cycle using Python

  • Setup Virtual Environment and Install Pyspark
  • [Commands] - Setup Virtual Environment and Install Pyspark
  • Getting Started with Pycharm
  • [Code and Instructions] - Getting Started with Pycharm
  • Passing Run Time Arguments
  • Accessing OS Environment Variables
  • Getting Started with Spark
  • Create Function for Spark Session
  • [Code and Instructions] - Create Function for Spark Session
  • Setup Sample Data
  • Read Data from Files
  • [Code and Instructions] - Read data from files
  • Process Data using Spark APIs
  • [Code and Instructions] - Process data using Spark APIs
  • Write Data to Files
  • [Code and Instructions] - Write data to files
  • Validating Writing Data to Files
  • Productionizing the Code
  • [Code and Instructions] - Productionizing the code
  • Setting up Data for Production Validation
  • Running Application using YARN
  • Detailed Validation of the Application

Spark Application Execution Life Cycle and Spark UI

  • Deploying and Monitoring Spark Applications - Introduction
  • Overview of Types of Spark Cluster Managers
  • Setup EMR Cluster with Hadoop and Spark
  • Overall Capacity of Big Data Cluster with Hadoop and Spark
  • Understanding YARN Capacity of an Enterprise Cluster
  • Overview of Hadoop HDFS and YARN Setup on Multi-node Cluster
  • Overview of Spark Setup on top of Hadoop
  • Setup Data Set for Word Count application
  • [Instructions and Commands] Setup Data Set for Word Count Application
  • Develop Word Count Application
  • [Code] Develop Word Count Application
  • Review Deployment Process of Spark Application
  • Overview of Spark Submit Command
  • Switching between Python Versions to run Spark Apps or launch Pyspark CLI
  • Switching between Pyspark Versions to run Spark Apps or launch Pyspark CLI
  • Review Spark Configuration Properties at Run Time
  • Develop Shell Script to run Spark Application
  • [Code] Develop Shell Script to run Spark Application
  • Run Spark Application and review default executors
  • Overview of Spark History Server UI

Setup SSH Proxy to access Spark Application logs

  • Setup SSH Proxy to access Spark Application logs - Introduction
  • Overview of Private and Public ips of servers in the cluster
  • Overview of SSH Proxy
  • Setup sshuttle on Mac or Linux
  • Proxy using sshuttle on Mac or Linux
  • Accessing Spark Application logs via SSH Proxy using sshuttle on Mac or Linux
  • Side effects of using SSH Proxy to access Spark Application Logs
  • Steps to setup SSH Proxy on Windows to access Spark Application Logs
  • Setup PuTTY and PuTTYgen on Windows
  • Quick Tour of PuTTY on Windows
  • Configure Passwordless Login using PuTTYGen Keys on Windows
  • Run Spark Application on Gateway Node using PuTTY
  • Configure Tunnel to Gateway Node using PuTTY on Windows for SSH Proxy
  • Setup Proxy on Windows and validate using Microsoft Edge browser
  • Understanding Proxying Network Traffic overcoming Windows Caveats
  • Update Hosts file for worker nodes using private ips
  • Access Spark Application logs using SSH Proxy
  • Overview of performing tasks related to Spark Applications using Mac

Deployment Modes of Spark Applications

  • Deployment Modes of Spark Applications - Introduction
  • Default Execution Master Type for Spark Applications
  • Launch Pyspark using local mode
  • Running Spark Applications using Local Mode
  • Overview of Spark CLI Commands such as Pyspark
  • Accessing Local Files using Spark CLI or Spark Applications
  • Overview of submitting spark application using client deployment mode
  • Overview of submitting spark application using cluster deployment mode
  • Review the default logging while submitting Spark Applications
  • Changing Spark Application Log Level using custom log4j properties
  • Submit Spark Application using client mode with log level info
  • Submit Spark Application using cluster mode with log level info
  • Submit Spark Applications using SPARK_CONF_DIR with custom properties files
  • Submit Spark Applications using Properties File

Instructors

Mr Durga Viswanatha Raju Gadiraju

Mr Durga Viswanatha Raju Gadiraju
Technology Adviser
Freelancer

Trending Courses

Popular Courses

Popular Platforms

Learn more about the Courses

Download the Careers360 App on your Android phone

Regular exam updates, QnA, Predictors, College Applications & E-books now on your Mobile

Careers360 App
150M+ Students
30,000+ Colleges
500+ Exams
1500+ E-books