Complete PySpark Developer Course (Spark with Python)

BY
Udemy

Master the PySpark methodologies and procedures while learning how to build HDFS clusters.

Mode

Online

Fees

₹ 449 1499

Quick Facts

particular details
Medium of instructions English
Mode of learning Self study
Mode of Delivery Video and Text Based

Course overview

The accountabilities of an Apache PySpark developer encompass creating Spark tasks for data aggregation and transformation, drafting test scripts for Spark helper and transformation methods, writing Scaladoc-style documentation for all code, and designing data processing channels. Complete PySpark Developer Course (Spark with Python) online certification is created by Sibaram Kumar - Data Engineer & Instructor, which is offered by Udemy.

Complete PySpark Developer Course (Spark with Python) online classes are intended for candidates looking for a comprehensive program to help them learn the core functionalities of PySpark to become data engineers and data scientists. Complete PySpark Developer Course (Spark with Python) online training covers topics such as Spark RDD, HDFS, Spark SQL, SparkSession, DataFrames, DataTypes, and ETL, as well as the strategies involved in the effective big data processing.

The highlights

  • Certificate of completion
  • Self-paced course
  • 30.5 hours of pre-recorded video content
  • 8 articles 
  • 73 downloadable resources

Program offerings

  • Online course
  • Learning resources
  • 30-day money-back guarantee
  • Unlimited access
  • Accessible on mobile devices and tv

Course and certificate fees

Fees information
₹ 449  ₹1,499
certificate availability

Yes

certificate providing authority

Udemy

What you will learn

Knowledge of python Knowledge of apache spark

After completing the Complete PySpark Developer Course (Spark with Python) certification, candidates will acquire the knowledge of the concepts involved with PySpark as well as will acquire the knowledge of the fundamentals of Python and Spark to become certified PySpark developers. In this PySaprk course, candidates will explore the fundamentals associated with catalyst optimizer, volcano iterator model, DAG scheduler, task scheduler, tungsten execution engine, JVM processes, HDFS, and YARN as well as will acquire an understanding of the methodologies involved with Spark RDD, Spark cluster, and Spark SQL. In this PySpark certification, candidates will also learn about ETL using DataFrames including loading APIs, transformation APIs, and extraction APIs.

The syllabus

Introduction To Spark

  • Why Spark Was Developed ?
  • What is Spark and its Features ?
  • Spark Main Components

Resources

  • Download the PySpark Slides.
  • Download the data files

Single Node Cluster Installation (Spark 2.x/3.x, Hive, HDFS, PostgreSQL, Docker)

  • Introduction and Installation Flow Chart
  • Resources
  • Register Free at Google Cloud (GCP) and Launch an Ubuntu based Virtual Machine
  • Set Up Python and Java
  • Set up Secure Connect to Localhost
  • Set up Hadoop tar, HDFS, YARN and manage Cluster Services
  • Set Up Docker, PostgreSQL, Hive Part 1
  • Set up Docker, PostgrSQL, Hive, Metastore Part 2
  • Set up Spark 2.x and Spark 3.x Part 1
  • Set up Spark 2.x and Spark 3.x Part 2
  • Set up Web UI and ports for Cluster and Application History
  • Manage the Cluster - Start & Stop the Cluster

Spark Installation/Set Up Standalone (Windows)

  • Resources
  • Minimum Supported Versions/Prerequisites
  • Java Installation
  • Python Installation
  • Spark Installation
  • WinUtils Set up
  • PyCharm Installation
  • PyCharm Basics
  • PyCharm Run Time Arguments
  • PyCharm Integrate Python and PySpark
  • How to debug Python Applications using PyCharm

Spark Installation/Set Up Standalone (Unix)

  • Introduction to Spark Installation on Linux
  • Resources
  • Install Ubuntu Unix Terminal and Basic Unix Commands
  • Install Java
  • Install Spark (with scala)
  • Install PySpark
  • Get data into Unix System from Github

HDFS Course

  • What is HDFS and Why to use HDFS
  • Resources
  • HDFS Components and Metadata
  • Data Blocks and Replication
  • Rack Awareness
  • HDFS Read Mechanism Architecture
  • Exercise - HDFS CLI Help Commands
  • Exercise - Bring Data from GitHub to Local to HDFS
  • Exercise - Listing and Sorting Files and Directories in HDFS
  • Exercise - Create or Remove Directories in HDFS
  • Exercise - Copy Data from HDFS to Local
  • Exercise - Copy data from Local to HDFS
  • Exercise - Preview Data in HDFS
  • Exercise - Knowing Statistics in HDFS
  • Exercise - Knowing Storage in HDFS File System
  • Exercise - Metadata in HDFS
  • Exercise - Managing File Permissions in HDFS
  • Exercise - Update Properties in HDFS
  • Note

Python Crash Course

  • Introduction and Installation
  • Main Features of Python
  • Python Basics
  • Python Variables
  • print(), dir() and help()
  • Python Operators
  • Modules
  • Python DataTypes - Numeric Types
  • Python DataTypes - String
  • Python DataTypes - List Part 1
  • Python DataTypes - List Part 2 and sorted() Function
  • Python DataTypes - Tuple
  • Python DataTypes - Set
  • Python DataTypes - Dictionary
  • Date and Time
  • Conditional Statements (if ..else)
  • For Loop
  • While Loop
  • User Defined Functions
  • Lambda Functions
  • Map Function
  • Filter Function
  • Reduce Function
  • File Handling
  • OOPs Basics Part 1
  • OOPs Basics Part 2
  • OOPs Basics - Exercise
  • OOPS Basics - Class Attributes
  • Python Special Variable - __name__
  • Work with Environment Variables
  • Exception Handling in Python
  • How to Traceback Exceptions in Python

SparkSession

  • Introduction to SparkSession
  • Spark Object and Spark Submit Part 1
  • Spark Object and Spark Submit Part 2
  • Spark Object and Spark Submit Part 3
  • Spark Object and Spark Submit Part 4
  • Spark Object and Spark Submit Part 5

RDD Fundamentals

  • What is RDD ?
  • RDD Features Part 1
  • RDD Features Part 2
  • When to use RDDs ?
  • RDD Properties and Problems

Create RDD

  • How to Create a RDD Part 1
  • How to Create a RDD Part 2

RDD Operations

  • Transformations - Low Level Part1
  • Transformations - Low Level Part 2
  • Transformations - Join Types
  • Actions - Total Aggregations.
  • Shuffle and Combiner
  • Transformations - Key Aggregations Part 1
  • Transformations - Key Aggregations Part 2
  • Transformations - Key Aggregations Part 3
  • Transformations - Key Aggregations Part 4
  • Transformations - Key Aggregations Part 5
  • Transformations - Sorting
  • Transformations - Ranking Part 1
  • Transformations - Ranking Part 2
  • Transformations - Ranking Part 3
  • Transformations - Set
  • Transformations - Sampling
  • What is Partition Part 1
  • What is Partition Part 2
  • Repartition
  • Repartition and Sort
  • Coalesce
  • Repartition Vs Coalesce
  • Extraction
  • Final Note

Spark Cluster Execution Architecture

  • Full Architecture
  • YARN As Spark Cluster Manager
  • JVMs across Cluster
  • Commonly Used Terms
  • Narrow and Wide Transformations
  • DAG Scheduler Part 1
  • DAG Scheduler Part 2
  • DAG Scheduler Part 3
  • Task Scheduler

RDD Persistence

  • RDD Persistence Part 1
  • RDD Persistence Part 2

Shared Variables

  • Broadcast Shared Variable
  • Accumulator Shared Variable

Spark SQL

  • Spark SQL Architecture Part 1
  • How Catalyst and Tungsten Engine Overview
  • Abstractions of User Programs, Query Plans
  • Transform
  • Tungsten Execution Engine & Volcano Iterator Model
  • Benchmark - Spark 1.6 Vs Spark 2.0
  • Understanding Execution Plan
  • Put it All Together - Full Architecture

DataFrame Fundamentals

  • Introduction to Dataframe
  • DataFrame Features - Distributed
  • DataFrame Features - Lazy Evaluation
  • DataFrame Features - Immutability
  • DataFrame Features - Other Features
  • Organization

SparkSession Functionalities

  • Introduction o SparkSession
  • Spark Object and Spark Submit Part 1
  • Spark Object and Spark Submit Part 2
  • Spark Object and Spark Submit Part 3
  • Spark Object and Spark Submit Part 4
  • Spark Object and Spark Submit Part 5
  • Version and Range
  • Create Data Frame
  • sql
  • table
  • Spark Context
  • conf
  • read-csv
  • read-text
  • read-orc and parquet
  • read-avro
  • read-hive
  • read-jdbc
  • udf
  • New Session and Stop
  • Catalog

Spark DataTypes

  • DataTypes Part 1
  • DataTypes Part 2
  • DataTypes Part 3

Data Frame Rows

  • DataFrame Rows

DataFrame Columns

  • DataFrame Column Part 1
  • DataFrame Column Part 2
  • DataFrame Column Part 3

DataFrame ETL (Transformations)

  • Introduction to Transformations and Extractions
  • DataFrame APIs Introduction
  • DataFrame APIs Selection
  • DataFrame APIs Filter or Where
  • DataFrame APIs Sorting
  • DataFrame APIs Set
  • DataFrame APIs Join Part 1
  • DataFrame APIs Join Part 2
  • DataFrame APIs Aggregation
  • DataFrame APIs GroupBy
  • DataFrame APIs Window Part 1
  • DataFrame APIs Window Part 2
  • DataFrame APIs Sampling Functions
  • DataFrame APIs Other Aggregate Functions
  • DataFrame Built-in Functions Introduction
  • DataFrame Built-in Functions_New Column Functions
  • DataFrame Built-in Functions_String Functions Part 1
  • Ch 16(DataFrame ETL)_17_DataFrame Built-in Functions_String Functions Part 2
  • Ch 16(DataFrame ETL)_17_DataFrame Built-in Functions_String Functions Part 3
  • DataFrame Built-in Functions_RegExp Functions
  • DataFrame Built-in Functions_Date Functions
  • DataFrame Built-in Functions_Null Functions
  • DataFrame Built-in Functions_Collection Functions Part 1
  • DataFrame Built-in Functions_Collection Functions Part 2
  • DataFrame Built-in Functions_Collection Functions Part 3
  • DataFrame Built-in Functions_na Functions
  • DataFrame Built-in Functions_Math and Statistics Functions
  • DataFrame Built-in Functions_Explode and Flatten Function
  • DataFrame Built-in Functions_Formatting Functions.
  • DataFrame Built-in Functions_Json Functions
  • DataFrame Built-in Functions_Json Functions Part 2

DataFrame ETL (Extractions)

  • Need of Repartition and Coalesce.
  • How to Repartition a DataFrame
  • How to Coalesce a DataFrame
  • Repartition Vs Coalesce Method of a DataFrame
  • DataFrame Extraction Introduction
  • DataFrame Extractions - csv
  • DataFrame Extractions APIs - text
  • DataFrame Extractions - parquet
  • DataFrame Extractions - orc json
  • DataFrame Extractions - hive
  • DataFrame Extractions - jdbc

Performance & Optimization

  • Join Strategies_01_Broadcast Join
  • Join Strategies_02_Shuffle Hash Join
  • Join Strategies_03_Shuffle Sort Merge Join
  • Join Strategies_04_Cartesian Product Join
  • Join Strategies_05_Broadcast Nested Loop Join
  • Join Strategies_06_Prioritize different join strategies
  • Driver Configurations
  • Executor Configurations Part 1
  • Executor Configurations Part 2
  • Configurations in spark-submit
  • Parallelism Configurations
  • Memory Management

Bonus Section

  • Bonus Section

Instructors

Mr Sibaram Kumar
Data Engineer
Freelancer

Trending Courses

Popular Courses

Popular Platforms

Learn more about the Courses

Download the Careers360 App on your Android phone

Regular exam updates, QnA, Predictors, College Applications & E-books now on your Mobile

Careers360 App
150M+ Students
30,000+ Colleges
500+ Exams
1500+ E-books