PySpark - Python Spark Hadoop coding framework & testing

BY
Udemy

Learn how to use PySpark's functionalities to scale up big data analytics and analyze data at scale.

Mode

Online

Fees

₹ 549 2499

Quick Facts

particular details
Medium of instructions English
Mode of learning Self study
Mode of Delivery Video and Text Based

Course overview

Apache Spark is an open-source distributed software framework and collection of library services for real-time, massive data processing, and PySpark is its Python API. Learning PySpark will help individuals build more configurable pipelines and analyses. The Hands-On PySpark for Big Data Analysis online certification was developed by Packt Publishing and is made available by Udemy, an education platform that offers programs to help participants advance their technical knowledge.

Hands-On PySpark for Big Data Analysis online course is a short-term program that involves 3.5 hours of learning material and 26 downloadable resources, which are intended for participants who want to learn the methods for analyzing big data sets and building big data platforms for machine learning models and business intelligence applications. Hands-On PySpark for Big Data Analysis online training discusses topics like data wrangling, data analysis, data cleaning, and structured data operations as well as explains the functionalities of Spark notebooks, Spark SQL, and resilient distributed datasets.

The highlights

  • Certificate of completion
  • Self-paced course
  • 3.5 hours of pre-recorded video content
  • 26 downloadable resource

Program offerings

  • Online course
  • Learning resources
  • 30-day money-back guarantee
  • Unlimited access
  • Accessible on mobile devices and tv

Course and certificate fees

Fees information
₹ 549  ₹2,499
certificate availability

Yes

certificate providing authority

Udemy

What you will learn

Knowledge of python Knowledge of big data

After completing the Hands-On PySpark for Big Data Analysis certification course, participants will acquire knowledge of the functionalities of PySpark for big data analytics. Participants will explore the patterns with Spark SQL to improve their business intelligence and increase productivity. In this PySpark certification, participants will learn about concepts involved with data wrangling, data cleaning, and data analysis of big data as well as acquire the knowledge of the techniques for structured data operations. In this PySpark course, participants will also learn about the strategies involved with Spark notebooks, MLlib, and resilient distributed datasets.

The syllabus

Introduction

  • Introduction
  • What is Big Data Spark?

Setting up Hadoop Spark development environment

  • Environment setup steps
  • Installing Python
  • Installing PyCharm
  • Creating a project in the main Python environment
  • Installing JDK
  • Installing Spark 3 & Hadoop
  • Running PySpark in the Console
  • PyCharm PySpark Hello DataFrame
  • PyCharm Hadoop Spark programming
  • Special instructions for Mac users
  • Quick tips - winutils permission
  • Python basics

Creating a PySpark coding framework

  • Structuring code with classes and methods
  • How Spark works?
  • Creating and reusing SparkSession
  • Spark DataFrame
  • Separating out Ingestion, Transformation and Persistence code

Logging and Error Handling

  • Python Logging
  • Managing log level through a configuration file
  • Having custom logger for each Python class
  • Error Handling with try except and raise
  • Logging using log4p and log4python packages

Creating a Data Pipeline with Hadoop Spark and PostgreSQL

  • Ingesting data from Hive
  • Transforming ingested data
  • Installing PostgreSQL
  • Spark PostgreSQL interaction with Psycopg2 adapter
  • Spark PostgreSQL interaction with JDBC driver
  • Persisting transformed data in PostgreSQL

Reading configuration from properties file

  • Organizing code further
  • Reading configuration from a property file

Unit testing PySpark application

  • Python unittest framework
  • Unit testing PySpark transformation logic
  • Unit testing an error

spark-submit

  • PySpark spark-submit
  • Thank you

Appendix - PySpark on Colab and DataFrame deep dive

  • Running Python Spark 3 on Google Colab
  • SparkSDL and Dataframe deep dive on Colab

Appendix - Big Data Hadoop Hive for beginners

  • Big Data concepts
  • Hadoop concepts
  • Hadoop Distributed File System (HDFS)
  • Understanding Google Cloud (GCP) Dataproc
  • Signing up for a Google Cloud free trial
  • Storing a file in HDFS
  • MapReduce and YARN
  • Hive
  • Querying HDFS data using Hive
  • Deleting the Cluster
  • Analyzing a billion records with Hive

Trending Courses

Popular Courses

Popular Platforms

Learn more about the Courses

Download the Careers360 App on your Android phone

Regular exam updates, QnA, Predictors, College Applications & E-books now on your Mobile

Careers360 App
150M+ Students
30,000+ Colleges
500+ Exams
1500+ E-books