Data Engineering using Kafka and Spark Structured Streaming

BY
Udemy

Learn about the components of data engineering and how to construct streaming pipelines using Kafka and Spark Structured Streaming.

Mode

Online

Fees

₹ 549 2899

Quick Facts

particular details
Medium of instructions English
Mode of learning Self study
Mode of Delivery Video and Text Based

Course overview

Spark Structured Streaming is a Spark SQL-based stream processing engine that processes data incrementally and upgrades the official outcome as new streaming data is received. Durga Viswanatha Raju Gadiraju - CEO at ITVersity and CTO at Analytiqs, Inc - created the Data Engineering using Kafka and Spark Structured Streaming online certification, which is offered by Udemy for candidates who want to master the concepts and strategies involved with data engineering using Apache Kafka and Spark structured streaming features.

Data Engineering using Kafka and Spark Structured Streaming online classes incorporates more than 9.5 hours of detailed learning resources along with 3 articles that help candidates learn about strategies for building streaming pipelines by integrating Kafka and Spark structured streaming. Data Engineering using Kafka and Spark Structured Streaming online training walks candidates through various data engineering topics such as data processing, incremental data processing, and data ingestion, as well as features of programs such as Hadoop, GCP, Hive, YARN, HDFS, and others.

The highlights

  • Certificate of completion
  • Self-paced course
  • 9.5 hours of pre-recorded video content
  • 3 articles 
  • Learning resources

Program offerings

  • Online course
  • Learning resources
  • 30-day money-back guarantee
  • Unlimited access
  • Accessible on mobile devices and tv

Course and certificate fees

Fees information
₹ 549  ₹2,899
certificate availability

Yes

certificate providing authority

Udemy

What you will learn

Knowledge of big data Knowledge of kafka Knowledge of apache spark

After completing the Data Engineering using Kafka and Spark Structured Streaming certification course, candidates will be introduced to the concepts involved with Apache Kafka, Apache Spark, and Spark structured streaming for data engineering operations. In this data engineering certification, candidates will learn the functionalities of various data engineering tools like YARN, HDFS, Hadoop, Hive, GCP, file source, and file target as well as will acquire an understanding of the techniques to work with Kafka cluster and Kafka connect. In this data engineering course, candidates will learn about strategies associated with incremental data processing, data processing, and data ingestion as well as will develop the skills to build streaming pipelines. 

The syllabus

Introduction

  • Introduction to Data Engineering using Kafka and Spark Structured Streaming
  • Important Note for first time Data Engineering Customers
  • Important Note for Data Engineering Essentials (Python and Spark) Customers
  • How to get 30 days of complimentary lab access?
  • How to access the material used for this course?

Getting Strted with kafka

  • Overview of Kafka
  • Managing Topics using Kafka CLI
  • Produce and Consume Messages using CLI
  • Validate Generation of Web Server Logs
  • Create a Web Server using nc
  • Produce retail logs to Kafka Topic
  • Consume retail logs from Kafka Topic
  • Clean up Kafka CLI Sessions to produce and consume messages
  • Define Kafka Connect to produce
  • Validate Kafka Connect to produce

Data Ingestion using Kafka Connect

  • Overview of Kafka Connect
  • Define Kafka Connect to Produce Messages
  • Validate Kafka Connect to produce messages
  • Cleanup Kafka Connect to produce messages
  • Write Data to HDFS using Kafka Connect
  • Setup HDFS 3 Sink Connector Plugin
  • Overview of Kafka Consumer Groups
  • Configure HDFS 3 Sink Properties
  • Run and Validate HDFS 3 Sink
  • Cleanup Kafka Connect to consume messages

Overview of Spark Structured Streaming

  • Understanding Streaming Context
  • Validate Log Data for Streaming
  • Push log messages to Netcat Webserver
  • Overview of built-in Input Sources
  • Reading Web Server logs using Spark Structured Streaming
  • Overview of Output Modes
  • Using append as Output Mode
  • Using complete as Output Mode
  • Using update as Output Mode
  • Overview of Triggers in Spark Structured Streaming
  • Overview of built-in Output Sinks
  • Previewing the Streaming Data

Kafgka and Spark Structured Streaming Integration

  • Create Kafka Topic
  • Read Data from Kafka Topic
  • Preview data using console
  • Preview data using memory
  • Transform Data using Spark APIs
  • Write Data to HDFS using Spark
  • Validate Data in HDFS using Spark
  • Write Data to HDFS using Spark using Header
  • Cleanup Kafka Connect and Files in HDFS

Incremental Loads using Spark Structured streaming

  • Overview of Spark Structured Streaming Triggers
  • Steps for Incremental Data Processing
  • Create Working Directory in HDFS
  • Logic to Upload GHArchive Files
  • Upload GHArchive Files to HDFS
  • Add new GHActivity JSON Files
  • Read JSON Data using Spark Structured streaming
  • Write in Parquet File Format
  • Analyze GHArchive Data in Parquet files using Spark
  • Add New GHActivity JSON files
  • Load Data Incrementally to Target Table
  • Validate Incremental Load
  • Add New GHActivity JSON files
  • Using maxFilerPerTrigger and latestFirst
  • Validate Incremental Load
  • Add New GHActivity JSON files
  • Incremental Load using Archival Process
  • Validate Incremental Load

Setting up environment using AWS Cloud9

  • Getting Started with Cloud9
  • Cleaning Cloud9 Environment
  • Warming up with Cloud9 IDE
  • Overview of EC2 related to Cloud9
  • Opening ports for Cloud9 Instance
  • Associating Elastic IPs to Cloud9 Instance
  • Increase EBS Volume Size of Cloud9 Instance
  • Setup Jupyter Lab on Cloud9
  • [Commands] Setup Jupyter Lab on Cloud9

Setting up Environment - Overview og GCP and Provision Ubuntu VP

  • Signing up for GCP
  • Overview of GCP Web Console
  • Overview of GCP Pricing
  • Provision Ubuntu VM from GCP
  • Setup Docker
  • Validating Python
  • Setup Jupyter Lab
  • Setup Jupyter Lab locally on Mac

Setup Single Node Hadoop Cluster

  • Introduction to Single Node Hadoop Cluster
  • Material related to setting up the environment
  • Setup Prerequisites
  • Setup Password less login
  • Download and Install Hadoop
  • Configure Hadoop HDFS
  • Start and Validate HDFS
  • Configure Hadoop YARN
  • Start and Validate YARN
  • Managing Single Node Hadoop

Setup Hive and Spark

  • Setup Data Sets for Practice
  • Download and Install Hive
  • Setup Database for Hive Metastore
  • Configure and Setup Hive Metastore
  • Launch and Validate Hive
  • Scripts to Manage Single Node Cluster
  • Download and Install Spark 2
  • Configure Spark 2
  • Validate Spark 2 using CLIs
  • Validate Jupyter Lab Setup
  • Integrate Spark 2 with Jupyter Lab
  • Download and Install Spark 3
  • Configure Spark 3
  • Validate Spark 3 using CLIs
  • Integrate Spark 3 with Jupyter Lab

Setup Single Node Kafka Cluster

  • Download and Install Kafka
  • Configure and Start Zookeeper
  • Configure and Start Kafka Broker
  • Scripts to manage single node cluster
  • Overview of Kafka CLI
  • Setup Retail log Generator
  • Redirecting logs to Kafka

Instructors

Mr Durga Viswanatha Raju Gadiraju

Mr Durga Viswanatha Raju Gadiraju
Technology Adviser
Freelancer

Trending Courses

Popular Courses

Popular Platforms

Learn more about the Courses

Download the Careers360 App on your Android phone

Regular exam updates, QnA, Predictors, College Applications & E-books now on your Mobile

Careers360 App
150M+ Students
30,000+ Colleges
500+ Exams
1500+ E-books