The course page of Big Data with Hadoop & Spark has moved to
  • Online Classes: 12 Sessions
  • |
  • CloudxLab™: 90 days - 24x7, Global
  • |
  • Project: 2 wks
Online Instructor Led Classes
Full Access To CloudxLab™ - 90x24 Hrs
Training By Industry Experts
Real Time Project
Earn Certificate In Big Data with Hadoop & Apache Spark
Scala Basics
Compatible with: CCP Data Engineer, CCA Spark and Hadoop Developer, HDP Certified Developer, HDP Certified Developer: Spark Certificates
Course Query
Upcoming Classes
Starts Time Details Trainer Price
  • -
  • Live Sessions + Self Paced
  • 3 Hours / Session
  • Total 18 Sessions
Sandeep Giri

(Incl. Taxes)
(only 3 left)


As data has grown, so has the rate at which it is processed, along with the complex demands made of it. Traditional tools are no longer able to handle this magnitude of storing and processing data - a single computer does not suffice due to IO, CPU & RAM limitations. This is when the new generation tools that run on multiple computers are required.

This is a very hands on course and will take you from the very basics to an advanced level in Big Data Analysis and Streaming processing using Apache Spark. Apache spark is probably the fastest and most efficient amongst all distributed computing tools. We will start with the basics of Big Data, understand the architecture of Apache Spark and solve problems using python on Spark. We will cover Spark SQL and basics of Machine Learning and solve few classic problems using MLLib. Or in other words, we will do the stream processing using Spark.

At the end of this course, you will be able to handle all three kinds of processing of Big Data:

  1. Volume (Spark SQL)
  2. Velocity (Spark Streaming)
  3. Variety (MLLib)

Why learn Big Data with Hadoop Apache Spark?

Almost every organisation has humongous data that has to be analysed for business growth/increase sales or to improve customer service. So Data Analysis is the hot trend in every industry.

The other way to measure the demand for Data Analysts is to look at the number of jobs being posted around the world on these technologies.


Our classes are conducted live online by our instructors via webinar or hangout. These are not pre-recorded classes. The instructor delivers the class using presentations, collaborative drawing tools, screenshares. All attendees are usually muted during the class. However, they can ask questions in the webinar or hangout chat windows. The instructor answers any questions asked immediately after explaining a concept. The instructor also asks questions during the sessions to ensure maximum student engagement.

Every class is recorded, complete with the screen and the audio, and uploaded to the Learning Management System which is accessible to our attendees for life.

At the end of each session, assignments are provided which the attendees have to submit in the LMS (Learning Management System). The assignments are continuously reviewed by our instructors and teaching assistants. In case we conclude that an attendee requires extra detailing, we schedule extra one-on-one sessions with that attendee.

What makes Big Data with Hadoop & Apache Spark course unique?

  • Scala Basics
  • Interactive Classes: More Questions. Less Lectures.
  • Simple explanations to complex topics by industry experts
  • Hands on workshops and real time projects.
  • Quizzes & Assignments
  • Certificate of Course at the end of course
  • A real time project involving Big Data with Hadoop & Apache Spark
  • Lifetime access to course content
  • CloudxLab™ - Access to the cloud infrastructure if learners don't wish to install Apache Spark on their computers

What are the prerequisites to join Big Data with Hadoop Apache Spark course?

To be able to take maximum benefit out of this course, you should have knowledge of the following:

  1. Basics Of SQL. You should know the basics of SQL and databases.
  2. A know-how of the basics of programming.We will be providing video classes covering the basics of Python. What is expected of the attendee is the ability to create a directory and see whats inside a file from the command line, and an understanding of 'loops' in any programming language.

In addition, the attendee should have the following hardware infrastructure:

  • A good internet connection. An internet speed of 2mbps is good enough.
  • Access to a computer. Since it is an online course, you would have to install webinar or hangout on your computer.
  • Nice To Have: A power backup for your router as well as computer.
  • Nice To Have: A good quality headphones.

What kind of project / real time experience?

After all sessions are over, we ask for the student's preference for a project. We form teams of 3-4 members and based on their interests we assign a project to each team. A project is usually of three weeks duration. If a team has an idea it wants to work on as a project, we screen the idea and the team can work on it, or we assign a project from the industry. Since it is not possible to provide real data from the industry, we provide data anonymously for projects. We continuously support and guide the teams during projects by conducting regular scheduled meetings and also provide individual assistance.

The projects assigned can also be based on public databases. There are various datasets available for free that can be found on any of the following websites:

A few examples of projects are as follows:

  • Understanding the trends and patterns in BitCoin transaction graphs by qualitative analysis. BitCoin is a virtual currency. The way a coin is mined is based on transaction logs. BitCoin transaction logs keep growing almost every mili second, and therefore, processing these transaction logs is a real challenge.
  • Understanding the correlation between the temperature of various cities and the stock market.
  • Processing Apache Log for ERRORs. Preparing web analytics based on apache weblogs:
    • Which services are slow
    • Which services have a high number of users
    • What is the failure rate of each service
  • Preparing recommendations based on the apache logs.
  • Using social media to compare a brand's marketing campaigns. The testing is basically done using sentiment analysis.

Feedback from our alumni

Savita Singh

Savita Singh

Director Engineering, Target Technology Services
LinkedIn Profile
Ratings: (5.0/5.0)
Review: Joined the Hadoop class from Know BIG DATA 5 wks back and its been a motivating experience. Last I coded was 20yrs back and but thanks to the instructor led training - I am executing Pig Latin and Hive commands to solve data problems and look forward to soon be able to complete small projects all by myself. Sandeep has been a great instructor, very very patient, always ready to put in extra time to clarify doubts and work at your pace and schedule.


Senior systems analyst
Quora Profile
Ratings: (5.0/5.0)

Review: I took 'Big Data and Hadoop' course on covers all concepts (HDFS,Map Reduce,Pig,Hive,Hbase,Spark,Flume,Sqoop etc.) We are given a cluster to work. So clustering concepts are also covered. Types of serialisation available are also covered in decent depth. The instructor sandeep has lot of knowledge and practical experience. The classes are very cool and u will never feel bored at all. Highly interactive and learning is a fun. Not even one question has gone unanswered. Teaching style is really good. Good to learn from him. I highly recommend if you plan to step into the world of Big Data. Wish you all the best for your endavours.

Alan Shaw

Alan Shaw

Quora Profile
Ratings: (5.0/5.0)

Review: This course was just what I was looking for, a way to deepen my knowledge on big data in general and Hadoop in particular. There were a lot of topics covered, and although I also tried to read along in Hadoop: The Definitive Guide, I found that the course often went even deeper into the details than the book. Also, the trainer was very patient and conscientious to answer everyone’s questions so that we were able to relate what we were learning to our immediate needs. Perhaps the best feature of the training was the access to an online lab environment where everything discussed could be tried hands-on, and not just for the assignments. Overall I recommend this course.

Ranjit Sahu

Ranjit Sahu

Associate Manager Technology at Thomson Reuters
LinkedIn Profile
Ratings: (5.0/5.0)
Review: Just completed the course "Big data & Hadoop" from KnowBigData. This is one of the best online course i have ever had. The instructor is amazing and his knowledge on the subject is excellent. The best about the course is, it is not one of those text-book course full of theories rather it is based on practical problems and how we can implement big data concepts to fix those.
Dr. Makhan Virdi

Dr. Makhan Virdi

Researcher, NASA - DAAC
LinkedIn Profile
Ratings: (5.0/5.0)

Big Data with Apache Spark: This is not a typical (online) classroom course. It is not just a series of videos with one way flow of information. Instead, it is a highly interactive setting where the instructor shares insightful details when any question/doubt is raised during the lecture. Sandeep passionately teaches complicated concepts in easy to understand language, supported with good analogies and effective examples. The course is well structured, covering the concepts of Big Data in width and depth.

Parveen Kumar

Parveen Kumar

VP - Engineering at CommonFloor
LinkedIn Profile
Ratings: (5.0/5.0)
Review: KnowBigData's "Big data & Hadoop" is one of the best courses I have attended online. Not only the instructor knows the concept extremely well but also very passionate about explaining difficult concept in simple way. The quiz is also very useful after some sessions to revise the learnings.

See More Reviews at FaceBook Page.

Big Data with Hadoop & Apache Spark Introduction Video

Big Data with Hadoop & Apache Spark Course Curriculum

  1. Understanding Big Data
  2. Problems with Traditional Large-Scale Systems
  3. Hadoop
  4. Data Storage and Ingest
  5. Data Processing
  6. Data Analysis and Exploration
  7. Other Ecosystem Tools
  1. Introduction to CloudxLab
  2. Distributed Processing on a Cluster
  3. Storage: HDFS Architecture
  4. Storage: Using HDFS
  5. Resource Management: YARN Architecture
  6. Resource Management: Working with YARN
  7. Introduction to the Hands-On Exercises
  8. Various Spark modes on YARN
  1. Understanding Basics of Mapreduce
  2. Simple example of Mapreduce
  3. Cleaning data with Pig / Pig Latin
  4. Advanced operations with PigLatin
  1. Sqoop Overview
  2. Basic Imports and Exports of relation data with sqoop
  3. Limiting Results
  4. Improving Sqoop’s Performance
  5. Importing data in real time using Flume
  1. Introduction to Impala and Hive
  2. Why Use Impala and Hive?
  3. Querying Data With Impala and Hive
  4. Comparing Hive and Impala to Traditional Databases
  5. Data Storage Overview
  6. Creating Databases and Tables
  7. Loading Data into Tables
  8. HCatalog
  9. Impala Metadata Caching
  1. What and why of NoSQL
  2. Understanding architecture of HBASE
  1. Selecting a File Format
  2. Hadoop Tool Support for File Formats
  3. Avro Schemas
  4. Using Avro with Impala, Hive, and Sqoop
  5. Avro Schema Evolution
  6. Compression
  7. Conclusion
  8. Partitioning Overview
  9. Partitioning in Impala and Hive
  1. What is Apache Flume?
  2. Basic Flume Architecture
  3. Flume Sources
  4. Flume Sinks
  5. Flume Channels
  6. Flume Configuration
  1. What is Apache Spark?
  2. Using the Spark Shell
  3. RDDs (Resilient Distributed Datasets)
  4. Introduction to Scala
  5. Functional Programming in Spark
  1. Creating RDDs
  2. Other General RDD Operations
  1. Spark Applications vs. Spark Shell
  2. Creating the SparkContext
  3. Building a Spark Application (Scala and Java)
  4. Running a Spark Application
  5. The Spark Application Web UI
  6. Configuring Spark Properties
  7. Logging
  8. Review: Spark on a Cluster
  9. RDD Partitions
  10. Partitioning of File-Based RDDs
  11. HDFS and Data Locality
  12. Executing Parallel Operations
  13. Stages and Tasks
  1. RDD Lineage
  2. RDD Persistence Overview
  3. Distributed Persistence
  1. Common Spark Use Cases
  2. Understanding Spark Streaming
  3. Iterative Algorithms in Spark
  4. Graph Processing and Analysis
  5. Machine Learning
  6. Example: k-means
  1. Understanding Oozie workflow
  2. Creating using UI and XMl
  1. Spark SQL and the SQL Context
  2. Creating DataFrames
  3. Transforming and Querying DataFrames
  4. Saving DataFrames
  5. DataFrames and RDDs
  6. Comparing Spark SQL, Impala, and Hive-on-Spark
  1. Churning the logs of NASA Kennedy Space Center WWW server
  2. Preparing realtime analytics of orders in an e-commerce company

What Certificate do we provide?

Based on your performance in Quizzes, Assignments and Projects, we provide the certificate in the following forms:

1. Hard Copy

We send a hard copy of the certificate to your address.

Digitally Signed 2. Digitally Signed Copy

We provide the PDF of the certificate that is digitally signed by

3. Share Your Success

Share your course record with employers and educational institutions through a secure permanent URL.

LinkedIn Recommendation & Endorsements 4. LinkedIn Recommendation & Endorsements

We will provide a LinkedIn Recommendation based on your performance. Also, we will endorse you with tags such as Hadoop, Big Data.

Verifiable Certificate 5. Verifiable Certificate

We have provided an online form to validate whether the certificate is correct or not here. This assists recruiters to verify the certificate provided by us.

About the Team

Sandeep Giri

Sandeep Giri

Founder & Chief Instructor

Past,, Founder @ tBits Global, D.E.Shaw

Education Indian Institute of Technology, Roorkee

For last 12 years, Sandeep has been building products and churning large amounts of data for various product firms. He has an all around experience of software development and big data analysis.

Apart from digging data and technologies, Sandeep enjoys conducting interviews and explaining difficult concepts in simple ways.

Read More

Big Data with Apache Spark - Frequently Asked Questions

Yes. Java is not required for understanding this course. We will be covering Scala for programming in Spark. So, if you qualify the following three criteria, :

  1. Basics Of SQL. You should know the basics of SQL and databases. If you know about filters in SQL, you are expected to understand the course.
  2. A know-how of the basics of programming. If you understand 'loops' in any programming language, and if you are able to create a directory and see whats inside a file from the command line, you are good to get the concepts of this course even if you have not really touched programming for the last 10 years! In addition, we will be providing video classes on the basics of Python.

No. We stopped classroom trainings a while back when we realized that our students attending the online instructor led classes are performing better in the assignments than students in our offline classrooms. Moreover, students ask more questions in online sessions in comparison to the classroom sessions.

Also, it is very difficult to get a real training locally in any city. So, it is better to have really good training than having the classroom sessions.

To check if the online session would work for you, please attend our demo sessions. I can assure you that you would like the instructor led online trainings.

There are two ways to do practicals.
  1. Using the our CloudxLabTo give our candidates a real experience of big data computing, we have provided a bunch of computers with all the big data technologies running on them since most of the big data technologies make sense only if done using multiple machines. You only have to use SSH Client (putty on windows) to connect to our cluster. Whether you are at home or office, and whether you are using a laptop or a tablet, you would be able to use Spark. See more details about CloudxLab, here.
  2. Using Virtual MachinesSecond and the traditional way to experiment on Spark is to install a Virtual Machine. We will assist you in setting up Virtual Machine. However, most of our students are so happy with our CloudxLab that they hardly install a Virtual Machine.

Our classes are held every weekend on Saturdays and Sundays either in the mornings or in the evenings. So, there are two classes for 3 hours each; one on saturday and one on sunday.

In addition to the 6 hours of weekend classes, you will have to devote around 4-6 hours every week to complete assignments.

If you are not able to attend a particular class, you can watch the recordings of that class. Otherwise, you can attend the same class in another running batch.

Sometimes, due to various reasons, people find it difficult to continue a course. In case that happens, you can continue in another session in the future, or you can request your refund. Here are the guidelines for requesting the refund.

Yes, the course material is available to our students for life. You will have access to the content in LMS for ever.

Yes, we provide our own Certification. At the end of your course, you will work on a real time project. You will receive a Problem Statement along with a data-set to work on our CloudxLab. Once you are successfully through the project (Reviewed by an expert), you will be awarded a certificate with a performance-based grading.If your project is not approved in the first attempt, you can take extra assistance to understand concepts better and reattempt the project free of cost.

Big Data with Apache Spark is one of the hottest career options available today for software engineers. There are around thousands of jobs currently in U.S. alone for Data Analysts and the demand for Data Analysts is far more than the availability. Learn more about career prospects in Data Analysis at and

CloudxLab has all the softwares that are required for the course plus some more components such as GIT and R. In case you require a particular software to be installed on cluster which is not already there, please let us know.

Upcoming Classes
Starts Time Details Trainer Price
  • -
  • 3 Hours / Session
  • Total 12 Sessions
Sandeep Giri

(Incl. Taxes)
(only 3 left)