Big Data Blog

Upcoming Course

  Big Data with Hadoop & Spark Course - Starts 14th January, 2017

Quick Hands On with Apache Spark


In this tutorial, we will try to understand basics of getting the computation done on Apache Spark. We will first try to understand RDD followed by the operations on RDD such as transformations and actions. The overall agenda is to show how to get started with Apache spark using Python.

Interview Questions on Apache Spark [Part 2]

This is the second set of Apache Spark Interview Questions. It includes coding problems on Apache Spark.

Q1: Say I have a huge list of numbers in RDD(say myrdd). And I wrote the following code to compute average:

def myAvg(x, y):
	return (x+y)/2.0;
avg = myrdd.reduce(myAvg);

What is wrong with it? And How would you correct it?

6 Reasons Why Big Data Career Is A Smart Choice

Confused whether to take up a career in Big Data or not? Planning to invest your time in getting certified and to acquire expertise in related frameworks like Hadoop, Spark etc. and worried whether you are making a huge mistake? Just spend a few minutes reading this blog and you will get six reasons on why you are making a smart choice by selecting a career in big data.

Machine Learning with Mahout - Tutorial

Mahout is essentially a scalable machine learning library built on top of Hadoop and its written in Java. It started with Hadoop and now it is compatible with lot of other similar system and it even runs independent  of Hadoop. Literally, Mahout means keeper/driver of elephants.

What is Machine Learning?

Machine learning basically means feeding a lot of data to computers and let computer figure out. Machine learning is a branch of artifical intelligence which deals with computer themselves learning automatically based on data.

Apache Flume - Tutorial

There are many cases specifically where data is flowing from one location to another and you have to handle the all of the flow of the data. You can use sqoop for handling the data flowing to/from relational database but when it comes to fast moving unstructured data it, Apache Flume is needed.

One use case of Flume can be moving the logs into HDFS. Logs may be from your web server. Whenever you run high traffic websites you have more than one servers running your web application and generating logs. You want to move this log frequently to HDFS. Or you are running crawlers which are downloading data from twitter and other sources and you want to move this data as and when it comes to your HDFS or Hive.

Introduction to Apache Flume in 30 minutes

What is Apache Flume?

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating & moving large data from many different sources to a centralized data store.

Flume supports a large variety of sources Including:

  • tail (like unix tail -f),
  • syslog,
  • log4j - allowing java applications to write logs to HDFS via flume

Flume Nodes

Flume nodes can be arranged in arbitrary topologies.Typically there is a node running on each source machine, with tiers of aggregating nodes that the data flows through on its way to HDFS.