What is big data?
Term Big data refers to data sets that are too large and complex for the traditional data processing tools to handle efficiently. Popular Vs in big data are mentioned below.
The data will be growing exponentially due to the fact that now every person has multiple devices which generates a lot of data. With emerging of the Internet of things the data generated each day is more than old days where people were using internet only for search purpose. So the huge data generation creates a problem for storing it, that implies the volume as a problem. We are talking in TBs and PBs as compared to Mbs and GBs 10 years back.
It is the speed of data created. Digitization is the main cause for the increase in the rate of data creation. Let me have a small example, the world population is growing each year so the devices people use also grow. Every person generates a lot of data every day. Take the example of Facebook, YouTube, and Google. Besides these we have sensors installed in office, home and factories and what they do are keep track of some data points for which they are built. So imagine how much data is generated each second.
Data can be categorized into three types structured, semi-structured and unstructured. In past years most of the data were structured data but the pattern is now changed as the percentage of unstructured data has increased. There are still debates going on around which one is more critical for business decisions however if one has a lot of unstructured data, why not benefit from it. It will give you more insight along with the structured data hence increasing the chances to make the best business decisions. One liner about semi-structured data is this kind of data can be converted to structured data with help of some tool before processing.
Big data refers to all above problems which can be summarized as storing and processing large complex data sets generated at rapid speed.
Why use Hadoop?
Hadoop is named after its co-founder Doug Cutting‘s son’s toy elephant, that’s why the elephant is seen beside the logo. Hadoop consists of hdfs and map-reduce(MR1)/yarn(MR2). Though Hadoop is a solution for big data problem, some people use Hadoop and big data as if they are one technology.
Hadoop can have different types of data from different sources ingested to it via multiple data ingestion tools. The beauty of Hadoop is, it can integrate with multiple tools and give best results. Hadoop ecosystem has multiple big data tools for each big data stack. Hadoop is designed for commodity hardware so it does not require specific hardware in other words it is cheap. It also enables the user to scale up means user can add nodes to the existing cluster without much effort. Hadoop uses a distributed file system so the large set of data is divided in small chunks and stored across multiple nodes that make the storing and processing the data fast. Also, it comes with a replica option to prevent data loss in case of failure. An ideal platform for implementation of pre-processing of raw data as the real world data are noisy and inconsistent. Therefore, cleaning and making the data in a format that the machine learning algorithms can consume, is required. A strong community is backing Hadoop as it is an open source tool. And not to forget the top leading companies Google, Facebook and Twitter have adopted Hadoop.