Introduction:
- As we already know that the Hadoop is an Apache open-source framework written in java environment, so it is the open source and being widely considered.
- It allows distributed processing of large datasets across clusters of computers using simple programming models.
- The Hadoop architecture is basically used to designed in such a manner so that it can scale up from single server to thousands of machines, each offering local computation and storage.
- Now if we need to understand what exactly the Hadoop is, then we need to have first understand the issues related to the Big Data and the traditional processing system as it is being considered as a major component and area of Hadoop.
As the technology is going to be Advanced day by day ahead, so we need to understand the importance of Hadoop, and its application strategy using which it can be able to provide the solution to the problems associated with Big Data. Here I am also going to discuss about the CERN case study to highlight the benefits of using Hadoop.
Problems with Traditional Approach:
- In traditional approach, the main issue was handling the heterogeneity of data i.e., structured, semi-structured and unstructured.
- In this approach, the user interacts with the application, which in turn handles the part of data storage and analysis.
- It is mainly suffering with the problem for storing the colossal amount of data.
- It also has the problem to store heterogeneous data.
- In traditional processing the accessing and processing speed is also having the major problem specially when the data size is very large.
Limitation/ problems lies with Traditional processing:
- This approach works fine with those applications that process less voluminous data that can be accommodated by standard database servers.
- It is limited up to the limit of the processor that is processing the data.
- But when it comes to dealing with huge amounts of scalable data, it is a hectic task to process such data through a single database bottleneck.
Now if we are going to consider the Big Data then it is being considered as a best solution over the traditional approach. Few major parts are discussed as below.
- The Big Data is emerging technology which is being used by most of the organization.
- It is basically the collection of large datasets that cannot be processed using traditional computing techniques.
- It is not a single technique or a tool, rather it has become a complete subject, which involves various tools, techniques and frameworks.
- Now the Organizations are examining large data sets to uncover all hidden patterns, unknown correlations, market trends, customer preferences and other useful business information.
Evolution of Hadoop:
- The evolution of Hadoop was get started from 1999 when it was first identified by Apache Software Foundation who launch it as a non-profit platform.
- Later on, In 2003, Doug Cutting launches project Nutch to handle billions of searches and indexing millions of web pages.
- In Oct 2003 – Google releases its first research papers with GFS (Google File System) where it describe about the Hadoop and its relevant functionality feature.
- In Dec 2004, Google releases papers with MapReduce which is being considered as a major component of Hadoop system.
- In 2005, Nutch used GFS and MapReduce to perform operations in the HDFS system which is a major component for Hadoop environment.
- In 2006, Yahoo created Hadoop based on GFS and MapReduce with Doug Cutting and team. You would be surprised if I would tell you that, in 2007 Yahoo started using Hadoop on a 1000 node cluster.
Later in Jan 2008, Yahoo released Hadoop as an open source project to Apache Software Foundation. In Jul 2008, Apache tested a 4000 node cluster with Hadoop successfully. In 2009, Hadoop successfully sorted a petabyte of data in less than 17 hours to handle billions of searches and indexing millions of web pages. Moving ahead in Dec 2011, Apache Hadoop released version 1.0. Later in Aug 2013, Version 2.0.6 was available.
What is Hadoop?
Hadoop is a framework that allows you to first store Big Data in a distributed environment, so that, you can process it parallelly. There are basically two components in Hadoop:
The first one is HDFS for storage (Hadoop distributed File System), that allows you to store data of various formats across a cluster. The second one is YARN, for resource management in Hadoop. It allows parallel processing over the data, i.e. stored across HDFS.
As we have already discussed earlier that the Hadoop is an open-source software framework like Java script and is mainly used for storing and processing Big Data in a distributed manner on large clusters of commodity hardware. Hadoop is licensed under the Apache v2 license.
Hadoop was developed, based on the paper written by Google on the MapReduce system and it applies concepts of functional programming. Hadoop is written in the Java programming language and ranks among the highest-level Apache projects. Hadoop was developed by Doug Cutting and Michael J. Cafarella.
Hadoop-as-a-Solution:
Let’s understand how Hadoop provides a solution to the Big Data problems that we have discussed so far.
The major challenges associated with big data are as follows −
- Capturing data
- Curation
- Storage
- Searching
- Sharing
- Transfer
- Analysis
- Presentation
How to store huge amount of data:
- The Hadoop is mainly have a HDFS system which is provides a distributed way to store Big Data.
- Here the data is stored in blocks in DataNodes and you specify the size of each block.
- For example if you have 512 MB of data and you have configured HDFS such that it will create 128 MB of data blocks.
- The HDFS will divide data into 4 blocks as 512/128=4 and stores it across different DataNodes.
- While storing these data blocks into DataNodes, data blocks are replicated on different DataNodes to provide fault tolerance.
How to store a variety of data:
- The HDFS in Hadoop can capable enough to store all kinds of data whether it is structured, semi-structured or unstructured.
- HDFS, there is no pre-dumping schema validation.
- It also follows write once and read many models.
- Due to this, you can just write any kind of data once and you can read it multiple times for finding insights.
How to process the data faster:
In this case the Hadoop is allowed to move the processing unit to data instead of moving data to the processing unit.
So, what does it mean by moving the computation unit to data?
It means that instead of moving data from different nodes to a single master node for processing, the processing logic is sent to the nodes where data is stored so as that each node can process a part of data in parallel. Finally, all of the intermediary output produced by each node is merged together and the final response is sent back to the client.
Where is Hadoop used:
As we have already discussed earlier that the Hadoop is framework hence it is mostly used for:
- Designing the Search engine as it has the ability to process a huge data. For designing the search in Yahoo, Amazon, Zvents it is mostly used.
- It is also used for designing the Log processing environment like Facebook, Yahoo does have.
- For making the Data Warehouse based application layer like the Facebook, AOL have.
- For Video and Image Analysis based application. As it requires the high processing.
When not to use Hadoop:
Following are some cases where it is being not recommended by the expert to use the Hadoop:
- Low Latency data access : Quick access to small parts of data
- Multiple data modification : Hadoop is a better fit only if we are primarily concerned about reading data and not modifying data.
- Lots of small files : Hadoop is suitable for scenarios, where we have few but large files.
- After knowing the best suitable use-cases, let us move on and look at a case study where Hadoop has done wonders.
Hadoop-CERN Case Study
- Now with respect to the discuss scenario the CERN is basically used when we need the data for scaling up in terms of amount and complexity.
- One of the important task is to serve these scalable requirements.
- Hadoop is also used for cluster setup. By using Hadoop, they limited their cost in hardware and complexity in maintenance.
They integrated Oracle & Hadoop and they got advantages of integrating. Oracle, optimized their Online Transactional System & Hadoop provided them scalable distributed data processing platform. They designed a hybrid system, and first they moved data from Oracle to Hadoop.
The main Hadoop components they are using at the CERN-IT Hadoop service:
You can learn about each of these tool in Hadoop ecosystem blog.
Techniques for integrating Oracle and Hadoop:
Export data from Oracle to HDFS
Sqoop was good enough for most cases and they also adopted some of the other possible options like custom ingestion, Oracle DataPump, streaming etc.
Query Hadoop from Oracle
They accessed tables in Hadoop engines using DB links in Oracle. That also build hybrid views by transparently combining data in Oracle and Hadoop.
Use Hadoop frameworks to process data in Oracle DBs.
Scope @ NareshIT:
- At Naresh IT you will get a good Experienced faculty who will guide you, mentor you and nurture you to achieve your dream goal.
- Here you will get a good hand on practice in terms of practical industry-oriented environment which will definitely help you a lot to shape your future.
- During the designing process of application, we will let you know about the other aspect of the application too.
- Our Expert trainer will let you know about every in’s and out’s about the problem scenario.
Achieving your dream goal is our motto. Our excellent team is working restlessly for our students to click their target. So, believe on us and our advice, and we assured you about your sure success.