Basic of Hadoop
Today
in this Fast growing World the rate
of data generated is growing enormously AND the amount of data produced by mankind
is growing rapidly every year. The amount of data produced by us from the
beginning of
time till 2003 was 5 billion
gigabytes. If you pile up the data in the form of disks it may fill an entire football
field. The same amount was created in every two days in 2011, and in every ten
minutes in 2013.
This rate is still growing
enormously. Though all this information produced is meaningful and can be
useful when processed, it is being neglected.
This
information can be referred as a collection of large datasets that cannot be processed
using traditional computing techniques abbreviated as BIG DATA.
Big Data
includes huge volume, high velocity, and extensible variety of data and can be
referred as three types.
- Structured data: Relational data.
- Semi Structured data: XML data.
- Unstructured data: Word, PDF, Text, Media Logs.
Handle such big data and analyzing it requires to
deal with certain technologies which provides more concrete
decision-making resulting in greater operational efficiencies, cost reductions,
and reduced risks for the business.
BIG DATA TECHNOLOGIES
looking into the technologies that
handle big data, we examine the following two classes of technology:
Operational
Big Data
This include systems like MongoDB
that provide operational capabilities for real-time, interactive workloads
where data is primarily captured and stored.
NoSQL Big Data systems are designed
to take advantage of new cloud computing architectures that have emerged over
the past decade to allow massive computations to be run inexpensively and
efficiently. This makes operational big data workloads much easier to manage,
cheaper, and faster to implement.
Some NoSQL systems can provide
insights into patterns and trends based on real-time data with minimal coding
and without the need for data scientists and additional infrastructure.
This includes systems like Massively
Parallel Processing (MPP) database systems and MapReduce that provide
analytical capabilities for retrospective and complex analysis that may touch
most or all of the data.
MapReduce provides a new method of
analyzing data that is complementary to the capabilities provided by SQL, and a
system based on MapReduce that can be scaled up from single servers to
thousands of high and low end machines.
Besides having these
technologies This Data when Processed, leads to major challenges as followed:
- Capturing data
- Curation
- Storage
- Searching
- Sharing
- Transfer
- Analysis
- Presentation
To deal with
the above challenges, organizations normally take the help of enterprise
servers.
BIG DATA Solutions
Traditional
Enterprise Approach
In this approach, an enterprise will have a computer to store and
process big data. For storage purpose, the programmers will take the help of
their choice of database vendors such as Oracle, IBM, etc. In this approach,
the user interacts with the application, which in turn handles the part of data
storage and analysis.
This approach
works fine with those applications that process less voluminous data that can
be accommodated by standard database servers, or up to the limit of the
processor that is processing the data. But when it comes to dealing with huge
amounts of scalable data, it is a hectic task to process such data through a
single database bottleneck.
Google’s
Solution
Google solved
this problem using an algorithm called MapReduce. This algorithm divides the
task into small parts and assigns them to many computers, and collects the
results from them which when integrated, form the result dataset.
Hadoop
Using the solution provided by Google, Doug Cutting and his
team developed an Open Source Project called HADOOP.
Hadoop runs
applications using the MapReduce algorithm, where the data is processed in
parallel with others. In short, Hadoop is used to develop applications that
could perform complete statistical analysis on huge amounts of data.
No comments:
Post a Comment