Big Data – Hadoop


Big Data – Hadoop

Hadoop is an open source software framework that supports data-intensive distributed applications available through the Apache Open Source community. It consists of a distributed file system HDFS, the Hadoop Distributed File System and an approach to distributed processing of analysis called MapReduce. It is written in Java and based on the Linux/Unix platform.

The Hadoop framework transparently provides both reliability and data motion to applications. Hadoop implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. It provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework. It enables applications to work with thousands of computation-independent computers and petabytes of data. The entire Apache Hadoop platform is commonly considered to consist of the Hadoop kernel, MapReduce and Hadoop Distributed File System (HDFS), and number of related projects including Apache Hive, Apache HBase, Apache Pig, Zookeeper etc.

The real magic of Hadoop is its ability to move the processing or computing logic to the data where it resides as opposed to traditional systems, which focus on a scaled-up single server, move the data to that central processing unit and process the data there.

This model does not work on the volume, velocity, and variety of data that present day industry is looking to mine for business intelligence. Hence, Hadoop with its powerful fault tolerant and reliable file system and highly optimized distributed computing model, is one of the leaders in the Big Data world.

Hadoop is it’s storage system and it’s distributed computing model

HDFS
Hadoop Distributed File System is a program level abstraction on top of the host OS file system. It is responsible for storing data on the cluster. Data is split into blocks and distributed across multiple nodes in the cluster.

MapReduce
MapReduce is a programming model for processing large datasets using distributed computing on clusters of computers. MapReduce consists of two phases: dividing the data across a large number of separate processing units (called Map), and then combining the results produced by these individual processes into a unified result set called Reduce.

NameNode
This is also called the Head Node/Master Node of the cluster.

Secondary NameNode
This is an optional node that you can have in your cluster to back up the NameNode if it goes down. If a secondary NameNode is configured, it keeps a periodic snapshot of the NameNode configuration to serve as a backup when needed.

DataNode
These are the systems across the cluster which store the actual HDFS data blocks. The data blocks are replicated on multiple nodes to provide fault tolerant and high availability solutions.

JobTracker
This is a service running on the NameNode, which manages MapReduce jobs and distributes individual tasks.

TaskTracker
This is a service running on the DataNodes, which instantiates and monitors individual Map and Reduce tasks that are submitted.

Hive
Hive is a supporting project for the main Apache Hadoop project and is an abstraction on top of MapReduce, which allows users to query the data without developing MapReduce applications. It provides the user with a SQL-like query language called Hive Query Language (HQL) to fetch data from Hive store.

Pig
Pig is an alternative abstraction on MapReduce, which uses dataflow scripting language called PigLatin. This is favored by programmers who already have scripting skills.

Flume
Flume is another open source implementation on top of Hadoop, which provides a data-ingestion mechanism for data into HDFS as data is generated.

Sqoop
Sqoop provides a way to import and export data to and from relational database tables (for example, SQL Server) and HDFS.

Oozie
Oozie allows creation of workflow of MapReduce jobs. This is familiar with developers who have worked on Workflow and communication foundation based solutions.

HBase
HBase is Hadoop database, a NoSQL database. It is another abstraction on top of Hadoop, which provides a near real-time query mechanisms to HDFS data.

HBase is massively scalable and delivers fast random writes as well as random and streaming reads. It also provides row-level atomicity guarantees, but no native cross-row transactional support. From a data model perspective, column-orientation gives extreme flexibility in storing data and wide rows allow the creation of billions of indexed values within a single table. HBase is ideal for workloads that are write-intensive, need to maintain a large amount of data, large indices, and maintain the flexibility to scale out quickly. HBase is now being used by many other workloads internally at Facebook and many other company using this amazing big data technolog.

Mahout
Mahout is a machine-learning library that contains algorithms for clustering and classification.

Good to know Apache Cassandra Terms:
CQL – Cassandra Query Language
RP – Random Partitioner
OPP – Order Preserving Partitioner
BOP – Byte Ordered Partitioner
RF – Replication Factor
CF – Column Family
JSON – Java Script Object Notation
BSON – Binary JSON
TTL – Time To Live
HDFS – Hadoop Distributed File System
CFS – Cassandra File System
UUID – Universal Unique IDentifier
DSE – Datastax Enterprise
AMI – Amazon Machine Image
OOM – Out Of Memory
SSTables – Sorted String Table
SEDA – Staged Event-Driven Architecture
CRUD – Create Read Update Delete

Big Data Architecture :
Big Data architecture is premised on a skill set for developing reliable, scalable, completely automated data pipelines. That skill set requires profound knowledge of every layer in the stack, beginning with cluster design and spanning everything from Hadoop tuning to setting up the top chain responsible for processing the data.

The main detail here is that data pipelines take raw data and convert it into insight (or value). Along the way, the Big Data engineer has to make decisions about what happens to the data, how it is stored in the cluster, how access is granted internally, what tools to use to process the data, and eventually the manner of providing access to the outside world. The latter could be BI or other analytic tools, the former (for the processing) are likely tools such as Impala or Apache Spark. The people who design and/or implement such architecture I refer to as Big Data engineers.

Facebook is starting to use Apache Hadoop technologies to serve realtime workloads. why facebook decided to use Hadoop technologies, the workloads that facebook have on realtime Hadoop, the enhancements that facebook did to Hadoop for supporting our workloads and the processes and methodologies facebook have adopted to deploy these workloads successfully.

Here is a link to the complete paper for those who are interested in understanding the details of why decided to use Hadoop_dwarehouse_2010  for a complete reference, please visit the Apache website

NoSQL Databases

acid-state
Aerospike
AlchemyDB
allegro-C
AllegroGraph
Amazon SimpleDB
ArangoDB
Azure Table Storage
BangDB
BaseX
Berkeley DB
Bigdata
BigTable
BrightstarDB
Btrieve
Cassandra
CDB (Constant Databse)
Chordless
Cloudant
Cloudata
Cloudera
Clusterpoint Server
CodernityDB
Couchbase Server
CouchDB
Datomic
db4o
densodb
DEX
djondb
DynamoDB
Dynomite
EJDB
Elliptics
EMC Documentum xDB
ESENT 
Event Store
Execom IOG
eXist
eXtremeDB
EyeDB
Faircom C-Tree
FatDB
FileDB
FlockDB
FoundationDB
FramerD
Freebase
Gemfire
GenieDB
GigaSpaces
Globals
GraphBase
GT.M
Hadoop
HamsterDB
Hazelcast
Hbase
Hibari
HPCC
HSS Database
HyperDex
HyperGraphDB
Hypertable
IBM Lotus/Domino
Voldemort
Yserial
ZODB
Infinite Graph
VertexDB
VMware vFabric GemFire
InfinityDB
InfoGrid
Intersystems Cache
ISIS Family
Jackrabbit
JasDB
jBASE
KAI
KirbyBase
LevelDB
LightCloud
LSM
Magma
MarkLogic Server
Maxtable
MemcacheDB
Meronymy
Mnesia
MongoDB
Morantex
NDatabase
NEO
Neo4J
nessDB
Ninja Database Pro
ObjectDB
Objectivity
OpenInsight
OpenLDAP
OpenLink Virtuoso
OpenQM
Oracle NoSQL Database
OrientDB
Perst
PicoLisp
Pincaster
Prevayler
Qizx
Queplix
RaptorDB
rasdaman
RavenDB
RDM Embedded
Reality
Recutils
Redis
RethinkDB
Riak
Scalaris
Scalien
SchemaFreeDB
SciDB
SDB
Sedna
siaqodb
SisoDB
Sones GraphDB
Starcounter
Sterling
Stratosphere
STSdb
Tarantool/Box
Terrastore
ThruDB
TIBCO Active Spaces
Tokutek
Tokyo Cabinet / Tyrant
Trinity
U2 (UniVerse, UniData)
VaultDB
VelocityDB
Versant
Infinispan

 

Let me know if you have any further question and your comments will be learning point.                                                                                                                  Back to top

Mehboob

Microsoft Certified Solutions Associate (MCSA)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s