Sunday, November 16, 2014

Apache Hadoop Basics

Big Data: Hadoop
What is Big Data?
Big data is term used to define a collection of any form(usually in context of unstructured data, as like,chat n FaceBook ) of data which exceeds the capacity of the conventional Database system and hence data processing is done by implementation of the parallel processing architecture as Apache Hadoop , Greenplum etc.
Big Data includes data sets with sizes beyond the ability of commonly used software tools to capture,curatemanage, and process the data within a specific tolerable elapsed time.

What challenges faced by traditional data processing Applications are addressed by Hadoop?
Capture
Curation
Analysis
transfer
search
visualization
Application Data is the cause for generating the complex and Unstructured data irrespective of the Business Transaction Data, as FaceBook activities as chat etc.
Example of scenarios generating the so called Big Data?
Whereis Big Data implementation is made in daily life:
FaceBook (500 TB data), Stock Exchanges (Trade data generated in TeraBytes in any exchange) Web:
Search Quality(As Analytics) Telecommunication:
To prevent Customer Churn, means to retrict user/client to move away by analyzing the user behaviour from huge data used for analysis.
Banking:
Threat Analysis and Fraud Detection
What made Hadoop to evolve though ETL Compute Grid is there?
ETL storaes the Data(Raw Data) which can be made scalable vertically but retieving the data from storage(Read) was still the loophole.
Hence, Hadoop = Storage + Compute Grid
Hadoop = Scalable Storage + Distributed Programming
Which Data is defined as to be big data?
The choice for the processing architecture is determined by the following Vs:
Velocity, Variety and Volume, where volume defines the amount of data, variety as the type of data as image files, text files , logs etc and velocity defines

Distributed File System:DFS can be defined as the Distribution of the Data across multiple servers instead of having the single server with high vertical scalability.
DFS is implemented to overcome the tradeoff IO(read/write) Not the storage Capacity since same dcan be acheived with the vertical scaling of the single server instead of having the multiple server.

DFS Consolidation: This defines that though the data resides in the seperate physical machines but apppears to be under the common directory structure on logical level.
Physical machine at physically different location getting appeared under the same file system.
Hadoop:
Hadoop is the framework used to capture , analysis the large amount of data in short span of time using the cluster of commodity hardware using programming model(Map-Reduce). with streaming data access. Hadoop uses the Distributed file system to store files to be processed, writtern in java.using HDFS(Hadoop Distributed file system ).
Hadoop Core Components:
HDFS-
Hadoop Distributed file system ) used for storage
MapReduce Engine(processing)
To retrive and analyze the data stored above
Hadoop architecture can be defined as the a HDFS cluster of nodes(machines) as: Name Node(Admin Node) having Job Tracker
Data Node(Slave Node) having the Task Tracker
Main features for HDFS:
Highly fault tolerence : done using Data Replication in multiple data nodes(commodity machines).Though replication causes data redundency but is still pursued as the data nodes hardware used is commodity hardware.
High Throughput: achieved by distribution of data across multiple nodes and reading data parallely makes high throughput(time taken to read data)
Suitable for applications with large data sets rather than small data set distributed accross
Buillt on commodity hardware ,no need for high end storage devices hence not costly
streaming access to file system data, i.e. write once read multiple times.
Scenarios where hadoop can’t be used:
low-latency data access
lot of small files: as every file will have META data associated hence not easy to maintain
multiple writers and arbitrary file modification
There are daemon/services named Job tracker and Task Tracker running on name node/ high avalibility machine (maintains and manages blocks present on datanodes, stores Meta Data in RAM) and data node(provide actuial storage and processes the read/write requests from client) , respectively.
Name Node is single point of failure, since there is a secondary name node but is NOT a backup node as it doesnt turns out to be name node in case of faailure, but simply copies the data from name node and writes to the disk.
Data Nodes are stored in a group called Rack and the nodes communicates through SSH and with minimum replication as 3.
Block Size in HDFS: 64 MB and is configurable
How does JOB Tracker works?
Input files will be copied to DFS
Client copies the inpu files to DFS
Client submits a JOB to Job Tracker
JOB Tracker initilaiczes the Job Queue
JOB Tracker reads job files from DFS that defines the blocks to be writter
the file lited as per change in configurational parameter.
Maps and reducces are created same in number as splits.
Map and reduces are sent to data Node
Map processes the data locally on the data nodea and generates intermediate key-value pair that in turn is used to generate finaloutput data using the reducer on this intermidiate data
The logic to generate the output from map is developer logic
Input File is splitted into input slits and is placed in datanodes, hence each split will have a corresponding Map programme
HeartBeat: determines if datanodeas are listening to namenode.
Job Tracker uses job queue to assign the task to Task Tracker on each datanode.
Job.xml and job.jar: Used to configure the
How does Hadoop DFS is different from Traditional DFS ?
Hadoop DFS implements localization wherein the data is not copied from the slave node/datanode to master/name node and processed but is rather processed locally at each datanode and is than sent to
master, hence data can be processed in small chunks at datanodes and hence less n/w overhead
How data write works in Hadoop??
NameNode determines which datanode to be written Following is the flow:
User sends the request to add data to namenode
nameNode returns to client the information about where to write the data(which node), determined using Meta-Data
Client (NOT name node) writes to the dataode and is NON-POSTED (where user gets the acknowlede packet to client which ensures that write is being done properly)asyncronous write. Write happens in a pipeline matter means, write keeps on happeing on all datanodes with the assumption of data being written.
what happens if write fails.?
Task tracker again defines which node to be written again.
Write will be in pipeline form and hence only one acknowledgement.
How Hadoop reads data?
Read is done from all the nodes in parallel alike in write so as to ensure data gets reteived properly/consistently despite either node failing to retrievd,may b due to say, network failure

No comments:

Post a Comment