deducethelogic: 2014

Saturday, December 6, 2014

Install Java8 in Ubuntu

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer

If it throws error: dpkg: error processing /var/cache/apt/archives/oracle-java7-installer_7u25-0~webupd8~1_all.deb (--unpack):

 trying to overwrite '/usr/share/applications/JB-java.desktop', which is also in package oracle-java6-installer 6u37-0~eugenesan~precise1
Errors were encountered while processing:
 /var/cache/apt/archives/oracle-java8-installer_7u25-0~webupd8~1_all.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)

apt-get install whatever

java -version

Tuesday, November 18, 2014

Spring Context Hierarchy and Spring MVC Flow

What does applicationContext means?

ApplicationContext:

It is the Central Interface that provides configuration for an application.

ApplicationContext Implements the multiple Interfaces as:

1).ListableBeanFactory

2).ResourceLoader : Interface for loading resources , classpath/filesystem resources

ResourceLoaderAware interface need to be implemented by the object that needs to get notified of ResourceLoader

Resource res = applicationContext.getConfigResources(“ http://UrlOfResouce”);

for example,say class:

public class SampleService implements ResourceLoaderAware{

private Resource resourceLoader;

public void setResourceLoader(ResourceLoader resourceLoader) {

this.resourceLoader = resourceLoader;

}

public Resource getResource(String location){

return resourceLoader.getResource(location);

}

In bean Configuration file:

<beans>

</beans>

In main class the ResourceLoader can be accessed as:

SampleService sample = (SampleService) applicationContext.getBean(“sampleService”);

Resource res = sample.getResource(“http://UrlOfResouce”);

Do your work here

ResourceLoaderAware is implemented by the Object and ResourceLoader is DI in the bean using setter Injection.

ResourcePatternResolver , a strategy interface for resolving the location pattern into ResourceObjects

3). MessageSource

Strategy interface for resolving messages, with support for the parameterization and internationalization of such messages.

It has two implementations:

ResourceBundleMessageSource built on top of java.util.ResourceBundle
ReloadableResourceBundleMessageSource, able to reload message definations without JVM restart

MessageSource is being notified to any object if it implements MessageSourceAware inteface and messageSourec can be passed as bean reference since its defined as bean with name messageSource

in applicationContext

public class SampleService implements MessageSourceAware{

private MessageSource messageSource;

private void setMessageSource(MessageSource messageSource){

this.messageSource = messageSource;

}

String nameinEnglish = messgaeSource.getMessage(“text”,Locale.US);

In main Class:

SampleService sample = (SampleService) applicationContext.getBean(“sampleService”);

4).ApplicationEventPublisher:

Serves as super-interface for ApplicationContext and it encapsulates event publication functionality

ApplicationEventPublisherAware is interface to be implemented by the Object that wants the ApplicationEventPublisher to get notified. And ApplicationEvent to be implemented by all ApplicationEvent

ApplicationEvent extends EventObject which is the root class from which all event state objects shall be derived.

Aware is the marker interface .

public MyCustomEventListener implements ApplicationListener{

public void onApplicationEvent(ApplicationEvent event){

}

public class MyCustomEvent extends ApplicationEvent{

public MyCustomEvent(Object source){

super(source);

}

public class MyCustomEventPublish implements ApplicationEventPublisherAware{

public ApplicationEventPublisher publisher;

public void setApplicationEventPublisher(ApplicationEventPublisher publisher) {
                this.publisher = publisher;
        }

        public void publish() {
                this.publisher.publishEvent(new MyCustomEvent(this));
        }

}

Bean Factory:

Bean Factory, as it name Implies,IOC container in spring.
This provides a way to configure a bean dependency and definations.
Bean Factory actually a container which instantiates, configures, and
manages a number of beans.

Spring MVC Flow:

Central Dispatcher for Httprequest controllers which dispatches the request to the handlers for processing a web request , providing convinient mapping

Sunday, November 16, 2014

Apache Hadoop Basics

Big Data: Hadoop
What is Big Data?

Big data is term used to define a collection of any form(usually in context of unstructured data, as like,chat n FaceBook ) of data which exceeds the capacity of the conventional Database system and hence data processing is done by implementation of the parallel processing architecture as Apache Hadoop , Greenplum etc.

Big Data includes data sets with sizes beyond the ability of commonly used software tools to capture ,curate , manage, and process the data within a specific tolerable elapsed time.

What challenges faced by traditional data processing Applications are addressed by Hadoop?

Capture

Curation

Analysis

transfer

search

visualization
Application Data is the cause for generating the complex and Unstructured data irrespective of the Business Transaction Data, as FaceBook activities as chat etc.

Example of scenarios generating the so called Big Data?

Whereis Big Data implementation is made in daily life:

FaceBook (500 TB data), Stock Exchanges (Trade data generated in TeraBytes in any exchange) Web:

Search Quality(As Analytics) Telecommunication:

To prevent Customer Churn, means to retrict user/client to move away by analyzing the user behaviour from huge data used for analysis.

Banking:

Threat Analysis and Fraud Detection

What made Hadoop to evolve though ETL Compute Grid is there?

ETL storaes the Data(Raw Data) which can be made scalable vertically but retieving the data from storage(Read) was still the loophole.

Hence, Hadoop = Storage + Compute Grid
Hadoop = Scalable Storage + Distributed Programming

Which Data is defined as to be big data?

The choice for the processing architecture is determined by the following Vs:

Velocity, Variety and Volume, where volume defines the amount of data, variety as the type of data as image files, text files , logs etc and velocity defines

Distributed File System:DFS can be defined as the Distribution of the Data across multiple servers instead of having the single server with high vertical scalability.
DFS is implemented to overcome the tradeoff IO(read/write) Not the storage Capacity since same dcan be acheived with the vertical scaling of the single server instead of having the multiple server.

DFS Consolidation: This defines that though the data resides in the seperate physical machines but apppears to be under the common directory structure on logical level.
Physical machine at physically different location getting appeared under the same file system.

Hadoop:

Hadoop is the framework used to capture , analysis the large amount of data in short span of time using the cluster of commodity hardware using programming model(Map-Reduce). with streaming data access. Hadoop uses the Distributed file system to store files to be processed, writtern in java.using HDFS(Hadoop Distributed file system ).

Hadoop Core Components:

HDFS-

Hadoop Distributed file system ) used for storage

MapReduce Engine(processing)

To retrive and analyze the data stored above

Hadoop architecture can be defined as the a HDFS cluster of nodes(machines) as: Name Node(Admin Node) having Job Tracker

Data Node(Slave Node) having the Task Tracker

Main features for HDFS:

Highly fault tolerence : done using Data Replication in multiple data nodes(commodity machines).Though replication causes data redundency but is still pursued as the data nodes hardware used is commodity hardware.

High Throughput: achieved by distribution of data across multiple nodes and reading data parallely makes high throughput(time taken to read data)

Suitable for applications with large data sets rather than small data set distributed accross

Buillt on commodity hardware ,no need for high end storage devices hence not costly

streaming access to file system data, i.e. write once read multiple times.

Scenarios where hadoop can’t be used:

low-latency data access

lot of small files: as every file will have META data associated hence not easy to maintain

multiple writers and arbitrary file modification

There are daemon/services named Job tracker and Task Tracker running on name node/ high avalibility machine (maintains and manages blocks present on datanodes, stores Meta Data in RAM) and data node(provide actuial storage and processes the read/write requests from client) , respectively.

Name Node is single point of failure, since there is a secondary name node but is NOT a backup node as it doesnt turns out to be name node in case of faailure, but simply copies the data from name node and writes to the disk.

Data Nodes are stored in a group called Rack and the nodes communicates through SSH and with minimum replication as 3.

Block Size in HDFS: 64 MB and is configurable

How does JOB Tracker works?

Input files will be copied to DFS

Client copies the inpu files to DFS

Client submits a JOB to Job Tracker

JOB Tracker initilaiczes the Job Queue

JOB Tracker reads job files from DFS that defines the blocks to be writter

the file lited as per change in configurational parameter.

Maps and reducces are created same in number as splits.

Map and reduces are sent to data Node

Map processes the data locally on the data nodea and generates intermediate key-value pair that in turn is used to generate finaloutput data using the reducer on this intermidiate data

The logic to generate the output from map is developer logic

Input File is splitted into input slits and is placed in datanodes, hence each split will have a corresponding Map programme

HeartBeat: determines if datanodeas are listening to namenode.

Job Tracker uses job queue to assign the task to Task Tracker on each datanode.

Job.xml and job.jar: Used to configure the

How does Hadoop DFS is different from Traditional DFS ?

Hadoop DFS implements localization wherein the data is not copied from the slave node/datanode to master/name node and processed but is rather processed locally at each datanode and is than sent to

master, hence data can be processed in small chunks at datanodes and hence less n/w overhead

How data write works in Hadoop??

NameNode determines which datanode to be written Following is the flow:

User sends the request to add data to namenode

nameNode returns to client the information about where to write the data(which node), determined using Meta-Data

Client (NOT name node) writes to the dataode and is NON-POSTED (where user gets the acknowlede packet to client which ensures that write is being done properly)asyncronous write. Write happens in a pipeline matter means, write keeps on happeing on all datanodes with the assumption of data being written.

what happens if write fails.?

Task tracker again defines which node to be written again.

Write will be in pipeline form and hence only one acknowledgement.

How Hadoop reads data?

Read is done from all the nodes in parallel alike in write so as to ensure data gets reteived properly/consistently despite either node failing to retrievd,may b due to say, network failure

Apache Spark Basics

General Purpose Cluter Computing system
Supports high-level APIs in Java,Scala and Python

Supports high level tools as: Spark SQL for sql and structured Data Processing

MLIb for ML Spark Streaming GraphX
Spark Versions for many HDFS Versions

Spark Distribution consists of master URL which could be either of following:

local : Runs spark locally with one worker thread local[K] : runs spark locally with K worker threads

local[*]: runs spark locally with as many worker threads as logical cores on machine

spark://HOST:PORT which connects to Spark StandAlone CLuster(standalone cluster either manually, by starting a master and workers by hand,),default port: 7077

mesos://HOST:PORT: Connects to mesos cluster(Apache Mesos is a cluster manager that simplifies the complexity of running applications on a shared pool of servers.), may be by using Zookeper

yarn-client: connects to YARN cluster in client mode yarn-cluster: connects to YARN cluster in cluster mode

Spark COnfiguration files:
Spark has three locations used to cofigure spark:

Spark properties: Set using SparkConfig object OR dynamically through command link by defining the --master flag

Envirnment Varibles: for per-machine settings , through conf/spark-env.sh script on all nodes.

Logging: usin loj4j properties

Spark Cluster:

Spark applications run as independent set of processes on a cluster coordinated by SparkContet Object in main programm(called driver programme)

SparkContext can connect to several cluster managers as Mesos/YARN which allocated resources across applications. Once connected, spark acquires executors on nodes in cluster, which are processes that run computations and store data for application

Launching on Cluster:
Spark can run either:

1. By Itself

2. over several existing cluster managers Amazon EC2

Standalone Deploy Mode Apache Mesos

Hadoop YARN
Spark Programming Basics:

Each Spark application has a driver programme that runs uer's main function and executes parallel operations on cluster.

Major abstraction in Sparks are:
RDD

Shared Variables

The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it.

RDD can also be made persist in memory so as can be reused effeciently across parallel operations A second abstraction in Spark is shared variables that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes,

Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel

Ways to create RDD:
1.parallelizing the existing collection in ur driver programme

2.referencig the daataset in an external storagge system as shared file system,HDFS,HBase or any DataSource offering Hadoop InputFormat

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access

Working with Key-Value pairs:

few special operations are only available on RDDs of key-value pairs. The most common ones are distributed “shuffle” operations, such as grouping or aggregating the elements by a key

RDD Persistence:

One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it
Spark has differet Storage levells that to be determined using the tradeOff between memory usage and CPU efficiency

RDD, default storage level(MEMORY_ONLY) , best CPU Option MEMORY_ONLY_SER + fast serialization library(either of Java Serialization OR Kryo
Serialization) to make objects more space efficient and fast access.

Data Removal:
Spark removes the old data partioned in LRU(Least recently used manner)

Spark provides 2 limited types of shared variables for two common usage patterns: BroadCast Variables and accumulators

Cassandra Arcitecture in 5 Pages

Cassandra Architecture:

Holds legacy from Big Table(by google) for wide row/memtable Model and Dynamo from(Amazon) for Architecture

Cassandra has a peer-to-per architecture(i.e. No master-slave replication as in MongoDB), henceno single point of failure where in each instance of Cassandra is said to be node .

The data resides in node(as partition) contained in rack(a logical set of nodes) which resides in Datacenter(logical set of Racks).

Default DataCenetr: DC1 Default Rack: RC1

Casssandra Cluster defines as the full set of nodes which map to single complete token ring, in which each node is primarily responsible for the replicas(Distrbution of data across cluster) that resides in the particular section/segment of token ring.

Each Node stores data as the atomic unit data called Partition, which in turn can be considered as the anlogous to row containg the columns of data

Request Coordination: Each read/write request to the cluster is determined by a particular nodes called coordinator (acts as the proxy b/w client application and nodes(replica) that owns data writes in cassandra)which is defied by the client library, which determies the location in the cluster to where to copy/replica the data(based on the RF(Replication factor) and RS(Replication Strategy). Read/Write operation is confirmed to client through Coordinator as per the CL(Consistency level)

Any node could be the coordinator and any node can talk to any node and by default, the pattern followed is:

Round-Robin

Every write on node gets timestamped.

Consistency Level (CL)→ Defines the number of creteria to hold good to signal the read/write as succefull and is configurable with following available levels:

Any

One (default), check closest replica node to coordinator

All

Quorum, defines the majority nodes i.e. (replica nodes)/2 +1;

Local_one

Local_quorum

each_quorum

CL is set for each request

CL varies for write and read, In case of read its the most recent/current data merged by coordinator

Tunable Consistency:

Type of consistency:

a). Immediate Consistency:

It returns most recent data , ALL gurantees recent data as is simple merged and compread in coordinator but has high latency, as it needs to compare and merge data at coordinator. if(nodes_written+nodes_read)> RF(Replication factor)

Consistency levels can be customized as :

High RC (Read consistency)--->

High WC(Write onsistency) --> ALL

Balanced C-->WC and RC both as Quorum

Clockwise syncronization across nodes is must as each to column includes timestamp and most recent data is read back in ine response to client

b).Eventual Consistency: may return stale data and ONE has high chances for stale data and low latency a gets data from node nearest to the coordinator.

ANY lowest while ALL → Highest Consistency

RF (Replication Factor)→ How many read/write to be done(number of replica)

Replication: storing copies of data on mutiple nodes.

It is defined at time defines keyspace using replica placement starategy total number of replicas across cluster= replication factor

All replicas equivalent i.e copy of each row

Replication factor < number of nodes else writes not allowed

RS (Replication Strategy)

Simple Strategy(For single Data Center setup), with RF >=3

places first replica on node determined by partioner Additional replicas placed clockwise

Network Topology(For multiple datacenter setup, prefers the replica to be on nodes falling inside different racks and different DataCenter),so to avoid n/w failure, power and is unique factor for each DC

Partioning:

It s a process wherein a hashing function is implement on the primary key for a record to be inserted to generate the token hence partioning key(Token calculated to determine the partition in the token

ring), in other words partioning key is the token (128 bit integer value)calculated by partioner( a system on each Cassandra system) and used by coordinator to figure out the node containg the token as primary range in the token ring and places the first replica on that node.

In Cassandra, the total amount of data managed by the cluster is represented as a ring. The ring is divided into ranges equal to the number of nodes, with each node being responsible for one or more ranges of the data.

Partioners are of 3 types:

Murmur3(Default)

RandomPartioner:

token assigns equal portion of data to each node

read/write are also evenly distributed and load balancing is simple as each part of hash range receives equal number of rows on average

ByteOrderered Partioner:

Consistent Hashing:

Each Node is the highest token for the particular segment of the token ring and is responsible for the partition with partitionkey falling in the preceding range and is called Primary Range and is first replica and the further replicas positioning is determied by the RF and RS.

Before a node canjoin the ring, it must be assigned a token. The token value determines the node's position in the ring and its range of data. Column family data is partitioned across the nodes based on the row key.

To determine the node where the first replica of a row will live, the ring is walked clockwise until it locates the node with a token value greater than that of the row key. Each node is responsible for the region of the ring between itself (inclusive) and its predecessor (exclusive).

With the nodes sorted in token order, the last node is considered the predecessor of the first node; hence the ringrepresentation.

Node with the lowest token also accepts row keys less than the lowest token and more than the highest token.

Nodes Communications: Based on the Gossip (each node gossips its token which were generated at start of cluster)and Snitch System, including the provision for the Dynamic snitch for nodes performing poorly).where,Snitch defined in .yaml defines how the nodes are gruped together within overall network tolpology.

Maps IPs to racks and datacenters

Gossip protocol is used to make nodes feasible to communicte with each other and hence to get the latest state of node.

As a practice, good to have same seed node in all nodes in a data

Type of Snitch:

SimpleSnitch = single datacenter cluster RackInferringSnitch = defined by octet on node IP address

PropertyFileSnitch = deterines the location of nodes by rack and D EC2 snitch

Dynamic Snitch= default and monitors the read latency and routes requests away frompoor-performing snitcch

Each node joining the cluster determines the cluster topology as:

Communicates to the Seed nodes using Gossip protocol(version controlled runs every sec to exchange the state information)

All is done through configuration in casssandra.yaml:

Prerequisites:

Cluster-name

Seed node: IP address for the node that the ne node will contact first after joining the cluster

listen-address: IP address to coommunicate through

Schema of cassandra is KEYSPACE

Overhead for data persisted to disk:

1.Column Overhead , 15 byte per column

2.Row overhead, 23 byte data

3.Primary key index of row keys

4.replication overhead

Avoid using supercolumns as has performance issues associated with rather use composite columns Cassandra was designed to avoid using Load Balancer, as high level clients as pycassa implements Load balancing directly

2 types of column family: static column family

Dynamic column family: for custm datatypes

Type of columns in columnFamily are:

Standard, one primary key

Composite, for managing wide rows

Expiring, gets deleted during compaction has optional expiration date called TTL(time to live)

Counter, as to store the number of times a page is viewed

Super, not preferd against composite column since it reads entire super columns and sun columns hence performance issue

Repair Options:

Read repair , read time and is done automatically

Nodetool repair: reapir full nodes at downtime

During read operations, full data is requested from one replica node and digest query is sent to all other nodes.wherein digest query returns the hash to return the current datastage and after merge at the coordinator the repplicas are updted automatically.

Nodetool repair used to repair / make all data on node consistent with most current replicas on cluster,

Hinted-off: A recovery mechanism for writes targeting offline nodes. Provision to convey the updates to the node in cluster, so as to bring data back into consistency irrespective of if node is down

(To avoid Schema Disagreement) and is stored by corodiator in system.hints table if the target ode: is down

fails to acknowledge and is configurable in cassandra.yaml

….................. READ and WRITE operations to come