Sunday, November 16, 2014

Apache Spark Basics

General Purpose Cluter Computing system
Supports high-level APIs in Java,Scala and Python

Supports high level tools as: Spark SQL for sql and structured Data Processing

MLIb for ML Spark Streaming GraphX
Spark Versions for many HDFS Versions

Spark Distribution consists of master URL which could be either of following:

local : Runs spark locally with one worker thread local[K] : runs spark locally with K worker threads

local[*]: runs spark locally with as many worker threads as logical cores on machine

spark://HOST:PORT which connects to Spark StandAlone CLuster(standalone cluster either manually, by starting a master and workers by hand,),default port: 7077

mesos://HOST:PORT: Connects to mesos cluster(Apache Mesos is a cluster manager that simplifies the complexity of running applications on a shared pool of servers.), may be by using Zookeper

yarn-client: connects to YARN cluster in client mode yarn-cluster: connects to YARN cluster in cluster mode

Spark COnfiguration files:
Spark has three locations used to cofigure spark:

Spark properties: Set using SparkConfig object OR dynamically through command link by defining the --master flag

Envirnment Varibles: for per-machine settings , through conf/spark-env.sh script on all nodes.

Logging: usin loj4j properties

Spark Cluster:

Spark applications run as independent set of processes on a cluster coordinated by SparkContet Object in main programm(called driver programme)

SparkContext can connect to several cluster managers as Mesos/YARN which allocated resources across applications. Once connected, spark acquires executors on nodes in cluster, which are processes that run computations and store data for application


Launching on Cluster:
Spark can run either:

1. By Itself

2. over several existing cluster managers Amazon EC2

Standalone Deploy Mode Apache Mesos

Hadoop YARN
Spark Programming Basics:

Each Spark application has a driver programme that runs uer's main function and executes parallel operations on cluster.

Major abstraction in Sparks are:
RDD

Shared Variables

The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it.

RDD can also be made persist in memory so as can be reused effeciently across parallel operations A second abstraction in Spark is shared variables that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes,

Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel

Ways to create RDD:
1.parallelizing the existing collection in ur driver programme

2.referencig the daataset in an external storagge system as shared file system,HDFS,HBase or any DataSource offering Hadoop InputFormat

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access

Working with Key-Value pairs:

few special operations are only available on RDDs of key-value pairs. The most common ones are distributed “shuffle” operations, such as grouping or aggregating the elements by a key

RDD Persistence:

One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it
Spark has differet Storage levells that to be determined using the tradeOff between memory usage and CPU efficiency

RDD, default storage level(MEMORY_ONLY) , best CPU Option MEMORY_ONLY_SER + fast serialization library(either of Java Serialization OR Kryo
Serialization) to make objects more space efficient and fast access.

Data Removal:
Spark removes the old data partioned in LRU(Least recently used manner)

Spark provides 2 limited types of shared variables for two common usage patterns: BroadCast Variables and accumulators

No comments:

Post a Comment