Spark组织的不是很合理,到这时才说到Spark集群的基本架构。原因是,前面的篇幅更多的是在Spark Shell上体验Spark API,以及对RDD一些粗浅的认识。没事,本着由粗糙到精细的原则,一步一步来,最后再把整个Spark相关的博客整理下,使之有条理,目前只是记录学习的过程。
Spark Cluster Overview
下面对上图进行解释:
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program). Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager or Mesos/YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks for the executors to run.
There are several useful things to note about this architecture:
- Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
- Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).
- Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.
最后一点强调了Sparker Driver所在的机器和Sparker集群最好位于同一个网络环境中,因为Driver中的SparkContext实例要发送任务给不同Worker Node的Executor并接受Executor的一些执行结果信息,一般而言,在企业实际的生产环境中Driver所在机器的配置往往都是比较不错的,尤其是其CPU的处理能力往往都很强悍。
Spark相关术语
Application | User program built on Spark. Consists of a driver program and executors on the cluster. |
Application Jar |
A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime |
Driver program | The process running the main() function of the application and creating the SparkContext |
Cluster manager | An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN) |
Deploy mode | Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster. |
Worker node | Any node that can run application code in the cluster |
Executor | A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. |
Task | A unit of work that will be sent to one executor |
Job | A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs. |
Stage | Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs |
- Application
Application是创建了SparkContext实例对象的Spark用户,包含了Driver程序:Spark-shell是一个Application,因为spark-shell在启动的时候创建了SparkContext对象,其名称为sc
- Job
和Spark的action相对应,每一个action例如count、savaAsTextFile等都会对应一个Job实例,该Job实例包含多任务的并行计算。
- Cluster Manager
集群资源管理的外部服务,在Spark上现在主要有Standalone、Yarn、Mesos等三种集群资源管理器,Spark自带的Standalone模式能够满足绝大部分纯粹的Spark计算环境中对集群资源管理的需求,基本上只有在集群中运行多套计算框架的时候才建议考虑Yarn和Mesos
- Worker Node
集群中可以运行应用程序代码的工作节点,相当于Hadoop的slave节点
- Executor
在一个Worker Node上为应用启动的工作进程,在进程中负责任务的运行,并且负责将数据存放在内存或磁盘上,必须注意的是,每个应用在一个Worker Node上只会有一个Executor,在Executor内部通过多线程的方式并发处理应用的任务。
- Task
被Driver送到executor上的工作单元,通常情况下一个task会处理一个split的数据,每个split一般就是一个Block块的大小:
- Stage
一个Job会被拆分成很多任务,每一组任务称为Stage,这个MapReduce的map和reduce任务很像,划分Stage的依据在于:Stage开始一般是由于读取外部数据或者Shuffle数据、一个Stage的结束一般是由于发生Shuffle(例如reduceByKey操作)或者整个Job结束时例如要把数据放到hdfs等存储系统上
后面补上Task,Executor,Stage的ScalaDoc
相关推荐
Spark集群及开发环境搭建,适合初学者,一步一步并配有截图。 目录 一、 软件及下载 2 二、 集群环境信息 2 三、 机器安装 2 1. 安装虚拟机VirtualBox 2 2. 安装CentOs7 2 四、 基础环境搭建(hadoop用户下)...
大数据中spark的基本架构和原理,有需要的可以下载看一下!
【spark论文】大型集群上的快速和通用数据处理架构(修正版)
Spark standalone 分布式集群搭建,Spark standalone运行模式,Spark Standalone运行架构解析---Spark基本工作流程,Spark Standalone运行架构解析---Spark local cluster模式
spark2.x最新集群搭建及使用,及参数调优,目前已经用户生产环境稳定运行!
Spark 是一种与 Hadoop 相似的开源集群计算环境,但是两者之间还存在一些不同之处,这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越,换句话说,Spark 启用了内存分布数据集,除了能够提供交互式查询...
spark——大型集群快速和通用数据处理 对与当前大数据的学习很有参考价值
Spark技术内幕深入解析Spark内核架构设计与实现原理.pdf
对于想学习 Spark 的人而言,如何构建 Spark 集群是其最大的难点之一, 为了解决大家构建 Spark 集群的一切困难,Spark 集群的构建分为了五个步骤,从 零起步,不需要任何前置知识,涵盖操作的每一个细节,构建完整...
spark高可用集群(自动切换方式)搭建手册,spark高可用集群(自动切换方式)搭建手册
Spark on Yan集群搭建的详细过程,减少集群搭建的时间
第一步 第二步 第三步 第四步 第五步
Intellij IDEA连接Spark集群
第1章 Spark简介 1.1 Spark是什么 1.2 Spark生态系统BDAS 1.3 Spark架构 ...3.3 Spark算子分类及功能 33.3.1 Value型Transformation算子 3.3.2 Key-Value型Transformation算子 3.3.3 Actions算子 3.4 本章小结
Spark技术内幕深入解析Spark内核架构设计与实现原理 Spark技术内幕深入解析Spark内核架构设计与实现原理
Spark技术内幕 深入解析Spark内核架构设计与实现原理.pdfSpark技术内幕 深入解析Spark内核架构设计与实现原理.pdf
本人搭建Hadoop集群基础之上的Yarn及Spark集群配置过程,及相应的学习文档。对Spark的Python编程指南进行了部分翻译。欢迎大家指正。
spark-2.0.1集群安装及编写例子提交任务,包括集群安装包及例子代码加上安装文档, spark-2.0.1集群安装及编写例子提交任务,包括集群安装包及例子代码加上安装文档
Spark基于内存计算,提高了在大数据环境下数据处理的实时性,同时保证了高容错性和高可伸缩性,允许用户将Spark部署在大量廉价硬件之上,形成集群。Spark得到了众多大数据公司的支持,这些公司包括Hortonworks、IBM...
七、 Spark & Scala 集群安装 18 1. scala安装 18 2. spark安装 19 3. 测试spark集群 20 八、 Scala开发环境搭建 21 1、系统安装 21 2、安装jdk8 21 3、安装scala2.11 21 4、安装scala for eclipse 21 5、...