1. yarn-client
在这种模式下,Spark driver在客户机上运行,然后向YARN申请运行exeutor以运行Task,即Driver和YARN是分开的,Driver程序作为YARN集群的一个客户端,这是一种CS模式
2. yarn-cluster
这种模式下,Spark driver将作为一个ApplicationMaster在YARN集群中先启动,然后再由ApplicationMaster向RM申请资源启动executor以运行Task。也就是说,在这种部署方式下,Driver程序运行在YARN集群上
在YARN中部署Spark应用程序时,可以使用Spark的bin/spark-submit提交Spark应用程序。在YARN上部署Spark应用程序的时候,不需要象Standalone、Mesos一样提供URL作为master参数的值,因为Spark应用程序可以在hadoop的配置文件里面获取相关的信息,所以只需要简单以yarn-cluster或yarn-client指定给master就可以了。因此,因为Spark需要从hadoop(或者具体的yarn相关的配置)的配置文件中获取相关的信息,所以需要配置环境变量HADOOP_CONF_DIR或者YARN_CONF_DIR。
所在在上面的配置再加一个配置项到conf/spark-env.sh中,同时在/etc/profile中也添加一行
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
yarn-client部署
1. 提交命令:
./spark-submit --name SparkWordCount --class spark.examples.SparkWordCount --master yarn-client --executor-memory 512M --total-executor-cores 1 SparkWordCount.jar README.md
对比之前的由Spark自己管理计算资源的提交方式
./spark-submit --name SparkWordCount --class spark.examples.SparkWordCount --master spark://hadoop.master:7077 --executor-memory 512M --total-executor-cores 1 SparkWordCount.jar README.md
2. 说明:
2.1. 采用yarn-client方式,因为driver在客户端,所以可以通过webUI访问driver的状态,默认是http://hadoop.master:4040访问,而YARN通过http://haoop.master:8088访问。
2.2 提交一个作业,产生的日志以及整个过程貌似很复杂的样子
[hadoop@hadoop bin]$ sh submitSparkApplicationYarnClient.sh //yarn-client方式提交任务
Delete the HDFS output directory //删除上次执行任务时,产生的HDFS输出目录
15/01/10 07:27:49 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/hadoop/SortedWordCountRDDInSparkApplication
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/01/10 07:27:52 INFO spark.SecurityManager: Changing view acls to: hadoop
15/01/10 07:27:52 INFO spark.SecurityManager: Changing modify acls to: hadoop
15/01/10 07:27:52 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/01/10 07:27:53 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/01/10 07:27:53 INFO Remoting: Starting remoting
15/01/10 07:27:54 INFO util.Utils: Successfully started service 'sparkDriver' on port 35401.
15/01/10 07:27:54 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@localhost:35401]
15/01/10 07:27:54 INFO spark.SparkEnv: Registering MapOutputTracker
15/01/10 07:27:54 INFO spark.SparkEnv: Registering BlockManagerMaster
15/01/10 07:27:54 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20150110072754-dcdf
15/01/10 07:27:54 INFO storage.MemoryStore: MemoryStore started with capacity 267.3 MB
15/01/10 07:27:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/01/10 07:27:56 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-8f55f6ec-399b-4371-9ab4-d648047381c5
15/01/10 07:27:56 INFO spark.HttpServer: Starting HTTP Server
15/01/10 07:27:56 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/01/10 07:27:57 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:52196
15/01/10 07:27:57 INFO util.Utils: Successfully started service 'HTTP file server' on port 52196.
15/01/10 07:27:57 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/01/10 07:27:58 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
15/01/10 07:27:58 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
15/01/10 07:27:58 INFO ui.SparkUI: Started SparkUI at http://localhost:4040
15/01/10 07:27:58 INFO spark.SparkContext: Added JAR file:/home/hadoop/software/spark-1.2.0-bin-hadoop2.4/bin/SparkWordCount.jar at http://localhost:52196/jars/SparkWordCount.jar with timestamp 1420892878400
//////////到此时,Spark1工作做完,将任务提交给Yarn/////
15/01/10 07:28:00 INFO client.RMProxy: Connecting to ResourceManager at hadoop.master/192.168.26.136:8032 //连接ResourceManager
15/01/10 07:28:02 INFO yarn.Client: Requesting a new application from cluster with 1 NodeManagers //申请资源
15/01/10 07:28:02 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
15/01/10 07:28:02 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead //分配一个资源单位,AM Container
15/01/10 07:28:02 INFO yarn.Client: Setting up container launch context for our AM //设置container
15/01/10 07:28:02 INFO yarn.Client: Preparing resources for our AM container
15/01/10 07:28:03 INFO yarn.Client: Uploading resource file:/home/hadoop/software/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar -> hdfs://hadoop.master:9000/user/hadoop/.sparkStaging/application_1420859110621_0002/spark-assembly-1.2.0-hadoop2.4.0.jar
////把spark-assembly-1.2.0-hadoop-2.4.0.jar上传到HDFS上???
15/01/10 07:28:22 INFO yarn.Client: Setting up the launch environment for our AM container
15/01/10 07:28:22 INFO spark.SecurityManager: Changing view acls to: hadoop
15/01/10 07:28:22 INFO spark.SecurityManager: Changing modify acls to: hadoop
15/01/10 07:28:22 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/01/10 07:28:22 INFO yarn.Client: Submitting application 2 to ResourceManager
////任务提交
15/01/10 07:28:22 INFO impl.YarnClientImpl: Submitted application application_1420859110621_0002
15/01/10 07:28:23 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:23 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1420892902791
final status: UNDEFINED
tracking URL: http://hadoop.master:8088/proxy/application_1420859110621_0002/
user: hadoop
///下面这一坨是什么情况?每秒钟rport一次状态?那这得产生多少垃圾日志?
15/01/10 07:28:24 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:26 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:27 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:28 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:29 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:30 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:31 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:32 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:34 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:35 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:36 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:37 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:38 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:39 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:40 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:41 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:42 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:43 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:44 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:45 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:46 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:47 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:48 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
///这是回到Spark运行时环境了?
15/01/10 07:28:48 INFO cluster.YarnClientSchedulerBackend: ApplicationMaster registered as Actor[akka.tcp://sparkYarnAM@hadoop.master:43444/user/YarnAM#-519598456]
15/01/10 07:28:48 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> hadoop.master, PROXY_URI_BASES -> http://hadoop.master:8088/proxy/application_1420859110621_0002), /proxy/application_1420859110621_0002
15/01/10 07:28:48 INFO ui.JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
//继续回到进度,此时已经从受理(ACCEPTED)状态到正在运行状态(RUNNING)
15/01/10 07:28:49 INFO yarn.Client: Application report for application_1420859110621_0002 (state: RUNNING)
15/01/10 07:28:49 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: hadoop.master
ApplicationMaster RPC port: 0
queue: default
start time: 1420892902791
final status: UNDEFINED
tracking URL: http://hadoop.master:8088/proxy/application_1420859110621_0002/
user: hadoop
15/01/10 07:28:49 INFO cluster.YarnClientSchedulerBackend: Application application_1420859110621_0002 has started running.
15/01/10 07:28:51 INFO netty.NettyBlockTransferService: Server created on 45652
15/01/10 07:28:51 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/01/10 07:28:51 INFO storage.BlockManagerMasterActor: Registering block manager localhost:45652 with 267.3 MB RAM, BlockManagerId(<driver>, localhost, 45652)
15/01/10 07:28:51 INFO storage.BlockManagerMaster: Registered BlockManager
15/01/10 07:28:51 INFO scheduler.EventLoggingListener: Logging events to hdfs://hadoop.master:9000/user/hadoop/sparkevt/application_1420859110621_0002
15/01/10 07:28:51 INFO cluster.YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
15/01/10 07:28:52 INFO storage.MemoryStore: ensureFreeSpace(216263) called with curMem=0, maxMem=280248975
15/01/10 07:28:52 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 211.2 KB, free 267.1 MB)
15/01/10 07:28:52 INFO storage.MemoryStore: ensureFreeSpace(31667) called with curMem=216263, maxMem=280248975
15/01/10 07:28:52 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 30.9 KB, free 267.0 MB)
15/01/10 07:28:52 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:45652 (size: 30.9 KB, free: 267.2 MB)
15/01/10 07:28:52 INFO storage.BlockManagerMaster: Updated info of block broadcast_0_piece0
15/01/10 07:28:52 INFO spark.SparkContext: Created broadcast 0 from textFile at SparkWordCount.scala:41
15/01/10 07:28:52 INFO mapred.FileInputFormat: Total input paths to process : 1
15/01/10 07:28:53 INFO spark.SparkContext: Starting job: sortByKey at SparkWordCount.scala:44
15/01/10 07:28:53 INFO scheduler.DAGScheduler: Registering RDD 3 (map at SparkWordCount.scala:44)
15/01/10 07:28:53 INFO scheduler.DAGScheduler: Got job 0 (sortByKey at SparkWordCount.scala:44) with 2 output partitions (allowLocal=false)
15/01/10 07:28:53 INFO scheduler.DAGScheduler: Final stage: Stage 1(sortByKey at SparkWordCount.scala:44)
15/01/10 07:28:53 INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 0)
15/01/10 07:28:53 INFO scheduler.DAGScheduler: Missing parents: List(Stage 0)
15/01/10 07:28:53 INFO scheduler.DAGScheduler: Submitting Stage 0 (MappedRDD[3] at map at SparkWordCount.scala:44), which has no missing parents
15/01/10 07:28:53 INFO storage.MemoryStore: ensureFreeSpace(3528) called with curMem=247930, maxMem=280248975
15/01/10 07:28:53 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.4 KB, free 267.0 MB)
15/01/10 07:28:53 INFO storage.MemoryStore: ensureFreeSpace(2498) called with curMem=251458, maxMem=280248975
15/01/10 07:28:53 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.4 KB, free 267.0 MB)
15/01/10 07:28:53 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:45652 (size: 2.4 KB, free: 267.2 MB)
15/01/10 07:28:53 INFO storage.BlockManagerMaster: Updated info of block broadcast_1_piece0
15/01/10 07:28:53 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:838
15/01/10 07:28:53 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[3] at map at SparkWordCount.scala:44)
15/01/10 07:28:53 INFO cluster.YarnClientClusterScheduler: Adding task set 0.0 with 2 tasks
15/01/10 07:28:53 INFO util.RackResolver: Resolved hadoop.master to /default-rack
15/01/10 07:29:06 INFO cluster.YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@hadoop.master:34326/user/Executor#-1519914394] with ID 1
15/01/10 07:29:06 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, hadoop.master, NODE_LOCAL, 1356 bytes)
15/01/10 07:29:06 INFO cluster.YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@hadoop.master:59070/user/Executor#763574095] with ID 2
15/01/10 07:29:06 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, hadoop.master, NODE_LOCAL, 1356 bytes)
15/01/10 07:29:09 INFO storage.BlockManagerMasterActor: Registering block manager hadoop.master:44394 with 267.3 MB RAM, BlockManagerId(1, hadoop.master, 44394)
15/01/10 07:29:09 INFO storage.BlockManagerMasterActor: Registering block manager hadoop.master:52439 with 267.3 MB RAM, BlockManagerId(2, hadoop.master, 52439)
15/01/10 07:29:11 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on hadoop.master:52439 (size: 2.4 KB, free: 267.3 MB)
15/01/10 07:29:11 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on hadoop.master:44394 (size: 2.4 KB, free: 267.3 MB)
15/01/10 07:29:15 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop.master:44394 (size: 30.9 KB, free: 267.2 MB)
15/01/10 07:29:15 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop.master:52439 (size: 30.9 KB, free: 267.2 MB)
15/01/10 07:29:34 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 27521 ms on hadoop.master (1/2)
15/01/10 07:29:34 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 27381 ms on hadoop.master (2/2)
15/01/10 07:29:34 INFO scheduler.DAGScheduler: Stage 0 (map at SparkWordCount.scala:44) finished in 40.486 s
15/01/10 07:29:34 INFO scheduler.DAGScheduler: looking for newly runnable stages
15/01/10 07:29:34 INFO scheduler.DAGScheduler: running: Set()
15/01/10 07:29:34 INFO scheduler.DAGScheduler: waiting: Set(Stage 1)
15/01/10 07:29:34 INFO scheduler.DAGScheduler: failed: Set()
15/01/10 07:29:34 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/01/10 07:29:34 INFO scheduler.DAGScheduler: Missing parents for Stage 1: List()
15/01/10 07:29:34 INFO scheduler.DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[7] at sortByKey at SparkWordCount.scala:44), which is now runnable
15/01/10 07:29:34 INFO storage.MemoryStore: ensureFreeSpace(3072) called with curMem=253956, maxMem=280248975
15/01/10 07:29:34 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.0 KB, free 267.0 MB)
15/01/10 07:29:34 INFO storage.MemoryStore: ensureFreeSpace(2122) called with curMem=257028, maxMem=280248975
15/01/10 07:29:34 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.1 KB, free 267.0 MB)
15/01/10 07:29:34 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:45652 (size: 2.1 KB, free: 267.2 MB)
15/01/10 07:29:34 INFO storage.BlockManagerMaster: Updated info of block broadcast_2_piece0
15/01/10 07:29:34 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:838
15/01/10 07:29:34 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 1 (MapPartitionsRDD[7] at sortByKey at SparkWordCount.scala:44)
15/01/10 07:29:34 INFO cluster.YarnClientClusterScheduler: Adding task set 1.0 with 2 tasks
15/01/10 07:29:34 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, hadoop.master, PROCESS_LOCAL, 1112 bytes)
15/01/10 07:29:34 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, hadoop.master, PROCESS_LOCAL, 1112 bytes)
15/01/10 07:29:34 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on hadoop.master:52439 (size: 2.1 KB, free: 267.2 MB)
15/01/10 07:29:35 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on hadoop.master:44394 (size: 2.1 KB, free: 267.2 MB)
15/01/10 07:29:35 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to sparkExecutor@hadoop.master:59070
15/01/10 07:29:35 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 158 bytes
15/01/10 07:29:35 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to sparkExecutor@hadoop.master:34326
15/01/10 07:29:37 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 3297 ms on hadoop.master (1/2)
15/01/10 07:29:37 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 3303 ms on hadoop.master (2/2)
15/01/10 07:29:37 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/01/10 07:29:37 INFO scheduler.DAGScheduler: Stage 1 (sortByKey at SparkWordCount.scala:44) finished in 3.307 s
///Job 0执行完成
15/01/10 07:29:37 INFO scheduler.DAGScheduler: Job 0 finished: sortByKey at SparkWordCount.scala:44, took 44.720124 s
15/01/10 07:29:38 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/01/10 07:29:38 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/01/10 07:29:38 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/01/10 07:29:38 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/01/10 07:29:38 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/01/10 07:29:38 INFO spark.SparkContext: Starting job: saveAsTextFile at SparkWordCount.scala:44
15/01/10 07:29:38 INFO scheduler.DAGScheduler: Registering RDD 5 (map at SparkWordCount.scala:44)
15/01/10 07:29:38 INFO scheduler.DAGScheduler: Got job 1 (saveAsTextFile at SparkWordCount.scala:44) with 2 output partitions (allowLocal=false)
15/01/10 07:29:38 INFO scheduler.DAGScheduler: Final stage: Stage 4(saveAsTextFile at SparkWordCount.scala:44)
15/01/10 07:29:38 INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 3)
15/01/10 07:29:38 INFO scheduler.DAGScheduler: Missing parents: List(Stage 3)
15/01/10 07:29:38 INFO scheduler.DAGScheduler: Submitting Stage 3 (MappedRDD[5] at map at SparkWordCount.scala:44), which has no missing parents
15/01/10 07:29:38 INFO storage.MemoryStore: ensureFreeSpace(2992) called with curMem=259150, maxMem=280248975
15/01/10 07:29:38 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.9 KB, free 267.0 MB)
15/01/10 07:29:38 INFO storage.MemoryStore: ensureFreeSpace(2168) called with curMem=262142, maxMem=280248975
15/01/10 07:29:38 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.1 KB, free 267.0 MB)
15/01/10 07:29:38 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:45652 (size: 2.1 KB, free: 267.2 MB)
15/01/10 07:29:38 INFO storage.BlockManagerMaster: Updated info of block broadcast_3_piece0
15/01/10 07:29:38 INFO spark.SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:838
15/01/10 07:29:38 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 3 (MappedRDD[5] at map at SparkWordCount.scala:44)
15/01/10 07:29:38 INFO cluster.YarnClientClusterScheduler: Adding task set 3.0 with 2 tasks
15/01/10 07:29:38 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 4, hadoop.master, PROCESS_LOCAL, 1101 bytes)
15/01/10 07:29:38 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 3.0 (TID 5, hadoop.master, PROCESS_LOCAL, 1101 bytes)
15/01/10 07:29:38 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on hadoop.master:52439 (size: 2.1 KB, free: 267.2 MB)
15/01/10 07:29:38 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on hadoop.master:44394 (size: 2.1 KB, free: 267.2 MB)
15/01/10 07:29:38 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 4) in 441 ms on hadoop.master (1/2)
15/01/10 07:29:38 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 3.0 (TID 5) in 470 ms on hadoop.master (2/2)
15/01/10 07:29:38 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 3.0, whose tasks have all completed, from pool
15/01/10 07:29:38 INFO scheduler.DAGScheduler: Stage 3 (map at SparkWordCount.scala:44) finished in 0.474 s
15/01/10 07:29:38 INFO scheduler.DAGScheduler: looking for newly runnable stages
15/01/10 07:29:38 INFO scheduler.DAGScheduler: running: Set()
15/01/10 07:29:38 INFO scheduler.DAGScheduler: waiting: Set(Stage 4)
15/01/10 07:29:38 INFO scheduler.DAGScheduler: failed: Set()
15/01/10 07:29:39 INFO scheduler.DAGScheduler: Missing parents for Stage 4: List()
15/01/10 07:29:39 INFO scheduler.DAGScheduler: Submitting Stage 4 (MappedRDD[10] at saveAsTextFile at SparkWordCount.scala:44), which is now runnable
15/01/10 07:29:39 INFO storage.MemoryStore: ensureFreeSpace(113152) called with curMem=264310, maxMem=280248975
15/01/10 07:29:39 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 110.5 KB, free 266.9 MB)
15/01/10 07:29:39 INFO storage.MemoryStore: ensureFreeSpace(68432) called with curMem=377462, maxMem=280248975
15/01/10 07:29:39 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 66.8 KB, free 266.8 MB)
15/01/10 07:29:39 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:45652 (size: 66.8 KB, free: 267.2 MB)
15/01/10 07:29:39 INFO storage.BlockManagerMaster: Updated info of block broadcast_4_piece0
15/01/10 07:29:39 INFO spark.SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:838
15/01/10 07:29:39 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 4 (MappedRDD[10] at saveAsTextFile at SparkWordCount.scala:44)
15/01/10 07:29:39 INFO cluster.YarnClientClusterScheduler: Adding task set 4.0 with 2 tasks
15/01/10 07:29:39 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 6, hadoop.master, PROCESS_LOCAL, 1112 bytes)
15/01/10 07:29:39 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 4.0 (TID 7, hadoop.master, PROCESS_LOCAL, 1112 bytes)
15/01/10 07:29:39 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on hadoop.master:52439 (size: 66.8 KB, free: 267.2 MB)
15/01/10 07:29:39 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on hadoop.master:44394 (size: 66.8 KB, free: 267.2 MB)
15/01/10 07:29:40 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 1 to sparkExecutor@hadoop.master:34326
15/01/10 07:29:40 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 158 bytes
15/01/10 07:29:40 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 1 to sparkExecutor@hadoop.master:59070
15/01/10 07:29:42 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 4.0 (TID 7) in 3184 ms on hadoop.master (1/2)
15/01/10 07:29:42 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 6) in 3213 ms on hadoop.master (2/2)
15/01/10 07:29:42 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 4.0, whose tasks have all completed, from pool
15/01/10 07:29:42 INFO scheduler.DAGScheduler: Stage 4 (saveAsTextFile at SparkWordCount.scala:44) finished in 3.215 s
////Job 1执行完成,至此所有的Job都执行完成了
15/01/10 07:29:42 INFO scheduler.DAGScheduler: Job 1 finished: saveAsTextFile at SparkWordCount.scala:44, took 3.969958 s
////下面这一坨是在干啥??
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/static,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/threadDump/json,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/threadDump,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/json,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/environment/json,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/environment,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/rdd/json,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/rdd,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/json,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/pool/json,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/pool,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/json,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/json,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs/job/json,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs/job,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs/json,null}
15/01/10 07:29:42 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs,null}
////Spark web UI被停了,只能看History了
15/01/10 07:29:42 INFO ui.SparkUI: Stopped Spark web UI at http://localhost:4040
15/01/10 07:29:42 INFO scheduler.DAGScheduler: Stopping DAGScheduler
15/01/10 07:29:42 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors
15/01/10 07:29:43 INFO cluster.YarnClientSchedulerBackend: Asking each executor to shut down
15/01/10 07:29:43 INFO cluster.YarnClientSchedulerBackend: Stopped
15/01/10 07:29:44 INFO spark.MapOutputTrackerMasterActor: MapOutputTrackerActor stopped!
15/01/10 07:29:44 INFO storage.MemoryStore: MemoryStore cleared
15/01/10 07:29:44 INFO storage.BlockManager: BlockManager stopped
15/01/10 07:29:44 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
///以上是资源释放的逻辑
//1. Executor停止 2. MemoryStore清空 3. BlockManager停止 4. BlockManagerMaster停止
15/01/10 07:29:44 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/01/10 07:29:44 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
////关闭SparkContext,整个任务结束
15/01/10 07:29:44 INFO spark.SparkContext: Successfully stopped SparkContext
访问http://hadoop.master:8088查看任务的执行情况。如下图所示,Spark的WorkCount程序确实能在Hadoop上看到,并且计算类型为Spark
访问http://hadoop.master:4040,不出意外的不能访问,原因是Spark在程序运行完成后,自动的将这个服务关闭了,这就是这个web服务是跟应用绑定的,而不是跟Spark绑定的。
访问http://hadoop.master:18080
yarn-cluster部署
1. 任务提交命令
./spark-submit --name SparkWordCount --class spark.examples.SparkWordCount --master yarn-cluster --executor-memory 512M --total-executor-cores 1 SparkWordCount.jar README.md
2. 任务日志
对以yarn-cluster模式运行产生的日志又产生不小的惊讶,原因是不像yarn-client那样产出了过程很复杂的日志,这里产生的日志很简单,总结下来就是任务被受理(ACCEPTED),任务处理中(RUNNING)以及任务执行完毕(FINISHED),除此别无其它,尤其是没有Spark的产生的日志...
[hadoop@hadoop bin]$ sh submitSparkApplicationYarnCluster.sh ////提交yarn-cluster模式的Spark应用程序
Delete the HDFS output directory ///删除上次任务执行时创建的HDFS输出目录
15/01/10 07:56:30 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/hadoop/SortedWordCountRDDInSparkApplication
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/01/10 07:56:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
///提交应用
///申请资源
15/01/10 07:56:36 INFO client.RMProxy: Connecting to ResourceManager at hadoop.master/192.168.26.136:8032
15/01/10 07:56:38 INFO yarn.Client: Requesting a new application from cluster with 1 NodeManagers
15/01/10 07:56:38 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
15/01/10 07:56:38 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/01/10 07:56:38 INFO yarn.Client: Setting up container launch context for our AM
15/01/10 07:56:38 INFO yarn.Client: Preparing resources for our AM container
////spark-assembly-1.2.0-hadoop2.4.0.jar又一次被送到HDFS上了
15/01/10 07:56:39 INFO yarn.Client: Uploading resource file:/home/hadoop/software/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar -> hdfs://hadoop.master:9000/user/hadoop/.sparkStaging/application_1420859110621_0003/spark-assembly-1.2.0-hadoop2.4.0.jar
////应用程序本身的jar包
15/01/10 07:56:49 INFO yarn.Client: Uploading resource file:/home/hadoop/software/spark-1.2.0-bin-hadoop2.4/bin/SparkWordCount.jar -> hdfs://hadoop.master:9000/user/hadoop/.sparkStaging/application_1420859110621_0003/SparkWordCount.jar
15/01/10 07:56:49 INFO yarn.Client: Setting up the launch environment for our AM container
15/01/10 07:56:49 INFO spark.SecurityManager: Changing view acls to: hadoop
15/01/10 07:56:49 INFO spark.SecurityManager: Changing modify acls to: hadoop
15/01/10 07:56:49 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
///任务提交到ResourceManager
15/01/10 07:56:49 INFO yarn.Client: Submitting application 3 to ResourceManager
15/01/10 07:56:49 INFO impl.YarnClientImpl: Submitted application application_1420859110621_0003
15/01/10 07:56:50 INFO yarn.Client: Application report for application_1420859110621_0003 (state: ACCEPTED)
15/01/10 07:56:50 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1420894609440
final status: UNDEFINED
tracking URL: http://hadoop.master:8088/proxy/application_1420859110621_0003/
user: hadoop
////任务已受理
15/01/10 07:56:51 INFO yarn.Client: Application report for application_1420859110621_0003 (state: ACCEPTED)
15/01/10 07:56:52 INFO yarn.Client: Application report for application_1420859110621_0003 (state: ACCEPTED)
15/01/10 07:56:53 INFO yarn.Client: Application report for application_1420859110621_0003 (state: ACCEPTED)
15/01/10 07:56:54 INFO yarn.Client: Application report for application_1420859110621_0003 (state: ACCEPTED)
15/01/10 07:56:55 INFO yarn.Client: Application report for application_1420859110621_0003 (state: ACCEPTED)
15/01/10 07:56:56 INFO yarn.Client: Application report for application_1420859110621_0003 (state: ACCEPTED)
15/01/10 07:56:57 INFO yarn.Client: Application report for application_1420859110621_0003 (state: ACCEPTED)
15/01/10 07:56:58 INFO yarn.Client: Application report for application_1420859110621_0003 (state: ACCEPTED)
15/01/10 07:56:59 INFO yarn.Client: Application report for application_1420859110621_0003 (state: ACCEPTED)
15/01/10 07:57:00 INFO yarn.Client: Application report for application_1420859110621_0003 (state: ACCEPTED)
15/01/10 07:57:01 INFO yarn.Client: Application report for application_1420859110621_0003 (state: ACCEPTED)
15/01/10 07:57:02 INFO yarn.Client: Application report for application_1420859110621_0003 (state: ACCEPTED)
15/01/10 07:57:03 INFO yarn.Client: Application report for application_1420859110621_0003 (state: ACCEPTED)
15/01/10 07:57:04 INFO yarn.Client: Application report for application_1420859110621_0003 (state: ACCEPTED)
15/01/10 07:57:05 INFO yarn.Client: Application report for application_1420859110621_0003 (state: ACCEPTED)
15/01/10 07:57:06 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:06 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: hadoop.master
ApplicationMaster RPC port: 0
queue: default
start time: 1420894609440
final status: UNDEFINED
tracking URL: http://hadoop.master:8088/proxy/application_1420859110621_0003/
user: hadoop
////任务开始执行
15/01/10 07:57:07 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:08 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:09 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:10 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:11 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:12 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:13 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:14 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:15 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:16 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:17 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:18 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:19 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:20 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:21 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:22 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:23 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:25 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:26 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:27 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:28 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:29 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:30 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:31 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:32 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:33 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:34 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:35 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:36 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:37 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:38 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:39 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:40 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
15/01/10 07:57:41 INFO yarn.Client: Application report for application_1420859110621_0003 (state: RUNNING)
///任务执行完毕
15/01/10 07:57:42 INFO yarn.Client: Application report for application_1420859110621_0003 (state: FINISHED)
15/01/10 07:57:42 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: hadoop.master
ApplicationMaster RPC port: 0
queue: default
start time: 1420894609440
final status: SUCCEEDED
tracking URL: http://hadoop.master:8088/proxy/application_1420859110621_0003/A
user: hadoop
3. 状态查看
3.1 采用yarn-cluster方式,因为driver在YARN中运行,要通过webUI访问driver的状态,需要点YARN中该job的Tracking UI。点TrackingUI上History会打开这个应用程序的历史记录,但是当点它的时候,却访问不到,原因是
Hadoop没有启动history server,应该使用Hadoop目录下的sbin/mr-jobhistory-daemon.sh启动history server
3.2 采用yarn-cluster方式,因为driver在YARN中运行,所以程序的运行结果不能在客户端显示,所以最好将结果保存在hdfs上,客户端的终端显示的是作为YARN的job的运行情况
3.3 访问http://hadoop.master:8088查看任务执行结果
本文参考:http://blog.csdn.net/book_mmicky/article/details/25714287
相关推荐
1. 解压Spark安装包 2. 配置Hadoop生态组件相关环境变量 2. 在 master 节点上,关闭HDFS的安全模式: 3. 在 master 节点上
Spark on Yarn模式部署.docx
描述了spark1.2.1在standalone集群模式和on yarn集群模式下的部署与运行方式。
资源是Spark 在yarn模式上的部署的spark安装包(spark-2.4.7-bin-hadoop2.7.tgz),以及安装部署的文档
Spark on YARN 上运行 准备 Spark on YARN 配置 调试应用 Spark 属性 重要提示 在一个安全的集群中运行 用 Apache Oozie 来运行应用程序 Kerberos 故障排查 Spark 配置 Spark 监控 指南 作业调度 ...
14. Spark on Yarn 模式有哪些优点? 15. 谈谈你对 container 的理解? 16. Spark 使用 parquet 文件存储格式能带来哪些好处? 17. 介绍 parition 和 block 有什么关联关系? 18. Spark 应用程序的执行过程是什么?
7.Spark on YARN 8.应用部署模式DeployMode 第二章、SparkCore 模块 1.RDD 概念及特性 2.RDD 创建 3.RDD 函数及使用 4.RDD 持久化 5.案例:SogouQ日志分析 6.RDD Checkpoint 7.外部数据源(HBase和MySQL) 8.广播...
文章目录Spark下载和安装Spark的部署模式spark on yarnIDEA编写spark程序下载Scala安装Scala插件建立Maven工程编写wordcount程序打成jar包验证JAR包 Spark下载和安装 可以去Spark官网下载对应的spark版本。此处我...
Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎。 1.Spark有几种部署模式,各个模式的特点 1.本地模式 ...3.Spark on yarn模式 分布式部署集群,资源和任务监控交给yarn管理 粗粒度
以Spark on YARN为例,核⼼组 件YARN作为资源调度器,其结构如下图所⽰ 下图讲述了Apache Spark on YARN的⼯作⽅式: YARN ResourceManager的功能为: 负责集群中所有资源的统⼀管理和分配,它接收来⾃各个节点...
通过引入Spark on YARN内存计算平台,将改进并行粒子群优化(IPPSO)算法部署在平台上,对最小二乘支持向量机(LSSVM)的不确定参数进行算法优化,利用优化后的参数进行负荷预测。通过引入并行化和分布式的...
更新过程中有⼤量的写⼊和删除操作,需要频繁合并和分裂,降低存储效率 优化:Hive on Tez / Spark 使⽤Tez和Spark替代MapReduce,达到提⾼Hive执⾏效率的⽬的 计算引擎 Spark ⽬前Spark⽀持三种分布式部署⽅式,...
聚效广告刘忆智-Beyond MLLibScale up Advanced Machine Learning on Spark 王栋-利用ELK监控Hadoop集群负载性能 梁堰波-Build Generalized Linear Models with Spark MLlib Databricks连城-The Future of Real-Time...
SQL on Big Data:掌握Hive、Spark SQL、Impala等SQL-on-Hadoop工具,用于大数据查询与分析。 数据可视化:了解如何将大数据分析结果通过Tableau、PowerBI等工具呈现。 数据压缩:熟悉Snappy、Gzip等数据压缩算法...
Jupyter Enterprise Gateway使Jupyter Notebook能够在分布式集群中启动远程内核,包括由YARN,IBM Spectrum Conductor,Kubernetes或Docker Swarm管理的Apache Spark。 它为以下内核提供开箱即用的支持: 使用...
文章目录Flink概述Flink生态为什么选择Flink?系统架构JobManager运行...CheckpointFlink部署与运行Yarn运行Flink作业Flink YARN SessionRun a single Flink job on YARN(推荐)Standalone部署Storm、Spark-Streaming
Angel基于Java和Scala开发,能在社区的Yarn上直接调试运行,并基于PS Service ,支持Angel上的Spark ,集成了图计算和深度学习算法。 欢迎对机器学习,图计算有兴趣的同仁一起贡献代码,提交问题或请求。请先查阅:...