`
bit1129
  • 浏览: 1051385 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

【Spark九十九】Spark Streaming的batch interval时间内的数据流转源码分析

 
阅读更多

 

以如下代码为例(SocketInputDStream):

Spark Streaming从Socket读取数据的代码是在SocketReceiver的receive方法中,撇开异常情况不谈(Receiver有重连机制,restart方法,默认情况下在Receiver挂了之后,间隔两秒钟重新建立Socket连接),读取到的数据通过调用store(textRead)方法进行存储。数据的流转需要关注如下几个问题:

1. 数据存储到什么位置了

2. 数据存储的结构如何?

3. 数据什么时候被读取

4. 读取到的数据(batch interval)如何转换为RDD

 

1. SocketReceiver#receive

  /** Create a socket connection and receive data until receiver is stopped */
  def receive() {
    var socket: Socket = null
    try {
      logInfo("Connecting to " + host + ":" + port)
      socket = new Socket(host, port)
      logInfo("Connected to " + host + ":" + port)
      val iterator = bytesToObjects(socket.getInputStream())
      while(!isStopped && iterator.hasNext) {
        store(iterator.next)
      }
      logInfo("Stopped receiving")
      restart("Retrying connecting to " + host + ":" + port)
    } catch {
      case e: java.net.ConnectException =>
        restart("Error connecting to " + host + ":" + port, e)
      case t: Throwable =>
        restart("Error receiving data", t)
    } finally {
      if (socket != null) {
        socket.close()
        logInfo("Closed socket to " + host + ":" + port)
      }
    }
  }

 

2. SocketReceiver#receive=>SocketReceiver#store

 

  /**
   * Store a single item of received data to Spark's memory.
   * These single items will be aggregated together into data blocks before
   * being pushed into Spark's memory.
   */
  def store(dataItem: T) {
    executor.pushSingle(dataItem)
  }

 

数据存储作为Executor功能之一,store方法调用了executor中的pushSingle操作,此时的Single可以理解为一次数据读取,而dataItem就是一次读取的数据对象

 

 

3. SocketReceiver#store=>executor.pushSingle(ReceiverSupervisorImpl.pushSingle)

 

  /** Push a single record of received data into block generator. */
  def pushSingle(data: Any) {
    blockGenerator.addData(data)
  }

 

数据放入到了blockGenerator数据结构中了,blockGenerator,类型为BlockGenerator,顾名思义是一个block生成器,所谓的block生成器,是指Spark Streaming每隔一段时间(默认200毫秒,   private val blockInterval = conf.getLong("spark.streaming.blockInterval", 200))将接收到的数据合并成一个block,然后将这个block写入到BlockManager,继续沿着个思路分析

 

 

4. executor.pushSingle=>BlockGenerator.addData

  /**
   * Push a single data item into the buffer. All received data items
   * will be periodically pushed into BlockManager.
   */
  def addData (data: Any): Unit = synchronized {
    waitToPush() ///通过阻塞控制Push的速度
    currentBuffer += data,将数据追加到currentBuffer中
  }

 

 当数据写入到currentBuffer中之后,似乎线索已经断了。事实上是BlockGenerator内部开启的两个线程(BlockIntervalTimer和BlockPushingThread)在背后继续处理currentBuffer

 

BlockIntervalTimer默认每200毫秒执行一次updateCurrentBufferer,该函数的功能是将类型为ArrayBuffer的currentBuffer合并成一个小的Block

  private val blockInterval = conf.getLong("spark.streaming.blockInterval", 200)
  private val blockIntervalTimer =
    new RecurringTimer(clock, blockInterval, updateCurrentBuffer, "BlockGenerator")
 

 

 BlockPushingThread是通过循环调用keepPushingBlocks将BlockIntervalTimer创建的各个Block写入到BlockManager中,

private val blockPushingThread = new Thread() { override def run() { keepPushingBlocks() } }
 

 

 上面说到的两个线程的同步是通过ArrayBlockQueue实现的

  private val blockQueueSize = conf.getInt("spark.streaming.blockQueueSize", 10)
  private val blocksForPushing = new ArrayBlockingQueue[Block](blockQueueSize)
 

 

 

5. BlockGenerator#updateCurrentBuffer

updateCurrentBuffer由BlockIntervalTimer线程执行

 

  /** Change the buffer to which single records are added to. */
  private def updateCurrentBuffer(time: Long): Unit = synchronized {
    try {
      val newBlockBuffer = currentBuffer 
      currentBuffer = new ArrayBuffer[Any] //这两句对currentBuffere这样的操作,是否有线程安全问题?没有,因为currentBuffer已经标注为@volatile类型的变量
      if (newBlockBuffer.size > 0) {
        val blockId = StreamBlockId(receiverId, time - blockInterval) //构造StreamBlockId
        val newBlock = new Block(blockId, newBlockBuffer) //创建出一个Block
        listener.onGenerateBlock(blockId) //通知谁?空实现,listener是作为BlockGenerator的构造函数传入的,这是一个所有通知时间的空实现
        blocksForPushing.put(newBlock)  //添加到阻塞队列中,等待BlockPushingThread读取
        logDebug("Last element in " + blockId + " is " + newBlockBuffer.last)
      }
    } catch {
      case ie: InterruptedException =>
        logInfo("Block updating timer thread was interrupted")
      case e: Exception =>
        reportError("Error in block updating thread", e)
    }
  }
 

 

6. BlockGenerator#keepPushingBlocks

keepPushingBlocks由BlockPushingThread执行

 

 

  /** Keep pushing blocks to the BlockManager. */
  private def keepPushingBlocks() {
    logInfo("Started block pushing thread")
    try {
      while(!stopped) {
        //poll是阻塞队列的非阻塞方法,但是如果队列中没有元素,则等待100ms,poll是取一个元素操作
        Option(blocksForPushing.poll(100, TimeUnit.MILLISECONDS)) match {
          case Some(block) => pushBlock(block)
          case None =>
        }
      }
      // Push out the blocks that are still left
      logInfo("Pushing out the last " + blocksForPushing.size() + " blocks")
      while (!blocksForPushing.isEmpty) {
        logDebug("Getting block ")
        val block = blocksForPushing.take()
        pushBlock(block)
        logInfo("Blocks left to push " + blocksForPushing.size())
      }
      logInfo("Stopped block pushing thread")
    } catch {
      case ie: InterruptedException =>
        logInfo("Block pushing thread was interrupted")
      case e: Exception =>
        reportError("Error in block pushing thread", e)
    }
  }
 

 

7. BlockGenerator#pushBlock

这个方法是针对一个Block进行push,而不是一次从队列中把所有的Block取出来,一次进行push。

 

 

  private def pushBlock(block: Block) {
    listener.onPushBlock(block.id, block.buffer)
    logInfo("Pushed block " + block.id)
  }
 

 

8. BlockGeneratorListener#onPushBlock

pushBlock是通过Observer模式,通知listener,这个liestener是BlockGenerator的构造函数传入的(其实是作为内部类,在构造时创建的实例)

 

 

  /** Divides received data records into data blocks for pushing in BlockManager. */
  private val blockGenerator = new BlockGenerator(new BlockGeneratorListener {
    def onAddData(data: Any, metadata: Any): Unit = { }

    def onGenerateBlock(blockId: StreamBlockId): Unit = { }

    def onError(message: String, throwable: Throwable) {
      reportError(message, throwable)
    }

    def onPushBlock(blockId: StreamBlockId, arrayBuffer: ArrayBuffer[_]) {
      pushArrayBuffer(arrayBuffer, None, Some(blockId))
    }
  }, streamId, env.conf)
 

 

9. BlockGenerator#pushArrayBuffer

 

  /** Store an ArrayBuffer of received data as a data block into Spark's memory. */
  def pushArrayBuffer(
      arrayBuffer: ArrayBuffer[_],
      metadataOption: Option[Any],
      blockIdOption: Option[StreamBlockId]
    ) {
    pushAndReportBlock(ArrayBufferBlock(arrayBuffer), metadataOption, blockIdOption)
  }
 

 

10. BlockGenerator#pushAndReportBlock

 

 

  /** Store block and report it to driver */
  def pushAndReportBlock(
      receivedBlock: ReceivedBlock,
      metadataOption: Option[Any],
      blockIdOption: Option[StreamBlockId]
    ) {
    val blockId = blockIdOption.getOrElse(nextBlockId)
    val numRecords = receivedBlock match {
      case ArrayBufferBlock(arrayBuffer) => arrayBuffer.size
      case _ => -1
    }

    val time = System.currentTimeMillis
    val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)
    logDebug(s"Pushed block $blockId in ${(System.currentTimeMillis - time)} ms")

    val blockInfo = ReceivedBlockInfo(streamId, numRecords, blockStoreResult)
    val future = trackerActor.ask(AddBlock(blockInfo))(askTimeout)
    Await.result(future, askTimeout)
    logDebug(s"Reported block $blockId")
  }
 

 

pushAndReportBlock做了两件事,一是Store Block,而是想Tracker汇报有Block加入

 

10.1 receivedBlockHandler.storeBlock(BlockManagerBasedBlockHandler#storeBlock)

 

 

  def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = {
    val putResult: Seq[(BlockId, BlockStatus)] = block match {
      case ArrayBufferBlock(arrayBuffer) =>
        blockManager.putIterator(blockId, arrayBuffer.iterator, storageLevel, tellMaster = true)
      case IteratorBlock(iterator) =>
        blockManager.putIterator(blockId, iterator, storageLevel, tellMaster = true)
      case ByteBufferBlock(byteBuffer) =>
        blockManager.putBytes(blockId, byteBuffer, storageLevel, tellMaster = true)
      case o =>
        throw new SparkException(
          s"Could not store $blockId to block manager, unexpected block type ${o.getClass.getName}")
    }
    if (!putResult.map { _._1 }.contains(blockId)) {
      throw new SparkException(
        s"Could not store $blockId to block manager with storage level $storageLevel")
    }
    BlockManagerBasedStoreResult(blockId)
  }
 

 

其中,blockManager是BlockManager类型的变量,定义于org.apache.spark.storage包中,实现向BlockManager写入数据,具体调用putIterator,putBytes,这是Spark存储子系统的内容,此处不赘述,重要的是,在此处写入进了BlockManager

 

10.2 ReceiverTracker#AddBlock

通过下面两个语句,将写入到BlockManager的信息汇报给TrackActor,这是一个进程间的同步调用(ask语法)

    val blockInfo = ReceivedBlockInfo(streamId, numRecords, blockStoreResult)
    val future = trackerActor.ask(AddBlock(blockInfo))(askTimeout)
    Await.result(future, askTimeout)

 

trackerActor对应的实体是ReceiverTracker,AddBlock消息将触发ReceiverTracker.addBlock,进而调用ReceivedBlockTracker.addBlock

 

 

  /** Add new blocks for the given stream */
  private def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = {
    receivedBlockTracker.addBlock(receivedBlockInfo)
  }
 

 

11. ReceivedBlockTracker.addBlock

 

 

  /** Add received block. This event will get written to the write ahead log (if enabled). */
  def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = synchronized {
    try {
      writeToLog(BlockAdditionEvent(receivedBlockInfo))//写WAL
      getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo //getReceivedBlockQueue从Map<streamId,streamReceivedBlockQueue>中获取相应的streamReceivedBlockQueue
      logDebug(s"Stream ${receivedBlockInfo.streamId} received " +
        s"block ${receivedBlockInfo.blockStoreResult.blockId}")
      true
    } catch {
      case e: Exception =>
        logError(s"Error adding block $receivedBlockInfo", e)
        false
    }
  }
 

 

 

 

 

分享到:
评论

相关推荐

    sparkStreaming消费数据不丢失

    sparkStreaming消费数据不丢失,sparkStreaming消费数据不丢失

    spark Streaming和structed streaming分析

    spark Streaming和structed streaming分析,理解整个 Spark Streaming 的模块划分和代码逻辑。

    Spark零基础思维导图(内含spark-core ,spark-streaming,spark-sql),总结的很全面.zip

    Spark零基础思维导图(内含spark-core ,spark-streaming,spark-sql),总结的很全面。 Spark零基础思维导图(内含spark-core ,spark-streaming,spark-sql)。 Spark零基础思维导图(内含spark-core ,spark-streaming,...

    SparkStreaming流式日志过滤与分析

    (1)利用SparkStreaming从文件目录读入日志信息,日志内容包含: ”日志级别、函数名、日志内容“ 三个字段,字段之间以空格拆分。请看数据源的文件。 (2)对读入都日志信息流进行指定筛选出日志级别为error或warn...

    Scala代码积累之spark streaming kafka 数据存入到hive源码实例

    Scala代码积累之spark streaming kafka 数据存入到hive源码实例,Scala代码积累之spark streaming kafka 数据存入到hive源码实例。

    SparkStreaming练习源码

    随着大数据的快速发展,业务场景越来越复杂,离线式的批处理框架...实时计算适用于这种对历史数据依赖不强,短时间内变化较大的数据。用户行为分析,舆情分析,等等不断随环境和时间实时变化的数据都可能用到实时计算。

    spark core、spark sql以及spark streaming 的Scala、java项目混合框架搭建以及大数据案例

    包中构建了Java以及Scala混合框架的maven打包框架以及关于spark core,spark sql 、spark streaming的一些典型案例或者算子使用。

    sparkstreaming.zip

    java的sparkstreaming连接kafka的例子,kafka生产者生产消息,消费者读取消息,sparkstreaming读取kafka小区并进行存储iotdb数据库。

    Spark-2.3.1源码解读

    Spark-2.3.1源码解读。 Spark Core源码阅读 Spark Context 阅读要点 Spark的缓存,变量,shuffle数据等清理及机制 Spark-submit关于参数及部署模式的部分解析 GroupByKey VS ReduceByKey OrderedRDDFunctions...

    基于Spark的零售交易数据分析

    该项目是大三下学期的课程设计,选取了共541909条数据,以Python为编程语言,使用大数据框架Spark对数据进行了预处理,然后分别从多个方面对数据进行了分类和分析,并对分析结果进行可视化。里面包含我的课程设计...

    Spark Submitter会议上的写spark streaming的建议

    Spark Submitter会议上的写spark streaming的建议,主要涉及spark streaming和kafka对接建议,以及相关逻辑

    基于Spark Streaming的大数据实时流计算平台和框架,并且是基于运行在yarn模式运行的spark streaming

    一个完善的Spark Streaming二次封装开源框架,包含:实时流任务调度、kafka偏移量管理,web后台管理,web api启动、停止spark streaming,宕机告警、自动重启等等功能支持,用户只需要关心业务代码,无需关注繁琐的...

    learning-spark-streaming

    Structured Streaming 是一个可拓展,容错的,基于Spark SQL执行引擎的流处理引擎。使用小量的静态数据模拟流处理。伴随流数据的到来,Spark SQL引擎会逐渐连续处理数据并且更新结果到最终的Table中。你可以在Spark ...

    sparkStreaming实战学习资料

    Spark中的(弹性分布式数据集)简称RDD: Spark中的Transformation操作之Value数据类型的算子: Spark中的Transformation操作之Key-Value数据类型的算子: Spark中的Action操作: Transformation-&gt;map算子: ...

    基于Scala的Spark RDD、Spark SQL、Spark Streaming相关Demo设计源码

    这是一个基于Scala语言开发的Spark RDD、Spark SQL、Spark Streaming相关Demo,包含35个文件。主要文件类型包括29个Scala源文件、2个Markdown文档、1个Reduced文件、1个XML文件、1个Java源文件和1个TXT文件。该项目...

    spark streaming

    spark streaming spark流式计算 Spark Streaming 是Spark核心API的一个扩展,可以实现高吞吐量的、具备容错机制的实时流数据的处理。支持从多种数据源获取数据

    2-3-Spark+Streaming.pdf

    spark程序是使用一个spark应用实例一次性对一批历史数据进行处理,spark streaming是将持续不断输入的数据流转换成多个batch分片,使用一批spark应用实例进行处理。

    spark streaming实时网站分析项目实战.rar

    6.spark streaming对接kafka的数据进行消费 数据采集详情:项目其他\数据采集.docx 二.数据清洗:见项目 使用spark streaming完成数据清洗操作 三.数据分析:见项目 功能一: 统计到今天为止视频的访问量 ...

    spark之sparkStreaming 理解

    spark之sparkStreaming 理解,总结了自己的理解,欢迎大家下载观看!

Global site tag (gtag.js) - Google Analytics