bit1129

浏览: 1052019 次
性别:
来自: 北京

最近访客更多访客>>

xiaoyaohen24

yuxin8000

abc951654

zhongqi2513

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

【Spark三十】Hash Based Shuffle之一Shuffle Write + NoConsolidationFiles

博客分类：

Spark

在Spark1.2的时候，Spark将默认基于Hash的Shuffle改为了默认基于Sort的Shuffle。那么二者在Shuffle过程中具体的Behavior究竟如何，Hash based shuffle有什么问题，Sort Based Shuffle有什么问题，

先看源代码分析下Hash Based Shuffle的流程，然后在从大方面去理解，毕竟，看代码是见数目不见森林。等见了树木之后，再看看森林是什么样的。

1.Hash Shuffle总体架构图

2. 示例程序

package spark.examples

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

object SparkWordCountHashShuffle {
  def main(args: Array[String]) {
    System.setProperty("hadoop.home.dir", "E:\\devsoftware\\hadoop-2.5.2\\hadoop-2.5.2");
    val conf = new SparkConf()
    conf.setAppName("SparkWordCount")
    conf.setMaster("local[3]")
    //Hash based Shuffle；
    conf.set("spark.shuffle.manager", "hash");
    val sc = new SparkContext(conf)
    val rdd = sc.textFile("file:///D:/word.in.3",4); //数据至少产生4个分区
    val rdd1 = rdd.flatMap(_.split(" "))
    val rdd2 = rdd1.map((_, 1))
    val rdd3 = rdd2.reduceByKey(_ + _, 3); ///3个分区对应3个ResultTask
    rdd3.saveAsTextFile("file:///D:/wordout" + System.currentTimeMillis());
    sc.stop
  }
}

调用rdd3.toDebugString得到如下的RDD依赖关系图(其实在ShuffledRDD之后，即在saveAsTextFile内部还会继续对rdd3进行转换，此处不考虑，ShuffledRDD是经过Shuffle过形成的RDD）

(3) ShuffledRDD[4] at reduceByKey at SparkWordCountHashShuffle.scala:18 []
 +-(5) MappedRDD[3] at map at SparkWordCountHashShuffle.scala:17 []
    |  FlatMappedRDD[2] at flatMap at SparkWordCountHashShuffle.scala:16 []
    |  file:///D:/word.in.3 MappedRDD[1] at textFile at SparkWordCountHashShuffle.scala:15 []
    |  file:///D:/word.in.3 HadoopRDD[0] at textFile at SparkWordCountHashShuffle.scala:15 []

Shuffle写操作发生在ShuffleMapTask中，Shuffle读操作发生在ResultTask中。ResultTask通过MapOutputTrackerMaster来获取ShuffleMapTask写数据的位置，因此，当ShuffleMapTask执行完后会更新MapOutputTrackerMaster以记录Shuffle写入数据的位置，而ResultTask则读取MapOutputTrackerMaster的相关信息读取ShuffleMapTask的写入数据

3. Hash Shuffle Write

3.1 ShuffleMapTask的runTask方法

  override def runTask(context: TaskContext): MapStatus = {
    // Deserialize the RDD using the broadcast variable.
    val ser = SparkEnv.get.closureSerializer.newInstance()
    ///反序列化taskBinary得到rdd和dep，rdd是Shuffle前的最后一个RDD，即wordcount中的MappedRDD[3]
    ///dep是ShuffleDependency
    val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
      ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)

    metrics = Some(context.taskMetrics)
    var writer: ShuffleWriter[Any, Any] = null
    try {
      ///获取shuffleManager，此处是HashShuffleManager
      val manager = SparkEnv.get.shuffleManager
      ///根据dep.shuffleHandle以及partitionId获取HashShuffleWriter，
      ///首先，ShuffleWriter是与RDD的一个分区关联的，因此M个ShuffleMapTask(对应m个partition)，就会产生m个writer
      ///dep.shuffleHandle获取的是什么，下面分析  
      writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
      
      ////调用HashShuffleWriter的write方法，写入的数据(入参是RDD中，index为partition的分区数据集合(Iteratable)
      writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
      
      ///stop做了什么事？调用stop的返回值的get方法以返回MapStatus对象，至于MapStatus对象中有什么数据，后面分析
      return writer.stop(success = true).get
    } catch {
      case e: Exception =>
        try {
          if (writer != null) {
            writer.stop(success = false)
          }
        } catch {
          case e: Exception =>
            log.debug("Could not stop writer", e)
        }
        throw e
    }
  }

3.2 反序列化taskBinary

    val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
      ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)

问题在于rdd和dep指的是什么？rdd是ShuffleMapStage的最后一个RDD，dep是ShuffleDependency类型，表示这个Stage对于它依赖的Stage而言是Shuffle依赖的。

rdd和dep是在DAGScheduler的submitMissingTasks中序列化的，代码片段如下

 var taskBinary: Broadcast[Array[Byte]] = null
    try {
      // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
      // For ResultTask, serialize and broadcast (rdd, func).
      val taskBinaryBytes: Array[Byte] =
        if (stage.isShuffleMap) { ///rdd来自于stage.rdd，dep来自于stage.shuffleDep.get，这个stage是ShuffleMapStage
          closureSerializer.serialize((stage.rdd, stage.shuffleDep.get) : AnyRef).array()
        } else {
          closureSerializer.serialize((stage.rdd, stage.resultOfJob.get.func) : AnyRef).array()
        }
      taskBinary = sc.broadcast(taskBinaryBytes)///通过broadcast，由driver向workers传播
    } catch {
      // In the case of a failure during serialization, abort the stage.
      case e: NotSerializableException =>
        abortStage(stage, "Task not serializable: " + e.toString)
        runningStages -= stage
        return
      case NonFatal(e) =>
        abortStage(stage, s"Task serialization failed: $e\n${e.getStackTraceString}")
        runningStages -= stage
        return
    }

3.3 dep.shuffleHandle

dep是ShuffleDependency对象；dep.shuffleHandle的类型是ShuffleHandle，实际类型是BasicShuffleHandle。shuffleHandle是ShuffleDependency的一个成员变量，在实例化ShuffleDependency的时候，即给它进行复制。复制是调用HashShuffleManager的registerShuffle方法实现的，registerShuffle有三个参数，shuffleId,ShuffleMapStage的最后一个RDD(这里的MappedRDD[3]的分区数，以及ShuffleDependency对象本身）。

  val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
    shuffleId, _rdd.partitions.size, this)

_rdd是ShuffleDependency的一个成员，这个rdd是ShuffledRDD构造时传入的，如下是ShuffledRDD的getDependencies方法，prev就是ShuffledRDD依赖的RDD，就是这里的_rdd。

registerShuffle记录的是ShuffledRDD依赖的rdd的partition数目

  override def getDependencies: Seq[Dependency[_]] = {
    List(new ShuffleDependency(prev, part, serializer, keyOrdering, aggregator, mapSideCombine))
  }

3.4 HashShuffleManager的registerShuffle方法

  /* Register a shuffle with the manager and obtain a handle for it to pass to tasks. */
  override def registerShuffle[K, V, C](
      shuffleId: Int,
      numMaps: Int, ////可见这个参数是mapper RDD的partition数目
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
    new BaseShuffleHandle(shuffleId, numMaps, dependency)
  }

3.4.2 关于BaseShuffleHandle

/**
 * A basic ShuffleHandle implementation that just captures registerShuffle's parameters.
 */
private[spark] class BaseShuffleHandle[K, V, C](
    shuffleId: Int,
    val numMaps: Int,
    val dependency: ShuffleDependency[K, V, C])
  extends ShuffleHandle(shuffleId)

/**
 * An opaque handle to a shuffle, used by a ShuffleManager to pass information about it to tasks.
 *
 * @param shuffleId ID of the shuffle
 */
private[spark] abstract class ShuffleHandle(val shuffleId: Int) extends Serializable {}

BaseShuffleHandle更像是一个case class,注意它是可序列化的，正如BaseShuffleHandle的方法说明，用于存放shuffle的信息的。

3.5manger.getWriter方法此处的manager是HashShuffleManager，

  /** Get a writer for a given partition. Called on executors by map tasks. */
  override def getWriter[K, V](handle: ShuffleHandle, mapId: Int, context: TaskContext)
      : ShuffleWriter[K, V] = {
    new HashShuffleWriter(
      shuffleBlockManager, handle.asInstanceOf[BaseShuffleHandle[K, V, _]], mapId, context)
  }

可见此处的getWriter返回一个HashShuffleWriter，它是针对Mapper partitions中一个partition返回的(mapId的含义就是一个mapper的一个partition的index)。同时携带一个BaseShuffleHandle(启动携带了shuffleId，mapper partitions总数以及ShuffleDependency)。在构造HashShuffleWriter的过程中，出现了shuffleBlockManager对象,注意getWriter是在HashShuffleManager中定义的，因此ShuffleBlockManager是HashShuffleManager的一个实例，代码定义如下，也就是说，对于Hash Shuffle而言，它的ShuffleBlockManager是FileShuffleBlockManager类型的，这个类中定义了Hash Shuffle时，ShuffleMapTask写磁盘时的文件载体就在这里面定义，待会儿介绍

  override def shuffleBlockManager: FileShuffleBlockManager = {
    fileShuffleBlockManager
  }

3.6 HashShuffleWriter实例化完后，调用它的write方法（注意，HahsShuffleWriter的实际存储载体是FileShuffleBlockManager）：

调用writer.write方法进行实际的写数据操作

writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])

write方法的入参是一个partition的数据集合(Iteratable),这个partition是一个整数，是mapper的partitions的一个partition的index

/** Write a bunch of records to this task's output */

  
  override def write(records: Iterator[_ <: Product2[K, V]]): Unit = {
    ///上面看到ShuffleDependency构造时，包含了如下信息：
   /// List(new ShuffleDependency(prev, part, serializer, keyOrdering, aggregator, mapSideCombine))
    ///根据dep的aggregator和mapSideCombine定义的不同情况，决定对分区数据是否进行按照Key进行Map端的合并
    val iter = 
     if (dep.aggregator.isDefined) {
        if (dep.mapSideCombine) {///如果定义了dep.aggregator同时定义了dep.mapSideCombine，则对Key进行combine操作，这是一个map端的combine 就是_ + _操作
           dep.aggregator.get.combineValuesByKey(records, context) ////内部使用HashMap进行combine
        } else { ////如果定义了dep.aggregator但是未定义map端的combiner
          records
        }
     }

     else if (dep.aggregator.isEmpty && dep.mapSideCombine) { ///如果定义了dep.mapSideCombine但是没有定义dep.aggregator，则抛出异常
       throw new IllegalStateException("Aggregator is empty for map-side combine")
     } 
     else { //直接返回，不进行Map端的按照Key的合并
      records
    }

    ////遍历iter，每个partition生成一个文件？不是！是根据不同的Key获取不同的输出文件(一共partitioner.partition个文件)
    for (elem <- iter) {
      ///根据元素Key得到bucketId，此处的关键是dep.partitioner指的是Shuffle前的最后一个RDD的分区方法还是Shuffle后的第一个RDD的分区方法
      val bucketId = dep.partitioner.getPartition(elem._1) ///根据Key获取bucketId
      ////根据bucketId获得一个writer，根据bucketId获得不同的writer，也就是不同的(Key,Value)写到不同的文件中了（依据elem所对应的bucketId)
      ///writers是shuffle的函数，参数是bucketId
      shuffle.writers(bucketId).write(elem)
    }
  }

3.7 Aggregator.combineValuesByKey

Aggregator.combineValuesByKey（即Mapper端做combine）是比较复杂的一步，它依据是否要spill磁盘分成了使用AppendOnlyMap做combine和ExternalAppendOnlyMap做combine，方法的结果是一样的，就是返回一个可迭代的数据集合(比较长，后面再展开说)

3.8 遍历每个元素，调用dep.partitioner.getPartition(elem._1)获取bucketId

此处的dep.partitioner是Shuffle前的最后一个RDD(MappedRDD[3])定义的partitoner还是Shuffle后的第一个RDD(ShuffledRDD)定义的partitioner

ShuffleDependency的partitioner是作为构造参数传入到ShuffleDependen中的，它的注释是用于对shuffle输出进行分区。通过调试也确认了，这个partitioner指的是ShuffledRDD的分区数，即它是Shuffle后的第一个RDD(ShuffledRDD)定义的partitioner。

调试发现dep.partitioner是一个分区数为3的HashPartitioner。

这也就是不难理解，dep.partitioner.getPartition(elem._1)获取的是这个elem按照ShuffledRDD的分区算法存放到指定的位置，因此，bucketId是ShuffledRDD的分区的index。

3.8.2. Partitioner的getPartition方法：

  def getPartition(key: Any): Int = key match {
    case null => 0
    ////使用Utils.nonNegativeMod的方法计算Key的Hash取模
    case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
  }

3.8.3 Utils.nonNegativeMod方法

 /* Calculates 'x' modulo 'mod', takes to consideration sign of x,
  * i.e. if 'x' is negative, than 'x' % 'mod' is negative too
  * so function return (x % mod) + mod in that case.
  */
  def nonNegativeMod(x: Int, mod: Int): Int = {
    val rawMod = x % mod
    rawMod + (if (rawMod < 0) mod else 0)
  }

3.9调用shuffle.writers(bucketId)获取一个目标(ShuffledRDD每个分区对应的ResultTask的拉取数据的源头）的writer,然后调用write将元素写入

3.10 首先看一下shuffle变量在HashShuffleWriter中的定义

  ///shuffleBlockManager的类型是FileShuffleBlockManager
  private val shuffle = shuffleBlockManager.forMapTask(dep.shuffleId, mapId, numOutputSplits, ser,
    writeMetrics)

3.11 shuffleBlockManager.forMapTask方法

/**
   * Get a ShuffleWriterGroup for the given map task, which will register it as complete
   * when the writers are closed successfully
   */
  ///mapId:是map端的partitionId,numBuckets是ResultTask的个数或者ShuffledRDD的分区数
  ///forMapTask是针对每个mapId，建立numBuckets个数(Reducer个数)的File？
  def forMapTask(shuffleId: Int, mapId: Int, numBuckets: Int, serializer: Serializer,
      writeMetrics: ShuffleWriteMetrics) = {
    new ShuffleWriterGroup {
      shuffleStates.putIfAbsent(shuffleId, new ShuffleState(numBuckets))
      private val shuffleState = shuffleStates(shuffleId)
      private var fileGroup: ShuffleFileGroup = null

      val writers: Array[BlockObjectWriter] = if (consolidateShuffleFiles) { ///如果是consolidateShuffleFiles，把shuffle聚合在一起
        fileGroup = getUnusedFileGroup()
        Array.tabulate[BlockObjectWriter](numBuckets) { bucketId =>
          val blockId = ShuffleBlockId(shuffleId, mapId, bucketId)
          blockManager.getDiskWriter(blockId, fileGroup(bucketId), serializer, bufferSize,
            writeMetrics)
        }
      } else {
        Array.tabulate[BlockObjectWriter](numBuckets) { bucketId => ///创建一个个数为numBuckets的数组，数组元素类型是BlockObjectWriter
          val blockId = ShuffleBlockId(shuffleId, mapId, bucketId) //ShuffleBlockId对象，入参：shuffleId，mapId以及每个reducer的partitionId
          //blockManager的类型是org.apache.spark.storage.BlockManager
          //BlockManager的类注释是Manager running on every node (driver and executors) which provides interfaces for putting and retrieving blocks both locally and remotely into various stores (memory, disk, and off-heap).
          //diskBlockManager的类型是DiskBlockManager
          //DiskBlockManager:
 /*Creates and maintains the logical mapping between logical blocks and physical on-disk
 * locations. By default, one block is mapped to one file with a name given by its BlockId.
 * However, it is also possible to have a block map to only a segment of a file, by calling
 * mapBlockToFileSegment().
          val blockFile = blockManager.diskBlockManager.getFile(blockId) ///根据上面的三方面信息，获取一个文件blockFile，一共M*N个文件
          // Because of previous failures, the shuffle file may already exist on this machine.
          // If so, remove it.
          if (blockFile.exists) {
            if (blockFile.delete()) {
              logInfo(s"Removed existing shuffle file $blockFile")
            } else {
              logWarning(s"Failed to remove existing shuffle file $blockFile")
            }
          }
          ///根据blockId,blockFile获取一个BlockObjectWriter，blockId和blockFile有点重复，因为blockFile中已经包含了blockId的信息
          ///bufferSize取自SparkConf中配置的spark.shuffle.file.buffer.kb参数，以kb为单位，默认为32，即32kb，用于写文件的缓冲
          blockManager.getDiskWriter(blockId, blockFile, serializer, bufferSize, writeMetrics)
        }
      }

由于forMapTask返回的ShuffleWriterGroup类型的对象，因此shuffle变量是ShuffleWriterGroup类型的，而ShuffleWriterGroup对象有一个writers成员

3.11.1 ShuffleBlockId

这个类像是JavaBean，它有唯一的一个name，用户获取这个ShuffleBlockId的名称，其中的reduceId,就是上面构造时传入的bucketId

name = "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId

3.11.2 blockManager.diskBlockManager.getFile(blockId)

根据blockid获取一个File，注意，此时这个File还没有创建，如果这个File已经存在，首先将其删除

它是调用DiskBlockManager的getFile方法

def getFile(blockId: BlockId): File = getFile(blockId.name)

getFile继续调用重载的getFile(fileName)

  def getFile(filename: String): File = {
    // Figure out which local directory it hashes to, and which subdirectory in that
    //对文件名做Hash
    //filename,例如shuffle_0_0_0
    val hash = Utils.nonNegativeHash(filename)
    //此处首先要知道localDirs是什么含义，通过它得到dirId,（dirId是一个目录的索引，即localDir[dirId]将得到具体的的目录）
    //localDirs就是制定的存放map结果数据的临时目录，可以指定多个，用逗号分隔
    //在wordcount例子中，没有指定spark.local.dir，默认去java.io.tmp的目录，并且localDirs的长度为1
    //此时dirId为0
    val dirId = hash % localDirs.length
    //subDirsPerLocalDir是什么？它取自SparkConf的spark.diskStore.subDirectories配置参数，默认为64
    //因为localDirs.length为1，那么subDirId=hash%subDirsPerLocalDir, 0~63的数字
    val subDirId = (hash / localDirs.length) % subDirsPerLocalDir

    // Create the subdirectory if it doesn't already exist
    //subDir是个二维数组：
    //private val subDirs = Array.fill(localDirs.length)(new Array[File](subDirsPerLocalDir))
    ///fill接收两个参数，第一个参数为n，表示对于0到n-1，每个元素都第二个参数填充，因此subDirs是个二维数组，表示对于每个localDir，都有0到subDirsPerLocalDir个数的子目录

    ///根据dirId和subDirId获取子目录的文件对象，应该还是null，经验证是null
    var subDir = subDirs(dirId)(subDirId)
    
    //子目录尚不存在
    if (subDir == null) {
      subDir = subDirs(dirId).synchronized { ///subDirs(dirId)得到的是一个一维数组
        val old = subDirs(dirId)(subDirId) ///线程同步的两阶段检查
        if (old != null) {
          old
        } else {
          val newDir = new File(localDirs(dirId), "%02x".format(subDirId)) ///将subDirId转换成16进制，
          newDir.mkdir()
          subDirs(dirId)(subDirId) = newDir ///给二维数组赋值
          newDir ///赋给subDir
        }
      }
    }
    ///文件所在的目录，以及文件名，但是并未创建File，即没有调用File.createNewFile
    ///subDir是${java.io.tmp}/spark-local-20150219132253-c917/0c(或者0d，是个单调增的16进制数，)
    new File(subDir, filename)
  }

localDirs：

  /* Create one local directory for each path mentioned in spark.local.dir; then, inside this
   * directory, create multiple subdirectories that we will hash files into, in order to avoid
   * having really large inodes at the top level. */
  //Gets or creates the directories listed in spark.local.dir or SPARK_LOCAL_DIRS,
  private[spark] val localDirs: Array[File] = createLocalDirs(conf)

郁闷的是，本地的断点调试进不了这个代码，源代码和class文件已经不匹配了，先把wordcount程序执行ShuffleMapTask生成的map结果，写下来，然后反推代码的含义

C:\Users\hadoop\AppData\Local\Temp\spark-local-20150219132253-c917>tree /f
文件夹 PATH 列表
卷序列号为 4E9D-390C
C:.
├─0c
│      shuffle_0_0_0
│
├─0d
│      shuffle_0_0_1
│
├─0e
│      shuffle_0_0_2
│      shuffle_0_2_0
│
├─0f
│      shuffle_0_2_1
│      shuffle_0_3_0
│
├─10
│      shuffle_0_2_2
│      shuffle_0_3_1
│
├─11
│      shuffle_0_3_2
│
├─12
└─13

经过上面的验证，localDirs是 C:\Users\hadoop\AppData\Local\Temp\spark-local-20150219132253-c917，而它下面的0c，0d...13则是16进制的子dirs。每个目录下最多有64个。

3.11.3 获取到blockFile之后，执行如下语句，获取Writer，返回的类型为BlockObjectWriter

blockManager.getDiskWriter(blockId, blockFile, serializer, bufferSize, writeMetrics)

上面的语句的实现方法如下：

  def getDiskWriter(
      blockId: BlockId,
      file: File,
      serializer: Serializer,
      bufferSize: Int,
      writeMetrics: ShuffleWriteMetrics): BlockObjectWriter = {
    val compressStream: OutputStream => OutputStream = wrapForCompression(blockId, _)
    val syncWrites = conf.getBoolean("spark.shuffle.sync", false)
    new DiskBlockObjectWriter(blockId, file, serializer, bufferSize, compressStream, syncWrites,
      writeMetrics)
  }

可见它是返回DiskBlockObjectWriter，有压缩算法serializer?

至此，FileShuffleBlockManager的forMapTask已经分析完了

3.12 通过shuffle.writers(bucketId)获取到FileShuffleBlockManager的forMapTask返回的DiskShuffleBlockWriter对象，调用它的write方法

  override def write(value: Any) {
    if (!initialized) {
      open()
    }

    objOut.writeObject(value) ///写入二进制流

    if (writesSinceMetricsUpdate == 32) {
      writesSinceMetricsUpdate = 0
      updateBytesWritten()
    } else {
      writesSinceMetricsUpdate += 1
    }
  }

3.13 当这个RDD的partition中的数据写完后，代码回到ShuffleMapTask的runTask中，执行最后一步，

return writer.stop(success = true).get

此时有两步操作，首先关闭上面的writer，因为在写的时候，打开了R个文件，需要关闭；其次是要讲写入的数据通知MapOutputTrackerMaster

 /** Close this writer, passing along whether the map completed */
  override def stop(initiallySuccess: Boolean): Option[MapStatus] = {
    var success = initiallySuccess
    try {
      if (stopping) {
        return None
      }
      stopping = true
      if (success) {
        try {
          Some(commitWritesAndBuildStatus())  ///这是干啥？应该是作为返回值的，使用Some包装
        } catch {
          case e: Exception =>
            success = false
            revertWrites()
            throw e
        }
      } else {
        revertWrites()
        None
      }
    } finally {
      // Release the writers back to the shuffle block manager.
      if (shuffle != null && shuffle.writers != null) { ///try的commitWritesAndBuildStatus已经关闭了所有打开的shuffle的writers，这里为什么还要release？
        try {
          shuffle.releaseWriters(success)
        } catch {
          case e: Exception => logError("Failed to release shuffle writers", e)
        }
      }
    }
  }

3.13.1

commitWritesAndBuildStatus

  private def commitWritesAndBuildStatus(): MapStatus = {
    // Commit the writes. Get the size of each bucket block (total block size).
    //每个writer都有写数据
    val sizes: Array[Long] = shuffle.writers.map { writer: BlockObjectWriter =>
      writer.commitAndClose() ///提交并关闭
      writer.fileSegment().length ///fileSegment()的长度如何结算的？这是每个writer写数据的长度
    }
    ///sizes是数组，表示本map所有的针对所有的reduce的数据都已经产生，每个mapper为每个reducer产生一个文件
    MapStatus(blockManager.shuffleServerId, sizes)
  }

3.13.2 DiskBlockObjectWriter的fileSegment()方法

  override def fileSegment(): FileSegment = {
    ///三个参数： 
    //initialPosition表示内容在文件的起始位置， finalPosition-initialPosition表示这个Segment的长度，对于没有启用consolidatition的map out，每个Seg就是一个完成的文件
    new FileSegment(file, initialPosition, finalPosition - initialPosition)
  }

3.14上面在commitWritesAndBuildStatus方法中返回了MapStatus对象，此对象尚没有给MapOutputTrackerMaster登记自己shuffle数据的位置

由于Spark的源代码和二进制包不同步，导致代码无法跟踪，先暂时到这里，先接着分析Hash Based Shuffle读吧。

上面对Hash based Shuffle write进行了源代码的剖析，还有一部分没有涉及，就是map端的combine操作，Aggragator.combineValuesByKey操作，没有进行涉及，再写。

其他【不包含在上面的分钟】

传入的partition数和实际的partition个数的对应关系

conf.set("spark.shuffle.manager", "hash");

1. 指定partition书目的textFile操作

    val rdd = sc.textFile("file:///D:/word.in.3",4); //4是最小partition书目

2. 如下的代码来自于HadoopRDD.scala，当前的minPartitions的值是4，得到的inputSplits的值是5，也就是Partition的数目为5

  override def getPartitions: Array[Partition] = {
    val jobConf = getJobConf()
    // add the credentials here as this can be called before SparkContext initialized
    SparkHadoopUtil.get.addCredentials(jobConf)
    val inputFormat = getInputFormat(jobConf)
    if (inputFormat.isInstanceOf[Configurable]) {
      inputFormat.asInstanceOf[Configurable].setConf(jobConf)
    }
    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    val array = new Array[Partition](inputSplits.size)
    for (i <- 0 until inputSplits.size) {
      array(i) = new HadoopPartition(id, i, inputSplits(i))
    }
    array
  }

ResultTask个数与Map Partition个数之间的关系，

1.如果ResultTask没有指定个数，那么默认是与Map Partition的个数相同；如果指定了，则按照指定的值创建ResultTask实例

package spark.examples

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

object SparkWordCount {
  def main(args: Array[String]) {
    System.setProperty("hadoop.home.dir", "E:\\devsoftware\\hadoop-2.5.2\\hadoop-2.5.2");
    val conf = new SparkConf()
    conf.setAppName("SparkWordCount")
    conf.setMaster("local")
    //Hash based Shuffle；
    conf.set("spark.shuffle.manager", "hash");
    val sc = new SparkContext(conf)
    val rdd = sc.textFile("file:///D:/word.in.3",4); //4表示最小Partition书目
    println(rdd.toDebugString)
    val rdd1 = rdd.flatMap(_.split(" "))
    println("rdd1:" + rdd1.toDebugString)
    val rdd2 = rdd1.map((_, 1))
    println("rdd2:" + rdd2.toDebugString)
    val rdd3 = rdd2.reduceByKey(_ + _, 3); ///3表示ReduceTask的个数，如果不指定则与Map Partition的个数相同
    println("rdd3:" + rdd3.toDebugString)
    rdd3.saveAsTextFile("file:///D:/wordout" + System.currentTimeMillis());
    sc.stop
  }
}

HashBased Shuffle Map产生的文件数，与Map Partition个数和ReduceTask个数的关系

1.Map的中间结果默认存放在java.io.tmp目录下，如果指定了则保存到指定目录

2.如果一个RDD有N个Partition，会产生N个ShuffleMapTask。

3.如果有1个ResultTask，那么最后的结果，会产生1个结果文件.Part-00000。如果有R个ReduceTask（即ResultTask)，则会产生R个结果文件。

4.M个partition，N个reduceTask，产生多少个Map文件？M*N。例如：

/tmp/0c/shuffle_0_0_0

/tmp/0d/shuffle_0_0_1

/tmp/0d/shuffle_0_0_2

/tmp/0e/shuffle_0_2_0

/tmp/0f/shuffle_0_2_1

/tmp/0f/shuffle_0_3_0

shuffle后面的三个数字的含义：

shuffleId
PartiontionID
ReduceTaskId,表明该partition将由第几个ReuceTask进行处理。最大值是2，因为一共3个ReduceTask

并行度

是指执行ReduceTask有几个core来执行，同时执行的个数。(除了一个local【4】的方式，还有一个设置并行度的参数)。设置了并行度后，上面的文件个数不变。

  /**
   * The master URL to connect to, such as "local" to run locally with one thread, "local[4]" to
   * run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.
   */
  def setMaster(master: String): SparkConf = {
    set("spark.master", master)
  }

spark.shuffle.consolidateFiles选项

示例源代码：

package spark.examples.shuffle

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._

object SparkHashShuffleConsolidationFile {
  def main(args: Array[String]) {
    System.setProperty("hadoop.home.dir", "E:\\devsoftware\\hadoop-2.5.2\\hadoop-2.5.2");
    val conf = new SparkConf()
    conf.setAppName("SparkWordCount")
    conf.setMaster("local[3]")
    //Hash based Shuffle；
    conf.set("spark.shuffle.manager", "hash");

    //使用文件聚合
    conf.set("spark.shuffle.consolidateFiles", "true");

    val sc = new SparkContext(conf)
    //10个以上的分区，每个分区对应一个Map Task
    //读取一个1M的文件
    val rdd = sc.textFile("file:///D:/server.log", 10);
    val rdd1 = rdd.flatMap(_.split(" "))
    val rdd2 = rdd1.map((_, 1))
    //6个Reducer
    val rdd3 = rdd2.reduceByKey(_ + _, 6);
    rdd3.saveAsTextFile("file:///D:/wordcount" + System.currentTimeMillis());

    println(rdd3.toDebugString)
    sc.stop
  }
}

结果Map Task产生了13个目录，文件内容：

C:.
├─00
│      merged_shuffle_0_5_2
│
├─01
│      merged_shuffle_0_4_2
│      merged_shuffle_0_5_1
│
├─02
│      merged_shuffle_0_3_2
│      merged_shuffle_0_4_1
│      merged_shuffle_0_5_0
│
├─03
│      merged_shuffle_0_2_2
│      merged_shuffle_0_3_1
│      merged_shuffle_0_4_0
│
├─04
│      merged_shuffle_0_1_2
│      merged_shuffle_0_2_1
│      merged_shuffle_0_3_0
│
├─05
│      merged_shuffle_0_0_2
│      merged_shuffle_0_1_1
│      merged_shuffle_0_2_0
│
├─06
│      merged_shuffle_0_0_1
│      merged_shuffle_0_1_0
│
├─07
│      merged_shuffle_0_0_0
│
├─0c
├─0d
├─0e
├─11
└─13

1. 结果显示一共六个Mapper，3个Reducer，18个文件，分析原因

2. 每个文件有个merged前缀，何意

加大输入文件的规模，看看结果？结果还是一样。为什么只有6个Mapper，而且只有3个Reducer(3是跟并行度有关的吧？)

查看图片附件

分享到：

【Spark三十一】SparkSubmit兼谈Spark的 ... | 【Spark二十九】Driver

2015-01-26 21:21
浏览 1901
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

【Spark三十】Hash Based Shuffle之一Shuffle Write + NoConsolidationFiles

1.Hash Shuffle总体架构图

2. 示例程序

3. Hash Shuffle Write

由于Spark的源代码和二进制包不同步，导致代码无法跟踪，先暂时到这里，先接着分析Hash Based Shuffle读吧。

其他【不包含在上面的分钟】

传入的partition数和实际的partition个数的对应关系

ResultTask个数与Map Partition个数之间的关系，

HashBased Shuffle Map产生的文件数，与Map Partition个数和ReduceTask个数的关系

并行度

spark.shuffle.consolidateFiles选项

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

【Spark三十】Hash Based Shuffle之一Shuffle Write + NoConsolidationFiles

1.Hash Shuffle总体架构图

2. 示例程序

3. Hash Shuffle Write

由于Spark的源代码和二进制包不同步，导致代码无法跟踪，先暂时到这里，先接着分析Hash Based Shuffle读吧。

其他【不包含在上面的分钟】

传入的partition数和实际的partition个数的对应关系

ResultTask个数与Map Partition个数之间的关系，

HashBased Shuffle Map产生的文件数，与Map Partition个数和ReduceTask个数的关系

并行度

spark.shuffle.consolidateFiles选项

评论

发表评论

相关推荐

【Spark109】Windows上运行spark-shell

【Spark108】Spark SQL动态代码生成四

【Spark107】Spark SQL动态代码生成三

【Spark106】Spark SQL动态代码生成二

【Spark105】Spark SQL动态代码生成一

【Spark105】Spark任务调度

【Spark104】Spark源代码构建打包

【Spark103】Task not serializable

【Spark102】Spark存储模块BlockManager剖析

【Spark101】Scala Promise/Future在Spark中的应用

【Spark100】Spark Streaming Checkpoint的一个坑

【Spark九十九】Spark Streaming的batch interval时间内的数据流转源码分析

【Spark九十八】Standalone Cluster Mode下的资源调度源代码分析

【Spark九十七】RDD API之aggregateByKey

【Spark九十六】RDD API之combineByKey

【Spark九十五】Spark Shell操作Spark SQL

【Spark九十四】spark-sql工具的使用

【Spark九十三】Spark读写Sequence File

【Spark九十二】Spark SQL操作Parquet格式的数据

【Spark九十一】Spark Streaming整合Kafka一些值得关注的问题

最近访客更多访客>>