updateStateByKey与mapwithstate怎么实现

159次阅读

共计 2598 个字符，预计需要花费 7 分钟才能阅读完成。

这篇文章主要讲解了“updateStateByKey 与 mapwithstate 怎么实现”，文中的讲解内容简单清晰，易于学习与理解，下面请大家跟着丸趣 TV 小编的思路慢慢深入，一起来研究和学习“updateStateByKey 与 mapwithstate 怎么实现”吧！

updateStateByKey 与 mapwithstate 这两个方法在 Dstream 中是找不到的，他们是通过隐式转换来进行实现的

由此可以看到，最终是通过 PairDStreamFunctions 来实现这两个方法的。

updateStateByKey

newUpdateFunc 方法是在原有基础上如何进行更新的方法

defaultPartitioner() 获得默认的分区数

如下代码出现了一个非常关键的地方

new StateDStream(self, ssc.sc.clean(updateFunc), partitioner, rememberPartitioner, None)

StateDStream 继承自 Dstream。

stateDStream 自会持久化到内存中

里面有一个很总要的方法：如果存在 parent RDD 就将执行 computeUsingPreviousRDD 方法

在该方法中，有一处性能瓶颈的代码

每次进行更新的时候都会将原有的 parentRDD 进行 cogroup，这样程序不断的运行这样会导致越来越慢！尽量少用改方法！

Mapwithstate

mapWithState 方法的返回值是 MapWithStateDStream，我们来看看它的实现类

MapWithStateDStreamImpl

最终返回 InternalMapWithStateDStream

跟 updateStateByKey 一样是持久化在了内存中

persist(StorageLevel.MEMORY_ONLY)

接下来看看每个继承自 Dstream 的最重要的方法 compute：

最终操作的是 RDD：MapWithStateRDD

RDD 中的 partition 被 MapWithStateRDDRecord 代表

MapWithStateRDDRecord 有伴生对象：中的方法，该方法是对 state 进行更新操作，不像 updateStateByKey 每次都会进 cogroup 的操作，而是在原有的基础上进行更新，效率得到了提高！

def updateRecordWithData[K: ClassTag, V: ClassTag, S: ClassTag, E: ClassTag](
prevRecord: Option[MapWithStateRDDRecord[K, S, E]],
dataIterator: Iterator[(K, V)],
mappingFunction: (Time, K, Option[V], State[S]) = Option[E],
batchTime: Time,
timeoutThresholdTime: Option[Long],
removeTimedoutData: Boolean
): MapWithStateRDDRecord[K, S, E] = {
// Create a new state map by cloning the previous one (if it exists) or by creating an empty one
val newStateMap = prevRecord.map {_.stateMap.copy() }. getOrElse {new EmptyStateMap[K, S]()}

val mappedData = new ArrayBuffer[E]
val wrappedState = new StateImpl[S]()

// Call the mapping function on each record in the data iterator, and accordingly
// update the states touched, and collect the data returned by the mapping function
dataIterator.foreach {case (key, value) =
wrappedState.wrap(newStateMap.get(key))
val returned = mappingFunction(batchTime, key, Some(value), wrappedState)
if (wrappedState.isRemoved) {
newStateMap.remove(key)
} else if (wrappedState.isUpdated
|| (wrappedState.exists timeoutThresholdTime.isDefined)) {
newStateMap.put(key, wrappedState.get(), batchTime.milliseconds)
}
mappedData ++= returned
}

// Get the timed out state records, call the mapping function on each and collect the
// data returned
if (removeTimedoutData timeoutThresholdTime.isDefined) {
newStateMap.getByTime(timeoutThresholdTime.get).foreach {case (key, state, _) =
wrappedState.wrapTimingOutState(state)
val returned = mappingFunction(batchTime, key, None, wrappedState)
mappedData ++= returned
newStateMap.remove(key)
}
}

MapWithStateRDDRecord(newStateMap, mappedData)
}
}

感谢各位的阅读，以上就是“updateStateByKey 与 mapwithstate 怎么实现”的内容了，经过本文的学习后，相信大家对 updateStateByKey 与 mapwithstate 怎么实现这一问题有了更深刻的体会，具体使用情况还需要大家实践验证。这里是丸趣 TV，丸趣 TV 小编将为大家推送更多相关知识点的文章，欢迎关注！

正文完