3
對於python數據框,info()函數提供內存使用情況。 pyspark有什麼等價物嗎? 感謝如何查找pyspark數據幀內存使用情況?
對於python數據框,info()函數提供內存使用情況。 pyspark有什麼等價物嗎? 感謝如何查找pyspark數據幀內存使用情況?
嘗試使用the following trick:
import py4j.protocol
from py4j.protocol import Py4JJavaError
from py4j.java_gateway import JavaObject
from py4j.java_collections import JavaArray, JavaList
from pyspark import RDD, SparkContext
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
# your dataframe what you'd estimate
df
# Helper function to convert python object to Java objects
def _to_java_object_rdd(rdd):
""" Return a JavaRDD of Object by unpickling
It will convert each Python object into Java object by Pyrolite, whenever the
RDD is serialized in batch or not.
"""
rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
# First you have to convert it to an RDD
JavaObj = _to_java_object_rdd(df.rdd)
# Now we can run the estimator
sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)
Thanks @MaxU ..is有沒有更簡單的方法來做到這一點?我沒有得到這個程序的大部分。 – Neo
http://metricbrew.com/how-to-estimate-rdd-or-dataframe-real-size-in-pyspark/ – MaxU
@MaxU什麼的該程序中的內存使用單位。 – Neo
[字節數](https://spark.apache.org/docs/2.0.0/api/java/index.html?2.0//apache/spark/util/SizeEstimator.html) – MaxU