2017-09-14 86 views
3

對於python數據框,info()函數提供內存使用情況。 pyspark有什麼等價物嗎? 感謝如何查找pyspark數據幀內存使用情況?

+0

http://metricbrew.com/how-to-estimate-rdd-or-dataframe-real-size-in-pyspark/ – MaxU

+0

@MaxU什麼的該程序中的內存使用單位。 – Neo

+0

[字節數](https://spark.apache.org/docs/2.0.0/api/java/index.html?2.0//apache/spark/util/SizeEstimator.html) – MaxU

回答

2

嘗試使用the following trick

import py4j.protocol 
from py4j.protocol import Py4JJavaError 
from py4j.java_gateway import JavaObject 
from py4j.java_collections import JavaArray, JavaList 

from pyspark import RDD, SparkContext 
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer 

# your dataframe what you'd estimate 
df 

# Helper function to convert python object to Java objects 
def _to_java_object_rdd(rdd): 
    """ Return a JavaRDD of Object by unpickling 
    It will convert each Python object into Java object by Pyrolite, whenever the 
    RDD is serialized in batch or not. 
    """ 
    rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer())) 
    return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True) 

# First you have to convert it to an RDD 
JavaObj = _to_java_object_rdd(df.rdd) 

# Now we can run the estimator 
sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj) 
+0

Thanks @MaxU ..is有沒有更簡單的方法來做到這一點?我沒有得到這個程序的大部分。 – Neo