2017-07-02 42 views
0

我有下面的火花數據幀,我試圖按列值拆分它,並返回一個包含x行數的新數據幀爲每列值按列值拆分火花數據幀,並在結果中獲得每列值x的行數

假設這是數據幀我有:

from pyspark import *; 
from pyspark.sql import *; 
from pyspark.sql.functions import udf 
from pyspark.sql.types import StringType, StructType, StructField, IntegerType, DoubleType 
import math; 

sc = SparkContext.getOrCreate(); 
spark = SparkSession.builder.master('local').getOrCreate(); 


schema = StructType([ 
    StructField("INDEX", IntegerType(), True), 
    StructField("SYMBOL", StringType(), True), 
    StructField("DATETIMETS", StringType(), True), 
    StructField("PRICE", DoubleType(), True), 
    StructField("SIZE", IntegerType(), True), 
]) 

df = spark\ 
    .createDataFrame(
     data=[(0,'A','2002-12-01 9:30:20',19.75,30200), 
      (1,'A','2002-12-02 9:31:20',29.75,30200),    
      (2,'A','2004-12-03 10:36:20',3.0,30200), 
      (3,'A','2006-12-06 22:41:20',24.0,30200), 
      (4,'A','2006-12-08 22:42:20',60.0,30200), 
      (5,'B','2002-12-09 9:30:20',15.75,30200), 
      (6,'B','2002-12-12 9:31:20',49.75,30200),    
      (7,'C','2004-11-02 10:36:20',6.0,30200), 
      (8,'C','2007-12-02 22:41:20',50.0,30200), 
      (9,'D','2008-12-02 22:42:20',60.0,30200), 
      (10,'E','2052-12-02 9:30:20',14.75,30200), 
      (11,'A','2062-12-02 9:31:20',12.75,30200),    
      (12,'A','2007-12-02 11:36:20',5.0,30200), 
      (13,'A','2008-12-02 22:41:20',40.0,30200), 
      (14,'A','2008-12-02 22:42:20',50.0,30200)], 
     schema=schema); 

說我想在每個符號最多兩行,即創建一個具有以下數據的新數據幀。

Resulting dataframe

有沒有辦法比,雖然循環使用「其中」條款的符號每個數據集做這個其他?

回答

1

這裏是一種選擇從每個SYMBOL服用前兩行:

df.rdd.groupBy(lambda r: r['SYMBOL']).flatMap(lambda x: list(x[1])[:2]).toDF().show() 

+-----+------+-------------------+-----+-----+ 
|INDEX|SYMBOL|   DATETIMETS|PRICE| SIZE| 
+-----+------+-------------------+-----+-----+ 
| 0|  A| 2002-12-01 9:30:20|19.75|30200| 
| 1|  A| 2002-12-02 9:31:20|29.75|30200| 
| 10|  E| 2052-12-02 9:30:20|14.75|30200| 
| 9|  D|2008-12-02 22:42:20| 60.0|30200| 
| 7|  C|2004-11-02 10:36:20| 6.0|30200| 
| 8|  C|2007-12-02 22:41:20| 50.0|30200| 
| 5|  B| 2002-12-09 9:30:20|15.75|30200| 
| 6|  B| 2002-12-12 9:31:20|49.75|30200| 
+-----+------+-------------------+-----+-----+