我想在amazon EMR實例上運行pyspark以從dynamodb讀取數據,並想知道如何在代碼中設置拆分和工人數量?如何在pyspark中設置拆分和減速器的數量
我遵循以下兩個文檔中的說明來提供當前連接到dynamoDB並讀取數據的代碼。 connecting to dynamoDB from pyspark 和Pyspark documentation
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate()
conf = {"dynamodb.servicename": "dynamodb", "dynamodb.input.tableName":
"Table1", "dynamodb.endpoint": "https://dynamodb.us-east-
1.amazonaws.com", "dynamodb.regionid":"us-east-1",
"mapred.input.format.class":
"org.apache.hadoop.dynamodb.read.DynamoDBInputFormat",
"mapred.output.format.class":
"org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat"
orders = sc.hadoopRDD(inputFormatClass="org.apache.hadoop.dynamodb.read.DynamoDBInputFormat",
keyClass="org.apache.hadoop.io.Text",
valueClass="org.apache.hadoop.dynamodb.DynamoDBItemWritable", conf=conf)
我試圖改變實例和SparkConf類並行數值,但不知道如何將影響SparkContext變量
SparkConf().set('spark.executor.instances','4')
SparkConf().set('spark.default.parallelism', '128')
設置分裂和減速器,但它似乎沒有改變它。