執行此操作的一種方法是編寫一個包裝腳本,該腳本可以運行一系列任務,然後將其中的每個腳本生成爲一個單獨的腳本。
在您的片段,它看起來像你想運行每個計算節點的腳本的2個實例共獲得8所以,在你的工作的腳本,你可以這樣做:
for ((i=0; i<8; i+=2)); do
aprun -n 1 ./wrapper.sh $i 2 &
done
wait
然後在包裝你可以這樣做(其中附加$ J向你唯一索引):
end=$(($1 + $2))
for ((j=$1; j<$end; j+=1)); do
./examplebashscript.sh $j &
done
wait
您還可以設置以下環境變量,以獲得不同的進程和線程的位置。你需要設置這些在你的shell(或作業腳本)運行 「aprun」 前:
export MPICH_CPUMASK_DISPLAY=1
export MPICH_RANK_REORDER_DISPLAY=1
例如,運行:
aprun -n 24 ./examplebashscript.sh
(的簡寫形式):
aprun -n 24 -N 24 -S 12 -d 1 ./examplebashscript.sh
將在STDERR上給出以下類型的輸出(注意這是在XC30上,每個計算節點上有兩個Intel Ivy Bridge 12-內核處理器,因此由於存在超線程,掩碼顯示每個節點上有48個內核):
[PE_0]: MPI rank order: Using default aprun rank ordering.
[PE_0]: rank 0 is on nid02749
[PE_0]: rank 1 is on nid02749
[PE_0]: rank 2 is on nid02749
[PE_0]: rank 3 is on nid02749
[PE_0]: rank 4 is on nid02749
[PE_0]: rank 5 is on nid02749
[PE_0]: rank 6 is on nid02749
[PE_0]: rank 7 is on nid02749
[PE_0]: rank 8 is on nid02749
[PE_0]: rank 9 is on nid02749
[PE_0]: rank 10 is on nid02749
[PE_0]: rank 11 is on nid02749
[PE_0]: rank 12 is on nid02749
[PE_0]: rank 13 is on nid02749
[PE_0]: rank 14 is on nid02749
[PE_0]: rank 15 is on nid02749
[PE_0]: rank 16 is on nid02749
[PE_0]: rank 17 is on nid02749
[PE_0]: rank 18 is on nid02749
[PE_0]: rank 19 is on nid02749
[PE_0]: rank 20 is on nid02749
[PE_0]: rank 21 is on nid02749
[PE_0]: rank 22 is on nid02749
[PE_0]: rank 23 is on nid02749
[PE_23]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000100000000000000000000000
[PE_22]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000010000000000000000000000
[PE_21]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000001000000000000000000000
[PE_0]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000000001
[PE_20]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000100000000000000000000
[PE_9]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000001000000000
[PE_11]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000100000000000
[PE_10]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000010000000000
[PE_8]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000100000000
[PE_1]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000000010
[PE_2]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000000100
[PE_18]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000001000000000000000000
[PE_7]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000010000000
[PE_15]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000001000000000000000
[PE_3]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000001000
[PE_6]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000001000000
[PE_16]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000010000000000000000
[PE_14]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000100000000000000
[PE_13]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000010000000000000
[PE_12]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000001000000000000
[PE_4]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000010000
[PE_5]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000100000
[PE_17]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000100000000000000000
[PE_19]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000010000000000000000000
您可能可以通過某種方式捕捉到這一點。
我一點都不熟悉'aprun',你是對的,從看它,文件是不是非常好。但是我會嘗試的一件事就是將環境變量使用'env'轉儲到某個文件中,並查看是否通過環境變量傳遞了這些信息。你可以使用像'env> $(hostname) - $$。env'這樣的東西寫出一個基於正在運行的進程的主機名和PID命名的文件,希望每次調用都可以得到不同的結果。 – 2015-03-13 18:48:04
我剛剛嘗試過,不幸的是我沒有看到任何接近我需要的東西。有一些SLURM變量(如SLURM_NNODES,SLURM_JOBID),它們在所有作業中都是相同的。因此,我需要有人對如何爲阿倫運行獨特的工作提供一些啓示。 – user4668442 2015-03-13 19:14:19