CUDA固定內存從設備中刷新

CUDA 5，設備功能3.5，VS 2012,64位Win 2012 Server。CUDA固定內存從設備中刷新

線程之間沒有共享內存訪問，每個線程都是獨立的。

我使用零拷貝的固定內存。在主機上，只有當我在主機上發出cudaDeviceSynchronize時，我才能讀取設備寫入的固定內存。

我希望能夠到：

水衝到鎖定的存儲，一旦設備已經更新了它。

不會阻止設備線程（可能由異步複製）

我打過電話__threadfence_system和__threadfence每個設備的寫入後，但沒有刷新。

下面是一個完整的示例代碼CUDA演示我的問題：

#include <conio.h> 
#include <cstdio> 
#include "cuda.h" 
#include "cuda_runtime.h" 
#include "device_launch_parameters.h" 

__global__ void Kernel(volatile float* hResult) 
{ 
    int tid = threadIdx.x + blockIdx.x * blockDim.x; 

    printf("Kernel %u: Before Writing in Kernel\n", tid); 
    hResult[tid] = tid + 1; 
    __threadfence_system(); 
    // expecting that the data is getting flushed to host here! 
    printf("Kernel %u: After Writing in Kernel\n", tid); 
    // time waster for-loop (sleep) 
    for (int timeWater = 0; timeWater < 100000000; timeWater++); 
} 

void main() 
{ 
    size_t blocks = 2; 
    volatile float* hResult; 
    cudaHostAlloc((void**)&hResult,blocks*sizeof(float),cudaHostAllocMapped); 
    Kernel<<<1,blocks>>>(hResult); 
    int filledElementsCounter = 0; 
    // naiive thread implementation that can be impelemted using 
    // another host thread 
    while (filledElementsCounter < blocks) 
    { 
     // blocks until the value changes, this moves sequentially 
     // while threads have no order (fine for this sample). 
     while(hResult[filledElementsCounter] == 0); 
     printf("%f\n", hResult[filledElementsCounter]);; 
     filledElementsCounter++; 
    } 
    cudaFreeHost((void *)hResult); 
    system("pause"); 
}

目前該樣品沒有被從設備讀取，除非我發出cudaDeviceSynchronize將無限期地等待。下面的作品樣本，但它是不我希望，因爲它違背了異步複製的目的是什麼：

void main() 
{ 
    size_t blocks = 2; 
    volatile float* hResult; 
    cudaHostAlloc((void**)&hResult, blocks*sizeof(float), cudaHostAllocMapped); 
    Kernel<<<1,blocks>>>(hResult); 
    cudaError_t error = cudaDeviceSynchronize(); 
    if (error != cudaSuccess) { throw; } 
    for(int i = 0; i < blocks; i++) 
    { 
     printf("%f\n", hResult[i]); 
    } 
    cudaFreeHost((void *)hResult); 
    system("pause"); 
}

來源

2013-05-07 Adam

你解決了這個問題嗎？您是否嘗試使用動態並行機制將數據寫入CPU主機的內存？在內核函數中使用'cudaMemcpyAsync（uva_host_ptr，device_ptr，size）;'，如以下鏈接所示：http://on-demand.gputechconf.com/gtc/2012/presentations/S0338-GTC2012-CUDA-Programming- Model.pdf – Alex 2013-10-13 21:34:50

您不能直接通過主機指針到內核。如果使用cudaHostAlloc和cudaHostAllocMapped標誌分配主機內存，則首先必須檢索映射主機內存的設備指針，然後才能在內核中使用它。使用cudaHostGetDevicePointer獲取映射主機內存的設備指針。

float* hResult, *dResult; 
cudaHostAlloc((void**)&hResult, blocks*sizeof(float), cudaHostAllocMapped); 
cudaHostGetDevicePointer(&dResult,hResult); 
Kernel<<<1,blocks>>>(dResult);

來源

2013-05-07 11:16:04 sgarizvi

當你說「你不能通過」，你的意思是解決我的臉紅問題，或者你的意思是一般？因爲當我用'cudaDeviceSynchronize'替換我的while循環時，我可以在不使用'cudaMemcpy'的情況下訪問hResult中的數據。我仍然無法看到您建議的解決方案如何解決沖洗問題。我是否一直在dResult上執行'cudaMemcpyAsync'，直到找到它裏面的東西？ – Adam 2013-05-07 11:25:23

其實我指出了一個會導致未定義行爲的一般錯誤。刷新問題可能是由於內核中的'printf'語句引起的。因爲內核中的'printf'在內核完成執行後會轉儲它的輸出。 – sgarizvi 2013-05-07 11:29:57

我在問題中增加了另一個示例，它是可以工作但同步的問題。你是否告訴我第二個樣本有未定義的行爲？它正在工作，即使我刪除了內核'printf' – Adam 2013-05-07 11:47:13

調用__threadfence_system()將確保寫入對系統可見繼續之前，但你的CPU會被緩存h_result變量，因此你只是在一個無限循環旋轉的舊值。嘗試將h_result標記爲volatile。

來源

2013-05-07 11:58:35 Tom

我已經更新了上面的示例，並添加了__threadfence_system（）和volatile，因爲添加volatile是一個好主意。但是，我仍然阻止不能讀取任何東西。 – Adam 2013-05-07 12:33:37

我用你的代碼上播放一個CentOS 6.2 CUDA 5.5和特斯拉M2090，可以斷定這一點：

它不會在您的系統的問題必須是驅動的問題，我建議你吃TCC司機。

我附上我的代碼，運行良好，做你想做的。內核結束之前，這些值出現在主機端。正如你所看到的，我添加了一些計算代碼來防止由於編譯器優化而刪除for循環。我添加了一個流和一個回調，在流中的所有工作完成後執行。程序輸出12很長一段時間沒有任何操作，直到stream finished...被打印到控制檯。

#include <iostream> 
#include "cuda.h" 
#include "cuda_runtime.h" 
#include "device_launch_parameters.h" 

#define SEC_CUDA_CALL(val)   checkCall ((val), #val, __FILE__, __LINE__) 

bool checkCall(cudaError_t result, char const* const func, const char *const file, int const line) 
{ 
    if (result != cudaSuccess) 
    { 
      std::cout << "CUDA (runtime api) error: " << func << " failed! " << cudaGetErrorString(result) << " (" << result << ") " << file << ":" << line << std::endl; 
    } 
    return result != cudaSuccess; 
} 

class Callback 
{ 
public: 
    static void CUDART_CB dispatch(cudaStream_t stream, cudaError_t status, void *userData); 

private: 
    void call(); 
}; 

void CUDART_CB Callback::dispatch(cudaStream_t stream, cudaError_t status, void *userData) 
{ 
    Callback* cb = (Callback*) userData; 
    cb->call(); 
} 

void Callback::call() 
{ 
    std::cout << "stream finished..." << std::endl; 
} 



__global__ void Kernel(volatile float* hResult) 
{ 
    int tid = threadIdx.x + blockIdx.x * blockDim.x; 

    hResult[tid] = tid + 1; 
    __threadfence_system(); 
    float A = 0; 
    for (int timeWater = 0; timeWater < 100000000; timeWater++) 
    { 
     A = sin(cos(log(hResult[0] * hResult[1]))) + A; 
     A = sqrt(A); 
    } 
} 

int main(int argc, char* argv[]) 
{ 
    size_t blocks = 2; 
    volatile float* hResult; 
    SEC_CUDA_CALL(cudaHostAlloc((void**)&hResult,blocks*sizeof(float),cudaHostAllocMapped)); 

    cudaStream_t stream; 
    SEC_CUDA_CALL(cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking)); 
    Callback obj; 
    Kernel<<<1,blocks,NULL,stream>>>(hResult); 
    SEC_CUDA_CALL(cudaStreamAddCallback(stream, Callback::dispatch, &obj, 0)); 

    int filledElementsCounter = 0; 

    while (filledElementsCounter < blocks) 
    { 
     while(hResult[filledElementsCounter] == 0); 
     std::cout << hResult[filledElementsCounter] << std::endl; 
     filledElementsCounter++; 
    } 

    SEC_CUDA_CALL(cudaStreamDestroy(stream)); 
    SEC_CUDA_CALL(cudaFreeHost((void *)hResult)); 
}

沒有調用返回錯誤，並且cuda-memcheck沒有發現任何問題。這按預期工作。你應該真的嘗試TCC驅動程序。

來源

2013-10-15 12:27:10

謝謝！但是可能你的意思是'Kernel <<< 1，threads'而不是'Kernel <<< 1，blocks'？我可以從http://www.nvidia.com/object/software-for-tesla-products.html下載TCC驅動程序。但是，我可以將它用於nVidia Quadro Mobile嗎？或者我必須使用什麼來解決此問題使用GPU nVidia ** Quadro（開普勒GK107/GK106）**？ – Alex 2013-10-15 14:43:28

代碼是從原始問題複製的，但是第二個參數是針對線程的。我對TCC驅動程序沒有期望，但我認爲它也適用於Quadros。看看這裏：http://stackoverflow.com/questions/19098650/does-the-cuda-tcc-driver-work-with-geforce-cards-on-windows – 2013-10-15 20:25:49

CUDA固定內存從設備中刷新

回答

相關問題