Search Results

Search found 244 results on 10 pages for 'cuda'.

Page 3/10 | < Previous Page | 1 2 3 4 5 6 7 8 9 10  | Next Page >

  • NVIDIA CUDA SDK Examples Compilation Unsupported Architecture 'computer_20'

    - by Andrew Bolster
    On compilation of the CUDA SDK, I'm getting a nvcc fatal : Unsupported gpu architecture 'compute_20' My toolkit is 2.3 and on a shared system (i.e cant really upgrade) and the driver version is also 2.3, running on 4 Tesla C1060s If it helps, the problem is being called in radixsort. It appears that a few people online have had this problem but i havent found anywhere that actually gives a solution.

    Read the article

  • Counting FLOPS/GFLOPS in program - CUDA

    - by msx
    Already finished my application which multiplies CRS matrix and vector (SpMV) and the only thing to do now is to count FLOPS my application did. In my opinion it's really hard to estimate number of floating point operation in case of sparse matrix - vector multiplication, because the number of multiplies in one row is really "jumpy" or fluent. I only tried to measure time using "cudaprof" ( available in ./CUDA/bin directory) - it works fine. Any sugestions and instruction pastes appreciated !

    Read the article

  • CUDA SDK compilation error

    - by ZeroDivide
    I am in the process of setting up a CUDA workstation. Platform specs: Intel Core 2 Duo Nvidia GTX 280 Fedora 10 GCC version 4.3.2 I have installed the developer driver, toolkit, and the SDK. When I try to compile the SDK example code I get the following errors: make[1]: * [obj/i386/release/cutil.cpp.o] Error 1 make: * [lib/libcutil.so] Error 2 I think this means that I am missing a library file but I'm not sure.

    Read the article

  • thread management in nbody code of cuda-sdk

    - by xnov
    When I read the nbody code in Cuda-SDK, I went through some lines in the code and I found that it is a little bit different than their paper in GPUGems3 "Fast N-Body Simulation with CUDA". My questions are: First, why the blockIdx.x is still involved in loading memory from global to share memory as written in the following code? for (int tile = blockIdx.y; tile < numTiles + blockIdx.y; tile++) { sharedPos[threadIdx.x+blockDim.x*threadIdx.y] = multithreadBodies ? positions[WRAP(blockIdx.x + q * tile + threadIdx.y, gridDim.x) * p + threadIdx.x] : //this line positions[WRAP(blockIdx.x + tile, gridDim.x) * p + threadIdx.x]; //this line __syncthreads(); // This is the "tile_calculation" function from the GPUG3 article. acc = gravitation(bodyPos, acc); __syncthreads(); } isn't it supposed to be like this according to paper? I wonder why sharedPos[threadIdx.x+blockDim.x*threadIdx.y] = multithreadBodies ? positions[WRAP(q * tile + threadIdx.y, gridDim.x) * p + threadIdx.x] : positions[WRAP(tile, gridDim.x) * p + threadIdx.x]; Second, in the multiple threads per body why the threadIdx.x is still involved? Isn't it supposed to be a fix value or not involving at all because the sum only due to threadIdx.y if (multithreadBodies) { SX_SUM(threadIdx.x, threadIdx.y).x = acc.x; //this line SX_SUM(threadIdx.x, threadIdx.y).y = acc.y; //this line SX_SUM(threadIdx.x, threadIdx.y).z = acc.z; //this line __syncthreads(); // Save the result in global memory for the integration step if (threadIdx.y == 0) { for (int i = 1; i < blockDim.y; i++) { acc.x += SX_SUM(threadIdx.x,i).x; //this line acc.y += SX_SUM(threadIdx.x,i).y; //this line acc.z += SX_SUM(threadIdx.x,i).z; //this line } } } Can anyone explain this to me? Is it some kind of optimization for faster code?

    Read the article

  • CUDA memory transfer issue

    - by Vaibhav Sundriyal
    I am trying to execute a code which first transfers data from CPU to GPU memory and vice-versa. In spite of increasing the volume of data, the data transfer time remains the same as if no data transfer is actually taking place. I am posting the code. #include <stdio.h> /* Core input/output operations */ #include <stdlib.h> /* Conversions, random numbers, memory allocation, etc. */ #include <math.h> /* Common mathematical functions */ #include <time.h> /* Converting between various date/time formats */ #include <cuda.h> /* CUDA related stuff */ #include <sys/time.h> __global__ void device_volume(float *x_d,float *y_d) { int index = blockIdx.x * blockDim.x + threadIdx.x; } int main(void) { float *x_h,*y_h,*x_d,*y_d,*z_h,*z_d; long long size=9999999; long long nbytes=size*sizeof(float); timeval t1,t2; double et; x_h=(float*)malloc(nbytes); y_h=(float*)malloc(nbytes); z_h=(float*)malloc(nbytes); cudaMalloc((void **)&x_d,size*sizeof(float)); cudaMalloc((void **)&y_d,size*sizeof(float)); cudaMalloc((void **)&z_d,size*sizeof(float)); gettimeofday(&t1,NULL); cudaMemcpy(x_d, x_h, nbytes, cudaMemcpyHostToDevice); cudaMemcpy(y_d, y_h, nbytes, cudaMemcpyHostToDevice); cudaMemcpy(z_d, z_h, nbytes, cudaMemcpyHostToDevice); gettimeofday(&t2,NULL); et = (t2.tv_sec - t1.tv_sec) * 1000.0; // sec to ms et += (t2.tv_usec - t1.tv_usec) / 1000.0; // us to ms printf("\n %ld\t\t%f\t\t",nbytes,et); et=0.0; //printf("%f %d\n",seconds,CLOCKS_PER_SEC); // launch a kernel with a single thread to greet from the device //device_volume<<<1,1>>>(x_d,y_d); gettimeofday(&t1,NULL); cudaMemcpy(x_h, x_d, nbytes, cudaMemcpyDeviceToHost); cudaMemcpy(y_h, y_d, nbytes, cudaMemcpyDeviceToHost); cudaMemcpy(z_h, z_d, nbytes, cudaMemcpyDeviceToHost); gettimeofday(&t2,NULL); et = (t2.tv_sec - t1.tv_sec) * 1000.0; // sec to ms et += (t2.tv_usec - t1.tv_usec) / 1000.0; // us to ms printf("%f\n",et); cudaFree(x_d); cudaFree(y_d); cudaFree(z_d); return 0; } Can anybody help me with this issue? Thanks

    Read the article

  • CUDA: cudaMemcpy only works in emulation mode.

    - by Jason
    I am just starting to learn how to use CUDA. I am trying to run some simple example code: float *ah, *bh, *ad, *bd; ah = (float *)malloc(sizeof(float)*4); bh = (float *)malloc(sizeof(float)*4); cudaMalloc((void **) &ad, sizeof(float)*4); cudaMalloc((void **) &bd, sizeof(float)*4); ... initialize ah ... /* copy array on device */ cudaMemcpy(ad,ah,sizeof(float)*N,cudaMemcpyHostToDevice); cudaMemcpy(bd,ad,sizeof(float)*N,cudaMemcpyDeviceToDevice); cudaMemcpy(bh,bd,sizeof(float)*N,cudaMemcpyDeviceToHost); When I run in emulation mode (nvcc -deviceemu) it runs fine (and actually copies the array). But when I run it in regular mode, it runs w/o error, but never copies the data. It's as if the cudaMemcpy lines are just ignored. What am I doing wrong? Thank you very much, Jason

    Read the article

  • Read vector into CUDA shared memory

    - by Ben
    I am new to CUDA and programming GPUs. I need each thread in my block to use a vector of length ndim. So I thought I might do something like this: extern __shared__ float* smem[]; ... if (threadIddx.x == 0) { for (int d=0; d<ndim; ++d) { smem[d] = vector[d]; } } __syncthreads(); ... This works fine. However, I seems wasteful that a single thread should do all loading, so I changed the code to if (threadIdx.x < ndim) { smem[threadIdx.x] = vector[threadIdx.x]; } __syncthreads(); which does not work. Why? It gives different results than the above code even when ndim << blockDim.x.

    Read the article

  • Optimize code performance when odd/even threads are doing different things in CUDA

    - by Orion Nebula
    Hi all! I have two large vectors, I am trying to do some sort of element multiplication, where an even-numbered element in the first vector is multiplied by the next odd-numbered element in the second vector .... and where the odd-numbered element in the first vector is multiplied by the preceding even-numbered element in the second vector Ex. vector 1 is V1(1) V1(2) V1(3) V1(4) vector 2 is V2(1) V2(2) V2(3) V2(4) V1(1) * V2(2) V1(3) * V2(4) V1(2) * V2(1) V1(4) * V2(3) I have written a Cuda code to do this: (Pds has the elements of the first vector in shared memory, Nds the second Vector) //instead of using %2 .. i check for the first bit to decide if number is odd/even -- faster if ((tx & 0x0001) == 0x0000) Nds[tx+1] = Pds[tx] * Nds[tx+1]; else Nds[tx-1] = Pds[tx] * Nds[tx-1]; __syncthreads(); Is there anyway to further accelerate this code or avoid divergence ? Thanks

    Read the article

  • CUDA threads output different value

    - by kar
    HAi, I wrote a cuda program , i have given the kernel function below. The device memory is allocated through CUDAMalloc(); the value of *md is 10; __global__ void add(int *md) { int x,oper=2; x=threadIdx.x; * md = *md*oper; if(x==1) { *md = *md*0; } if(x==2) { *md = *md*10; } if(x==3) { *md = *md+1; } if(x==4) { *md = *md-1; } } executed the above code add<<<1,5>>(*md) , add<<<1,4>>>(*md) for <<<1,5>>> the output is 19 for <<<1,4>>> the output is 21 1) I have doubt that cudaMalloc() will allocate in device main memory? 2) Why the last thread alone is executed always in the above program? Thank you

    Read the article

  • Optimize code perfromance when odd/even threads are doing different things in CUDA

    - by Ashraf
    Hi all! I have two large vectors, I am trying to do some sort of element multiplication, where an even-numbered element in the first vector is multiplied by the next odd-numbered element in the second vector .... and where the odd-numbered element in the first vector is multiplied by the preceding even-numbered element in the second vector Ex. vector 1 is V1(1) V1(2) V1(3) V1(4) vector 2 is V2(1) V2(2) V2(3) V2(4) V1(1) * V2(2) V1(3) * V2(4) V1(2) * V2(1) V1(4) * V2(3) I have written a Cuda code to do this: (Pds has the elements of the first vector in shared memory, Nds the second Vector) //instead of using %2 .. i check for the first bit to decide if number is odd/even -- faster if ((tx & 0x0001) == 0x0000) Nds[tx+1] = Pds[tx] * Nds[tx+1]; else Nds[tx-1] = Pds[tx] * Nds[tx-1]; __syncthreads(); Is there anyway to further accelerate this code or avoid divergence .. Thanks

    Read the article

  • Using device variable by multiple threads on CUDA

    - by ashagi
    I am playing around with cuda. At the moment I have a problem. I am testing a large array for particular responses, and when I get the response, I have to copy the data onto another array. For example, my test array of 5 elements looks like this: [ ][ ][v1][ ][ ][v2] Result must look like this: [v1][v2] The problem is how do I calculate the address of the second array to store the result? All elements of the first array are checked in parallel. I am thinking to declare a device variable int addr = 0. Every time I find a response, I will increment the addr. But I am not sure about that because it means that addr may be accessed by multiple threads at the same time. Will that cause problems? Or will the thread wait until another thread finishes using that variable?

    Read the article

  • help me understand cuda

    - by scatman
    i am having some troubles understanding threads in NVIDIA gpu architecture with cuda. please could anybody clarify these info: an 8800 gpu has 16 SMs with 8 SPs each. so we have 128 SPs. i was viewing stanford's video presentation and it was saying that every SP is capable of running 96 threads cuncurrently. does this mean that it (SP) can run 96/32=3 warps concurrently? moreover, since every SP can run 96 threads and we have 8 SPs in every SM. does this mean that every SM can run 96*8=768 threads concurrently?? but if every SM can run a single Block at a time, and the maximum number of threads in a block is 512, so what is the purpose of running 768 threads concurrently and have a max of 512 threads? a more general question is:how are blocks,threads,and warps distributed to SMs and SPs? i read that every SM gets a single block to execute at a time and threads in a block is divided into warps (32 threads), and SPs execute warps.

    Read the article

  • Transferring data from 2d Dynamic array in C to CUDA and back

    - by Soumya
    I have a dynamically declared 2D array in my C program, the contents of which I want to transfer to a CUDA kernel for further processing. Once processed, I want to populate the dynamically declared 2D array in my C code with the CUDA processed data. I am able to do this with static 2D C arrays but not with dynamically declared C arrays. Any inputs would be welcome! I mean the dynamic array of dynamic arrays. The test code that I have written is as below. #include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> #include <conio.h> #include <math.h> #include <stdlib.h> const int nItt = 10; const int nP = 5; __device__ int d_nItt = 10; __device__ int d_nP = 5; __global__ void arr_chk(float *d_x_k, float *d_w_k, int row_num) { int index = (blockIdx.x * blockDim.x) + threadIdx.x; int index1 = (row_num * d_nP) + index; if ( (index1 >= row_num * d_nP) && (index1 < ((row_num +1)*d_nP))) //Modifying only one row data pertaining to one particular iteration { d_x_k[index1] = row_num * d_nP; d_w_k[index1] = index; } } float **mat_create2(int r, int c) { float **dynamicArray; dynamicArray = (float **) malloc (sizeof (float)*r); for(int i=0; i<r; i++) { dynamicArray[i] = (float *) malloc (sizeof (float)*c); for(int j= 0; j<c;j++) { dynamicArray[i][j] = 0; } } return dynamicArray; } /* Freeing memory - here only number of rows are passed*/ void cleanup2d(float **mat_arr, int x) { int i; for(i=0; i<x; i++) { free(mat_arr[i]); } free(mat_arr); } int main() { //float w_k[nItt][nP]; //Static array declaration - works! //float x_k[nItt][nP]; // if I uncomment this dynamic declaration and comment the static one, it does not work..... float **w_k = mat_create2(nItt,nP); float **x_k = mat_create2(nItt,nP); float *d_w_k, *d_x_k; // Device variables for w_k and x_k int nblocks, blocksize, nthreads; for(int i=0;i<nItt;i++) { for(int j=0;j<nP;j++) { x_k[i][j] = (nP*i); w_k[i][j] = j; } } for(int i=0;i<nItt;i++) { for(int j=0;j<nP;j++) { printf("x_k[%d][%d] = %f\t",i,j,x_k[i][j]); printf("w_k[%d][%d] = %f\n",i,j,w_k[i][j]); } } int size1 = nItt * nP * sizeof(float); printf("\nThe array size in memory bytes is: %d\n",size1); cudaMalloc( (void**)&d_x_k, size1 ); cudaMalloc( (void**)&d_w_k, size1 ); if((nP*nItt)<32) { blocksize = nP*nItt; nblocks = 1; } else { blocksize = 32; // Defines the number of threads running per block. Taken equal to warp size nthreads = blocksize; nblocks = ceil(float(nP*nItt) / nthreads); // Calculated total number of blocks thus required } for(int i = 0; i< nItt; i++) { cudaMemcpy( d_x_k, x_k, size1,cudaMemcpyHostToDevice ); //copy of x_k to device cudaMemcpy( d_w_k, w_k, size1,cudaMemcpyHostToDevice ); //copy of w_k to device arr_chk<<<nblocks, blocksize>>>(d_x_k,d_w_k,i); cudaMemcpy( x_k, d_x_k, size1, cudaMemcpyDeviceToHost ); cudaMemcpy( w_k, d_w_k, size1, cudaMemcpyDeviceToHost ); } printf("\nVerification after return from gpu\n"); for(int i = 0; i<nItt; i++) { for(int j=0;j<nP;j++) { printf("x_k[%d][%d] = %f\t",i,j,x_k[i][j]); printf("w_k[%d][%d] = %f\n",i,j,w_k[i][j]); } } cudaFree( d_x_k ); cudaFree( d_w_k ); cleanup2d(x_k,nItt); cleanup2d(w_k,nItt); getch(); return 0;

    Read the article

  • CUDA Global Memory, Where is it?

    - by gamerx
    I understand that in CUDA's memory hierachy, we have things like shared memory, texture memory, constant memory, registers and of course the global memory which we allocate using cudaMalloc(). I've been searching through whatever documentations I can find but I have yet to come across any that explicitly explains what is the global memory. I believe that the global memory allocated is on the GDDR of graphics card itself and not the RAM that is shared with the CPU since one of the documentations did state that the pointer cannot be dereferenced by the host side. Am I right?

    Read the article

  • CUDA Kernel Not Updating Global Variable

    - by Taher Khokhawala
    I am facing the following problem in a CUDA kernel. There is an array "cu_fx" in global memory. Each thread has a unique identifier jj and a local loop variable ii and a local float variable temp. Following code is not working. It is not at all changing cu_fx[jj]. At the end of loop cu_fx[jj] remains 0. ii = 0; cu_fx[jj] = 0; while(ii < l) { if(cu_y[ii] > 0) cu_fx[jj] += (cu_mu[ii]*cu_Kernel[(jj-start_row)*Kernel_w + ii]); else cu_fx[jj] -= (cu_mu[ii]*cu_Kernel[(jj-start_row)*Kernel_w + ii]); ii++; } But when I rewrite it using a temporary variable temp, it works fine. ii = 0; temp = 0; while(ii < l) { if(cu_y[ii] > 0) temp += (cu_mu[ii]*cu_Kernel[(jj-start_row)*Kernel_w + ii]); else temp -= (cu_mu[ii]*cu_Kernel[(jj-start_row)*Kernel_w + ii]); ii++; } cu_fx[jj] = temp; Can somebody please help with this problem. Thanking in advance.

    Read the article

  • Calling handwritten CUDA kernel with thrust

    - by macs
    Hi, since i needed to sort large arrays of numbers with CUDA, i came along with using thrust. So far, so good...but what when i want to call a "handwritten" kernel, having a thrust::host_vector containing the data? My approach was (backcopy is missing): int CUDA_CountAndAdd_Kernel(thrust::host_vector<float> *samples, thrust::host_vector<int> *counts, int n) { thrust::device_ptr<float> dSamples = thrust::device_malloc<float>(n); thrust::copy(samples->begin(), samples->end(), dSamples); thrust::device_ptr<int> dCounts = thrust::device_malloc<int>(n); thrust::copy(counts->begin(), counts->end(), dCounts); float *dSamples_raw = thrust::raw_pointer_cast(dSamples); int *dCounts_raw = thrust::raw_pointer_cast(dCounts); CUDA_CountAndAdd_Kernel<<<1, n>>>(dSamples_raw, dCounts_raw); thrust::device_free(dCounts); thrust::device_free(dSamples); } The kernel looks like: __global__ void CUDA_CountAndAdd_Kernel_Device(float *samples, int *counts) But compilation fails with: error: argument of type "float **" is incompatible with parameter of type "thrust::host_vector *" Huh?! I thought i was giving float and int raw-pointers? Or am i missing something?

    Read the article

  • how to make a CUDA Histogram kernel?

    - by kitw
    Hi all, I am writing a CUDA kernel for Histogram on a picture, but I had no idea how to return a array from the kernel, and the array will change when other thread read it. Any possible solution for it? __global__ void Hist( TColor *dst, //input image int imageW, int imageH, int*data ){ const int ix = blockDim.x * blockIdx.x + threadIdx.x; const int iy = blockDim.y * blockIdx.y + threadIdx.y; if(ix < imageW && iy < imageH) { int pixel = get_red(dst[imageW * (iy) + (ix)]); //this assign specific RED value of image to pixel data[pixel] ++; // ?? problem statement ... } } @para d_dst: input image TColor is equals to float4. @para data: the array for histogram size [255] extern "C" void cuda_Hist(TColor *d_dst, int imageW, int imageH,int* data) { dim3 threads(BLOCKDIM_X, BLOCKDIM_Y); dim3 grid(iDivUp(imageW, BLOCKDIM_X), iDivUp(imageH, BLOCKDIM_Y)); Hist<<<grid, threads>>>(d_dst, imageW, imageH, data); }

    Read the article

  • Beginner error in CUDA

    - by Dimitri
    Hi folks, first all i want to wish you a merry christmas. I am writing a small program in CUDA and i have the following errors : contraste.cu(167): error: calling a host function from a __device__/__global__ function is not allowed I don't understand why. Can you please help me and show me my errors. It seems that my program is correct. Here is a the bunch of code causing the problems : __global__ void kernel_contraste(float power, unsigned char tab_in[], unsigned char tab_out[], int nbl, int nbc) { int x = threadIdx.x; printf("I am the thread %d\n", x); } Part of my main program : unsigned char *dimg, *dimg_res; ..... cudaMalloc((void **)dimg, h * w * sizeof(char)); cudaMemcpy(dimg, r.data, h*w*sizeof(char), cudaMemcpyHostToDevice); cudaMalloc((void **)dimg_res, h*w*sizeof(char)); dim3 nbThreadparBloc(256); dim3 numblocs(1); kernel_contraste<<<numblocs, nbThreadparBloc >>>(puissance, dimg, dimg_res, h, w); cudaThreadSynchronize(); ..... cudaFree(dimg); cudaFree(dimg_res); The line 167 is the line where i call the printf in function kernel_contraste. For information, this program takes an image as an input( a sun Rasterfile ) and a power then it calculates the contraste of that image. Thanks !!

    Read the article

  • Optimizing Vector elements swaps using CUDA

    - by Orion Nebula
    Hi all, Since I am new to cuda .. I need your kind help I have this long vector, for each group of 24 elements, I need to do the following: for the first 12 elements, the even numbered elements are multiplied by -1, for the second 12 elements, the odd numbered elements are multiplied by -1 then the following swap takes place: Graph: because I don't yet have enough points, I couldn't post the image so here it is: http://www.freeimagehosting.net/image.php?e4b88fb666.png I have written this piece of code, and wonder if you could help me further optimize it to solve for divergence or bank conflicts .. //subvector is a multiple of 24, Mds and Nds are shared memory _shared_ double Mds[subVector]; _shared_ double Nds[subVector]; int tx = threadIdx.x; int tx_mod = tx ^ 0x0001; int basex = __umul24(blockDim.x, blockIdx.x); Mds[tx] = M.elements[basex + tx]; __syncthreads(); // flip the signs if (tx < (tx/24)*24 + 12) { //if < 12 and even if ((tx & 0x0001)==0) Mds[tx] = -Mds[tx]; } else if (tx < (tx/24)*24 + 24) { //if >12 and < 24 and odd if ((tx & 0x0001)==1) Mds[tx] = -Mds[tx]; } __syncthreads(); if (tx < (tx/24)*24 + 6) { //for the first 6 elements .. swap with last six in the 24elements group (see graph) Nds[tx] = Mds[tx_mod + 18]; Mds [tx_mod + 18] = Mds [tx]; Mds[tx] = Nds[tx]; } else if (tx < (tx/24)*24 + 12) { // for the second 6 elements .. swp with next adjacent group (see graph) Nds[tx] = Mds[tx_mod + 6]; Mds [tx_mod + 6] = Mds [tx]; Mds[tx] = Nds[tx]; } __syncthreads(); Thanks in advance ..

    Read the article

  • Upgrade from ubuntu 9.10 to 11.10

    - by Chinnu
    Our project definition is to develop CUDA programs. Our workstation has CUDA 3.1 installed in Ubuntu 9.10. We need to program in CUDA 5.0 which can be installed only on ubuntu 11.10 or 12.04. We tried upgrading but were faced with many problems as 9.10 is no longer supported. So we chose to proceed with a clean installation. Since we have a shared workstation, we need to back up the settings. We decided to use clonezilla for cloning the system. Booting from the LiveCD showed an unexpected error. Another option was to install 11.10 in an external HDD by partitioning it, but Gparted could not be installed and terminated with the error "installArchives() failed" which we couldn't solve even after modifying the sources.list. We are stuck either ways. Have no idea how to proceed and we have a deadline to submit our CUDA program. Any suggestion is welcome.

    Read the article

  • Allocate constant memory

    - by Vlad
    I'm trying to set my simulation params in constant memory but without luck (CUDA.NET). cudaMemcpyToSymbol function returns cudaErrorInvalidSymbol. The first parameter in cudaMemcpyToSymbol is string... Is it symbol name? actualy I don't understand how it could be resolved. Any help appreciated. //init, load .cubin float[] arr = new float[1]; arr[0] = 0.0f; int size = Marshal.SizeOf(arr[0]) * arr.Length; IntPtr ptr = Marshal.AllocHGlobal(size); Marshal.Copy(arr, 0, ptr, arr.Length); var error = CUDARuntime.cudaMemcpyToSymbol("param", ptr, 4, 0, cudaMemcpyKind.cudaMemcpyHostToDevice); my .cu file contain __constant__ float param;

    Read the article

  • CUDA: Memory copy to GPU 1 is slower in multi-GPU

    - by zenna
    My company has a setup of two GTX 295, so a total of 4 GPUs in a server, and we have several servers. We GPU 1 specifically was slow, in comparison to GPU 0, 2 and 3 so I wrote a little speed test to help find the cause of the problem. //#include <stdio.h> //#include <stdlib.h> //#include <cuda_runtime.h> #include <iostream> #include <fstream> #include <sstream> #include <string> #include <cutil.h> __global__ void test_kernel(float *d_data) { int tid = blockDim.x*blockIdx.x + threadIdx.x; for (int i=0;i<10000;++i) { d_data[tid] = float(i*2.2); d_data[tid] += 3.3; } } int main(int argc, char* argv[]) { int deviceCount; cudaGetDeviceCount(&deviceCount); int device = 0; //SELECT GPU HERE cudaSetDevice(device); cudaEvent_t start, stop; unsigned int num_vals = 200000000; float *h_data = new float[num_vals]; for (int i=0;i<num_vals;++i) { h_data[i] = float(i); } float *d_data = NULL; float malloc_timer; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord( start, 0 ); cudaMemcpy(d_data, h_data, sizeof(float)*num_vals,cudaMemcpyHostToDevice); cudaMalloc((void**)&d_data, sizeof(float)*num_vals); cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &malloc_timer, start, stop ); cudaEventDestroy( start ); cudaEventDestroy( stop ); float mem_timer; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord( start, 0 ); cudaMemcpy(d_data, h_data, sizeof(float)*num_vals,cudaMemcpyHostToDevice); cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &mem_timer, start, stop ); cudaEventDestroy( start ); cudaEventDestroy( stop ); float kernel_timer; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord( start, 0 ); test_kernel<<<1000,256>>>(d_data); cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &kernel_timer, start, stop ); cudaEventDestroy( start ); cudaEventDestroy( stop ); printf("cudaMalloc took %f ms\n",malloc_timer); printf("Copy to the GPU took %f ms\n",mem_timer); printf("Test Kernel took %f ms\n",kernel_timer); cudaMemcpy(h_data,d_data, sizeof(float)*num_vals,cudaMemcpyDeviceToHost); delete[] h_data; return 0; } The results are GPU0 cudaMalloc took 0.908640 ms Copy to the GPU took 296.058777 ms Test Kernel took 326.721283 ms GPU1 cudaMalloc took 0.913568 ms Copy to the GPU took[b] 663.182251 ms[/b] Test Kernel took 326.710785 ms GPU2 cudaMalloc took 0.925600 ms Copy to the GPU took 296.915039 ms Test Kernel took 327.127930 ms GPU3 cudaMalloc took 0.920416 ms Copy to the GPU took 296.968384 ms Test Kernel took 327.038696 ms As you can see, the cudaMemcpy to the GPU is well double the amount of time for GPU1. This is consistent between all our servers, it is always GPU1 that is slow. Any ideas why this may be? All servers are running windows XP.

    Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10  | Next Page >