Search Results

Search found 244 results on 10 pages for 'cuda'.

Page 6/10 | < Previous Page | 2 3 4 5 6 7 8 9 10  | Next Page >

  • how can a __global__ function RETURN a value or BREAK out like C/C++ does

    - by user1684726
    Recently i've been doing string comparing jobs on CUDA, and i wonder how can a global function return a value when it finds the exact string that i'm looking for. I mean, i need the global function which contains a great amout of threads to find a certain string among a big big string-pool simultaneously, and i hope that once the exact string is caught, the global funtion can stop all the threads and return back to the main funtion, and tells me "he did it"! B.T.W., I'm using CUDA C .How could i possibly achieve that, waiting for help.

    Read the article

  • Why can't nvcc find my Visual C++ installation?

    - by Jack Lloyd
    I'm running Windows 7 Pro x64 on a Core i5 with a NVIDIA 3100m, which is CUDA compatible. I've tried installing both the 32-bit and 64-bit CUDA toolkits from NVIDIA, unfortunately from with either of them I cannot compile anything; nvcc says "cannot find a supported cl version. Only MSVC 8.0 and MSVC 9.0 are supported". I have the x86 and x86-64 compilers installed via the Windows 7 SDK (compiler version 15.00.30729.01 for both arches). Both compilers are operating correctly; I've built and tested C and C++ code using them. I've tried running nvcc from command shells set up for both 32 bit and 64 bit compilation, and using the -ccbin command line option to nvcc to point it at the Visual C++ install directory. What is the right way of handling this setup? Is there some way I make nvcc be more verbose about what is going on? The -v flag isn't terrible helpful. Ideally some way to make it show what it is finding versus what it's expecting to find. Will this work better if I install Visual C++ Express instead? Or is only a commercial version of VC++ supported for use with CUDA?

    Read the article

  • Run python script on server over ssh session in the background persistantly

    - by Stefan R. Falk
    I got an account from my professor for our universities CUDA server for running some tests. I am connecting via ssh over terminal. The thing is, as I close the terminal the server also seems to kill the running script. As I reconnect it has stopped. No it is not possible that the script already terminated since those test runs should take a few hours even on those machine.. Can anybody help me here? OS: Linux cuda01 3.13-1-amd64 #1 SMP Debian 3.13.7-1 (2014-03-25) x86_64 GNU/Linux

    Read the article

  • How do I use compiler intrinsic __fmul_?

    - by Eric Thoma
    I am writing a massively parallel GPU application. I have been optimizing it by hand. I received a 20% performance increase with _fdividef(x, y), and according to The Cuda C Programming Guide (section C.2.1), using similar functions for multiplication and adding is also beneficial. The function is stated as this: "_fmulrn,rz,ru,rd". __fdividef(x,y) was not stated with the arguments in brackets. I was wondering, what are those brackets? If I run the simple code: int t = __fmul_(5,4); I a compiler error about how _fmul is undefined. I have the CUDA runtime included, so I don't think it is a setup thing; rather it is something to do with those square brackets. How do I correctly use this function? Thank you.

    Read the article

  • Exclusive compute mode with OpenCL+NVidia

    - by lokli
    Hi, I have a question to exclusive compute mode with NVidia+OpenCL. I can set up exclusive compute mode (page 74 from cuda programming guide 3.0) with nvidia-smi on a nvidia-gpu . that means, only one program can compute on gpu. cuda runtime schedules than app automatically. but I have a problem with opencl-programs in this case: if one application runs on a gpu with setted exclusive compute mode and second opencl-program calls clGetDeviceInfo(..., CL_DEVICE_AVAILABLE, ...) with the same GPU is the result == CL_TRUE. After that if opencl-app tries to create a context on this device, than crashes the running app (both). How can i find out an available GPU with OpenCL? Thanks.

    Read the article

  • Not able to kill bad kernel running on NVIDIA GPU

    - by arvindkgs
    Hi, I am in a real fix. Please help. Its urgent. I have a host process that spawns multiple host(CPU) threads. These threads in turn call the CUDA kernel. These CUDA kernels are written by external users. So it might be bad kernels that enter infinite loop. In order to overcome this I have put a time-out of 2 mins that will kill the corresponding CPU thread. Will killing the CPU thread also kill the kernel running on the GPU? As far as what I have tested it does'nt. How can I also kill all the threads currently running in the GPU? Thanks, Arvind

    Read the article

  • Ikoula lance un nouveau serveur dédié, le Green GPU propose « 192 CUDA Parallel Processor Cores » aux professionnels de la création graphique

    Ikoula lance un nouveau serveur dédié le Green GPU propose « 192 CUDA Parallel Processor Cores » aux professionnels de la création graphiqueL'hébergeur français Ikoula propose à la location un nouveau serveur dédié qui intègre une carte graphique professionnelle ou GPU. La Nvidia Quadro 2000D est la carte retenue pour le lancement de cette nouvelle offre de serveur dédié. La Quadro 2000D bénéficie du coeur de la technologie Fermi de Nvidia et propose 192 CUDA Parallel Processor Cores, le tout accompagné...

    Read the article

  • NVIDIA sort sa version 5 de CUDA, simplification de la programmation sur la plateforme de calcul parallèle la plus omniprésente

    NVIDIA sort sa version 5 de CUDA Simplification de la programmation sur la plateforme de calcul parallèle la plus omniprésente NVIDIA a sorti aujourd'hui la version de pré-production NVIDIA CUDA 5, une nouvelle version de la plateforme de calcul parallèle et du modèle de programmation les plus omniprésents, destinée à accélérer les applications scientifiques et d'ingénierie sur les processeurs graphiques. Elle est téléchargeable gratuitement à partir de la zone Développeurs du site Web NVIDIA. Avec plus de 1,5 millions de téléchargements et la prise en charge de plus de 180 applications d'ingénierie, scientifiques et commerciale...

    Read the article

  • malloc works, cudaHostAlloc segfaults?

    - by Mikhail
    I am new to CUDA and I want to use cudaHostAlloc. I was able to isolate my problem to this following code. Using malloc for host allocation works, using cudaHostAlloc results in a segfault, possibly because the area allocated is invalid? When I dump the pointer in both cases it is not null, so cudaHostAlloc returns something... works in_h = (int*) malloc(length*sizeof(int)); //works for (int i = 0;i<length;i++) in_h[i]=2; doesn't work cudaHostAlloc((void**)&in_h,length*sizeof(int),cudaHostAllocDefault); for (int i = 0;i<length;i++) in_h[i]=2; //segfaults Standalone Code #include <stdio.h> void checkDevice() { cudaDeviceProp info; int deviceName; cudaGetDevice(&deviceName); cudaGetDeviceProperties(&info,deviceName); if (!info.deviceOverlap) { printf("Compute device can't use streams and should be discared."); exit(EXIT_FAILURE); } } int main() { checkDevice(); int *in_h; const int length = 10000; cudaHostAlloc((void**)&in_h,length*sizeof(int),cudaHostAllocDefault); printf("segfault comming %d\n",in_h); for (int i = 0;i<length;i++) { in_h[i]=2; } free(in_h); return EXIT_SUCCESS; } ~ Invocation [id129]$ nvcc fun.cu [id129]$ ./a.out segfault comming 327641824 Segmentation fault (core dumped) Details Program is run in interactive mode on a cluster. I was told that an invocation of the program from the compute node pushes it to the cluster. Have not had any trouble with other home made toy cuda codes.

    Read the article

  • Setting pixel values in Nvidia NPP ImageCPU objects?

    - by solvingPuzzles
    In the Nvidia Performance Primitives (NPP) image processing examples in the CUDA SDK distribution, images are typically stored on the CPU as ImageCPU objects, and images are stored on the GPU as ImageNPP objects. boxFilterNPP.cpp is an example from the CUDA SDK that uses these ImageCPU and ImageNPP objects. When using a filter (convolution) function like nppiFilter, it makes sense to define a filter as an ImageCPU object. However, I see no clear way setting the values of an ImageCPU object. npp::ImageCPU_32f_C1 hostKernel(3,3); //allocate space for 3x3 convolution kernel //want to set hostKernel to [-1 0 1; -1 0 1; -1 0 1] hostKernel[0][0] = -1; //this doesn't compile hostKernel(0,0) = -1; //this doesn't compile hostKernel.at(0,0) = -1; //this doesn't compile How can I manually put values into an ImageCPU object? Notes: I didn't actually use nppiFilter in the code snippet; I'm just mentioning nppiFilter as a motivating example for writing values into an ImageCPU object. The boxFilterNPP.cpp example doesn't involve writing directly to an ImageCPU object, because nppiFilterBox is a special case of nppiFilter that uses a built-in gaussian smoothing filter (probably something like [1 1 1; 1 1 1; 1 1 1]).

    Read the article

  • Is CUDA, cuBLAS or cuBLAS-XT the right place to start with for machine learning?

    - by Stefan R. Falk
    I am not sure if this is the right forum to post this question - but it surely is no question for stackoverflow. I work on my bachelor thesis and therefore I am implementing a so called Echo-State Network which basically is an artificial neural network that has a large reservoir of randomly initialized neurons and just a few input and output neurons .. but I think we can skip that. The thing is, there is a Python library called Theano which I am using for this implementation. It encapsulates the CUDA API and offers a quiet "comfortable" way to access the power of a NVIDIA graphics card. Since CUDA 6.0 there is a sub-library called cuBLAS (Basic Linear Algebra Subroutines) for LinAlg operations and also a cuBLAS-XT an extention which allows to run calculations on multiple graphics cards. My question at this point is if it would make sense to start using cuBLAS and/or cuBLAS-XT right now since the API is quite complex or rather wait for libraries that will build up on those library (such as Theano does on basic CUDA)? If you think this is the wrong place for this question please tell me which one is, thank you.

    Read the article

  • 3d convolution in c++

    - by alboot
    Hello, I'm looking for some source code implementing 3d convolution. Ideally, I need C++ code or CUDA code. I'd appreciate if anybody can point me to a nice and fast implementation :-) Cheers

    Read the article

  • Why won't OpenCV compile in NVCC?

    - by zenna
    Hi there I am trying to integrate CUDA and openCV in a project. Problem is openCV won't compile when NVCC is used, while a normal c++ project compiles just fine. This seems odd to me, as I thought NVCC passed all host code to the c/c++ compiler, in this case the visual studio compiler. The errors I get are? c:\opencv2.0\include\opencv\cxoperations.hpp(1137): error: no operator "=" matches these operands operand types are: const cv::Range = cv::Range c:\opencv2.0\include\opencv\cxoperations.hpp(2469): error: more than one instance of overloaded function "std::abs" matches the argument list: function "abs(long double)" function "abs(float)" function "abs(double)" function "abs(long)" function "abs(int)" argument types are: (ptrdiff_t) So my question is why the difference considering the same compiler (should be) is being used and secondly how I could remedy this.

    Read the article

  • What are the default values for arch and code options when using nvcc?

    - by Auron
    When compiling your CUDA code, you have to select for which architecture your code is being generated. nvcc provides two parameters to specify this architecture, basically: arch specifies the virtual arquictecture, which can be compute_10, compute_11, etc. code specifies the real architecture, which can be sm_10, sm_11, etc. So a command like this: nvcc x.cu -arch=compute_13 -code=sm_13 Will generate 'cubin' code for devices with 1.3 compute capability. Please correct me if I'm wrong. Which I would like to know is which are the default values for these two parameters? Which is the default architecture that nvcc uses when no value for arch or code is specified?

    Read the article

  • Untrusted GPGPU code (OpenCL etc) - is it safe? What risks?

    - by Grzegorz Wierzowiecki
    There are many approaches when it goes about running untrusted code on typical CPU : sandboxes, fake-roots, virtualization... What about untrusted code for GPGPU (OpenCL,cuda or already compiled one) ? Assuming that memory on graphics card is cleared before running such third-party untrusted code, are there any security risks? What kind of risks? Any way to prevent them ? (Possible sandboxing on gpgpu or other technique?) P.S. I am more interested in gpu binary code level security rather than hight-level gpgpu programming language security (But those solutions are welcome as well). What I mean is that references to gpu opcodes (a.k.a machine code) are welcome.

    Read the article

  • What's a good algorithm for searching arrays N and M, in order to find elements in N that also exist

    - by GenTiradentes
    I have two arrays, N and M. they are both arbitrarily sized, though N is usually smaller than M. I want to find out what elements in N also exist in M, in the fastest way possible. To give you an example of one possible instance of the program, N is an array 12 units in size, and M is an array 1,000 units in size. I want to find which elements in N also exist in M. (There may not be any matches.) The more parallel the solution, the better. I used to use a hash map for this, but it's not quite as efficient as I'd like it to be. Typing this out, I just thought of running a binary search of M on sizeof(N) independent threads. (Using CUDA) I'll see how this works, though other suggestions are welcome.

    Read the article

  • What does this structure actually do?

    - by LGTrader
    I found this structure code in a Julia Set example from a book on CUDA. I'm a newbie C programmer and cannot get my head around what it's doing, nor have I found the right thing to read on the web to clear it up. Here's the structure: struct cuComplex { float r; float i; cuComplex( float a, float b ) : r(a), i(b) {} float magnitude2( void ) { return r * r + i * i; } cuComplex operator*(const cuComplex& a) { return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i); } cuComplex operator+(const cuComplex& a) { return cuComplex(r+a.r, i+a.i); } }; and it's called very simply like this: cuComplex c(-0.8, 0.156); cuComplex a(jx, jy); int i = 0; for (i=0; i<200; i++) { a = a * a + c; if (a.magnitude2() > 1000) return 0; } return 1; So, the code did what? Defined something of structure type 'cuComplex' giving the real and imaginary parts of a number. (-0.8 & 0.156) What is getting returned? (Or placed in the structure?) How do I work through the logic of the operator stuff in the struct to understand what is actually calculated and held there? I think that it's probably doing recursive calls back into the stucture float magnitude2 (void) { return return r * r + i * i; } probably calls the '*' operator for r and again for i, and then the results of those two operations call the '+' operator? Is this correct and what gets returned at each step? Just plain confused. Thanks!

    Read the article

  • My kernel only works in block (0,0)

    - by ZeroDivide
    I am trying to write a simple matrixMultiplication application that multiplies two square matrices using CUDA. I am having a problem where my kernel is only computing correctly in block (0,0) of the grid. This is my invocation code: dim3 dimBlock(4,4,1); dim3 dimGrid(4,4,1); //Launch the kernel; MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width); This is my Kernel function __global__ void MatrixMulKernel(int* Md, int* Nd, int* Pd, int Width) { const int tx = threadIdx.x; const int ty = threadIdx.y; const int bx = blockIdx.x; const int by = blockIdx.y; const int row = (by * blockDim.y + ty); const int col = (bx * blockDim.x + tx); //Pvalue stores the Pd element that is computed by the thread int Pvalue = 0; for (int k = 0; k < Width; k++) { Pvalue += Md[row * Width + k] * Nd[k * Width + col]; } __syncthreads(); //Write the matrix to device memory each thread writes one element Pd[row * Width + col] = Pvalue; } I think the problem may have something to do with memory but I'm a bit lost. What should I do to make this code work across several blocks?

    Read the article

  • trouble calculating offset index into 3D array

    - by Derek
    Hello, I am writing a CUDA kernel to create a 3x3 covariance matrix for each location in the rows*cols main matrix. So that 3D matrix is rows*cols*9 in size, which i allocated in a single malloc accordingly. I need to access this in a single index value the 9 values of the 3x3 covariance matrix get their values set according to the appropriate row r and column c from some other 2D arrays. In other words - I need to calculate the appropriate index to access the 9 elements of the 3x3 covariance matrix, as well as the row and column offset of the 2D matrices that are inputs to the value, as well as the appropriate index for the storage array. i have tried to simplify it down to the following: //I am calling this kernel with 1D blocks who are 512 cols x 1row. TILE_WIDTH=512 int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; int r = by + ty; int c = bx*TILE_WIDTH + tx; int offset = r*cols+c; int ndx = r*cols*rows + c*cols; if((r < rows) && (c < cols)){ //this IF statement is trying to avoid the case where a threadblock went bigger than my original array..not sure if correct d_cov[ndx + 0] = otherArray[offset]; d_cov[ndx + 1] = otherArray[offset] d_cov[ndx + 2] = otherArray[offset] d_cov[ndx + 3] = otherArray[offset] d_cov[ndx + 4] = otherArray[offset] d_cov[ndx + 5] = otherArray[offset] d_cov[ndx + 6] = otherArray[offset] d_cov[ndx + 7] = otherArray[offset] d_cov[ndx + 8] = otherArray[offset] } When I check this array with the values calculated on the CPU, which loops over i=rows, j=cols, k = 1..9 The results do not match up. in other words d_cov[i*rows*cols + j*cols + k] != correctAnswer[i][j][k] Can anyone give me any tips on how to sovle this problem? Is it an indexing problem, or some other logic error?

    Read the article

  • The best way to predict performance without actually porting the code?

    - by ardiyu07
    I believe there are people with the same experience with me, where he/she must give a (estimated) performance report of porting a program from sequential to parallel with some designated multicore hardwares, with a very few amount of time given. For instance, if a 10K LoC sequential program was given and executes on Intel i7-3770k (not vectorized) in 100 ms, how long would it take to run if one parallelizes the code to a Tesla C2075 with NVIDIA CUDA, given that all kinds of parallelizing optimization techniques were done? (but you're only given 2-4 days to report the performance? assume that you didn't know the algorithm at all. Or perhaps it'd be safer if we just assume that it's an impossible situation to finish the job) Therefore, I'm wondering, what most likely be the fastest way to give such performance report? Is it safe to calculate solely by the hardware's capability, such as GFLOPs peak and memory bandwidth rate? Is there a mathematical way to calculate it? If there is, please prove your method with the corresponding problem description and the algorithm, and also the target hardwares' specifications. Or perhaps there already exists such tool to (roughly) estimate code porting? (Please don't the answer: 'kill yourself is the fastest way.')

    Read the article

  • How CudaMalloc work?

    - by kitw
    I am trying to modify the imageDenosing class in CUDA SDK, I need to repeat the filter many time incase to capture the time. But my code doesn't work properly. //start __global__ void F1D(TColor *image,int imageW,int imageH, TColor *buffer) { const int ix = blockDim.x * blockIdx.x + threadIdx.x; const int iy = blockDim.y * blockIdx.y + threadIdx.y; if(iy != 0 && iy < imageH-1 && ix < imageW) { float4 fresult = get_color(image[imageW * iy + ix]); float4 fresult4 = get_color(image[imageW * (iy+1) + ix]); float4 fresult5 = get_color(image[imageW * (iy-1) + ix]); float4 fresult7; fresult7.x = fresult.x*0.5+fresult4.x*.25+fresult5.x*.25; fresult7.y = fresult.y*0.5+fresult4.y*.25+fresult5.y*.25; fresult7.z = fresult.z*0.5+fresult4.z*.25+fresult5.z*.25; buffer[imageW * iy + ix] = make_color(fresult7.x,fresult7.y,fresult7.z,0); } image[imageW * iy + ix] = buffer[imageW * iy + ix]; //should be use cudaMemcpy, But it fails } //extern extern "C" void cuda_F1D(TColor *dst, int imageW, int imageH) { dim3 threads(BLOCKDIM_X, BLOCKDIM_Y); dim3 grid(iDivUp(imageW, BLOCKDIM_X), iDivUp(imageH, BLOCKDIM_Y)); Copy<<<grid, threads>>>(dst, imageW, imageH); size_t size = imageW*imageH*sizeof(TColor); TColor *host =(TColor*) malloc(size); TColor *dst2; //TColor *dst3; //TColor *d = new TColor(imageW*imageH*sizeof(TColor)); dim3 threads2(imageW,1); dim3 grid2(iDivUp(imageW, imageW), iDivUp(imageH, 1)); *for(int i = 0;i<1;i++) { cudaMalloc( (void **)&dst2, size); cudaMemcpy(dst2, dst, imageW*imageH*sizeof(TColor),cudaMemcpyHostToDevice); //cudaMalloc( (void **)&dst3, imageW*imageH*sizeof(TColor)); //cudaMemcpy(dst3, dst, imageW*imageH*sizeof(TColor),cudaMemcpyHostToDevice); F1D<<<grid2, threads2>>>(dst, imageW, imageH,dst2); //cudaMemcpy(dst, dst3, imageW*imageH*sizeof(TColor),cudaMemcpyDeviceToHost); cudaFree(dst2); }* } This code works, but cant synchronise the array of image. and lead to many synchronise problem

    Read the article

  • CUDA error message : unspecified launch failure

    - by user1297065
    I received the error message "unspecified launch failure" in following part. off_t *matches_position; ...... cudaMalloc ( (void **) &mat_position, sizeof(off_t)*10); ...... cudaMemcpy (mat_position, matches_position, sizeof(off_t)*10, cudaMemcpyHostToDevice ); ...... err=cudaMemcpy (matches_position, mat_position, sizeof(off_t)*10, cudaMemcpyDeviceToHost ); if(err!=cudaSuccess) { printf("\n3 %s\n", cudaGetErrorString(err)); } Do you know why this error message is reported??

    Read the article

  • Can I legally publish my Fortran 90 wrappers to nVidias CUFFT library (from CUDA SDK)?

    - by Jakub Narebski
    From a legal standpoint (licensing issues), can I legally in agreement with license publish Fortran 90 wrappers (bindings) to CUFFT library from nVidia CUDA Toolkit, under some open source license (either CC0 i.e. public domain, or some kind of permissive license like BSD). nVidia provides only C bindings with their CUDA SDK. Header files contain the following text: /* * Copyright 1993-2011 NVIDIA Corporation. All rights reserved. * * NOTICE TO LICENSEE: * * This source code and/or documentation ("Licensed Deliverables") are * subject to NVIDIA intellectual property rights under U.S. and * international Copyright laws. * * These Licensed Deliverables contained herein is PROPRIETARY and * CONFIDENTIAL to NVIDIA and is being provided under the terms and * conditions of a form of NVIDIA software license agreement by and * between NVIDIA and Licensee ("License Agreement") or electronically * accepted by Licensee. Notwithstanding any terms or conditions to * the contrary in the License Agreement, reproduction or disclosure * of the Licensed Deliverables to any third party without the express * written consent of NVIDIA is prohibited. The License.txt file includes the following fragment Source Code: Developer shall have the right to modify and create derivative works with the Source Code. Developer shall own any derivative works ("Derivatives") it creates to the Source Code, provided that Developer uses the Materials in accordance with the terms and conditions of this Agreement. Developer may distribute the Derivatives, provided that all NVIDIA copyright notices and trademarks are propagated and used properly and the Derivatives include the following statement: "This software contains source code provided by NVIDIA Corporation."

    Read the article

< Previous Page | 2 3 4 5 6 7 8 9 10  | Next Page >