zenna - Developer IT

CUDA: How to reuse kernels in multiple files (for unit testing)

- by zenna

How can I go about reusing the same kernel without getting fatal linking errors due to defining the symbol multiple times In Visual Studio I get "fatal error LNK1169: one or more multiply defined symbols found" My current structure is as follows: Interface.h has an extern interface to a C function: myCfunction() (ala the C++ integration SDK example) Kernel.cu contains the actual __global__ kernels and is NOT included in the build: __global__ my_kernel() Wrapper.cu inlcudes Kernel.cu and Interface.h and calls my_kernel<<<...>>> This all works fine. But if I add another C function in another file which also includes Kernel.cu and uses those kernels, I get the errors. So how can I reuse the kernels in Kernel.cu among many C functions in different files. The purpose of this by the way is unit testing, and integrating my kernels with CPP unit, if there is no way to reuse kernels (there must be!) then other suggestions for unit testing kernels within my existing CPP unit framework would be appreciate. Thanks Zenna

Read the article

Why won't OpenCV compile in NVCC?

- by zenna

Hi there I am trying to integrate CUDA and openCV in a project. Problem is openCV won't compile when NVCC is used, while a normal c++ project compiles just fine. This seems odd to me, as I thought NVCC passed all host code to the c/c++ compiler, in this case the visual studio compiler. The errors I get are? c:\opencv2.0\include\opencv\cxoperations.hpp(1137): error: no operator "=" matches these operands operand types are: const cv::Range = cv::Range c:\opencv2.0\include\opencv\cxoperations.hpp(2469): error: more than one instance of overloaded function "std::abs" matches the argument list: function "abs(long double)" function "abs(float)" function "abs(double)" function "abs(long)" function "abs(int)" argument types are: (ptrdiff_t) So my question is why the difference considering the same compiler (should be) is being used and secondly how I could remedy this.

Read the article

CUDA: Memory copy to GPU 1 is slower in multi-GPU

- by zenna

My company has a setup of two GTX 295, so a total of 4 GPUs in a server, and we have several servers. We GPU 1 specifically was slow, in comparison to GPU 0, 2 and 3 so I wrote a little speed test to help find the cause of the problem. //#include <stdio.h> //#include <stdlib.h> //#include <cuda_runtime.h> #include <iostream> #include <fstream> #include <sstream> #include <string> #include <cutil.h> __global__ void test_kernel(float *d_data) { int tid = blockDim.x*blockIdx.x + threadIdx.x; for (int i=0;i<10000;++i) { d_data[tid] = float(i*2.2); d_data[tid] += 3.3; } } int main(int argc, char* argv[]) { int deviceCount; cudaGetDeviceCount(&deviceCount); int device = 0; //SELECT GPU HERE cudaSetDevice(device); cudaEvent_t start, stop; unsigned int num_vals = 200000000; float *h_data = new float[num_vals]; for (int i=0;i<num_vals;++i) { h_data[i] = float(i); } float *d_data = NULL; float malloc_timer; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord( start, 0 ); cudaMemcpy(d_data, h_data, sizeof(float)*num_vals,cudaMemcpyHostToDevice); cudaMalloc((void**)&d_data, sizeof(float)*num_vals); cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &malloc_timer, start, stop ); cudaEventDestroy( start ); cudaEventDestroy( stop ); float mem_timer; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord( start, 0 ); cudaMemcpy(d_data, h_data, sizeof(float)*num_vals,cudaMemcpyHostToDevice); cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &mem_timer, start, stop ); cudaEventDestroy( start ); cudaEventDestroy( stop ); float kernel_timer; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord( start, 0 ); test_kernel<<<1000,256>>>(d_data); cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &kernel_timer, start, stop ); cudaEventDestroy( start ); cudaEventDestroy( stop ); printf("cudaMalloc took %f ms\n",malloc_timer); printf("Copy to the GPU took %f ms\n",mem_timer); printf("Test Kernel took %f ms\n",kernel_timer); cudaMemcpy(h_data,d_data, sizeof(float)*num_vals,cudaMemcpyDeviceToHost); delete[] h_data; return 0; } The results are GPU0 cudaMalloc took 0.908640 ms Copy to the GPU took 296.058777 ms Test Kernel took 326.721283 ms GPU1 cudaMalloc took 0.913568 ms Copy to the GPU took[b] 663.182251 ms[/b] Test Kernel took 326.710785 ms GPU2 cudaMalloc took 0.925600 ms Copy to the GPU took 296.915039 ms Test Kernel took 327.127930 ms GPU3 cudaMalloc took 0.920416 ms Copy to the GPU took 296.968384 ms Test Kernel took 327.038696 ms As you can see, the cudaMemcpy to the GPU is well double the amount of time for GPU1. This is consistent between all our servers, it is always GPU1 that is slow. Any ideas why this may be? All servers are running windows XP.

Read the article

Kohana 3 ORM: How to get data from pivot table? and all other tables for that matter

- by zenna

I am trying to use ORM to access data stored, in three mysql tables 'users', 'items', and a pivot table for the many-many relationship: 'user_item' I followed the guidance from http://stackoverflow.com/questions/1946357/kohana-3-orm-read-additional-columns-in-pivot-tables and tried $user = ORM::factory('user',1); $user->items->find_all(); $user_item = ORM::factory('user_item', array('user_id' => $user, 'item_id' => $user->items)); if ($user_item->loaded()) { foreach ($user_item as $pivot) { print_r($pivot); } } But I get the SQL error: "Unknown column 'user_item.id' in 'order clause' [ SELECT user_item.* FROM user_item WHERE user_id = '1' AND item_id = '' ORDER BY user_item.id ASC LIMIT 1 ]" Which is clearly erroneous because Kohana is trying to order the elements by a column which doesn't exist: user_item.id. This id doesnt exist because the primary keys of this pivot table are the foreign keys of the two other tables, 'users' and 'items'. Trying to use: $user_item = ORM::factory('user_item', array('user_id' => $user, 'item_id' => $user->items)) ->order_by('item_id', 'ASC'); Makes no difference, as it seems the order_by() or any sql queries are ignored if the second argument of the factory is given. Another obvious error with that query is that the item_id = '', when it should contain all the elements. So my question is how can I get access to the data stored in the pivot table, and actually how can I get access to the all items held by a particular user as I even had problems with that? Thanks

Read the article

CUDA, more threads for same work = Longer run time despite better occupancy, Why?

- by zenna

I encountered a strange problem where increasing my occupancy by increasing the number of threads reduced performance. I created the following program to illustrate the problem: #include <stdio.h> #include <stdlib.h> #include <cuda_runtime.h> __global__ void less_threads(float * d_out) { int num_inliers; for (int j=0;j<800;++j) { //Do 12 computations num_inliers += threadIdx.x*1; num_inliers += threadIdx.x*2; num_inliers += threadIdx.x*3; num_inliers += threadIdx.x*4; num_inliers += threadIdx.x*5; num_inliers += threadIdx.x*6; num_inliers += threadIdx.x*7; num_inliers += threadIdx.x*8; num_inliers += threadIdx.x*9; num_inliers += threadIdx.x*10; num_inliers += threadIdx.x*11; num_inliers += threadIdx.x*12; } if (threadIdx.x == -1) d_out[blockIdx.x*blockDim.x+threadIdx.x] = num_inliers; } __global__ void more_threads(float *d_out) { int num_inliers; for (int j=0;j<800;++j) { // Do 4 computations num_inliers += threadIdx.x*1; num_inliers += threadIdx.x*2; num_inliers += threadIdx.x*3; num_inliers += threadIdx.x*4; } if (threadIdx.x == -1) d_out[blockIdx.x*blockDim.x+threadIdx.x] = num_inliers; } int main(int argc, char* argv[]) { float *d_out = NULL; cudaMalloc((void**)&d_out,sizeof(float)*25000); more_threads<<<780,128>>>(d_out); less_threads<<<780,32>>>(d_out); return 0; } Note both kernels should do the same amount of work in total, the (if threadIdx.x == -1 is a trick to stop the compiler optimising everything out and leaving an empty kernel). The work should be the same as more_threads is using 4 times as many threads but with each thread doing 4 times less work. Significant results form the profiler results are as followsL: more_threads: GPU runtime = 1474 us,reg per thread = 6,occupancy=1,branch=83746,divergent_branch = 26,instructions = 584065,gst request=1084552 less_threads: GPU runtime = 921 us,reg per thread = 14,occupancy=0.25,branch=20956,divergent_branch = 26,instructions = 312663,gst request=677381 As I said previously, the run time of the kernel using more threads is longer, this could be due to the increased number of instructions. Why are there more instructions? Why is there any branching, let alone divergent branching, considering there is no conditional code? Why are there any gst requests when there is no global memory access? What is going on here! Thanks

Read the article

why including www in a jquery load causes it to fail?

- by zenna

Could someone enlighten me as to why including www in a ajax request causes it to fail. i.e. This works: $('#mydiv').load('http://mydomain.com/getitems'); But this doesn't (returns nothing) $('#mydiv').load('http://www.mydomain.com/getitems'); Note that www.mydomain.com/getitems is a valid domain, in the sense that if I point my web browser to it I am able to load the page.

Developer IT