Originally posted on: http://geekswithblogs.net/freestylecoding/archive/2014/08/20/processing-kinect-v2-color-streams-in-parallel.aspxProcessing Kinect v2 Color Streams in Parallel
I've really been enjoying being a part of the Kinect for Windows Developer's Preview. The new hardware has some really impressive capabilities. However, with great power comes great system specs. Unfortunately, my little laptop that could is not 100% up to the task; I've had to get a little creative.
The most disappointing thing I've run into is that I can't always cleanly display the color camera stream in managed code. I managed to strip the code down to what I believe is the bear minimum:
using( ColorFrame _ColorFrame = e.FrameReference.AcquireFrame() ) {
if( null == _ColorFrame ) return;
BitmapToDisplay.Lock();
_ColorFrame.CopyConvertedFrameDataToIntPtr(
BitmapToDisplay.BackBuffer,
Convert.ToUInt32( BitmapToDisplay.BackBufferStride * BitmapToDisplay.PixelHeight ),
ColorImageFormat.Bgra );
BitmapToDisplay.AddDirtyRect(
new Int32Rect(
0,
0,
_ColorFrame.FrameDescription.Width,
_ColorFrame.FrameDescription.Height ) );
BitmapToDisplay.Unlock();
}
With this snippet, I'm placing the converted Bgra32 color stream directly on the BackBuffer of the WriteableBitmap. This gives me pretty smooth playback, but I still get the occasional freeze for half a second.
After a bit of profiling, I discovered there were a few problems. The first problem is the size of the buffer along with the conversion on the buffer. At this time, the raw image format of the data from the Kinect is Yuy2. This is great for direct video processing. It would be ideal if I had a WriteableVideo object in WPF. However, this is not the case.
Further digging led me to the real problem. It appears that the SDK is converting the input serially. Let's think about this for a second. The color camera is a 1080p camera. As we should all know, this give us a native resolution of 1920 x 1080. This produces 2,073,600 pixels. Yuy2 uses 4 bytes per 2 pixel, for a buffer size of 4,147,200 bytes. Bgra32 uses 4 bytes per pixel, for a buffer size of 8,294,400 bytes. The SDK appears to be doing this on one thread.
I started wondering if I chould do this better myself. I mean, I have 8 cores in my system. Why can't I use them all?
The first problem is converting a Yuy2 frame into a Bgra32 frame. It is NOT trivial. I spent a day of research of just how to do this. In the end, I didn't even produce the best algorithm possible, but it did work.
After I managed to get that to work, I knew my next step was the get the conversion operation off the UI Thread. This was a simple process of throwing the work into a Task. Of course, this meant I had to marshal the final write to the WriteableBitmap back to the UI thread.
Finally, I needed to vectorize the operation so I could run it safely in parallel. This was, mercifully, not quite as hard as I thought it would be. I had my loop return an index to a pair of pixels. From there, I had to tell the loop to do everything for this pair of pixels. If you're wondering why I did it for pairs of pixels, look back above at the specification for the Yuy2 format. I won't go into full detail on why each 4 bytes contains 2 pixels of information, but rest assured that there is a reason why the format is described in that way.
The first working attempt at this algorithm successfully turned my poor laptop into a space heater. I very quickly brought and maintained all 8 cores up to about 97% usage. That's when I remembered that obscure option in the Task Parallel Library where you could limit the amount of parallelism used. After a little trial and error, I discovered 4 parallel tasks was enough for most cases. This yielded the follow code:
private byte ClipToByte( int p_ValueToClip ) {
return Convert.ToByte( ( p_ValueToClip < byte.MinValue ) ? byte.MinValue : ( ( p_ValueToClip > byte.MaxValue ) ? byte.MaxValue : p_ValueToClip ) );
}
private void ColorFrameArrived( object sender, ColorFrameArrivedEventArgs e ) {
if( null == e.FrameReference ) return;
// If you do not dispose of the frame, you never get another one...
using( ColorFrame _ColorFrame = e.FrameReference.AcquireFrame() ) {
if( null == _ColorFrame ) return;
byte[] _InputImage = new byte[_ColorFrame.FrameDescription.LengthInPixels * _ColorFrame.FrameDescription.BytesPerPixel];
byte[] _OutputImage = new byte[BitmapToDisplay.BackBufferStride * BitmapToDisplay.PixelHeight];
_ColorFrame.CopyRawFrameDataToArray( _InputImage );
Task.Factory.StartNew( () => {
ParallelOptions _ParallelOptions = new ParallelOptions();
_ParallelOptions.MaxDegreeOfParallelism = 4;
Parallel.For( 0, Sensor.ColorFrameSource.FrameDescription.LengthInPixels / 2, _ParallelOptions, ( _Index ) => {
// See http://msdn.microsoft.com/en-us/library/windows/desktop/dd206750(v=vs.85).aspx
int _Y0 = _InputImage[( _Index << 2 ) + 0] - 16;
int _U = _InputImage[( _Index << 2 ) + 1] - 128;
int _Y1 = _InputImage[( _Index << 2 ) + 2] - 16;
int _V = _InputImage[( _Index << 2 ) + 3] - 128;
byte _R = ClipToByte( ( 298 * _Y0 + 409 * _V + 128 ) >> 8 );
byte _G = ClipToByte( ( 298 * _Y0 - 100 * _U - 208 * _V + 128 ) >> 8 );
byte _B = ClipToByte( ( 298 * _Y0 + 516 * _U + 128 ) >> 8 );
_OutputImage[( _Index << 3 ) + 0] = _B;
_OutputImage[( _Index << 3 ) + 1] = _G;
_OutputImage[( _Index << 3 ) + 2] = _R;
_OutputImage[( _Index << 3 ) + 3] = 0xFF; // A
_R = ClipToByte( ( 298 * _Y1 + 409 * _V + 128 ) >> 8 );
_G = ClipToByte( ( 298 * _Y1 - 100 * _U - 208 * _V + 128 ) >> 8 );
_B = ClipToByte( ( 298 * _Y1 + 516 * _U + 128 ) >> 8 );
_OutputImage[( _Index << 3 ) + 4] = _B;
_OutputImage[( _Index << 3 ) + 5] = _G;
_OutputImage[( _Index << 3 ) + 6] = _R;
_OutputImage[( _Index << 3 ) + 7] = 0xFF;
} );
Application.Current.Dispatcher.Invoke( () => {
BitmapToDisplay.WritePixels(
new Int32Rect( 0, 0, Sensor.ColorFrameSource.FrameDescription.Width, Sensor.ColorFrameSource.FrameDescription.Height ),
_OutputImage,
BitmapToDisplay.BackBufferStride,
0 );
} );
} );
}
}
This seemed to yield a results I wanted, but there was still the occasional stutter. This lead to what I realized was the second problem. There is a race condition between the UI Thread and me locking the WriteableBitmap so I can write the next frame. Again, I'm writing approximately 8MB to the back buffer.
Then, I started thinking I could cheat. The Kinect is running at 30 frames per second. The WPF UI Thread runs at 60 frames per second. This made me not feel bad about exploiting the Composition Thread. I moved the bulk of the code from the FrameArrived handler into CompositionTarget.Rendering. Once I was in there, I polled from a frame, and rendered it if it existed. Since, in theory, I'm only killing the Composition Thread every other hit, I decided I was ok with this for cases where silky smooth video performance REALLY mattered. This ode looked like this:
private byte ClipToByte( int p_ValueToClip ) {
return Convert.ToByte( ( p_ValueToClip < byte.MinValue ) ? byte.MinValue : ( ( p_ValueToClip > byte.MaxValue ) ? byte.MaxValue : p_ValueToClip ) );
}
void CompositionTarget_Rendering( object sender, EventArgs e ) {
using( ColorFrame _ColorFrame = FrameReader.AcquireLatestFrame() ) {
if( null == _ColorFrame )
return;
byte[] _InputImage = new byte[_ColorFrame.FrameDescription.LengthInPixels * _ColorFrame.FrameDescription.BytesPerPixel];
byte[] _OutputImage = new byte[BitmapToDisplay.BackBufferStride * BitmapToDisplay.PixelHeight];
_ColorFrame.CopyRawFrameDataToArray( _InputImage );
ParallelOptions _ParallelOptions = new ParallelOptions();
_ParallelOptions.MaxDegreeOfParallelism = 4;
Parallel.For( 0, Sensor.ColorFrameSource.FrameDescription.LengthInPixels / 2, _ParallelOptions, ( _Index ) => {
// See http://msdn.microsoft.com/en-us/library/windows/desktop/dd206750(v=vs.85).aspx
int _Y0 = _InputImage[( _Index << 2 ) + 0] - 16;
int _U = _InputImage[( _Index << 2 ) + 1] - 128;
int _Y1 = _InputImage[( _Index << 2 ) + 2] - 16;
int _V = _InputImage[( _Index << 2 ) + 3] - 128;
byte _R = ClipToByte( ( 298 * _Y0 + 409 * _V + 128 ) >> 8 );
byte _G = ClipToByte( ( 298 * _Y0 - 100 * _U - 208 * _V + 128 ) >> 8 );
byte _B = ClipToByte( ( 298 * _Y0 + 516 * _U + 128 ) >> 8 );
_OutputImage[( _Index << 3 ) + 0] = _B;
_OutputImage[( _Index << 3 ) + 1] = _G;
_OutputImage[( _Index << 3 ) + 2] = _R;
_OutputImage[( _Index << 3 ) + 3] = 0xFF; // A
_R = ClipToByte( ( 298 * _Y1 + 409 * _V + 128 ) >> 8 );
_G = ClipToByte( ( 298 * _Y1 - 100 * _U - 208 * _V + 128 ) >> 8 );
_B = ClipToByte( ( 298 * _Y1 + 516 * _U + 128 ) >> 8 );
_OutputImage[( _Index << 3 ) + 4] = _B;
_OutputImage[( _Index << 3 ) + 5] = _G;
_OutputImage[( _Index << 3 ) + 6] = _R;
_OutputImage[( _Index << 3 ) + 7] = 0xFF;
} );
BitmapToDisplay.WritePixels(
new Int32Rect( 0, 0, Sensor.ColorFrameSource.FrameDescription.Width, Sensor.ColorFrameSource.FrameDescription.Height ),
_OutputImage,
BitmapToDisplay.BackBufferStride,
0 );
}
}