smallpt CPU optimizations

September 19th, 2010 § 0 comments § permalink

Introduction

I wrote a multithreaded SIMD CPU optimized version of smallpt (http://www.kevinbeason.com/smallpt/) in order gain more experience and improve my CPU optimization skills.

Implementation FPS Samples/Sec
Single-Threaded Unoptimized (double) 0.49 313.6K
Single-Threaded Optimized (float) 0.88 563.2K
Multi-Threaded Optimized (float) 2.8-3.3 1792.0K-2112.0K

Reorganizing the data in SOA fashion, I was able to speed up the Ray-Sphere intersection code 2.82x times. Also used SIMD to speed up the vector normalization code a little bit. Because of the incoherent nature of randomly sampled secondary rays, the radiance calculation code didn’t seem easily or effectively vectorizable, so I did not try using SIMD there. Without multithreading, I get a speedup of around 1.79x. Using the Nulstein work-stealing task scheduler, with multithreading enabled, I get a speedup of around 5.71x-6.73x.

CodePlex link for my program’s source code and binary are provided at the end of the article.

A Picture


Renderered for around 45 minutes using the multithreaded fully optimized version.

My Setup

CPU: Intel Q6600 Core2Quad 2.4 ghz (not-overclocked.) 4 CPU cores.
OS: Vista 64-bit SP2

Profiler

I used AMD CodeAnalyst. (http://www.butamanrenderer.com/2010/08/30/using-amd-codeanalyst-on-an-intel-cpu/)

Going from double to float

Implementation FPS Samples/Sec
double 0.49 313.6K
float 0.6 384.0K

Replacing double with float resulted in a 1.22x speedup. This is even with implicit float->double conversion occurring for the float version, but not the double version.

Float Precision Problems

Going from double to float introduced precision errors into the smallpt render (http://www.butamanrenderer.com/2010/08/08/smallpt-float-precision-problems/). I tried offsetting the ray center in the ray direction using a small epsilon value in order to avoid self-intersections, but I couldn’t find a magic epsilon to get rid of all artifacts. This was a very annoying problem because sometimes the error would take a bit of converging before becoming apparent. In the end, I gave up my search for the magic epsilon and modified the scene in smiliar fashion to SmallptGPU (http://davibu.interfree.it/opencl/smallptgpu/smallptGPU.html), decreasing the light’s radius and bringing it down inside the room. In addition, I also decreased the radius of the sphere walls. This got rid of the float precision problems for me.

Multithreading

For multithreading, I used the Nulstein work-stealing task scheduler (http://www.butamanrenderer.com/2010/08/16/nulstein-a-work-stealing-task-scheduler/). The original smallpt code uses OpenMP for multithreading, but Visual Studio 2008 Standard Edition doesn’t support it out of the box and using OpenMP with it requires a bit of work (http://stackoverflow.com/questions/865686/openmp-in-visual-studio-2005-standard), so I didn’t try it.

Implementation FPS Samples/Sec
Single-Threaded Optimized 0.88 563.2K
Multi-Threaded Optimized 2.8-3.3 1792.0K-2112.0K

With multithreading, the fps fluctuates a bit, and I get a speedup of around 3.18x-3.75x on a 4-core CPU.

SOA Ray-Sphere SIMD

Implementation FPS Samples/Sec
No SIMD SOA 0.67 428.8K
SIMD SOA 0.88 563.2K

Using AMD CodeAnalyst, I took a profile of my smallpt implementation, and saw Ray-Sphere intersection taking up around 55% of total processing time. A good candidate for some optimization! Effective SIMD Ray-Sphere intersection optimizations usually involve rearranging the data in SOA (structure of arrays) format so that 4-rays vs 1-sphere or 1-ray vs 4-sphere intersection tests can be performed at once. A 1-ray versus 4-sphere SOA modification was easier to implement with smallpt, so I went with that. Note that though smallpt’s speedup was around 1.313x

(time(ms) spent total) *
(Ray-Sphere intersection % of total processing)
 = (time(ms) spent for Ray-Sphere intersection)
( 1000 / 0.49 ) * 0.55 = 1122ms
( 1000 / 0.88 ) * 0.35 = 397ms
1122ms / 397ms = 2.82x

meaning the speedup of Ray-Sphere intersection itself was 2.82x, so a pretty good speedup.

SIMD Optimized Sqrt

Implementation FPS Samples/Sec
sqrtf (SSE scalar sqrtss) 0.88 563.2K
SSE scalar rsqrtss 0.90 576.0K
SSE scalar rsqrtss with one NR step 0.87 556.8K

After reading http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/ and http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/, I decided to try rsqrtss( x )*x=sqrt(x) for optimization. Since I have SSE2 enabled, the default c runtime sqrtf function gets compiled into ssqrts. I tried rsqrtss( x )*x and rsqrtss( x )*x with one step of Newton-Raphson iteration. rsqrtss( x )*x is a tiny bit faster, but its 11-bit precision estimate is not sufficient and I get artifacts in the image. rsqrtss( x )*x with one step of Newton-Raphson iteration has no artifacts, but for my code, is a tiny bit slower than the c runtime sqrtf function.

SIMD Optimized Vector Normalize

Implementation FPS Samples/Sec
Non-SIMD Vector Normalize 0.83 531.2K
SIMD Vector Normalize 0.88 563.2K

I used code from http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/ to try a faster vector normalize using rsqrtss( x )*x=sqrt(x) with one step of Newton-Rhaphson and I got 1.06x speedup on smallpt itself. I did not check how much the normalize function itself is becoming optimized, but it may be significant.

Tonemapping

http://www.yakiimo3d.com/2010/03/13/dx11-directcompute-global-operator-photographic-tonemapping/
I use the DX11 compute shader Rheinhard global operator tonemapper I wrote for Yakiimo3D.

Source Code & Binary

I use DirectX11 for rendering, so you’ll need a DirectX11 capable video card in order to run the program.
http://butamanrenderer.codeplex.com/releases/view/52569

The “Reference” and “Optimized” directories respectively contains the un-optimized and optimized smallpt implementations. Use “Optimized/SmallPtDefines.h” to toggle on and off SOA Ray-Sphere intersection and other optimizations.

Nulstein: A Work Stealing Task Scheduler

August 16th, 2010 § 0 comments § permalink

http://www.gamasutra.com/view/feature/4287/sponsored_feature_doityourself_.php
Nulstein Gamasutra article.

http://software.intel.com/en-us/articles/do-it-yourself-game-task-scheduling/
Nulstein Intel article. Seems to be the same as the Gamasutra article. Different reader comments.

Nulstein is a work-stealing task scheduler written by an ex-game dev Intel developer. In the article, the author says he wanted to write a Intel TBB-like, but more compact task scheduler that would fit inside a 64K demo. Nulstein seems to be a minimalist, but clearly written and fully functional work-stealing task scheduler, and since I don’t have any experience writing a multithreaded task scheduler, I’m going to use it in code and try to learn from it.


I integrated the Nulstein task scheduler into my DX11 smallpt code and made smallpt multithreaded. On my Intel Core2 Quad Q6600 2.4ghz with 4 CPU cores, the fps went from around 0.46 fps to around 1.76 fps. The sampling rate went from around 294400 samples/sec to around 1125561 samples/sec. Though I haven’t measured how much work stealing is helping out, the x3.82 speed-up I’m getting for going from 1 CPU core to 4 CPU cores seems pretty good.

https://www.cmpevents.com/GDCE10/a.asp?option=C&V=11&SessID=11257
The author of Nulstein is giving a talk titled “UFO Invasion: DX11 and Multicore to the Rescue” at GDC Europe 2010. The talk seems to be a follow-up of the above article. The talk summary says multicore CPU programming (Nulstein again) and DX11 deferred context multithreaded draw calls will be covered with the talk’s full source code to be made available. I am so far liking Nulstein and am looking forward to this talk’s material being made public.

smallpt float precision problems

August 8th, 2010 § 0 comments § permalink



Argh, simply searching and replacing double with float results in self-intersection precision errors. The ompf smallpt thread mentions this problem http://ompf.org/forum/viewtopic.php?f=8&t=1079&start=0&sid=df3f2a3cc54c52b1accc4e6baf9e1c73.

smallpt

August 8th, 2010 § 0 comments § permalink

I’m going to begin my renderer studies by writing a SIMD, multithreaded implementation of smallpt (http://kevinbeason.com/smallpt/).

I looked at the source code of smallpt a couple of months ago after the OpenCL GPGPU implementation of it, SmallptGPU (http://davibu.interfree.it/opencl/smallptgpu/smallptGPU.html), came out. I was learning DirectCompute and thought it would be good for me to write a DirectCompute version of smallpt, but instead of doing that I ended up writing a DirectCompute Buddhabrot renderer (http://www.yakiimo3d.com/2010/03/29/dx11-directcompute-buddhabrot-nebulabrot/).

Using SmallPtGPU as reference, I had a CPU version of smallpt working inside a DX11 StructuredBuffer render framework, and I’m picking up where I left off and using that source code for my smallpt studies. With my current CPU version of smallpt, on my Core2 Quad Q6600 2.4ghz, I get around 0.46 fps while rendering a 400×400 image with 2×2 supersampling, for 400*400*0.46*4=294400 samples/sec.

Where Am I?

You are currently browsing the PathTracing category at Butaman Renderer.