Introduction
I wrote a multithreaded SIMD CPU optimized version of smallpt (http://www.kevinbeason.com/smallpt/) in order gain more experience and improve my CPU optimization skills.
| Implementation | FPS | Samples/Sec |
|---|---|---|
| Single-Threaded Unoptimized (double) | 0.49 | 313.6K |
| Single-Threaded Optimized (float) | 0.88 | 563.2K |
| Multi-Threaded Optimized (float) | 2.8-3.3 | 1792.0K-2112.0K |
Reorganizing the data in SOA fashion, I was able to speed up the Ray-Sphere intersection code 2.82x times. Also used SIMD to speed up the vector normalization code a little bit. Because of the incoherent nature of randomly sampled secondary rays, the radiance calculation code didn’t seem easily or effectively vectorizable, so I did not try using SIMD there. Without multithreading, I get a speedup of around 1.79x. Using the Nulstein work-stealing task scheduler, with multithreading enabled, I get a speedup of around 5.71x-6.73x.
CodePlex link for my program’s source code and binary are provided at the end of the article.
A Picture

Renderered for around 45 minutes using the multithreaded fully optimized version.
My Setup
CPU: Intel Q6600 Core2Quad 2.4 ghz (not-overclocked.) 4 CPU cores.
OS: Vista 64-bit SP2
Profiler
I used AMD CodeAnalyst. (http://www.butamanrenderer.com/2010/08/30/using-amd-codeanalyst-on-an-intel-cpu/)
Going from double to float
| Implementation | FPS | Samples/Sec |
|---|---|---|
| double | 0.49 | 313.6K |
| float | 0.6 | 384.0K |
Replacing double with float resulted in a 1.22x speedup. This is even with implicit float->double conversion occurring for the float version, but not the double version.
Float Precision Problems
Going from double to float introduced precision errors into the smallpt render (http://www.butamanrenderer.com/2010/08/08/smallpt-float-precision-problems/). I tried offsetting the ray center in the ray direction using a small epsilon value in order to avoid self-intersections, but I couldn’t find a magic epsilon to get rid of all artifacts. This was a very annoying problem because sometimes the error would take a bit of converging before becoming apparent. In the end, I gave up my search for the magic epsilon and modified the scene in smiliar fashion to SmallptGPU (http://davibu.interfree.it/opencl/smallptgpu/smallptGPU.html), decreasing the light’s radius and bringing it down inside the room. In addition, I also decreased the radius of the sphere walls. This got rid of the float precision problems for me.
Multithreading
For multithreading, I used the Nulstein work-stealing task scheduler (http://www.butamanrenderer.com/2010/08/16/nulstein-a-work-stealing-task-scheduler/). The original smallpt code uses OpenMP for multithreading, but Visual Studio 2008 Standard Edition doesn’t support it out of the box and using OpenMP with it requires a bit of work (http://stackoverflow.com/questions/865686/openmp-in-visual-studio-2005-standard), so I didn’t try it.
| Implementation | FPS | Samples/Sec |
|---|---|---|
| Single-Threaded Optimized | 0.88 | 563.2K |
| Multi-Threaded Optimized | 2.8-3.3 | 1792.0K-2112.0K |
With multithreading, the fps fluctuates a bit, and I get a speedup of around 3.18x-3.75x on a 4-core CPU.
SOA Ray-Sphere SIMD
| Implementation | FPS | Samples/Sec |
|---|---|---|
| No SIMD SOA | 0.67 | 428.8K |
| SIMD SOA | 0.88 | 563.2K |
Using AMD CodeAnalyst, I took a profile of my smallpt implementation, and saw Ray-Sphere intersection taking up around 55% of total processing time. A good candidate for some optimization! Effective SIMD Ray-Sphere intersection optimizations usually involve rearranging the data in SOA (structure of arrays) format so that 4-rays vs 1-sphere or 1-ray vs 4-sphere intersection tests can be performed at once. A 1-ray versus 4-sphere SOA modification was easier to implement with smallpt, so I went with that. Note that though smallpt’s speedup was around 1.313x
(time(ms) spent total) * (Ray-Sphere intersection % of total processing) = (time(ms) spent for Ray-Sphere intersection) ( 1000 / 0.49 ) * 0.55 = 1122ms ( 1000 / 0.88 ) * 0.35 = 397ms 1122ms / 397ms = 2.82x
meaning the speedup of Ray-Sphere intersection itself was 2.82x, so a pretty good speedup.
SIMD Optimized Sqrt
| Implementation | FPS | Samples/Sec |
|---|---|---|
| sqrtf (SSE scalar sqrtss) | 0.88 | 563.2K |
| SSE scalar rsqrtss | 0.90 | 576.0K |
| SSE scalar rsqrtss with one NR step | 0.87 | 556.8K |
After reading http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/ and http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/, I decided to try rsqrtss( x )*x=sqrt(x) for optimization. Since I have SSE2 enabled, the default c runtime sqrtf function gets compiled into ssqrts. I tried rsqrtss( x )*x and rsqrtss( x )*x with one step of Newton-Raphson iteration. rsqrtss( x )*x is a tiny bit faster, but its 11-bit precision estimate is not sufficient and I get artifacts in the image. rsqrtss( x )*x with one step of Newton-Raphson iteration has no artifacts, but for my code, is a tiny bit slower than the c runtime sqrtf function.
SIMD Optimized Vector Normalize
| Implementation | FPS | Samples/Sec |
|---|---|---|
| Non-SIMD Vector Normalize | 0.83 | 531.2K |
| SIMD Vector Normalize | 0.88 | 563.2K |
I used code from http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/ to try a faster vector normalize using rsqrtss( x )*x=sqrt(x) with one step of Newton-Rhaphson and I got 1.06x speedup on smallpt itself. I did not check how much the normalize function itself is becoming optimized, but it may be significant.
Tonemapping
http://www.yakiimo3d.com/2010/03/13/dx11-directcompute-global-operator-photographic-tonemapping/
I use the DX11 compute shader Rheinhard global operator tonemapper I wrote for Yakiimo3D.
Source Code & Binary
I use DirectX11 for rendering, so you’ll need a DirectX11 capable video card in order to run the program.
http://butamanrenderer.codeplex.com/releases/view/52569
The “Reference” and “Optimized” directories respectively contains the un-optimized and optimized smallpt implementations. Use “Optimized/SmallPtDefines.h” to toggle on and off SOA Ray-Sphere intersection and other optimizations.



