smallpt CPU optimizations

September 19th, 2010 § 0 comments § permalink

Introduction

I wrote a multithreaded SIMD CPU optimized version of smallpt (http://www.kevinbeason.com/smallpt/) in order gain more experience and improve my CPU optimization skills.

Implementation FPS Samples/Sec
Single-Threaded Unoptimized (double) 0.49 313.6K
Single-Threaded Optimized (float) 0.88 563.2K
Multi-Threaded Optimized (float) 2.8-3.3 1792.0K-2112.0K

Reorganizing the data in SOA fashion, I was able to speed up the Ray-Sphere intersection code 2.82x times. Also used SIMD to speed up the vector normalization code a little bit. Because of the incoherent nature of randomly sampled secondary rays, the radiance calculation code didn’t seem easily or effectively vectorizable, so I did not try using SIMD there. Without multithreading, I get a speedup of around 1.79x. Using the Nulstein work-stealing task scheduler, with multithreading enabled, I get a speedup of around 5.71x-6.73x.

CodePlex link for my program’s source code and binary are provided at the end of the article.

A Picture


Renderered for around 45 minutes using the multithreaded fully optimized version.

My Setup

CPU: Intel Q6600 Core2Quad 2.4 ghz (not-overclocked.) 4 CPU cores.
OS: Vista 64-bit SP2

Profiler

I used AMD CodeAnalyst. (http://www.butamanrenderer.com/2010/08/30/using-amd-codeanalyst-on-an-intel-cpu/)

Going from double to float

Implementation FPS Samples/Sec
double 0.49 313.6K
float 0.6 384.0K

Replacing double with float resulted in a 1.22x speedup. This is even with implicit float->double conversion occurring for the float version, but not the double version.

Float Precision Problems

Going from double to float introduced precision errors into the smallpt render (http://www.butamanrenderer.com/2010/08/08/smallpt-float-precision-problems/). I tried offsetting the ray center in the ray direction using a small epsilon value in order to avoid self-intersections, but I couldn’t find a magic epsilon to get rid of all artifacts. This was a very annoying problem because sometimes the error would take a bit of converging before becoming apparent. In the end, I gave up my search for the magic epsilon and modified the scene in smiliar fashion to SmallptGPU (http://davibu.interfree.it/opencl/smallptgpu/smallptGPU.html), decreasing the light’s radius and bringing it down inside the room. In addition, I also decreased the radius of the sphere walls. This got rid of the float precision problems for me.

Multithreading

For multithreading, I used the Nulstein work-stealing task scheduler (http://www.butamanrenderer.com/2010/08/16/nulstein-a-work-stealing-task-scheduler/). The original smallpt code uses OpenMP for multithreading, but Visual Studio 2008 Standard Edition doesn’t support it out of the box and using OpenMP with it requires a bit of work (http://stackoverflow.com/questions/865686/openmp-in-visual-studio-2005-standard), so I didn’t try it.

Implementation FPS Samples/Sec
Single-Threaded Optimized 0.88 563.2K
Multi-Threaded Optimized 2.8-3.3 1792.0K-2112.0K

With multithreading, the fps fluctuates a bit, and I get a speedup of around 3.18x-3.75x on a 4-core CPU.

SOA Ray-Sphere SIMD

Implementation FPS Samples/Sec
No SIMD SOA 0.67 428.8K
SIMD SOA 0.88 563.2K

Using AMD CodeAnalyst, I took a profile of my smallpt implementation, and saw Ray-Sphere intersection taking up around 55% of total processing time. A good candidate for some optimization! Effective SIMD Ray-Sphere intersection optimizations usually involve rearranging the data in SOA (structure of arrays) format so that 4-rays vs 1-sphere or 1-ray vs 4-sphere intersection tests can be performed at once. A 1-ray versus 4-sphere SOA modification was easier to implement with smallpt, so I went with that. Note that though smallpt’s speedup was around 1.313x

(time(ms) spent total) *
(Ray-Sphere intersection % of total processing)
 = (time(ms) spent for Ray-Sphere intersection)
( 1000 / 0.49 ) * 0.55 = 1122ms
( 1000 / 0.88 ) * 0.35 = 397ms
1122ms / 397ms = 2.82x

meaning the speedup of Ray-Sphere intersection itself was 2.82x, so a pretty good speedup.

SIMD Optimized Sqrt

Implementation FPS Samples/Sec
sqrtf (SSE scalar sqrtss) 0.88 563.2K
SSE scalar rsqrtss 0.90 576.0K
SSE scalar rsqrtss with one NR step 0.87 556.8K

After reading http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/ and http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/, I decided to try rsqrtss( x )*x=sqrt(x) for optimization. Since I have SSE2 enabled, the default c runtime sqrtf function gets compiled into ssqrts. I tried rsqrtss( x )*x and rsqrtss( x )*x with one step of Newton-Raphson iteration. rsqrtss( x )*x is a tiny bit faster, but its 11-bit precision estimate is not sufficient and I get artifacts in the image. rsqrtss( x )*x with one step of Newton-Raphson iteration has no artifacts, but for my code, is a tiny bit slower than the c runtime sqrtf function.

SIMD Optimized Vector Normalize

Implementation FPS Samples/Sec
Non-SIMD Vector Normalize 0.83 531.2K
SIMD Vector Normalize 0.88 563.2K

I used code from http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/ to try a faster vector normalize using rsqrtss( x )*x=sqrt(x) with one step of Newton-Rhaphson and I got 1.06x speedup on smallpt itself. I did not check how much the normalize function itself is becoming optimized, but it may be significant.

Tonemapping

http://www.yakiimo3d.com/2010/03/13/dx11-directcompute-global-operator-photographic-tonemapping/
I use the DX11 compute shader Rheinhard global operator tonemapper I wrote for Yakiimo3D.

Source Code & Binary

I use DirectX11 for rendering, so you’ll need a DirectX11 capable video card in order to run the program.
http://butamanrenderer.codeplex.com/releases/view/52569

The “Reference” and “Optimized” directories respectively contains the un-optimized and optimized smallpt implementations. Use “Optimized/SmallPtDefines.h” to toggle on and off SOA Ray-Sphere intersection and other optimizations.

Where Am I?

You are currently browsing the Demo category at Butaman Renderer.