smallpt CPU optimizations

September 19th, 2010 § 0 comments § permalink

Introduction

I wrote a multithreaded SIMD CPU optimized version of smallpt (http://www.kevinbeason.com/smallpt/) in order gain more experience and improve my CPU optimization skills.

Implementation FPS Samples/Sec
Single-Threaded Unoptimized (double) 0.49 313.6K
Single-Threaded Optimized (float) 0.88 563.2K
Multi-Threaded Optimized (float) 2.8-3.3 1792.0K-2112.0K

Reorganizing the data in SOA fashion, I was able to speed up the Ray-Sphere intersection code 2.82x times. Also used SIMD to speed up the vector normalization code a little bit. Because of the incoherent nature of randomly sampled secondary rays, the radiance calculation code didn’t seem easily or effectively vectorizable, so I did not try using SIMD there. Without multithreading, I get a speedup of around 1.79x. Using the Nulstein work-stealing task scheduler, with multithreading enabled, I get a speedup of around 5.71x-6.73x.

CodePlex link for my program’s source code and binary are provided at the end of the article.

A Picture


Renderered for around 45 minutes using the multithreaded fully optimized version.

My Setup

CPU: Intel Q6600 Core2Quad 2.4 ghz (not-overclocked.) 4 CPU cores.
OS: Vista 64-bit SP2

Profiler

I used AMD CodeAnalyst. (http://www.butamanrenderer.com/2010/08/30/using-amd-codeanalyst-on-an-intel-cpu/)

Going from double to float

Implementation FPS Samples/Sec
double 0.49 313.6K
float 0.6 384.0K

Replacing double with float resulted in a 1.22x speedup. This is even with implicit float->double conversion occurring for the float version, but not the double version.

Float Precision Problems

Going from double to float introduced precision errors into the smallpt render (http://www.butamanrenderer.com/2010/08/08/smallpt-float-precision-problems/). I tried offsetting the ray center in the ray direction using a small epsilon value in order to avoid self-intersections, but I couldn’t find a magic epsilon to get rid of all artifacts. This was a very annoying problem because sometimes the error would take a bit of converging before becoming apparent. In the end, I gave up my search for the magic epsilon and modified the scene in smiliar fashion to SmallptGPU (http://davibu.interfree.it/opencl/smallptgpu/smallptGPU.html), decreasing the light’s radius and bringing it down inside the room. In addition, I also decreased the radius of the sphere walls. This got rid of the float precision problems for me.

Multithreading

For multithreading, I used the Nulstein work-stealing task scheduler (http://www.butamanrenderer.com/2010/08/16/nulstein-a-work-stealing-task-scheduler/). The original smallpt code uses OpenMP for multithreading, but Visual Studio 2008 Standard Edition doesn’t support it out of the box and using OpenMP with it requires a bit of work (http://stackoverflow.com/questions/865686/openmp-in-visual-studio-2005-standard), so I didn’t try it.

Implementation FPS Samples/Sec
Single-Threaded Optimized 0.88 563.2K
Multi-Threaded Optimized 2.8-3.3 1792.0K-2112.0K

With multithreading, the fps fluctuates a bit, and I get a speedup of around 3.18x-3.75x on a 4-core CPU.

SOA Ray-Sphere SIMD

Implementation FPS Samples/Sec
No SIMD SOA 0.67 428.8K
SIMD SOA 0.88 563.2K

Using AMD CodeAnalyst, I took a profile of my smallpt implementation, and saw Ray-Sphere intersection taking up around 55% of total processing time. A good candidate for some optimization! Effective SIMD Ray-Sphere intersection optimizations usually involve rearranging the data in SOA (structure of arrays) format so that 4-rays vs 1-sphere or 1-ray vs 4-sphere intersection tests can be performed at once. A 1-ray versus 4-sphere SOA modification was easier to implement with smallpt, so I went with that. Note that though smallpt’s speedup was around 1.313x

(time(ms) spent total) *
(Ray-Sphere intersection % of total processing)
 = (time(ms) spent for Ray-Sphere intersection)
( 1000 / 0.49 ) * 0.55 = 1122ms
( 1000 / 0.88 ) * 0.35 = 397ms
1122ms / 397ms = 2.82x

meaning the speedup of Ray-Sphere intersection itself was 2.82x, so a pretty good speedup.

SIMD Optimized Sqrt

Implementation FPS Samples/Sec
sqrtf (SSE scalar sqrtss) 0.88 563.2K
SSE scalar rsqrtss 0.90 576.0K
SSE scalar rsqrtss with one NR step 0.87 556.8K

After reading http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/ and http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/, I decided to try rsqrtss( x )*x=sqrt(x) for optimization. Since I have SSE2 enabled, the default c runtime sqrtf function gets compiled into ssqrts. I tried rsqrtss( x )*x and rsqrtss( x )*x with one step of Newton-Raphson iteration. rsqrtss( x )*x is a tiny bit faster, but its 11-bit precision estimate is not sufficient and I get artifacts in the image. rsqrtss( x )*x with one step of Newton-Raphson iteration has no artifacts, but for my code, is a tiny bit slower than the c runtime sqrtf function.

SIMD Optimized Vector Normalize

Implementation FPS Samples/Sec
Non-SIMD Vector Normalize 0.83 531.2K
SIMD Vector Normalize 0.88 563.2K

I used code from http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/ to try a faster vector normalize using rsqrtss( x )*x=sqrt(x) with one step of Newton-Rhaphson and I got 1.06x speedup on smallpt itself. I did not check how much the normalize function itself is becoming optimized, but it may be significant.

Tonemapping

http://www.yakiimo3d.com/2010/03/13/dx11-directcompute-global-operator-photographic-tonemapping/
I use the DX11 compute shader Rheinhard global operator tonemapper I wrote for Yakiimo3D.

Source Code & Binary

I use DirectX11 for rendering, so you’ll need a DirectX11 capable video card in order to run the program.
http://butamanrenderer.codeplex.com/releases/view/52569

The “Reference” and “Optimized” directories respectively contains the un-optimized and optimized smallpt implementations. Use “Optimized/SmallPtDefines.h” to toggle on and off SOA Ray-Sphere intersection and other optimizations.

Using AMD CodeAnalyst on an Intel CPU

August 30th, 2010 § 0 comments § permalink

http://developer.amd.com/cpu/CodeAnalyst/Pages/default.aspx
The AMD CodeAnalyst HP.


CodeAnalyst data view. The inspected data in the above picture is the ray-sphere intersection function. CodeAnalyst let’s you see source code and inlined assembly with statistics on how many times each source line and each assembly instruction was sampled.


CodeAnalyst graph view. Provides a quick overview of which functions are costly.

I should note that I’ve modified smallpt a little bit since last week, so the above profile capture is now on a different exe. The main structure of smallpt remains unchanged, so you can see that the ray-sphere intersection is still where a lot of time is being spent.

Last week, I tried Very Sleepy to profile my smallpt implementation. I knew about but didn’t try AMD CodeAnalyst because I own an Intel Core2Quad Q6600 2.4ghz and thought it might be a hassle to try to use it. However, information on the Internet such as this AMD forum post seem to indicate that I can use CodeAnalyst’s timer-based profiling even on Intel CPUs without any problems. Event-based profiling and instruction-based profiling need AMD hardware, so stuff like cache miss information is still unavailable to me.

The timer-based profiling in CodeAnalyst uses EIP register sampling like Very Sleepy, so as far I could tell, I don’t get any additional information. However, 2 things I like about CodeAnalyst more than Very Sleepy is the Visual Studio Integration and the fact that I can see sampling information on individual assembly instructions. One thing I like less than Very Sleepy is I haven’t figured out how to perform sampling per thread. Also Very Sleepy is open-source and I like that.

I think I will use CodeAnalyst as my main profiler (I really like the per assembly instruction sampling info) and switch to Very Sleepy when I find the need. I’m very happy that AMD is letting me use CodeAnalyst on an Intel CPU and will definitely remember CodeAnalyst when I purchase my next CPU.

My setup:
Intel Core2Quad Q6600 2.4ghz
CodeAnalyst 2.97 (Installer was named CodeAnalyst_Public_2.97.803.0531.exe)
Vista 64-bit SP2

Very Sleepy Profiling Again

August 22nd, 2010 § 0 comments § permalink


// DXUT.cpp DXUTRender3DEnvironment11()
 if( DXUTIsRenderingPaused() || !DXUTIsActive() || GetDXUTState().GetRenderingOccluded() )
 {
     // Window is minimized/paused/occluded/or not exclusive so yield CPU time to other processes
     //Sleep( 50 );
}

For now just commented out the DXUT Sleep yield call and re-profiled my multithreaded smallpt application (EIP register sampled around 30,000 times.) Ray-Sphere intsersection is now around 42% of the total CPU time.

Very Sleepy Profiler

August 21st, 2010 § 0 comments § permalink

http://www.codersnotes.com/sleepy/
Very Sleepy’s HP. GPLed source code is available.

http://sleepy.sourceforge.net/
Sleepy’s HP. Very Sleepy is based on the source code of Sleepy. Apparently Sleepy was written by Nick Chapman, author of the Indigo Renderer.

I learned about Very Sleepy a little while back on Twitter and people seemed pretty positive about it. I would have liked to use VTune, but it’s not free and I haven’t tried AMD CodeAnalyst yet because my CPU is Intel (though I should give it a try.)



Here’s what my multithreaded smallpt’s Very Sleepy profiling result looks like. Very Sleepy supports profiling multiple threads. Every thread of my mulithreaded smallpt should be doing the same work (calculating the path’s radiance), and the above is the profiling result on one of my threads. From the profile, you can see that the intersect function is taking up around 36% of total CPU time with the ray-sphere intersection calculation eating up around 31% of total CPU time. The ZwDelayExecution eating up 9% of total CPU time is probably DXUT calling Sleep because the app has lost focus (forgot to comment the call out…)

Very Sleepy samples the EIP register to measure in which parts of the code, the program is spending its time in, and doesn’t give as much info as other more complex profilers, but it’s quick, easy to use and seems like a pretty handy tool. I find it nice that it’s open source too.

EDIT
http://www.butamanrenderer.com/2010/08/22/very-sleepy-profiling-again/
Woke up the next day and re-profiled with the DXUT Sleep call removed.

Where Am I?

You are currently browsing the Optimization category at Butaman Renderer.