<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Butaman Renderer</title>
	<atom:link href="http://www.butamanrenderer.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.butamanrenderer.com</link>
	<description>A hobby renderer...</description>
	<lastBuildDate>Mon, 20 Sep 2010 08:22:07 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>smallpt CPU optimizations</title>
		<link>http://www.butamanrenderer.com/2010/09/19/smallpt-cpu-optimizations/</link>
		<comments>http://www.butamanrenderer.com/2010/09/19/smallpt-cpu-optimizations/#comments</comments>
		<pubDate>Sun, 19 Sep 2010 09:24:11 +0000</pubDate>
		<dc:creator>yakiimo02</dc:creator>
				<category><![CDATA[Demo]]></category>
		<category><![CDATA[Multithreading]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[PathTracing]]></category>

		<guid isPermaLink="false">http://www.butamanrenderer.com/?p=43</guid>
		<description><![CDATA[Introduction I wrote a multithreaded SIMD CPU optimized version of smallpt (http://www.kevinbeason.com/smallpt/) in order gain more experience and improve my CPU optimization skills. Implementation FPS Samples/Sec Single-Threaded Unoptimized (double) 0.49 313.6K Single-Threaded Optimized (float) 0.88 563.2K Multi-Threaded Optimized (float) 2.8-3.3 1792.0K-2112.0K Reorganizing the data in SOA fashion, I was able to speed up the Ray-Sphere [...]]]></description>
			<content:encoded><![CDATA[<h2>Introduction</h2>
<p>I wrote a multithreaded SIMD CPU optimized version of smallpt (<a href="http://www.kevinbeason.com/smallpt/" onclick="pageTracker._trackPageview('/outgoing/www.kevinbeason.com/smallpt/?referer=');">http://www.kevinbeason.com/smallpt/</a>) in order gain more experience and improve my CPU optimization skills.</p>
<table border="1" cellpadding="5" >
<tr>
<th>Implementation</th>
<th>FPS</th>
<th>Samples/Sec</th>
</tr>
<tr>
<td>Single-Threaded Unoptimized (double)</td>
<td>0.49</td>
<td>313.6K</td>
</tr>
<tr>
<td>Single-Threaded Optimized (float)</td>
<td>0.88</td>
<td>563.2K</td>
</tr>
<tr>
<td>Multi-Threaded Optimized (float)</td>
<td>2.8-3.3</td>
<td>1792.0K-2112.0K</td>
</tr>
</table>
<p>
Reorganizing the data in SOA fashion, I was able to speed up the Ray-Sphere intersection code 2.82x times. Also used SIMD to speed up the vector normalization code a little bit. Because of the incoherent nature of randomly sampled secondary rays, the radiance calculation code didn&#8217;t seem easily or effectively vectorizable, so I did not try using SIMD there. Without multithreading, I get a speedup of around 1.79x. Using the Nulstein work-stealing task scheduler, with multithreading enabled, I get a speedup of around 5.71x-6.73x.<br />
<br />
CodePlex link for my program&#8217;s source code and binary are provided at the end of the article.</p>
<h2>A Picture</h2>
<p><a href="http://www.butamanrenderer.com/wp-content/uploads/2010/09/smallpt_ss.png"><img src="http://www.butamanrenderer.com/wp-content/uploads/2010/09/smallpt_ss.png" alt="" title="smallpt_ss" width="400" height="400" class="alignnone size-full wp-image-45" /></a><br />
Renderered for around 45 minutes using the multithreaded fully optimized version.</p>
<h2>My Setup</h2>
<p>CPU: Intel Q6600 Core2Quad 2.4 ghz (not-overclocked.) 4 CPU cores.<br />
OS: Vista 64-bit SP2</p>
<h2>Profiler</h2>
<p>I used AMD CodeAnalyst. (<a href="http://www.butamanrenderer.com/2010/08/30/using-amd-codeanalyst-on-an-intel-cpu/">http://www.butamanrenderer.com/2010/08/30/using-amd-codeanalyst-on-an-intel-cpu/</a>)</p>
<h2>Going from double to float</h2>
<table border="1" cellpadding="5" >
<tr>
<th>Implementation</th>
<th>FPS</th>
<th>Samples/Sec</th>
</tr>
<tr>
<td>double</td>
<td>0.49</td>
<td>313.6K</td>
</tr>
<tr>
<td>float</td>
<td>0.6</td>
<td>384.0K</td>
</tr>
</table>
<p>Replacing double with float resulted in a 1.22x speedup. This is even with implicit float->double conversion occurring for the float version, but not the double version.</p>
<h2>Float Precision Problems</h2>
<p>Going from double to float introduced precision errors into the smallpt render (<a href="http://www.butamanrenderer.com/2010/08/08/smallpt-float-precision-problems/">http://www.butamanrenderer.com/2010/08/08/smallpt-float-precision-problems/</a>). I tried offsetting the ray center in the ray direction using a small epsilon value in order to avoid self-intersections, but I couldn&#8217;t find a magic epsilon to get rid of all artifacts. This was a very annoying problem because sometimes the error would take a bit of converging before becoming apparent. In the end, I gave up my search for the magic epsilon and modified the scene in smiliar fashion to SmallptGPU (<a href="http://davibu.interfree.it/opencl/smallptgpu/smallptGPU.html" onclick="pageTracker._trackPageview('/outgoing/davibu.interfree.it/opencl/smallptgpu/smallptGPU.html?referer=');">http://davibu.interfree.it/opencl/smallptgpu/smallptGPU.html</a>), decreasing the light&#8217;s radius and bringing it down inside the room. In addition, I also decreased the radius of the sphere walls. This got rid of the float precision problems for me.</p>
<h2>Multithreading</h2>
<p>For multithreading, I used the Nulstein work-stealing task scheduler (<a href="http://www.butamanrenderer.com/2010/08/16/nulstein-a-work-stealing-task-scheduler/">http://www.butamanrenderer.com/2010/08/16/nulstein-a-work-stealing-task-scheduler/</a>). The original smallpt code uses OpenMP for multithreading, but Visual Studio 2008 Standard Edition doesn&#8217;t support it out of the box and using OpenMP with it requires a bit of work (<a href="http://stackoverflow.com/questions/865686/openmp-in-visual-studio-2005-standard" onclick="pageTracker._trackPageview('/outgoing/stackoverflow.com/questions/865686/openmp-in-visual-studio-2005-standard?referer=');">http://stackoverflow.com/questions/865686/openmp-in-visual-studio-2005-standard</a>), so I didn&#8217;t try it.</p>
<table border="1" cellpadding="5" >
<tr>
<th>Implementation</th>
<th>FPS</th>
<th>Samples/Sec</th>
</tr>
<tr>
<td>Single-Threaded Optimized</td>
<td>0.88</td>
<td>563.2K</td>
</tr>
<tr>
<td>Multi-Threaded Optimized</td>
<td>2.8-3.3</td>
<td>1792.0K-2112.0K</td>
</tr>
</table>
<p>With multithreading, the fps fluctuates a bit, and I get a speedup of around 3.18x-3.75x on a 4-core CPU. </p>
<h2>SOA Ray-Sphere SIMD</h2>
<table border="1" cellpadding="5" >
<tr>
<th>Implementation</th>
<th>FPS</th>
<th>Samples/Sec</th>
</tr>
<tr>
<td>No SIMD SOA</td>
<td>0.67</td>
<td>428.8K</td>
</tr>
<tr>
<td>SIMD SOA</td>
<td>0.88</td>
<td>563.2K</td>
</tr>
</table>
<p>Using AMD CodeAnalyst, I took a profile of my smallpt implementation, and saw Ray-Sphere intersection taking up around 55% of total processing time. A good candidate for some optimization! Effective SIMD Ray-Sphere intersection optimizations usually involve rearranging the data in SOA (structure of arrays) format so that 4-rays vs 1-sphere or 1-ray vs 4-sphere intersection tests can be performed at once. A 1-ray versus 4-sphere SOA modification was easier to implement with smallpt, so I went with that. Note that though smallpt&#8217;s speedup was around 1.313x</p>
<pre>
(time(ms) spent total) *
(Ray-Sphere intersection % of total processing)
 = (time(ms) spent for Ray-Sphere intersection)
( 1000 / 0.49 ) * 0.55 = 1122ms
( 1000 / 0.88 ) * 0.35 = 397ms
1122ms / 397ms = 2.82x
</pre>
<p>meaning the speedup of Ray-Sphere intersection itself was 2.82x, so a pretty good speedup.</p>
<h2>SIMD Optimized Sqrt</h2>
<table border="1" cellpadding="5" >
<tr>
<th>Implementation</th>
<th>FPS</th>
<th>Samples/Sec</th>
</tr>
<tr>
<td>sqrtf (SSE scalar sqrtss)</td>
<td>0.88</td>
<td>563.2K</td>
</tr>
<tr>
<td>SSE scalar rsqrtss</td>
<td>0.90</td>
<td>576.0K</td>
</tr>
<tr>
<td>SSE scalar rsqrtss with one NR step</td>
<td>0.87</td>
<td>556.8K</td>
</tr>
</table>
<p>After reading <a href="http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/" onclick="pageTracker._trackPageview('/outgoing/assemblyrequired.crashworks.org/2009/10/16/timing-square-root/?referer=');">http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/</a> and <a href="http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/" onclick="pageTracker._trackPageview('/outgoing/assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/?referer=');">http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/</a>, I decided to try rsqrtss( x )*x=sqrt(x) for optimization. Since I have SSE2 enabled, the default c runtime sqrtf function gets compiled into ssqrts. I tried rsqrtss( x )*x  and rsqrtss( x )*x  with one step of Newton-Raphson iteration. rsqrtss( x )*x is a tiny bit faster, but its 11-bit precision estimate is not sufficient and I get artifacts in the image. rsqrtss( x )*x  with one step of Newton-Raphson iteration has no artifacts, but for my code, is a tiny bit slower than the c runtime sqrtf function. </p>
<h2>SIMD Optimized Vector Normalize</h2>
<table border="1" cellpadding="5" >
<tr>
<th>Implementation</th>
<th>FPS</th>
<th>Samples/Sec</th>
</tr>
<tr>
<td>Non-SIMD Vector Normalize</td>
<td>0.83</td>
<td>531.2K</td>
</tr>
<tr>
<td>SIMD Vector Normalize</td>
<td>0.88</td>
<td>563.2K</td>
</tr>
</table>
<p>I used code from <a href="http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/" onclick="pageTracker._trackPageview('/outgoing/assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/?referer=');">http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/</a> to try a faster vector normalize using rsqrtss( x )*x=sqrt(x) with one step of Newton-Rhaphson and I got 1.06x speedup on smallpt itself. I did not check how much the normalize function itself is becoming optimized, but it may be significant.</p>
<h2>Tonemapping</h2>
<p><a href="http://www.yakiimo3d.com/2010/03/13/dx11-directcompute-global-operator-photographic-tonemapping/" onclick="pageTracker._trackPageview('/outgoing/www.yakiimo3d.com/2010/03/13/dx11-directcompute-global-operator-photographic-tonemapping/?referer=');">http://www.yakiimo3d.com/2010/03/13/dx11-directcompute-global-operator-photographic-tonemapping/</a><br />
I use the DX11 compute shader Rheinhard global operator tonemapper I wrote for Yakiimo3D. </p>
<h2>Source Code &#038; Binary</h2>
<p>I use DirectX11 for rendering, so you&#8217;ll need a DirectX11 capable video card in order to run the program.<br />
<a href="http://butamanrenderer.codeplex.com/releases/view/52569" onclick="pageTracker._trackPageview('/outgoing/butamanrenderer.codeplex.com/releases/view/52569?referer=');">http://butamanrenderer.codeplex.com/releases/view/52569</a><br />
<br />
The &#8220;Reference&#8221; and &#8220;Optimized&#8221; directories respectively contains the un-optimized and optimized smallpt implementations. Use &#8220;Optimized/SmallPtDefines.h&#8221; to toggle on and off SOA Ray-Sphere intersection and other optimizations.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.butamanrenderer.com/2010/09/19/smallpt-cpu-optimizations/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using AMD CodeAnalyst on an Intel CPU</title>
		<link>http://www.butamanrenderer.com/2010/08/30/using-amd-codeanalyst-on-an-intel-cpu/</link>
		<comments>http://www.butamanrenderer.com/2010/08/30/using-amd-codeanalyst-on-an-intel-cpu/#comments</comments>
		<pubDate>Sun, 29 Aug 2010 15:26:38 +0000</pubDate>
		<dc:creator>yakiimo02</dc:creator>
				<category><![CDATA[Optimization]]></category>

		<guid isPermaLink="false">http://www.butamanrenderer.com/?p=37</guid>
		<description><![CDATA[http://developer.amd.com/cpu/CodeAnalyst/Pages/default.aspx The AMD CodeAnalyst HP. CodeAnalyst data view. The inspected data in the above picture is the ray-sphere intersection function. CodeAnalyst let&#8217;s you see source code and inlined assembly with statistics on how many times each source line and each assembly instruction was sampled. CodeAnalyst graph view. Provides a quick overview of which functions are [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://developer.amd.com/cpu/CodeAnalyst/Pages/default.aspx" onclick="pageTracker._trackPageview('/outgoing/developer.amd.com/cpu/CodeAnalyst/Pages/default.aspx?referer=');">http://developer.amd.com/cpu/CodeAnalyst/Pages/default.aspx</a><br />
The AMD CodeAnalyst HP.<br />
<br />
<a href="http://www.butamanrenderer.com/wp-content/uploads/2010/08/codeanalyst_data.png"><img src="http://www.butamanrenderer.com/wp-content/uploads/2010/08/codeanalyst_data-1024x672.png" alt="" title="codeanalyst_data" width="470" height="308" class="alignnone size-large wp-image-38" /></a><br />
CodeAnalyst data view. The inspected data in the above picture is the ray-sphere intersection function. CodeAnalyst let&#8217;s you see source code and inlined assembly with statistics on how many times each source line and each assembly instruction was sampled.<br />
<br />
<a href="http://www.butamanrenderer.com/wp-content/uploads/2010/08/codeanalyst_graph.png"><img src="http://www.butamanrenderer.com/wp-content/uploads/2010/08/codeanalyst_graph-1024x672.png" alt="" title="codeanalyst_graph" width="470" height="308" class="alignnone size-large wp-image-39" /></a><br />
CodeAnalyst graph view. Provides a quick overview of which functions are costly.<br />
<br />
I should note that I&#8217;ve modified smallpt a little bit since last week, so the above profile capture is now on a different exe. The main structure of smallpt remains unchanged, so you can see that the ray-sphere intersection is still where a lot of time is being spent.<br />
<br />
Last week, I tried Very Sleepy to profile my smallpt implementation. I knew about but didn&#8217;t try AMD CodeAnalyst because I own an Intel Core2Quad Q6600 2.4ghz and thought it might be a hassle to try to use it. However, information on the Internet such as <a href="http://forums.amd.com/devforum/messageview.cfm?catid=218&#038;threadid=92275" onclick="pageTracker._trackPageview('/outgoing/forums.amd.com/devforum/messageview.cfm?catid=218_038_threadid=92275&amp;referer=');">this AMD forum post</a> seem to indicate that I can use CodeAnalyst&#8217;s timer-based profiling even on Intel CPUs without any problems. Event-based profiling and instruction-based profiling need AMD hardware, so stuff like cache miss information is still unavailable to me.<br />
<br />
The timer-based profiling in CodeAnalyst uses EIP register sampling like Very Sleepy, so as far I could tell, I don&#8217;t get any additional information. However, 2 things I like about CodeAnalyst more than Very Sleepy is the Visual Studio Integration and the fact that I can see sampling information on individual assembly instructions. One thing I like less than Very Sleepy is I haven&#8217;t figured out how to perform sampling per thread. Also Very Sleepy is open-source and I like that.<br />
<br />
I think I will use CodeAnalyst as my main profiler (I really like the per assembly instruction sampling info) and switch to Very Sleepy when I find the need. I&#8217;m very happy that AMD is letting me use CodeAnalyst on an Intel CPU and will definitely remember CodeAnalyst when I purchase my next CPU.<br />
<br />
My setup:<br />
Intel Core2Quad Q6600 2.4ghz<br />
CodeAnalyst 2.97 (Installer was named CodeAnalyst_Public_2.97.803.0531.exe)<br />
Vista 64-bit SP2</p>
]]></content:encoded>
			<wfw:commentRss>http://www.butamanrenderer.com/2010/08/30/using-amd-codeanalyst-on-an-intel-cpu/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Luxology Video On CPU vs GPU</title>
		<link>http://www.butamanrenderer.com/2010/08/28/luxology-video-on-cpu-vs-gpu/</link>
		<comments>http://www.butamanrenderer.com/2010/08/28/luxology-video-on-cpu-vs-gpu/#comments</comments>
		<pubDate>Sat, 28 Aug 2010 06:01:16 +0000</pubDate>
		<dc:creator>yakiimo02</dc:creator>
				<category><![CDATA[GPGPU]]></category>

		<guid isPermaLink="false">http://www.butamanrenderer.com/?p=34</guid>
		<description><![CDATA[http://www.luxology.com/tv/training/view.aspx?id=536 14 minute video on the Luxology site. http://www.vizworld.com/2010/08/luxology-cpu-gpu-ray-tracing/ Learned about it from Vizworld. The comparison is between Modo on a 12 Core CPU and Octane on 2 GPUs. I think there is a definite bias towards presenting their own product in the best light, but it&#8217;s an interesting comparison. The Luxrender wiki mentions it [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.luxology.com/tv/training/view.aspx?id=536" onclick="pageTracker._trackPageview('/outgoing/www.luxology.com/tv/training/view.aspx?id=536&amp;referer=');">http://www.luxology.com/tv/training/view.aspx?id=536</a><br />
14 minute video on the Luxology site.<br />
<br />
<a href="http://www.vizworld.com/2010/08/luxology-cpu-gpu-ray-tracing/" onclick="pageTracker._trackPageview('/outgoing/www.vizworld.com/2010/08/luxology-cpu-gpu-ray-tracing/?referer=');">http://www.vizworld.com/2010/08/luxology-cpu-gpu-ray-tracing/</a><br />
Learned about it from Vizworld.<br />
<br />
The comparison is between Modo on a 12 Core CPU and Octane on 2 GPUs. I think there is a definite bias towards presenting their own product in the best light, but it&#8217;s an interesting comparison. The Luxrender wiki mentions it as well (<a href="http://www.luxrender.net/wiki/index.php?title=Luxrender_and_OpenCL#CPU_.2B_GPU_.2B_Network_Rendering" onclick="pageTracker._trackPageview('/outgoing/www.luxrender.net/wiki/index.php?title=Luxrender_and_OpenCL_CPU_.2B_GPU_.2B_Network_Rendering&amp;referer=');">http://www.luxrender.net/wiki/index.php?title=Luxrender_and_OpenCL#CPU_.2B_GPU_.2B_Network_Rendering</a>), but I think I do agree that an heterogeneous solution that utilizes both CPU and GPU is the best looking solution.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.butamanrenderer.com/2010/08/28/luxology-video-on-cpu-vs-gpu/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Very Sleepy Profiling Again</title>
		<link>http://www.butamanrenderer.com/2010/08/22/very-sleepy-profiling-again/</link>
		<comments>http://www.butamanrenderer.com/2010/08/22/very-sleepy-profiling-again/#comments</comments>
		<pubDate>Sun, 22 Aug 2010 05:29:02 +0000</pubDate>
		<dc:creator>yakiimo02</dc:creator>
				<category><![CDATA[Optimization]]></category>

		<guid isPermaLink="false">http://www.butamanrenderer.com/?p=30</guid>
		<description><![CDATA[// DXUT.cpp DXUTRender3DEnvironment11() &#160;if&#40; DXUTIsRenderingPaused&#40;&#41; &#124;&#124; !DXUTIsActive&#40;&#41; &#124;&#124; GetDXUTState&#40;&#41;.GetRenderingOccluded&#40;&#41; &#41; &#160;&#123; &#160; &#160; &#160;// Window is minimized/paused/occluded/or not exclusive so yield CPU time to other processes &#160; &#160; &#160;//Sleep( 50 ); &#125; For now just commented out the DXUT Sleep yield call and re-profiled my multithreaded smallpt application (EIP register sampled around 30,000 times.) Ray-Sphere [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.butamanrenderer.com/wp-content/uploads/2010/08/verysleepy21.png"><img src="http://www.butamanrenderer.com/wp-content/uploads/2010/08/verysleepy21-1024x672.png" alt="" title="verysleepy2" width="470" height="308" class="alignnone size-large wp-image-31" /></a><br />
</p>
<div class="codecolorer-container cpp default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="cpp codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666;">// DXUT.cpp DXUTRender3DEnvironment11()</span><br />
&nbsp;<span style="color: #0000ff;">if</span><span style="color: #008000;">&#40;</span> DXUTIsRenderingPaused<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #000040;">||</span> <span style="color: #000040;">!</span>DXUTIsActive<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #000040;">||</span> GetDXUTState<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span>.<span style="color: #007788;">GetRenderingOccluded</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><br />
&nbsp;<span style="color: #008000;">&#123;</span><br />
&nbsp; &nbsp; &nbsp;<span style="color: #666666;">// Window is minimized/paused/occluded/or not exclusive so yield CPU time to other processes</span><br />
&nbsp; &nbsp; &nbsp;<span style="color: #666666;">//Sleep( 50 );</span><br />
<span style="color: #008000;">&#125;</span></div></div>
<p>
For now just commented out the DXUT Sleep yield call and re-profiled my multithreaded smallpt application (EIP register sampled around 30,000 times.) Ray-Sphere intsersection is now around 42% of the total CPU time.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.butamanrenderer.com/2010/08/22/very-sleepy-profiling-again/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Very Sleepy Profiler</title>
		<link>http://www.butamanrenderer.com/2010/08/21/very-sleepy-profiler/</link>
		<comments>http://www.butamanrenderer.com/2010/08/21/very-sleepy-profiler/#comments</comments>
		<pubDate>Sat, 21 Aug 2010 14:07:37 +0000</pubDate>
		<dc:creator>yakiimo02</dc:creator>
				<category><![CDATA[Optimization]]></category>

		<guid isPermaLink="false">http://www.butamanrenderer.com/?p=20</guid>
		<description><![CDATA[http://www.codersnotes.com/sleepy/ Very Sleepy&#8217;s HP. GPLed source code is available. http://sleepy.sourceforge.net/ Sleepy&#8217;s HP. Very Sleepy is based on the source code of Sleepy. Apparently Sleepy was written by Nick Chapman, author of the Indigo Renderer. I learned about Very Sleepy a little while back on Twitter and people seemed pretty positive about it. I would have [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.codersnotes.com/sleepy/" onclick="pageTracker._trackPageview('/outgoing/www.codersnotes.com/sleepy/?referer=');">http://www.codersnotes.com/sleepy/</a><br />
Very Sleepy&#8217;s HP. GPLed source code is available.<br />
<br />
<a href="http://sleepy.sourceforge.net/" onclick="pageTracker._trackPageview('/outgoing/sleepy.sourceforge.net/?referer=');">http://sleepy.sourceforge.net/</a><br />
Sleepy&#8217;s HP. Very Sleepy is based on the source code of Sleepy. Apparently Sleepy was written by Nick Chapman, author of the Indigo Renderer.<br />
<br />
I learned about Very Sleepy a little while back on Twitter and people seemed pretty positive about it. I would have liked to use VTune, but it&#8217;s not free and I haven&#8217;t tried AMD CodeAnalyst yet because my CPU is Intel (though I should give it a try.)<br />
<br />
<a href="http://www.butamanrenderer.com/wp-content/uploads/2010/08/verysleepy.png"><img src="http://www.butamanrenderer.com/wp-content/uploads/2010/08/verysleepy-1024x672.png" alt="" title="verysleepy" width="470" height="308" class="alignnone size-large wp-image-27" /></a><br />
<br />
Here&#8217;s what my multithreaded smallpt&#8217;s Very Sleepy profiling result looks like. Very Sleepy supports profiling multiple threads. Every thread of my mulithreaded smallpt should be doing the same work (calculating the path&#8217;s radiance), and the above is the profiling result on one of my threads. From the profile, you can see that the intersect function is taking up around 36% of total CPU time with the ray-sphere intersection calculation eating up around  31% of total CPU time. The ZwDelayExecution eating up 9% of total CPU time is probably DXUT calling Sleep because the app has lost focus (forgot to comment the call out&#8230;)<br />
<br />
Very Sleepy samples the EIP register to measure in which parts of the code, the program is spending its time in, and doesn&#8217;t give as much info as other more complex profilers, but it&#8217;s quick, easy to use and seems like a pretty handy tool. I find it nice that it&#8217;s open source too.<br />
<br />
EDIT<br />
<a href="http://www.butamanrenderer.com/2010/08/22/very-sleepy-profiling-again/">http://www.butamanrenderer.com/2010/08/22/very-sleepy-profiling-again/</a><br />
Woke up the next day and re-profiled with the DXUT Sleep call removed.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.butamanrenderer.com/2010/08/21/very-sleepy-profiler/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

