czwartek, 13 grudnia 2018

Reverse engineering the rendering of The Witcher 3, part 7a - average luminance (histogram/distribution)

Welcome,

Calculating average luminance of current frame can be found in virtually any modern video game. Such value is often used later by eye adaptation and tonemapping. Simple approaches include calculating luma to, let's say, 5122 texture and calculating its mips and using the last one. This usually works, but is quite limiting. More sophisticated solutions use compute shaders in order to perform, for instance, parallel reduction.

Let's see how this problem was approached by CD Projekt Red in The Witcher 3. I've already investigated its tonemapping and eye adaptation (links in the first paragraph) before and average luminance is the only piece of puzzle missing so far.

To start, calculating average luminance in The Witcher 3 consists of two passes. I decided not to combine them in one post for clarity, so today I will focus on the first one - "distribution of luminance" (calculating histogram of brightness). For the second part, click here to read it.

Finding these two passes shouldn't be too difficult in your favourite frame analyzer. They are subsequent Dispatch calls, just before eye adaptation:



Let's see the inputs for this pass. There are two textures needed:
1) HDR color buffer, downscaled to 1/4 x 1/4 (for example, from 1920x1080 to 480x270),
2) Fullscreen depth buffer

HDR color buffer at 1/4 x 1/4 resolution. Notice nice trick that this buffer is a part of larger one. Reusing buffers is defnitely  a good thing.

Fullscreen depth buffer
Why downscaling color buffer? I guess it's probably all about performance :)

In terms of output for this pass, there is a structured buffer. 256 elements per 4 bytes each.
Shaders have no debug info here, so let's assume it's just a buffer of unsigned ints.

Important: The first step of calculating average luminance is calling ClearUnorderedAccessViewUint to zero all elements of the structured buffer.

Let's see assembly for compute shader (this is the first compute shader in the series!)

 cs_5_0  
    dcl_globalFlags refactoringAllowed  
    dcl_constantbuffer cb0[3], immediateIndexed  
    dcl_resource_texture2d (float,float,float,float) t0  
    dcl_resource_texture2d (float,float,float,float) t1  
    dcl_uav_structured u0, 4  
    dcl_input vThreadGroupID.x  
    dcl_input vThreadIDInGroup.x  
    dcl_temps 6  
    dcl_tgsm_structured g0, 4, 256  
    dcl_thread_group 64, 1, 1  
   0: store_structured g0.x, vThreadIDInGroup.x, l(0), l(0)  
   1: iadd r0.xyz, vThreadIDInGroup.xxxx, l(64, 128, 192, 0)  
   2: store_structured g0.x, r0.x, l(0), l(0)  
   3: store_structured g0.x, r0.y, l(0), l(0)  
   4: store_structured g0.x, r0.z, l(0), l(0)  
   5: sync_g_t  
   6: ftoi r1.x, cb0[2].z  
   7: mov r2.y, vThreadGroupID.x  
   8: mov r2.zw, l(0, 0, 0, 0)  
   9: mov r3.zw, l(0, 0, 0, 0)  
  10: mov r4.yw, l(0, 0, 0, 0)  
  11: mov r1.y, l(0)  
  12: loop  
  13:  utof r1.z, r1.y  
  14:  ge r1.z, r1.z, cb0[0].x  
  15:  breakc_nz r1.z  
  16:  iadd r2.x, r1.y, vThreadIDInGroup.x  
  17:  utof r1.z, r2.x  
  18:  lt r1.z, r1.z, cb0[0].x  
  19:  if_nz r1.z  
  20:   ld_indexable(texture2d)(float,float,float,float) r5.xyz, r2.xyzw, t0.xyzw  
  21:   dp3 r1.z, r5.xyzx, l(0.212600, 0.715200, 0.072200, 0.000000)  
  22:   imul null, r3.xy, r1.xxxx, r2.xyxx  
  23:   ld_indexable(texture2d)(float,float,float,float) r1.w, r3.xyzw, t1.yzwx  
  24:   eq r1.w, r1.w, cb0[2].w  
  25:   and r1.w, r1.w, cb0[2].y  
  26:   add r2.x, -r1.z, cb0[2].x  
  27:   mad r1.z, r1.w, r2.x, r1.z  
  28:   add r1.z, r1.z, l(1.000000)  
  29:   log r1.z, r1.z  
  30:   mul r1.z, r1.z, l(88.722839)  
  31:   ftou r1.z, r1.z  
  32:   umin r4.x, r1.z, l(255)  
  33:   atomic_iadd g0, r4.xyxx, l(1)  
  34:  endif  
  35:  iadd r1.y, r1.y, l(64)  
  36: endloop  
  37: sync_g_t  
  38: ld_structured r1.x, vThreadIDInGroup.x, l(0), g0.xxxx  
  39: mov r4.z, vThreadIDInGroup.x  
  40: atomic_iadd u0, r4.zwzz, r1.x  
  41: ld_structured r1.x, r0.x, l(0), g0.xxxx  
  42: mov r0.w, l(0)  
  43: atomic_iadd u0, r0.xwxx, r1.x  
  44: ld_structured r0.x, r0.y, l(0), g0.xxxx  
  45: atomic_iadd u0, r0.ywyy, r0.x  
  46: ld_structured r0.x, r0.z, l(0), g0.xxxx  
  47: atomic_iadd u0, r0.zwzz, r0.x  
  48: ret  

And constant buffer:


We know already that the first input is downscaled HDR color buffer. For FullHD, its resolution is 480x270. Take a look at Dispatch call.
Dispatch(270, 1, 1) - that means we run 270 thread groups. Simply speaking, we dispatch one thread group per one row of color buffer.

Each thread group performs on one row of HDR color buffer
Now when we have this context, let's try to figure out what this shader does.
Each thread group has 64 threads in X direction (dcl_thread_group 64, 1, 1) and also some shared memory, 256 elements, 4 bytes per each (dcl_tgsm_structured g0, 4, 256).

Note that in the shader we use SV_GroupThreadID (vThreadIDInGroup.x) [0-63] and SV_GroupID (vThreadGroupID.x) [0-269].

1) We start by setting all elements of shared memory to zero. Since we have 256 elements in shared memory, and 64 threads per group, we can do it nicely with simple loop:

   // The first step is to set whole shared data to zero.  
   // Because each thread group has 64 threads, each one can zero 4 elements using a simple offset.  
   [unroll] for (uint idx=0; idx < 4; idx++)  
   {  
     const uint offset = threadID + idx*64;  
     shared_data[ offset ] = 0;  
   }  

2) After that, we set a barrier with GroupMemoryBarrierWithGroupSync (sync_g_t). We do it to make sure all threads set elements of groupshared memory to zero before going to the next stage.

3) Now we perform loop which we can roughly write like this:
  // cb0_v0.x is width of downscaled color buffer. For 1920x1080, it's 1920/4 = 480;  
   float ViewportSizeX = cb0_v0.x;  
   [loop] for ( uint PositionX = 0; PositionX < ViewportSizeX; PositionX += 64 )  
   {  
      ...  

This is simple 'for' loop with incrementation by 64 (have you already noticed why? ;) ).

The next step it to calculate position of pixel to load.
Let's think about it.

In terms of "Y" coordinate - we can use SV_GroupID.x, because we dispatched 270 thread groups.
In terms of "X" well... we can take advantage of current thread in the group! Let's try it.

Because we have 64 threads per group, such approach will get through all pixels.
Consider thread group (0, 0, 0).
- Thread (0, 0, 0) will process pixels (0, 0), (64, 0), (128, 0), (192, 0), (256, 0), (320, 0),
(384, 0), (448, 0).
- Thread (1, 0, 0) will process pixels (1, 0), (65, 0), (129, 0), (193, 0), (257, 0), (321, 0), (385, 0), (449, 0)
...
- Thread (63, 0, 0) will process pixels (63, 0), (127, 0), (191, 0), (255, 0), (319, 0),
(383, 0), (447, 0)
This way, all pixels will be processed.

We want also to make sure that we won't load pixel out of color buffer:
  // We move along X axis, pixel by pixel. Y is GroupID.  
     uint CurrentPixelPositionX = PositionX + threadID;  
     uint CurrentPixelPositionY = groupID;  
     if ( CurrentPixelPositionX < ViewportSizeX )  
     {  
        // HDR Color buffer.  
        // Calculate screen space position of HDR color buffer, load it and calculate luma.  
        uint2 colorPos = uint2(CurrentPixelPositionX, CurrentPixelPositionY);  
        float3 color = texture0.Load( int3(colorPos, 0) ).rgb;  
        float luma = dot(color, LUMA_RGB);  

See? Pretty simple :)
I've also calculted luma (line 21 of the assembly).

Okay, we already calculated luma from color pixel, feels good. The next step is to load (no samping!) corresponding depth value.
But we have a problem here, because we attached full-resolution depth buffer. How to deal with it?
That's surprisingly simple, just multiply colorPos by some constant (cb0_v2.z). We downscaled HDR color buffer by 4, so this value is 4!
     const int iDepthTextureScale = (int) cb0_v2.z;  
     uint2 depthPos = iDepthTextureScale * colorPos;  
     float depth = texture1.Load( int3(depthPos, 0) ).x;  


So far so good! But... we came to assembly lines 24-25....
  24:   eq r2.x, r2.x, cb0[2].w  
  25:   and r2.x, r2.x, cb0[2].y  

Well. At first, we have floating-poing equality comparison, the result of it goes to r2.x and right after that we have.... what? Bitwise AND?? Seriously? On floating-point value? What the heck???

The 'eq+and' problem
Let me just say this was the most difficult part of this shader to figure out for me. I tried even some crazy asint/asfloat combinations...
What about a bit different approach? Let's just do a simple float-float comparison in HLSL

 float DummyPS() : SV_Target0  
 {  
   float test = (cb0_v0.x == cb0_v0.y);  
   return test;  
 }  

And output assembly:
   0: eq r0.x, cb0[0].y, cb0[0].x  
   1: and o0.x, r0.x, l(0x3f800000)  
   2: ret   

Interesting, isn't it? didn't expect here 'and'.
0x3f800000 is simply 1.0f... well, logical, as we have 1.0 if comparison passes, 0.0 otherwise.
What if you could 'replace' 1.0 with some other value? Like this:
 float DummyPS() : SV_Target0  
 {  
   float test = (cb0_v0.x == cb0_v0.y) ? cb0_v0.z : 0.0;  
   return test;  
 }  

And result:
   0: eq r0.x, cb0[0].y, cb0[0].x  
   1: and o0.x, r0.x, cb0[0].z  
   2: ret   

Hahah! It works :) Just magic by HLSL compiler. Note aside, if you replace 0.0 with something different, it will be just movc.


Going back to our compute shader, the next step is to check if depth value is equal to cb0_v2.w. It's always set to 0.0 - simply speaking, we check if the pixel lies on far plane (sky). If yes, we assign to this factor some value, around 0.5 (I checked few frames).

Such calculated coefficent is used for interpolation between color luma, and 'sky' luma (cb0_v2.x, often around 0.0). I guess this is to give more control how sky is important in calculating average luminance, usually by decreasing its importance. Very smart idea.
    // We check if pixel lies on far plane (sky). If yes, we can specify how it will be  
    // mixed with our values.  
    float value = (depth == cb0_v2.w) ? cb0_v2.y : 0.0;  
         
    // If 'value' is 0.0, this lerp will simply give us 'luma'. However, if 'value' is different  
    // (often around ~0.50), calculated luma can have less importance. (cb0_v2.x is usually close to 0.0).  
    float lumaOk = lerp( luma, cb0_v2.x, value );  
   

As we have lumaOk, the next step is to calculate its natural logarithm to make it distribute nicely. But wait. Let's say that lumaOk is 0.0. We know that log(0) is undefined, so we add 1.0, because log(1) = 0.0.

After that, we scale the calculated logarithm by 128 to distribute it nicely for 256 cells. Very smart!
And this is exactly where this 88.722839 comes from. It's 128 * natural logatithm(2).
It's just the way HLSL calculates logatithms.
In HLSL assembly there is only one function which calculates logarithms: log and it's base-2.
       // Let's assume that lumaOk is 0.0.  
       // log(0) is undefined  
       // log(1) = 0.  
       // calculate natural logarithm of luma  
       lumaOk = log(lumaOk + 1.0);  
         
       // Scale logarithm of luma by 128  
       lumaOk *= 128;  


Finally we calculate index of cell from logarithmically distributed luminance and add '1' to corresponding cell in shared memory.
       // Calculate proper index. Uint and since we have 256 elements in array,  
       // make sure it will not get out of bounds.  
       uint uLuma = (uint) lumaOk;  
       uLuma = min(uLuma, 255);  
   
       // Add '1' to corresponding luma value.  
       InterlockedAdd( shared_data[uLuma], 1 );  

The next step is to, again, set barrier to make sure all pixels in row have been processed.
And the last one is to add values from shared memory to structured buffer, the same way, in simple loop:
   // Wait until all pixels in this row have been processed  
   GroupMemoryBarrierWithGroupSync();  
   
   // Add calculated values to structured buffer.  
   [unroll] for (uint idx = 0; idx < 4; idx++)  
   {  
     const uint offset = threadID + idx*64;  
   
     uint data = shared_data[offset];  
     InterlockedAdd( g_buffer[offset], data );  
   }  

After all 64 threads in the thread group fill shared data, each thread will add 4 values to output buffer.

In terms of the output buffer. Let's think about it. The sum of all values of the buffer is equal to total number of pixels! (for 480x270 = 129 600). So we know now how much of pixels have specific luminance.

If you're a bit rusty with compute shaders (like me) that might not be intuitive at the first time, so get through the post a few times, take pen&paper and try to understand concepts behind this technique.

That's all! :)  That's how The Witcher 3 calculates histogram of luminance. I've certainly learned a lot during writing this post. Congatulations for people at CD Projekt Red!

If you are interested in full HLSL shader, it's here. My ambition always is to get as similar assembly as in original game and I'm more than happy that I've done this again! :)

I hope you enjoyed this post.
Thanks for reading!

Brak komentarzy:

Prześlij komentarz