niedziela, 17 marca 2019

Reverse engineering the rendering of The Witcher 3, part 12 - stupid sky tricks

Welcome,

This part of the series will be slightly different comparing to the previous ones. Today I'd like to show you some aspects of sky shaders from The Witcher 3.

Why some "stupid tricks" instead of full shader? Well, there are a few reasons. First of all, sky shader in The Witcher 3 is quite a complex beast. Pixel Shader of 2015 version has 267 lines of assembly while PS from "Blood & Wine" DLC - 385.
Moreover, they have quite a lot of inputs which doesn't really help in struggles to reverse engineer complete (and readable!) HLSL code.

Therefore, I decided to show you some tricks from these shaders only. If I find anything new, this post will be updated.

The differences between 2015 version of the game and B&W (2016) addon are quite notable. This includes, for instance, different calculation of stars and their blinking, different approach to rendering of the Sun... Blood & Wine shader also calculates Milky Way during the night.

I'll start with some basics and switch to stupid tricks later.

Basics

As most of modern video games, The Witcher 3 uses skydome to represent sky. Take a look at hemisphere used for this in The Witcher 3 (2015). On a side note, in this case bounding box of this mesh ranges from [0,0,0] to [1,1,1] (Z is up-axis) and has smoothly distributed UVs. We'll use them later.


The idea behind skydome is similar to skybox (mesh being used is the only difference). During vertex shader we translate a skydome with respect to observer (usually by camera position) which gives an illusion that sky is really far away - we'll never go there.

If you have been following the series for a while you know that The Witcher 3 uses reversed depth - that means, far plane is represented by 0.0f while near plane - by 1.0f. To make sure that output of skydome will be completely on far plane, we set MinDepth the same as MaxDepth of viewport parameters:


To learn how MinDepth and MaxDepth fields are used during viewport transform click here (docs.microsoft.com).

Vertex Shader

Let's start with vertex shader. In The Witcher 3 (2015) assembly of VS is as follows:
 vs_5_0  
    dcl_globalFlags refactoringAllowed  
    dcl_constantbuffer cb1[4], immediateIndexed  
    dcl_constantbuffer cb2[6], immediateIndexed  
    dcl_input v0.xyz  
    dcl_input v1.xy  
    dcl_output o0.xy  
    dcl_output o1.xyz  
    dcl_output_siv o2.xyzw, position  
    dcl_temps 2  
   0: mov o0.xy, v1.xyxx  
   1: mad r0.xyz, v0.xyzx, cb2[4].xyzx, cb2[5].xyzx  
   2: mov r0.w, l(1.000000)  
   3: dp4 o1.x, r0.xyzw, cb2[0].xyzw  
   4: dp4 o1.y, r0.xyzw, cb2[1].xyzw  
   5: dp4 o1.z, r0.xyzw, cb2[2].xyzw  
   6: mul r1.xyzw, cb1[0].yyyy, cb2[1].xyzw  
   7: mad r1.xyzw, cb2[0].xyzw, cb1[0].xxxx, r1.xyzw  
   8: mad r1.xyzw, cb2[2].xyzw, cb1[0].zzzz, r1.xyzw  
   9: mad r1.xyzw, cb1[0].wwww, l(0.000000, 0.000000, 0.000000, 1.000000), r1.xyzw  
  10: dp4 o2.x, r0.xyzw, r1.xyzw  
  11: mul r1.xyzw, cb1[1].yyyy, cb2[1].xyzw  
  12: mad r1.xyzw, cb2[0].xyzw, cb1[1].xxxx, r1.xyzw  
  13: mad r1.xyzw, cb2[2].xyzw, cb1[1].zzzz, r1.xyzw  
  14: mad r1.xyzw, cb1[1].wwww, l(0.000000, 0.000000, 0.000000, 1.000000), r1.xyzw  
  15: dp4 o2.y, r0.xyzw, r1.xyzw  
  16: mul r1.xyzw, cb1[2].yyyy, cb2[1].xyzw  
  17: mad r1.xyzw, cb2[0].xyzw, cb1[2].xxxx, r1.xyzw  
  18: mad r1.xyzw, cb2[2].xyzw, cb1[2].zzzz, r1.xyzw  
  19: mad r1.xyzw, cb1[2].wwww, l(0.000000, 0.000000, 0.000000, 1.000000), r1.xyzw  
  20: dp4 o2.z, r0.xyzw, r1.xyzw  
  21: mul r1.xyzw, cb1[3].yyyy, cb2[1].xyzw  
  22: mad r1.xyzw, cb2[0].xyzw, cb1[3].xxxx, r1.xyzw  
  23: mad r1.xyzw, cb2[2].xyzw, cb1[3].zzzz, r1.xyzw  
  24: mad r1.xyzw, cb1[3].wwww, l(0.000000, 0.000000, 0.000000, 1.000000), r1.xyzw  
  25: dp4 o2.w, r0.xyzw, r1.xyzw  
  26: ret  

In this scenario VS outputs only texcoords and world-space position. In Blood & Wine it also outputs normalized normal vector. I'll stay with the 2015 version as it's simpler.

Take a look at constant buffer marked as cb2:


Here we have world matrix (uniform scaling by 100 and translation by camera position). Nothing fancy. cb2_v4 and cb2_v5 are scale/bias factors which serve to transform positions of vertices from [0-1] range to [-1;1] one. But here, in terms of Z-axis (up) these coefficents will 'squeeze' it.


We have already seen similar VS in previous parts of the series. The general algorithm is to pass texcoords further, then calculate Position with scale/bias factors, then calculate PositionW in world space, then calculate final clip space position by multiplying matWorld and matViewProj matrices together -> use their product to multiply with Position to get final SV_Position.

So, the HLSL for this vertex shader would be something like this:
 struct InputStruct {  
      float3 param0 : POSITION;  
      float2 param1 : TEXCOORD;  
      float3 param2 : NORMAL;  
      float4 param3 : TANGENT;  
 };  
   
 struct OutputStruct {  
      float2 param0 : TEXCOORD0;  
      float3 param1 : TEXCOORD1;  
      float4 param2 : SV_Position;  
 };  
   
 OutputStruct EditedShaderVS(in InputStruct IN)  
 {  
      OutputStruct OUT = (OutputStruct)0;  
        
      // Simple texcoords passing  
      OUT.param0 = IN.param1;  
        
        
      // * Manually construct world and viewProj martices from float4s:  
      row_major matrix matWorld = matrix(cb2_v0, cb2_v1, cb2_v2, float4(0,0,0,1) );  
      matrix matViewProj = matrix(cb1_v0, cb1_v1, cb1_v2, cb1_v3);  
   
      // * Some optional fun with worldMatrix  
      // a) Scale  
      //matWorld._11 = matWorld._22 = matWorld._33 = 0.225f;  
   
      // b) Translate  
      // X Y Z  
      //matWorld._14 = 520.0997;  
      //matWorld._24 = 74.4226;  
      //matWorld._34 = 113.9;  
   
      // Local space - note the scale+bias here!  
      //float3 meshScale = float3(2.0, 2.0, 2.0);  
      //float3 meshBias = float3(-1.0, -1.0, -0.4);  
      float3 meshScale = cb2_v4.xyz;  
      float3 meshBias = cb2_v5.xyz;  
   
      float3 Position = IN.param0 * meshScale + meshBias;  
        
      // World space  
      float4 PositionW = mul(float4(Position, 1.0), transpose(matWorld) );  
      OUT.param1 = PositionW.xyz;  
   
      // Clip space - original approach from The Witcher 3  
      matrix matWorldViewProj = mul(matViewProj, matWorld);  
      OUT.param2 = mul( float4(Position, 1.0), transpose(matWorldViewProj) );  
        
      return OUT;  
 }  

Comparison of the my shader (left) and the original one (right):

The great thing about RenderDoc is that it allows to inject your own shader instead of original one and your changes do affect the pipeline until the very end of a frame. As you can see in HLSL code, I gave you some options to change scaling and translation of the final geometry. You can play with it and achieve some funny results:

Hail to the skydome!

Optimizing the vertex shader

Do you see a problem with the original vertex shader? Per-vertex matrix-matrix multiplication is completely redundant! I found it in at least few vertex shaders (for instance, in distant rain shafts).  We could optimize it by multiplying PositionW with matViewProj immediately!

So, we can replace HLSL code:
      // Clip space - original approach from The Witcher 3  
      matrix matWorldViewProj = mul(matViewProj, matWorld);  
      OUT.param2 = mul( float4(Position, 1.0), transpose(matWorldViewProj) );  

with this one:
      // Clip space - optimized version  
      OUT.param2 = mul( matViewProj, PositionW );  

An optimized version produces the following assembly:
    vs_5_0  
    dcl_globalFlags refactoringAllowed  
    dcl_constantbuffer CB1[4], immediateIndexed  
    dcl_constantbuffer CB2[6], immediateIndexed  
    dcl_input v0.xyz  
    dcl_input v1.xy  
    dcl_output o0.xy  
    dcl_output o1.xyz  
    dcl_output_siv o2.xyzw, position  
    dcl_temps 2  
   0: mov o0.xy, v1.xyxx  
   1: mad r0.xyz, v0.xyzx, cb2[4].xyzx, cb2[5].xyzx  
   2: mov r0.w, l(1.000000)  
   3: dp4 r1.x, r0.xyzw, cb2[0].xyzw  
   4: dp4 r1.y, r0.xyzw, cb2[1].xyzw  
   5: dp4 r1.z, r0.xyzw, cb2[2].xyzw  
   6: mov o1.xyz, r1.xyzx  
   7: mov r1.w, l(1.000000)  
   8: dp4 o2.x, cb1[0].xyzw, r1.xyzw  
   9: dp4 o2.y, cb1[1].xyzw, r1.xyzw  
  10: dp4 o2.z, cb1[2].xyzw, r1.xyzw  
  11: dp4 o2.w, cb1[3].xyzw, r1.xyzw  
  12: ret

As you can see, we reduced number of instructions from 26 to 12 - that's quite a change. I don't know how widespread this problem is in the game but c'mon CD Projekt Red, maybe a patch or something? :)

I'm not kidding here. You can inject my optimized shader instead of original one in RenderDoc and see for yourself that this optimization changes nothing in terms of visuals. Honestly, I don't know why CD Projekt Red decided to do per-vertex matrix-matrix multiplication...

The Sun

In The Witcher 3 (2015) calculating of atmospheric scattering and the Sun consists of two separate draw calls:

The Witcher 3 (2015) - before

The Witcher 3 (2015) - with sky

The Witcher 3 (2015) - with sky + the Sun
Rendering the Sun in 2015 version is pretty similar to the Moon in terms of geometry and blend/depth states.


On the other hand, in Blood & Wine sky with the Sun is rendered in one pass:

The Witcher 3: Blood & Wine (2016) - before sky

The Witcher 3: Blood & Wine (2016) - with sky and the Sun

No matter how you want to render the Sun at some point you will need (normalized) direction of sunlight. The most intuitive way to obtain this vector is to use spherical coordinates. Basically you need only two values representing two angles (in radians!): phi and theta. Once you have them you can assume r = 1, so it cancels, so for y-up Cartesian coordinate system we can write HLSL code like this:
 float3 vSunDir;  
 vSunDir.x = sin(fTheta)*cos(fPhi);  
 vSunDir.y = sin(fTheta)*sin(fPhi);  
 vSunDir.z = cos(fTheta);  
 vSunDir = normalize(vSunDir);  

Normally you calculate sunlight direction in your application, then pass it to constant buffer for further use.

Once we have sunlight direction we can dive into assembly of pixel shader from Blood & Wine....
  ...   
  100: add r1.xyw, -r0.xyxz, cb12[0].xyxz  
  101: dp3 r2.x, r1.xywx, r1.xywx  
  102: rsq r2.x, r2.x  
  103: mul r1.xyw, r1.xyxw, r2.xxxx  
  104: mov_sat r2.xy, cb12[205].yxyy  
  105: dp3 r2.z, -r1.xywx, -r1.xywx  
  106: rsq r2.z, r2.z  
  107: mul r1.xyw, -r1.xyxw, r2.zzzz  
  ...  

Okay. To start, cb12[0].xyz is a position of camera, while in r0.xyz we store vertex position (it's an output from vertex shader). Therefore, line 100 calculates worldToCamera vector. But take a look at lines 105-107. We could write it as normalize( -worldToCamera), which means we calculate normalized cameraToWorld vector.

  120: dp3_sat r1.x, cb12[203].yzwy, r1.xywx  

Then we calculate dot product between cameraToWorld and sunDirection vectors! Remember they have to be normalized. Also we saturate whole expression to clamp it within [0-1] range.

Cool! We have this dot product in r1.x. Let's find the next use of it...
  152: log r1.x, r1.x  
  153: mul r1.x, r1.x, cb12[203].x  
  154: exp r1.x, r1.x  
  155: mul r1.x, r2.y, r1.x  


The "log, mul, exp" triple is, simply speaking, exponentation. As you can see, we raise our cosine (dot product of normalized vectors) to some power. You may ask, why? This way we can produce gradient which will mimic our Sun. (And line 155 affects opacity of this gradient, so you can for instance set this to zero to completely hide the Sun). See some examples:

exponent = 54

exponent = 2400
Having this gradient, we use it to interpolate between skyColor and sunColor! To make sure there will be no artifacts we had to saturate in line 120.

Please take a note that this trick can be used to mimic corona phenomenon for the Moon (with lower values of the exponent). For this you will need moonDirection vector - which can be easily calculated with spherical coordinates.

Final HLSL can look similar to the following snippet:
 float3 vCamToWorld = normalize( PosW – CameraPos );  
   
 float cosTheta = saturate( dot(vSunDir, vCamToWorld) );  
 float sunGradient = pow( cosTheta, sunExponent );  
   
 float3 color = lerp( skyColor, sunColor, sunGradient );  

Moving stars

If you would make a timelapse during the night on a clear sky in The Witcher 3 you would notice that stars are not static - they slightly move across the sky with time! I noticed this quite accidentally and wanted to see how this was done.

Let's start with fact that stars in The Witcher 3 are represented with 1024x1024x6 cubemap. If you think about it, it's very handy solution as it easily allows to map directions to sample the cubemap.

Consider the following piece of assembly:
  159: add r1.xyz, -v1.xyzx, cb1[8].xyzx  
  160: dp3 r0.w, r1.xyzx, r1.xyzx  
  161: rsq r0.w, r0.w  
  162: mul r1.xyz, r0.wwww, r1.xyzx  
  163: mul r2.xyz, cb12[204].zwyz, l(0.000000, 0.000000, 1.000000, 0.000000)  
  164: mad r2.xyz, cb12[204].yzwy, l(0.000000, 1.000000, 0.000000, 0.000000), -r2.xyzx  
  165: mul r4.xyz, r2.xyzx, cb12[204].zwyz  
  166: mad r4.xyz, r2.zxyz, cb12[204].wyzw, -r4.xyzx  
  167: dp3 r4.x, r1.xyzx, r4.xyzx  
  168: dp2 r4.y, r1.xyxx, r2.yzyy  
  169: dp3 r4.z, r1.xyzx, cb12[204].yzwy  
  170: dp3 r0.w, r4.xyzx, r4.xyzx  
  171: rsq r0.w, r0.w  
  172: mul r2.xyz, r0.wwww, r4.xyzx  
  173: sample_indexable(texturecube)(float,float,float,float) r4.xyz, r2.xyzx, t0.xyzw, s0  

To calculate final sampling vector (line 173), we start by calculating normalized worldToCamera vector (lines 159-162).

Then we calculate 2 cross products (163-164, 165-166) with moonDirection and later perform 3 dot products to get final sampling vector. HLSL:

 float3 vWorldToCamera = normalize( g_CameraPos.xyz - Input.PositionW.xyz );  
 float3 vMoonDirection = cb12_v204.yzw;  
   
 float3 vStarsSamplingDir = cross( vMoonDirection, float3(0, 0, 1) );  
 float3 vStarsSamplingDir2 = cross( vStarsSamplingDir, vMoonDirection );  
   
 float dirX = dot( vWorldToCamera, vStarsSamplingDir2 );  
 float dirY = dot( vWorldToCamera, vStarsSamplingDir );  
 float dirZ = dot( vWorldToCamera, vMoonDirection);  
 float3 dirXYZ = normalize( float3(dirX, dirY, dirZ) );  
   
 float3 starsColor = texNightStars.Sample( samplerAnisoWrap, dirXYZ ).rgb;  

Note to self: This is really well-thought and I definitely have to investigate it in more details.
Note to readers: If you know more about this operation, let me know!

Blinking stars

Another nice trick I wanted to investigate in more details is blinking of stars. If you walk around, let's say, outskirts of Novigrad City and sky is clear you can notice that stars are blinking.

I was curious how this was implemented. So the difference is quite big between the 2015 version and Blood & Wine. For simplicity I'll stay with 2015 version.

So we start just after sampling starsColor from the previous section:
  174: mul r0.w, v0.x, l(100.000000)  
  175: round_ni r1.w, r0.w  
  176: mad r2.w, v0.y, l(50.000000), cb0[0].x  
  177: round_ni r4.w, r2.w  
  178: bfrev r4.w, r4.w  
  179: iadd r5.x, r1.w, r4.w  
  180: ishr r5.y, r5.x, l(13)  
  181: xor r5.x, r5.x, r5.y  
  182: imul null, r5.y, r5.x, r5.x  
  183: imad r5.y, r5.y, l(0x0000ec4d), l(0.0000000000000000000000000000000000001)  
  184: imad r5.x, r5.x, r5.y, l(146956042240.000000)  
  185: and r5.x, r5.x, l(0x7fffffff)  
  186: itof r5.x, r5.x  
  187: mad r5.y, v0.x, l(100.000000), l(-1.000000)  
  188: round_ni r5.y, r5.y  
  189: iadd r4.w, r4.w, r5.y  
  190: ishr r5.z, r4.w, l(13)  
  191: xor r4.w, r4.w, r5.z  
  192: imul null, r5.z, r4.w, r4.w  
  193: imad r5.z, r5.z, l(0x0000ec4d), l(0.0000000000000000000000000000000000001)  
  194: imad r4.w, r4.w, r5.z, l(146956042240.000000)  
  195: and r4.w, r4.w, l(0x7fffffff)  
  196: itof r4.w, r4.w  
  197: add r5.z, r2.w, l(-1.000000)  
  198: round_ni r5.z, r5.z  
  199: bfrev r5.z, r5.z  
  200: iadd r1.w, r1.w, r5.z  
  201: ishr r5.w, r1.w, l(13)  
  202: xor r1.w, r1.w, r5.w  
  203: imul null, r5.w, r1.w, r1.w  
  204: imad r5.w, r5.w, l(0x0000ec4d), l(0.0000000000000000000000000000000000001)  
  205: imad r1.w, r1.w, r5.w, l(146956042240.000000)  
  206: and r1.w, r1.w, l(0x7fffffff)  
  207: itof r1.w, r1.w  
  208: mul r1.w, r1.w, l(0.000000001)  
  209: iadd r5.y, r5.z, r5.y  
  210: ishr r5.z, r5.y, l(13)  
  211: xor r5.y, r5.y, r5.z  
  212: imul null, r5.z, r5.y, r5.y  
  213: imad r5.z, r5.z, l(0x0000ec4d), l(0.0000000000000000000000000000000000001)  
  214: imad r5.y, r5.y, r5.z, l(146956042240.000000)  
  215: and r5.y, r5.y, l(0x7fffffff)  
  216: itof r5.y, r5.y  
  217: frc r0.w, r0.w  
  218: add r0.w, -r0.w, l(1.000000)  
  219: mul r5.z, r0.w, r0.w  
  220: mul r0.w, r0.w, r5.z  
  221: mul r5.xz, r5.xxzx, l(0.000000001, 0.000000, 3.000000, 0.000000)  
  222: mad r0.w, r0.w, l(-2.000000), r5.z  
  223: frc r2.w, r2.w  
  224: add r2.w, -r2.w, l(1.000000)  
  225: mul r5.z, r2.w, r2.w  
  226: mul r2.w, r2.w, r5.z  
  227: mul r5.z, r5.z, l(3.000000)  
  228: mad r2.w, r2.w, l(-2.000000), r5.z  
  229: mad r4.w, r4.w, l(0.000000001), -r5.x  
  230: mad r4.w, r0.w, r4.w, r5.x  
  231: mad r5.x, r5.y, l(0.000000001), -r1.w  
  232: mad r0.w, r0.w, r5.x, r1.w  
  233: add r0.w, -r4.w, r0.w  
  234: mad r0.w, r2.w, r0.w, r4.w  
  235: mad r2.xyz, r0.wwww, l(0.000500, 0.000500, 0.000500, 0.000000), r2.xyzx  
  236: sample_indexable(texturecube)(float,float,float,float) r2.xyz, r2.xyzx, t0.xyzw, s0  
  237: log r4.xyz, r4.xyzx  
  238: mul r4.xyz, r4.xyzx, l(2.200000, 2.200000, 2.200000, 0.000000)  
  239: exp r4.xyz, r4.xyzx  
  240: log r2.xyz, r2.xyzx  
  241: mul r2.xyz, r2.xyzx, l(2.200000, 2.200000, 2.200000, 0.000000)  
  242: exp r2.xyz, r2.xyzx  
  243: mul r2.xyz, r2.xyzx, r4.xyzx  

Huh. Let's take a look at the very end of this quite big piece of assembly.

Once we sampled starsColor in line 173 we calculate some offset value. This offset is used to perturb first sampling direction (r2.xyz, line 235), then we sample stars cubemap again, perform gamma correction on these two values (237-242) and multiply them together (243).

Simple, isn't it? Well, not really. Think about this offset for a while. It must be different across whole skydome - stars blinking the same way would look very unrealistic.

To make sure that offset will be as diverse as possible we will take advantage of UVs wrapped across skydome (v0.xy) and elapsed time from constant buffer (cb[0].x).

If you are unfamiliar with this intimidating ishr/xor/and thing, take a look at lightnings effect to learn more about integer noise.

So as you can see, integer noise is called here four times, but it's different now comparing to lightnings. To make results even more random the input integer for noise is a sum (iadd) and reversing bits is performed (reversebits instrinsic; bfrev instruction).

Okay, easy now. Let's start from start.
We have 4 "iterations" of integer noise. I analyzed the assembly and calculation of all 4 iterations looks like this:
 // * Inputs - UV and elapsed time in seconds  
 float2 starsUV;  
 starsUV.x = 100.0 * Input.TextureUV.x;       
 starsUV.y = 50.0  * Input.TextureUV.y + g_fTime;  
             
 // * Iteration 1  
 int iStars1_A = reversebits( asint( floor(starsUV.y) ) );  
 int iStars1_B = asint( floor(starsUV.x) );            
   
 float fStarsNoise1 = integerNoise( iStars1_A + iStars1_B );  
             
   
 // * Iteration 2  
 int iStars2_A = reversebits( asint( floor(starsUV.y) ) );  
 int iStars2_B = asint( floor( starsUV.x - 1.0 ) );       
   
 float fStarsNoise2 = integerNoise( iStars2_A + iStars2_B );  
        
   
 // * Iteration 3  
 int iStars3_A = reversebits( asint( floor( starsUV.y - 1.0 ) ) );  
 int iStars3_B = asint( floor(starsUV.x) );  
   
 float fStarsNoise3 = integerNoise( iStars3_A + iStars3_B );  
             
   
 // * Iteration 4  
 int iStars4_A = reversebits( asint( floor( starsUV.y - 1.0 ) ) );  
 int iStars4_B = asint( floor( starsUV.x - 1.0 ) );  
   
 float fStarsNoise4 = integerNoise( iStars4_A + iStars4_B );  

The final outputs of all these 4 iterations are (follow itof instructions to find them):

Iteration 1 - r5.x,
Iteration 2 - r4.w,
Iteration 3 - r1.w,
Iteration 4 - r5.y

After the last itof (line 216) we have:
  217: frc r0.w, r0.w   
  218: add r0.w, -r0.w, l(1.000000)   
  219: mul r5.z, r0.w, r0.w   
  220: mul r0.w, r0.w, r5.z   
  221: mul r5.xz, r5.xxzx, l(0.000000001, 0.000000, 3.000000, 0.000000)   
  222: mad r0.w, r0.w, l(-2.000000), r5.z   
  223: frc r2.w, r2.w   
  224: add r2.w, -r2.w, l(1.000000)   
  225: mul r5.z, r2.w, r2.w   
  226: mul r2.w, r2.w, r5.z   
  227: mul r5.z, r5.z, l(3.000000)   
  228: mad r2.w, r2.w, l(-2.000000), r5.z   

These lines calculate values for S-curve for weights based on fractional part of UVs, just like in case of lightnings. So:

  float s_curve( float x )   
  {   
    float x2 = x * x;   
    float x3 = x2 * x;   
      
    // -2x^3 + 3x^2   
    return -2.0*x3 + 3.0*x2;   
  }  
   
 ...  
 
 // lines 217-222
 float weightX = 1.0 - frac( starsUV.x );  
 weightX = s_curve( weightX );  
   
 // lines 223-228
 float weightY = 1.0 - frac( starsUV.y );  
 weightY = s_curve( weightY );  

As you can expect, these factors serve to interpolate noise smoothly and generate final offset for sampling coordinates:
  229: mad r4.w, r4.w, l(0.000000001), -r5.x   
  230: mad r4.w, r0.w, r4.w, r5.x   
  float noise0 = lerp( fStarsNoise1, fStarsNoise2, weightX );  
   
  231: mad r5.x, r5.y, l(0.000000001), -r1.w   
  232: mad r0.w, r0.w, r5.x, r1.w   
  float noise1 = lerp( fStarsNoise3, fStarsNoise4, weightX );  
   
  233: add r0.w, -r4.w, r0.w   
  234: mad r0.w, r2.w, r0.w, r4.w   
  float offset = lerp( noise0, noise1, weightY );            
   
  235: mad r2.xyz, r0.wwww, l(0.000500, 0.000500, 0.000500, 0.000000), r2.xyzx   
  236: sample_indexable(texturecube)(float,float,float,float) r2.xyz, r2.xyzx, t0.xyzw, s0   
  float3 starsPerturbedDir = dirXYZ + offset * 0.0005;  
    
  float3 starsColorDisturbed = texNightStars.Sample( samplerAnisoWrap, starsPerturbedDir ).rgb;


Once we have starsColorDisturbed, the hardest part is over. Phew!

The next step is to perform gamma correction on both starsColor and starsColorDisturbed and multiply them:
  starsColor = pow( starsColor, 2.2 );  
  starsColorDisturbed = pow( starsColorDisturbed, 2.2 );  
   
  float3 starsFinal = starsColor * starsColorDisturbed;  

Stars - the final touches

We have starsFinal in r1.xyz. What's happening at the end of processing stars is this:
  256: log r1.xyz, r1.xyzx  
  257: mul r1.xyz, r1.xyzx, l(2.500000, 2.500000, 2.500000, 0.000000)  
  258: exp r1.xyz, r1.xyzx  
  259: min r1.xyz, r1.xyzx, l(1.000000, 1.000000, 1.000000, 0.000000)  
  260: add r0.w, -cb0[9].w, l(1.000000)  
  261: mul r1.xyz, r0.wwww, r1.xyzx  
  262: mul r1.xyz, r1.xyzx, l(10.000000, 10.000000, 10.000000, 0.000000)  

This is much, much easier comparing to blinking and moving stars.
So we start with raising starsFinal to power of 2.5 - this allows to control density of stars. Pretty clever. Then, we make sure the maximum color of stars is float3(1, 1, 1).

cb0[9].w is used to control general visibility of stars. So in daytime expect this to be set to 1.0 (which yields in multiplying by zero) and 0.0 during nights.

At the end we boost visibility of stars by 10. And this is over! :)

Summary

In this post I presented some cool tricks I found while investigating sky shader from The Witcher 3. I hope you enjoyed it. Thanks for reading!

Take care,
M.

poniedziałek, 4 marca 2019

Reverse engineering the rendering of The Witcher 3, part 11 - lightnings

Welcome back!

In the 11th part of the series we will take a look how lightnings are rendered in The Witcher 3: Wild Hunt.

Following the distant rain shafts effect, lightnings are rendered slightly after, but still in forward pass. You can see them in action in the following video:



They last for a very short time; therefore, it will be the best to play this video with 0.25 speed.
You can see that they are not static images; their intensity slightly changes with time.

There are many similarities here with distant rain shafts in terms of rendernig nuances, like the same blending (additive blending) and depth (test enabled, no depth write) states.

Scene without lightning

Scene with lightning
In terms of geometry, lightnings in The Witcher 3 are tree-like meshes, this particular lightning is represented by the following one:

It's provided with UV coordinates and normal vectors. They will be useful in the vertex shader stage.

Vertex Shader

Let's take a look at the assembly of the vertex shader:

 vs_5_0  
    dcl_globalFlags refactoringAllowed  
    dcl_constantbuffer cb1[9], immediateIndexed  
    dcl_constantbuffer cb2[6], immediateIndexed  
    dcl_input v0.xyz  
    dcl_input v1.xy  
    dcl_input v2.xyz  
    dcl_input v4.xyzw  
    dcl_input v5.xyzw  
    dcl_input v6.xyzw  
    dcl_input v7.xyzw  
    dcl_output o0.xy  
    dcl_output o1.xyzw  
    dcl_output_siv o2.xyzw, position  
    dcl_temps 3  
   0: mov o0.xy, v1.xyxx  
   1: mov o1.xyzw, v7.xyzw  
   2: mul r0.xyzw, v5.xyzw, cb1[0].yyyy  
   3: mad r0.xyzw, v4.xyzw, cb1[0].xxxx, r0.xyzw  
   4: mad r0.xyzw, v6.xyzw, cb1[0].zzzz, r0.xyzw  
   5: mad r0.xyzw, cb1[0].wwww, l(0.000000, 0.000000, 0.000000, 1.000000), r0.xyzw  
   6: mov r1.w, l(1.000000)  
   7: mad r1.xyz, v0.xyzx, cb2[4].xyzx, cb2[5].xyzx  
   8: dp4 r2.x, r1.xyzw, v4.xyzw  
   9: dp4 r2.y, r1.xyzw, v5.xyzw  
  10: dp4 r2.z, r1.xyzw, v6.xyzw  
  11: add r2.xyz, r2.xyzx, -cb1[8].xyzx  
  12: dp3 r1.w, r2.xyzx, r2.xyzx  
  13: rsq r1.w, r1.w  
  14: div r1.w, l(1.000000, 1.000000, 1.000000, 1.000000), r1.w  
  15: mul r1.w, r1.w, l(0.000001)  
  16: mad r2.xyz, v2.xyzx, l(2.000000, 2.000000, 2.000000, 0.000000), l(-1.000000, -1.000000, -1.000000, 0.000000)  
  17: mad r1.xyz, r2.xyzx, r1.wwww, r1.xyzx  
  18: mov r1.w, l(1.000000)  
  19: dp4 o2.x, r1.xyzw, r0.xyzw  
  20: mul r0.xyzw, v5.xyzw, cb1[1].yyyy  
  21: mad r0.xyzw, v4.xyzw, cb1[1].xxxx, r0.xyzw  
  22: mad r0.xyzw, v6.xyzw, cb1[1].zzzz, r0.xyzw  
  23: mad r0.xyzw, cb1[1].wwww, l(0.000000, 0.000000, 0.000000, 1.000000), r0.xyzw  
  24: dp4 o2.y, r1.xyzw, r0.xyzw  
  25: mul r0.xyzw, v5.xyzw, cb1[2].yyyy  
  26: mad r0.xyzw, v4.xyzw, cb1[2].xxxx, r0.xyzw  
  27: mad r0.xyzw, v6.xyzw, cb1[2].zzzz, r0.xyzw  
  28: mad r0.xyzw, cb1[2].wwww, l(0.000000, 0.000000, 0.000000, 1.000000), r0.xyzw  
  29: dp4 o2.z, r1.xyzw, r0.xyzw  
  30: mul r0.xyzw, v5.xyzw, cb1[3].yyyy  
  31: mad r0.xyzw, v4.xyzw, cb1[3].xxxx, r0.xyzw  
  32: mad r0.xyzw, v6.xyzw, cb1[3].zzzz, r0.xyzw  
  33: mad r0.xyzw, cb1[3].wwww, l(0.000000, 0.000000, 0.000000, 1.000000), r0.xyzw  
  34: dp4 o2.w, r1.xyzw, r0.xyzw  
  35: ret  

There are many similarities here compared to the vertex shader from the one from distant rain shafts so I won't repeat myself. The one major difference I want to show are lines 11-18:

  11: add r2.xyz, r2.xyzx, -cb1[8].xyzx  
  12: dp3 r1.w, r2.xyzx, r2.xyzx  
  13: rsq r1.w, r1.w  
  14: div r1.w, l(1.000000, 1.000000, 1.000000, 1.000000), r1.w  
  15: mul r1.w, r1.w, l(0.000001)  
  16: mad r2.xyz, v2.xyzx, l(2.000000, 2.000000, 2.000000, 0.000000), l(-1.000000, -1.000000, -1.000000, 0.000000)  
  17: mad r1.xyz, r2.xyzx, r1.wwww, r1.xyzx  
  18: mov r1.w, l(1.000000)  
  19: dp4 o2.x, r1.xyzw, r0.xyzw  

To start, cb1[8].xyz is camera position while r2.xyz is world-space position, so line 11 calculates vector from camera to world position. Then, lines 12-15 compute length( worldPos - cameraPos) * 0.000001.

v2.xyz is normal vector from input geometry. Line 16 unpacks it from [0-1] range to [-1;1] one.
Then, the final world-space position is calculated:

finalWorldPos = worldPos + length( worldPos - cameraPos) * 0.000001 * normalVector


HLSL snippet for this operation would be something like this:

      ...  
      // final world-space position  
      float3 vNormal = Input.NormalW * 2.0 - 1.0;  
      float lencameratoworld = length( PositionL - g_cameraPos.xyz) * 0.000001;  
   
      PositionL += vNormal*lencameratoworld;  
   
      // SV_Posiiton   
      float4x4 matModelViewProjection = mul(g_viewProjMatrix, matInstanceWorld );   
      Output.PositionH = mul( float4(PositionL, 1.0), transpose(matModelViewProjection) );      
   
      return Output;   

Such operation causes slight "explosion" of the mesh (in the direction of normal vector). I did some simple experiments and replaced 0.000001 with a few different values, see the results:

0.000002

0.000005

0.00001

0.000025

Pixel Shader

Ok, we're done with the vertex shader, time to see the assembly of pixel shader!
 ps_5_0  
    dcl_globalFlags refactoringAllowed  
    dcl_constantbuffer cb0[1], immediateIndexed  
    dcl_constantbuffer cb2[3], immediateIndexed  
    dcl_constantbuffer cb4[5], immediateIndexed  
    dcl_input_ps linear v0.x  
    dcl_input_ps linear v1.w  
    dcl_output o0.xyzw  
    dcl_temps 1  
   0: mad r0.x, cb0[0].x, cb4[4].x, v0.x  
   1: add r0.y, r0.x, l(-1.000000)  
   2: round_ni r0.y, r0.y  
   3: ishr r0.z, r0.y, l(13)  
   4: xor r0.y, r0.y, r0.z  
   5: imul null, r0.z, r0.y, r0.y  
   6: imad r0.z, r0.z, l(0x0000ec4d), l(0.0000000000000000000000000000000000001)  
   7: imad r0.y, r0.y, r0.z, l(146956042240.000000)  
   8: and r0.y, r0.y, l(0x7fffffff)  
   9: round_ni r0.z, r0.x  
  10: frc r0.x, r0.x  
  11: add r0.x, -r0.x, l(1.000000)  
  12: ishr r0.w, r0.z, l(13)  
  13: xor r0.z, r0.z, r0.w  
  14: imul null, r0.w, r0.z, r0.z  
  15: imad r0.w, r0.w, l(0x0000ec4d), l(0.0000000000000000000000000000000000001)  
  16: imad r0.z, r0.z, r0.w, l(146956042240.000000)  
  17: and r0.z, r0.z, l(0x7fffffff)  
  18: itof r0.yz, r0.yyzy  
  19: mul r0.z, r0.z, l(0.000000001)  
  20: mad r0.y, r0.y, l(0.000000001), -r0.z  
  21: mul r0.w, r0.x, r0.x  
  22: mul r0.x, r0.x, r0.w  
  23: mul r0.w, r0.w, l(3.000000)  
  24: mad r0.x, r0.x, l(-2.000000), r0.w  
  25: mad r0.x, r0.x, r0.y, r0.z  
  26: add r0.y, -cb4[2].x, cb4[3].x  
  27: mad_sat r0.x, r0.x, r0.y, cb4[2].x  
  28: mul r0.x, r0.x, v1.w  
  29: mul r0.yzw, cb4[0].xxxx, cb4[1].xxyz  
  30: mul r0.xyzw, r0.xyzw, cb2[2].wxyz  
  31: mul o0.xyz, r0.xxxx, r0.yzwy  
  32: mov o0.w, r0.x  
  33: ret  

The good thing: this code is not that long.
The bad thing:
   3: ishr r0.z, r0.y, l(13)  
   4: xor r0.y, r0.y, r0.z  
   5: imul null, r0.z, r0.y, r0.y  
   6: imad r0.z, r0.z, l(0x0000ec4d), l(0.0000000000000000000000000000000000001)  
   7: imad r0.y, r0.y, r0.z, l(146956042240.000000)  
   8: and r0.y, r0.y, l(0x7fffffff)  
...what the heck is this???

To be honest, it's not the first time when I see that piece of cra... assembly in shaders of The Witcher 3. But when I found it for the first time, I was like: "oh crap, wtf is this?".

Indeed, you can find something like this in a few shaders of TW3. I won't go now through many adventures I've had with this one, but let me just say the answer is integer noise:

 // For more details see: http://libnoise.sourceforge.net/noisegen/  
 float integerNoise( int n )  
 {  
      n = (n >> 13) ^ n;  
      int nn = (n * (n * n * 60493 + 19990303) + 1376312589) & 0x7fffffff;  
      return ((float)nn / 1073741824.0);  
 }  

Phew. As you can see, it's invoked 2 times in the pixel shader. Following the guidance from its website gives us some useful tips how to implement smooth noise properly. I'll back to this in a minute.

Take a look at line 0, we perform animation here based on the following formula:

animation = elapsedTime * animationSpeed + TextureUV.x

These values, after being floored (round_ni instruction) are subsequent entry points to integer noise. Generally we calculate value of the noise for two integers and then calculate final, interpolated value between them (see libnoise's website for details).

Okay, this is integer noise while all previously mentioned values (also floored values) are floats!
Please notice that there are no ftoi instructions here. My guess is that programmers from CD Projekt Red used here asint HLSL intrinsic function which performs "reinterpret_cast" of floating-point values and treat it like integer pattern.

The interpolation weight for two values is calculated in lines 10-11

interpolationWeight = 1.0 - frac( animation );

Such approach allows to interpolate between values with time.
To make a smooth noise, this interpolant is passed to SCurve function:
 float s_curve( float x )  
 {  
   float x2 = x * x;  
   float x3 = x2 * x;  
     
   // -2x^3 + 3x^2  
   return -2.0*x3 + 3.0*x2;  
 }

Smoothstep function [libnoise.sourceforge.net]

This function is known as "smoothstep". But as you can see from the assembly, this is not  smoothstep instrinsic from HLSL. The intrinsic performs some clamps to make sure the values will be correct. Since we know that our interpolationWeight will always be in [0-1] range we can safely omit these checks.

Calculating the final value includes a few multiplications. Please take a look that the final output alpha can change, depending on noise value. This is handy, because it will affect opacity of the rendered lightning - just like in real life.

The final pixel shader:
 cbuffer cbPerFrame : register (b0)  
 {  
   float4 cb0_v0;  
   float4 cb0_v1;  
   float4 cb0_v2;  
   float4 cb0_v3;  
 }  
   
 cbuffer cbPerFrame : register (b2)  
 {  
   float4 cb2_v0;  
   float4 cb2_v1;  
   float4 cb2_v2;  
   float4 cb2_v3;  
 }  
   
 cbuffer cbPerFrame : register (b4)  
 {  
   float4 cb4_v0;  
   float4 cb4_v1;  
   float4 cb4_v2;  
   float4 cb4_v3;  
   float4 cb4_v4;  
 }  
   
 struct VS_OUTPUT  
 {  
   float2 Texcoords : Texcoord0;  
   float4 InstanceLODParams : INSTANCE_LOD_PARAMS;  
   float4 PositionH : SV_Position;  
 };  
   
 // Shaders in TW3 use integer noise.  
 // For more details see: http://libnoise.sourceforge.net/noisegen/  
 float integerNoise( int n )  
 {  
   n = (n >> 13) ^ n;  
   int nn = (n * (n * n * 60493 + 19990303) + 1376312589) & 0x7fffffff;  
   return ((float)nn / 1073741824.0);  
 }  
   
 float s_curve( float x )  
 {  
   float x2 = x * x;  
   float x3 = x2 * x;  
   
   // -2x^3 + 3x^2  
   return -2.0*x3 + 3.0*x2;  
 }  
   
 float4 Lightning_TW3_PS( in VS_OUTPUT Input ) : SV_Target
 {  
   // * Inputs  
   float elapsedTime = cb0_v0.x;  
   float animationSpeed = cb4_v4.x;  
   
   float minAmount = cb4_v2.x;  
   float maxAmount = cb4_v3.x;  
   
   float colorMultiplier = cb4_v0.x;  
   float3 colorFilter = cb4_v1.xyz;  
   float3 lightningColorRGB = cb2_v2.rgb;  
   
   
   // Animation using time and X texcoord  
   float animation = elapsedTime * animationSpeed + Input.Texcoords.x;  
   
   // Input parameters for Integer Noise.  
   // They are floored and please note there are using asint.  
   // That might be an optimization to avoid "ftoi" instructions.  
   int intX0 = asint( floor(animation) );  
   int intX1 = asint( floor(animation-1.0) );  
   
   float n0 = integerNoise( intX0 );  
   float n1 = integerNoise( intX1 );    
   
   // We interpolate "backwards" here.  
   float weight = 1.0 - frac(animation);  
   
   // Following the instructions from libnoise, we perform  
   // smooth interpolation here with cubic s-curve function.  
   float noise = lerp( n0, n1, s_curve(weight) );  
   
   // Make sure we are in [0.0 - 1.0] range.  
   float lightningAmount = saturate( lerp(minAmount, maxAmount, noise) );  
   lightningAmount *= Input.InstanceLODParams.w;    // 1.0  
   lightningAmount *= cb2_v2.w;             // 1.0  
   
   // Calculate final lightning color   
   float3 lightningColor = colorMultiplier * colorFilter;  
   lightningColor *= lighntingColorRGB;  
   
   float3 finalLightningColor = lightningColor * lightningAmount;  
   return float4( finalLightningColor, lightningAmount );  
 }  

Summary

In this post I described how lightnings are rendered in The Witcher 3.
I'm more than happy that output assembly from my shader is the same as the original one!

On the left - my shader, on the right - original assembly


I hope you enjoyed it! Thanks for reading.

Feel free to comment and take care,
M.

niedziela, 20 stycznia 2019

Reverse engineering the rendering of The Witcher 3, part 10 - distant rain shafts

Welcome to the 10th part of the series! Woohooo! :)

See previous parts here.

This time we will take a look at really cool atmospheric effect I really like - distant rain/light shafts near the horizon. The easiest way to encounter them in the game is visiting Skellige Islands.



Personally I really love such atmospherical phenomena and I was really curious how graphics programmers from CD Projekt Red implemented this. Let's find out!

Here are two screenshots before and after applying rain shafts:

Before rain shafts

After rain shafts

Geometry

Our first stop is geometry. The idea is to use small cylinder:
Cylinder in local space
In terms of local space position, it's pretty small - the range of position is in ( 0.0 - 1.0 ).

The input layout for this draw call looks like this...

What is important to us here: Texcoords and Instance_Transform.

Texcoords are wrapped pretty simply: U of both upper and lower base is in [0.02777 - 1.02734] range. V is equal to 1.0 on lower base and equal to 0.0 on upper one. As you can see, it's pretty simple to even generate this mesh procedurally.

As we have this small cylinder in local space, we multiply it by world matrix which is provided by per-instance INSTANCE_TRANSFORM input element. Let's check values of this matrix:



It looks quite intimidating, isn't it? Don't worry, let's do some decomposing and see what this matrix hides!
 XMMATRIX mat( -227.7472,  159.8043,  374.0736, -116.4951,  
               -194.7577, -173.3836, -494.4982,  238.6908,  
               -14.16466, -185.4743,  784.564,   -1.45565,  
                0.0, 0.0, 0.0, 1.0 );  
   
      mat = XMMatrixTranspose( mat );  
   
      XMVECTOR vScale;  
      XMVECTOR vRotateQuat;  
      XMVECTOR vTranslation;  
      XMMatrixDecompose( &vScale, &vRotateQuat, &vTranslation, mat );  
   
      // Rotation matrix...  
      XMMATRIX matRotate = XMMatrixRotationQuaternion( vRotateQuat );  

Results are really interesting:

vRotateQuat: (0.0924987569, -0.314900011, 0.883411944, -0.334462732)
vScale: (299.999969, 300.000000, 1000.00012)
vTranslation: (-116.495102, 238.690796, -1.45564997)

It's important to know camera position at this particular frame: ( -116.5338, 234.8695, 2.09 )

As you can see, we scale the cylinder to make it pretty big in world space ( in TW3 Z axis is up-one), translate it with respect to camera position and rotate.
Here is the cylinder after vertex shader transform:

Cylinder after transforming by vertex shader. See how it's placed relative to view frustum


Vertex Shader

Input geometry and vertex shader are are strictly dependent on each other.
Let's take a closer look at assembly of vertex shader:

 vs_5_0  
    dcl_globalFlags refactoringAllowed  
    dcl_constantbuffer cb1[7], immediateIndexed  
    dcl_constantbuffer cb2[6], immediateIndexed  
    dcl_input v0.xyz  
    dcl_input v1.xy  
    dcl_input v4.xyzw  
    dcl_input v5.xyzw  
    dcl_input v6.xyzw  
    dcl_input v7.xyzw  
    dcl_output o0.xyz  
    dcl_output o1.xyzw  
    dcl_output_siv o2.xyzw, position  
    dcl_temps 2  
   0: mov o0.xy, v1.xyxx  
   1: mul r0.xyzw, v5.xyzw, cb1[6].yyyy  
   2: mad r0.xyzw, v4.xyzw, cb1[6].xxxx, r0.xyzw  
   3: mad r0.xyzw, v6.xyzw, cb1[6].zzzz, r0.xyzw  
   4: mad r0.xyzw, cb1[6].wwww, l(0.000000, 0.000000, 0.000000, 1.000000), r0.xyzw  
   5: mad r1.xyz, v0.xyzx, cb2[4].xyzx, cb2[5].xyzx  
   6: mov r1.w, l(1.000000)  
   7: dp4 o0.z, r1.xyzw, r0.xyzw  
   8: mov o1.xyzw, v7.xyzw  
   9: mul r0.xyzw, v5.xyzw, cb1[0].yyyy  
  10: mad r0.xyzw, v4.xyzw, cb1[0].xxxx, r0.xyzw  
  11: mad r0.xyzw, v6.xyzw, cb1[0].zzzz, r0.xyzw  
  12: mad r0.xyzw, cb1[0].wwww, l(0.000000, 0.000000, 0.000000, 1.000000), r0.xyzw  
  13: dp4 o2.x, r1.xyzw, r0.xyzw  
  14: mul r0.xyzw, v5.xyzw, cb1[1].yyyy  
  15: mad r0.xyzw, v4.xyzw, cb1[1].xxxx, r0.xyzw  
  16: mad r0.xyzw, v6.xyzw, cb1[1].zzzz, r0.xyzw  
  17: mad r0.xyzw, cb1[1].wwww, l(0.000000, 0.000000, 0.000000, 1.000000), r0.xyzw  
  18: dp4 o2.y, r1.xyzw, r0.xyzw  
  19: mul r0.xyzw, v5.xyzw, cb1[2].yyyy  
  20: mad r0.xyzw, v4.xyzw, cb1[2].xxxx, r0.xyzw  
  21: mad r0.xyzw, v6.xyzw, cb1[2].zzzz, r0.xyzw  
  22: mad r0.xyzw, cb1[2].wwww, l(0.000000, 0.000000, 0.000000, 1.000000), r0.xyzw  
  23: dp4 o2.z, r1.xyzw, r0.xyzw  
  24: mul r0.xyzw, v5.xyzw, cb1[3].yyyy  
  25: mad r0.xyzw, v4.xyzw, cb1[3].xxxx, r0.xyzw  
  26: mad r0.xyzw, v6.xyzw, cb1[3].zzzz, r0.xyzw  
  27: mad r0.xyzw, cb1[3].wwww, l(0.000000, 0.000000, 0.000000, 1.000000), r0.xyzw  
  28: dp4 o2.w, r1.xyzw, r0.xyzw  
  29: ret  

Together with simple passing of Texcoords (line 0) and Instance_LOD_Params (line 8), two more things are needed for output: SV_Position (obviously) and Height (.z component) of world position.

Remember that local space is in [0-1] range? Well, the vertex shader uses scale&bias to adjust local position just before applying world matrix. Smart!
In this case, we have scale = float3(4, 4, 2) and bias = float3(-2, -2, -1).

The pattern you can notice between line 9 and 28 is multiplying two row-major martices.
Just take a look at final vertex shader in HLSL :)
 cbuffer cbPerFrame : register (b1)  
 {  
   row_major float4x4 g_viewProjMatrix;  
   row_major float4x4 g_rainShaftsViewProjMatrix;  
 }  
   
 cbuffer cbPerObject : register (b2)  
 {  
   float4x4 g_mtxWorld;  
   float4 g_modelScale;  
   float4 g_modelBias;  
 }  
   
 struct VS_INPUT  
 {  
   float3 PositionW : POSITION;  
   float2 Texcoord : TEXCOORD;  
   float3 NormalW : NORMAL;  
   float3 TangentW : TANGENT;  
   float4 InstanceTransform0 : INSTANCE_TRANSFORM0;  
   float4 InstanceTransform1 : INSTANCE_TRANSFORM1;  
   float4 InstanceTransform2 : INSTANCE_TRANSFORM2;  
   float4 InstanceLODParams  : INSTANCE_LOD_PARAMS;  
 };  
   
 struct VS_OUTPUT  
 {  
   float3 TexcoordAndZ : Texcoord0;  
   
   float4 LODParams : LODParams;  
   float4 PositionH : SV_Position;  
 };  
   
 VS_OUTPUT RainShaftsVS( VS_INPUT Input )  
 {  
   VS_OUTPUT Output = (VS_OUTPUT)0;  
   
   // simple data passing  
   Output.TexcoordAndZ.xy = Input.Texcoord;  
   Output.LODParams = Input.InstanceLODParams;  
   
   // world space  
   float3 meshScale = g_modelScale.xyz;  // float3( 4, 4, 2 );
   float3 meshBias =  g_modelBias.xyz;   // float3( -2, -2, -1 );
   float3 PositionL = Input.PositionW * meshScale + meshBias;  
   
   // Manually build instanceWorld matrix from float4s:  
   float4x4 matInstanceWorld = float4x4(Input.InstanceTransform0, Input.InstanceTransform1,  
   Input.InstanceTransform2 , float4(0, 0, 0, 1) );  
   
   // World-space Height (.z)  
   float4x4 matWorldInstanceLod = mul( g_rainShaftsViewProjMatrix, matInstanceWorld );  
   Output.TexcoordAndZ.z = mul( float4(PositionL, 1.0), transpose(matWorldInstanceLod) ).z;  
   
   // SV_Posiiton  
   float4x4 matModelViewProjection = mul(g_viewProjMatrix, matInstanceWorld );  
   Output.PositionH = mul( float4(PositionL, 1.0), transpose(matModelViewProjection) );       
 
   return Output;  
 }   


Comparison between my VS (left) and original one (right):


The differences do not affect calculations ;) I injected my VS into frame and everything is alright!


Pixel Shader

Finally....! For start, I'll show you inputs:
There are two textures involved: noise texture and depth buffer:



Values from constant buffers:





And pixel shader assembly:
 ps_5_0  
    dcl_globalFlags refactoringAllowed  
    dcl_constantbuffer cb0[8], immediateIndexed  
    dcl_constantbuffer cb2[3], immediateIndexed  
    dcl_constantbuffer cb12[23], immediateIndexed  
    dcl_constantbuffer cb4[8], immediateIndexed  
    dcl_sampler s0, mode_default  
    dcl_sampler s15, mode_default  
    dcl_resource_texture2d (float,float,float,float) t0  
    dcl_resource_texture2d (float,float,float,float) t15  
    dcl_input_ps linear v0.xyz  
    dcl_input_ps linear v1.w  
    dcl_input_ps_siv v2.xy, position  
    dcl_output o0.xyzw  
    dcl_temps 1  
   0: mul r0.xy, cb0[0].xxxx, cb4[5].xyxx  
   1: mad r0.xy, v0.xyxx, cb4[4].xyxx, r0.xyxx  
   2: sample_indexable(texture2d)(float,float,float,float) r0.x, r0.xyxx, t0.xyzw, s0  
   3: add r0.y, -cb4[2].x, cb4[3].x  
   4: mad_sat r0.x, r0.x, r0.y, cb4[2].x  
   5: mul r0.x, r0.x, v0.y  
   6: mul r0.x, r0.x, v1.w  
   7: mul r0.x, r0.x, cb4[1].x  
   8: mul r0.yz, v2.xxyx, cb0[1].zzwz  
   9: sample_l(texture2d)(float,float,float,float) r0.y, r0.yzyy, t15.yxzw, s15, l(0)  
  10: mad r0.y, r0.y, cb12[22].x, cb12[22].y  
  11: mad r0.y, r0.y, cb12[21].x, cb12[21].y  
  12: max r0.y, r0.y, l(0.000100)  
  13: div r0.y, l(1.000000, 1.000000, 1.000000, 1.000000), r0.y  
  14: add r0.y, r0.y, -v0.z  
  15: mul_sat r0.y, r0.y, cb4[6].x  
  16: mul_sat r0.x, r0.y, r0.x  
  17: mad r0.y, cb0[7].y, r0.x, -r0.x  
  18: mad r0.x, cb4[7].x, r0.y, r0.x  
  19: mul r0.xyz, r0.xxxx, cb4[0].xyzx  
  20: log r0.xyz, r0.xyzx  
  21: mul r0.xyz, r0.xyzx, l(2.200000, 2.200000, 2.200000, 0.000000)  
  22: exp r0.xyz, r0.xyzx  
  23: mul r0.xyz, r0.xyzx, cb2[2].xyzx  
  24: mul o0.xyz, r0.xyzx, cb2[2].wwww  
  25: mov o0.w, l(0)  
  26: ret  

Phew! Quite a lot of stuff, but actually it's not that bad.


So, what happens here? At first we calculate animated UVs using elapsed time from cbuffer (cb0[0].x) and some scale/offsets. These texcoords are used to sample from noise texture (line 2).

Once we have noise value from texture, we interpolate between min/max values (usually 0 and 1).
Then we perform some multiplications, like by V tex coordinate (remember that V coordinate comes from 1 to 0?) - line 5

This way we calculated "intensity mask" - it looks like this:



Notice that distant objects (lighthouse, mountains...) are gone. This happens because the cylinder passes depth test - the cylinder is not on far plane and is drawn in front of aforementioned objects:
depth test
We want to mimic that the rain shafts are further (not necessarily on far plane, though). To achieve that, we compute another mask, "far objects mask".

So we compute it with the following formula:
farObjectsMask = saturate( (FrustumDepth - CylinderWorldSpaceHeight) * 0.001 );

(0.001 comes from cbuffer)

which gives us desired mask:


( I explained a bit how frustum depth is extracted from depth buffer in my post about sharpen )

Personally, I think this could be done cheaper, without calculating world-space height in VS by multiplying frustum depth by smaller number, like 0.0004.


Then we multiply both masks, which yields the final one:


Having this final mask (line 16) we have another interpolation which pretty much does nothing (at least in tested scenario), then we multiply the final mask with shafts color (line 19), perform gamma correction (lines 20-22) and perform final multiplications (23-24).

At the end we return color with zero alpha. This is because blending is enabled in this pass:

FinalColor = SourceColor * 1.0 + (1.0 - SourceAlpha) * DestColor.

If you are a bit rusty how blending works, a quick explanation:
SourceColor is RGB output from the pixel shader while DestColor is current RGB color of pixel in render target. Because SourceAlpha is always 0.0, the afromentioned equation simplifies to:

FinalColor = SourceColor + DestColor.

Simply speaking, we perform additive blending here. If this pixel shader returns (0, 0, 0) the color will remain the same.

Here is final HLSL - I think after this description it will be much easier to follow:
 struct VS_OUTPUT  
 {  
   float3 TexcoordAndWorldspaceHeight : Texcoord0;  
   float4 LODParams : LODParams;    // float4(1,1,1,1)  
   float4 PositionH : SV_Position;  
 };  
   
 float getFrustumDepth( in float depth )  
 {  
   // from [1-0] to [0-1]  
   float d = depth * cb12_v22.x + cb12_v22.y;  
   
   // special coefficents  
   d = d * cb12_v21.x + cb12_v21.y;  
   
   // return frustum depth  
   return 1.0 / max(d, 1e-4);  
 }  
   
 float4 EditedShaderPS( in VS_OUTPUT Input ) : SV_Target0  
 {  
   // * Input from Vertex Shader  
   float2 InputUV = Input.TexcoordAndWorldspaceHeight.xy;  
   float WorldHeight = Input.TexcoordAndWorldspaceHeight.z;  
   float LODParam = Input.LODParams.w;  
   
   // * Inputs  
   float elapsedTime = cb0_v0.x;  
   float2 uvAnimation = cb4_v5.xy;  
   float2 uvScale = cb4_v4.xy;    
   float minValue = cb4_v2.x; // 0.0  
   float maxValue = cb4_v3.x; // 1.0  
   float3 shaftsColor = cb4_v0.rgb;  // RGB( 147, 162, 173 )  
   
   float3 finalColorFilter = cb2_v2.rgb; // float3( 1.175, 1.296, 1.342 );  
   float finalEffectIntensity = cb2_v2.w;  
   
   float2 invViewportSize = cb0_v1.zw;  
   
   float depthScale = cb4_v6.x;  // 0.001  
   
   // sample noise  
   float2 uvOffsets = elapsedTime * uvAnimation;  
   float2 uv = InputUV * uvScale + uvOffsets;    
   float disturb = texture0.Sample( sampler0, uv ).x;  
   
   // * Intensity mask  
   float intensity = saturate( lerp(minValue, maxValue, disturb) );  
   intensity *= InputUV.y;   // transition from (0, 1)  
   intensity *= LODParam;   // usually 1.0  
   intensity *= cb4_v1.x;   // 1.0    
   
   // Sample depth  
   float2 ScreenUV = Input.PositionH.xy * invViewportSize;  
   float hardwareDepth = texture15.SampleLevel( sampler15, ScreenUV, 0 ).x;  
   float frustumDepth = getFrustumDepth( hardwareDepth );  
   
   
   // * Calculate mask covering distant objects behind cylinder.  
   
   // Seems that the input really is world-space height (.z component, see vertex shader)  
   float depth = frustumDepth - WorldHeight;  
   float distantObjectsMask = saturate( depth * depthScale );  
   
   // * calculate final mask  
   float finalEffectMask = saturate( intensity * distantObjectsMask );  
   
   // cb0_v7.y and cb4_v7.x are set to 1.0 so I didn't bother with naming them :)  
   float paramX = finalEffectMask;  
   float paramY = cb0_v7.y * finalEffectMask;  
   float effectAmount = lerp(paramX, paramY, cb4_v7.x);  
   
   // color of shafts comes from contant buffer  
   float3 effectColor = effectAmount * shaftsColor;  
   
   // gamma correction  
   effectColor = pow(effectColor, 2.2);  
   
   // final multiplications  
   effectColor *= finalColorFilter;  
   effectColor *= finalEffectIntensity;  
   
   // return with zero alpha 'cause the blending used here is:  
   // SourceColor * 1.0 + (1.0 - SrcAlpha) * DestColor  
   return float4( effectColor, 0.0 );  
 }   

I'm happy to say that my PS produces the same assembly as original is ;)

I hope you enjoyed it.
Thanks for reading! :)

M.

piątek, 28 grudnia 2018

Reverse engineering the rendering of The Witcher 3, part 9 - GBuffer

Welcome,

This is the ninth part of my series about rendering in The Witcher 3. Click here for full index.

In this part I will show some details about geometry buffer (gbuffer) in The Witcher 3.

I assume here that you know the basics of deferred shading. 
Quick recap: the idea is to, well, defer rendering by not calculating all final lighting and shading immediately, but instead separate calculations into two stages.
In the first one (geometry pass) we fill GBuffer with data about surface (position, normals, specular color etc...) and in the second one (lighting pass) we combine everything and calculate lighting. 

Deferred shading is hugely popular approach because it allows to calculate lighting in one full-screen pass with techniques like tile-based deferred shading which greatly improves performance.

Simply speaking, GBuffer is collecton of textures with properties of geometry. It's very important to design its layout carefully. For real-life example, check for instance The Rendering Technologies of Crysis 3

After this brief introduction let's take a look at example frame from The Witcher 3: Blood & Wine:
One of many inns in Toussaint

The main GBuffer consists of three fullscreen render targets with DXGI_FORMAT_R8G8B8A8_UNORM format
and DXGI_FORMAT_D24_UNORM_S8_UINT depth+stencil buffer.

Here are screenshots of them:
Render Target 0 - RGB channels, surface color

Render Target 0 - A channel. I have no idea what it is, really.

Render Target 1 - RGB channels. We have normal vectors in [0-1] range here.

Render Target 1 - A channel. Looks like reflectance!

Render Target 2 - RGB channels. Looks like specular color!
A channel is black in this scene (but it is used later)

Depth buffer. Note that reversed depth is used here

Stencil buffer to mark certain type of pixels (like skin, vegetation etc)
This is not whole GBuffer. Lighting pass also uses reflection probes and other buffers but this is not the subject of this post.

Before I start the "main" part of this post, some general observations first:


General observations


1) The only buffer to clear is depth/stencil.

If you analyze aforementioned textures in any good frame analyzer you may be a little surprised, because there is no "Clear" call on them with exception of Depth/Stencil.

So in reality RenderTarget1 looks like this (notice "blurred" pixels on far plane):

This is simple and nice optimization. 
Take with you: ClearRenderTargetView calls are not free, so use them only when really necessary.


2) Reversed depth rocks

Many articles have been already written about precision of floating-point depth buffer. The Witcher 3 uses reversed-z which is natural choice for such game with open world and long draw distances.

For DirectX the switch shouldn't be difficult:

a) Clear depth buffer with "0" intead on "1".
In a traditional approach we used to clear depth buffer far value of "1". After reversing depth, the new "far" value is zero, so we need to change that.

b) Flip near and far clip values when calculating projection matrix

c) Change depth test from "Less" to "Greater".

For OpenGL there is a bit more work (see mentioned articles) but it is really worth the effort.


3) Do not store world position

It is that simple. Reconstruct world position from depth in lighting pass.


Pixel Shader

What I want to show in this post is pixel shader which feeds GBuffer with surface data. 
So we know by now we that store at least color, normals and specular.
Of course it's not that simple as you may think.

The problem with this pixel shader is that it comes in many variants. They differ in number of textures consumed and number of parameters used from constant buffer (probably constant buffer which describes material).

I decided to use this nice barrel for analyze:
Our heroic barrel!
And please give warm welcome to textures used:

So we have albedo, normal map and specular color. Pretty common scenario.

Before we start, few words about geometry inputs:
The geometry comes with position, texcoords, normal and tangent buffers.
Vertex Shader outputs at least texcoords, normalized tangent/normal/bitangent vectors multiplied earlier by world matrix. For more complicated materials (like with two diffuse or normal maps) vertex shader can output other data but I wanted to show here the simple cases.


Pixel Shader as assembly:
 ps_5_0  
    dcl_globalFlags refactoringAllowed  
    dcl_constantbuffer cb4[3], immediateIndexed  
    dcl_sampler s0, mode_default  
    dcl_sampler s13, mode_default  
    dcl_resource_texture2d (float,float,float,float) t0  
    dcl_resource_texture2d (float,float,float,float) t1  
    dcl_resource_texture2d (float,float,float,float) t2  
    dcl_resource_texture2d (float,float,float,float) t13  
    dcl_input_ps linear v0.zw  
    dcl_input_ps linear v1.xyzw  
    dcl_input_ps linear v2.xyz  
    dcl_input_ps linear v3.xyz  
    dcl_input_ps_sgv v4.x, isfrontface  
    dcl_output o0.xyzw  
    dcl_output o1.xyzw  
    dcl_output o2.xyzw  
    dcl_temps 3  
   0: sample_indexable(texture2d)(float,float,float,float) r0.xyzw, v1.xyxx, t1.xyzw, s0  
   1: sample_indexable(texture2d)(float,float,float,float) r1.xyz, v1.xyxx, t0.xyzw, s0  
   2: add r1.w, r1.y, r1.x  
   3: add r1.w, r1.z, r1.w  
   4: mul r2.x, r1.w, l(0.333300)  
   5: add r2.y, l(-1.000000), cb4[1].x  
   6: mul r2.y, r2.y, l(0.500000)  
   7: mov_sat r2.z, r2.y  
   8: mad r1.w, r1.w, l(-0.666600), l(1.000000)  
   9: mad r1.w, r2.z, r1.w, r2.x  
  10: mul r2.xzw, r1.xxyz, cb4[0].xxyz  
  11: mul_sat r2.xzw, r2.xxzw, l(1.500000, 0.000000, 1.500000, 1.500000)  
  12: mul_sat r1.w, abs(r2.y), r1.w  
  13: add r2.xyz, -r1.xyzx, r2.xzwx  
  14: mad r1.xyz, r1.wwww, r2.xyzx, r1.xyzx  
  15: max r1.w, r1.z, r1.y  
  16: max r1.w, r1.w, r1.x  
  17: lt r1.w, l(0.220000), r1.w  
  18: movc r1.w, r1.w, l(-0.300000), l(-0.150000)  
  19: mad r1.w, v0.z, r1.w, l(1.000000)  
  20: mul o0.xyz, r1.wwww, r1.xyzx  
  21: add r0.xyz, r0.xyzx, l(-0.500000, -0.500000, -0.500000, 0.000000)  
  22: add r0.xyz, r0.xyzx, r0.xyzx  
  23: mov r1.x, v0.w  
  24: mov r1.yz, v1.zzwz  
  25: mul r1.xyz, r0.yyyy, r1.xyzx  
  26: mad r1.xyz, v3.xyzx, r0.xxxx, r1.xyzx  
  27: mad r0.xyz, v2.xyzx, r0.zzzz, r1.xyzx  
  28: uge r1.x, l(0), v4.x  
  29: if_nz r1.x  
  30:  dp3 r1.x, v2.xyzx, r0.xyzx  
  31:  mul r1.xyz, r1.xxxx, v2.xyzx  
  32:  mad r0.xyz, -r1.xyzx, l(2.000000, 2.000000, 2.000000, 0.000000), r0.xyzx  
  33: endif  
  34: sample_indexable(texture2d)(float,float,float,float) r1.xyz, v1.xyxx, t2.xyzw, s0  
  35: max r1.w, r1.z, r1.y  
  36: max r1.w, r1.w, r1.x  
  37: lt r1.w, l(0.200000), r1.w  
  38: movc r2.xyz, r1.wwww, r1.xyzx, l(0.120000, 0.120000, 0.120000, 0.000000)  
  39: add r2.xyz, -r1.xyzx, r2.xyzx  
  40: mad o2.xyz, v0.zzzz, r2.xyzx, r1.xyzx  
  41: lt r1.x, r0.w, l(0.330000)  
  42: mul r1.y, r0.w, l(0.950000)  
  43: movc r1.x, r1.x, r1.y, l(0.330000)  
  44: add r1.x, -r0.w, r1.x  
  45: mad o1.w, v0.z, r1.x, r0.w  
  46: dp3 r0.w, r0.xyzx, r0.xyzx  
  47: rsq r0.w, r0.w  
  48: mul r0.xyz, r0.wwww, r0.xyzx  
  49: max r0.w, abs(r0.y), abs(r0.x)  
  50: max r0.w, r0.w, abs(r0.z)  
  51: lt r1.xy, abs(r0.zyzz), r0.wwww  
  52: movc r1.yz, r1.yyyy, abs(r0.zzyz), abs(r0.zzxz)  
  53: movc r1.xy, r1.xxxx, r1.yzyy, abs(r0.yxyy)  
  54: lt r1.z, r1.y, r1.x  
  55: movc r1.xy, r1.zzzz, r1.xyxx, r1.yxyy  
  56: div r1.z, r1.y, r1.x  
  57: div r0.xyz, r0.xyzx, r0.wwww  
  58: sample_l(texture2d)(float,float,float,float) r0.w, r1.xzxx, t13.yzwx, s13, l(0)  
  59: mul r0.xyz, r0.wwww, r0.xyzx  
  60: mad o1.xyz, r0.xyzx, l(0.500000, 0.500000, 0.500000, 0.000000), l(0.500000, 0.500000, 0.500000, 0.000000)  
  61: mov o0.w, cb4[2].x  
  62: mov o2.w, l(0)  
  63: ret  

The shader has few stages. I will describe each main part of this shader separately.
But at first, as always - screenshot with values from constant buffer:

Albedo

We start with hard stuff. It's not that simple as "OutputColor.rgb = Texture.Sample(uv).rgb"
After we sample RGB of color texture (line 1) the next 14 lines are something which I called "desaturation filter". Let me show you HLSL code:

 float3 albedoColorFilter( in float3 color, in float desaturationFactor, in float3 desaturationValue )  
 {  
   float sumColorComponents = color.r + color.g + color.b;  
    
   float averageColorComponentValue = 0.3333 * sumColorComponents;  
   float oneMinusAverageColorComponentValue = 1.0 - averageColorComponentValue;  
     
   float factor = 0.5 * (desaturationFactor - 1.0);  
     
   float avgColorComponent = lerp(averageColorComponentValue, oneMinusAverageColorComponentValue, saturate(factor));  
   float3 desaturatedColor = saturate(color * desaturationValue * 1.5);  
    
   float mask = saturate( avgColorComponent * abs(factor) );  
   
   float3 finalColor = lerp( color, desaturatedColor, mask );  
   return finalColor;  
 }  

For majority of objects, this code does nothing but returns the original color from texture. This is achieved by proper "material cbuffer" values. cb4_v1.x is set to 1.0 which returns in mask equal to 0.0 and gives input color from lerp instruction.

However, there are some exceptions. The highest value of desaturationFactor I found was 4.0 (never below 1.0) and desaturatedColor depends on material. It can be something like (0.2, 0.3, 0.4); there are no strict rules. Of course I couldn't resist to implement this in my own DX11 framework and here are the results, all with desaturatedColor equal to float3( 0.25, 0.3, 0.45 )

desaturationFactor = 1.0 (no effect)

desaturationFactor = 2.0

desaturationFactor = 3.0

desaturationFactor = 4.0
I'm sure it's just applying material parameters but it's not the end of the albedo part.
Lines 15-20 perform final touches:
  15: max r1.w, r1.z, r1.y   
  16: max r1.w, r1.w, r1.x   
  17: lt r1.w, l(0.220000), r1.w   
  18: movc r1.w, r1.w, l(-0.300000), l(-0.150000)   
  19: mad r1.w, v0.z, r1.w, l(1.000000)   
  20: mul o0.xyz, r1.wwww, r1.xyzx   

v0.z is output from Vertex Shader and it's equal to zero. Remember it, because v0.z will be used later a couple of times.

It seems to be some factor and all this code looks like darkening albedo a little bit, but since v0.z is equal to 0, the color is untouched. HLSL:

   /* ALBEDO */  
   // optional desaturation (?) filter  
   float3 albedoColor = albedoColorFilter( colorTex, cb4_v1.x, cb4_v0.rgb );  
   float albedoMaxComponent = getMaxComponent( albedoColor );  
     
   // I really have no idea what this is  
   // In most of cases Vertex Shader outputs "paramZ" as 0  
   float paramZ = Input.out0.z;  // note, mostly 0  
   
   // Note that 0.70 are 0.85 are not present in the output assembly  
   // Because I wanted to use lerp here I had to adjust them manually.  
   float param = (albedoMaxComponent > 0.22) ? 0.70 : 0.85;  
   float mulParam = lerp(1, param, paramZ);  
   
   // Output  
   pout.RT0.rgb = albedoColor * mulParam;  
   pout.RT0.a = cb4_v2.x;  

Regarding RT0.a, as you can see, it comes from materal's constant buffer but since the shader has no debug information, it's hard to say exactly what this is. Maybe translucency?

We are done with the first render target!

Normals

We start by unpacking normal map, then we perform normal mapping as usual:
   /* NORMALS */   
   float3 sampledNormal = ((normalTex.xyz - 0.5) * 2);  
   
   // Data to construct TBN matrix  
   float3 Tangent = Input.TangentW.xyz;  
   float3 Normal = Input.NormalW.xyz;  
   float3 Bitangent;  
   Bitangent.x = Input.out0.w;  
   Bitangent.yz = Input.out1.zw;  
   
   // remove this saturate in real scenario, this is a hack to make sure normal-tbn multiplication  
   // will have 'mad' instructions in assembly instead a bunch of 'mov's
   Bitangent = saturate(Bitangent);  
     
   float3x3 TBN = float3x3(Tangent, Bitangent, Normal);  
   float3 normal = mul( sampledNormal, TBN );  

Nothing really surprising so far.

Take a look at lines 28-33:
  28: uge r1.x, l(0), v4.x   
  29: if_nz r1.x   
  30: dp3 r1.x, v2.xyzx, r0.xyzx   
  31: mul r1.xyz, r1.xxxx, v2.xyzx   
  32: mad r0.xyz, -r1.xyzx, l(2.000000, 2.000000, 2.000000, 0.000000), r0.xyzx   
  33: endif   

We can roughly write it this way:
   [branch] if (bIsFrontFace <= 0)  
   {  
      float cosTheta = dot(Input.NormalW, normal);  
      float3 invNormal = cosTheta * Input.NormalW;  
      normal = normal - 2*invNormal;  
   }  

I'm not sure if this is a proper way of writing this. If you know what type of mathematical operation this is - let me know.

We see that the pixel shader uses SV_IsFrontFace.
What's that? Documentation (I wanted to write 'msdn' but..) comes to the rescue:

"Specifies whether a triangle is front facing. For lines and points, IsFrontFace has the value true. The exception is lines drawn out of triangles (wireframe mode), which sets IsFrontFace the same way as rasterizing the triangle in solid mode. Can be written to by the geometry shader, and read by the pixel shader."

I also wanted to check it for myself. Indeed, the effect is visible in wireframe mode only. I believe the purpose of this piece of code is to properly calculate normals (therefore, lighting) in wireframe mode.
Here is a comparison: Both wireframe final scene color with this trick off/on as well as gbuffer normal [0-1] texture with this trick off/on:

Scene color without the trick

Scene color with the trick
Normals [0-1] without the trick

Normals [0-1] with the trick
Have you noticed that the format of every rendertarget of GBuffer is R8G8B8A8_UNORM? That means we have 256 possible values per one component. Is it enough for storing normals?

Storing high quality normals with reasonable amount of bytes in GBuffer is known problem but fortunately there is a lot of material to learn from.

Probably some of you already know what technique is used here. I'd like to say that in whole geometry pass there is one additional texture attached to slot #13...:


Ha! The Witcher 3 uses technique known as "Best Fit Normals". I will not go here in details (refer to the presentation). It was invented around 2009-2010 by Crytek and since CryEngine is open source, BFN is open source too.

BFN causes "grainy" look of normals texture.
Afer scaling normal with the best fit, we encode it from [-1;1] to [0, 1] range.

Specular 

We start from line 34, by sampling specular texture:
  34: sample_indexable(texture2d)(float,float,float,float) r1.xyz, v1.xyxx, t2.xyzw, s0   
  35: max r1.w, r1.z, r1.y   
  36: max r1.w, r1.w, r1.x   
  37: lt r1.w, l(0.200000), r1.w   
  38: movc r2.xyz, r1.wwww, r1.xyzx, l(0.120000, 0.120000, 0.120000, 0.000000)   
  39: add r2.xyz, -r1.xyzx, r2.xyzx   
  40: mad o2.xyz, v0.zzzz, r2.xyzx, r1.xyzx   

As you can see, there is similar "darkening" filter as with Albedo:
Calc component with max value, then calulate "darker" color and interpolate with original specular color using a parameter from vertex shader... which is set to 0, so we output color from texture.

HLSL:
   /* SPECULAR */  
   float3 specularTex = texture2.Sample( samplerAnisoWrap, Texcoords ).rgb;  
   
   // Similar algorithm as in Albedo. Calculate max component, compare this with  
   // some threshold and calculate "minimum" value if needed.  
   // Because in the scene I analyzed paramZ was set to zero, value from texture will be  
   // the final result.  
   float specularMaxComponent = getMaxComponent( specularTex );  
   float3 specB = (specularMaxComponent > 0.2) ? specularTex : float3(0.12, 0.12, 0.12);  
   float3 finalSpec = lerp(specularTex, specB, paramZ);  
   pout.RT2.xyz = finalSpec;  

Reflectivity

I have no idea if this name is proper for this parameter since I don't know how it affects lighting pass. The thing is that alpha channel of input normal map has additional data:
Alpha channel of "normal map" texture. (c) CD Projekt Red
Assembly:
  41: lt r1.x, r0.w, l(0.330000)   
  42: mul r1.y, r0.w, l(0.950000)   
  43: movc r1.x, r1.x, r1.y, l(0.330000)   
  44: add r1.x, -r0.w, r1.x   
  45: mad o1.w, v0.z, r1.x, r0.w   

Say hello to our old friend, 'v0.z'! This is similar to both albedo and specular:
   /* REFLECTIVITY */  
   float reflectivity = normalTex.a;  
   float reflectivity2 = (reflectivity < 0.33) ? (reflectivity * 0.95) : 0.33;  
     
   float finalReflectivity = lerp(reflectivity, reflectivity2, paramZ);  
   pout.RT1.a = finalReflectivity;  

Nice! This is the end of analyzing the first variant of pixel shader.

In terms of result, here is a comparison of my shader (left) with the original one (right):
These differences do not affect calculations so my job is done here ;)



Pixel Shader - "Albedo + Normals" variant

I decided to show you one more variant - now with albedo & normal maps only - without specular texture. The assembly is a bit longer:
 ps_5_0  
    dcl_globalFlags refactoringAllowed  
    dcl_constantbuffer cb4[8], immediateIndexed  
    dcl_sampler s0, mode_default  
    dcl_sampler s13, mode_default  
    dcl_resource_texture2d (float,float,float,float) t0  
    dcl_resource_texture2d (float,float,float,float) t1  
    dcl_resource_texture2d (float,float,float,float) t13  
    dcl_input_ps linear v0.zw  
    dcl_input_ps linear v1.xyzw  
    dcl_input_ps linear v2.xyz  
    dcl_input_ps linear v3.xyz  
    dcl_input_ps_sgv v4.x, isfrontface  
    dcl_output o0.xyzw  
    dcl_output o1.xyzw  
    dcl_output o2.xyzw  
    dcl_temps 4  
   0: mul r0.x, v0.z, cb4[0].x  
   1: sample_indexable(texture2d)(float,float,float,float) r1.xyzw, v1.xyxx, t1.xyzw, s0  
   2: sample_indexable(texture2d)(float,float,float,float) r0.yzw, v1.xyxx, t0.wxyz, s0  
   3: add r2.x, r0.z, r0.y  
   4: add r2.x, r0.w, r2.x  
   5: add r2.z, l(-1.000000), cb4[2].x  
   6: mul r2.yz, r2.xxzx, l(0.000000, 0.333300, 0.500000, 0.000000)  
   7: mov_sat r2.w, r2.z  
   8: mad r2.x, r2.x, l(-0.666600), l(1.000000)  
   9: mad r2.x, r2.w, r2.x, r2.y  
  10: mul r3.xyz, r0.yzwy, cb4[1].xyzx  
  11: mul_sat r3.xyz, r3.xyzx, l(1.500000, 1.500000, 1.500000, 0.000000)  
  12: mul_sat r2.x, abs(r2.z), r2.x  
  13: add r2.yzw, -r0.yyzw, r3.xxyz  
  14: mad r0.yzw, r2.xxxx, r2.yyzw, r0.yyzw  
  15: max r2.x, r0.w, r0.z  
  16: max r2.x, r0.y, r2.x  
  17: lt r2.x, l(0.220000), r2.x  
  18: movc r2.x, r2.x, l(-0.300000), l(-0.150000)  
  19: mad r0.x, r0.x, r2.x, l(1.000000)  
  20: mul o0.xyz, r0.xxxx, r0.yzwy  
  21: add r0.xyz, r1.xyzx, l(-0.500000, -0.500000, -0.500000, 0.000000)  
  22: add r0.xyz, r0.xyzx, r0.xyzx  
  23: mov r1.x, v0.w  
  24: mov r1.yz, v1.zzwz  
  25: mul r1.xyz, r0.yyyy, r1.xyzx  
  26: mad r0.xyw, v3.xyxz, r0.xxxx, r1.xyxz  
  27: mad r0.xyz, v2.xyzx, r0.zzzz, r0.xywx  
  28: uge r0.w, l(0), v4.x  
  29: if_nz r0.w  
  30:  dp3 r0.w, v2.xyzx, r0.xyzx  
  31:  mul r1.xyz, r0.wwww, v2.xyzx  
  32:  mad r0.xyz, -r1.xyzx, l(2.000000, 2.000000, 2.000000, 0.000000), r0.xyzx  
  33: endif  
  34: add r0.w, -r1.w, l(1.000000)  
  35: log r1.xyz, cb4[3].xyzx  
  36: mul r1.xyz, r1.xyzx, l(2.200000, 2.200000, 2.200000, 0.000000)  
  37: exp r1.xyz, r1.xyzx  
  38: mad r0.w, r0.w, cb4[4].x, cb4[5].x  
  39: mul_sat r1.xyz, r0.wwww, r1.xyzx  
  40: log r1.xyz, r1.xyzx  
  41: mul r1.xyz, r1.xyzx, l(0.454545, 0.454545, 0.454545, 0.000000)  
  42: exp r1.xyz, r1.xyzx  
  43: max r0.w, r1.z, r1.y  
  44: max r0.w, r0.w, r1.x  
  45: lt r0.w, l(0.200000), r0.w  
  46: movc r2.xyz, r0.wwww, r1.xyzx, l(0.120000, 0.120000, 0.120000, 0.000000)  
  47: add r2.xyz, -r1.xyzx, r2.xyzx  
  48: mad o2.xyz, v0.zzzz, r2.xyzx, r1.xyzx  
  49: lt r0.w, r1.w, l(0.330000)  
  50: mul r1.x, r1.w, l(0.950000)  
  51: movc r0.w, r0.w, r1.x, l(0.330000)  
  52: add r0.w, -r1.w, r0.w  
  53: mad o1.w, v0.z, r0.w, r1.w  
  54: lt r0.w, l(0), cb4[7].x  
  55: and o2.w, r0.w, l(0.064706)  
  56: dp3 r0.w, r0.xyzx, r0.xyzx  
  57: rsq r0.w, r0.w  
  58: mul r0.xyz, r0.wwww, r0.xyzx  
  59: max r0.w, abs(r0.y), abs(r0.x)  
  60: max r0.w, r0.w, abs(r0.z)  
  61: lt r1.xy, abs(r0.zyzz), r0.wwww  
  62: movc r1.yz, r1.yyyy, abs(r0.zzyz), abs(r0.zzxz)  
  63: movc r1.xy, r1.xxxx, r1.yzyy, abs(r0.yxyy)  
  64: lt r1.z, r1.y, r1.x  
  65: movc r1.xy, r1.zzzz, r1.xyxx, r1.yxyy  
  66: div r1.z, r1.y, r1.x  
  67: div r0.xyz, r0.xyzx, r0.wwww  
  68: sample_l(texture2d)(float,float,float,float) r0.w, r1.xzxx, t13.yzwx, s13, l(0)  
  69: mul r0.xyz, r0.wwww, r0.xyzx  
  70: mad o1.xyz, r0.xyzx, l(0.500000, 0.500000, 0.500000, 0.000000), l(0.500000, 0.500000, 0.500000, 0.000000)  
  71: mov o0.w, cb4[6].x  
  72: ret

The differences between this variant and previous one are:

a) lines 1, 19: interpolation parameter v0.z is multiplied by cb4[0].x from constant buffer, but this product is used only to interpolate albedo at line 19. For other output data, 'usual' v0.z is used.


b) lines 54-55: o2.w is now set under condition that ( cb4[7].x > 0.0 )

We already know this pattern "someComparison - and" from calculating luminance histogram from TW3, we can write this as:
 pout.RT2.w = (cb4_v7.x > 0.0) ? (16.5/255.0) : 0.0;  


c) lines 34-42: completely different calculation of specular.

There is no specular texture. Let's see assembly responsible for that:
  34: add r0.w, -r1.w, l(1.000000)   
  35: log r1.xyz, cb4[3].xyzx   
  36: mul r1.xyz, r1.xyzx, l(2.200000, 2.200000, 2.200000, 0.000000)   
  37: exp r1.xyz, r1.xyzx   
  38: mad r0.w, r0.w, cb4[4].x, cb4[5].x   
  39: mul_sat r1.xyz, r0.wwww, r1.xyzx   
  40: log r1.xyz, r1.xyzx   
  41: mul r1.xyz, r1.xyzx, l(0.454545, 0.454545, 0.454545, 0.000000)   
  42: exp r1.xyz, r1.xyzx   

Note we used here (1-reflectivity). Luckily, this is quite simple in HLSL:
   float oneMinusReflectivity = 1.0 - normalTex.a;  
   float3 specularTex = pow(cb4_v3.rgb, 2.2);  
   oneMinusReflectivity = oneMinusReflectivity * cb4_v4.x + cb4_v5.x;  
   specularTex = saturate(specularTex * oneMinusReflectivity);  
   specularTex = pow(specularTex, 1.0/2.2);  
   
   // proceed as in the first variant...  
   float specularMaxComponent = getMaxComponent( specularTex ); 
   ... 

On a side note, in this variant we have slightly larger constant buffer with material data. These extra values are used to emulate specular color here.

The rest of the shader is the same as in prevous variant.

72 lines of assembly is a little too much for WinMerge to display at once so just believe me it's almost the same assembly as in original. Or you can grab my HLSLexplorer and see it for yourself! ;)


Summary

...and if you've come this far, maybe you're willing to come a little further.

Nothing what seems simple is not in real life and feeding the gbuffer in The Witcher 3 is no exception. I've just shown you the simplest variants of pixel shaders responsible for it and some general observations which apply to deferred shading in general.

For the most patient ones (or vice versa) the two variants of pixel shaders @ pastebin:





Feel free to comment.

I hope you enjoyed it.
Thanks for reading!