poniedziałek, 27 lipca 2020

Reverse engineering the rendering of The Witcher 3, part 20 - light shafts

This post is a part of the series "Reverse engineering the rendering of The Witcher 3".

In this 20th (wow) post I am going to cover how the "Light Shafts" (aka crepuscular rays, God rays, sunbeams...) effect from "The Witcher 3: Wild Hunt" works. I focus not only on presenting the idea of the algorithm, but also provide the game-specific implementation details, code snippets and a Shadertoy the reader can experiment with.

First, a brief showcase of the effect:

Before applying the effect
After applying the effect
In the fourth edition of "Real-Time Rendering" there is a high-level overview of the effect:
"First, a high-contrast correction curve is applied to the input image to isolate the unoccluded parts of the sun. Next, radial blurs, centered on the sun, are applied to the image (...) in a series, each operating on the output of the previous one (...). The final image of the flare is combined additively with the original scene rendering."

As it states, there are 4 steps involved:
- Isolating the unoccluded parts of the Sun
- Applying a high-contrast correction curve to extract the brightest pixels from the first step
- Performing a series of radial blurs
- Additive blending with the main scene color

I am going to cover all of them, providing extra details and example code where needed.

The description from the book serves to give a brief idea of how the effect works. In practice, it's a bit more complex. For example, in the implementation of The Witcher 3, the light shafts are strictly intertwined with bloom. You cannot have the light shafts working if the bloom is turned off. Maybe it's because (it's just my guess), at some point later, the light shafts are combined together with (not fully blurred yet) bloom - therefore both effects can benefit from blur.

1. Isolating the unoccluded parts of the Sun

The first step is simple - extracting sky pixels. It starts with a fullscreen floating-point LDR buffer which has already been processed with tonemapping (and later possibly with: depth of field, drunk effect and motion blur). The scene depth buffer is used as well.

Input 1 - fullscreen LDR color buffer
Input 2 - fullscreen depth buffer
Let's take a look at the pixel shader assembly:
    dcl_globalFlags refactoringAllowed  
    dcl_constantbuffer cb12[23], immediateIndexed  
    dcl_resource_texture2d (float,float,float,float) t0  
    dcl_resource_texture2d (float,float,float,float) t1  
    dcl_input_ps_siv v0.xy, position  
    dcl_output o0.xyzw  
    dcl_temps 2  
   0: ftoi r0.xy, v0.xyxx  
   1: mov r0.zw, l(0, 0, 0, 0)  
   2: ld_indexable(texture2d)(float,float,float,float) r1.x, r0.xyww, t1.xyzw  
   3: ld_indexable(texture2d)(float,float,float,float) r0.xyz, r0.xyzw, t0.xyzw  
   4: mad r0.w, r1.x, cb12[22].x, cb12[22].y  
   5: ge r0.w, r0.w, l(1.0000)  
   6: and r0.w, r0.w, l(1.0000)  
   7: mul o0.xyz, r0.wwww, r0.xyzx  
   8: mov o0.w, l(1.0000)  
   9: ret  

This is a typical fullscreen pass. In this simple shader the inputs textures are: R11G11B10_FLOAT color texture (t0) and depth (t1). The shader outputs sky pixels only - the ones which pass (depth == 1.0) test.

The Witcher 3 uses reversed depth, so 1.0 is represented as black color in the depth texture. A flip (1.0 - depth) is performed at line 4.
The output is a fullscreen R11G11B10_FLOAT texture.

Output. Note that sky pixels are the only ones left.

This shader can be written like so in HLSL:
 Texture2D TexColor : register (t0);  
 Texture2D TexDepth : register (t1);  
 float4 LightShafts_SkyOnlyPS( in float4 Position : SV_Position ) : SV_Target  
   int3 pos = int3(Position.xy, 0);  
   float depth = TexDepth.Load(pos).r;  
   float3 color = TexColor.Load(pos).rgb;  
   // Perform '1.0 - depth' flip.
   // g_depthScale = -1.0
   // g_depthBias = 1.0
   depth = depth * g_depthScale + g_depthBias;  
   bool isSky = (depth >= 1.0);  
   color *= isSky;  
   return float4(color, 1.0);  

2. Applying high-contrast correction curve

Having the sky pixels only, it's time to extract the brightest of them.
Input - fullscreen texture with sky pixels only
Output - the brightest pixels have been exposed
Note while the output texture is in fact fullscreen (1920x1080), the result is rendered in half-res (960x540) to the upper left region of the texture.

The shader used for extracting the brightest pixels for light shafts is the same as for the bloom - usually called a threshold pass. Here is its assembly:
    dcl_globalFlags refactoringAllowed  
    dcl_constantbuffer cb3[10], immediateIndexed  
    dcl_sampler s0, mode_default  
    dcl_resource_texture2d (float,float,float,float) t0  
    dcl_input_ps_siv v0.xy, position  
    dcl_output o0.xyzw  
    dcl_temps 4  
   0: ftoi r0.xy, v0.xyxx  
   1: ftoi r0.zw, cb3[4].zzzw  
   2: iadd r0.xy, -r0.zwzz, r0.xyxx  
   3: itof r0.xy, r0.xyxx  
   4: add r0.xy, r0.xyxx, l(0.5000, 0.5000, 0.0000, 0.0000)  
   5: div r0.xy, r0.xyxx, cb3[3].xyxx  
   6: round_z r0.zw, cb3[4].xxxy  
   7: mad r0.xy, r0.xyxx, cb3[1].xyxx, r0.zwzz  
   8: div r0.xy, r0.xyxx, cb3[0].xyxx  
   9: div r1.xyzw, l(-1.0000, -1.0000, 1.0000, -1.0000), cb3[0].xyxy  
  10: add r1.xyzw, r0.xyxy, r1.xyzw  
  11: add r2.xyzw, cb3[5].xyzw, l(0.5000, 0.5000, 0.5000, 0.5000)  
  12: div r2.xyzw, r2.xyzw, cb3[0].xyxy  
  13: max r1.xyzw, r1.xyzw, r2.xyxy  
  14: min r1.xyzw, r2.zwzw, r1.xyzw  
  15: sample_l(texture2d)(float,float,float,float) r3.xyz, r1.xyxx, t0.xyzw, s0, l(0)  
  16: sample_l(texture2d)(float,float,float,float) r1.xyz, r1.zwzz, t0.xyzw, s0, l(0)  
  17: add r1.xyz, r1.xyzx, r3.xyzx  
  18: div r0.zw, l(0.0000, 0.0000, -1.0000, 1.0000), cb3[0].xxxy  
  19: add r0.zw, r0.zzzw, r0.xxxy  
  20: max r0.zw, r2.xxxy, r0.zzzw  
  21: min r0.zw, r2.zzzw, r0.zzzw  
  22: sample_l(texture2d)(float,float,float,float) r3.xyz, r0.zwzz, t0.xyzw, s0, l(0)  
  23: add r1.xyz, r1.xyzx, r3.xyzx  
  24: div r0.zw, l(1.0000, 1.0000, 1.0000, 1.0000), cb3[0].xxxy  
  25: add r0.xy, r0.zwzz, r0.xyxx  
  26: max r0.xy, r2.xyxx, r0.xyxx  
  27: min r0.xy, r2.zwzz, r0.xyxx  
  28: sample_l(texture2d)(float,float,float,float) r0.xyz, r0.xyxx, t0.xyzw, s0, l(0)  
  29: add r0.xyz, r0.xyzx, r1.xyzx  
  30: mul r0.xyz, r0.xyzx, l(0.2500, 0.2500, 0.2500, 0.0000)  
  31: dp3 r0.w, cb3[9].xyzx, r0.xyzx  
  32: add r1.x, r0.w, -cb3[8].x  
  33: max r0.w, r0.w, l(0.0001)  
  34: max r1.x, r1.x, l(0)  
  35: mul_sat r1.y, r1.x, cb3[8].y  
  36: mul r1.x, r1.y, r1.x  
  37: min r1.x, r1.x, cb3[7].x  
  38: div r0.w, r1.x, r0.w  
  39: mul r0.xyz, r0.wwww, r0.xyzx  
  40: mul o0.xyz, r0.xyzx, cb3[6].xxxx  
  41: mov o0.w, l(0)  
  42: ret  

Now I am going to explain how this shader works. I focus on the general idea here - the shader from the game allows to use of an arbitrary region from the input texture; I will omit these nuances for simplicity's sake.

First, let's find out a pixel location in [0-1] texture space.
 float2 textureUV = Input.SV_Position.xy;  
 textureUV += float2( 0.5, 0.5 );  
 textureUV /= outputTextureSize.xy;   // outputTextureSize = (960, 540)  

Then. the shader determines minimum and maximum possible sampling coordinates (in this case input texture has a size of 1920x1080):
 // Minimum (0,0) and Maximum (1919,1079) input pixel  
 float2 outputTextureMinPixel = cb3_v5.xy;  
 float2 outputTextureMaxPixel = cb3_v5.zw;  
 // Calculate min/max sampling coordinates  
 float2 minSamplingUV = outputTextureMinPixel + float2(0.5, 0.5);  
 float2 maxSamplingUV = outputTextureMaxPixel + float2(0.5, 0.5);  
 minSamplingUV /= inputTextureSize.xy; // inputTextureSize = (1920, 1080)  
 maxSamplingUV /= inputTextureSize.xy;  

Next, for each processed pixel, the shader samples 4 pixels in its neighborhood with respect to the just calculated min/max texture coordinates. After that an average color is calculated:
 static const float2 g_offsets[4] =  
  float2( -1, -1 ),  
  float2(  1, -1 ),  
  float2( -1,  1 ),  
  float2(  1,  1 )  
 float3 color = 0;  
 [unroll] for (int i = 0; i < 4; i++)  
   float2 offset = g_offsets[i] / inputTextureSize.xy; // (1920, 1080)  
   float2 samplingUv = textureUV + offset;  
   samplingUv = clamp( samplingUv, minSamplingUV, maxSamplingUV );  
   color += TexColor.SampleLevel( samplerLinearClamp, samplingUv, 0 ).rgb;  
 // calculate average color  
 color /= 4.0;  

The last (and the most important) bit is to determine the bloom amount.

This part starts from calculating brightness of the just-calculated average color (note that to this point in the shader all calculations were in LDR). It's possible to use various operators here, like maximum component from RGB color (see here). The Witcher 3 uses set of RGB weights to determine brightness (assembly line 31):

brightness = dot(color.rgb, weights.rgb);

Using such an operator allows for some flexibility; for instance, for underwater scenes having the bloom that uses only green and blue channels might be desired by artists. Also, having LDR allows to cover whole RGB range in [0-1] which is more intuitive while tweaking the weights.

Once brightness is obtained, the next parameter in the bloom calculations is threshold. It is used to cut off less bright pixels from further consideration:

contribution = max( 0.0, brightness - threshold );

The following assembly snippet covers calculation of bloom amount for a pixel:
  31: dp3 r0.w, cb3[9].xyzx, r0.xyzx  
  32: add r1.x, r0.w, -cb3[8].x  
  33: max r0.w, r0.w, l(0.0001)  
  34: max r1.x, r1.x, l(0)  
  35: mul_sat r1.y, r1.x, cb3[8].y  
  36: mul r1.x, r1.y, r1.x  
  37: min r1.x, r1.x, cb3[7].x  
  38: div r0.w, r1.x, r0.w  
  39: mul r0.xyz, r0.wwww, r0.xyzx  
  40: mul o0.xyz, r0.xyzx, cb3[6].xxxx  

In general, calculating bloom amount in The Witcher 3 can be thought as a function of four variables:

fbloom - amount of bloom for a pixel,
b - brightness,
t - threshold,
nmax - maximum allowed value for nominator,
nscale - scale factor for contribution parabola.

So far, I haven't mentioned the last two parameters but they are quite simple.  nscale  is a scale factor for previously introduced contribution2 whereas nmax is maximum allowed nominator value.

To illustrate the function, let's assume that we don't care about nmax and let's pick 0.15 for t and 0.8 for nscale . Therefore, the equation simplifies to the following function of one variable:

The graph of this function looks like this:

To understand why "max(0)" is needed, here is graph of the same function but max(0, b-0.15) is replaced with just b-0.15:

The function has its zero point at 0.15, which is the threshold value. Using max, the threshold serves as cutoff so everything less than it is zeroed. Without max, the closer to zero the value of the function rises due to the 1/x factor.

Once bloomAmount is obtained, it's multiplied with the average color calculated earlier. However, there is one more multiplication in the end with "final boost" value (assembly line 40).
 float bloomAmount;  
 // ... calculate bloom amount ... //     
 color.rgb *= bloomAmount;  

 // Perform final boost
 color.rgb *= finalBoost;

 return float4(color.rgb, 0.0);

For reference, here are values from constant buffer for this particular scene:
The ones of interest are:

cb3_v9.rgb - bloom weights,
cb3_v8.x - brightness threshold,
cb3_v8.y - nominator scale
cb3_v7.x - nominator max scale
cb3_v6.x - final boost (in every frame tested, this is 100)

Having that in mind, everything related to bloom threshold in HLSL could be written like so:
 // Inputs:  
 const float3 bloomWeights = params9.rgb;  
 const float bloomThreshold = params8.x;  
 const float bloomScale = params8.y;  
 const float bloomMaxNominator = params7.x;  
 const float finalBoost = params6.x;  
 float brightness = dot( bloomWeights, color );  
 float contribution = max( brightness - bloomThreshold, 0.0);  
 // Calculate amount of bloom.  
 // The Witcher 3 uses a hyperbola:   
 // f(x) = ( b*(x-a)^2 ) / x  
 // where:  
 // x - input brightness  
 // f(x) - amount of bloom for given brightness  
 // b = bloom scale  
 // a = bloom threshold    
 contribution *= saturate( contribution * bloomScale );   
 contribution = min(contribution, bloomMaxNominator); // nominator can't be too high  
 const float bloomAmount = contribution / max(brightness, 0.0001);        // avoid too small denominator  
 // Apply the bloom amount to color  
 color *= bloomAmount;  
 // Perform final boost (usually multiplying by 100)  
 color *= finalBoost;  
 return float4(color, 0);  

Here I present a few debug screenshots I obtained by changing selected bloom parameters:

Weights = float3(0.0, 0.0, 1.0)

Weights = float3(0.0, 0.0, 0.5)

Threshold = 1.0

Threshold = 1.5

3. Radial blurs

The crucial part of the light shafts in The Witcher 3 is implemented using two radial blurs passes. Before I get to these, here is an input texture to the first one - obtained from further downsampling the result from the just covered second step:

The first radial blur pass renders to 1/16th of the scene(480x270). For the considered frame it gives the following result, with characteristic ring pattern:

The output from the first pass is an input for the second pass which produces:

To gracefully introduce the reader in radial blur and where the rings come from, here is a Shadertoy with its basic implementation: here.

The reader is encouraged to take a look at and investigate it. There is a few defines which control the blur, each one is described.

Why the multi pass approach is being used? The answer is quite simple - performance (obviously). Using more than one pass reduces complexity of the algorithm significantly. The provided Shadertoy demonstrates it. In the multi pass scenario three passes are used, max 8 texture fetches per each one. In total 8*3 = 24 samples. To achieve the same effect in just one pass, shader would need to sample a texture 83 = 512 times per pixel.

As for the implementation in The Witcher 3, the first radial filter does 16 sparsely placed samples that are subsequently refined in the second pass (it uses the same number of taps, but spaced more closely, to fill the gaps) - this gives a fairly long blur distance while keeping the perf (number of samples needed) reasonable. Usually the spacing in these sorts of filters is chosen so that the last one fills all the gaps and avoid Moire patterns.

Both passes use the same shader - which is almost 300 lines long, so I put it on pastebin for convenience. The shader assembly may seem very intimidating at first, but actually, it's not that bad. A brief look at it and it's quite easy to distinct two separate unrolled loops (16 steps) which stand for ~75% of whole assembly.

The idea of using radial blur for God rays is not new - it was used in demoscene in the past millennium [4]. A similar approach apparently was also used in The Witcher 2. [2]  Crysis was using this technique as well, but the blur was performed on the depth buffer. [1]

Let's start analysis of the shader.

3.1. Finding required positions

The first step is to determine current pixel position and both min and max possible sampling texture coordinates - it's quite similar to the bloom threshold shader.

Here is a part of assembly responsible for it:
   0: round_z r0.xyzw, cb3[3].zwxy  
   1: add r0.xy, -r0.xyxx, v0.xyxx  
   2: add r1.xy, r0.zwzz, l(0.500000, 0.500000, 0.000000, 0.000000)  
   3: add r1.zw, r0.zzzw, cb3[2].xxxy  
   4: add r1.zw, r1.zzzw, l(0.000000, 0.000000, -0.500000, -0.500000)  
   5: div r1.xyzw, r1.xyzw, cb3[0].xyxy  
   6: round_z r0.xy, r0.xyxx  
   7: add r0.xy, r0.zwzz, r0.xyxx  
   8: add r0.xy, r0.xyxx, l(0.500000, 0.500000, 0.000000, 0.000000)  
   9: div r0.xy, r0.xyxx, cb3[0].xyxx  

In HLSL this be written like so:
 const float2 inputRegionSize =  cb3_v2.xy; // (480, 270)  
 const float2 inputTextureSize = cb3_v0.xy; // (960, 540)  
 // Determine min and max possible position to sample from input texture  
 float2 minPosition = float2(0.5, 0.5);   
 float2 maxPosition = inputRegionSize - float2(0.5, 0.5);  

 minPosition /= inputTextureSize;  
 maxPosition /= inputTextureSize;  
 // Determine position of currently processed pixel
 float2 pixelPosition = Input.SV_Position.xy;
 pixelPosition += float2(0.5, 0.5);  
 pixelPosition /= inputTextureSize;  

Next, let's find the center of radial blur (in case of The Witcher 3, the Sun) in texture space:
 const float2 sunScreenPos = cb3_v4.xy; // usually [0-1] range  
 float2 radialBlurCenter = lerp(minPosition, maxPosition, sunScreenPos); // (0.0-0.5) range!  

It's just a linear interpolation between just obtained min and max sampling positions. The question remains, how to find sunScreenPos on CPU side? To answer, let's consider a surface and the usual vectors used for lighting it:
Image credit: "Real-Time Rendering, 4th Edition" - provided under Fair Use

l is a normalized light vector, pointing towards the Sun. It can be 'transformed' into a point by multiplying this vector by a large value, like far plane distance, which yields 'position' of the Sun in world space. Then it can be projected to screen space.
Here is a pseudocode for it:
 // Obtain world position of the Sun from light vector towards the Sun  
 Vec3 sunPos_W = lightVectorTowardsSun.Normalize() * camera->GetFarPlane();  
 // Position of the Sun in clip space  
 Vec3 sunPos_C = camera->GetMatrixViewProj().TransformPoint(sunPos_W);  
 // viewport transform  
 // https://docs.microsoft.com/en-us/windows/win32/direct3d11/d3d10-graphics-programming-guide-rasterizer-stage-getting-started  
 Vec2 sunPos2D;  
 sunPos2D.x = (1.f + sunPos_C.x) * (float) GetBackbufferWidth() / 2.f;  
 sunPos2D.y = (1.f - sunPos_C.y) * (float) GetBackbufferHeight() / 2.f;  
 sunPos2D.x /= (float) GetBackbufferWidth();  
 sunPos2D.y /= (float) GetBackbufferHeight();  

 // submit to shader

Since this effect is coming from the Sun / the Moon only, there is a space for optimization by checking value of a dot product of light vector and camera 'look' vector - processing the light shafts can be skipped if the camera is facing away from the Sun.

3.2. Checking if pixel qualifies for radial filter

The next chunk of assembly determines if radial filter should be applied at all (if not, black color is immediately returned):
  12: div r3.x, cb3[0].x, cb3[0].y  
  13: add r2.zw, r0.xxxy, -r2.xxxy  
  14: mov r3.y, l(1.0000)  
  15: mul r2.zw, r2.zzzw, r3.xxxy  
  16: dp2 r0.z, r2.zwzz, r2.zwzz  
  17: sqrt r0.z, r0.z  
  18: div r0.z, r0.z, r0.w  
  19: div_sat r0.z, r0.z, cb3[5].x  
  20: lt r2.z, r0.z, l(1.0000)  
  21: if_nz r2.z  
  284: else  
  285:  mov r0.xyz, l(0, 0, 0, 0)  
  286: endif  
  287: mov o0.xyz, r0.xyzx  
  288: mov o0.w, l(1.0000)  
  289: ret  

This step starts from calculating length from the radial blur center to currently processed pixel (lines 16-17).

Here is a visualization of such calculated length:

There is a problem with such estimated length, though. The shader uses circular pattern. To fix it, the vector needs to be 'squished' by an aspect ratio of the texture (line 15). Doing so fixes the problem:
Regardless of an aspect ratio, taking it into account gives such a circle so it's easy to find pixels which are out of desired radius. In this shader length < 1.0 condition is the main factor which decides whether a pixel qualifies for radial filter or not.

Before the condition check there are two more things to look at, though.

First, estimation of length happens in the area from minPosition to maxPosition, roughly from (0, 0) to (0.5, 0.5). What happens now is a remapping of length as if it was calculated from (0, 0) to (1, 1) range - see line 18. It is done with division by (maxPosition.y - minPosition.y) which is pretty much multiplying by two.

Second, let's introduce a parameter which scales step length - stepScaleFactor, by which the length is divided - line 19.

See the following screenshots which present how stepScaleFactor affects output of the first radial filter. Red area indicates pixels which fail length < 1.0 condition.

stepScaleFactor = 1.0

stepScaleFactor = 1.5

stepScaleFactor = 1.75

stepScaleFactor = 2.0

Example HLSL snippet:
 const float stepScaleFactor = cb3_v5.x;  

 float2 squishParams = float2(inputTextureSize.x / inputTextureSize.y, 1.0);  
 float2 sunToPixel = pixelPosition - radialBlurCenter;  
 sunToPixel *= squishParams;  
 float dist = length( sunToPixel ); 
 float y_diff = maxPosition.y - minPosition.y; // about 0.5  
 dist /= y_diff;   // Rescale as if length was calculated in [0-1] range (basically, mult by 2)  
 dist = saturate( dist / stepScaleFactor );  
 float3 finalColor = 0.0.xxx;  
 if (dist < 1.0)

3.3. Finding step vector and number of steps

In general, the third step can be described like this: Calculate step vectors and number of steps for both passes, then select one set based on the index of the current one.

It starts just as in the Shadertoy:
 if (dist < 1.0)  
   float2 pixelToSun = normalize( radialBlurCenter - pixelPosition );  
   float maxDist = length( radialBlurCenter - pixelPosition );  
   // Step vectors for the first stage (step_0) and the second one (step_1)  
   float2 step_0 = pixelToSun;  
   float2 step_1 = pixelToSun;  

Having a normalized pixel -> center vector, a single step is calculated now. As mentioned earlier, The Witcher 3 uses 16 samples. For the second pass taps are placed more closely to fill the gaps from the first one:
 // Length of a single step 
 const float stepLength_0 = 1.0 / 16.0;
 const float stepLength_1 = (1.0 / 16.0) * (1.0 / 16.0);

 step_0 *= stepLength_0;  
 step_1 *= stepLength_1; 

Now the step vectors are adjusted with currently used parameters:

First, stepScaleFactor is taken into the account - the longer the factor, the longer single step is.

Second, because pixelToSun was normalized earlier, it has to be remapped back to (0.0 - 0.5) range since the upper left quadrant of the input texture will be sampled.

 // Scale step vectors  
 step_0 *= stepScaleFactor;  
 step_1 *= stepScaleFactor;  
 // The step vectors have been normalized earlier (pixelToSun),   
 // so scale them down so they match resolution   
 // of input texture region  
 step_0 *= y_diff;  
 step_1 *= y_diff;  

Another important thing to be aware of is aspect ratio.

At first normalized earlier pixelToSun vector is stretched horizontally which yields a vector with length >= 1. Then the step vectors are divided by the new length:
 // Take aspect ratio into account  
 // Note - pixelToSun is at this point normalized, so its length = 1  
 pixelToSun *= squishParams;  // squishParams = (~1.77777777, 1)
 // Adjust step vectors  
 step_0 /= length(pixelToSun);  
 step_1 /= length(pixelToSun);  

Here is a debug screenshot which shows the result of the first radial filter without taking aspect ratio into account while determining step vectors:

And here is with taking aspect ratio into account:

Once the step vectors are obtained, finding a number of steps for blur is easy:
 int CalculateRadialBlurSteps(float fMaxDist, float fStepLength)  
    float fNumSteps = fMaxDist / fStepLength;  
    fNumSteps += 1.0;  // add 1 for tap #0 (just sampling the input uv)
    return (int)fNumSteps;  

 // * Calculate number of steps ("rings")  
 int numSteps_0 = CalculateRadialBlurSteps( maxDist, length(step_0) );  
 int numSteps_1 = CalculateRadialBlurSteps( maxDist, length(step_1) );  

The index of the current radial blur pass is provided as constant buffer value (0 for the first pass, 1 for the second one). To select proper step vector and number of steps, a simple conditional expression is performed:
 // Select radial blur params for this pass  
 const int iRadialBlurPass = (int)cb3_v4.z;  
 float2 dir;  
 int numSteps;  
 if (iRadialBlurPass == 0)  
    dir = step_0;  
    numSteps = numSteps_0;  
 else if (iRadialBlurPass == 1)  
    dir = step_1;  
    numSteps = numSteps_1;  

So far, I assumed that a radial center is within texture. But this is not always the case. There could be a situation where the Sun is out of screen yet some remains of shafts are still visible.

Consider a situation when a radial blur center is outside of [0-1] range:

In the (rough) example above there are 8 steps total. Note that the last two are outside of input texture region so there is no point in sampling them since there is no such data. To reduce the number of texture fetches the calculations can be clamped with respect to minPosition and maxPosition and this is exactly what this shader does.

The idea is to find proper distances to texture edges depending on direction of just selected step vector:

To obtain toMin:
toMin = pixelPosition - minPosition

To obtain toMax:
toMax = maxPosition - pixelPosition

To determine which distances will be used for clamping, the following test is performed: stepVector > (0, 0), which means towards maxPosition. Note that this test is done separately for both X and Y axis, so the final result can be, as in the example above, (toMaxX, toMinY).

At the very end the result is divided by abs(stepVector) which yields maximum number of steps for both X and Y axis. The lesser one is taken.

This clamping is performed by the following assembly snippet:
  48:  lt r3.xy, l(0, 0, 0, 0), r2.xyxx  
  49:  add r1.zw, -r0.xxxy, r1.zzzw  
  50:  add r1.xy, -r1.xyxx, r0.xyxx  
  51:  movc r1.xy, r3.xyxx, r1.zwzz, r1.xyxx  
  52:  div r1.xy, r1.xyxx, abs(r2.xyxx)  
  53:  add r1.xy, r1.xyxx, l(1.000000, 1.000000, 0.000000, 0.000000)  
  54:  ftoi r1.xy, r1.xyxx  
  55:  imin r1.x, r1.x, r2.z  
  56:  imin r1.x, r1.y, r1.x  

To provide some context, the values in the registers at line 48 are:
r2.xy - just selected dir
r0.xy - pixelPosition,
r1.xy - minPosition,
r1.zw - maxPosition

The equivalent HLSL can be written similar to this one:
 int CalculateMaxRadialSteps_ScreenSpace(float2 vStepVector, float2 pixelPosition, float2 minTexCoord, float2 maxTexCoord)  
   // Determine the minimum required number of steps to sample.  
   // This is used for cases when radial blur center is outside of screen  
   // Think of it as clamping number of steps using texture space  
   // The step vector is "pixel to center" - find its direction towards texture space  
   float2 direction = (vStepVector > float2(0,0));  
   // Distances from current pixel position to min and max texture coordinates.  
   // Note all of them are always positive - think of them as distances, not as vectors.  
   float2 distancesToMax = maxTexCoord - pixelPosition;
   float2 distancesToMin = pixelPosition - minTexCoord;  
   // Select proper distances depending on direction of the step vector.  
   // Note that we can select, for instance, X from 'distancesToMax' and Y from 'distancesFromMin'.  
   float2 selectedDistances = direction ? distancesToMax : distancesToMin;  
   // Calculate number of steps in texture space  
   float2 fResults = selectedDistances / abs(vStepVector);  
   // Add "1" for the first sample  
   fResults += float2(1.0, 1.0);  
   int2 iResults = (int2) fResults;     
   return min(iResults.x, iResults.y);  

To obtain the final number of taps another min is performed:
 // Determine the minimum required number of steps  
 // This is used for cases when radial blur center is outside of screen  
 int step_UV = CalculateMaxRadialSteps_ScreenSpace(dir, pixelPosition, minPosition, maxPosition);  
 // Final number of steps  
 numSteps = min( numSteps, step_UV );  

3.4. Sampling texture and final touches

This is (really) the last step of the radial blur. Again, I start from describing it in large picture, then I will explain the details.

At this point what the shader has is a step vector, final number of taps and index of radial blur pass.

Let's take a look at assembly which covers the general flow of this step:
 57:  if_nz r0.w  
     ... sampling loop for the second pass ...  
 108: else  
     ... sampling loop for the first pass  ...  
 273: endif  
 274: mul r1.xyz, r1.yzwy, l(0.062500, 0.062500, 0.062500, 0.000000)  
 275: add r0.x, -r0.z, l(1.000000)  
 276: log r0.x, r0.x  
 277: mul r0.x, r0.x, cb3[5].y  
 278: exp r0.x, r0.x  
 279: mul r0.y, r0.z, r0.z  
 280: mad r0.y, r0.y, cb3[5].z, l(1.000000)  
 281: div r0.x, r0.x, r0.y  
 282: mul r0.xyz, r0.xxxx, r1.xyzx  
 283: movc r0.xyz, r0.wwww, r0.xyzx, r1.xyzx  

Depending on the index of the pass, a different loop is executed which puts calculated color into r1.xyz registers.

Lines 274-283 are common for both passes and I will start from describing this part.


Regardless of current pass, an average color is calculated at line 274 - it is just a division by 16. For the first radial blur pass, this color is returned immediately since r0.w is zero (line 283).
However, for the second pass an extra fadeout / falloff parameter is calculated which is used for multiplication with the color.

The falloff parameter which range is [0-1] is used to decrease the brightness of pixels the further they are from radial center. It's calculated from dist and this is probably why dist < 1.0 test was performed earlier. There are two parameters which control how fast the falloff from center is: exponent and attenuation (it's how I named them).

It's quite simple to write it in HLSL. Again, please keep in mind that color*falloffFactor multiplication occurs at the end of the second pass:
 const float FinalFalloffExponent = cb3_v5.y;  
 const float FinalAttenuation = cb3_v5.z;  
 float nominator = pow(1.0 - dist, FinalFalloffExponent);  
 float denominator = FinalAttenuation*dist*dist + 1;  
 float falloffFactor = nominator / denominator;  

This is how fallofFactor looks like for most cases:

And this is how the final frame would look like if the falloff wasn't used:

The first loop
All what's left to cover are sampling loops. I will start from the first pass. Here's a snippet from it:
  150:   ilt r5.xyzw, l(4, 5, 6, 7), r1.xxxx  
  151:   and r5.xyzw, r5.xyzw, l(1.000000, 1.000000, 1.000000, 1.000000)  
  152:   sample_l(texture2d)(float,float,float,float) r4.xyz, r4.zwzz, t0.xyzw, s0, l(0)  
  153:   dp3 r2.z, l(0.212600, 0.715200, 0.072200, 0.000000), r4.xyzx  
  154:   add r2.w, r2.z, -cb3[7].x  
  155:   max r2.zw, r2.zzzw, l(0.000000, 0.000000, 0.000100, 0.000000)  
  156:   mul_sat r3.w, r2.w, cb3[7].y  
  157:   mul r2.w, r2.w, r3.w  
  158:   div r2.z, r2.w, r2.z  
  159:   mul r4.xyz, r2.zzzz, r4.xyzx  
  160:   mad r3.xyz, r5.xxxx, r4.xyzx, r3.xyzx  

It's important to remember that this is an unrolled loop, so it will always execute 16 times. To find out if current step is going to be used, the shader checks if the desired number of taps (r1.x) is greater than the current step number.

Obtaining sample position is the same as in the Shadertoy:
 if (iRadialBlurPass == 0) 
    for (int step = 0; step < 16; step++)  
        bool doStep = (numSteps > step);  
        float2 samplingUV = pixelPosition + dir*step;  
        float3 color = texture0.SampleLevel( samplerLinearClamp, samplingUV, 0 ).rgb;  

The interesting aspect of the first loop is that it uses luminance threshold to darken less bright pixels. The idea is the same as for the bloom threshold described earlier, it just uses standard luminance RGB as weights and different cutoff and scale factors are provided:
   static const float3 LUMINANCE_RGB = float3(0.2126, 0.7152, 0.0722);

   const float threshold = cb3_v7.x;  
   const float scale = cb3_v7.y;  
   float pixelBrightness = dot(LUMINANCE_RGB, color);  
   float contribution = max(pixelBrightness - threshold, 0.0);  
   contribution *= saturate(contributuon * scale);  
   float amount = contribution / max(pixelBrightness, 0.0001);  
   color *= amount;         
   // Add (or not) sample  
   finalColor_radialBlur += doStep*color;  
} // if (iRadialBlurPass == 0

At the end of the loop a sample is added or discarded, depending on doStep boolean.

And a quick demonstration how luminance threshold affects the first blur.

From this frame (threshold = ~0.77):

Threshold = 0

Threshold = 1.25

Threshold = 2

The second loop
The loop for the second pass is much simpler - again, it's just like in the Shadertoy provided:
 if (iRadialBlurPass == 1)  
   for (int step = 0; step < 16; step++)  
     bool doStep = (numSteps > step);  
     float2 samplingUV = pixelPosition + dir*step;  
     float3 color = doStep*texture0.SampleLevel( samplerLinearClamp, samplingUV, 0 ).rgb;  
     finalColor_radialBlur += color;  
   const float3 secondPassColor = cb3_v6.rgb;  
   finalColor_radialBlur *= secondPassColor;  

After all 16 taps in the second pass are obtained, the accumulated result is multiplied with a special secondPassColor value from constant buffer which serves as tint for the shafts - it's artist driven.
For instance, during sunset secondPassColor will represent more reddish or orangish color. For this particular scene, it's:

float3(740, 452, 379)

Playing with this value allows to, for instance, change light shafts color to more blueish:

As for the second pass, there are a few things worth noting. First, extra operations are performed - multiplying by secondPassColor and then, after dividing by 16, the falloff parameter is taken into account.

Second, the pass uses a blend state with Maximum operator for color. The actual render target bound contains downscaled full scene color (sky + objects) using the bloom threshold shader I explained earlier.

Render target before the second radial blur pass:

Render target after the second radial blur pass (note: output from radial blur shader has been processed with simple Reinhard tonemapping operator so brighter HDR scene pixels can actually be noticed):

4. Blending

The final blending is just using an additive blend state with original scene rendering so I think no extra explanation is necessary here except of a few notes:

- Fullscreen position needs to be remapped so it matches upper left quadrant of half-res texture,
- Light shafts + the bloom are modulated at this point with lens dirt texture - which not everyone likes.


Four steps - some of them trivial, some of them more complex (looking at you, radial blur) - and this is the end of this journey! Hopefully the light shafts from The Witcher 3 are less mysterious right now.

Because the effect is completely screen-based, its obvious weakness is lack of light shafts while not looking at the Sun. Nowadays it is probably more desired to look into volumetric solutions, like for instance the one from Red Dead Redemption 2 [3]

Below, there is a short list of references which I think the reader may find useful.

I have covered a lot of ground here. Nevertheless, I do hope you had as much fun as I did.


I would like to thank Michał Iwanicki and Eric Haines for their valuable suggestions and feedback.


[1] Tiago Sousa, "Crysis Next Gen Effects". GDC 2008. slides

[2] Bartłomiej Wroński, "Volumetric fog: Unified compute shader-based solution to atmospheric scattering". Advances in Real-Time Rendering in Games course, SIGRRAPH 2014. slides course page

[3] Fabian Bauer, "Creating the Atmospheric World of Red Dead Redemption 2: A Complete and Integrated Solution". Advances in Real-Time Rendering in Games course, SIGRRAPH 2019. slides course page

[4] Kenny Mitchell, "Volumetric Light Scattering as a Post-Process". GPU Pro 3, chapter 13. read

[5] Masaki Kawase, "Frame Buffer Postprocessing Effects in double-S. T. E. A. L (Wreckless)". GDC 2003 (note: this one was not referenced in the post but I still find it worth a check)

poniedziałek, 25 maja 2020

Reverse engineering the rendering of The Witcher 3, part 19 - portals

This post is a part of the series "Reverse engineering the rendering of The Witcher 3".

If you have played The Witcher 3 for long enough, you know that Geralt is not a huge fan of portals. Let's find out if they are really that scary.

There are two types of portals in the game:
Blue portal
Fire portal

I will explain how the fire one is built. It's mostly because its code is simpler comparing to the blue one :)

Here is how the fire portal looks in the game:

The most important part of course is fire rotating towards the centre, but there is more than meets the eye. More about it later.

The plan for today is pretty standard: I will describe geometry first, the vertex and the pixel shaders later. Quite a few screenshots and videos incoming.

In terms of general rendering details, the portals are drawn in forward pass with blending enabled - pretty widespread approach in the game, check shooting stars article for more info.

Let's get going.

1. Geometry

Here's how the portal mesh looks like:
Local space - Front view

Local space - Side view

The mesh reminds Gabriel's Horn. The vertex shader squeezes it along one axis, here's the same mesh afterwards as seen from side (in world space):
The portal mesh after vertex shader (side view)

Besides position, each vertex has extra data associated with it: The relevant ones are (at this point I'll show visualization from RenderDoc, they will be described in more detail later):

Texcoords (float2):

Tangent (float3):

Color (float3):

All of them will be used later, but already at this point there is too much data for .obj file so exporting this mesh can be problematic. What I did was exporting every channel to a separate .csv file, and then I'm loading all the .csv files in my C++ application and am assembling the mesh in runtime from such loaded data.

2. Vertex shader

The vertex shader is not particularly interesting, let's have a quick look at the relevant fragment anyway:
    dcl_globalFlags refactoringAllowed  
    dcl_constantbuffer cb1[7], immediateIndexed  
    dcl_constantbuffer cb2[6], immediateIndexed  
    dcl_input v0.xyz  
    dcl_input v1.xy  
    dcl_input v3.xyz  
    dcl_input v4.xyzw  
    dcl_input v6.xyzw  
    dcl_input v7.xyzw  
    dcl_input v8.xyzw  
    dcl_output o0.xyz  
    dcl_output o1.xyzw  
    dcl_output o2.xyz  
    dcl_output o3.xyz  
    dcl_output_siv o4.xyzw, position  
    dcl_temps 3  
   0: mov o0.xy, v1.xyxx  
   1: mul r0.xyzw, v7.xyzw, cb1[6].yyyy  
   2: mad r0.xyzw, v6.xyzw, cb1[6].xxxx, r0.xyzw  
   3: mad r0.xyzw, v8.xyzw, cb1[6].zzzz, r0.xyzw  
   4: mad r0.xyzw, cb1[6].wwww, l(0.000000, 0.000000, 0.000000, 1.000000), r0.xyzw  
   5: mad r1.xyz, v0.xyzx, cb2[4].xyzx, cb2[5].xyzx  
   6: mov r1.w, l(1.000000)  
   7: dp4 o0.z, r1.xyzw, r0.xyzw  
   8: mov o1.xyzw, v4.xyzw  
   9: dp4 o2.x, r1.xyzw, v6.xyzw  
  10: dp4 o2.y, r1.xyzw, v7.xyzw  
  11: dp4 o2.z, r1.xyzw, v8.xyzw  
  12: mad r0.xyz, v3.xyzx, l(2.000000, 2.000000, 2.000000, 0.000000), l(-1.000000, -1.000000, -1.000000, 0.000000)  
  13: dp3 r2.x, r0.xyzx, v6.xyzx  
  14: dp3 r2.y, r0.xyzx, v7.xyzx  
  15: dp3 r2.z, r0.xyzx, v8.xyzx  
  16: dp3 r0.x, r2.xyzx, r2.xyzx  
  17: rsq r0.x, r0.x  
  18: mul o3.xyz, r0.xxxx, r2.xyzx  

The vertex shader looks pretty similar to the other ones we've seen in this series.
After a quick analysis and comparing with input layout, the output struct can be written like so:
 struct VS_OUTPUT  
      float3 TexcoordAndViewSpaceDepth : TEXCOORD0;  
      float3 Color : TEXCOORD1;  
      float3 WorldSpacePosition : TEXCOORD2;  
      float3 Tangent : TEXCOORD3;  
      float4 PositionH : SV_Position;  

One thing I wanted to point out is how the shader retrieves view-space depth (o0.z): it's just .w component of SV_Position.

there is a thread from gamedev.net which explains it in a bit more detail.

3. Pixel shader

Here is an example scene just before drawing a portal...:

...and after:

also, there is an useful "Clear Before Draw" overlay option in RenderDoc texture viewer, so we can precisely see the drawn portal:

The first observation is that the actual fire layer is drawn only in the central area of the mesh.

The pixel shader is 186 lines long, I put it here for convenience and reference. As usual, I will be showing relevant assembly fragments while explaining things.

It's also worth to notice that 100 lines of 186 are related with fog calculations.

To start, there are 4 textures attached as input: fire (t0), noise/smoke (t1), scene color (t6) and scene depth (t15):

Fire texture
Noise/smoke texture
Scene color
Scene depth
There is also a dedicated constant buffer with 14 params which control the effect:

While the inputs: position, tangent and texcoords are quite simple concepts, let's take a closer look at the "Color" channel. After a few experiments it seems this is not a color per se but rather three different masks which the shader uses to distinguish between individual layers and where to apply certain effects:

Color.r - heat haze mask. As the name implies, it's used for heat haze effect (more about it later):

Color.g - inner mask. Used mostly for the fire effect

Color.b - back mask. Used to determine where the "back" of the portal is.

In case of such effects I think it's better to describe particular layers individually instead of analyzing the assembly from the very start to the very end like I used to do long time ago.

So, here we go:

3.1. Fire layer

First, let's investigate the most important bit: a fire layer. Here is a video of it:

The basic idea to achieve such effect is using the static texcoords from per-vertex data and animate them using elapsed time variable from constant buffer. Having such animated texcoords, we sample a texture (fire in this case) with warp/repeat sampler.

Interestingly, in this particular effect actually only the .r channel of the fire texture is sampled. To make the effect more convincing two layers of fire are obtained this way and then they are modulated together.

Alright, alright... let's see some code finally!

We start with making the texcoords more dynamic as they reach the center of the mesh:
   const float2 texcoords = Input.TextureUV;  
   const float uvSquash = cb4_v4.x; // 2.50  
   const float y_cutoff = 0.2;  
   const float y_offset = pow(texcoords.y - y_cutoff, uvSquash);  

here is the same, but in assembly lang:
  21: add r1.z, v0.y, l(-0.200000)  
  22: log r1.z, r1.z  
  23: mul r1.z, r1.z, cb4[4].x  
  24: exp r1.z, r1.z  

Then, the shader obtains texcoords for the first fire layer and samples the fire texture:
   const float elapsedTimeSeconds = cb0_v0.x;  
   const float uvScaleGlobal1 = cb4_v2.x; // 1.00  
   const float uvScale1 = cb4_v3.x;    // 0.15  

   // Sample fire1 - the first fire layer  
   float fire1; // r1.w  
     float2 fire1Uv;  
     fire1Uv.x = texcoords.x;  
     fire1Uv.y = uvScale1 * elapsedTimeSeconds + y_offset;  
     const float scaleGlobal = floor(uvScaleGlobal1); // 1.0
     fire1Uv *= scaleGlobal;  
     fire1 = texFire.Sample(samplerLinearWrap, fire1Uv).x;  

The corresponding assembly snippet is:
  25: round_ni r1.w, cb4[2].x  
  26: mad r2.y, cb4[3].x, cb0[0].x, r1.z  
  27: mov r2.x, v0.x  
  28: mul r2.xy, r1.wwww, r2.xyxx  
  29: sample_indexable(texture2d)(float,float,float,float) r1.w, r2.xyxx, t0.yzwx, s0  

Here's how the first layer looks like for elapsedTimeSeconds = 50.0:

And to show what y_cutoff actually does, here is the same scene but y_cutoff = 0.5:

This way we have obtained the first layer. Now, the shader obtains the second one:
   const float uvScale2 = cb4_v6.x;       // 0.06  
   const float uvScaleGlobal2 = cb4_v7.x; // 1.00  
   // Sample fire2 - the second fire layer  
   float fire2; // r1.z  
     float2 fire2Uv;  
     fire2Uv.x = texcoords.x - uvScale2 * elapsedTimeSeconds;  
     fire2Uv.y = uvScale2 * elapsedTimeSeconds + y_offset;  
     const float fire2_scale = floor(uvScaleGlobal2);  
     fire2Uv *= fire2_scale;  
     fire2 = texFire.Sample(samplerLinearWrap, fire2Uv).x;  

and the assembly snippet responsible for it:
  144: mad r2.x, -cb0[0].x, cb4[6].x, v0.x  
  145: mad r2.y, cb0[0].x, cb4[6].x, r1.z  
  146: round_ni r1.z, cb4[7].x  
  147: mul r2.xy, r1.zzzz, r2.xyxx  
  148: sample_indexable(texture2d)(float,float,float,float) r1.z, r2.xyxx, t0.yzxw, s0  

So, as you can see, the only difference are UVs: Now the X is animated as well.

The second layer looks like so:

Once we have the two layers of inner fire, it's time to modulate them. This is a bit more complicated than a simple multiplication though, as the inner mask is involved:
   const float innerMask = Input.Color.y;  
   const float portalInnerColorSqueeze = cb4_v8.x; // 3.00  
   const float portalInnerColorBoost = cb4_v9.x; // 188.00  
   // Calculate inner fire influence  
   float inner_influence;  // r1.z
     // innerMask and "-1.0" are used here to control where the inner part of a portal is.  
     inner_influence = fire1 * fire2 + innerMask;  
     inner_influence = saturate(inner_influence - 1.0);  
     // Exponentation to hide less luminous elements of inner portal  
     inner_influence = pow(inner_influence, portalInnerColorSqueeze);  
     // Boost the intensity  
     inner_influence *= portalInnerColorBoost;  

And corresponding assembly:
  149: mad r1.z, r1.w, r1.z, v1.y  
  150: add_sat r1.z, r1.z, l(-1.000000)  
  151: log r1.z, r1.z  
  152: mul r1.z, r1.z, cb4[8].x  
  153: exp r1.z, r1.z  
  154: mul r1.z, r1.z, cb4[9].x  

Once we have inner_influence, which is nothing more than just a mask for inner fire, all we have to do is to multiply the mask with the inner fire color:

   // Calculate portal color  
   const float3 colorPortalInner = cb4_v5.rgb; // (1.00, 0.60, 0.21961)  
   const float3 portal_inner_final = pow(colorPortalInner, 2.2) * inner_influence;  

the assembly:
  155: log r2.xyz, cb4[5].xyzx  
  156: mul r2.xyz, r2.xyzx, l(2.200000, 2.200000, 2.200000, 0.000000)  
  157: exp r2.xyz, r2.xyzx  
  170: mad r2.xyz, r2.xyzx, r1.zzzz, r3.xyzx  

Here is a video which shows particular layers of inner fire in action: The order: the first layer, the second layer, the inner influence and the final inner color:

3.2. Glow

Once we have the inner fire, let's take at the second layer: glow. Here is the video which shows inner fire only, then glow only and then their sum - the final fire effect:

Here's how the shader calculates the glow. Similar to the inner fire, at first a mask is generated and then multiplied with glow color from the constant buffer.
   const float portalOuterGlowAttenuation = cb4_v10.x; // 0.30  
   const float portalOuterColorBoost = cb4_v11.x; // 1.50
   const float3 colorPortalOuterGlow = cb4_v12.rgb; // (1.00, 0.61961, 0.30196)  
   // Calculate outer portal glow  
   float outer_glow_influence;  
     float outer_mask = (1.0 - backMask) * innerMask;  
     const float perturbParam = fire1*fire1;  
     float outer_mask_perturb = lerp( 1.0 - portalOuterGlowAttenuation, 1.0, perturbParam );  
     outer_mask *= outer_mask_perturb;  
     outer_glow_influence = outer_mask * portalOuterColorBoost;  
   // the final glow color  
   const float3 portal_outer_final = pow(colorPortalOuterGlow, 2.2) * outer_glow_influence; 
   // and the portal color, the sum of fire and glow
   float3 portal_final = portal_inner_final + portal_outer_final;

Here's how the outer_mask looks:

 (1.0 - backMask) * innerMask

The glow is not a constant color. To make it more interesting, it uses animated first fire layer (squared) so wobbles going towards the centre can be noticed:

And the assembly responsible for the glow:
  158: add r2.w, -v1.z, l(1.000000)  
  159: mul r2.w, r2.w, v1.y  
  160: mul r1.w, r1.w, r1.w  
  161: add r3.x, l(1.000000), -cb4[10].x  
  162: add r3.y, -r3.x, l(1.000000)  
  163: mad r1.w, r1.w, r3.y, r3.x  
  164: mul r1.w, r1.w, r2.w  
  165: mul r1.w, r1.w, cb4[11].x  
  166: log r3.xyz, cb4[12].xyzx  
  167: mul r3.xyz, r3.xyzx, l(2.200000, 2.200000, 2.200000, 0.000000)  
  168: exp r3.xyz, r3.xyzx  
  169: mul r3.xyz, r1.wwww, r3.xyzx  
  170: mad r2.xyz, r2.xyzx, r1.zzzz, r3.xyzx  

3.3. Heat haze

When I started analyzing how the portal shader actually works, I was wondering why exactly it needs the scene color without the portal as one of input textures. My main point was "hey, we are using blending here, so it's enough to return a pixel with zero alpha to keep the background color".

The shader has a subtle yet nice effect of heat haze - heat and energy are coming from it so the background is distorted.

The idea is to offset the pixel texcoords and sample the background color texture with the new coordinates - an operation which is impossible with simple blending.

Here is a video which demostrates how this works - the order: full effect first, then heat haze as in the shader, in the end I'm multiplying the offset by 10 to exaggerate the effect.

Let's see how the offset is actually calculated.
   const float ViewSpaceDepth = Input.ViewSpaceDepth;  
   const float3 Tangent = Input.Tangent;  
   const float backgroundDistortionStrength = cb4_v1.x; // 0.40  

   // Fades smoothly from the outer edges to the back of a portal
   const float heatHazeMask = Input.Color.x;
   // The heat haze effect is view dependent thanks to tangent vectors in view space.  
   float2 heatHazeOffset = mul( normalize(Tangent), (float3x4)g_mtxView);  
   heatHazeOffset *= float2(-1, 1);  
   // Fade the effect as camera is further from a portal  
   const float heatHazeDistanceFade = backgroundDistortionStrength / ViewSpaceDepth;  
   heatHazeOffset *= heatHazeDistanceFade;  
   heatHazeOffset *= heatHazeMask;  
   // this is what animates the heat haze effect  
   heatHazeOffset *= pow(fire1, 0.2);  
   // Actually I don't know what's this :)  
   // It was 1.0 usually so I won't bother discussing this.  
   heatHazeOffset *= vsDepth2;  

The relevant assembly is a bit scattered throughout the code, here it is:
  11: dp3 r1.x, v3.xyzx, v3.xyzx  
  12: rsq r1.x, r1.x  
  13: mul r1.xyz, r1.xxxx, v3.xyzx  
  14: mul r1.yw, r1.yyyy, cb12[2].xxxy  
  15: mad r1.xy, cb12[1].xyxx, r1.xxxx, r1.ywyy  
  16: mad r1.xy, cb12[3].xyxx, r1.zzzz, r1.xyxx  
  17: mul r1.xy, r1.xyxx, l(-1.000000, 1.000000, 0.000000, 0.000000)  
  18: div r1.z, cb4[1].x, v0.z  
  19: mul r1.xy, r1.zzzz, r1.xyxx  
  20: mul r1.xy, r1.xyxx, v1.xxxx  
  33: mul r1.xy, r1.xyxx, r2.xxxx  
  34: mul r1.xy, r0.zzzz, r1.xyxx  

Once we have the offset calculated, let's use it!
   const float2 backgroundSceneMaxUv = cb0_v2.zw; // (1.0, 1.0)  
   const float2 invViewportSize = cb0_v1.zw; // (1.0 / 1920.0, 1.0 / 1080.0 )
   // Obtain background scene color - we need to obtain it from texture  
   // for distortion effect  
   float3 sceneColor;  
     const float2 sceneUv_0 = pixelUv + backgroundSceneMaxUv*heatHazeOffset;  
     const float2 sceneUv_1 = backgroundSceneMaxUv - 0.5*invViewportSize;  
     const float2 sceneUv = min(sceneUv_0, sceneUv_1);  
     sceneColor = texScene.SampleLevel(sampler6, sceneUv, 0).rgb;  

  175: mad r0.xy, cb0[2].zwzz, r1.xyxx, r0.xyxx  
  176: mad r1.xy, -cb0[1].zwzz, l(0.500000, 0.500000, 0.000000, 0.000000), cb0[2].zwzz  
  177: min r0.xy, r0.xyxx, r1.xyxx  
  178: sample_l(texture2d)(float,float,float,float) r1.xyz, r0.xyxx, t6.xyzw, s6, l(0)  

So, in the end we have sceneColor.

3.4. "Destination" color

By "destination" color I refer to the central part of the portal:

Unfortunately, this is all black. And the reason for that is fog.

I have already explored fog solution more or less in part 15 of the series. In the portal shader fog calculations are in [35-135] lines of the source assembly.

 struct FogResult  
   float4 paramsFog;  
   float4 paramsAerial;  
 FogResult fog;  
   const float3 CameraPosition = cb12_v0.xyz;  
   const float fogStart = cb12_v22.z; // near plane  
   fog = CalculateFog( WSPosition, CameraPosition, fogStart, false );   
 const float3 destination_color = fog.paramsFog.a * fog.paramsFog.rgb;  

So this is what brings us the final scene:

The thing is, in this frame camera is so close to the portal that the estimated destination_color is equal to zero so the black center of the portal is actually fog! (or, lack of fog, technically).

Since we are allowed to inject shaders into the game via RenderDoc, let's try to manually offset the camera:
  const float3 CameraPosition = cb12_v0.xyz + float3(100, 100, 0);  

And here's the result:


So, while it has very little sense to use fog calculatons in this particular scenario, in theory there is nothing what stops us from using, for instance, a landscape from another world as the destination_color (maybe an extra pair of texcoords would be needed but still, this is perfectly doable).

Using fog could be helpful in case of huge portal which player can see from great distance.

3.5. Mixing (heat hazed) scene color with destination

I was wondering where to put this section - to "destination color" or maybe to "putting all this together" but I decided to make new subsection instead :)

At this point we have sceneColor described in 3.3 which already contains heat haze effect and we also have destination_color from 3.4. 

They are interpolated with lerp:
  178: sample_l(texture2d)(float,float,float,float) r1.xyz, r0.xyxx, t6.xyzw, s6, l(0)  
  179: mad r3.xyz, r4.wwww, r4.xyzx, -r1.xyzx  
  180: mad r0.xyw, r0.wwww, r3.xyxz, r1.xyxz  

What is the value that interpolates them (r0.w) ?
This is where the noise/smoke texture is actually used.

It's used to produce, as I called it, "portal destination mask".

And a video (first the full effect, then the destination mask, then the interpolated heat hazed scene color with destination color):

Take a look at this HLSL snippet:
   // Determines the back part of a portal  
   const float backMask = Input.Color.z;  
   const float ViewSpaceDepth = Input.TexcoordAndViewSpaceDepth.z;  
   const float viewSpaceDepthScale = cb4_v0.x; // 0.50    
   // Load depth from texture  
   float hardwareDepth = texDepth.SampleLevel(sampler15, pixelUv, 0).x;  
   float linearDepth = getDepth(hardwareDepth);  
   // cb4_v0.x = 0.5  
   float vsDepthScale = saturate( (linearDepth - ViewSpaceDepth) * viewSpaceDepthScale );  
   float vsDepth1 = 2*vsDepthScale;
   // Calculate 'portal destination' mask - maybe we would like see a glimpse of where a portal leads  
   // like landscape from another planet - the shader allows for it.  
   float portal_destination_mask;  
     const float region_mask = dot(backMask.xx, vsDepth1.xx);  
     const float2 _UVScale = float2(4.0, 1.0);  
     const float2 _TimeScale = float2(0.0, 0.2);  
     const float2 _UV = texcoords * _UVScale + elapsedTime * _TimeScale;  
     portal_destination_mask = texNoise.Sample(sampler0, _UV).x;  
     portal_destination_mask = saturate(portal_destination_mask + region_mask - 1.0);  
     portal_destination_mask *= portal_destination_mask; // line 143, r0.w  

The portal destination mask is mostly obtained the same way as fire - using animated texture coordinates. It uses "region_mask" variable to adjust where the effect is placed.

To obtain region_mask, another vaiable called vsDepth1 is used. I will describe it a bit in the next section. It does have a marginal effect on the destination mask though.

The corresponding assembly for the destination mask is:
  137: dp2 r0.w, v1.zzzz, r0.zzzz  
  138: mul r2.xy, cb0[0].xxxx, l(0.000000, 0.200000, 0.000000, 0.000000)  
  139: mad r2.xy, v0.xyxx, l(4.000000, 1.000000, 0.000000, 0.000000), r2.xyxx  
  140: sample_indexable(texture2d)(float,float,float,float) r2.x, r2.xyxx, t1.xyzw, s0  
  141: add r0.w, r0.w, r2.x  
  142: add_sat r0.w, r0.w, l(-1.000000)  
  143: mul r0.w, r0.w, r0.w  

3.6. Putting all this together

Phew, we are almost done.

Let's obtain the portal color first:
 // Calculate portal color  
 float3 portal_final;  
   const float3 portal_inner_color = pow(colorPortalInner, 2.2) * inner_influence;  
   const float3 portal_outer_color = pow(colorPortalOuterGlow, 2.2) * outer_glow_influence;  
   portal_final = portal_inner_color + portal_outer_color;  
   portal_final *= vsDepth1; // fade the effect to avoid harsh artifacts due to depth test  
   portal_final *= portalFinalColorFilter; // this was (1,1,1) - so not relevant  

The only aspect I'd like to discuss here is vsDepth1.

Here is how this mask looks like:

In the previous subsection I showed how this is obtained, basically a "linear depth buffer" which is used to reduce the portal's color so there is no harsh cutoff due to depth test.

Consider the final scene again, with and without the multiplication with vsDepth1.

Once we have portal_final, obtaining the final color is easy:
   const float finalPortalAmount = cb2_v0.x; // 0.99443  
   const float3 finalColorFilter = cb2_v2.rgb; // (1.0, 1.0, 1.0)  
   const float finalOpacityFilter = cb2_v2.a; // 1.0  
   // Alpha component for blending  
   float opacity = saturate( lerp(cb2_v0.x, 1, cb4_v13.x) );  
   // Calculate the final color  
   float3 finalColor;  
     // Mix the scene color (with heat haze effect) with the 'destination color'.  
     // In this particular example fog is used as destination (which is black where camera is nearby)  
     // but in theory there is nothing which stops us from putting here a landscape from another world.  
     const float3 destination_color = fog.paramsFog.a * fog.paramsFog.rgb;      
     finalColor = lerp( sceneColor, destination_color, portal_destination_mask );  
     // Add the portal color  
     finalColor += portal_final * finalPortalAmount;  
     // Final filter  
     finalColor *= finalColorFilter;  
   opacity *= finalOpacityFilter;  
   return float4(finalColor * opacity, opacity);  

So this is it. There is an extra finalPortalAmount variable which decides how much of the fire you actually see. I haven't tested it in such detail, but I imagine it's used when the portal appears and disappears - for a brief amount of time you don't see fire, but the whole rest instead - glow, the destination color etc.

4. Summary

The final HLSL shader is here if you are interested. I had to reorder a few lines in order to get the same assembly as the original one, but it doesn't interrupt the general flow. The shader is RenderDoc ready, all cbuffers are there etc, so you can inject it and experiment on your own.

Hope you enjoyed it - thanks for reading!