The last post described the part of achieving the desired quality in shadow map sampling. Now it is time to make it faster. If you don't know what PCF stands for please read the first section of PCF Shadow Acne first.
The source code at the end of the last post showed a 3x3 PCF implementation. For each of the nine pixels a bilinear lookup based on 4 values is computed. This also means that many of them are taken more than once. The area covered by sampling is 4x4=16 pixels whereby the implementation takes 9*4=36 samples through the Gather command. These are 20 too much!
What happens with a pixel sampled twice? Lets consider a 2x1 filter size. The 3 underlying samples (comparison results from the shadow map) are called A, B and C. The fractional components of the sample position are x and y.
The linear lookup will no give:
1 2 |
A*(1-x) + B*x B*(1-x) + C*x |
which are summed and normalized (average):
1 2 |
0.5 * (A*(1-x) + B*x + B*(1-x) + C*x) = 0.5 * (A*(1-x) + B + C*x) |
It appears that B was sampled twice but it would be sufficient to fetch it just one time to be able to compute the same result.
For 2D there are just a bit more factors. The result for the area sampled by a 3x3 PCF is visible in the table.
(1-x)*(1-y) | (1-y) | (1-y) | x*(1-y) |
(1-x) | 1 | 1 | x |
(1-x) | 1 | 1 | x |
(1-x)*y | y | y | x*y |
I chose the 3x3 PCF or 4x4 pixel area because this allows to load all texels with 4 Gather commands. The results just have to be multiplied with the respective factors and summed up. The new code has 5 Gather commands less and produces the very same output.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
float computePCFShadow3( Texture2D _ShadowMap, float3 _vLightSpaceCoord ) { // To texture space _vLightSpaceCoord = _vLightSpaceCoord*float3(0.5,-0.5, 1.0)+float3(0.5, 0.5, -0.0005); // perform PCF filtering on a 3 x 3 texel neighborhood float fSum = 0.0f; float fZScaled = _vLightSpaceCoord.z * Z_CMP_CORRECTION_SCALE; float4 vFrac; vFrac.xy = frac(_vLightSpaceCoord.xy*c_fShadowMapResolution-0.5); vFrac.zw = 1-vFrac.xy; float4 vDepth; // Sample // Interpolation Z-Comparison "Correction" of hard z-test edges. vDepth = _ShadowMap.Gather( g_PointSampler, _vLightSpaceCoord.xy, int2(-1,-1) ); fSum += dot( float4(vFrac.z, 1, vFrac.w, vFrac.z*vFrac.w), (vDepth > _vLightSpaceCoord.z) * saturate(vDepth*Z_CMP_CORRECTION_SCALE - fZScaled) ); vDepth = _ShadowMap.Gather( g_PointSampler, _vLightSpaceCoord.xy, int2( 1,-1) ); fSum += dot( float4(1, vFrac.x, vFrac.x*vFrac.w, vFrac.w), (vDepth > _vLightSpaceCoord.z) * saturate(vDepth*Z_CMP_CORRECTION_SCALE - fZScaled) ); vDepth = _ShadowMap.Gather( g_PointSampler, _vLightSpaceCoord.xy, int2(-1, 1) ); fSum += dot( float4(vFrac.z*vFrac.y, vFrac.y, 1, vFrac.z), (vDepth > _vLightSpaceCoord.z) * saturate(vDepth*Z_CMP_CORRECTION_SCALE - fZScaled) ); vDepth = _ShadowMap.Gather( g_PointSampler, _vLightSpaceCoord.xy, int2( 1, 1) ); fSum += dot( float4(vFrac.y, vFrac.x*vFrac.y, vFrac.x, 1), (vDepth > _vLightSpaceCoord.z) * saturate(vDepth*Z_CMP_CORRECTION_SCALE - fZScaled) ); return fSum/9.0; } |
The code can be compared to the last one. It is a little bit messier because the manual loop unroll - or lets say because computing the correct factor inside a loop would be much more complicated than just write it down.
Another change is that the correction factor which was introduced to get rid of the artifacts is now called Z_CMP_CORRECTION_SCALE and is not the plane literal 2048.0 anymore.