Category Archives: Math

Numerical Recipes for Rendering

Rounding

Rounding is the root of all evil. Floats have the nice property that they always have (more or less) the same relative precision. That means, it doesn't matter if a number is small or huge, as long as it is not too small (underflow) or too large (overflow). Still, the number of digits is limited. Lets assume we have 3 digits precision in decimal system, than adding 100e1 and 1 gives 100e1. So the 1 had no effect at all, since the true result 1001 was rounded to three digits. In most cases the rounding during addition is not fatal, because the error is relatively small (all 3 stored digits are correct). Addition and subtraction may only have an error in the last digit, i.e. they will not be far away from the true result. If rounding to the closer number the error is at most 0.5 ulp (units is the last place).

The point where rounding makes things bad is if errors accumulate. In the worst case catastrophic cancellation can happen, where a subtraction of similar numbers reveals a previous error. An example from [Go91]: Again we have decimal numbers with 3 digits: a=1.22, c=2.28, b=3.34 and want to compute b*b-4*a*c. The exact result is 0.0292. In limited precision it is b*b=11.2 (0.5 ulp), 4*a*c=11.1 (0.5 ulp) and finally 0.1 (70 ulp !).

1-x² and x²-y²

Lets start with the smallest term. 1-x*x is very useful in trigonometric expressions, since sin^2 + cos^2 = 1. For example: this can be used for the compression of normal vectors in polar coordinates. So, what is the problem of this expression?
It is catastrophic cancellation, because x is squared and then subtracted, which is the most basic example where this kind of error occurs.
It is possible to write the expression using the third binomial formula (1-x)*(1+x), which is more robust. This form has a bigger error in general, but as the graph shows, its error does not increase if both numbers are very similar and is very small otherwise. However, a maximal error of 0.012% in the first form is not that dramatic for most applications.
Note, that the same trick applies to x*x-y*y which can be reformulated to (x-y)*(x+y)

Quadratic Equation

There are many algorithms which require the roots of a quadratic polynomial, e.g. a ray-sphere intersection. The basic expression we want to solve is ax² + bx + c = 0 for which the known solution is

If the term below the root gets wrong (like in the example above) this is not that bad, because if that happens the result is small compared to b. Now, the problem occurs again if b and the square root are similar and get subtracted. However, this is only the case for one of the two results, because the other one adds the root term. The graph shows that if b is 1000 times larger than a and c the error becomes already larger than 10% (for 32bit floats). For a factor over 6000, rounding toggles between 100% and some smaller error for odd and even numbers.
Again, it is possible to use the third binomial formula and multiply the term with the denominator using the inverted sign [X05]:

Note, that the

changed and we get the subtraction for the two results reversed. A stable implementation needs to check which of the two formulas is better for which x and than use one of the two different terms.

// Returns two solutions with x1 >= x2 if there are solutions
void solveSquarePoly(float a, float b, float c, float& x1, float& x2)
{
    float discriminant = b*b - 4.0f*a*c;
    // ... handle discriminant < 0 ...
    float dsqrt = sqrt(discriminant);
    if(b > 0) {
        x1 = (-b - dsqrt)/(2.0f*a);
        x2 = -2.0f*c/(b + dsqrt);
    } else {
        x1 = -2.0f*c/(b - dsqrt);
        x2 = (-b + dsqrt)/(2.0f*a);
    }
}

// Returns two solutions with x1 >= x2 if there are solutions

void solveSquarePoly(float a, float b, float c, float& x1, float& x2)

{

float discriminant = b*b - 4.0f*a*c;

// ... handle discriminant < 0 ...

float dsqrt = sqrt(discriminant);

if(b > 0) {

x1 = (-b - dsqrt)/(2.0f*a);

x2 = -2.0f*c/(b + dsqrt);

} else {

x1 = -2.0f*c/(b - dsqrt);

x2 = (-b + dsqrt)/(2.0f*a);

}

ln(1+x) and exp(1+x,n)

For small |x|, 1+x discards many digits, so both expressions will be inaccurate. The later can be useful in numerical integrations (e.g. simulation of movement), when there is a grow or decrease over time. It can be expressed using the first one: exp(1+x,n) = exp(n*ln(1+x)) which leaves us with the problem of ln(1+x) to be inaccurate. Well, if you look at the graph it is not thaaat important. x must be really close to zero, before something bad happens.

The following code snipped solves this issue. Effectively, the function is multiplied with x/x, where the denominator is replaced by (1+x)-1, which is an estimation of the error made by the summation. [HP82]

float log1p(float x)
{
    float sum = 1+x;
    if(sum == 1) return 1;
    else return x*ln(sum)/(sum-1);
}

float log1p(float x)

{

float sum = 1+x;

if(sum == 1) return 1;

else return x*ln(sum)/(sum-1);

}

You can find a numerical stable log1p method in the math.h (at least in my cpp standard library). Surprisingly, it is slower by a factor of 7 and calls modf, acosf, fdtest and log internally. Seems to be some series expansion inside.

Sum and Variance of Many Numbers

Especially path tracing and other Monte Carlo light transport simulations rely on the law of large numbers. I.e. they converge to the correct result for sufficient many samples. Well, they should, but due to floating point they never will.

First, I experimented with plausible series of 10,000 and 100,000 samples of uniform and exponential distributions. All had shown a systematic underestimation with up to 0.001% error. This is only a problem, if many larger numbers with small variance are added (e.g. for being a uniform sample) which I did not test, because this is not really likely for rendering applications. Still, I will add a few solutions for completeness. See [Wiki1] and [He05] for more alternatives. The first well known solution is the Kahan summation. It tracks the lost digits in a second register to achieve a constant error bound (trivial summation has an error in the number of samples). However, adding doubles naively is already more precise than Kahan summation with two floats, so it makes only sense if we want to have even more precision than double [Wiki1] or cannot use double (e.g. on GPU).

class PreciseSum
{
    double sum, c;
public:
    PreciseSum() : sum(0), c(0) {}

    void add(float x)
    {
        x -= c; // "Carry" of the lost digits
        volatile double tsum = sum + x; // volatile to forbid optimization
        c = (tsum - sum) - x; // 0 if nothing was rounded
        sum = tsum;
    }

    float sum() const { return sum; }
};

class PreciseSum

{

double sum, c;

public:

PreciseSum() : sum(0), c(0) {}

void add(float x)

{

x -= c; // "Carry" of the lost digits

volatile double tsum = sum + x; // volatile to forbid optimization

c = (tsum - sum) - x; // 0 if nothing was rounded

sum = tsum;

}

float sum() const { return sum; }

};

Another option is to recursively compute partial sums. This is quite nice, because computing prefix sums (e.g. for summed area tables) with a parallel algorithm is already more accurate than a linear algorithm.

Beyond the sum of samples we could also be interested in there variance. This time errors can get worse, because if the mean is close to the (with a small variance) we subtract two similar numbers.

Another problem with the naive formulation is, that we need to keep all samples to compute

once we know

. Both problems are solved by Welford's method explained on [Co08]. I like it, because we don't need to store more than three values which we do anyway (mean, variance and number of samples).

class PreciseVariance
{
    int m_n;
    float m_s, m_m;
public:
    PreciseVariance() : m_n(0), m_s(0), m_m(0) {}

    void addSample(float x)
    {
        if(m_n++ == 0) m_m = x;
        else {
            float oldM = m_m;
            m_m = oldM + (x - oldM) / m_n;
            m_s = m_s + (x - oldM) * (x - m_m);
        }
    }

    int n() const { return m_n; }
    float mean() const { return m_m; }
    float variance() const { return m_n>1 ? m_s/(m_n-1) : 0.0; }
};

class PreciseVariance

{

int m_n;

float m_s, m_m;

public:

PreciseVariance() : m_n(0), m_s(0), m_m(0) {}

void addSample(float x)

{

if(m_n++ == 0) m_m = x;

else {

float oldM = m_m;

m_m = oldM + (x - oldM) / m_n;

m_s = m_s + (x - oldM) * (x - m_m);

}

int n() const { return m_n; }

float mean() const { return m_m; }

float variance() const { return m_n>1 ? m_s/(m_n-1) : 0.0; }

};

Compared to the often used sum of squares method, which stores x, x*x and n, Welford's algorithm performs much better. In the experiment I used n samples uniformly distributed in [0.5, 1.5] which should be comparable stable (see summation above).
Unfortunately, for more than ~45000 samples the variance in the naive sum of squares method gets negative. Even in the absolute value the error gets larger than 1000% (logarithmic scale!). Besides that catastrophe, all other values have acceptable errors in the 5th to 7th decimal place.

Final Thoughts

I tested a set of recipes for numerical stable computations to see which of them are really necessary in the context of rendering and simulation. Before, I encountered problems in a ray tracing application using ellipsoids leading to geometry misses in my bounding volume hierarchy. So, of course I think using the more stable solution of the quadratic polynomial is a good idea.
Using the third binomial formula is not really necessary, but since it barely changes instruction count/performance it can be used everywhere without thinking.
In the cases of logarithmic functions and summing numbers, errors happen in extremal cases mainly. As long as we do not run into these, which is unlikely in rendering, nothing bad happens with the naive approaches.
However, when using the sums of numbers in some subtraction, as for the computation of variance, I strongly recommend to use a better method. Especially, for unbiased Monte Carlo integration simply adding floats is not enough.

[Co08] Accurately Computing Running Variance, John D. Cook, 2008
http://www.johndcook.com/blog/standard_deviation/
[Co14] Avoiding Overflow, Underflow, and Loss of Precision, John D. Cook, 2014
http://www.codeproject.com/Articles/25294/Avoiding-Overflow-Underflow-and-Loss-of-Precision
[Go91] What Every Computer Scientist Should Know About Floating-Point Arithmetic, David Goldberg, 1991
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
[HP82] HP-15C Advanced Functions Handbook, Hewlett-Packard Company, 1982
[He05] Binary floating point summation accurate to full precision (Python recipe), Raymond Hettinger, 2005
http://code.activestate.com/recipes/393090/
[Wiki1] https://en.wikipedia.org/wiki/Kahan_summation_algorithm
[X05] 2005, http://people.csail.mit.edu/bkph/articles/Quadratics.pdf

Performance Optimal Vector Swizzling in C++

3 Replies

In a shader we can type vec3 v0 = v1.xxy * 2 and any other combination of x, y, z and w depending on the length of the vector. The resulting vector must not have the same same size (in the example v1 could be a vec2) because components can be copied through. This is called swizzling and is really comfortable.
Vectors are everywhere in game projects not only in the shaders. How can we get the same behavior in C++? Can we get it without losing performance? I wanted to understand if and how this can be done. There are two solutions available: The glm library from G-Truc and the CxxSwizzle library. Anyway, I did not test the two libraries for their performance but if you wanna have swizzling you might take one of them instead of the header file I had written. The advantage is that they have implemented more functions so far. But, I did not found explanations about how to solve the problem so I will try to fill that gap.

Before we can start here are the problems to face:

(1) Access the elements in arbitrary order and count: v0.xxy + v1.xzy
(2) Write to a swizzled vector v1.yxwz = v0; where doubled elements are explicit forbidden
(3) No Memory overhead: a vec3 should have the size of 3 times its base type
(4) No Computational overhead: a solution with multiple lines containing equivalent scalar operations should not be faster

First there are two different possibilities to a achieve the syntax v1.yxwz = v0; without brackets: macros and unions. You could also have a nested type but then the expression would not return any address and it is impossible to calculate things on the data of v without its address. In case of macros you can hide functions like yxwz() which do something you want. The problems with functions is that they get complicated on the left-hand-side where we want them to return references to swizzlings. The example (2) should fill the vector v1 in a swizzled order and not compute things on some copy of v1. You might be able to solve that with template meta programming or explicit proxy objects. These are objects of another type containing a reference to the original type. Operators on them will always access the original elements in some type-dependent way. However Returning proxies might be to complicated for a compiler to be optimized away. Further I do not like to have macros like x to pollute all my namespaces!

The union Solution

In a union all members work on the same space. If each member has a different type and if there are operators for each we can do everything we want.

union
{
    float m_data[3];
    Txxx xxx;
    Txxy xxy;
    ...
};

union

{

float m_data[3];

Txxx xxx;

Txxy xxy;

...

};

The types must be trivially copyable, otherwise it would not be possible to put them into a union. It is possible but not feasible to write so many types, so we want the compiler to make this job: using templates.

The Swizzle-Proxy Template

template<typename VectorType, typename Data, int A, int B>
class SwizzleProxy2
{
public:
    template<class VectorType2, typename Data2, int A2, int B2>
    SwizzleProxy2& operator += (const SwizzleProxy2<VectorType2, Data2, A2, B2>& _rhs)
    {
        ((Data*)this)[A] += ((const Data2*)&_rhs)[A2];
        ((Data*)this)[B] += ((const Data2*)&_rhs)[B2];
        return *this;
    }
};

template<typename VectorType, typename Data, int A, int B>

class SwizzleProxy2

{

public:

template<class VectorType2, typename Data2, int A2, int B2>

SwizzleProxy2& operator += (const SwizzleProxy2<VectorType2, Data2, A2, B2>& _rhs)

{

((Data*)this)[A] += ((const Data2*)&_rhs)[A2];

((Data*)this)[B] += ((const Data2*)&_rhs)[B2];

return *this;

}

};

The above class shows the basic idea of how to implement the operators for swizzling with exactly two elements: xx, xy, wx, ... . The template arguments A and B can be any index of elements in an underling real vector. For the swizzle wx A is 3 and B is 0 accessing two elements of a vec4.

Notice: the class itself does not have own members! Instantiating it would cause lots of access violations. Together with the union above the this pointer becomes a pointer to m_data . That is why we can cast it so ugly without fear.
Unfortunately when compiling the compiler must create a new operator for each combination of swizzle types. This increases compile times heavily which cannot be avoided.

So far we can use the class the following way:

v1.xw += v2.yy;
v1.xx += v2.zx; // Bad O_o

1 2	v1.xw += v2.yy; v1.xx += v2.zx; // Bad O_o

The second line would also compile but behave wired. It would add v2.z and v2.x to v1.x successively. To avoid that we can cause the compiler to fail by the following trick:

static const bool IsWritable = (A != B);
typedef typename std::conditional<IsWritable, SwizzleProxy2, struct OperationNotAvailable>::type WriteableThisType;

WriteableThisType& operator += (const SwizzleProxy2<VectorType2, Data2, A2, B2>& _rhs)

static const bool IsWritable = (A != B);

typedef typename std::conditional<IsWritable, SwizzleProxy2, struct OperationNotAvailable>::type WriteableThisType;

WriteableThisType& operator += (const SwizzleProxy2<VectorType2, Data2, A2, B2>& _rhs)

Depending on how the indices are chosen the return type is either SwizzleProxy2 as before or struct OperationNotAvailable which is nowhere defined. In the second case the compiler cannot create the function and will give you an error message which will contain "OperationNotAvailable" at some point.

To implement all the different operators for all SwizzleProxyX class I tried to create a template based collection of common operator implementations. The problem was that the compiler failed to optimize everything so we need to do that ourself for each of the (four) proxy templates. So the old CommonVectorOperator class currently contains the array access operator [] only. To still reduce the work a little bit I used macros for code generation. The macro is undefined at the end of the operator section such that from outside there are no unnecessary symbols. Just have a look into the code of the complete SwizzleProxy2 class.

/// \brief Type for swizzled access to two element vectors.
/// \details All the vector operators are defined on the swizzle types. The
///		derived final vector classes are only unions of swizzle vectors.
template<typename VectorType, typename Data, int A, int B>
class SwizzleProxy2: public CommonVectorOperators<Data>
{
public:
    /// \brief To be a write able proxy all indices must be different
    static const bool IsWritable = (A != B);

    /// \brief Use this type if an function should be created only if the
    ///		current swizzle is write able.
    typedef typename std::conditional<IsWritable, SwizzleProxy2, struct OperationNotAvailable>::type WriteableThisType;

    /// \brief Use a locally defined macro to reduce the vector
    ///		implementation overhead.
#   define CREATE_ASSIGMENT_OPERATOR(Op)                        \
    template<class VectorType2, typename Data2, int A2, int B2> \
    WriteableThisType& operator Op (const SwizzleProxy2<VectorType2, Data2, A2, B2>& _rhs) \
    {                                                           \
        ((Data*)this)[A] Op ((const Data2*)&_rhs)[A2];          \
        ((Data*)this)[B] Op ((const Data2*)&_rhs)[B2];          \
        return *this;                                           \
    }                                                           \
                                                                \
    /* Scalar operation */                                      \
    WriteableThisType& operator Op (const Data _rhs)            \
    {                                                           \
        ((Data*)this)[A] Op _rhs;                               \
        ((Data*)this)[B] Op _rhs;                               \
        return *this;                                           \
    }

    CREATE_ASSIGMENT_OPERATOR( = )

    CREATE_ASSIGMENT_OPERATOR( += )
    CREATE_ASSIGMENT_OPERATOR( -= )
    CREATE_ASSIGMENT_OPERATOR( *= )
    CREATE_ASSIGMENT_OPERATOR( /= )

    // The following operators are only defined for integer types.
    CREATE_ASSIGMENT_OPERATOR( |= )
    CREATE_ASSIGMENT_OPERATOR( &= )
    CREATE_ASSIGMENT_OPERATOR( ^= )
    CREATE_ASSIGMENT_OPERATOR( %= )
    CREATE_ASSIGMENT_OPERATOR( <<= )
    CREATE_ASSIGMENT_OPERATOR( >>= )

#   undef CREATE_ASSIGMENT_OPERATOR

#   define CREATE_ARITHMETIC_OPERATOR(Op)                                \
    template<class VectorType2, typename Data2, int A2, int B2>          \
    VectorType operator Op (const SwizzleProxy2<VectorType2, Data2, A2, B2>& _rhs) const \
    {                                                                    \
        VectorType result;                                               \
        result[0] = ((const Data*)this)[A] Op ((const Data2*)&_rhs)[A2]; \
        result[1] = ((const Data*)this)[B] Op ((const Data2*)&_rhs)[B2]; \
        return result;                                                   \
    }                                                                    \
                                                                         \
    VectorType operator Op (const Data _rhs) const                       \
    {                                                                    \
        VectorType result;                                               \
        result[0] = ((const Data*)this)[A] Op _rhs;                      \
        result[1] = ((const Data*)this)[B] Op _rhs;                      \
        return result;                                                   \
    }                                                                    \
                                                                         \
    friend VectorType operator Op (const Data _lhs, const SwizzleProxy2& _rhs) \
    {                                                                    \
        VectorType result;                                               \
        result[0] = _lhs Op ((const Data*)&_rhs)[A];                     \
        result[1] = _lhs Op ((const Data*)&_rhs)[B];                     \
        return result;                                                   \
    }

    CREATE_ARITHMETIC_OPERATOR(+)
    CREATE_ARITHMETIC_OPERATOR(-)
    CREATE_ARITHMETIC_OPERATOR(*)
    CREATE_ARITHMETIC_OPERATOR(/)

    // Integer only operators
    CREATE_ARITHMETIC_OPERATOR(|)
    CREATE_ARITHMETIC_OPERATOR(&)
    CREATE_ARITHMETIC_OPERATOR(^)
    CREATE_ARITHMETIC_OPERATOR(%)
    CREATE_ARITHMETIC_OPERATOR(<<)
    CREATE_ARITHMETIC_OPERATOR(>>)

#   undef CREATE_ARITHMETIC_OPERATOR
};

/// \brief Type for swizzled access to two element vectors.

/// \details All the vector operators are defined on the swizzle types. The

/// derived final vector classes are only unions of swizzle vectors.

template<typename VectorType, typename Data, int A, int B>

class SwizzleProxy2: public CommonVectorOperators<Data>

{

public:

/// \brief To be a write able proxy all indices must be different

static const bool IsWritable = (A != B);

/// \brief Use this type if an function should be created only if the

/// current swizzle is write able.

typedef typename std::conditional<IsWritable, SwizzleProxy2, struct OperationNotAvailable>::type WriteableThisType;

/// \brief Use a locally defined macro to reduce the vector

/// implementation overhead.

# define CREATE_ASSIGMENT_OPERATOR(Op) \

template<class VectorType2, typename Data2, int A2, int B2> \

WriteableThisType& operator Op (const SwizzleProxy2<VectorType2, Data2, A2, B2>& _rhs) \

{ \

((Data*)this)[A] Op ((const Data2*)&_rhs)[A2]; \

((Data*)this)[B] Op ((const Data2*)&_rhs)[B2]; \

return *this; \

} \

/* Scalar operation */ \

WriteableThisType& operator Op (const Data _rhs) \

{ \

((Data*)this)[A] Op _rhs; \

((Data*)this)[B] Op _rhs; \

return *this; \

}

CREATE_ASSIGMENT_OPERATOR( = )

CREATE_ASSIGMENT_OPERATOR( += )

CREATE_ASSIGMENT_OPERATOR( -= )

CREATE_ASSIGMENT_OPERATOR( *= )

CREATE_ASSIGMENT_OPERATOR( /= )

// The following operators are only defined for integer types.

CREATE_ASSIGMENT_OPERATOR( |= )

CREATE_ASSIGMENT_OPERATOR( &= )

CREATE_ASSIGMENT_OPERATOR( ^= )

CREATE_ASSIGMENT_OPERATOR( %= )

CREATE_ASSIGMENT_OPERATOR( <<= )

CREATE_ASSIGMENT_OPERATOR( >>= )

# undef CREATE_ASSIGMENT_OPERATOR

# define CREATE_ARITHMETIC_OPERATOR(Op) \

template<class VectorType2, typename Data2, int A2, int B2> \

VectorType operator Op (const SwizzleProxy2<VectorType2, Data2, A2, B2>& _rhs) const \

{ \

VectorType result; \

result[0] = ((const Data*)this)[A] Op ((const Data2*)&_rhs)[A2]; \

result[1] = ((const Data*)this)[B] Op ((const Data2*)&_rhs)[B2]; \

return result; \

} \

VectorType operator Op (const Data _rhs) const \

{ \

VectorType result; \

result[0] = ((const Data*)this)[A] Op _rhs; \

result[1] = ((const Data*)this)[B] Op _rhs; \

return result; \

} \

friend VectorType operator Op (const Data _lhs, const SwizzleProxy2& _rhs) \

{ \

VectorType result; \

result[0] = _lhs Op ((const Data*)&_rhs)[A]; \

result[1] = _lhs Op ((const Data*)&_rhs)[B]; \

return result; \

}

CREATE_ARITHMETIC_OPERATOR(+)

CREATE_ARITHMETIC_OPERATOR(-)

CREATE_ARITHMETIC_OPERATOR(*)

CREATE_ARITHMETIC_OPERATOR(/)

// Integer only operators

CREATE_ARITHMETIC_OPERATOR(|)

CREATE_ARITHMETIC_OPERATOR(&)

CREATE_ARITHMETIC_OPERATOR(^)

CREATE_ARITHMETIC_OPERATOR(%)

CREATE_ARITHMETIC_OPERATOR(<<)

CREATE_ARITHMETIC_OPERATOR(>>)

# undef CREATE_ARITHMETIC_OPERATOR

};

Remark: The scalar-vector operators are implemented as friend . This is a trick in C++ to avoid having such functions in the global namespace. The compiler can still find the function by ADL (argument dependent lookup). For each different template argument setup of the proxy class there is exactly one such operator.

You might have noticed that the template takes a VectorType argument. This is required in the implementation of the non-assigning operators as a simple +. These must return a new copy which is only possible of the real vector type is known.

The Final Vector Class

/// \brief Implementation of a 2D vector class with swizzling.
template<typename Data>
struct Vec2_Base: public detail::SwizzleProxy2<Vec2_Base<Data>,Data,0,1>
{
	typedef Data DataType;

	/// \brief The data with a multitude of access functions
	union {
		DataType m_data[2];

		detail::SwizzleProxy1<Vec1_Base<Data>,Data,0> x, r;
		detail::SwizzleProxy1<Vec1_Base<Data>,Data,1> y, g;
		detail::SwizzleProxy2<Vec2_Base<Data>,Data,0,0> xx, rr;
		detail::SwizzleProxy2<Vec2_Base<Data>,Data,0,1> xy, rg;
		detail::SwizzleProxy2<Vec2_Base<Data>,Data,1,0> yx, gr;
		detail::SwizzleProxy2<Vec2_Base<Data>,Data,1,1> yy, gg;
		detail::SwizzleProxy3<Vec3_Base<Data>,Data,0,0,0> xxx, rrr;
		detail::SwizzleProxy3<Vec3_Base<Data>,Data,0,0,1> xxy, rrg;
		detail::SwizzleProxy3<Vec3_Base<Data>,Data,0,1,0> xyx, rgr;
		detail::SwizzleProxy3<Vec3_Base<Data>,Data,0,1,1> xyy, rgg;
		detail::SwizzleProxy3<Vec3_Base<Data>,Data,1,0,0> yxx, grr;
		detail::SwizzleProxy3<Vec3_Base<Data>,Data,1,0,1> yxy, grg;
		detail::SwizzleProxy3<Vec3_Base<Data>,Data,1,1,0> yyx, ggr;
		detail::SwizzleProxy3<Vec3_Base<Data>,Data,1,1,1> yyy, ggg;
		detail::SwizzleProxy4<Vec4_Base<Data>,Data,0,0,0,0> xxxx, rrrr;
		detail::SwizzleProxy4<Vec4_Base<Data>,Data,0,0,0,1> xxxy, rrrg;
		detail::SwizzleProxy4<Vec4_Base<Data>,Data,0,0,1,0> xxyx, rrgr;
		detail::SwizzleProxy4<Vec4_Base<Data>,Data,0,0,1,1> xxyy, rrgg;
		detail::SwizzleProxy4<Vec4_Base<Data>,Data,0,1,0,0> xyxx, rgrr;
		detail::SwizzleProxy4<Vec4_Base<Data>,Data,0,1,0,1> xyxy, rgrg;
		detail::SwizzleProxy4<Vec4_Base<Data>,Data,0,1,1,0> xyyx, rggr;
		detail::SwizzleProxy4<Vec4_Base<Data>,Data,0,1,1,1> xyyy, rggg;
		detail::SwizzleProxy4<Vec4_Base<Data>,Data,1,0,0,0> yxxx, grrr;
		detail::SwizzleProxy4<Vec4_Base<Data>,Data,1,0,0,1> yxxy, grrg;
		detail::SwizzleProxy4<Vec4_Base<Data>,Data,1,0,1,0> yxyx, grgr;
		detail::SwizzleProxy4<Vec4_Base<Data>,Data,1,0,1,1> yxyy, grgg;
		detail::SwizzleProxy4<Vec4_Base<Data>,Data,1,1,0,0> yyxx, ggrr;
		detail::SwizzleProxy4<Vec4_Base<Data>,Data,1,1,0,1> yyxy, ggrg;
		detail::SwizzleProxy4<Vec4_Base<Data>,Data,1,1,1,0> yyyx, gggr;
		detail::SwizzleProxy4<Vec4_Base<Data>,Data,1,1,1,1> yyyy, gggg;
	};

	/// \brief Fast default construction without initialization
	Vec2_Base() {}
	/// \brief Construction from scalar
	explicit Vec2_Base(Data _x)		{ m_data[0] = _x; m_data[1] = _x; }
	/// \brief Construction from two elements
	Vec2_Base(Data _x, Data _y)		{ m_data[0] = _x; m_data[1] = _y; }
	/// \brief creation from swizzle type
	template<typename VectorType, typename Data2, int A, int B>
	Vec2_Base( const detail::SwizzleProxy2<VectorType,Data2,A,B>& _v ) { m_data[0] = _v[A]; m_data[1] = _v[B]; }
	// Standard copy and assignment operator are defined as well
};

/// \brief Implementation of a 2D vector class with swizzling.

template<typename Data>

struct Vec2_Base: public detail::SwizzleProxy2<Vec2_Base<Data>,Data,0,1>

{

typedef Data DataType;

/// \brief The data with a multitude of access functions

union {

DataType m_data[2];

detail::SwizzleProxy1<Vec1_Base<Data>,Data,0> x, r;

detail::SwizzleProxy1<Vec1_Base<Data>,Data,1> y, g;

detail::SwizzleProxy2<Vec2_Base<Data>,Data,0,0> xx, rr;

detail::SwizzleProxy2<Vec2_Base<Data>,Data,0,1> xy, rg;

detail::SwizzleProxy2<Vec2_Base<Data>,Data,1,0> yx, gr;

detail::SwizzleProxy2<Vec2_Base<Data>,Data,1,1> yy, gg;

detail::SwizzleProxy3<Vec3_Base<Data>,Data,0,0,0> xxx, rrr;

detail::SwizzleProxy3<Vec3_Base<Data>,Data,0,0,1> xxy, rrg;

detail::SwizzleProxy3<Vec3_Base<Data>,Data,0,1,0> xyx, rgr;

detail::SwizzleProxy3<Vec3_Base<Data>,Data,0,1,1> xyy, rgg;

detail::SwizzleProxy3<Vec3_Base<Data>,Data,1,0,0> yxx, grr;

detail::SwizzleProxy3<Vec3_Base<Data>,Data,1,0,1> yxy, grg;

detail::SwizzleProxy3<Vec3_Base<Data>,Data,1,1,0> yyx, ggr;

detail::SwizzleProxy3<Vec3_Base<Data>,Data,1,1,1> yyy, ggg;

detail::SwizzleProxy4<Vec4_Base<Data>,Data,0,0,0,0> xxxx, rrrr;

detail::SwizzleProxy4<Vec4_Base<Data>,Data,0,0,0,1> xxxy, rrrg;

detail::SwizzleProxy4<Vec4_Base<Data>,Data,0,0,1,0> xxyx, rrgr;

detail::SwizzleProxy4<Vec4_Base<Data>,Data,0,0,1,1> xxyy, rrgg;

detail::SwizzleProxy4<Vec4_Base<Data>,Data,0,1,0,0> xyxx, rgrr;

detail::SwizzleProxy4<Vec4_Base<Data>,Data,0,1,0,1> xyxy, rgrg;

detail::SwizzleProxy4<Vec4_Base<Data>,Data,0,1,1,0> xyyx, rggr;

detail::SwizzleProxy4<Vec4_Base<Data>,Data,0,1,1,1> xyyy, rggg;

detail::SwizzleProxy4<Vec4_Base<Data>,Data,1,0,0,0> yxxx, grrr;

detail::SwizzleProxy4<Vec4_Base<Data>,Data,1,0,0,1> yxxy, grrg;

detail::SwizzleProxy4<Vec4_Base<Data>,Data,1,0,1,0> yxyx, grgr;

detail::SwizzleProxy4<Vec4_Base<Data>,Data,1,0,1,1> yxyy, grgg;

detail::SwizzleProxy4<Vec4_Base<Data>,Data,1,1,0,0> yyxx, ggrr;

detail::SwizzleProxy4<Vec4_Base<Data>,Data,1,1,0,1> yyxy, ggrg;

detail::SwizzleProxy4<Vec4_Base<Data>,Data,1,1,1,0> yyyx, gggr;

detail::SwizzleProxy4<Vec4_Base<Data>,Data,1,1,1,1> yyyy, gggg;

};

/// \brief Fast default construction without initialization

Vec2_Base() {}

/// \brief Construction from scalar

explicit Vec2_Base(Data _x) { m_data[0] = _x; m_data[1] = _x; }

/// \brief Construction from two elements

Vec2_Base(Data _x, Data _y) { m_data[0] = _x; m_data[1] = _y; }

/// \brief creation from swizzle type

template<typename VectorType, typename Data2, int A, int B>

Vec2_Base( const detail::SwizzleProxy2<VectorType,Data2,A,B>& _v ) { m_data[0] = _v[A]; m_data[1] = _v[B]; }

// Standard copy and assignment operator are defined as well

};

If the final class would not inherit from the proxy class operations on normal vectors would not succeed. Instead it would be necessary to write additional operators which take vector-swizzle, vecot-vector and swizzle-vector arguments but fortunately inheritance is much easier.

Then the union is filled with all access patterns up to vec4. As you can see these are 30 for a vec2. For a vec4 itself this number grows to 340 because there are four instead of two indices for each element.

Before the last constructor we would not be able to use all the nice swizzling stuff fluently. Calling move(position.zyx) would fail because .zyx is not a vector (assuming move would like to have a vector). The implicit cast generated through this constructor is rounding off the whole implementation.

Full Header: swizzle.7z
Currently the implementation lacks functions like normalization... They might follow later.

Drawing Beams

Spanning the Beam with a Geometry Shader

The easy idea where everything begun: Create a billboard in the geometry shader.

For an input of two vertices from a line compute a vector which is perpendicular to the line and the view direction. Add this new vector as offsets to the ends and emit a thin quad with two corners at on end and two at the other end. I did the offsetting directly in projection space to avoid the requirement for more matrices and transformations. This is not the common approach and might be confusing.

You can see the implementation in the code in the next section. I added a scaling * l1.w to the offset vector to make the beam zoom invariant. This is because I will use it to render selections, ... of objects and don't want the handles to vanish at far distances. Later I discovered that this causes problem two and I removed it.

The next step was to smooth the quad borders in pixel shader. Therefore I created "texture coordinates" and used them to compute the distance of a fragment to the beam border.

void main()
{
	// Fading along the line direction only close to the ends
	float fadeLong = max(0, abs(gs_TexCoord.x) * c_fFadeScale - c_fFadeOffset);
	fadeLong = max(0, 1 - fadeLong * fadeLong);
	// Maximal smooth fading about beem width
	float fadeShort = max(0, 1 - gs_TexCoord.y * gs_TexCoord.y);
	ps_FragColor = gs_Color * fadeShort * fadeLong;
}

void main()

{

// Fading along the line direction only close to the ends

float fadeLong = max(0, abs(gs_TexCoord.x) * c_fFadeScale - c_fFadeOffset);

fadeLong = max(0, 1 - fadeLong * fadeLong);

// Maximal smooth fading about beem width

float fadeShort = max(0, 1 - gs_TexCoord.y * gs_TexCoord.y);

ps_FragColor = gs_Color * fadeShort * fadeLong;

}

Vanishing Triangles through Projective Space

Everything worked fine and than I enlarged my grid...

It is a matter of fact that if a vertex of a triangle is behind the camera the full triangle is culled. This is because in perspective space a vertex behind the camera has undefined coordinates - the camera itself is a singularity. It has nothing to do with the clipping algorithm, which would work well if a vertex is between near plane and camera. To avoid the negative coordinates you need to do clipping of the triangle yourself before perspective division.

The clipping must do two things: Move the vertex and interpolate its data. The lines 24-34 are doing this. Since I would like to clip a ray it becomes easier than with triangles. I project the end points along the ray direction to the nearplane. For a triangle do that on each edge. Important: the direction is the full 4D vector in homogeneous space. To figure that out I required most of the time.

#version 330

#define OUT_VERTS 4

in vec4 vs_Color[2];
in vec3 vs_Position[2];

out vec4 gs_Color;
// x/w, y/w, 1/w for perspective correction
noperspective out vec3 gs_TexCoord;

layout(lines) in;
layout(triangle_strip, max_vertices = OUT_VERTS) out;

layout(std140) uniform ... // Shortened this listing here

void main(void)
{
	// Transform to projection space - the rest is done directly in screen space
	vec4 l1 = vec4(vs_out_Position[0], 1) * c_mWorldViewProjection;
	vec4 l2 = vec4(vs_out_Position[1], 1) * c_mWorldViewProjection;

	// Do a manual clipping.
	vec4 direction = normalize(l2 - l1);
	// Compute original distance for texture coordinate reprojection.
	float len = length(l2.xyz - l1.xyz) * 0.5;
	// Reproject end points to the near-thresholdplane
	if( l1.z < c_fNearPlane )
		l1 -= direction * (l1.z-c_fNearPlane) / direction.z;
	float tex1 = 1.0 - length(l2.xyz - l1.xyz) / len;
	if( l2.z < c_fNearPlane )
		l2 -= direction * (l2.z-c_fNearPlane) / direction.z;
	float tex2 = length(l2.xyz - l1.xyz) / len - 1.0;

	// Compute a vector perpendicular vector to create a beam
	vec2 dir = normalize(l2.xy / l2.w - l1.xy / l1.w);
	// Cross product with view direction
	vec4 perpendicular = vec4(-dir.y * 0.2, dir.x * 0.2, 0, 0);

	gs_Color = vs_out_Color[0];
	gl_Position = l1 + perpendicular * l1.w;
	gs_TexCoord = vec3(tex1 / l1.w, -1.0, 1.0 / l1.w);
	EmitVertex();

	gl_Position = l1 - perpendicular * l1.w;
	gs_TexCoord = vec3(tex1 / l1.w, 1.0, 1.0 / l1.w);
	EmitVertex();

	gs_Color = vs_out_Color[1];
	gl_Position = l2 + perpendicular * l2.w;
	gs_TexCoord = vec3(tex2 / l2.w, -1.0, 1.0 / l2.w);
	EmitVertex();

	gl_Position = l2 - perpendicular * l2.w;
	gs_TexCoord = vec3(tex2 / l2.w, 1.0, 1.0 / l2.w);
	EmitVertex();
	EndPrimitive();
}

#version 330

#define OUT_VERTS 4

in vec4 vs_Color[2];

in vec3 vs_Position[2];

out vec4 gs_Color;

// x/w, y/w, 1/w for perspective correction

noperspective out vec3 gs_TexCoord;

layout(lines) in;

layout(triangle_strip, max_vertices = OUT_VERTS) out;

layout(std140) uniform ... // Shortened this listing here

void main(void)

{

// Transform to projection space - the rest is done directly in screen space

vec4 l1 = vec4(vs_out_Position[0], 1) * c_mWorldViewProjection;

vec4 l2 = vec4(vs_out_Position[1], 1) * c_mWorldViewProjection;

// Do a manual clipping.

vec4 direction = normalize(l2 - l1);

// Compute original distance for texture coordinate reprojection.

float len = length(l2.xyz - l1.xyz) * 0.5;

// Reproject end points to the near-thresholdplane

if( l1.z < c_fNearPlane )

l1 -= direction * (l1.z-c_fNearPlane) / direction.z;

float tex1 = 1.0 - length(l2.xyz - l1.xyz) / len;

if( l2.z < c_fNearPlane )

l2 -= direction * (l2.z-c_fNearPlane) / direction.z;

float tex2 = length(l2.xyz - l1.xyz) / len - 1.0;

// Compute a vector perpendicular vector to create a beam

vec2 dir = normalize(l2.xy / l2.w - l1.xy / l1.w);

// Cross product with view direction

vec4 perpendicular = vec4(-dir.y * 0.2, dir.x * 0.2, 0, 0);

gs_Color = vs_out_Color[0];

gl_Position = l1 + perpendicular * l1.w;

gs_TexCoord = vec3(tex1 / l1.w, -1.0, 1.0 / l1.w);

EmitVertex();

gl_Position = l1 - perpendicular * l1.w;

gs_TexCoord = vec3(tex1 / l1.w, 1.0, 1.0 / l1.w);

EmitVertex();

gs_Color = vs_out_Color[1];

gl_Position = l2 + perpendicular * l2.w;

gs_TexCoord = vec3(tex2 / l2.w, -1.0, 1.0 / l2.w);

EmitVertex();

gl_Position = l2 - perpendicular * l2.w;

gs_TexCoord = vec3(tex2 / l2.w, 1.0, 1.0 / l2.w);

EmitVertex();

EndPrimitive();

}

Perspective Interpolation

Without perspective correction

For any data interpolated over a triangle in screen space artifacts appear. If you search for the heading you will find pictures illustrating the problem. The first picture on the right also shows what happens when doing non-corrected interpolation. To implement that use the qualifier: noperspective out vec3 gs_TexCoord; . The standard behavior or the qualifier smooth will let the hardware doing the correction. But then I got image two which was worse. Another problem was that in certain angles the lines faded out if not using corrected coordinates. So I tried to combine them.

To do perspective correct interpolation manually one can interpolate (u/w) , (v/w) and (1/w) instead and compute u = (u/w)/(1/w) in the pixel shader. This is what happens to the coordinates in the listing. I only correct the "long" direction of the ray and let the short uncorrected. The solution created stable lines but with the artifacts from picture one.

With perspective correction

During writing this post I discovered that my zoom invariant scaling caused this problem. I do not recommend using it. If you need it rather set c_fLineWidth to a value which depends on the object distance. I removed the scaling and the semi-correction after writing this.
Lesson learned: Write down what you did to see what you can improve.

Developerblog – Johannes Jendersie

Content Generators & Demos

Category Archives: Math

Numerical Recipes for Rendering

Rounding

1-x² and x²-y²

Quadratic Equation

ln(1+x) and exp(1+x,n)

Sum and Variance of Many Numbers

Final Thoughts

Performance Optimal Vector Swizzling in C++

The union Solution

The Swizzle-Proxy Template

The Final Vector Class

Drawing Beams

Spanning the Beam with a Geometry Shader

Vanishing Triangles through Projective Space

Perspective Interpolation