Transform Hierarchies ("Scene-graph")

A few days ago I was confronted with a publication of BitSquid on practical data orientated design, and one of the highlights for myself was the part about transform hierarchies. If you are not familiar with the article in question I recommend reading the article.
To summarize, their implementation separates the transform hierarchy into an isolated problem space and represents it in three key components: local poses, world poses, and relationship-ids. The local poses contain relative matrices to their respective parent. The world pose accumulates all matrices that are inherited. The relationship-id just contains an index stating whose world-pose to inherit from.

Like any pragmatic programmer should be, I was quite skeptical about the practicality of the solution for a few reasons. First one being that the solution proposed to unconditionally multiply the entire transformation hierarchy. Secondly, they conveniently forget to elaborate on the fact that in order to correctly transform the hierarchy, it needs to be sorted topologically, which unmodified is an N² problem.

Rather than speculating I decided to benchmark a derivative of their solution using ‘ideal test-lab’ conditions. At the time of writing I don’t have a representative scene so I presumed a fairly moderate number 6000 nodes (worst case scenario) divided over 3 hierarchy levels leaving 2000 nodes per level. Below are the results:

Implementation description	Profiled in ms	Frame Time in % ^{in retrospect to 60 fps.}
naive top. search	181ms	1091.18%
b-tree indexing top. search	5.311ms	31.87%
b-tree indexing top. search + inline apple b-tree index	3.813ms	22.43%
b-tree indexing top. search + inline apple b-tree index + custom quick-sort	1.819ms	10.91%
Actual matrix Transformations	1.337 ms	8.02

Note: that the profiled time includes any intermediate steps that were required i.e. sorting arrays etc. The recorded profile statistics are from an debug build.

Further optimizations are addressed by using a release build, which reduces to total time for sorting the topology to approximately 0.836 ms time and transforming the matrices to approximately 0.367 ms. Combined these tasks take an approximated 7.22% frame-time.

#include <System/Config.h>
#include <System/Math/Vector3.h>
#include <System/Math/Vector4.h>
#include <System/Math/Matrix4x4.h>
#include <System/Math/Primitives.h>
#include <System.Runtime\Services\Services.h>
#include <System/Console.h>
#include <System\IntrinsicX.h>
#include "SpatialDatabase.h"
using namespace aurora;
using namespace aurora::services;


#include <System\Math\Matrix4x4.h>
#include <stdio.h>
#include <stdlib.h>

namespace aurora
{
    class TransformHierarchyContainer
    {
    public:
        TransformHierarchyContainer( int size )
        {
            m_capacity = size;
            initialize();
        }

        void initialize()
        {       
            m_link = (aurora::int32*)malloc( sizeof(aurora::int32) * m_capacity  );
            m_forward_mapping = (aurora::int16*)malloc( sizeof(aurora::int16) * m_capacity  );
            m_reverse_mapping = (aurora::int16*)malloc( sizeof(aurora::int16) * m_capacity  );
            m_worldPose = (aurora::Matrix4x4*)malloc( sizeof(aurora::Matrix4x4) * m_capacity  );
            m_localPose = (aurora::Matrix4x4*)malloc( sizeof(aurora::Matrix4x4) * m_capacity  );                   
            m_dirty = (bool*)malloc( sizeof(bool) * m_capacity  );
            reset();
        }

        void reset()
        {
            m_partition = 0;
            for( int i = 0; i < m_capacity; i++ ) m_forward_mapping[i] = m_reverse_mapping[i] = i, m_dirty[i] = false;           
        }

        int  used() const
        {
            return m_partition;
        }

        uint32 allocate()
        {
            if( m_partition < m_capacity )
            {
                int index = m_forward_mapping[m_partition++];
                m_link[index] = -1; m_localPose[index] = m_worldPose[index] = Matrix4x4::Identity();
                m_dirty[index] = true;
                return index;
            }

            return 0x0;
        }

        void  free(uint32 reference)
        {
                       
            //Find the datastore indices, these corelated directly to the indices of the datastore
            int32 dataStoreParitionIndex  = m_partition - 1;
            int32 dataStoreCurrentIndex      = reference;
            int32 indexStoreCurrentIndex  = m_reverse_mapping[ dataStoreCurrentIndex ];
            int32 indexStoreParitionIndex = dataStoreParitionIndex;

               
            //Swap the two indices effectivly we move the last occupied element below the parition 
            //index registering it as free and we move the element from that location on the 'freed'
            //location.
            int16 A = m_forward_mapping[ indexStoreCurrentIndex  ];
            int16 B = m_forward_mapping[ indexStoreParitionIndex ];           
                      m_forward_mapping[ indexStoreCurrentIndex  ] = B;
                      m_forward_mapping[ indexStoreParitionIndex ] = A;


            //Swap our two values from the reverse lookup table. The reverse lookup table is an implicit
            //table lea
            int16 C = m_reverse_mapping[ A ];
            int16 D = m_reverse_mapping[ B ];           
            m_reverse_mapping[ A ] = D;
            m_reverse_mapping[ B ] = C;               
            m_partition--;
        }


        void  attach( int32 parentNode, int32 nodeId )
        {
            m_link[ nodeId ] = parentNode;
        }

        void  deattach( int32 parentNode, int32 nodeId)
        {   
            if( m_link[ nodeId ] == parentNode )
            m_link[ nodeId ] = -1;
        }

    public:
        aurora::int32 m_partition, m_capacity;
        aurora::int16* __restrict m_forward_mapping;
        aurora::int16* __restrict m_reverse_mapping;               
        aurora::Matrix4x4* __restrict m_worldPose;
        aurora::Matrix4x4* __restrict m_localPose;
        aurora::int32* __restrict m_link;
        bool* __restrict m_dirty; 
    };

}


TransformHierarchyContainer            m_Transforms(8192);
aurora::uint32                        m_Topological[8192];
aurora::int32                        m_Active;
aurora::int32                        m_LinksDirty;

NOALIAS inline void* HierarchyHalfPointIntervalSearch( register const void* __restrict key, const void *base0, size_t nmemb, register size_t size) 
{
    register const char *base = reinterpret_cast<const char*>(base0);
    register size_t lim;
    register int cmp;
    register const char* __restrict p;
    for (lim = nmemb; lim != 0; lim >>= 1) {
        p = base + (lim >> 1) * size;
        cmp = *reinterpret_cast<const int*>(key) - *reinterpret_cast<const int*>(p);               
        if (cmp == 0)
        {
            while( (*reinterpret_cast<const int*>(key) - *reinterpret_cast<const int*>(p)) == 0 && p > base0 ) p -= size;
            if( (*reinterpret_cast<const int*>(key) - *reinterpret_cast<const int*>(p)) != 0 ) p+= size;
            return ((void *)p);
        }

        if (cmp > 0) {    /* key > p: move right */
            base = (char *)p + size;
            lim--;
        }       
    }
    return (NULL);
}
NOALIAS inline void  HierarchyQuickSort(int32* __restrict values, const int& leftarg, const int& rightarg)
{
  if (leftarg < rightarg) 
  {    
    register int32 pivotvalue = values[(leftarg << 1 ) + 0];
    register int left = leftarg - 1;
    register int right = rightarg + 1;

    for(;;) {
      while (values[(--right << 1)] > pivotvalue);
      while (values[(++left << 1)] < pivotvalue);
      if (left >= right) break;
      uint64 temp = *reinterpret_cast<uint64*>( &values[(right << 1 )] );
      *reinterpret_cast<uint64*>( &values[(right << 1 ) ] ) = *reinterpret_cast<uint64*>( &values[(left << 1 ) ] );
      *reinterpret_cast<uint64*>( &values[(left << 1 ) ] ) = *reinterpret_cast<uint64*>( &temp );
    }

    int pivot = right;
    HierarchyQuickSort(values, leftarg, pivot);
    HierarchyQuickSort(values, pivot + 1, rightarg);
  }
}
NOALIAS inline void  HierarchySearchIndex( register int max, aurora::int32* __restrict parentIndices, aurora::int32* __restrict sortFeedback )
{   
    for( int i = 0; i < max; i++ ) {
        sortFeedback[ (i << 1) + 0 ] = parentIndices[i] ;
        sortFeedback[ (i << 1) + 1 ] = i;
    }

    HierarchyQuickSort( sortFeedback, 0, max - 1 );
}

NOALIAS inline void  HierarchyTransform(register int p, aurora::uint32* __restrict topologicalSortOrder, aurora::int32* __restrict m_link, aurora::Matrix4x4* __restrict m_localPose, aurora::Matrix4x4* __restrict m_worldPose)
{
    const int mask[] = { 0x00000000, 0xFFFFFFFF };
    for( int i = 0; i < p; i++ )
    {                       
        int n2 = i + (2 & mask[ (i + 2) < p ]);
        int n1 = i + (1 & mask[ (i + 1) < p ]);
        _mm_prefetch( (char*)&m_localPose[ topologicalSortOrder[n2] + 1 ], _MM_HINT_T0 );
        _mm_prefetch( (char*)&m_localPose[ topologicalSortOrder[n1] + 0 ], _MM_HINT_T0 );

        int l = topologicalSortOrder[i];
        if( m_link[l] == -1 )
            m_worldPose[l] = m_localPose[ l ];
        else
            m_worldPose[l] = m_worldPose[m_link[l]] * m_localPose[ l ];
    }
}   
NOALIAS inline int   HierarchyTopologySortForced(register aurora::int32 count, aurora::uint32* __restrict output, aurora::int32* __restrict sortFeedback, bool* __restrict dirty)
{   
    _mm_prefetch( (char*)sortFeedback, _MM_HINT_T0 );
    _mm_prefetch( (char*)output, _MM_HINT_T0 );           
    register aurora::int32* __restrict upper = (aurora::int32*)ScratchPadService::Local().allocate( count * sizeof(aurora::int32) * 4);
    register aurora::int32* __restrict lower = upper;       
    register aurora::int32* __restrict upperFeedback = (sortFeedback + (count << 1));

    register int32 count2 = count, lvalue = 0, isdirty = false, partiton = 0;
    while( sortFeedback[0] == -1 ) *upper++ = *++sortFeedback, *sortFeedback++, count2--;       
    while( lower != upper ){       
        output[ partiton++ ] = lvalue = *lower++, isdirty = dirty[ lvalue ], dirty[ lvalue ] = false;           
        int32* loff = (int32*)HierarchyHalfPointIntervalSearch( &lvalue, sortFeedback, count2, sizeof(uint32) * 2);                                                
        int32* uoff = loff > 0 ? (loff + 2) : 0x0;
        while( loff != uoff ){
            dirty[ loff[ 1 ] ] |= isdirty;
            *upper++ = loff[ 1 ]; loff += 2;
            uoff += (((uoff + 2) <= upperFeedback) && (loff[ 0 ] == lvalue)) * 2;
        }                   
    }

    return partiton;
}

NOALIAS inline int   HierarchyTopologySort(register aurora::int32 count, aurora::uint32* __restrict output, aurora::int32* __restrict sortFeedback, bool* __restrict dirty)
{
    _mm_prefetch( (char*)sortFeedback, _MM_HINT_T0 );
    _mm_prefetch( (char*)output, _MM_HINT_T0 );       
    register aurora::int32* __restrict upper = (aurora::int32*)ScratchPadService::Local().allocate( count * sizeof(aurora::int32) * 4);
    register aurora::int32* __restrict lower = upper;       
    register aurora::int32* __restrict upperFeedback = (sortFeedback + (count << 1));

    
    register int32 count2 = count, lvalue = 0, isdirty = false, partiton = 0;
    while( sortFeedback[0] == -1 ) *upper++ = *++sortFeedback, *sortFeedback++, count2--;       
    while( lower != upper ){               
        output[ partiton++ ] = lvalue = *lower++, isdirty = dirty[ lvalue ], dirty[ lvalue ] = false, partiton -= (isdirty ^ 1);           
        int32* loff = (int32*)HierarchyHalfPointIntervalSearch( &lvalue, sortFeedback, count2, sizeof(uint32) * 2);                                                
        int32* uoff = loff > 0 ? (loff + 2) : 0x0;
        while( loff != uoff ){
            int v = (((uoff) <= upperFeedback) && (loff[ 0 ] == lvalue));
            if( v > 0 ) {               
                dirty[ loff[ 1 ] ] |= isdirty;
                *upper++ = loff[ 1 ]; loff += 2; uoff += 2;
            }   
            else
            {
                break;
            }
        }   
    }

    return partiton;
}

NOALIAS inline void  HierarchyTopologyPrint(register aurora::int32 count, aurora::uint32* __restrict output, aurora::int32* __restrict sortFeedback, bool* __restrict dirty)
{
    for( int i = 0; i < count; i++ ) {
        printf("%d %d\r\n", sortFeedback[1], sortFeedback[0]);
        sortFeedback += 2;
    }
}


void TransformCreate( aurora::uint32& handle, const aurora::Matrix4x4& matrix )
{
    handle = m_Transforms.allocate();
    m_Transforms.m_localPose[ m_Transforms.m_forward_mapping[ handle] ] = matrix;
    m_LinksDirty = true;
}

void TransformDelete( aurora::uint32 handle )
{
    m_Transforms.free( handle );
    m_LinksDirty = true;
}

void TransformReadLocal( aurora::uint32 handle, aurora::Matrix4x4& matrix )
{
    matrix = m_Transforms.m_localPose[ m_Transforms.m_forward_mapping[ handle] ];
}

void TransformReadGlobal( aurora::uint32 handle, aurora::Matrix4x4& matrix )
{
    if( m_Transforms.m_dirty[ m_Transforms.m_forward_mapping[ handle] ] ) {
        TransformUpdate();
    }

    matrix = m_Transforms.m_worldPose[ m_Transforms.m_forward_mapping[ handle] ];   
}

void TransformWrite( aurora::uint32 handle, const aurora::Matrix4x4& matrix )
{
    m_Transforms.m_localPose[ m_Transforms.m_forward_mapping[ handle] ] = matrix;
    m_Transforms.m_dirty[ m_Transforms.m_forward_mapping[ handle] ] = true;
    m_LinksDirty = true;
}


void TransformUpdate()
{
    if( m_LinksDirty )
    {
        aurora::int32*    sortFeedback = (aurora::int32*)ScratchPadService::Local().allocate(  m_Transforms.used() * sizeof(aurora::uint32) * 2 );
        aurora::uint32* topologyFeedback = (aurora::uint32*)ScratchPadService::Local().allocate(m_Transforms.used() * sizeof(aurora::uint32));       
        memset( sortFeedback, 0, m_Transforms.used() * sizeof(aurora::uint32) * 2);
        memset( topologyFeedback, 0, m_Transforms.used() * sizeof(aurora::uint32));   
        HierarchySearchIndex( m_Transforms.m_partition, m_Transforms.m_link, sortFeedback);
        m_Active = HierarchyTopologySort( m_Transforms.m_partition, topologyFeedback, sortFeedback, m_Transforms.m_dirty);   
        memcpy(m_Topological, topologyFeedback, m_Active * sizeof(aurora::uint32));
        ScratchPadService::Local().free( topologyFeedback );
        ScratchPadService::Local().free( sortFeedback );   
        m_LinksDirty = false;
    }   
    
    HierarchyTransform( m_Active, m_Topological, m_Transforms.m_link, m_Transforms.m_localPose, m_Transforms.m_worldPose );   
}

Travel, Food & Development

Search This Blog

Transform Hierarchies ("Scene-graph")

Labels

Comments

Post a Comment

Popular posts from this blog

Material & shader management

Exploring as a Digital Nomad

The 8 Best U.S. Cities to Visit for a Quick Vacation