Skip to main content

Transform Hierarchies ("Scene-graph")

A few days ago I was confronted with a publication of BitSquid on practical data orientated design, and one of the highlights for myself was the part about transform hierarchies. If you are not familiar with the article in question I recommend reading the article.
To summarize, their implementation separates the transform hierarchy into an isolated problem space and represents it in three key components: local poses, world poses, and relationship-ids. The local poses contain relative matrices to their respective parent. The world pose accumulates all matrices that are inherited. The relationship-id just contains an index stating whose world-pose to inherit from.

Like any pragmatic programmer should be, I was quite skeptical about the practicality of the solution for a few reasons. First one being that the solution proposed to unconditionally multiply the entire transformation hierarchy. Secondly, they conveniently forget to elaborate on the fact that in order to correctly transform the hierarchy, it needs to be sorted topologically, which unmodified is an N² problem.

Rather than speculating I decided to benchmark a derivative of their solution using ‘ideal test-lab’ conditions. At the time of writing I don’t have a representative scene so I presumed a fairly moderate number 6000 nodes (worst case scenario) divided over 3 hierarchy levels leaving 2000 nodes per level. Below are the results:

Implementation descriptionProfiled in msFrame Time in % in retrospect to 60 fps.
naive top. search181ms1091.18%
b-tree indexing top. search5.311ms31.87%
b-tree indexing top. search + inline apple b-tree index3.813ms22.43%
b-tree indexing top. search + inline apple b-tree index + custom quick-sort1.819ms10.91%
Actual matrix Transformations1.337 ms8.02

Note: that the profiled time includes any intermediate steps that were required i.e. sorting arrays etc. The recorded profile statistics are from an debug build.

Further optimizations are addressed by using a release build, which reduces to total time for sorting the topology to approximately 0.836 ms time and transforming the matrices to approximately 0.367 ms.  Combined these tasks take an approximated 7.22% frame-time.

#include <System/Config.h>
#include <System/Math/Vector3.h>
#include <System/Math/Vector4.h>
#include <System/Math/Matrix4x4.h>
#include <System/Math/Primitives.h>
#include <System.Runtime\Services\Services.h>
#include <System/Console.h>
#include <System\IntrinsicX.h>
#include "SpatialDatabase.h"
using namespace aurora;
using namespace aurora::services;


#include <System\Math\Matrix4x4.h>
#include <stdio.h>
#include <stdlib.h>

namespace aurora
{
    class TransformHierarchyContainer
    {
    public:
        TransformHierarchyContainer( int size )
        {
            m_capacity = size;
            initialize();
        }

        void initialize()
        {      
            m_link = (aurora::int32*)malloc( sizeof(aurora::int32) * m_capacity  );
            m_forward_mapping = (aurora::int16*)malloc( sizeof(aurora::int16) * m_capacity  );
            m_reverse_mapping = (aurora::int16*)malloc( sizeof(aurora::int16) * m_capacity  );
            m_worldPose = (aurora::Matrix4x4*)malloc( sizeof(aurora::Matrix4x4) * m_capacity  );
            m_localPose = (aurora::Matrix4x4*)malloc( sizeof(aurora::Matrix4x4) * m_capacity  );                  
            m_dirty = (bool*)malloc( sizeof(bool) * m_capacity  );
            reset();
        }

        void reset()
        {
            m_partition = 0;
            for( int i = 0; i < m_capacity; i++ ) m_forward_mapping[i] = m_reverse_mapping[i] = i, m_dirty[i] = false;          
        }

        int  used() const
        {
            return m_partition;
        }

        uint32 allocate()
        {
            if( m_partition < m_capacity )
            {
                int index = m_forward_mapping[m_partition++];
                m_link[index] = -1; m_localPose[index] = m_worldPose[index] = Matrix4x4::Identity();
                m_dirty[index] = true;
                return index;
            }

            return 0x0;
        }

        void  free(uint32 reference)
        {
                      
            //Find the datastore indices, these corelated directly to the indices of the datastore
            int32 dataStoreParitionIndex  = m_partition - 1;
            int32 dataStoreCurrentIndex      = reference;
            int32 indexStoreCurrentIndex  = m_reverse_mapping[ dataStoreCurrentIndex ];
            int32 indexStoreParitionIndex = dataStoreParitionIndex;

              
            //Swap the two indices effectivly we move the last occupied element below the parition
            //index registering it as free and we move the element from that location on the 'freed'
            //location.
            int16 A = m_forward_mapping[ indexStoreCurrentIndex  ];
            int16 B = m_forward_mapping[ indexStoreParitionIndex ];          
                      m_forward_mapping[ indexStoreCurrentIndex  ] = B;
                      m_forward_mapping[ indexStoreParitionIndex ] = A;


            //Swap our two values from the reverse lookup table. The reverse lookup table is an implicit
            //table lea
            int16 C = m_reverse_mapping[ A ];
            int16 D = m_reverse_mapping[ B ];          
            m_reverse_mapping[ A ] = D;
            m_reverse_mapping[ B ] = C;              
            m_partition--;
        }


        void  attach( int32 parentNode, int32 nodeId )
        {
            m_link[ nodeId ] = parentNode;
        }

        void  deattach( int32 parentNode, int32 nodeId)
        {  
            if( m_link[ nodeId ] == parentNode )
            m_link[ nodeId ] = -1;
        }

    public:
        aurora::int32 m_partition, m_capacity;
        aurora::int16* __restrict m_forward_mapping;
        aurora::int16* __restrict m_reverse_mapping;              
        aurora::Matrix4x4* __restrict m_worldPose;
        aurora::Matrix4x4* __restrict m_localPose;
        aurora::int32* __restrict m_link;
        bool* __restrict m_dirty;
    };

}


TransformHierarchyContainer            m_Transforms(8192);
aurora::uint32                        m_Topological[8192];
aurora::int32                        m_Active;
aurora::int32                        m_LinksDirty;

NOALIAS inline void* HierarchyHalfPointIntervalSearch( register const void* __restrict key, const void *base0, size_t nmemb, register size_t size)
{
    register const char *base = reinterpret_cast<const char*>(base0);
    register size_t lim;
    register int cmp;
    register const char* __restrict p;
    for (lim = nmemb; lim != 0; lim >>= 1) {
        p = base + (lim >> 1) * size;
        cmp = *reinterpret_cast<const int*>(key) - *reinterpret_cast<const int*>(p);              
        if (cmp == 0)
        {
            while( (*reinterpret_cast<const int*>(key) - *reinterpret_cast<const int*>(p)) == 0 && p > base0 ) p -= size;
            if( (*reinterpret_cast<const int*>(key) - *reinterpret_cast<const int*>(p)) != 0 ) p+= size;
            return ((void *)p);
        }

        if (cmp > 0) {    /* key > p: move right */
            base = (char *)p + size;
            lim--;
        }      
    }
    return (NULL);
}
NOALIAS inline void  HierarchyQuickSort(int32* __restrict values, const int& leftarg, const int& rightarg)
{
  if (leftarg < rightarg)
  {   
    register int32 pivotvalue = values[(leftarg << 1 ) + 0];
    register int left = leftarg - 1;
    register int right = rightarg + 1;

    for(;;) {
      while (values[(--right << 1)] > pivotvalue);
      while (values[(++left << 1)] < pivotvalue);
      if (left >= right) break;
      uint64 temp = *reinterpret_cast<uint64*>( &values[(right << 1 )] );
      *reinterpret_cast<uint64*>( &values[(right << 1 ) ] ) = *reinterpret_cast<uint64*>( &values[(left << 1 ) ] );
      *reinterpret_cast<uint64*>( &values[(left << 1 ) ] ) = *reinterpret_cast<uint64*>( &temp );
    }

    int pivot = right;
    HierarchyQuickSort(values, leftarg, pivot);
    HierarchyQuickSort(values, pivot + 1, rightarg);
  }
}
NOALIAS inline void  HierarchySearchIndex( register int max, aurora::int32* __restrict parentIndices, aurora::int32* __restrict sortFeedback )
{  
    for( int i = 0; i < max; i++ ) {
        sortFeedback[ (i << 1) + 0 ] = parentIndices[i] ;
        sortFeedback[ (i << 1) + 1 ] = i;
    }

    HierarchyQuickSort( sortFeedback, 0, max - 1 );
}

NOALIAS inline void  HierarchyTransform(register int p, aurora::uint32* __restrict topologicalSortOrder, aurora::int32* __restrict m_link, aurora::Matrix4x4* __restrict m_localPose, aurora::Matrix4x4* __restrict m_worldPose)
{
    const int mask[] = { 0x00000000, 0xFFFFFFFF };
    for( int i = 0; i < p; i++ )
    {                      
        int n2 = i + (2 & mask[ (i + 2) < p ]);
        int n1 = i + (1 & mask[ (i + 1) < p ]);
        _mm_prefetch( (char*)&m_localPose[ topologicalSortOrder[n2] + 1 ], _MM_HINT_T0 );
        _mm_prefetch( (char*)&m_localPose[ topologicalSortOrder[n1] + 0 ], _MM_HINT_T0 );

        int l = topologicalSortOrder[i];
        if( m_link[l] == -1 )
            m_worldPose[l] = m_localPose[ l ];
        else
            m_worldPose[l] = m_worldPose[m_link[l]] * m_localPose[ l ];
    }
}  
NOALIAS inline int   HierarchyTopologySortForced(register aurora::int32 count, aurora::uint32* __restrict output, aurora::int32* __restrict sortFeedback, bool* __restrict dirty)
{  
    _mm_prefetch( (char*)sortFeedback, _MM_HINT_T0 );
    _mm_prefetch( (char*)output, _MM_HINT_T0 );          
    register aurora::int32* __restrict upper = (aurora::int32*)ScratchPadService::Local().allocate( count * sizeof(aurora::int32) * 4);
    register aurora::int32* __restrict lower = upper;      
    register aurora::int32* __restrict upperFeedback = (sortFeedback + (count << 1));

    register int32 count2 = count, lvalue = 0, isdirty = false, partiton = 0;
    while( sortFeedback[0] == -1 ) *upper++ = *++sortFeedback, *sortFeedback++, count2--;      
    while( lower != upper ){      
        output[ partiton++ ] = lvalue = *lower++, isdirty = dirty[ lvalue ], dirty[ lvalue ] = false;          
        int32* loff = (int32*)HierarchyHalfPointIntervalSearch( &lvalue, sortFeedback, count2, sizeof(uint32) * 2);                                               
        int32* uoff = loff > 0 ? (loff + 2) : 0x0;
        while( loff != uoff ){
            dirty[ loff[ 1 ] ] |= isdirty;
            *upper++ = loff[ 1 ]; loff += 2;
            uoff += (((uoff + 2) <= upperFeedback) && (loff[ 0 ] == lvalue)) * 2;
        }                  
    }

    return partiton;
}

NOALIAS inline int   HierarchyTopologySort(register aurora::int32 count, aurora::uint32* __restrict output, aurora::int32* __restrict sortFeedback, bool* __restrict dirty)
{
    _mm_prefetch( (char*)sortFeedback, _MM_HINT_T0 );
    _mm_prefetch( (char*)output, _MM_HINT_T0 );      
    register aurora::int32* __restrict upper = (aurora::int32*)ScratchPadService::Local().allocate( count * sizeof(aurora::int32) * 4);
    register aurora::int32* __restrict lower = upper;      
    register aurora::int32* __restrict upperFeedback = (sortFeedback + (count << 1));

   
    register int32 count2 = count, lvalue = 0, isdirty = false, partiton = 0;
    while( sortFeedback[0] == -1 ) *upper++ = *++sortFeedback, *sortFeedback++, count2--;      
    while( lower != upper ){              
        output[ partiton++ ] = lvalue = *lower++, isdirty = dirty[ lvalue ], dirty[ lvalue ] = false, partiton -= (isdirty ^ 1);          
        int32* loff = (int32*)HierarchyHalfPointIntervalSearch( &lvalue, sortFeedback, count2, sizeof(uint32) * 2);                                               
        int32* uoff = loff > 0 ? (loff + 2) : 0x0;
        while( loff != uoff ){
            int v = (((uoff) <= upperFeedback) && (loff[ 0 ] == lvalue));
            if( v > 0 ) {              
                dirty[ loff[ 1 ] ] |= isdirty;
                *upper++ = loff[ 1 ]; loff += 2; uoff += 2;
            }  
            else
            {
                break;
            }
        }  
    }

    return partiton;
}

NOALIAS inline void  HierarchyTopologyPrint(register aurora::int32 count, aurora::uint32* __restrict output, aurora::int32* __restrict sortFeedback, bool* __restrict dirty)
{
    for( int i = 0; i < count; i++ ) {
        printf("%d %d\r\n", sortFeedback[1], sortFeedback[0]);
        sortFeedback += 2;
    }
}


void TransformCreate( aurora::uint32& handle, const aurora::Matrix4x4& matrix )
{
    handle = m_Transforms.allocate();
    m_Transforms.m_localPose[ m_Transforms.m_forward_mapping[ handle] ] = matrix;
    m_LinksDirty = true;
}

void TransformDelete( aurora::uint32 handle )
{
    m_Transforms.free( handle );
    m_LinksDirty = true;
}

void TransformReadLocal( aurora::uint32 handle, aurora::Matrix4x4& matrix )
{
    matrix = m_Transforms.m_localPose[ m_Transforms.m_forward_mapping[ handle] ];
}

void TransformReadGlobal( aurora::uint32 handle, aurora::Matrix4x4& matrix )
{
    if( m_Transforms.m_dirty[ m_Transforms.m_forward_mapping[ handle] ] ) {
        TransformUpdate();
    }

    matrix = m_Transforms.m_worldPose[ m_Transforms.m_forward_mapping[ handle] ];  
}

void TransformWrite( aurora::uint32 handle, const aurora::Matrix4x4& matrix )
{
    m_Transforms.m_localPose[ m_Transforms.m_forward_mapping[ handle] ] = matrix;
    m_Transforms.m_dirty[ m_Transforms.m_forward_mapping[ handle] ] = true;
    m_LinksDirty = true;
}


void TransformUpdate()
{
    if( m_LinksDirty )
    {
        aurora::int32*    sortFeedback = (aurora::int32*)ScratchPadService::Local().allocate(  m_Transforms.used() * sizeof(aurora::uint32) * 2 );
        aurora::uint32* topologyFeedback = (aurora::uint32*)ScratchPadService::Local().allocate(m_Transforms.used() * sizeof(aurora::uint32));      
        memset( sortFeedback, 0, m_Transforms.used() * sizeof(aurora::uint32) * 2);
        memset( topologyFeedback, 0, m_Transforms.used() * sizeof(aurora::uint32));  
        HierarchySearchIndex( m_Transforms.m_partition, m_Transforms.m_link, sortFeedback);
        m_Active = HierarchyTopologySort( m_Transforms.m_partition, topologyFeedback, sortFeedback, m_Transforms.m_dirty);  
        memcpy(m_Topological, topologyFeedback, m_Active * sizeof(aurora::uint32));
        ScratchPadService::Local().free( topologyFeedback );
        ScratchPadService::Local().free( sortFeedback );  
        m_LinksDirty = false;
    }  
   
    HierarchyTransform( m_Active, m_Topological, m_Transforms.m_link, m_Transforms.m_localPose, m_Transforms.m_worldPose );  
}

Comments

Popular posts from this blog

Material & shader management

In the upcoming changes in my editor I implemented the material system inspired on  Frostbite engine of DICE, binaries are download-able on the project page. Also I've implemented an conversion tool and file-format for future mesh formats using Assimp.

Asian food culture

When you think about Asian foods of course you might be thinking about those famous dishes that have made it into the western society like Sushi, Nasi or Bami.

The 8 Best U.S. Cities to Visit for a Quick Vacation

The best thing about visiting a new city is experiencing the thrill of adventure. From delicious food to rich history, there’s always something new to do. Whether you live close to these cities or you’re planning on making a trip to the USA, here's 8 of the best U.S. cities to visit on your next vacation (in no particular order): 1. Portland, Oregon As Oregon’s largest city, Portland has steadily been on the rise as a hotspot for food and beer connoisseurs. It’s nestled between the Columbia and Willamette Rivers with a stunning view of snowy Mount Hood which only adds to the thriving artistic culture. Portland is also home to beautiful parks, bridges and bike paths, making this city a top choice for outdoor adventurists. If you’re looking for more breathtaking escapades, Portland is nearby to a few national forests including Mount Hood National Forest and Gifford Pinchot National Forest. 2. Nashville, Tennessee Nashville rightfully owns...