A few days ago I was confronted with a publication of BitSquid on practical data orientated design, and one of the highlights for myself was the part about transform hierarchies. If you are not familiar with the article in question I recommend reading the article.
To summarize, their implementation separates the transform hierarchy into an isolated problem space and represents it in three key components: local poses, world poses, and relationship-ids. The local poses contain relative matrices to their respective parent. The world pose accumulates all matrices that are inherited. The relationship-id just contains an index stating whose world-pose to inherit from.
Like any pragmatic programmer should be, I was quite skeptical about the practicality of the solution for a few reasons. First one being that the solution proposed to unconditionally multiply the entire transformation hierarchy. Secondly, they conveniently forget to elaborate on the fact that in order to correctly transform the hierarchy, it needs to be sorted topologically, which unmodified is an N² problem.
Rather than speculating I decided to benchmark a derivative of their solution using ‘ideal test-lab’ conditions. At the time of writing I don’t have a representative scene so I presumed a fairly moderate number 6000 nodes (worst case scenario) divided over 3 hierarchy levels leaving 2000 nodes per level. Below are the results:
Note: that the profiled time includes any intermediate steps that were required i.e. sorting arrays etc. The recorded profile statistics are from an debug build.
Further optimizations are addressed by using a release build, which reduces to total time for sorting the topology to approximately 0.836 ms time and transforming the matrices to approximately 0.367 ms. Combined these tasks take an approximated 7.22% frame-time.
To summarize, their implementation separates the transform hierarchy into an isolated problem space and represents it in three key components: local poses, world poses, and relationship-ids. The local poses contain relative matrices to their respective parent. The world pose accumulates all matrices that are inherited. The relationship-id just contains an index stating whose world-pose to inherit from.
Like any pragmatic programmer should be, I was quite skeptical about the practicality of the solution for a few reasons. First one being that the solution proposed to unconditionally multiply the entire transformation hierarchy. Secondly, they conveniently forget to elaborate on the fact that in order to correctly transform the hierarchy, it needs to be sorted topologically, which unmodified is an N² problem.
Rather than speculating I decided to benchmark a derivative of their solution using ‘ideal test-lab’ conditions. At the time of writing I don’t have a representative scene so I presumed a fairly moderate number 6000 nodes (worst case scenario) divided over 3 hierarchy levels leaving 2000 nodes per level. Below are the results:
Implementation description | Profiled in ms | Frame Time in % in retrospect to 60 fps. |
naive top. search | 181ms | 1091.18% |
b-tree indexing top. search | 5.311ms | 31.87% |
b-tree indexing top. search + inline apple b-tree index | 3.813ms | 22.43% |
b-tree indexing top. search + inline apple b-tree index + custom quick-sort | 1.819ms | 10.91% |
Actual matrix Transformations | 1.337 ms | 8.02 |
Note: that the profiled time includes any intermediate steps that were required i.e. sorting arrays etc. The recorded profile statistics are from an debug build.
Further optimizations are addressed by using a release build, which reduces to total time for sorting the topology to approximately 0.836 ms time and transforming the matrices to approximately 0.367 ms. Combined these tasks take an approximated 7.22% frame-time.
#include <System/Config.h>
#include <System/Math/Vector3.h>
#include <System/Math/Vector4.h>
#include <System/Math/Matrix4x4.h>
#include <System/Math/Primitives.h>
#include <System.Runtime\Services\Services.h>
#include <System/Console.h>
#include <System\IntrinsicX.h>
#include "SpatialDatabase.h"
using namespace aurora;
using namespace aurora::services;
#include <System\Math\Matrix4x4.h>
#include <stdio.h>
#include <stdlib.h>
namespace aurora
{
class TransformHierarchyContainer
{
public:
TransformHierarchyContainer( int size )
{
m_capacity = size;
initialize();
}
void initialize()
{
m_link = (aurora::int32*)malloc( sizeof(aurora::int32) * m_capacity );
m_forward_mapping = (aurora::int16*)malloc( sizeof(aurora::int16) * m_capacity );
m_reverse_mapping = (aurora::int16*)malloc( sizeof(aurora::int16) * m_capacity );
m_worldPose = (aurora::Matrix4x4*)malloc( sizeof(aurora::Matrix4x4) * m_capacity );
m_localPose = (aurora::Matrix4x4*)malloc( sizeof(aurora::Matrix4x4) * m_capacity );
m_dirty = (bool*)malloc( sizeof(bool) * m_capacity );
reset();
}
void reset()
{
m_partition = 0;
for( int i = 0; i < m_capacity; i++ ) m_forward_mapping[i] = m_reverse_mapping[i] = i, m_dirty[i] = false;
}
int used() const
{
return m_partition;
}
uint32 allocate()
{
if( m_partition < m_capacity )
{
int index = m_forward_mapping[m_partition++];
m_link[index] = -1; m_localPose[index] = m_worldPose[index] = Matrix4x4::Identity();
m_dirty[index] = true;
return index;
}
return 0x0;
}
void free(uint32 reference)
{
//Find the datastore indices, these corelated directly to the indices of the datastore
int32 dataStoreParitionIndex = m_partition - 1;
int32 dataStoreCurrentIndex = reference;
int32 indexStoreCurrentIndex = m_reverse_mapping[ dataStoreCurrentIndex ];
int32 indexStoreParitionIndex = dataStoreParitionIndex;
//Swap the two indices effectivly we move the last occupied element below the parition
//index registering it as free and we move the element from that location on the 'freed'
//location.
int16 A = m_forward_mapping[ indexStoreCurrentIndex ];
int16 B = m_forward_mapping[ indexStoreParitionIndex ];
m_forward_mapping[ indexStoreCurrentIndex ] = B;
m_forward_mapping[ indexStoreParitionIndex ] = A;
//Swap our two values from the reverse lookup table. The reverse lookup table is an implicit
//table lea
int16 C = m_reverse_mapping[ A ];
int16 D = m_reverse_mapping[ B ];
m_reverse_mapping[ A ] = D;
m_reverse_mapping[ B ] = C;
m_partition--;
}
void attach( int32 parentNode, int32 nodeId )
{
m_link[ nodeId ] = parentNode;
}
void deattach( int32 parentNode, int32 nodeId)
{
if( m_link[ nodeId ] == parentNode )
m_link[ nodeId ] = -1;
}
public:
aurora::int32 m_partition, m_capacity;
aurora::int16* __restrict m_forward_mapping;
aurora::int16* __restrict m_reverse_mapping;
aurora::Matrix4x4* __restrict m_worldPose;
aurora::Matrix4x4* __restrict m_localPose;
aurora::int32* __restrict m_link;
bool* __restrict m_dirty;
};
}
TransformHierarchyContainer m_Transforms(8192);
aurora::uint32 m_Topological[8192];
aurora::int32 m_Active;
aurora::int32 m_LinksDirty;
NOALIAS inline void* HierarchyHalfPointIntervalSearch( register const void* __restrict key, const void *base0, size_t nmemb, register size_t size)
{
register const char *base = reinterpret_cast<const char*>(base0);
register size_t lim;
register int cmp;
register const char* __restrict p;
for (lim = nmemb; lim != 0; lim >>= 1) {
p = base + (lim >> 1) * size;
cmp = *reinterpret_cast<const int*>(key) - *reinterpret_cast<const int*>(p);
if (cmp == 0)
{
while( (*reinterpret_cast<const int*>(key) - *reinterpret_cast<const int*>(p)) == 0 && p > base0 ) p -= size;
if( (*reinterpret_cast<const int*>(key) - *reinterpret_cast<const int*>(p)) != 0 ) p+= size;
return ((void *)p);
}
if (cmp > 0) { /* key > p: move right */
base = (char *)p + size;
lim--;
}
}
return (NULL);
}
NOALIAS inline void HierarchyQuickSort(int32* __restrict values, const int& leftarg, const int& rightarg)
{
if (leftarg < rightarg)
{
register int32 pivotvalue = values[(leftarg << 1 ) + 0];
register int left = leftarg - 1;
register int right = rightarg + 1;
for(;;) {
while (values[(--right << 1)] > pivotvalue);
while (values[(++left << 1)] < pivotvalue);
if (left >= right) break;
uint64 temp = *reinterpret_cast<uint64*>( &values[(right << 1 )] );
*reinterpret_cast<uint64*>( &values[(right << 1 ) ] ) = *reinterpret_cast<uint64*>( &values[(left << 1 ) ] );
*reinterpret_cast<uint64*>( &values[(left << 1 ) ] ) = *reinterpret_cast<uint64*>( &temp );
}
int pivot = right;
HierarchyQuickSort(values, leftarg, pivot);
HierarchyQuickSort(values, pivot + 1, rightarg);
}
}
NOALIAS inline void HierarchySearchIndex( register int max, aurora::int32* __restrict parentIndices, aurora::int32* __restrict sortFeedback )
{
for( int i = 0; i < max; i++ ) {
sortFeedback[ (i << 1) + 0 ] = parentIndices[i] ;
sortFeedback[ (i << 1) + 1 ] = i;
}
HierarchyQuickSort( sortFeedback, 0, max - 1 );
}
NOALIAS inline void HierarchyTransform(register int p, aurora::uint32* __restrict topologicalSortOrder, aurora::int32* __restrict m_link, aurora::Matrix4x4* __restrict m_localPose, aurora::Matrix4x4* __restrict m_worldPose)
{
const int mask[] = { 0x00000000, 0xFFFFFFFF };
for( int i = 0; i < p; i++ )
{
int n2 = i + (2 & mask[ (i + 2) < p ]);
int n1 = i + (1 & mask[ (i + 1) < p ]);
_mm_prefetch( (char*)&m_localPose[ topologicalSortOrder[n2] + 1 ], _MM_HINT_T0 );
_mm_prefetch( (char*)&m_localPose[ topologicalSortOrder[n1] + 0 ], _MM_HINT_T0 );
int l = topologicalSortOrder[i];
if( m_link[l] == -1 )
m_worldPose[l] = m_localPose[ l ];
else
m_worldPose[l] = m_worldPose[m_link[l]] * m_localPose[ l ];
}
}
NOALIAS inline int HierarchyTopologySortForced(register aurora::int32 count, aurora::uint32* __restrict output, aurora::int32* __restrict sortFeedback, bool* __restrict dirty)
{
_mm_prefetch( (char*)sortFeedback, _MM_HINT_T0 );
_mm_prefetch( (char*)output, _MM_HINT_T0 );
register aurora::int32* __restrict upper = (aurora::int32*)ScratchPadService::Local().allocate( count * sizeof(aurora::int32) * 4);
register aurora::int32* __restrict lower = upper;
register aurora::int32* __restrict upperFeedback = (sortFeedback + (count << 1));
register int32 count2 = count, lvalue = 0, isdirty = false, partiton = 0;
while( sortFeedback[0] == -1 ) *upper++ = *++sortFeedback, *sortFeedback++, count2--;
while( lower != upper ){
output[ partiton++ ] = lvalue = *lower++, isdirty = dirty[ lvalue ], dirty[ lvalue ] = false;
int32* loff = (int32*)HierarchyHalfPointIntervalSearch( &lvalue, sortFeedback, count2, sizeof(uint32) * 2);
int32* uoff = loff > 0 ? (loff + 2) : 0x0;
while( loff != uoff ){
dirty[ loff[ 1 ] ] |= isdirty;
*upper++ = loff[ 1 ]; loff += 2;
uoff += (((uoff + 2) <= upperFeedback) && (loff[ 0 ] == lvalue)) * 2;
}
}
return partiton;
}
NOALIAS inline int HierarchyTopologySort(register aurora::int32 count, aurora::uint32* __restrict output, aurora::int32* __restrict sortFeedback, bool* __restrict dirty)
{
_mm_prefetch( (char*)sortFeedback, _MM_HINT_T0 );
_mm_prefetch( (char*)output, _MM_HINT_T0 );
register aurora::int32* __restrict upper = (aurora::int32*)ScratchPadService::Local().allocate( count * sizeof(aurora::int32) * 4);
register aurora::int32* __restrict lower = upper;
register aurora::int32* __restrict upperFeedback = (sortFeedback + (count << 1));
register int32 count2 = count, lvalue = 0, isdirty = false, partiton = 0;
while( sortFeedback[0] == -1 ) *upper++ = *++sortFeedback, *sortFeedback++, count2--;
while( lower != upper ){
output[ partiton++ ] = lvalue = *lower++, isdirty = dirty[ lvalue ], dirty[ lvalue ] = false, partiton -= (isdirty ^ 1);
int32* loff = (int32*)HierarchyHalfPointIntervalSearch( &lvalue, sortFeedback, count2, sizeof(uint32) * 2);
int32* uoff = loff > 0 ? (loff + 2) : 0x0;
while( loff != uoff ){
int v = (((uoff) <= upperFeedback) && (loff[ 0 ] == lvalue));
if( v > 0 ) {
dirty[ loff[ 1 ] ] |= isdirty;
*upper++ = loff[ 1 ]; loff += 2; uoff += 2;
}
else
{
break;
}
}
}
return partiton;
}
NOALIAS inline void HierarchyTopologyPrint(register aurora::int32 count, aurora::uint32* __restrict output, aurora::int32* __restrict sortFeedback, bool* __restrict dirty)
{
for( int i = 0; i < count; i++ ) {
printf("%d %d\r\n", sortFeedback[1], sortFeedback[0]);
sortFeedback += 2;
}
}
void TransformCreate( aurora::uint32& handle, const aurora::Matrix4x4& matrix )
{
handle = m_Transforms.allocate();
m_Transforms.m_localPose[ m_Transforms.m_forward_mapping[ handle] ] = matrix;
m_LinksDirty = true;
}
void TransformDelete( aurora::uint32 handle )
{
m_Transforms.free( handle );
m_LinksDirty = true;
}
void TransformReadLocal( aurora::uint32 handle, aurora::Matrix4x4& matrix )
{
matrix = m_Transforms.m_localPose[ m_Transforms.m_forward_mapping[ handle] ];
}
void TransformReadGlobal( aurora::uint32 handle, aurora::Matrix4x4& matrix )
{
if( m_Transforms.m_dirty[ m_Transforms.m_forward_mapping[ handle] ] ) {
TransformUpdate();
}
matrix = m_Transforms.m_worldPose[ m_Transforms.m_forward_mapping[ handle] ];
}
void TransformWrite( aurora::uint32 handle, const aurora::Matrix4x4& matrix )
{
m_Transforms.m_localPose[ m_Transforms.m_forward_mapping[ handle] ] = matrix;
m_Transforms.m_dirty[ m_Transforms.m_forward_mapping[ handle] ] = true;
m_LinksDirty = true;
}
void TransformUpdate()
{
if( m_LinksDirty )
{
aurora::int32* sortFeedback = (aurora::int32*)ScratchPadService::Local().allocate( m_Transforms.used() * sizeof(aurora::uint32) * 2 );
aurora::uint32* topologyFeedback = (aurora::uint32*)ScratchPadService::Local().allocate(m_Transforms.used() * sizeof(aurora::uint32));
memset( sortFeedback, 0, m_Transforms.used() * sizeof(aurora::uint32) * 2);
memset( topologyFeedback, 0, m_Transforms.used() * sizeof(aurora::uint32));
HierarchySearchIndex( m_Transforms.m_partition, m_Transforms.m_link, sortFeedback);
m_Active = HierarchyTopologySort( m_Transforms.m_partition, topologyFeedback, sortFeedback, m_Transforms.m_dirty);
memcpy(m_Topological, topologyFeedback, m_Active * sizeof(aurora::uint32));
ScratchPadService::Local().free( topologyFeedback );
ScratchPadService::Local().free( sortFeedback );
m_LinksDirty = false;
}
HierarchyTransform( m_Active, m_Topological, m_Transforms.m_link, m_Transforms.m_localPose, m_Transforms.m_worldPose );
}
Comments
Post a Comment