GAL (C++ library)

ninepoints · October 5, 2019, 11:10am

Please take a look over here: GAL github repository

The README summarizes some of the main motivations of the library and some of the advantages I’m shooting for.

Perhaps a bit more regarding implementation detail, the library fully separates the data dependencies of an expression’s evaluation from the expression itself. This allows the expression to be reduced at compile time so that representations can remain sparse. Furthermore, this eliminates spurious operations that compilers cannot generally eliminate due to the nature of floating point arithmetic. The library supports a general (p, q, r) metric signature although most of my testing thus far has been in P(R^*_2) and P(R^*_3). Other features are that the computational backend can be swapped out which is something I intend on doing in the future to support generation of shader routines for GPU acceleration (or even HDL support, but SPIR-V is the target first). The library also supports field elements of arbitrary order, so for example, it supports automatic differentiation using the dual (order 2) unit but also other more exotic number systems. I haven’t gotten to it, but I intend on supporting constraints on the polynomial generators to accelerate the computation even more. Currently, polynomial reduction does not require division, but the monomials are totally ordered and amenable to standard computational algebraic techniques.

The library itself has been in development for ~1.5 weeks so please don’t expect too much, but I figured I’d post it here for early feedback/comments/suggestions. Time permitting, I intend on paring down my renderer at home to be suitable as a demo/viewing application. This would run natively in C++ via Vulkan most likely for platform compatibility.

Thanks all! Really digging the community here so far.

enki · October 6, 2019, 7:42am

Hi @ninepoints !

Welcome to the forum ! Having seen some of the early produced machine code of this library (some of the best I’ve seen) - I am very excited to see where this will go - especially w.r.t. GPU-friendly code generation. It is really great to see these new libraries, and I encourage others that are working on their own to also share their work.

I will get together with @hadfield.hugo, @alex, @LeoDorst, Joan Lasenby, Hildebrandt, and formalize a recent meeting we had in Cambridge as a forum post, in an effort to standardize the operators, Witt basis etc and avoid these things being different between libraries.

Keep us posted !

E.

ninepoints · October 7, 2019, 3:29am

One interesting weakness I’m finding in the generalized motor product code is that it is difficult to make icache friendly and I’m currently brainstorming ways to address this.

In a 4x4 matrix multiplication, there are easy loop sequences that a compiler can see and autovectorize where possible. Furthermore, the actual size of the code is small, so after a jump, it’s very likely to fit entirely in an icache line with no thrash.

In contrast, a versor product is very difficult for a compiler to optimize and each individual multiplication and add is very likely to generate its own instruction. Thus, it’s quite likely on most platforms that even if the number of operations decrease, the actual performance will be worse than the matrix formulation.

Of course, we can always turn the versor product into a matrix if necessary but I intend on searching for hopefully a better and more general approach. After all, if achieving competitive performance dictates needing a “special” implementation for each operation, we’ll end up losing much of the advantage of using GA in the first place.

This kind of optimization will be most crucial if using GPUs to accelerate. Larger shader programs increase the cost/latency of pipeline switching which is a concern for any realtime renderer managing a lot of materials/draw state. Furthermore, icaches can be more restrictive for each core which may limit ultimate chip occupancy at full blast.

To be continued…

edit:

Putting some numbers to this, I can get a single 4x4 to 4x4 multiply down to 16 sse muls, 12 additions, and 4 broadcasts per row. The loop is actually unrolled in this case and we have 67 instructions (including initial loads). My optimized code so far, for the same operation in PGA is also autovectorized by the compiler but results in 117 instructions including loads, so there is some work to be done.

ninepoints · October 7, 2019, 4:55am

On second thought, I may not be comparing apples to apples here. A general dual quaternion represented as a matrix multiplication ends up being an 8x8 done naively and often, work needs to be done to actual perform the conversion (this has to be done every frame assuming smooth interpolation is needed).

Thus I may be closer to striking distance than I thought…

vincentnozick · October 7, 2019, 9:29am

Hi @ninepoints!

your approach looks really promising.
I hope we can see more soon.

Thanks for sharing this with us .

ninepoints · October 9, 2019, 9:50am

I’m working on this way too late but it’s fun

Here’s my update so far:

I’ve tested CGA now more rigorously (previously was not handling change of basis properly when computing geometric products).
The IK test from the ga-benchmark library has been at least partially implemented and I can verify that my results are matching with good precision
To prevent overflow during expression generation without losing too much precision, I decompose irreducible ratios using lower and upper bounds of the mediant. Based on parity (possibly slightly biased, but not very), I choose the upper or lower one to not introduce systemic rounding. In addition, I bound the error to 1e-7 which is the best precision that can be offered by floating point (in the single digit range), so no precision is “lost” except what would have been lost via floating point anyways. In most cases, reduction over the rationals will produce better numerical stability.
The engine interface is slightly improved but I have plans to improve the ergonomics of using it. The main additional burden on the user thus far is the decision of how much computation to compute in an expression and when to actually reify results. Reifying an entire chain of results is obviously optimal but sacrifices compile time, so providing this flexibility is important. Doing this automatically is ideal, but I will defer that to later (C++2a lacks the functionality required without a lot of macro nonsense).
Some operators were standardized to resemble behavior in other implementations and the symmetric inner product is there too via the | operator. The regressive product is now & as might be expected.

No benchmark as of yet but I’m working hard to be able to actually test it.

edit join -> regressive product since it’s function of either joining or meeting is dependent on the dual or undual point of view (thanks @enki)

ninepoints · October 24, 2019, 2:46pm

It’s been a while since I’ve updated but the library has undergone already not 1, not 2, but 3 rewrites already in order to circumvent a swath of problems I was having getting the C++ compiler to behave. Using the “SpecializedAlgorithmInverseKinematics.hpp” benchmark from ga-benchmark, I can achieve a 2x speedup over versor (2.5x speedup with ffast-math and fma on for both libraries), due largely to the ability of the library to both represent data as sparsely as possible and permit more aggressive CSE by the compiler. Again for clarity, all compiler flags were identical for the different runs.

The hardest thing was actually keeping compile times down and memory usage low. The benchmark still takes 50s on my machine to compile, but this is because there is no cheating going on (the full exponential is evaluated etc).

ninepoints · October 24, 2019, 3:04pm

Oh and word to the wise there are still a number of untested areas of the library (mainly cga3 and ega are tested right now) so let me know if you intend on using the library and I’ll ensure things are working correctly

ninepoints · October 28, 2019, 11:00pm

https://www.jeremyong.com/gal/benchmarks/

I’ve published benchmarks at the site above for review. I’ve also documented a number of things I’ve done to make the code faster here in case anyone finds it helpful.

Now that it has a site, I won’t be spamming this forum anymore with project details unless I think I’ve come across something others might find useful for their own work as well.

Cheers,
ninepoints

alex · October 29, 2019, 12:31pm

not spam, i have been reading these updates.

meuns · October 29, 2019, 1:51pm

Agreed! I took a look at the library too.

ninepoints · November 12, 2019, 10:48pm

It’s been a while since the last update, but I wanted to give an brief update on progress that remains unreleased (this thread is now my blog lol).

First, progress has been slow due to work obligations and catching some sort of cold. That said, I’m (hopefully) close to an update that brings a round of upgrades to the libraries. In no particular order:

There is a new abstraction layer that performs expression simplification at the geometric algebra level (not at the multivector level). So for example, a silly expression like (p + 2 * p + 1 - 1 - 3*p) ^ (3 * p) (for any entity p) would be reduced to 0 trivially before any term expansion of p occurs. This required a new way of encoding expression trees for which I created a custom constexpr-friendly flat representation (similar to Reverse Polish Notation but with extensions to handle polyadic operations).
A number of things were made more convenient in actually expressing computation. In a gal compute context, you will be able to write things like 1_e23 or 1.5_e0. This is finished, but my plan is to use continued fractions where possible to represent the double/float literally exactly as a rational compile time constant. A pseudoscalar looks like 1_ps and the inverse as 1_ips.
The expression compiler, in addition to doing a number of reductions/simplifications, will also enable creating temporary results on the stack. This is a feature that was preventing me from streamlining the usage of transcendental functions. Also, this will allow me to improve compile times in situations where multivector term expansion would have been prohibitively expensive.

In short, when the release hopefully goes out this weekend, if I can deliver what I hope, I expect speed, usability, functionality, and compile times to all improve somewhat dramatically. Not to over-promise, of course, this all comes with a caveat that I’m a mere mortal and not immune to error.

godefv · February 18, 2020, 2:56pm

The Common Subexpression Elimination is a cool feature.

Making the computation explicit is interesting as well. I have found myself doing something similar by using symbolic math to define an expression, and then replace the symbols with runtime values (I have an eval() function for this).

I encourage you to change code like entity<algebra, double, 0b11, 0b101> my_multivector{3.2, 1.2}; to something like 3.2*e01+1.2*e02. What does it bring to you to specify the algebra here ?

ninepoints · February 18, 2020, 5:31pm

I actually do support that syntax but in the context of a lambda where the metric is specified and the values in your example are passed as scalar indeterminates. The entity<algebra, T, E...> signature is really an implementation detail and actually and important one that lets me import entities from one algebra to another. It’s “glue” if you will. For example, you can define a point in VGA, perform operations with the \mathbb{R}_{3, 0, 0} metric, but then use the same point in the context of computation in \mathbf{P}(\mathbb{R}^*_{3, 0, 1}) and the type system automatically performs the conversion for you behind the scenes.