Consider the Nimrod Programming Language

Mid-April Update: Thanks for the thoughts everyone! The post quickly grew to >10,000 views and reached #10 on Hacker News for a while. I continue to enjoy the language for a multitude of applications from basic scripting to ‘hard work’ tasks.

I  love to play around with new computer programming languages. Even though I spend most of my time in industry tested standards for their respective applications (e.g. Java, C, C++, Python, Javascript, …), I think there are a lot of good reasons–especially these days–to learn and experiment with new languages. The impact of modern language development isn’t limited to a cottage industry of computer scientists and programmers. Take the growing Scala language as an example.  Twitter transitioned from a framework primarily using Ruby to Scala to scale their service and to maintain a programming model they desired. I also believe we are finally beginning to see languages that are elegant, expressive, and, importantly, fast. For example, these days, a company using two industry standards like Python and C++ might do ‘heavy lifting’ in C++ and write a lot of ‘high level’ wrapper code and scripts in a language like Python. Why not use just one language? Why is Python ‘slow’ for some tasks and C++ ‘unpleasant’ for scripting tasks? A good language should be expressive enough to elegantly express domain-specific tasks while allowing the programmer to make the things that need to be fast, fast.

Why the competition may not quite fit the bill

I could just list out great Nimrod features and say: ‘consider it’, but I don’t think that these features are very useful without some explanation of why these features provide an overall better experience than other compelling languages.  When it comes to picking  a programming language that attempts a speed-elegance unification, there are a lot of choices. The five on the ‘short list’ that I discuss in this post are:

There are other options that I could put on this list like Haskell or Go, and I have my reasons for picking the 5 above, but I don’t want to discuss them right now.  What I would like to do is convince you that Nimrod is a particularly nice language to consider since the design decisions they made, to me, result in an elegant, expressive, and fast language (though I  understand people have different syntactic preferences).   These are my initial thoughts after nearly three weeks of coding a lot in Nimrod. I am writing this because I think the language needs to get more attention than it has, and it deserves to be taken seriously as a worthy competitor to the other four mentioned above.

Of course this is not a super-detailed comparison, but overall, I hope I provide some  reasons you may want to consider Nimrod over these other very nice languages. They all have stuff to offer over Nimrod and vice-versa. And there are also a number of overlapping features.  Ideally I would like to have a highly expressive, fast language that is what I call “K&R-memorable” which basically means that it approximately as easy to understand it as it is to understand C (all you do is read K&R and you’re good).

C++11 has really brought C++ a long way. Coding with it results in a lot less boiler-plate code and it did a reasonable job of incorporating higher-order functions and handy value-semantic preserving features such as move semantics.  However, there’s still a lot of boiler plate (e.g. it’s 2014 and I’m still writing header files separate from source because compilation time with header-only files is too slow?), and now I need to implement more operators for classes to preserve value semantics (don’t forget to implement the move assignment operator!). So C++11 is nice and incorporates some modern features, esp. because it works with all other C code, but it’s much too complex, and I think, far less elegant than the other alternatives.

Scala and Rust are both very interesting languages (in general, simpler to understand than the totality of C++11). I have had a good deal of experience with Scala and have played with Rust for a couple of minor tasks. Both languages implement traits. To me, traits are a far more elegant way of adding similar functionality to different objects when compared with multiple inheritance. But my experience with Scala has shown me that while it is easy to use libraries, it is harder to design them in the midst of a complex graph of how objects and traits are related to one another. I spent a lot of time engineering the types to be just right, which is great, but it was also frustrating and I felt that the safety I desire at compile time would be more easily achieved without such a complex system.  I will discuss some design decisions made by Nimrod below that I think result in less time spent on type nitpicking and more time spent on getting the ‘job done right’ with reasonable safety features.   Rust provides more built-in memory safety, which is great, but understanding how it all works and the ‘conversations’ with the compiler after coding can be frustrating. Believe it or not, sometimes hard guarantees with types/traits and memory are not worth the pain when programming (i.e. the programmer effort required, mental model and syntactic complexity).  I think this is precisely why the adoption of a slow dyamically duck-typed language like Python has been so successful. They’re ‘easy’ to program in.  I think Nimrod is a happier medium.

Julia’s motivation comes from two places. The language resembles typical scientific programming syntax (ala Matlab and Pylab) that executes fast when compiled, and offers extensive and intuitive metaprogramming capabilities since it is homoiconic like Lisp. (And the scientist in me really likes the IJulia notebook feature that they have apparently worked quickly to develop.) I will show some examples below on how Nimrod offers a powerful and elegant metaprogramming environment without necessarily being homoiconic.      My only real concern with Julia is  lower-level systems programming. Forced garbage collection can be a game-changer here, and I’m not sure I think its choice of being ‘largely’ dynamically typed is a good one in this setting either.   Providing a library developer some level of type annotation and type class restriction can be useful for engineering purposes and more helpful when dealing with compile-time errors.    I work in the area of computational biology and I am left wondering: is Julia the right language to build the fastest read aligners, gene expression estimators, etc.? These tools are often written in C/C++, so Julia code would have to beat that!  A similar sentiment applies to Scala: it’s dependence on the JVM has actually resulted in very poor performance in even a simple multicore application, in my experience.

Quick start with Nimrod

OK, so you should read the tutorial and eventually the manual on the web site to get a quick start and get to know the language better, but I’ll tell you how I started using it: as a scripting language. I know this isn’t the best for ‘performance’ testing, but any language that has this ‘unification’ quality should be equally good at scripting as it is for high-performance applications. Here is a simple example:

import os

proc shell(cmd: string) =
    if os.execShellCmd(cmd) != 0:
       raise newException(EOS, cmd & "returned non-zero error code")

proc fexists(fname: string) : bool =
    try: discard Open(fname)
    except EIO: return false
    return true

const fromScratch = false

shell "clear"

if fromScratch:
    echo "Removing cached files and log"
    shell "rm -rf nimcache log.txt"
    echo "All output in log.txt"
    echo "Compiling ..."
    shell "g++ -fPIC -O3 -shared -std=c++11 bla.so bla.cpp 2>&1 >> log.txt"

# More of the pipeline, e.g.

if not fexists("blah.txt"): createBlah()
else: useBlah()

This, to me is a very clean way to do basic shell scripting while having the power of a full programming language.

Nimrod avoids ‘over-objectifying’

OK, so that was relatively straightforward. Here is a simple example of how to create a matrix type (partially inspired from a stackoverflow post):


type Matrix[T] = object
    nrows, ncols: int
    data: seq[T]

proc index(A: Matrix, r,c: int): int {.inline.} =
    if r<0 or r>A.nrows-1 or c<0 or c>A.ncols-1:
        raise newException(EInvalidIndex, "matrix index out of range")
    result = r*A.ncols+c

proc alloc(A: var Matrix, nrows,ncols: int) {.inline.} =
    ## Allocate space for a m x n matrix
    A.nrows = nrows
    A.ncols = ncols
    newSeq(A.data, nrows*ncols)

proc `[]`(A: Matrix, r,c: int): Matrix.T =
    ## Return the element at A[r,c]
    result = A.data[A.index(r,c)]

proc `[]=`(A: var Matrix, r,c: int, val: Matrix.T) =
    ## Sets A[r,c] to val
    A.data[A.index(r,c)] = val

iterator elements(A: Matrix): tuple[i:int, j:int, x:Matrix.T] =
    ## Iterates through matrix elements row-wise
    for i in 0 .. <A.nrows:
        for j in 0 .. <A.ncols:
            yield (i,j,A[i,j])

proc `$`(A: Matrix) : string =
    ## String representation of matrix
    result = ""
    for i in 0 .. <A.nrows:
        for j in 0 .. <A.ncols:
            result.add($A[i,j] & " ")
        result.add("\n")

The first thing to notice is that a matrix is an object type that contains data and its number of rows and columns. All the methods take a matrix as the first argument. This matrix is generic on any type Matrix.T. An alternative syntax where ‘[T]’ comes after a procedure name may also be used. Nimrod uses a uniform call syntax that implies these two calls are equivalent:

A.alloc(nr,nc): ...
alloc(A,nr,nc)

Notice that ‘elements’ is an iterator. This is a very efficient iterator called an ‘inline’ iterator. You can read more about this in the tutorial and manual. The `$` operator before a variable is the standard ‘to string’ operator. This allows you to do:

echo A

and a matrix will be printed out.

The uniform call syntax is a simple way to support a lot of ‘call-chaining’ like behavior commonly seen in object-functional programming and avoids forcing methods to be in objects. As an example, say I have a BigInt, and a little int and I want to be able to support addition of them. In Nimrod, you simply write a procedure that overloads the ‘+’ operator and it works (otherwise you get a compile time error). In a language like Scala, you define what’s called an ‘implicit conversion’ to do this for you. The added idea of an implicit conversion on objects and having to define them so explicitly seems more complex than just overloading the operator.  Note that there are other cases where you would like to use implicit conversions and Nimrod provides this capability. Calls to procedures in Nimrod can be ‘pass by reference’ in C++ world:

proc test(x:var int) =
   x=5

var x = 3
echo x
test(x)
echo x

results in:

3
5

The compiler will chose the appropriate method at compile time to call based on the types in the procedure. Nimrod also supports multiple dispatch.

Nimrod has an intuitive type system

As mentioned above, traits are a nice way of defining components of functionality tied to an object and the compiler will error out if certain traits are required, but missing, for example. I also mentioned that this can lead to complexities in library design and engineering (which may be good or bad depending on your perspective and the outcome).

One feature of Nimrod that’s appealing is that it offers the programmer type classes — the ability to group types into a single type (e.g. define float and int to be of type number), and distinct types — the ability to create two different types corresponding to the same underlying data type (e.g. dollars and euros are both ints in their tutorial example). Similar to type classes, Nimrod also allows constraints on generic types, and support for additional constraints is in the works. So the compiler will provide an error message if a method is not defined for a particular class of types its defined on or if a desired method is missing. Traits appear to be a formalism that could be useful, but might result in a lot of added complexity given the capabilities already provided by type classes and distinct types. Nimrod also supports an effects system which allows for additional compile-time safety checks.

You will want to metaprogram in Nimrod

Nimrod makes it easy to extend the language and the abstract syntax tree to generate the code you want. Say I wanted to do an openMP-like parallel for using Nimrod’s threads over a shared sequence of data. The thread code looks a lot like this:

template parallelFor[T](data:openarray[T], i:expr, numProcs:int, body : stmt) : stmt {.immediate.} =
    let numOpsPerThread = (data.len/numProcs).toInt
    proc fe(j:int) {.thread.} =
        for q in 0 .. numOpsPerThread-1:
            let i = j*numOpsPerThread+q
            body # so something with data[i]
    var thr: array[0..numProcs-1, TThread[int]]
    for j in 0 .. numProcs-1:
        createThread(thr[j], fe, j)
    joinThreads(thr)

But using the template as defined above, I can just do:

parallelFor(sequence, i, numProcs):
    # do something with sequence[i]

Note: this is just a toy example showing how to do some handy metaprogramming using a lower level version of threads.  The developers are working on a much better way of handling threads and parallelism at a higher level. The tutorial defines a debug statement (a variation of it which I use a lot) that shows how you can actually modify the abstract syntax tree.

C code and Memory Allocation

The typical compilation scheme in Nimrod I use is to compile to C code. They introduce a nimcache/ directory with all the C and object code for me to inspect to see how efficient it is compared to what I would write in C. It’s fairly straightforward to link into C code if you must.

The C code Nimrod generates often appears indistinguishable in speed when compared to hand-crafted C code I made in certain examples. Nimrod is much more pleasurable to program in than C, and the compile-time and run-time error messages are far better than C.

Also, I’d like to note that Nimrod allows for manually allocated memory and low-level operations to provide the developer ‘C-like’ control.  In most cases the standard libraries using the GC are appropriate, but in some cases you may want to manage your own data on the heap and Nimrod allows for this.

Young language, helpful community

The Nimrod language is young and has a handful of developers working on making it to a 1.0 release. The Nimrod community has been very helpful to me and I think it has a lot of potential.

I’m writing this post based on my experiences so far. I would really appreciate any feedback if I’m wrong or misrepresented a language I discussed. The post will be modified accordingly with acknowledgement.

Thanks to Rob Patro and the Nimrod community for useful discussions.

Advertisements

Scala is pretty awesome

After hearing about it a lot, I have started using Scala recently and am finding it really fun and powerful to use.  Scalala is a nice analog to numpy.  The documentation needs to get better, but I think it’s coming along.

I just found a good introductory video by the creator of the language, and if you’re curious whether this is a language you’re interested in playing with, I think it is a good place to start:

http://www.youtube.com/watch?v=zqFryHC018k

Speed up your Python: Unladen vs. Shedskin vs. PyPy vs. Cython vs. C

Lately I’ve found that prototyping code in a higher-level language like Python is more enjoyable, readable, and time-efficient than directly coding in C (against my usual instinct when coding something that I think needs to go fast).  Mostly this is because of the simplicity in syntax of the higher level language as well as the fact that I’m not caught up in mundane aspects of making my code more optimized/efficient. That being said, it is still desirable to make portions of the code run efficiently and so creating a C/ctypes module is called for.

I recently created an application (I won’t go into the details now) that had a portion of it that could be significantly sped up if compiled as a C module.  This spawned a whole exploration into speeding up my Python code (ideally while making minimal modifications to it).

I created a C module directly, used the Shedskin compiler to compile my Python code into C++, and tried the JIT solutions PyPy and Unladen Swallow.  The time results for running the first few iteration for this application were surprising to me:

cpython:        59.174s
shedskin: 1m18.428s
c-stl:             12.515s
pypy:           10.316s
unladen:       44.050s
cython:         39.824

While this is not an exhaustive test, PyPy consistently beats a handwritten module using C++ and STL!  Moreover, PyPy required little modification to my source (itertools had some issues) [1].  I’m surprised that Uladen and Shedskin took so long (all code was compiled at O3 optimization and run on multiple systems to make sure the performance numbers were relatively consistent).

Apparently out-of-the-box solutions these days can offer nearly a 10x improvement over default Python for a particular app. and I wonder what aspects of PyPy’s system accounts for this large performance improvement (their JIT implementation?).

[1] Uladen required no modifications to my program to run properly and Shedskin required quite a few to get going.  Of course, creating a C-based version took a moment :-).

Update 1: Thanks for the comments below.  I added Cython, re-ran the analysis, and emailed off the source to those who were interested.

Update 2: The main meat of the code is a nested for loop that does string slicing and comparisons and it turns out that it’s in the slicing and comparisons that was the bottleneck for Shedskin.  The new numbers are below with a faster matching function for all tests (note that this kind of addition requires call ‘code twiddling’, where we find ourselves fiddling with a very straightforward, readable set of statements to gain efficiency).

cpython:       59.593s
shedskin0.6:   8.602s
shedskin0.7:   3.332s
c-stl:              1.423s
pypy:             8.947s
unladen:       29.163s
cython:         26.486s (3.5s after adding a few types)

 

So C comes out the winner here, but Shedskin and Cython are quite competitive.  PyPy’s JIT performance is impressive and I’ve been scrolling through some of the blog entries on their website to learn more about why this could be. Thanks to Mark (Shedskin) and Maciej (PyPy) for their comments in general and and to Mark for profling the various Shedskin versions himself and providing a matching function. It would be interesting to see if the developers of Unladen and Cython have some suggestions for improvement.

I also think it’s important not to look at this comparison as a ‘bake-off’ to see which one is better.  PyPy is doing some very different things than Shedskin, for example.  Which one you use at this point will likely be highly dependent on the application and your urge to create more optimized code.  I think in general hand-writing C code and code-twiddling it will almost always get faster results, but this comes at the cost of time and headache.  In the meanwhile, the folks behind these tools are making it more feasible to take our Python code and optimize it right out of the box.

Update 3: I also added (per request below :-)) just a few basic ‘cdef’s and types to my Cython version.  It does a lot better, getting about 3.5s on average per run!

A couple of recent Python tidbits

Five years ago the two most practical and used computer languages for me were C and Perl.

While I still use C for the nitty-gritty stuff that needs to be fast, I’m finding that a lot of stuff can get done, and get done very fast in a scripting language.  Over a year ago I learned the wonders of Ruby (which, to me, is basically a superior replacement to Perl from a ‘it’s fun to code in’ perspective  and is easy to transition to from Perl).

But overwhelmingly, I’ve found myself enjoying and using Python.  The biggest selling point from my perspective is the high quality scientific computing and plotting support (which in many cases has replaced my use of R-project for these types of things).

Here are three little tidbits that I’ve recently found handy and take virtually no effort to begin using:

(1) First,to speed up the things that need the speed, calling C functions from Python is super-handy.  I really like the ctypes module because in many cases, as long as your C functions take the default types as inputs, you can simply expose your function via a dynamic library with no real extra effort.

(2) List comprehensions are a fun and very useful feature in Python.  But many times, you don’t want to create the whole list before iterating through it.  Here’s where generator expressions come in.  Instead of brackets around your comprehension, just put parens, and suddenly you can iterate over these elements without creating the entire list in advance (especially useful with a module like itertools).

(3) Finally, one thing that had bugged me a little in Python, especially since I’m used to C, is that there really is no real scanf equivalent [1].  What I end up doing is parsing out my variables from a string and on separate lines explicitly setting their type.  This just takes too many lines (and I could often do it in fewer lines in C)!   After some thought, a lab mate and I converged on:

>>> types = [float, str, int]

>>> fields = [‘3.14159’, ‘Hello’, 10]

>>> pi, hi, ten = [f(x) for f,x in zip(types,fields)]

[1] Note that while there is a suggestion in the Python docs on how to do this, it just suggests how to extract different types with regular expressions, not concisely convert them.

Flex and Bison Fun

So I’ve been writing parsers lately with Flex and Bison.

Some notes of interest:

  • The “info” pages on both are really helpful and I’ve learned most of what I needed from them (as opposed to Googling around or finding a book)
  • The Regular Expression Explorer has been quite handy
  • While learning, I wrote a bunch of toy parsers, and I realized that I was copying “template code” a lot. This seemed unnecessary. A tool like this simply takes a BNF-like input and outputs into different languages.  Seems pretty cool but haven’t played with it too much yet.

Fun.

How can I trust you, code?

I’ve been programming a bit lately in C++ Boost for two reasons (1) I find the template metaprogramming aspects intriguing and the generic programming aspects super-convenient. Yet, when I’m really concerned with knowing precisely what’s going on to avoid wasting CPU cycles and memory, I go back to C. But then of course C is doing all sorts of things for me as well (just see the difference in speed in your code with the -O3 optimization flag).

One trend I’ve noticed after reading a lot of code is that bit-twiddlers will often over-optimize. I’m more and more convinced that this not only affects the readability of the code, but it may also be indicative of a lack of trust in the compiler’s optimization. The more control you have, the better you feel, right?

On the other end of things, some people trust objects and functions as black boxes and could care less what’s going on underneath. This post is not for them.

I hate to pull the big CS “language” and “compiler” words out of my pocket again [1], but how can I trust you language and compiler? When do I know I’m over-twiddling and it’s better that the compiler do the work for me? While there are guidelines, I think what it comes down to is a decent understanding of the compiler you’re dealing with and then, basic experimentation.

Most of my questions about whether certain optimizations are actually occuring are typically answered by doing timing tests, creating little “experiment” code bits to test a simple hypothesis, or look directly at the assembly code (e.g. the “-S” flag in gcc).

This did get me thinking, though, with so many different compilers and architectures–even with standards and benchmarks–a lot of these little questions aren’t answered. This process of creating precise tests and getting the answers seems to be the best approach [2].

—–
[1] But I do believe these topics have some of the most interesting research opportunities from a CS perspective. As I foray into scientific computing and its applications to the sciences, I hope I will have the chance to tackle a CSee problem or two along the way.
[2] Wouldn’t it be nice if there was a nice repository of contributed tests for a variety of architectures, compilers, and languages, and that these tests can be tagged with all this information? In the spirit of my last post, I say let’s start a massive conglomerate blog where people can post these tests. And then there could be another one where people just bitch about the goodness and badness of the tests. It’ll be like a somewhat moderated coding newsgroup for today’s age. Oh how I’ve blurred the line between honest desire and sarcasm here.

Compile-time optimization of data structures?

Or: “it’s all fun and games until you realize you’re a monkey. Bob brings up a big point about the expressiveness of a language and its relation to concise code. In a similar carpal tunnel vein (but certainly a bit less meaningful in scope) I find it difficult to find decent work done on compile-time optimization of data structures based on suggestions for space and time efficiencies and how they intend to be used.

In other words, if you enjoyed and/or revisited an undergraduate course on data structures, based on any given problem at hand, the chances are good that you can concoct some monstrosity of a hybrid data structure that keeps efficiency and your use of the code in mind.

The handwavy question I’m posing is, what aspects of concocting this crazy structure are actually due to clever thinking about the problem at hand vs. “this data structure is good for this, and this one for that” kind of algorithmic work? The end result of the algorithmic work is a tired hand and wordy, unexpressive code. I have this sneaking suspicion that a lot of the work is indeed algorithmic and it’s worth exploring what can be automated at compile-time based on potentially simple rules and patterns.

Perhaps a good example to keep things somewhat concrete is the notion of a set, essentially an unordered collection of unique elements. It has some basic binary operations like union and intersection and there are a variety of things one imagine doing with a set: query for membership of an element, add an element to it, etc [1]. For instance, in a graph, every node might have an associated set of nodes it’s connected to [2], or it may alternatively be represented as a set of edges and a set of nodes where each edge is a set of two nodes. Now the last thing I’m advocating is to always use some mathematical structure to express what objects you’re dealing with in your program, but it does turn out that a “set object” can be quite handy. Here is some simple Python code that basically takes as an input a set of files and returns the number of unique files where each element of a file is a line (keep in mind there can be duplicate lines as well as entire duplicate files):

import sysfrom sets import Set
uniqpars = Set()
for parfile in sys.stdin:
  parset = Set()pf = open(parfile.rstrip())
  for line in pf:
    parset.add(line)
    pf.close()
    uniqpars.add(parset)
print len(uniqpars)

The fact that the example above uses a set in a traditional object-oriented way [3] is somewhat beside the point. What sometimes matters are simple questions like how fast are membership and addition operations or are we doing a lot of unions or how much space does my set actually take? There are many ways to implement a set depending on how you’re using it. A few examples:

Super-fast lookup, but potential memory or false positive concerns:

  • A directly-mapped array or bitset if you have enough space to store your entire possible set of members.
  • A valueless hash table if you have a decent amount of memory around, a potentially super-large set of possible members and want to store only the members you see. These also have super-quick lookup and insertion/deletion times.
  • A Bloom filter if memory is a concern, you have a super-large set of possible members, don’t mind not storing the members, and are willing to accept a small probability of false positives.

The structures above are great for fast operations given the set sizes don’t fluctuate that much and you have the memory or are willing to deal with false positives.

  • Binary search trees, especially the self-balancing kind, are great for sets with sizes that fluctuate a lot [4] or many union operations are being done, for example (they preserve an ordering of the elements that can be useful). If maximum or minimum element values need to be quickly accessed, binary heaps are often good ways to implement those kind of sets.

Obviously, there are a variety of ways to implement this kind of object. Object-oriented languages like C++ and Java have ways to separate the notion of an object from its implementation [5], but I reiterate, the object-orientedness is not necessarily the point. It’s simply a convenient way to organize and express the fact that you’re using a set object with commonly-used functions.

What seems to be deserving of more attention is if there are some handy structures with well-defined operations like a set, it’s probably worth using those structures and optimizing for what we need to based on some helpful directives at compile-time [6]. In other words, suppose I’m using some objects with well-defined operations like a set with each other. Suppose I also have a way to express how I’m using my information with these objects. Finally, suppose I can use techniques that some resourceful computer scientists have come up with to piece together these objects in a relatively optimal way based on how I plan to use the structures. And presto! We could potentially do a lot of work from the get-go to save the programmer some implementation details yet still wind up with some pretty efficient code.

OK, so there are a lot of assumptions there, but I imagine one can begin by citing more examples of commonly-used objects, they’re uses, and common implementations by themselves and as hybrids with other data structures. Perhaps there are some interesting rules and techniques that could come about.

And all that just reminded of some Ph.D. work a friend is doing at the moment. Maybe this topic will be revisited.

——
[1] Note that I’m just focusing on data structures and basic operations as well as their optimization with other data structures and ignoring the more language-oriented aspects.
[2] Often implemented as an adjacency list
[3] That is, the way it can check equality of sets on insertion is the fact that each member is some object that has some defined way of saying it’s “equal” to another object. In the example above, there is a set of lines and a “set of sets of those lines”. With object-oriented techniques the “equals” function works for both even though the member types are different (i.e. a “line”, and a “set of lines”).
[4] As a brief motivator an allocator can be associated with a data structure to deal with this issue (though it can get hairy at times). A good review of allocation techniques as well as some implementation motivation can be found at this starting point.
[5] Though C++ is super-guilty of not using a multiple-inheritance trick to do this in practice.
[6] I don’t think this is too far-fetched given we’ve done quite a lot of work in linear algebra and higher-level matrix-oriented languages tend to do a good job with running optimized routines based on a few directives.