It was the mean() of times, it was the rms() of times.
I continue to poke and prod at Julia, learning more of both the good parts and the bad parts. Julia is very powerful, but it is extremely architecturally immature. Kind of like cars built by Tesla: you get some innovations, but also rookie mistakes and poor quality in things like paint where there's really no excuse. One of the downsides of this is that you really don't want to install Julia on Windows, because installation of modules, and compilation will be slower than mud. Why? Files. Lots and lots of files. Hundreds of thousands of files once you install just a few packages like Plots and ODE. If you're going to want to use it on a Windows, then hopefully you're using Windows 10 and you can install WSL 2 (windows subsystem linux), and then install Julia on a native linux filesystem. If not, Windows Defender is going to want to scan every one of those 400+ thousand files, and you might as well take the afternoon off and let it do its thing.
But that's not what this is about. This is about performance.
Once you get all the bad module and dependency stuff behind you, there's some interesting speed available for getting work done. But even with this experiment, the data answers a few questions, and raises some more. But let's setup the problem first.
In Jane Herriman's video (mentioned in the last post), she pretends to be introducing us to Julia, but she instead leads us on a whirlwind tour through all kinds of interesting work. Including benchmarking some functions: built-in, hand written, and external C libraries. Because a few bits were clipped off the edge of her screen shots, I tried to re-create her work from that section, but with my own twist. She was benchmarking
mean() or taking the average. I decided to do
rms() or the root mean square. This is probably because I'm both a programming nerd, and a power electronics nerd, and
rms is one of those useful things you do from time to time to figure out how much delivered energy you're getting from changing voltage. The name basically describes the operations involved, but TeX version would be:
$\sqrt{ \frac{1}{N} \sum_{i=1}^N x_i^2}$
(I looked into putting a pretty figure here, but Chrome doesn't support most of MathML yet, and I didn't figure there would be that many people using firefox circa 2020.)
Julia doesn't have a built-in for
rms(), but the forums have a number of suggestions:
sqrt(mean(A .^ 2.))
sqrt(sum(x->x^2, A)/length(A))
norm(A)/sqrt(length(A))
The first version is kind of slow (3.5 times slower than the baseline). The second one saves having to generate a temporary copy of the entire array, instead squaring each term one by one as its consumed by
sum(), it comes in only four percent slower than the baseline. The third one I expected great things from, as the
norm() operation is basically the root of the sum of the squares, but it was actually slower than the other version at thirteen percent over the baseline. So what was the baseline? Well, I wrote it out:
function rms(A)
s= 0
@simd for e in A
s += e * e
end
sqrt( s / length(A) )
end
That little bit of magic dust before the
for is required, probably to specify that none of the iterations have any dependency on any other, and let the compiler go crazy. I actually wrote this in C as well, using two different styles (tranditional and performant), and while there was a small variation in the timings (one was a smidge faster, and the other a sliver slower), the julia version and the C versions were practically identical for the
clang version. The
gcc version didn't do as well. But the whole ordeal with compile flags and such is a story for another time.
Reference:
benchmark_rms.jl Pluto notebook.
Labels: benchmark, julia, performance