Technical Difficulties from on Top of the Mountain

2020-11-14

It was the mean() of times, it was the rms() of times.

I continue to poke and prod at Julia, learning more of both the good parts and the bad parts. Julia is very powerful, but it is extremely architecturally immature. Kind of like cars built by Tesla: you get some innovations, but also rookie mistakes and poor quality in things like paint where there's really no excuse. One of the downsides of this is that you really don't want to install Julia on Windows, because installation of modules, and compilation will be slower than mud. Why? Files. Lots and lots of files. Hundreds of thousands of files once you install just a few packages like Plots and ODE. If you're going to want to use it on a Windows, then hopefully you're using Windows 10 and you can install WSL 2 (windows subsystem linux), and then install Julia on a native linux filesystem. If not, Windows Defender is going to want to scan every one of those 400+ thousand files, and you might as well take the afternoon off and let it do its thing.

But that's not what this is about. This is about performance.

Once you get all the bad module and dependency stuff behind you, there's some interesting speed available for getting work done. But even with this experiment, the data answers a few questions, and raises some more. But let's setup the problem first.

In Jane Herriman's video (mentioned in the last post), she pretends to be introducing us to Julia, but she instead leads us on a whirlwind tour through all kinds of interesting work. Including benchmarking some functions: built-in, hand written, and external C libraries. Because a few bits were clipped off the edge of her screen shots, I tried to re-create her work from that section, but with my own twist. She was benchmarking mean() or taking the average. I decided to do rms() or the root mean square. This is probably because I'm both a programming nerd, and a power electronics nerd, and rms is one of those useful things you do from time to time to figure out how much delivered energy you're getting from changing voltage. The name basically describes the operations involved, but TeX version would be:
$\sqrt{ \frac{1}{N} \sum_{i=1}^N x_i^2}$
(I looked into putting a pretty figure here, but Chrome doesn't support most of MathML yet, and I didn't figure there would be that many people using firefox circa 2020.)

Julia doesn't have a built-in for rms(), but the forums have a number of suggestions:


sqrt(mean(A .^ 2.))
sqrt(sum(x->x^2, A)/length(A))
norm(A)/sqrt(length(A))

The first version is kind of slow (3.5 times slower than the baseline). The second one saves having to generate a temporary copy of the entire array, instead squaring each term one by one as its consumed by sum(), it comes in only four percent slower than the baseline. The third one I expected great things from, as the norm() operation is basically the root of the sum of the squares, but it was actually slower than the other version at thirteen percent over the baseline. So what was the baseline? Well, I wrote it out:


function rms(A)
  s= 0
  @simd for e in A
    s += e * e
  end
  sqrt( s / length(A) )
end

That little bit of magic dust before the for is required, probably to specify that none of the iterations have any dependency on any other, and let the compiler go crazy. I actually wrote this in C as well, using two different styles (tranditional and performant), and while there was a small variation in the timings (one was a smidge faster, and the other a sliver slower), the julia version and the C versions were practically identical for the clang version. The gcc version didn't do as well. But the whole ordeal with compile flags and such is a story for another time.

Reference: benchmark_rms.jl Pluto notebook.

Labels: benchmark, julia, performance

¶ 8:00 PM 0 comments

2020-11-06

Julia ups and downs.

Still struggling with Julia, which is both a criticism and how it should be. A criticism because certain things don't make sense, or work poorly, and that's an area where the language could improve. And how it should be, because learning a sufficiently interesting language requires understanding new idioms, different syntax, and large libraries of expertise.

Since I'm doing this all on a whim, I'm still mostly using Steve Brunton's classes as the exercises. I've down shifted to the Begining Scientific Computing series which is kind of review, so I can zip through the lectures faster. Unfortunately the videos are not at all organized, and its somewhat of a puzzle to work out the order. I'm doing my best to document what I think is the progression in the comments of each of my julia files here:

As I've been going along, one thing that has raised it head, is the form of data that plot() like to make time series, vs the form the linear algebra solvers use. The best I've come up with to pull one row of data out of a vector of vectors is [e[1] for e in vu], but I fear this is a copy of the data which makes me sad.

To brush up on my basics, I found this interesting introduction, which wasn't so much an introduction as a tour through a lot of interesting topics like benchmarking, multiple plots in one pane, inline C code, and even simd. Its also just fun watching someone get excited by the ternary operator.

using BenchmarkTools
using Random
using Statistics
using LinearAlgebra
using Plots

A = 2 * rand(10^7)
T_bench= @benchmark sqrt(mean(A .^ 2.))
T_bench2= @benchmark sqrt(sum(x->x*x, A)/ length( A ))
T_bench3= @benchmark norm(A) / sqrt(length(A))
histogram( T_bench.times )

Inlining C code, though the video cut the right edge off and I had to guess at what was missing,

using Libdl
C_code = """
#include <stddef.h>
#include <math.h>     
double c_rms(size_t n, double * X) {
  double s= 0.0 ;
  for ( size_t i= n ; ( i -- ) ; X ++ ) { s += ( *X * *X ) ; }
  return sqrt( s / n ) ;
}
double c_rmse(size_t n, double * X) {
  double s= 0.0 ;
  for ( size_t i= 0 ; ( i < n ) ; i ++ ) { s += X[i] * X[i] ; }
  return sqrt( s / n ) ;
}
"""
const Clib = tempname()
open( `gcc -fPIC -O3 -msse3 -xc -shared -ffast-math -o $(Clib * "." * Libdl.dlext) -`, "w" ) do f
  print(f, C_code)
end  
c_rms( X::Array{Float64}) = ccall((:c_rms, C_lib), Float64, (Csize_t, Ptr{Float64},), length(X), X )
c_rmse( X::Array{Float64}) = ccall((:c_rmse, C_lib), Float64, (Csize_t, Ptr{Float64},), length(X), X )
c_rms( A )

And finally, some parallel coding in Julia,

function rms(A)
  s = zero(eltype(A))  # generic versiion
  @simd for e in A
    s += e * e
  end
  sqrt( s / length(A) )
end

To try these pluto notebooks out without having to have Julia running locally, there's a Binder transform here, but I think I may eventually setup a pluto instance on my server.

Labels: julia

¶ 8:00 PM 0 comments