The first version (testsort1.cpp), had some warts. I used inline macros for some boilerplate expansion, inheritance for the different sorters, and I used a self organizing global singleton list to collect the tasks to run. Plus the version I turned in had an error in one of the three versions of merge sort I wrote. Not that surprising that they didn't call back.
But I decided to press on, and see if I could do better. For one thing, the timing I was getting from shell sort (one of my favorites) was not that great; and I was watching a lot of industry experts in online conferences railing against inheritance (see Sean Parent, an Adobe principal scientist, talk about how Inheritance Is The Base Class of Evil). So my second version (testsort2.cpp), used initializer lists, but needed this really ugly adapter to deal with having two different signatures for the sort functors depending on whether they operated on iterators or raw pointers. I also had a little fight with formatting floating point numbers with iostreams, but that's a whole other story.
Finally, after running over the type erasure/pimpl pattern several times in the Going Native 2013 lectures, I decided to give it a try. Its a really convoluted strategy. Basically, instead of having an abstract base class and a descendant with its own data and extra stuff that implements the base interface, you create a concrete object that doesn't do anything, but has a safe pointer to a virtual base class, which is implemented by a descendant adapter class which is templated to dispatch all its calls to a helper object which it owns. Hopefully the compiler is cleaning up most of this mess.
Anyways, I implemented all of this in version three (testsort3.cpp) which I think is pretty good. I also templated some of the sort routines, though I should use the traits of containers to make the interface a little nicer. One interesting thing though were the results.
I ran GCC and Clang on a 1U server first, and realized these days that -O0 is pretty horrible, especially on the iterator based code:
Compiler | n | Bubble Sort | Selection | Insertion Sort | |||
---|---|---|---|---|---|---|---|
Iter | Raw | Iter | Raw | Iter | Raw | ||
gcc -O0 | 10000 | 2515.88 | 836.98 | 711.25 | 156.54 | 304.98 | 60.87 |
gcc -O2 | 10000 | 191.08 | 182.4 | 173.08 | 173.07 | 22.29 | 22.17 |
gcc -O3 | 10000 | 197.23 | 213.65 | 87.49 | 87.45 | 22.35 | 22.2 |
Clang -O0 | 10000 | 1952.49 | 686.21 | 675.17 | 158.1 | 287.81 | 79.07 |
Clang -O2 | 10000 | 215.3 | 210.12 | 63.93 | 63.92 | 22.15 | 22.08 |
Clang -O3 | 10000 | 212.86 | 213.72 | 63.67 | 63.67 | 22.49 | 22.51 |
Thankfully with some optimization the difference between iterators and raw pointers is down to noise, though it also seems these days that -O3 vs -O2 is pretty much a wash.
It was also amazing how wide the spread was between the worst and best sorts:
Compiler | n | Bubble | Selection | Insertion | Shell | Merge | buf-Merge | qsort() | std::sort |
---|---|---|---|---|---|---|---|---|---|
gcc -O2 | 50000 | 5226.38 | 4353.58 | 576.49 | 4.44 | 7.67 | 5.15 | 6.75 | 3.3 |
gcc -O3 | 50000 | 5347.93 | 2198.94 | 580.26 | 4.84 | 7.61 | 3.97 | 6.88 | 3.36 |
clang -O2 | 50000 | 5361.56 | 1587.31 | 578.19 | 4.77 | 7.36 | 4.81 | 6.97 | 3.12 |
clang -O3 | 50000 | 5345.02 | 1587.3 | 584.16 | 4.59 | 8.13 | 4.83 | 6.88 | 3.18 |
Everybody knows that bubble sort is just the worst sort ever, but I was surprised that it was 1000X worse than the cluster of the fast ones. Even selection and insertion sort are way out there, being 300X and 100X times the best times. But I was also curious how well the fast ones would fare as the set sizes continued to grow.
So I pushed on into bigger datasets with the fast ones (sorry, I wasn't going to time n=1000000 for bubble sort with it growing at O(n^2)). Letting it run overnight, I let the fast ones push the bounds of the server's physical memory:
Compiler | n | Shell | Merge | buf-Merge | qsort() | std::sort |
---|---|---|---|---|---|---|
gcc -O3 | 1000000 | 121.52 | 174.83 | 98.2 | 176.69 | 83.99 |
gcc -O3 | 10000000 | 1519.97 | 1950.48 | 1124.86 | 1984.32 | 967.28 |
gcc -O3 | 100000000 | 18622.18 | 21343.25 | 12492.76 | 22463.4 | 10886.02 |
gcc -O3 | 1000000000 | 67849.88 | 171505.77 | 204308.42 | 204446.49 | 47208.55 |
clang -O3 | 1000000 | 124.86 | 176.53 | 119.42 | 174.18 | 79.76 |
clang -O3 | 10000000 | 1564.63 | 1982.01 | 1392.44 | 2010.94 | 926.81 |
clang -O3 | 100000000 | 19084.54 | 21694.59 | 15671.13 | 22861.94 | 10585.51 |
clang -O3 | 1000000000 | 73168.25 | 1040334.6 | 144162.16 | 187819.62 | 45389.01 |
Here we see performance between clang and gcc varying a bit more (though not in a consistent fashion). Strangely enough as the set size gets huge, qsort() takes the biggest hit under gcc, while merge sort goes off the rails under clang.
Clearly std::sort is the winner hands down, though shell sort keeps right up with it until the end. This is important to keep in mind as shell sort's implementation is compact (the key loop is 8 lines long), and its non-recursive. Two very useful properties in embedded applications. It also remains mysterious, as academia can still not explain why it works so well. That's probably one of the reasons I like it so much.
Ok, and now for some pretty pictures:
And a comparison between the timings of gcc and clang for shell & std::sort. I think the units are in ns/N.
Not as fancy as using Google Visualization charts live like in this benchmark of std::containers, but it will have to do for now.
Feb '04
Oops I dropped by satellite.
New Jets create excitement in the air.
The audience is not listening.
Mar '04
Neat chemicals you don't want to mess with.
The Lack of Practise Effect
Apr '04
Scramjets take to the air
Doing dangerous things in the fire.
The Real Way to get a job
May '04
Checking out cool tools (with the kids)
A master geek (Ink Tank flashback)
How to play with your kids