We started on a
quest with Tuomas to find out just how much performance we could squeeze out of the N800 by compiling the whole thing with VFP support (in essence, use the hardware floating point unit on the device).
You might be surprised to learn that most if not all of the software on N800 is compiled
without support for the FPU. I guess the main motivation to do so is library size, and
Tommi Komulainen mentioned on
IRC that the boot time was increased without real benefits when they enabled VFP in GTK+ and Pango, so it was quickly reverted.
The reasons for this boot time growth most likely include the loss of thumb instruction support the VFP imples, which means growth in library size (I'll get back to that below) as well as the fact that mixing thumb code and VFP code does
not mix at all.
Lets look at these hypotheses a bit closer:
Library size and memory consumption
Tuomas already had figures on the rootfs size growth as well as per-library notes in his blog, and that much is obvious. Usually the number one reason to use thumb code in the first place is to reduce the code size. I don't consider the flash eating a problem (after all, you can now get 8 Gt of space with the two SD slots...), but unfortunately this will be projected to RAM usage as well.
We didn't think of this too much while making the tests, but I tried a quick 'cat /proc/meminfo' at the end of the boot cycle and got the following results:
| Memory field | Thumb | VFP |
|---|
| MemFree | 70344 kB | 63660 kB |
| Cached | 26740 kB | 30648 kB |
| Mapped | 11844 kB | 14228 kB |
So you get an initial 7MB penalty for wanting more speed. This most likely will increase when more libraries are loaded, although most of the common stuff should be already in memory by the time the desktop is up. My gut feeling is that this will become a problem only if you tend to run multiple apps or keep many browser windows open, but of course have nothing concrete to offer as proof...
The boot speed is also discussed by Tuomas, but here's few notes about that as well:
- Tuomas boots from the SD, which is much faster than the internal flash (and jffs2 since it compresses the files on the fly) so he gets hit by only 2s (not 5s what Tommi reported)
- I booted from flash and got the 5s penalty. Then I disabled the 'sleep 9' line from /etc/osso-af-init/real-af-base-apps and it dropped down to 2s for me too... I'm not 100% sure that the maemo-af-desktop finishes by that, but it doesn't matter much since you won't be able to do anything with it for 5secs after it's shown anyway (the menu shows slowly the first time and the connectivity statusbar applet seems to block until it's connected to my wlan...). I wonder if that 9s hack (presumably done for 770) is really needed and/or beneficial any more?-)
Thumbing the VFP down
In short, mixing thumb code and VFP code is not a good idea. This is due to the nature of the thumb instructions, they are 16 bit versus the 32 bits of "normal" ARM instructions. When the processor executes thumb code, it needs to be in a different mode than when executing normal code. Now, if the processor has to execute two different types of code in sequence, it needs to switch back and forth in the modes, which will be relatively slow. Now, picture a scenario where Glib is thumb code and GTK+ is VFP. Every single time any GObject traffic is happening, like say signals, the processor is jumping from mode to mode like crazy when it executes GObject code in thumb and then signal handlers in VFP which are more than likely to use stuff like g_strdup() etc which again means you go back to thumb... You can see it's a lost cause to find any performance there.
This is pretty well illustrated by the following table, which shows the gtkperf timings step-by-step when working towards fully VFP system. The test was executed so that I took the debs that Tuomas had built and divided them up in chunks, installing one chunk at a time, rebooting and running gtkperf.
| Component switched to VFP | Total time for 'gtkperf -a -c 500' |
|---|
| Totally thumb system | 931.57 |
| Hildon libraries, Sapwood | 920.84 |
| GTK+ | 909.27 |
| Pango | 902.66 |
| Glib | 879.35 |
| Fontconfig, Freetype | 882.61 |
| The X11 libraries | 881.30 |
| The Xomap X11 server | 798.45 |
| Libc | 742.58 |
Gtkperf is not the most robust test in the world, and we observed big variance between runs. In particluar the thumb-only number is one of the bigger ones I got, but it never was much under 900s and the 930ish figure was repeating for me at that time... Thus the numbers above should not be taken as face value, but as a indication of a trend. And that is very clear in these numbers, the difference between thumb system and VFP system is almost 200s so that cannot possibly be a fluke ;)
Most notable drops in the time comes in the end, when we start to get a "pure" VFP system. This is to me a clear indication that any benefit gotten from the VFP in higher libs is swallowed by the constant mode swithcing (and probably by other, more complex issues like caches). When you don't need to switch modes anymore, it's all pure benefit from hardware floats.
So what about Cairo, my very best favourite graphics library? Yes indeed, it too gets a hefty boost from using hardware floats. It is pretty clearly cut in two sets: the radials (2x-7x boost) and the rest (<2x boost or nothing). The mosaic_tesselate_curves gets also a good boost (almost 2,5x), so it's not only radial patterns that benefit.
As a conclusion, I'm not shy to waste my flash and RAM on speed, so I'll definitely be running a VFP-enabled system in the future on my N800 :)