Profiling and websockets: going from 23s to 285ms
(Based on a talk that got canceled on me at the last second. [slides])
This is the story of how I took websocket message processing from 23s to 285ms
I was working on creating a web-app version of a desktop game I made ages ago and wanted to add game-state syncing between clients. I thought it could be a good chance to explore websockets since I hadn't done anything much with them before and thought maybe they would be a nice fit. The app happened to be written in CHICKEN Scheme and CHICKEN didn't have a websockets library. I took a quick look at the websockets spec and it didn't seem too difficult to implement so I started work on a CHICKEN websockets library.
Indeed it proved relatively easy to implement but I had a problem: message processing was horrifically slow. The test case I had chosen was to serially process and echo back 1,000 16MB text messages. My first version took 23 seconds. It didn't actually matter for the web app I was working on but who can live with such an inequity? Plus I was going to publish it and that would just make me look bad for all the people that ran the same benchmark. So I set out to change that.
(Note that this piece is not a "language shootout" or some other attempt to prove something is faster than something else. It is purely just observations on optimizing a somewhat arbitrarily chosen benchmark for the fun of it.)
I chose this test case to make measuring differences easy and it seemed sending and receiving text messages would be a relatively common thing to do with websockets. There are of course a million other things that could be optimized. For the sake of this piece I'll use the 16MB test but I also tracked smaller, more normal sized messages and the results scaled linearly so the large size is representative.
The first version took 23 seconds and other implementations took between 400ms and 2.5s. Very embarrassing. My first goal was just to get it around 1s which seemed reasonable enough for something I didn't really need in the first place. By the time I reached that point though I was too addicted and resolved to try to beat the other implementations or at least be one of the fastest.
I started out being lazy and instead of profiling I just took a couple guesses at what was causing the slowness. The few quick things I tried that seemed to me like they would be slow didn't actually seem to make much of a difference. I did though observe that the heap would balloon from 15MB to 1.5GB! After running the benchmark it would drop back down to around 40MB so seemed obvious too much garbage was being generated, but where? So I went unlazy and pulled out the profiler.
As expected a significant amount of the processing time was from the GC, about 50%. It was easy to discover from the profiler that the garbage was mostly coming from the message being copied multiple times. After removing the unnecessary copying the processing time was down to 2.5s. While that is fantastic it was still too slow. Improving it further took more work.
The next thing in the profiler output that was suspect seemed to actually be coming from a dependent library or even the CHICKEN language runtime itself. At this point I had only instrumented the websockets library itself and not any libraries it depended on or the language runtime. So the next step was to instrument everything else. The next round of profiling produced an interesting result; it appeared that there was a bottleneck in the language runtime itself.
It took a fair amount of effort to track down but it turned out the method used for copying the return message to the output stream was copying it one character at a time. After patching the runtime to copy the whole message at once the processing time was reduced to from 2.5s to 1s.
The next bottleneck I identified and attacked was in the unmasking algorithm. Unmasking is required in the websockets spec. It is a psuedo security mechanism that requires xor-ing the entire message.
One of my favorite features of CHICKEN is it allows directly embedding C code. When you really want to squeeze out some extra performance it is quite nice. Optimizing the Scheme code itself didn't improve it much so I naturally turned to C. That allowed me to use some low-level trickery that brought the processing time from 1s down to 700ms.
The next bottleneck was in UTF-8 validation. websockets requires text messages to be UTF-8 validated. My original version used a parse combinator for validation but it wasn't a speed demon. I searched the internets for a faster way to validate UTF-8. There are a number of implementations and some websockets libraries had their own or their own derived implementations. The most common one, which happened to be compatibly open source licensed, was used by a number of implementations and indeed was significantly faster than the parser combinator. But while testing it I discovered it didn't actually fully validate correctly! For some possible inputs it would claim were valid when they were clearly not. I would have none of that. After trying to fix it I decided to go back to the correctly validating parser combinator implementation and try to improve that. It turned out not to be too difficult and that brought down processing time from 700ms to 400ms.
So now I had reached roughly the same performance as the faster implementations but I couldn't help trying to do even better. I still kept thinking there must be a faster way to UTF-8 validate. I played around with things a bit and then realized that if you only wanted to validate if a message was plain ASCII it could be done very quickly and efficiently since you only have to check that the value is less than 128 and move on to the next character. I modified the UTF-8 validation function to first attempt to validate it as plain ASCII and if it came across a non ASCII character it would bail out and drop back to the parser combinator implementation. With that part written in C it was practically free and generated no garbage. This brought the test case from 400ms down to 285ms! Sure it is cheating a bit but if you're lucky enough to have all or mostly all pure plain ASCII messages then you will have even faster message processing. Worse case performance is roughly on par with the faster implementations. So yay it isn't the slowest anymore!
Moral of the story is mostly to not be lazy and to start with profiling. Without that it would be much more difficult to figure out where the bottlenecks are and to know what bottlenecks are the most significant and should be attacked first.
And lastly, feel free to cheat when you can.
Some notes
It is entirely likely that other implementations are slower than they could be because they have extra features or have slower message processing to increase parallel processing, for instance, and if they removed those they would be faster than this implementation.
I didn't test every implementation. I just grabbed some of the more popular ones. Very possible there are faster ones.
The benchmarks are older now, around a year. It is entirely possible implementations have gotten faster since then.
There are still a fair number of areas I've identified for improvement but it's good enough for now.
In the final version image size settled around 40MB.
One of my initial thoughts was to switch some things from being purely functional to being imperative since in your head you think that generating garbage is going to have worse performance. Turns out it actually slowed things down in one case. Upon further investigation it turns out the CHICKEN compiler is better able to optimize some cases when you don't modify state and in this case it actually even generated less garbage with the functional version. Other compilers are known to do the same in certain situations. So don't guess. Test and profile.
And again, the competition was more for developing targets and providing motivation, nothing more really.