Processing websocket messages quickly

thintz.com/slides/websockets

Details

  • CHICKEN Scheme (lisp)
  • Wanted websockets, CHICKEN didn't have
  • Completely new

websockets summary

full-duplex communication over single TCP connection

websocket.send('hi');

websocket.onmessage = function(evt) {
  console.log('server said: ' + evt.data); };

> server said: hello!

The Test:

Processing a 16M text message

  • processing messages common
  • large size for comparison
  • only one of many possible tests

Goal

Be faster than most popular implementations

  • ~380ms

23 seconds

version 1

First thing

Instrument everything and profile

  • library itself
  • dependencies
  • runtime

1st conclusions

  • High memory allocation
    • Inherently slow
    • Very high GC pressure

      Image would grow to over 1GB!

Action

  • Remove unnecessary copying of the message
  • Remove incidental garbage creation

(Both brought down GC pressure significantly)

2.5 seconds

Bottleneck in runtime

write-u8vector

  • internally copying some strings by one character at a time!
  • improved by copying entire string at once

1 second

Bottleneck in unmasking

(a websockets psuedo security mechanism)

  • Requires xor-ing entire message

Action

  • Converted from pure Scheme code to C

    (CHICKEN allows embedding C directly)

  • Which allowed using some low-level trickery
  • And elimination of some more garbage generation

700 ms

UTF-8 Validation

  • Using a parser combinator algorithm
    • Relatively slow
  • Most websockets implementations use a fast, freely available C algorithm/implementation

    But it was broken!

UTF-8 Validation Improvements

  • Improved parser combinator algorithm
  • Still slower than some other implementations
  • But remained "correct"

400 ms

Pretty much reached the goal but can we do better?

Assumption optimization (cheating?)

  • Assumption: most messages would be plain ASCII
  • Realization:

    • Plain ASCII can be checked in C effectively free
    • > Unless byte is greater than 127 continue to next byte else bail to parser combinator algorithm

    • No memory allocations, no garbage, max O(n) running time
    • Worst case still good (~400ms)

285 ms!

Summary:

  • It helps to instrument and profile everything including language runtime
  • Watch out for unnecessary memory allocations
  • Utilize low-level languages/libraries when possible (most fast implementations do)
  • Consider likely "real world" usage as well as academic/theoretical
  • Cheat when possible :)

More info: