Saturday, June 27, 2015

Don't REST ! Revisiting RPC performance and where Fowler might have been wrong ..

[Edit: Title is a click bait of course, Fowler is aware of the async/sync issues, recently posted http://martinfowler.com/articles/microservice-trade-offs.html with clarifying section regarding async. ]

Hello my dear citizens of planet earth ...

There are many good reasons to decompose large software systems into decoupled message passing components (team size + decoupling, partial + continuous software delivery, high availability, flexible scaling + deployment architecture, ...).

With distributed applications, there comes the need for ordered point to point message passing. This is different to client/server relations, where many clients send requests at low rate and the server can choose to scale using multiple threads processing requests concurrently.
Remote Messaging performance is to distributed systems what method invocation performance is for non-distributed monolithic applications.
(guess what is one of the most optimized areas in the JVM: method invocation)

[Edit: with "REST", I also refer to HTTP based webservice style API, this somewhat imprecise]

Revisiting high level remoting Abstractions

There were various attempts at building high-level, location transparent abstractions (e.g. corba, distributed objects), however in general those idea's have not received broad acceptance.

This article of Martin Fowler sums up common sense pretty well:

http://martinfowler.com/articles/distributed-objects-microservices.html

Though not explicitely written, the article implies synchronous remote calls, where a sender blocks and waits for a remote result to arrive thereby including cost of a full network roundtrip for each remote call performed.

With asynchronous remote calls, many of the complaints do not hold true anymore. When using asynchronous message passing, the granularity of remote calls is not significant anymore.

"course grained" processing

remote.getAgeAndBirth().then( (age,birth) -> .. );

is not significantly faster than 2 "fine grained" calls

all( remote.getAge(), remote.getBirth() ) 
   .then( resultArray -> ... ); 

as both variants include network round trip latency only once.

On the other hand with synchronous remote calls, every single remote method call has a penalty of one network round trip, and only then Fowlers arguments hold.

Another element changing the picture is the availability of "Spores", a snippet of code which can be passed over the network and executed at receiver side e.g.

remote.doWithPerson( "Heinz", heinz -> {
    // executed remotely, stream data back to callee
    stream( heinz.salaries().sum() / 12 ); finish();
}).then( averageSalary -> .. );

Spore's are implementable effectively with availability of VM's and JIT compilation.

Actor Systems and similar asynchronous message passing approaches have gained popularity in recent years. Main motivation was to ease concurrency and the insight that multithreading with shared data does not scale well and is hard to master in an industrial grade software development environment.

As large servers in essence are "distributed systems in a box", those approaches apply also for distributed systems.

Following I'll test performance of remote invocations of some frameworks. I'd like to proof that established frameworks are far from what is technically possible and want to show that popular technology choices such as REST are fundamentally inept to form the foundation of large and fine grained distributed applications.


Test Participants

Disclaimer: As tested products are of medium to high complexity, there is danger of misconfiguration or test errors, so if anybody has a (verfied) improvement to one of the testcases, just drop me a comment or file an issue to the github repository containing the tests:

https://github.com/RuedigerMoeller/remoting-benchmarks.

I verified by googling forums etc. that numbers are roughly in line with what others have observed.

Features I expect from a distributed application framework:
  • Ideally fully location transparent. At least there should be a concise way (e.g. annotations, generators) to do marshalling half-automated. 
  • It is capable to map responses to their appropriate request callbacks automatically (via callbacks or futures/promises or whatever).
  • its asynchronous

products tested (Disclaimer: I am the author of kontraktor):
  • Akka 2.11
    Akka provides a high level programming interface, marshalling and networking details are mostly invisible to application code (full location transparency).
  • Vert.x 3.1
    provides a weaker level of abstraction compared to actor systems, e.g. there are no remote references. Vert.x has a symbolic notion of network communication (event bus, endpoints).
    As it's "polyglot", marshalling and message encoding need some manual support.
    Vert.x is kind of a platform and addresses many practical aspects of distributed applications such as application deployment, integration of popular technology stacks, monitoring, etc.
  • REST (RestExpress)
    As Http 1.1 based REST is limited by latency (synchronous protocol), I just choosed this more or less randomly.
  • Kontraktor 3, distributed actor system on Java 8. I believe it hits a sweet spot regarding performance, ease of use and mind model complexity. Kontraktor provides a concise, mostly location transparent high level programming model (Promises, Streams, Spores) supporting many transports (tcp, http long poll, websockets).

Libraries skipped:
  • finagle - requires me to clone and build their fork of thrift 0.5 first. Then I'd have to define thrift messages, then generate, then finally run it. 
  • parallel universe - at time of writing the actor remoting was not in a testable state ("Galaxy" is alpha), examples are without build files, the gradle build did not work. Once i managed to build, the programs where expecting configuration files which I could not find. Maybe worth a revisit (accepting pull requests as well :) ).

The Test

I took a standard remoting example:
The "Ask" testcase: 
The sender sends a message of two numbers, the remote receiver answers with the sum of those 2 numbers. The remoting layer has to track and match requests and responses as there can be tens of thousand "in-flight". 
The "Tell" testcase: 
Sender sends fire-and forget. No reply is sent from the receiver. 

Results

Attention: Don't miss notes below charts.

Platform: Linux Centos 7 dual socket 20 real cores @2.5 GHZ, 64GB ram. As the tests are ordered point to point, none of the tests made use of more than 4 cores.

tell Sum
(msg/second)
ask Sum (msg/second)
Kontraktor Idiomatic1.900.000860.000
Kontraktor Sum-Object1.450.000795.000
Vert.x 3 200.000200.000
AKKA (Kryo)120.00065.000
AKKA70.00064.500
RestExpress15.00015.000
Rest >15 connections48.00048.000

let me chart that for you ..



Remarks:

  • Kontraktor 3 outperforms by a huge margin. I verified the test is correct and all messages are transmitted (if in doubt just clone the git repo and reproduce). 
  • Vert.x 3 seems to have a built-in rate limiting. I saw peaks of 400k messages/second however it averaged at 200k (hints for improvement welcome). In addition, the first connecting sender only gets 15k/second throughput. If I stop and reconnect, throughput is as charted.
    I tested the very first Vert.x 3 final release. For marshalling fast-serialization (FST) was used (also used in kontraktor). Will update as Vert.x 3 matures
  • Akka. I spent quite some time on improving the performance with mediocre results. As Kryo is roughly of same speed as fst serialization, I'd expect at least 50% of kontraktor's performance.

    Edit:
    Further analysis shows, Akka is hit by poor serialization performance. It has an option to use Protobuf for encoding, which might improve results (but why kryo did not help then ?).

    Implications of using protobuf:
    * need each message be defined in a .proto file, need generator to be run
    * frequently additional datatransfer is done like "app data => generated messages => app data"
    * no reference sharing support, no cyclic object graphs can be transmitted
    * no implicit compression by serialization's reference sharing.
    * unsure wether the ask() test profits as it did not profit from Kryo as well
    * Kryo performance is in the same ballpark as protobuf but did not help that much either.

    Smell: I had several people contacting me aiming to improve Akka results. They disappear somehow.

    Once I find time I might add a protobuf test. Its a pretty small test program, so if there was an easy fix, it should not be a huge effort to provide it. The git repo linked above contains a maven buildable ready to use project.
  • REST. Poor throughput is not caused by RestExpress (which I found quite straight forward to use) but by Http1.1's dependence on latency. If one moves a server to other hardware (e.g. different subnet, cloud), throughput of a service can change drastically due to different latency. This might change with Http 2.
    Good news is: You can <use> </any> <chatty> { encoding: for messages }, as it won't make a big difference for point to point REST performance.
    Only when opening many connections (>20) concurrently, throughput increases. This messes up transaction/message ordering, so can only be used for idempotent operations (a species mostly known from white papers and conference slides, rarely seen in the wild).


Misc Observations


Backpressure

Sending millions of messages as fast as possible can be tricky to implement in a non-blocking environment. A naive send loop
  • might block the processing thread 
  • build up a large outbound queue as put is faster than take+sending. 
  • can prevent incoming Callbacks from being enqueued + executed (=Deadlock or OOM).
Of course this is a synthetic test case, however similar situations exist e.g. when streaming large query results or sending large blob's to other node's (e.g. init with reference data).

None of the libraries (except rest) did that out of the box:
  • Kontraktor requires a manual increase of queue sizes over default (default is 32k) in order to not deadlock in the "ask" test. In addition its required to programatically adopt send rate by using the backpressure signal raised by the TCP stack (network send blocks). This can be done non-blocking, "offer()" style.
  • For VertX i used a periodic task sending a burst of 500 to 1000 messages. Unfortunately the optimal number of messages per burst depends on hardware performance, so the test might need adoption when run on e.g. a Laptop.
  • For Akka I send 1 million messages each 30 seconds in order to avoid implementation of application level flow control. It just queued up messages and degrades to like 50 msg/s when used naively (big loop).
  • REST was not problematic here (synchronous Http1.1 anyway). Degraded by default.



Why is kontraktor remoting that faster ?
  • premature optimization 
  • adaptive batching works wonders, especially when applied to reference sharing serialization.
  • small performance compromises stack up, reduce them bottom up 
Kontraktor actually is far from optimal. It still generates and serializes a "MethodCall() { int targetId, [..], String methodName, Object args[]}" for each message remoted. It does not use Externalizable or other ways of bypassing generic (fast-)serialization.
Throughputs beyond 10 million remote method invocations/sec have been proved possible at cost of a certain fragility + complexity (unique id's and distributed systems ...) + manual marshalling optimizations.


Conclusion

  • As scepticism regarding distributed object abstractions is mostly performance related, high performance asynchronous remote invocation is a game changer
  • Popular libraries have room for improvement in this area 
  • Don't use REST/Http for inter-system connect, (Micro-) Service oriented architectures. Point to point performance is horrible. It has its applications in the area of (WAN) web services / platform neutral, easily accessible API's and client/server patterns.
  • Asynchronous programming is different, requires different/new solution patterns (at source code level). It is unavoidable to learn use of asynchronous messaging primitives.
    "Pseudo Synchronous" approaches (e.g. fibers) are good in order to better scale multithreading, but do not work out for distributed systems.
  • lack of craftsmanship can kill visions.