Friday, November 20, 2015

Clarification and Fix for Serialization Exploit

There have been reports of a Java serialization based epxloit (e.g. here), enabling the attacker to execute arbitrary code. As usual, there is a lot of half baked reasoning about this issue, so I'd like to present a fix for my fast-serialization library as well as tweaks to block deserialization of certain classes for stock JDK serialization.

In my view the issue is not located in Java serialization, but insecure programming in some common open source libraries.

Attributing this exploit to Java Serialization is wrong, the exploit requires door opening non-jdk code to be present on the server.

Unfortunately some libraries accidentally do this, however its very unlikely this is a wide spread (mis-) programming pattern.


How can a class at server side open the door ?


  • the class has to define readObject() 
  • the readObject() method has to be implemented in a way that allows to execute code sent from remote

I struggle to imagine a sane use case on why one would do something along the line of

"readObject(ObjectInputStream in) { Runtime.exec(in.readString()); }"

or

"readObject(ObjectInputStream in) { define_and_execute_class(in.readBytes()); }"

Anyway, it happened. Again: something like the code above must be present on the class path at server side, it cannot be injected from outside.



Fixing this for fast-serialization


  • fast-serialization 2.43 introduces a delegate interface, which enables blacklisting/whitelisting of classes, packages or whole package trees. By narrowing down classes being allowed for serialization, one can assure only expected objects are deserialized.



  • if you cannot upgrade to fst 2.43 for backward compatibility reasons, at least register a custom serializer for the problematic classes (see exploit details). As registered custom serializers are processed by fast-serialization before falling back to JDK serialization emulation, throwing an exception in the read or instantiate method of the custom serializer will effectively block the exploit.
  • There is no significant performance impact as this is executed on initialization only



Hotfixing stock JDK Serialization implementation

(Verfied for JDK 1.8 only)

Use a subclass of ObjectInputStream instead of the original implementation and put in a blacklisting/whitelisting by overriding this method in ObjectInputStream:



(maybe also check resolveProxyClass, I'm not sure wether this also can get exploited [probably not]).
This will have a performance impact, so its crucial to have a efficient implementation of blacklisting/whitelisting here.




Friday, July 31, 2015

Polymer WebComponents Client served by Java Actors

Front end development productivity has been stalling for a long time. I'd even say regarding productivity, there was no substantial improvement since the days of Smalltalk (Bubble Event Model, MVC done right, component based).

Weird enough, for a long time things got worse: Document centric web technology made classic and proven component centric design patterns hard to implement.

For reasons unknown to me (fill in some major raging), it took some 15 years until a decent and well thought webcomponent standard (proposal) arised: WebComponents. Its kind an umbrella of several standards, most importantly:
  • Html Templates - the ability to have markup snippets which are not "seen" and processed by the browser but can be "instantiated" programmatically later on.
  • Html Imports - finally the ability to include a bundle of css, html and javascript implicitely resolving dependencies by not loading an import twice. Style does not leak out to the main document.
  • Shadow Dom
It would not be a real web technology without some serious drawbacks: Html Imports are handled by the browser, so you'll get many http requests when loading a webcomponent app using html imports to resolve dependencies.
As the Html standard is strictly separated from the Http protocol, simple solutions such as server side imports (to reduce number of http requests) can never be part of an official standard.

Anyway, with Http-2's pipelining and multiplexing features html-imports won't be an issue within some 3 to 5 years (..), until then some workarounds and shim's are required.

Github repo for this blog "Buzzword Cloud": kontraktor-polymer-example

Live Demohttp://46.4.83.116/polymer/ (just sign in with anything except user 'admin'). Requires Websocket connectivity, modern browser (uses webcomps-lite.js, so no IE). Curious wether it survives ;-).

Currently Chrome and Opera implement WebComponents, however there is a well working polyfill for mozilla + safari and a last resort polyfill for ie.
Using a server side html import shim (as provided by kontraktor's http4k), web component based apps run on android 4.x and IOS safari.

(Polymer) WebComponents in short



a web component with dependencies (link rel='import'), embedded style, a template and code


Dependency Management with Imports ( "<link rel='import' ..>" )

The browser does not load an imported document twice and evaluates html imports linearly. This way dependencies are resolved in correct order implicitly. An imported html snippet may contain (nearby) arbitrary html such as scripts, css, .. most often it will contain the definition of a web component.

Encapsulation

Styles and templates defined inside an imported html component do not "leak" to the containing document.
Web components support data binding (one/two way). Typically a Web component coordinates its direct children only (encapsulation). Template elements can be easily accessed with "this.$.elemId".

Application structure

An application also consists of components. Starting from the main-app component, one subdivides an app hierarchically into smaller subcompontents, which has a nice self-structuring effect, as one creates reusable visual components along the way.

The main index.html typically refers to the main app component only (kr-login is an overlay component handling authentication with kontraktor webapp server):



That's nice and pretty straight forward .. but lets have a look what my simple sample application's footprint looks like:

  Web Reality:


hm .. you might imagine how such an app will load on a mobile device, as the number of concurrent connections typically is limited to 2..6 and a request latency of 200 to 500ms isn't uncommon. As bandwidth increases continously, but latency roughly stays the same reducing the number of requests pays off even at cost of increased initial bandwidth for many apps.

In order to reduce the number of requests and bandwidth usage, the JavaScript community maintains a load of tools minifying, aggregating and preprocessing resources. When using Java at the server side, one ends up having two build systems, gradle or maven for your java stuff, node.js based javascript tools like bower, grunt, vulcanize etc. . In addition a lot of temporary (redundant) artifacts are created this way.

As Java web server landscape mostly sticks to server-centric "Look-'ma-we-can-do-ze-web™" applications, its hardly possible to make use of modern javascript frameworks using Java at server side, especially as REAL JAVA DEVELOPERS DO NOT SIMPLY INSTALL NODE.JS (though they have a windows installer now ;) ..). Nashorn unfortunately isn't yet there, currently it fails to replace node.js due to missing API or detail incompatibilities.





I think its time for an unobtrusive pointer to my kontraktor actor library, which abstracts away most of the messy stuff allowing for direct communication of JavaScript and Java Actors via http or websockets.



Even when serving single page applications, there is stuff only a web server can implement effectively:

  • dynamically inline Html-Import's 
  • dynamically minify scripts (even inline javascript)

In essence inlining, minification and compression are temporary wire-level optimizations, no need to add this to the build and have your project cluttered up. Kontraktor's Http4k optimizes served content dynamically if run in production mode.
The same application with (kontraktor-)Http4k in production mode:


even better: As imports are removed during server side inlining, the web app runs under Safari IOS and Android 4.4.2 default browser.

Actor based Server side

Wether its a TCP Server Socket accepting client(-socket)s or a client browser connecting a web server: its structurally the same pattern:

1. server listens for authentication/connection requests.
2. server authenticates/handles a newly connecting client and instantiates/opens a client connection (webapp terminology: session).

What's different when comparing a TCP server and a WebApp Http server is
  • underlying transport protocol
  • message encoding
As kontraktor maps actors to a concrete transport topology at runtime, a web app server application does not look special:


a generic server using actors

The decision on transport and encoding is done by publishing the server actor. The underlying framework (kontraktor) then will map the high level behaviour description to a concrete transport and encoding. E.g. for websockets + http longpoll transports it would look like:


On client side, js4k.js implements the api required to talk to java actors (using a reduced tell/ask - style API).

So with a different transport mapping, a webapp backend might be used as an in-process or tcp-connectable service.

So far so good, however a webapp needs html-import inlining, minification and file serving ...
At this point there is an end to the abstraction game, so kontraktor simply leverages the flexibility of RedHat's Undertow by providing a "resource path" FileResource handler.


The resource path file handler parses .html files and inlines+minifies (depending on settings) html-import's, css links and script tags. Ofc for development this should be turned off.
The resource path works like Java's classpath, which means a referenced file is looked up starting with the first path entry advancing along the resource path until a match is found. This can be nice in order to quickly switch versions of components used and keep your libs organized (no copying required).

As html is fully parsed by Http4k (JSoup parser ftw), its recommended to keep your stuff syntactically correct. In addition keep in mind that actors require non-blocking + async implementation of server side logic, have your blocking database calls "outsourced" to kontraktors thread pool like



Conclusion:
  • javascript frameworks + web standards keep improving. A rich landscape of libraries and ready to use components has grown.
  • they increasingly come with node.js based tooling 
  • JVM based non-blocking servers scale better and have a much bigger pool of server side software components
  • kontraktor http4k + js4k.js helps closing the gap and simplifies development by optimizing webapp file structure and size dynamically and abstracting (not annotating!) away irrelevant details and enterprisey ceremony.

Saturday, June 27, 2015

Don't REST ! Revisiting RPC performance and where Fowler might have been wrong ..

[Edit: Title is a click bait of course, Fowler is aware of the async/sync issues, recently posted http://martinfowler.com/articles/microservice-trade-offs.html with clarifying section regarding async. ]

Hello my dear citizens of planet earth ...

There are many good reasons to decompose large software systems into decoupled message passing components (team size + decoupling, partial + continuous software delivery, high availability, flexible scaling + deployment architecture, ...).

With distributed applications, there comes the need for ordered point to point message passing. This is different to client/server relations, where many clients send requests at low rate and the server can choose to scale using multiple threads processing requests concurrently.
Remote Messaging performance is to distributed systems what method invocation performance is for non-distributed monolithic applications.
(guess what is one of the most optimized areas in the JVM: method invocation)

[Edit: with "REST", I also refer to HTTP based webservice style API, this somewhat imprecise]

Revisiting high level remoting Abstractions

There were various attempts at building high-level, location transparent abstractions (e.g. corba, distributed objects), however in general those idea's have not received broad acceptance.

This article of Martin Fowler sums up common sense pretty well:

http://martinfowler.com/articles/distributed-objects-microservices.html

Though not explicitely written, the article implies synchronous remote calls, where a sender blocks and waits for a remote result to arrive thereby including cost of a full network roundtrip for each remote call performed.

With asynchronous remote calls, many of the complaints do not hold true anymore. When using asynchronous message passing, the granularity of remote calls is not significant anymore.

"course grained" processing

remote.getAgeAndBirth().then( (age,birth) -> .. );

is not significantly faster than 2 "fine grained" calls

all( remote.getAge(), remote.getBirth() ) 
   .then( resultArray -> ... ); 

as both variants include network round trip latency only once.

On the other hand with synchronous remote calls, every single remote method call has a penalty of one network round trip, and only then Fowlers arguments hold.

Another element changing the picture is the availability of "Spores", a snippet of code which can be passed over the network and executed at receiver side e.g.

remote.doWithPerson( "Heinz", heinz -> {
    // executed remotely, stream data back to callee
    stream( heinz.salaries().sum() / 12 ); finish();
}).then( averageSalary -> .. );

Spore's are implementable effectively with availability of VM's and JIT compilation.

Actor Systems and similar asynchronous message passing approaches have gained popularity in recent years. Main motivation was to ease concurrency and the insight that multithreading with shared data does not scale well and is hard to master in an industrial grade software development environment.

As large servers in essence are "distributed systems in a box", those approaches apply also for distributed systems.

Following I'll test performance of remote invocations of some frameworks. I'd like to proof that established frameworks are far from what is technically possible and want to show that popular technology choices such as REST are fundamentally inept to form the foundation of large and fine grained distributed applications.


Test Participants

Disclaimer: As tested products are of medium to high complexity, there is danger of misconfiguration or test errors, so if anybody has a (verfied) improvement to one of the testcases, just drop me a comment or file an issue to the github repository containing the tests:

https://github.com/RuedigerMoeller/remoting-benchmarks.

I verified by googling forums etc. that numbers are roughly in line with what others have observed.

Features I expect from a distributed application framework:
  • Ideally fully location transparent. At least there should be a concise way (e.g. annotations, generators) to do marshalling half-automated. 
  • It is capable to map responses to their appropriate request callbacks automatically (via callbacks or futures/promises or whatever).
  • its asynchronous

products tested (Disclaimer: I am the author of kontraktor):
  • Akka 2.11
    Akka provides a high level programming interface, marshalling and networking details are mostly invisible to application code (full location transparency).
  • Vert.x 3.1
    provides a weaker level of abstraction compared to actor systems, e.g. there are no remote references. Vert.x has a symbolic notion of network communication (event bus, endpoints).
    As it's "polyglot", marshalling and message encoding need some manual support.
    Vert.x is kind of a platform and addresses many practical aspects of distributed applications such as application deployment, integration of popular technology stacks, monitoring, etc.
  • REST (RestExpress)
    As Http 1.1 based REST is limited by latency (synchronous protocol), I just choosed this more or less randomly.
  • Kontraktor 3, distributed actor system on Java 8. I believe it hits a sweet spot regarding performance, ease of use and mind model complexity. Kontraktor provides a concise, mostly location transparent high level programming model (Promises, Streams, Spores) supporting many transports (tcp, http long poll, websockets).

Libraries skipped:
  • finagle - requires me to clone and build their fork of thrift 0.5 first. Then I'd have to define thrift messages, then generate, then finally run it. 
  • parallel universe - at time of writing the actor remoting was not in a testable state ("Galaxy" is alpha), examples are without build files, the gradle build did not work. Once i managed to build, the programs where expecting configuration files which I could not find. Maybe worth a revisit (accepting pull requests as well :) ).

The Test

I took a standard remoting example:
The "Ask" testcase: 
The sender sends a message of two numbers, the remote receiver answers with the sum of those 2 numbers. The remoting layer has to track and match requests and responses as there can be tens of thousand "in-flight". 
The "Tell" testcase: 
Sender sends fire-and forget. No reply is sent from the receiver. 

Results

Attention: Don't miss notes below charts.

Platform: Linux Centos 7 dual socket 20 real cores @2.5 GHZ, 64GB ram. As the tests are ordered point to point, none of the tests made use of more than 4 cores.

tell Sum
(msg/second)
ask Sum (msg/second)
Kontraktor Idiomatic1.900.000860.000
Kontraktor Sum-Object1.450.000795.000
Vert.x 3 200.000200.000
AKKA (Kryo)120.00065.000
AKKA70.00064.500
RestExpress15.00015.000
Rest >15 connections48.00048.000

let me chart that for you ..



Remarks:

  • Kontraktor 3 outperforms by a huge margin. I verified the test is correct and all messages are transmitted (if in doubt just clone the git repo and reproduce). 
  • Vert.x 3 seems to have a built-in rate limiting. I saw peaks of 400k messages/second however it averaged at 200k (hints for improvement welcome). In addition, the first connecting sender only gets 15k/second throughput. If I stop and reconnect, throughput is as charted.
    I tested the very first Vert.x 3 final release. For marshalling fast-serialization (FST) was used (also used in kontraktor). Will update as Vert.x 3 matures
  • Akka. I spent quite some time on improving the performance with mediocre results. As Kryo is roughly of same speed as fst serialization, I'd expect at least 50% of kontraktor's performance.

    Edit:
    Further analysis shows, Akka is hit by poor serialization performance. It has an option to use Protobuf for encoding, which might improve results (but why kryo did not help then ?).

    Implications of using protobuf:
    * need each message be defined in a .proto file, need generator to be run
    * frequently additional datatransfer is done like "app data => generated messages => app data"
    * no reference sharing support, no cyclic object graphs can be transmitted
    * no implicit compression by serialization's reference sharing.
    * unsure wether the ask() test profits as it did not profit from Kryo as well
    * Kryo performance is in the same ballpark as protobuf but did not help that much either.

    Smell: I had several people contacting me aiming to improve Akka results. They disappear somehow.

    Once I find time I might add a protobuf test. Its a pretty small test program, so if there was an easy fix, it should not be a huge effort to provide it. The git repo linked above contains a maven buildable ready to use project.
  • REST. Poor throughput is not caused by RestExpress (which I found quite straight forward to use) but by Http1.1's dependence on latency. If one moves a server to other hardware (e.g. different subnet, cloud), throughput of a service can change drastically due to different latency. This might change with Http 2.
    Good news is: You can <use> </any> <chatty> { encoding: for messages }, as it won't make a big difference for point to point REST performance.
    Only when opening many connections (>20) concurrently, throughput increases. This messes up transaction/message ordering, so can only be used for idempotent operations (a species mostly known from white papers and conference slides, rarely seen in the wild).


Misc Observations


Backpressure

Sending millions of messages as fast as possible can be tricky to implement in a non-blocking environment. A naive send loop
  • might block the processing thread 
  • build up a large outbound queue as put is faster than take+sending. 
  • can prevent incoming Callbacks from being enqueued + executed (=Deadlock or OOM).
Of course this is a synthetic test case, however similar situations exist e.g. when streaming large query results or sending large blob's to other node's (e.g. init with reference data).

None of the libraries (except rest) did that out of the box:
  • Kontraktor requires a manual increase of queue sizes over default (default is 32k) in order to not deadlock in the "ask" test. In addition its required to programatically adopt send rate by using the backpressure signal raised by the TCP stack (network send blocks). This can be done non-blocking, "offer()" style.
  • For VertX i used a periodic task sending a burst of 500 to 1000 messages. Unfortunately the optimal number of messages per burst depends on hardware performance, so the test might need adoption when run on e.g. a Laptop.
  • For Akka I send 1 million messages each 30 seconds in order to avoid implementation of application level flow control. It just queued up messages and degrades to like 50 msg/s when used naively (big loop).
  • REST was not problematic here (synchronous Http1.1 anyway). Degraded by default.



Why is kontraktor remoting that faster ?
  • premature optimization 
  • adaptive batching works wonders, especially when applied to reference sharing serialization.
  • small performance compromises stack up, reduce them bottom up 
Kontraktor actually is far from optimal. It still generates and serializes a "MethodCall() { int targetId, [..], String methodName, Object args[]}" for each message remoted. It does not use Externalizable or other ways of bypassing generic (fast-)serialization.
Throughputs beyond 10 million remote method invocations/sec have been proved possible at cost of a certain fragility + complexity (unique id's and distributed systems ...) + manual marshalling optimizations.


Conclusion

  • As scepticism regarding distributed object abstractions is mostly performance related, high performance asynchronous remote invocation is a game changer
  • Popular libraries have room for improvement in this area 
  • Don't use REST/Http for inter-system connect, (Micro-) Service oriented architectures. Point to point performance is horrible. It has its applications in the area of (WAN) web services / platform neutral, easily accessible API's and client/server patterns.
  • Asynchronous programming is different, requires different/new solution patterns (at source code level). It is unavoidable to learn use of asynchronous messaging primitives.
    "Pseudo Synchronous" approaches (e.g. fibers) are good in order to better scale multithreading, but do not work out for distributed systems.
  • lack of craftsmanship can kill visions.