Friday, April 24, 2009

FILE_PING: new discovery protocol based on shared storage

I've just created a first version of FILE_PING in 2.6 and 2.8. This is a new discovery protocol which uses a shared directory into which all nodes of a cluster write their addresses.

New nodes can read the contents of that directory and then send their discovery requests to all nodes found in the dir.

When a node leaves, it'll remove its address from the directory again.

When would someone use FILE_PING, e.g. over TCPGOSSIP and GossipRouter ?

When IP multicasting is not enabled, or cannot be used for other reasons, we have to resort to either TCPPING , which lists nodes statically in the config, or TCPGOSSIP, which retrieves initial membership information from external process(es), the GossipRouter(s).

The latter solution is a bit cumbersome since an additional process has to be maintained.

FILE_PING is a simple solution to replace GossipRouter, so we don't have to maintain that external process.

However, note that performance will most likely not be better: a shared directory e.g. on NFS or SMB requires a round trip for a read or write, too. So if we have 10 nodes which wrote their information to file, then we have to make 10 round trips via SMB to fetch that information, compared to 1 round trip to the GossipRouter(s) !

So FILE_PING is an option for developers who prefer to take the perf hit (maybe in the order of a few additional milliseconds per discovery phase) over having to maintain an external GossipRouter process.

FILE_PING is part of 2.6.10, which will be released early next week, or it can be downloaded from CVS (2.6. branch) or here. In the latter case, place the into the src/org/jgroups/protocols directory and execute the 'jar' target in the build.xml Ant script of your JGroups src distro.

Wednesday, April 01, 2009

Those damn edge cases !

While JGroups is over 10 years old and very mature, sometimes I still run into cases that aren't handled. While the average user won't run into edge cases because we test the normal cases very well, if you do run into one, in the best case, you have 'undefined' behavior (whatever that means !), in the worst case, you're hosed.

Here's one.

The other week I was at a (european) army, for a week of JGroups consulting. They have a system which runs JGroups nodes over flappy links, radio and satellite networks. Sometimes, a link between 2 nodes A and B can even turn asymmetric, meaning A can send to B, but B not to A !

It turns out that they have a lot of partitions (e.g. when a satellite link goes down), followed by subsequent remerging when the link is restored. Sometimes, members would not be able to communicate with each other after the merge.

This was caused by an edge case in UNICAST which doesn't handle overlapping partitions.

A non-overlapping partition is a partition where a cluster of {A,B,C,D} falls apart into 2 (or more) subclusters of {A,B} and {C,D}, or {A}, {B}, {C}, {D}. The latter case can easily be reproduced when you kill a switch connecting the 4 nodes.

An overlapping partition is when the cluster falls apart into subclusters that overlap, e.g. {A,B,C} and {C,D}. This can happen with asymmetrical links (which never happens with a regular switch !), or FD and many nodes being killed at the same time and a merge occuring before all dead nodes have been removed from the cluster.

If this sounds obsure, it actually is !

But anyway, here's what happens at the UNICAST level.

Warning: rough road ahead...

UNICAST keeps state for each connection. E.g. if A sends a unicast message to B, A maintains the last sequence number (seqno) sent to B (e.g. #25) and B maintains the highest seqno received from A (#25). The same holds for message from B to A, let's say B's last message to A was #7.

Now we have a network partition, which creates a new view {A,B} at A and {B} at B. So, in other words, B unilaterally excluded A from its view, but A didn't exclude B. The reason is that A can communicate with B, but B cannot communicate with A.

Now, you might ask, wait a minute ! If A can communicate with B, why can't B communicate with A ?

This doesn't happen with switches, but here we're talking about separate up and down links over radios, and if a radio up-link goes down, that just means we cannot send, but still receive (through the down-link) !

Let's now look at what happens:

When B receives the new view {B}, it removes the entry for A from its connection table. It therefore loses the memory that its last message to A was #7.

On the other side, A doesn't remove its connection entry for B, which is still at #25.

When the partition heals and a merge ensues, A sends a message to B. The message's seqno is #25, the next message to B will be #26 and so on.

On the receiver side, B creates a new connection table entry for A with seqno #1. When A#25 and A#26 are received, they're stored in the table, but not passed up to the application because we expect messages #1-#24 from A first.

This is terrible because A will never send messages #1-#24 ! Because B will simply store all messages from A, it will run out of memory at some point, unless there's another view excluding A !

Doom also lurks in the reverse direction: when B sends #1 to A, A expects #7 and therefore discards messages #1-#6 from B !

This is bad and caused me to enhance the design for UNICAST. The new design includes connection IDs, so we'll reset an old connection when a new connection ID is received, and receivers asking senders for the initial seqnos if they have no entry for a given sender.

This change will not affect current users, running systems which are connected via switches/VLANs etc. But it will remove a showstopper for running JGroups in rough environments like the one described above.

The design for the new enhanced UNICAST protocol can be found at [1].