Friday, August 13, 2010

Daisychaining in the clouds

I've been working on a new protocol DAISYCHAIN [1] which is based on research out of EPFL [2].

The idea behind it is that it is inefficient to broadcast a message in clusters where IP multicasting is not available. For example, if we only have TCP available (as is the case in most clouds today), then we have to send a broadcast (or group) message N-1 times. If we want to broadcast M to a cluster of 10, we send the same message 9 times.

Example: if we have {A,B,C,D,E,F}, and A broadcasts M, then it sends it to B, then to C, then to D etc.

If we have a 1 GB switch, and M is 1GB, then sending a broadcast to 9 members takes 9 seconds, even if we parallelize the sending of M. This is due to the fact that the link to the switch only sustains 1GB / sec. (Note that I'm conveniently ignoring the fact that the switch will start dropping packets if it is overloaded, causing TCP to retransmit, slowing things down)...

Let's introduce the concept of a round. A round is the time it takes to send or receive a message. In the above example, a round takes 1 second if we send 1 GB messages.




In the existing N-1 approach, it takes X * (N-1) rounds to send X messages to a cluster of N nodes. So to broadcast 10 messages a the cluster of 10, it takes 90 rounds.


Enter DAISYCHAIN.

The idea is that, instead of sending a message to N-1 members, we only send it to our neighbor, which forwards it to its neighbor, and so on. For example, in {A,B,C,D,E}, D would broadcast a message by forwarding it to E, E forwards it to A, A to B, B to C and C to D. We use a time-to-live field, which gets decremented on every forward, and a message gets discarded when the time-to-live is 0.

The advantage is that, instead of taxing the link between a member and the switch to send N-1 messages, we distribute the traffic more evenly across the links between the nodes and the switch. Let's take a look at an example, where A broadcasts messages m1 and m2 in cluster {A,B,C,D}, '-->' means sending:

Traditional N-1 approach

Round 1: A(m1) --> B
Round 2: A(m1) --> C
Round 3: A(m1) --> D
Round 4: A{m2) --> B
Round 5: A(m2} --> C
Round 6: A(m2) --> D

It takes 6 rounds to broadcast m1 and m2 to the cluster.


Daisychaining approach

Round 1: A(m1) --> B
Round 2: A(m2) --> B || B(m1) --> C
Round 3: B(m2) --> C || C(m1) --> D
Round 4: C(m2) --> D

In round 1, A send m1 to B.
In round 2, A sends m2 to B, but B also forwards m1 (received in round 1) to C.
In round 3, A is done. B forwards m2 to C and C forwards m1 to D(in parallel, denoted by '||').
In round 4, C forwards m2 to D.

Switch usage

Let's take a look at this in terms of switch usage: in the N-1 approach, A can only send 125MB/sec, no matter how many members there are in the cluster, so it is constrained by the link capacity to the switch. (Note that A can also receive 125MB/sec in parallel with today's full duplex links).

So the link between A and the switch gets hot.

In the daisychaining approach, link usage is more even: if we look for example at round 2, A sending to B and B sending to C uses 2 different links, so there are no constraints regarding capacity of a link. The same goes for B sending to C and C sending to D.

In terms of rounds, the daisy chaining approach uses X + (N-2) rounds, so for a cluster size of 10 and broadcasting 10 messages, it requires only 18 rounds, compared to 90 for the N-1 approach !


Performance

I ran a quick performance test this morning, with 4 nodes connected to a 1 GB switch; and every node sending 1 million 8K messages, for a total of 32GB received by every node. The config used was tcp.xml.

The N-1 approach yielded a throughput of 73 MB/node/sec, and the daisy chaining approach 107MB/node/sec !

The change to switch from N-1 to daisy chaining was to place DAISYCHAIN  directly on top of TCP.

DAISYCHAIN is still largely experimental, but the numbers above show that it has potential to improve performance in TCP based clusters.


[1] https://jira.jboss.org/browse/JGRP-1021
[2] infoscience.epfl.ch/record/149218/files/paper.pdf

8 comments:

  1. Are there any concerns about failures shortly after a member drops out of the group? If D sends a response back for M2 but the daisychain path is broken due to a node leaving. Will there be a loop in the Daisychain protocol that detects this and picks a new path?

    ReplyDelete
  2. Hey Scott,

    the logical ring is determined on every view change, so for {A,B,C,D}, A knows its neighbor is B, B knows it's C and so on...

    Initially, when a new member is about to join, and hasn't yet received a view, it simply bypasses DAISYCHAIN and use the default transport mechanism to disseminate a message, e.g. N-1 in the case of TCP.

    In terms of retransmissions, I've been thinking about a scheme which would ask the predecessor for a missing message, so B would ask A, because that's the one which sent us the message in the first place.

    However, this is not (yet) in place: retransmission of a message M goes directly to the member which sent M, or - if NAKACK.use_mcast_xmit_req is true - are broadcast around the logical ring.

    ReplyDelete
  3. Wow, very nice. Would love to run some perf tests with DAISYCHAIN on Infinispan!

    ReplyDelete
  4. Hi Manik,

    this may not affect Infinispan's DIST mode much, as DAISYCHAIN handles only multicasts... What else uses multicasts in Infinispan (besides replication) ? Invalidation ?
    DAISYCHAIN really shines on high data volumes. One aspect I didn't mention is that latency gets a big higher, in favor of higher throughput...

    ReplyDelete
  5. Anonymous3:09 PM

    Bela,
    Can you please explain this bit of math: "in the N-1 approach, A can only send 125MB/sec, no matter how many members there are in the cluster". It's a 1GB switch, right?

    Thanks,
    Scott D.

    ReplyDelete
  6. Correct; 1 GBit / 8 ~= 125MBytes. I should have written 1 Gb to denote Gigabits versus MB to denote megabytes...

    ReplyDelete
  7. Bela, sounds interesting!
    One more thing which could speed up large clusters is some kind of "federations" - I mean we could have geographically-distributed subgroups of nodes, where each subgroups could use UDP inside itself, so we send broadcast messages only once between subgroup "leaders" (using TCP) and then the leaders rebroadcast those messages inside their groups using UDP.
    What do you think about this?

    ReplyDelete
  8. Yes, the RELAY protocol's supposed to do that. Interesting especially for data center replication, e.g. between NYC and SFO.

    Coincidentally, we're having an online demo this WED by Mike Jensen about exploring how to expand JGroups to scale to large groups, details below:

    When: Aug 18th 2010 at 10:00am Eastern (2:00pm GMT)
    Register here: https://cc.readytalk.com/r/gw6zy67joqts

    ReplyDelete