Friday, August 29, 2014

JGroups 3.5.0.Final released

I'm happy to announce that JGroups 3.5.0.Final has been released !

It's quite a big release with 137 issues resolved.

Downloads at the usual place (SourceForge) or via Maven:

<dependency>
    <groupId>org.jgroups</groupId>
    <artifactId>jgroups</artifactId>
    <version>3.5.0.Final</version>
</dependency>

 

The release notes are here.

Enjoy !

Bela Ban




 











Wednesday, July 30, 2014

New record for a large JDG cluster: 500 nodes

Today I ran a 500 node JDG (JBoss Data Grid) cluster on Google Compute Engine, topping the previous record of 256 nodes. The test used is IspnPerfTest [2] and it does the following:
  • The data grid has keys [1..20000], values are 1KB byte buffers
  • The cache is DIST-SYNC with 2 owners for every key
  • Every node invokes 20'000 requests on random keys
  • A request is a read (with an 80% chance) or a write (with a 20% chance)
    • A read returns a 1KB buffer
    • A write takes a 1KB buffer and sends it to the primary node and the backup node
  • The test invoker collects the results from all 500 nodes and computes an average which it prints to stdout.
But see for yourself; the YouTube video is here: [1].

In terms of cluster size, I could probably have gone higher: it was pretty effortless to go to 500 instances... any takers ? :-)


[1] http://youtu.be/Cz3CDr31EDU
[2] https://github.com/belaban/IspnPerfTest

Wednesday, July 23, 2014

Running a JGroups cluster in Google Compute Engine: part 3 (1000 nodes)

A quick update on my GCE perf tests:

I successfully ran a 1000 node UPerf cluster on Google Compute Engine, the YouTube video is at http://youtu.be/Imn1M7EUTGE. The perf tests are started at around 14'13".

I also managed to get 2286 nodes running, however didn't record it... :-)

Monday, April 07, 2014

Running JGroups on Google Compute Engine


I recently started looking into Google Compute Engine (GCE) [1], an IAAS service similar to Amazon's EC2. I had been looking into EC2 a few years ago and created S3_PING back then, to simplify running JGroups clusters on EC2.

As GCE seems to be taking off, I wanted to be able to provide a simple way to run JGroups clusters on GCE, too. Of course, this also means it's simple to run Infinispan/JDG and WildFly(JBoss EAP) clusters on GCE.

So the first step was creating a discovery protocol named GOOGLE_PING.

Turns out this was surprisingly easy; only 27 lines of code as I could more or less reuse S3_PING. Google provides an S3 compatibility mode which only requires a change of the cloud storage host name !

Next, I wanted to run a number of UPerf instances on GCE and measure performance.

UPerf mimics Infinispan's partial replication, in which every node picks a backup for its data: every node invokes 20'000 synchronous RPCs, 80% of those are READS and 20% WRITES. Random destinations are picked for every request and when all members are done, the times to invoke the 20'000 requests are collected from all members, averaged and printed to the console.

The nice thing about UPerf is that parameters (such as the payload of each request, the number of invoker threads, the number of RPCs etc) can be changed dynamically; this is done in the form of a CONFIG RPC. All members apply the changes and therefore don't need to be restarted. (New members acquire the entire configuration from the coordinator).

I made 2 videos showing how this is done. Part 1 [3] shows how to setup and configure JGroups to run on 2 node cluster, and part 2 [4] shows a 100 node cluster and measures performance of UPerf. I hope to add a part 3 later, when I have a quota to run 1000 cores...

The first part can be seen here:

The second part is here:




Enjoy !

[1] https://cloud.google.com/products/compute-engine/

[2] https://github.com/belaban/JGroups/blob/master/src/org/jgroups/protocols/GOOGLE_PING.java

[3] http://youtu.be/xq7JxeIQTrU

[4]  http://youtu.be/fokCUvB6UNM

Friday, January 24, 2014

JGroups: status and outlook

For our clustering team meeting in Palma, I wrote a presentation on status and future of JGroups. As there isn't anything confidential in it, I thought I might as well share it.
Feedback to the users mailing list, please.
Cheers,

https://github.com/belaban/workshop/blob/master/slides/status2014.adoc

Sunday, October 06, 2013

JGroups 3.4.0.Final released

I'm happy to announce that JGroups 3.4.0.Final has been released !

The major features, bug fixes and optimizations are:
  • View creation and coordinator selection is now pluggable
  • Cross site replication can have more than 1 site master
  • Fork channels: light weight channels
  • Reduction of memory and wire size for views, merge views, digests and various headers
  • Various optimizations for large clusters: the largest JGroups cluster is now at 1'538 nodes !
  • The license is now Apache License 2.0
3.4.0.Final can be downloaded from SourceForge [1] or Maven (central). The complete list of issues is at [2]. Below is a summary of the changes.
Enjoy !

[1] https://sourceforge.net/projects/javagroups/files/JGroups/3.4.0.Final
[2] https://issues.jboss.org/browse/JGRP/fixforversion/12320869


Changes:
****************************************************************************************************
Note that the license was changed from LGPL 2.1 to AL 2.0: https://issues.jboss.org/browse/JGRP-1643
****************************************************************************************************


New features
============


Pluggable policy for picking coordinator
----------------------------------------
[https://issues.jboss.org/browse/JGRP-592]

View and merge-view creation is now pluggable; this means that an application can determine which member is
the coordinator.
Documentation: http://www.jgroups.org/manual/html/user-advanced.html#MembershipChangePolicy.


RELAY2: allow for more than one site master
-------------------------------------------
[https://issues.jboss.org/browse/JGRP-1649]

If we have a lot of traffic between sites, having more than 1 site master increases performance and reduces stress
on the single site master


Fork channels: private light-weight channels
--------------------------------------------
[https://issues.jboss.org/browse/JGRP-1613]

This allows multiple light-weight channels to be created over the same (base) channel. The fork channels are
private to the app which creates them and the app can also add protocols over the default stack. These protocols are
also private to the app.

Doc: http://www.jgroups.org/manual/html/user-advanced.html#ForkChannel
Blog: http://belaban.blogspot.ch/2013/08/how-to-hijack-jgroups-channel-inside.html



Kerberos based authentication
-----------------------------
[https://issues.jboss.org/browse/JGRP-1657]

New AUTH plugin contributed by Martin Swales. Experimental, needs more work


Probe now works with TCP too
----------------------------
[https://issues.jboss.org/browse/JGRP-1568]

If multicasting is not enabled, probe.sh can be started as follows:
probe.sh -addr 192.168.1.5 -port 12345
, where 192.168.1.5:12345 is the physical address:port of a node.
Probe will ask that node for the addresses of all other members and then send the request to all members.



Optimizations
=============

UNICAST3: ack messages sooner
-----------------------------
[https://issues.jboss.org/browse/JGRP-1664]

A message would get acked after delivery, not reception. This was changed, so that long running app code would not
delay acking the message, which could lead to unneeded retransmission by the sender.


Compress Digest and MutableDigest
---------------------------------
[https://issues.jboss.org/browse/JGRP-1317]
[https://issues.jboss.org/browse/JGRP-1354]
[https://issues.jboss.org/browse/JGRP-1391]
[https://issues.jboss.org/browse/JGRP-1690]

- In some cases, when a digest and a view are the same, the members
  field of the digest points to the members field of the view,
  resulting in reduced memory use.
- When a view and digest are the same, we marshal the members only
  once (in JOIN-RSP, MERGE-RSP and INSTALL-MERGE-VIEW).
- We don't send the digest with a VIEW to *existing members*; the full
  view and digest is only sent to the joiner. This means that new
  views are smaller, which is useful in large clusters.
- JIRA:  https://issues.jboss.org/browse/JGRP-1317
- View and MergeView now use arrays rather than lists to store
  membership and subgroups
- Make sure digest matches view when returning JOIN-RSP or installing
  MergeView (https://issues.jboss.org/browse/JGRP-1690)
- More efficient marshalling of GMS$GmsHeader: when view and digest
  are present, we only marshal the members once


Large clusters:
---------------
- https://issues.jboss.org/browse/JGRP-1700: STABLE uses a bitset rather than a list for STABLE msgs, reducing
  memory consumption
- https://issues.jboss.org/browse/JGRP-1704: don't print the full list of members
- https://issues.jboss.org/browse/JGRP-1705: suppression of fake merge-views
- https://issues.jboss.org/browse/JGRP-1710: move contents of GMS headers into message body (otherwise packet at
  transport gets too big)
- https://issues.jboss.org/browse/JGRP-1713: ditto for VIRE-RSP in MERGE3
- https://issues.jboss.org/browse/JGRP-1714: move large data in headers to message body



Bug fixes
=========

FRAG/FRAG2: incorrect ordering with message batches
[https://issues.jboss.org/browse/JGRP-1648]

Reassembled messages would get reinserted into the batch at the end instead of at their original position


RSVP: incorrect ordering with message batches
---------------------------------------------
[https://issues.jboss.org/browse/JGRP-1641]

RSVP-tagged messages in a batch would get delivered immediately, instead of waiting for their turn


Memory leak in STABLE
---------------------
[https://issues.jboss.org/browse/JGRP-1638]

Occurred when send_stable_msg_to_coord_only was enabled.


NAKACK2/UNICAST3: problems with flow control
--------------------------------------------
[https://issues.jboss.org/browse/JGRP-1665]

If an incoming message sent out other messages before returning, it could block forever as new credits would not be
processed. Caused by a regression (removal of ignore_sync_thread) in FlowControl.



AUTH: nodes without AUTH protocol can join cluster
--------------------------------------------------
[https://issues.jboss.org/browse/JGRP-1661]

If a member didn't ship an AuthHeader, the message would get passed up instead of rejected.


LockService issues
------------------
[https://issues.jboss.org/browse/JGRP-1634]
[https://issues.jboss.org/browse/JGRP-1639]
[https://issues.jboss.org/browse/JGRP-1660]

Bug fix for concurrent tryLock() blocks and various optimizations.


Logical name cache is cleared, affecting other channels
-------------------------------------------------------
[https://issues.jboss.org/browse/JGRP-1654]

If we have multiple channels in the same JVM, the a new view in one channel triggers removal of the entries
of all other caches




Manual
======

The manual is at http://www.jgroups.org/manual-3.x/html/index.html.

Monday, September 30, 2013

New record for a large JGroups cluster: 1538 nodes

The largest JGroups cluster so far has been 536 nodes [1], but since last week we have a new record: 1538 nodes !

The specs for this cluster are:
  • 4 X Intel Xeon CPU E3-1220 V2 @ 3.10GHz 4GB: 8 boxes of 13 members = 104
  • 1 X Intel Celeron M CPU 440 @ 1.86GHz 1GB: 8 boxes of 3 members = 24
  • 1 X Intel Celeron M CPU 440 @ 1.86GHz 2GB: 141 boxes of 10 members =1410
    • Total = 1538 members
  • JGroups 3.4.0.Beta2 (custom build with JGRP-1705 [2] included)
  • 120 MB of heap for each node
The time to form a 1538 node cluster was around 115 seconds and it took a total 11 views (due to view bundling being enabled).

Note that the physical memory listed for the hardware above is shared by all the nodes on that box, e.g. in the last HW config of 141 boxes, 10 nodes share the 2GB of physical memory; each member has 120 MB max of heap available.

The picture below shows the management GUI. The last line shows member number 1538, which took roughly 11 seconds to join (due to view bundling) and the total time for the cluster to form was ca. 115 seconds.




[1] http://belaban.blogspot.ch/2011/04/largest-jgroups-cluster-ever-536-nodes.html
[2] https://issues.jboss.org/browse/JGRP-1705

Wednesday, August 21, 2013

How to hijack a JGroups channel inside Infinispan / JBoss ... (and get away with it !)

Have you ever used an Infinispan or JBoss (WildFly) cluster ?
If no --> skip this article.

Have you ever had the need to send messages around the cluster, without resorting to RpcManager offered by Infinispan or HAPartition provided by WildFly ?
If no --> skip this article.

Have you ever wanted to hijack the JGroups channel used by Infinispan and WildFly and use it for your own purposes ?
If no --> skip this article.

I recently added light-weight channels [1] to JGroups. They provide the ability to send and receive messages over an existing channel, yet those messages are private to the light-weight channel. And the hijacked channel doesn't see those private messages either.

This is good when an application wants to reuse an existing channel, and doesn't want to create/configure/maintain its own channel, which would increase resource consumption.

Also, applications can create many (hundreds) of light-weight channels, as they don't use a lot of resources and are easy and quick to create and destroy.

Even better: if an application would like to add capabilities to an existing stack, e.g. atomic counters, a distributed execution service or distributed locking, then it can define the protocol(s) it wants to add and the private channel will add these. Again, the hijacked channel is unaffected by this.

There's documentation on this at [1], so let me show you how to hijack a channel inside of an Infinispan application. The application can be downloaded from [3].

The (original but edited) code that creates an Infinispan cache looks like this:

    protected void start() throws Exception {
        mgr=new DefaultCacheManager("infinispan.xml");
        cache=mgr.getCache("clusteredCache");

        eventLoop();
    }


It creates a CacheManager, then creates a Cache instance off of it. This is it. Now, we want to grab the JGroups channel and hijack it to run a Draw instance on it (HijackTest:53, [4]):


    protected void start() throws Exception {
        mgr=new DefaultCacheManager("infinispan.xml");
        cache=mgr.getCache("clusteredCache");
        Transport   tp;

        ForkChannel fork_ch;
        tp=cache.getAdvancedCache().getRpcManager().getTransport();
        Channel main_ch=((JGroupsTransport)transport).getChannel();
        fork_ch=new ForkChannel(main_ch, 

                                "hijack-stack", 
                                "lead-hijacker",
                                true,

                                ProtocolStack.ABOVE,
                                FRAG2.class);
        Draw draw=new Draw((JChannel)fork_ch);
        fork_ch.connect("ignored");
        draw.go();

        eventLoop();
    }


The code in bold was inserted. So what does it do ? It grabs the JGroups channel from the Infinispan cache, creates a light-weight (fork) channel  and runs Draw [5] over it. Draw is a replicated whiteboard GUI application. To explain this step-by-step:
  • First the Transport instance is retrieved from the cache, then the JGroups channel from it
  • Next a ForkChannel is created. It is passed the hijacked JGroups channel, the name of the newly created fork stack ("hijack-stack") and the name of the fork channel ("lead-hijacker"). Then we state that we want to dynamically insert a FORK protocol if it doesn't exist, above FRAG2. FORK [2] is needed to create fork channels
  • Next, we create an instance of Draw (shipped with JGroups) on the newly created fork channel
  • Then we connect the fork channel and call go() on Draw which starts up the GUI
The result is shown below:




The hijacked application is a text-based Infinispan perf test (shown in the 2 shell windows). The hijacking Draw windows are shown above. Pressing and moving the mouse around will send multicasts of coordinates/color around the cluster over the fork channel, but the hijacked channel will not notice anything.

Seem like we got away with hijacking the JGroups channel after all ! :-)



[1] http://www.jgroups.org/manual/html/user-advanced.html#ForkChannel

[2] https://issues.jboss.org/browse/JGRP-1613

[3] https://github.com/belaban/HijackTest

[4] https://github.com/belaban/HijackTest/blob/master/src/main/java/org/dubious/HijackTest.java#L53

[5] https://github.com/belaban/JGroups/blob/master/src/org/jgroups/demos/Draw.java


Wednesday, July 31, 2013

BYOC: Bring Your Own Coordinator

I've finally resolved one of the older issues in JGroups, created 6 years ago: pluggable policy for picking the coordinator [1].

This adds the ability to let application code create the new View on a join, leave or crash-leave, or the MergeView on a merge.

Why is this important ?

There are 2 scenarios that benefit from this:
  1. The new default merge policy always picks a previous coordinator if available, so that moving around of the coordinatorship is minimized. Because singleton services (services which are instantiated only on one node in the cluster, usually on the coordinator) need to shutdown services and start them every time the coordinatorship moves, minimizing this means less work for singletons.
  2. A node with a coordinator role has more work to do, e.g. join/leave/merge handling, plus it may be running some singleton services. Some systems will therefore prefer bulkier machines (faster CPU, more memory) to run the coordinator. But so far, this hasn't been possible, as JGroups always preferred every node equally. This changes now, and the puggable policy allows for pinning the coordinatorship to a number of machines (a subset of the cluster nodes).
The documentation is at [2]. For questions and feedback, please use the mailing list. 
Cheers,

[1] https://issues.jboss.org/browse/JGRP-592

[2] http://www.jgroups.org/manual/html/user-advanced.html#MembershipChangePolicy

Monday, July 29, 2013

First JGroups release under Apache License 2.0

I'm happy to announce that I just released 3.4.0.Alpha1, which is the first release under the Apache License 2.0 !

In order to change the license, I had to track down all contributors who still have code in the code base and get their permissions [1], which was time consuming.

Given that JGroups came into existence in 1998, even though a lot of code has since been added/removed/replaced, there are still quite a number of people who still have contributions in the current master.

Finding some of those folks turned out to be hard. I ran into a few cases of abandoned email addresses, and sometimes googling for "firstname lastname" led me to people with the same name but no connection to JGroups whatsoever...

Anyway, the change is done now and I wanted to release a first alpha under AL 2.0 as soon as possible, so other projects can start using JGroups with the new license.

Although 'alpha' suggests bad code quality, the opposite is actually true: this release is very stable, at least as stable as the latest 3.3.x. The major changes are listed below.
Enjoy !

RELAY2: allow for more than one site master
-------------------------------------------
[https://issues.jboss.org/browse/JGRP-1649]

If we have a lot of traffic between sites, having more than 1 site master 
increases performance and reduces stress on the single site master


Kerberos based authentication
-----------------------------
[https://issues.jboss.org/browse/JGRP-1657]

New AUTH plugin contributed by Martin Swales. Experimental, needs more work


Probe now works with TCP too
----------------------------
[https://issues.jboss.org/browse/JGRP-1568]

If multicasting is not enabled, probe.sh can be started as follows:
probe.sh -addr 192.168.1.5 -port 12345
, where 192.168.1.5:12345 is the physical address:port of a node.
Probe will ask that node for the addresses of all other members and 
then send the request to all members.



Optimizations
=============

UNICAST3: ack messages sooner
-----------------------------
[https://issues.jboss.org/browse/JGRP-1664]

A message would get acked after delivery, not reception. This was changed, 
so that long running app code would not delay acking the message, which 
could lead to unneeded retransmission by the sender.


Bug fixes
=========

FRAG/FRAG2: incorrect ordering with message batches
[https://issues.jboss.org/browse/JGRP-1648]

Reassembled messages would get reinserted into the batch at the end instead of 
at their original position


RSVP: incorrect ordering with message batches
---------------------------------------------
[https://issues.jboss.org/browse/JGRP-1641]

RSVP-tagged messages in a batch would get delivered immediately,
instead of waiting for their turn


Memory leak in STABLE
---------------------
[https://issues.jboss.org/browse/JGRP-1638]

Occurred when send_stable_msg_to_coord_only was enabled.


NAKACK2/UNICAST3: problems with flow control
--------------------------------------------
[https://issues.jboss.org/browse/JGRP-1665]

If an incoming message sent out other messages before returning, it could 
block forever as new credits would not be processed.
Caused by a regression (removal of ignore_sync_thread) in FlowControl.



AUTH: nodes without AUTH protocol can join cluster
--------------------------------------------------
[https://issues.jboss.org/browse/JGRP-1661]

If a member didn't ship an AuthHeader, the message would get passed up
instead of rejected.


LockService issues
------------------
[https://issues.jboss.org/browse/JGRP-1634]
[https://issues.jboss.org/browse/JGRP-1639]
[https://issues.jboss.org/browse/JGRP-1660]

Bug fix for concurrent tryLock() blocks and various optimizations.


Logical name cache is cleared, affecting other channels
-------------------------------------------------------
[https://issues.jboss.org/browse/JGRP-1654]

If we have multiple channels in the same JVM, the a new view in one channel
triggers removal of the entries of all other caches



Manual
======

The manual is at http://www.jgroups.org/manual-3.x/html/index.html.

The complete list of features and bug fixes can be found at 
http://jira.jboss.com/jira/browse/JGRP.


Bela Ban, Kreuzlingen, Switzerland
Aug 2013


[1] https://issues.jboss.org/browse/JGRP-1643




Tuesday, May 28, 2013

JGroups to investigate adopting Apache License 2.0

We've started to investigate whether JGroups can change its license from LGPL 2.1 [1] to Apache License (AL) 2.0 [2]. If possible, the next JGroups release will be licensed under AL 2.0.

The reason for the switch is that AL 2.0 is more permissive than LGPL 2.1 and can be integrated with other projects / products more easily. Also, AL 2.0 is one of the most frequently used licenses in the open source world [3], and you can't go wrong with that, can you ? :-)

I've received quite a number of requests to change the license over the years, from the open source community and from companies trying to integrate JGroups into their products.

This change should increase acceptance of JGroups with other open source projects or commercial products.

So what would the impact of this change be ?
  • For projects / products using a current (<= 3.3) JGroups release, nothing would change (they're still licensed under LGPL 2.1)
  • For projects / products upgrading to the AL licensed release (once it's out), nothing would change; the AL 2.0 license is even more permissive
Let me repeat this, nothing would change except that LGPL skeptics should now be able to use JGroups.

Note that Infinispan is looking into making the same change, see [4] for details. As Infinispan consumes JGroups, it is important for JGroups to have the same license as Infinispan, or else integration would be a nightmare; it's hard (if not impossible) for an AL project to consume an LGPL project.

Changing the license would be done in the next JGroups release. Whether this will be 3.4 or whether a license change warrants going directly to a 4.0 is still undecided.

We're currently working with Red Hat's legal department and the community to see whether this switch is possible and what needs to be done to make it happen.

Opinions ? Questions ? Feedback ? Please post to this blog, or the JGroups mailing list.

Cheers,

[1] http://opensource.org/licenses/LGPL-2.1
[2] http://opensource.org/licenses/Apache-2.0
[3] http://osrc.blackducksoftware.com/data/licenses
[4] http://infinispan.blogspot.co.uk/2013/05/infinispan-to-adopt-apache-software.html



Tuesday, May 07, 2013

JGroups 3.3.0.Final released

I'm happy to announce that JGroups 3.3.0.Final was released today.

It contains quite a few optimizations and new features, the most prominent ones being:
  • Message batching: messages received as bundles by the transport are passed up as batches. Compared to passing individual messages up the stack, the advantage is that we have to acquire resources (such as locks) only once per batch instead of once per message. This reduces the number of lock acquisitions and should lead to less context switching and better performance in contended scenarios.
  • Asynchronous Invocation API: This allows the recipient of a message in MessageDispatcher or RpcDispatcher to make the delivering thread return immediately (making it available for other requests) and to send the response later. The advantage is that it is the application which now decides how to deliver messages (e.g. sequentially or in parallel), and not JGroups. The documentation is here: http://www.jgroups.org/manual-3.x/html/user-building-blocks.html#AsyncInvocation
  • UNICAST3: this is a new reliable unicast protocol, combining the advantages of UNICAST (immediate delivery) and UNICAST3 (speed). It is the default in 3.3.
  • New internal thread pool: to be used for internal JGroups messages only. This way, important internal JGroups messages are not impeded by the delivery of application messages ahead of them in delivery order. While this pool is not supposed to be used by application messages, it will help for example to decrease unneeded blocking (by credit messages getting queued up behind application messages), or reduce false suspicions (due to heartbeats getting handled too late, or even getting dropped).
  • RELAY2 improvements: RELAY2 is the protocol which relays traffic between geographically separate sites. This will be the topic of my talk at Red Hat Summit in June.
  • New timer implementation: a better, simpler and faster implementation; the default in 3.3.
  • New message bundler: this new bundler handles sending of individual messages and message batches equally well. Default in 3.3.
A more detailed version of these release notes can be found at [1]. The documentation can be found at [2]. The JIRA is at [3].

Please post questions and feedback as usual on the mailing list.


[1] https://github.com/belaban/JGroups/blob/3.3/doc/ReleaseNotes-3.3.0.txt

[2]  http://www.jgroups.org/manual-3.x/html/index.html

[3] https://issues.jboss.org/browse/JGRP

Monday, April 22, 2013

JBossWorld 2013 in Boston is around the corner

FYI,

I'm going to have a talk at JBossWorld / Red Hat Summit 2013 in Boston on June 13: http://www.redhat.com/summit/sessions/index.html#54.

This will be about a feature in JBoss Data Grid (JDG) which provides geographic failover between sites. I'm going to run a couple of JDG clusters (Boston, London and San Francisco), and initially route all clients to London. Then, as London shuts down at the end of their day, I'll route all clients over to Boston, and finally to San Francisco.

I have a demo that can be run by anyone with internet access, in their browser. Users can punch in some data, and will see their clients fail over between sites seamlessly, without data loss.

Hope to see / meet some of you in Boston !
Cheers,

Tuesday, February 05, 2013

Performance of message batching


A quick heads up on performance of message batching [1] [2]: I ran MPerf and UPerf and got the results as shown at the bottom of [1].

The tests were run on a 4 node cluster; with cluster sizes of 6 and 8, I ran 2 processes on the same physical box.

MPerf shows that a slightly better perf for 2 and 4 nodes, but a significantly (10%) better perf when running more than 1 process on the same box (6 and 8 nodes). I think the reason is that under contention, the property of message batching to acquire fewer locks comes in to reduce lock contention.

UnicastTestRPC shows exactly the same perf for the old (no message batching) and the new code (with message batching). The main reason here is that we use synchronous RPCs and one sender, which doesn't take advantage of message batching at all, as no message bundles are sent across the wire.

UPerf shows a significantly better perf for 4 nodes (11%) and 8 nodes (16% better). I guess the reason here is that we do make use of message batching as we have multiple sender threads and higher contention than in the previous test.

This is not the end of the line, as I haven't implemented message batching in protocols above NAKACK2 and UNICAST2: currently, messages are sent up in batches from the transport to NAKACK2 (multicast messages) or UNICAST2 (unicast messages), but from there on, they're sent up individually.
This will get changed in [3], but because this is a lot of work and will affect many classes, I thought I split the work in two parts.

The first part has been merged with master (3.3) and it would be interesting to get feedback from people trying this out !

Cheers,

[1] https://issues.jboss.org/browse/JGRP-1564

[2] http://belaban.blogspot.ch/2013/01/buy-one-get-many-for-free-message.html

[3] https://issues.jboss.org/browse/JGRP-1581

Wednesday, January 30, 2013

Buy one, get many for free: message batching in JGroups

Just a quick heads-up of what's going on in JGroups 3.3 with message batching.

Currently, when the transport receives a message bundle (say 20 messages), it passes the bundle to the regular thread pool (OOB messages are never bundled). The thread which handles the bundle grabs each message, adds it to the retransmission table, and then removes as many messages as possible and passes them up one-by-one.

So, although we have a message bundle, we don't process it as a bundle, but rather add each message individually and then pass them up up one-by-one.

Message batching [1] changes this. It reads a message bundle directly into 2 MessageBatch instances: one for OOB messages and one for regular messages. (This already shows that OOB messages are now bundled, too, but more on this later). The OOB MessageBatch is passed to the OOB thread pool for handling, the regular batch to the regular pool.

A message batch is nothing more than a list of messages.

A message batch is handled by only 1 thread: the thread passes the entire batch up the stack. Each protocol can remove messages it consumes (e.g. FD_ALL), change messages in-place (e.g. COMPRESS or ENCRYPT), remove and add messages (e.g. FRAG2), remove messages and pass up a new batch (NAKACK2, UNICAST2) or even do nothing (SIZE, STATS).

The advantage is that a protocol can now handle many messages at once, amortizing (e.g.) lock acquisition costs. For example, NAKACK2 adds all 20 messages to the retransmission table at once, thereby acquiring the table lock only once rather than 20 times. This means that we incur the cost of 1 lock acquition, instead of 20. It goes without saying that this will also reduce lock contention, at least in this particular case, even if the lock duration will be slightly longer than before.

I'll present some performance numbers soon, but so far preliminary performance tests look promising !

So while message bundling queues messages and sends them across the wire as a list, but stops at the receiver's transport; message batching takes this idea further and passes that bundle up all the way to the channel. (Note that this will require another receive() callback in the Receiver, but this will be transparent by default).

Message batching will allow other cool things to happen, e.g.
  • OOB messages will be bundled too now. If no bundling is desired, tag a message as DONT_BUNDLE.
  • We can simplify the message bundler (on the sender side), see [2]. As a result, I might even be able to remove all existing 4 message bundlers. As you know, I like removing stuff, and making code easier to read !
  • RPC responses can be bundled [3]
  • UNICAST2 can now ack the 'last message' [4]
Cheers,


[1] https://issues.jboss.org/browse/JGRP-1564

[2] https://issues.jboss.org/browse/JGRP-1540

[3] https://issues.jboss.org/browse/JGRP-1566

[4] https://issues.jboss.org/browse/JGRP-1548

Wednesday, January 02, 2013

SUPERVISOR: detecting faults and fixing them automatically

I've added a new protocol SUPERVISOR [1] to master, which can periodically check for certain conditions and correct them if necessary. This will be in the next release (3.3) of JGroups.

You can think of SUPERVISOR as an automated rule-based system admin.

SUPERVISOR was born out of a discussion on the mailing list [2] where a bug in FD caused the failure detection task in FD to be stopped, so members would not get suspected and excluded anymore. This is bad if the suspected member was the coordinator itself, as new members would not be able to join anymore !

Of course, fixing the bug [3] was the first priority, but I felt that it would be good to also have a second line of defense that detected problems in a running stack. Even if a rule doesn't fix the problem, it can still be used to detect it and alert the system admin, so that the issue can be fixed manually.

The documentation for SUPERVISOR is here: [4].


[1] https://github.com/belaban/JGroups/blob/master/src/org/jgroups/protocols/rules/SUPERVISOR.java

[2] https://sourceforge.net/mailarchive/message.php?msg_id=30218296

[3] https://issues.jboss.org/browse/JGRP-1559

[4] http://www.jgroups.org/manual-3.x/html/user-advanced.html#Supervisor



Cross site replication: demo on YouTube

FYI,

I've recently published a video on cross-site replication [1] on Youtube: [2].

The video shows how to set up and configure cross-site replication in Infinispan, although the focus of the video is on running the performance test [3].

Cheers, and a belated happy new year to everyone !


[1] https://docs.jboss.org/author/display/ISPN/Cross+site+replication

[2] https://www.youtube.com/watch?v=owOs430vLZo

[3] https://github.com/belaban/IspnPerfTest

Friday, November 16, 2012

Persisting discovery responses with TCPPING

I've added a nifty little feature to JGroups which helps people who use TCPPING but can't list all of the cluster nodes in the static list.

So far I've always said that if someone needs dynamic discovery, they should use a dynamic discovery protocol such as PING / MPING (require IP multicasting), TCPGOSSIP (requires external GossipRouter process), FILE_PING (requires shared file system), S3_PING / AS_PING / SWIFT_PING / RACKSPACE_PING (requires to be running in a cloud) or JDBC_PING (requires a database).

I always said that TCPPING is for static clusters, ie. clusters where the membership is finite and is always known beforehand.

However, there are cases, where it makes sense to add a little dynamicity to TCPPING, and this is what PDC (Persistent Discovery Cache) does.

PDC is a new protocol that should be placed somewhere between the transport and the discovery protocol, e.g.

    <TCP />

    <PDC cache_dir="/tmp/jgroups"  />

    <TCPPING timeout="2000" num_initial_members="20"
            initial_hosts="192.168.1.5[7000]"
            port_range="0" return_entire_cache="true"
            use_disk_cache="true" />

Here, PDC is placed above TCP and below TCPPING. Note that we need to set use_disk_cache to true in the discovery protocol for it to use the persistent cache.

What PDC does is actually very simple: it intercepts discovery responses and persists them to disk. Whenever a discovery request is received, it also intercepts that request and adds its own results from disk to the response set.

Let's take a look at a use case (with TCPPING) that PDC solves:
  • The membership is {A,B,C}
  • TCPPING.initial_hosts="A"
  • A is started, the cluster is A|0={A}
  • B is started, the cluster is A|1={A,B}
  • C is started, the cluster is A|2={A,B,C}
  • A is killed, the cluster is B|3={B,C}
  • C leaves, the cluster is B|4={B}
  • C joins again
    • Without PDC, it doesn't get a response from A (which is the only node listed in TCPPING.initial_hosts), and forms a singleton cluster C|0={C}
    • With PDC, C discovers A and B and asks both of them for an initial discovery. B replies and therefore the new view is B|5={B,C}
The directory in which PDC stores its information is configured with PDC.cache_dir. If multiple cluster nodes are running on the same physical box, they can share that directory.

Feedback appreciated on the mailing list !
Cheers,
Bela

Friday, October 19, 2012

JGroups 3.2.0.Final released

I've released JGroups 3.2.0.Final, the most important features are:
  • RELAY2 
  • Internationalized logging
    •  The most important user-facing warnings and error messages (e.g. configuration errors) have been internationalized.
    • Error/warning translations are in jg-messages.properties. If someone wants to translate these into a different language, e.g. French, just copy jg-messages.properties into jg-messages_fr.properties and translate the messages. The new file now only needs to be added to the classpath, no changes to JGroups !
  • Reduction of error/warn messages
    • Sometimes there are a lot of recurring warnings or error messages, e.g. warnings about messages received from different clusters, or warnings about messages from members with different JGroups versions.
    • These can now be suppressed for a certain time, e.g. we can configure that there's only *one* warning every 60 seconds about messages from different clusters.
    • [https://issues.jboss.org/browse/JGRP-1518]

A full list of features and bug fixes is here.

The manual can be found at http://www.jgroups.org/manual-3.x/html/index.html.

Questions and feedback as usual on the mailing lists.

Enjoy !

Bela Ban
Kreuzlingen, Oct 2012

Friday, July 06, 2012

JGroups 3.0.11 and 3.1.0 released

I'm happy to announce that I've released JGroups versions 3.0.11 and 3.1.0 !

3.0.11 is the 3.0.x branch which is used by the newly released EAP 6 / JBoss 7.x application server. It consists mainly of bug fixes (and one or two performance enhancements) backported from the 3.1 branch.

The 3.1.0 release has 90+ issues which were resolved (some of them backported to 3.0.x).

Here's a short list of the major issues resolved in 3.1.0, for details consult [2]:

  • NAKACK, UNICAST and NAKACK2 now use a new internal data structure for message delivery and retransmission, which reduces the memory needed by JGroups
  • MERGE3: a new merge protocol for large clusters
  • RSVP: blocks the sender until a given message has been received by all members of the target set
  • A new Total Order Anycast (TOA) protocol needed by the next version of Infinispan to deliver messages to a cluster subset in total order
  • New discovery protocols for mod-cluster (not yet completely done), Rackspace and OpenStack
  • MPerf / UPerf: dynamic multicast and unicast performance tests
  • Concurrent joins to a non-existing cluster are faster, and there's less chances of a merge happening (optimization)
  •  TCP: socket creation doesn't block sending of regular messages (optimization)
Both JGroups 3.0.11 and 3.1.0 can be downloaded from [1].  The updated documentation can be found at [3].

As usual, use the mailing lists or fora for questions.

Enjoy !


[1] https://sourceforge.net/projects/javagroups/files/JGroups/

[2] https://github.com/belaban/JGroups/blob/master/doc/ReleaseNotes-3.1.0.txt

[3] http://www.jgroups.org/manual-3.x/html/index.html

Sunday, April 01, 2012

JBoss World 2012

I'm going to be speaking at JBossWorld 2012 (June 29th) on session clustering in EAP 6 (JBoss 7.1.x):
http://www.redhat.com/summit/sessions/best-of.html#18

The talk is a remake of the 2008 talk held by Brian Stansberry and me, and will show how clustering  performance has increased between JBoss 4 and 7. However, this is not all, I'll cover among other things:
  • Configuration of an EAP 6 cluster
  • Use of EAP 6 domains to start and stop JBoss instances in a cluster, to deploy applications across the entire cluster, and to disseminate configuration changes
  • Pros and cons of replication and distribution, and its effect on scalability and performance
  • Configuration and tuning of Infinispan and JGroups to achieve optimal performance
  • Setup of mod-cluster to dynamically add and remove JBoss instances and applications
  • Performance difference between EAP 5 and 6
I'll be in Boston Tuesday until Friday and hope to meet many users of JGroups/Infinispan/JBoss clustering, get feedback and experience reports on the good, bad and ugly, and in general have many good discussions !

Friday, February 10, 2012

JGroups 3.1.0.Alpha2 released

I'm happy to announce the release of JGroups 3.1.0.Alpha2 !

Don't be put off by the Alpha2 suffix; as a matter of fact, this release is very stable, and I might just go ahead and promote it to "Final" within a short time !

At the time of writing this, I still have a few issues open in 3.1, but because I think the current feature set is great, I might push them into a 3.2.

So what features and enhancements did 3.1 add ? In a nutshell:

  • A new protocol NAKACK2: this is a successor to NAKACK (which will get phased out over the next couple of releases). The 2 biggest changes are:
    • A new memory efficient data structure (Table) is used to store messages to be retransmitted. It can grow and shrink dynamically, and replaces NakReceiverWindow.
    • There is no Retransmitter associated with each table, and we don't create an entry *per* missing sequence number (seqno) or seqno range. Instead, we have *one* single retransmission task, which periodically (xmit_interval ms) scans through *all* tables, identifies gaps and triggers retransmission for missing messages. This is a significant code simplification and brings memory consumption down when we have missing messages.
  • Changes to UNICAST2 and UNICAST: in both cases, we switch from NakReceiverWindow  / AckSenderWindow / AckReceiverWindow to Table and instead of a retransmitter per member, we now have *one* retransmitter task for *all* members.
  • The changes in NAKACK2, UNICAST2 and UNICAST have several benefits:
    • Code simplification: having only one data structure (Table) instead of several ones (NakReceiverWindow, AckSenderWindow, AckReceiverWindow), plus removing all Retransmitter implementations leads to simpler code.
    • Code reduction: several classes can be removed, making the code base simpler to understand, and reducing complexity
    • Better maintainability: Table is now an important core data structure, and improvements to it will affect many parts of JGroups
    • Smaller memory footprint: especially for larger clusters, having less per-member data (e.g. retransmission tasks) should lead to better scalability in large clusters (e.g. 1000 nodes).
    • Smooth transition: we'll leave NAKACK (and therefore NakReceiverWindow and Retransmitter) in JGroups for some releases. NAKACK / NakReceiverWindow have served JGroups well for over a decade, and are battle-tested. When there is an issue with NAKACK2 / Table in  production, we can always fall back to NAKACK. I intend to phase out NAKACK after some releases and a good amount of time spent in production around the world, to be sure NAKACK2 works well
  • MERGE3: merging is frequent in large clusters. MERGE3 handles merging in large clusters better by
    • preventing (or reducing the chances of) concurrent merges
    • reducing traffic caused by merging
    • disseminating {UUID/physical address/logical name} information, so every node has this information, reducing the number of times we need to ask for it explicitly.
    • MERGE3 was written with UDP as transport in mind (which is the transport recommended for large clusters anyway), but it also works with TCP. 
  • Synchronous messages: they  block the sender until the receiver or receivers have ack'ed its delivery. This allows for 'partial flushing' in the sense that all messages sent by a member P prior to M will get delivered at all receivers before delivering M.
    This is related to FLUSH, but less costly and can be done per message. For example, if a unit of work is done, a sender could send an RSVP tagged message M and would be sure that - after the send() returns - all receivers have delivered M.
    To send an RSVP marked messages, Message.setFlag(Message.Flag.RSVP) has to be used.
    A new protocol (RSVP) needs to be added to the stack. See the documentation (link below) for details.
  • A new rackspace-based discovery protocol
  • Concurrent joining to a non-existing cluster is faster
  • Elimination (or reduction of) "no physical address for X; dropping message" warnings
  • Elimination of global JGroups ThreadGroup leaks
  • Elimination of socket leaks with TCPPING
The full list of changes is at [1], the manual can be found at [2] and 3.1 can be downloaded from [3].

Feedback is appreciated on the mailing lists, enjoy !


[1] https://github.com/belaban/JGroups/blob/master/doc/ReleaseNotes-3.1.0.txt
[2] http://www.jgroups.org/manual-3.x/html/index.html
[3] https://sourceforge.net/projects/javagroups/files/JGroups/3.1.0.Alpha2/



Tuesday, December 06, 2011

Repondez s'il vous plait !

No, this isn't a post in French (my school French would be too rusty for this !); this is about a new protocol in JGroups, called RSVP :-)

As the name possibly suggests, this feature allows for messages to get ack'ed by receivers before a message send returns. In other words, when A broadcasts a message M to {A,B,C,D}, then JChannel.send() will only return once itself, B, C and D have acknowledged that they delivered M to the application.

This differs from the default behavior of JGroups which always sends messages asynchronously, and guarantees that all non-faulty members will eventually receive the message. If we tag a message as RSVP, then we basically have 2 properties:
  1. The message send will only return when we've received all acks from the current members. Members leaving or crashing during the wait are treated as if they sent an ack. The send() method can also throw a (runtime) TimeoutException if a timeout was defined (in RSVP) and encountered.
  2. If A sent (asynchronous) messages #1-10, and tagged #10 as RSVP, then - when send() returns successfully - A is guaranteed that all members received A's message #10 and all messages prior to #10, that's #1-9.
This can be used for example when completing a unit of work, and needing to know that all current cluster members received all of the messages sent up to now by a given cluster member.

This is similar to FLUSH, but less strict in that it is a per-sender flush, there is no reconciliation phase, and it doesn't stop the world.

An alternative is to use a blocking RPC. However, I wanted to add the capability of synchronous messages directly into the base channel.

Note that this also solves another problem: if A sends messages #1-5, but some members drop #5, and A doesn't send more messages for some time, then A#5 won't get delivered at some members for quite a while (until stability (STABLE) kicks in).

RSVP will be available in JGroups 3.1. If you want to try it out, get the code from master [2]. The documentation is at [1], section 3.8.8.2.

For questions, I suggest one of the mailing lists.
Cheers,

[1] http://www.jgroups.org/manual-3.x/html/user-channel.html#SendingMessages

[2] https://github.com/belaban/JGroups


Thursday, November 17, 2011

JGroups 3.0.0.Final released

I'm happy to announce that JGroups 3.0.0.Final is here !

While originally intended to make only API changes (some of them queued for years), there are also several optimizations, most of them related to running JGroups in larger clusters.

For instance, the size of several messages has been reduced, and some protocol rounds have been eliminated, making JGroups more memory efficient and less chatty.

For the last couple of weeks, I've been working on making merging of 100-300 cluster nodes faster and making sure a merge never blocks. To this end, I've written a unit test, which creates N singleton nodes (= nodes which only see themselves in the cluster), then make them see each other and wait until a cluster of N has formed.

The test itself was a real challenge because I was hitting the max heap size pretty soon. For example, with 300 members, I had to increase the heap size to at least 900 MB, to make the test complete. This indicates that a JGroups member needs roughly a max of 3MBs of heap. Of course, I had to use shared thread pools, timers and do a fair amount of (memory) tuning on some of the protocols, to accommodate 300 members all running in the same JVM.

Running in such a memory constrained environment led to some more optimizations, which will benefit users, even if they're not running 300 members inside the same JVM ! :-)

One of them is that UNICAST / UNICAST2 maintain a structure for every member they talk to. So if member A sends a unicast to each and every member of a cluster of 300, it'll have 300 connections open.

The change is to close connections that have been idle for a given (configurable) time, and re-establish them when needed.

Further optimizations will be made in 3.1.

The release notes for 3.0.0.Final are here: https://github.com/belaban/JGroups/blob/master/doc/ReleaseNotes-3.0.0.txt

JGroups 3.0.0.Final can be downloaded here: https://sourceforge.net/projects/javagroups/files/JGroups/3.0.0.Final

As usual, if you have questions, use one of the mailing lists for questions.

Enjoy !


Monday, September 12, 2011

Publish-subscribe with JGroups

I've added a new demo program (org.jgroups.demos.PubSub), which shows how to use JGroups channels to do publish-subscribe.

Pub-sub is a pattern where instances subscribe to topics and receive only messages posted to those topics. For example, in a stock feed application, an instance could subscribe to topics "rht", "aapl" and "msft". Stock quote publishers could post to these topics to update a quote, and subscribers would get notified of the updates.

The simplest way to do this in JGroups is for each instance to join a cluster; publishers send topic posts as multicasts, and subscribers discard messages for topics to which they haven't subscribed.

The problem with this is that a lot of multicasts will make it all they way up to the application, only to be discarded there if the topic doesn't match. This means that a message is received by the transport protocols (by all instances in the cluster), passed up through all the protocols, and then handed over to the application. If the application discards the message, then all the work of fragmenting, retransmitting, ordering, flow-controlling, de-fragmenting, uncompressing and so on is unnecessary, resulting in wasted CPU cycles, lock acquisitions, cache and memory accesses, context switching and bandwidth.

A solution to this could be to do topic filtering at the publisher's side: a publisher maintains a hashmap of subscribers and topics they've subscribed to and sends updates only to instances which have a current subscription.

This has two drawbacks though: first the publishers have additional work maintaining those subscriptions, and the subscribers need to multicast subscribe or unsubscribe requests. In addition, new publishers need to somehow get the current subscriptions from an existing cluster member (via state transfer).

Secondly, to send updates only to instances with a subscription, we'd have to resort to unicasts: if 10 instances of a 100 instance cluster are subscribed to "rht", an update message to "rht" would entail sending 10 unicast messages rather than 1 multicast message. This generates more traffic than needed, especially when the cluster size increases.

Another solution, and that's the one chosen by PubSub, is to send all updates as multicast messages, but discard them as soon as possible at the receivers when there isn't a match. Instead of having to traverse the entire JGroups stack, a message that doesn't match is discarded directly by the transport, which is the first protocol that receives a message.

This is done by using a shared transport and creating a separate channel for each subscription: whenever a new topic is subscribed to, PubSub creates a new channel and joins a cluster whose name is the topic name. This is not overly costly, as the transport protocol - which contains almost all the resources of a stack, such as the thread pools, timers and sockets -  is only created once.

The first channel to join a cluster will create the shared transport. Subsequent channels will only link to the existing shared transport, but won't initialize it. Using reference counting, the last channel to leave the cluster will de-allocate the resources used by the shared transport and destroy it.

Every channel on top of the same shared transport will join a different cluster, named after the topic. PubSub maintains a hashmap of topic names as keys and channels as values. A "subscribe rht" operation simply creates a new channel (if there isn't one for topic "rht" yet), adds a listener, joins cluster "rht" and adds the topic/channel pair to the hashmap. An "unsubscribe rht" grabs the channel for "rht", closes it and removes it from the hashmap.

When a publishes posts an update for "rht", it essentially sends a multicast to the "rht" cluster.

The important point is that, when an update for "rht" is received by a shared transport, JGroups tries to find the channel which joined cluster "rht" and passes the message up to that channel (through its protocol stack), or discards it if there isn't a channel which joined cluster "rht".

For example, if we have 3 channels A, B and C over the same shared transport TP, and A joined cluster "rht", B joined "aapl" and C joined "msft", then when a message for "ibm" arrives, it will be discarded by TP as there is no cluster "ibm" present. When a message for "rht" arrives, it will be passed up the stack for "rht" to channel A.

As a non-matching message will be discarded at the transport level, and not the application level, we save the costs of passing the message up the stack, through all the protocols and delivering it to the application.

Note that PubSub uses the properties of IP multicasting, so the stack used by it should have UDP as shared transport. If TCP is used, then there are no benefits to the approach outlined above.

Wednesday, September 07, 2011

Speaking at the OpenBlend conference on Sept 15

FYI,

I'll be speaking at the OpenBlend conference in Ljubljana on Sept 15.

My talk will be about how to persist data without using a disk, by spreading it over a grid with a customizable degree of redundancy. Kind of the NoSQL stuff everybody and their grandmothers are talking about these days...

I'm excited to visit Ljubljana, as I've never been there before and I like seeing new towns.

The other reason, of course, is to beat Ales Justin's a**s in tennis :-)

If you happen to be in town, come and join us ! I mean not for tennis, but for the conference, or for a beer in the evening !

Cheers,
Bela

Thursday, September 01, 2011

Optimizations for large clusters

I've been working on making JGroups more efficient on large clusters. 'Large' is between 100 and 2000 nodes.

My focus has been on making the memory footprint smaller, and to reduce the wire size of certain types of messages.


Here are some of the optimizations that I implemented.

Discovery

Discovery is needed by a new member to find the coordinator when joining. It broadcasts a discovery request, and everybody in the cluster replies with a discovery response.

There were 2 problems with this: first, a cluster of 1000 nodes meant that a new joiner received 1000 messages at the same time, possibly clogging up network queues and causing messages to get dropped.

This was solved by staggering the sending of responses (stagger_timeout).

The second problem was that every discovery response included the current view. In a cluster of 1000, this meant that 1000 responses each contained a view of 1000 members !

The solution to this was that we only send back the address of the coordinator; as this is all that's needed to send a JOIN request to it. So instead of sending back (with every discovery response) 1000 addresses, we now only send back 1 address.


Digest

A digest used to contain the lowest, highest delivered and highest received sequence numbers (seqnos) for every member. They are sent back to a new joiner in a JOIN response, and they are also broadcast periodically by STABLE to purge messages delivered by everyone.

The wire size would be 2 longs for every address (UUID), and 3 longs for the 3 seqnos. That's roughly 1000 * 5 * 8 = 40000 bytes for a cluster of 1000 members. Bear in mind that that's the size of one digest; in a cluster of 1000, everyone broadcasts such a digest periodically (STABLE) !

The first optimization was to remove the 'low' seqno; I had to change some code in the retransmitters to allow for that, but - hey - who wouldn't do that to save 8 bytes / STABLE message ? :-)

This reduced the wire (and memory !) size of a 1000-member digest by another 8'000 bytes, down to 32'000 (from 40'000).

Having only highest delivered (HD) and highest received (HR) seqnos allowed for another optimization: HR is always >= HD, and the difference between HR and HD is usually small.

So the next optimization was to send HR as a delta to HD. So instead of sending 322649 | 322650, we'd send 322649 | 1.

The central optimization underlying that was that seqnos seldomly need 8 bytes: a seqno starts at 1 and increases monotonically. If a member sends 5 million messages, the seqno can still be encoded in 4 bytes (saving 4 bytes per seqno). If a member is restarted, the seqno starts again at 1 and can thus be encoded in 1 byte.

So now I could encode an HD/HR pair by sending a byte containing the number of bytes needed for the HD part in the lower 4 bits and the number of bytes needed for the delta in the higher 4 bits. The HD and the delta would then follow. Example: to encode HD=2000000 | HR=2000500, we'd generate the bytes:

| 50 | -128 | -124 | 30 | -12 | 1 |

  • 50 encodes a length of 3 for HD and 2 for HD-HR (500)
  • -128, -124 and 30 encode 2'000'000 in 3 bytes
  • -12 and 1 encode the delta (500)

So instead of using 16 bytes for the above sequence, we use only 6 bytes !

If we assume that we can encode 2 seqnos on average in 6 bytes, the wire size of a digest is now 1000 * (16 (UUID) + 6) = 22'000, that's down from 40'000 in a 1000 member cluster. In other words, we're saving almost 50% of the wire size of a digest !

Of course, we can not only encode seqno sequences, but also other longs, which is exactly what we did for another optimization. Examples of where this makes sense are:
  • Seqnos in NakackHeaders: every multicast message has such a header, so the savings here are significant
  • Range: this is used for retransmission requests, and is also a seqno sequence
  • RequestCorrelator IDs: used for every RPC
  • Fragmentation IDs (FRAG and FRAG2)
  • UNICAST and UNICAST2: sqnos and ranges
  • ViewId
An example of where this doesn't make sense are UUIDs: they are generated such that the bits are spread out over the entire 8 bytes, so encoding them would make 9 bytes out of 8 and that doesn't help.


JoinRsp

A JoinRsp used to contain a list of members twice: once in the view and once in the digest. The was eliminated, and now we're sending the member list only once. This also cut the wire size of a JoinRsp in half.



Further optimizations planned for 3.1 include delta views and better compressed STABLE messages:



Delta views

If we have a view of 1000 members, we always send the full address list with every view change. This is not necessary, as everybody has access to the previous view.

So, for example, when we have P, Q and R joining, and X and Y leaving in V22, then we can simply send a delta view; a view V22={V21+P+Q+R-X-Y}. This means, take the current view V21, remove members X and Y, and add members P, Q and R to the tail of the list, in order to generate a new view V22.

So, instead of sending a list of 1000 members, we simply send 5 members, and everybody creates the new view locally, based on the current view and the delta information.


Compressed STABLE messages

A STABLE message contains a digest with a list of all members and then the digest seqnos for HD and HR. Since STABLE messages are exchanged between members of the same cluster, they all have the same view, or else they would drop a STABLE message.

Hence, we can drop the View and instead send the ViewId, which is 1 address and a long. Everyone knows that the digest seqnos will be in order of the current view, e.g. seqno pair 1 belongs to the first member of the current view, seqno pair 2 to the second member and so on.

So instead of sending a list of 1000 members for a STABLE message, we only send 1 address.

This will reduce the wire size of a 1000-member digest sent by STABLE from roughly 40'000 bytes to ca. 6'000 bytes !



Download 3.0.0.CR1

The optimizations (exluding delta views and compressed STABLE messages) are available in JGroups 3.0.0.CR1, which can be downloaded from [1].

Enjoy (and feedback appreciated, on the mailing lists...) !

[1] https://sourceforge.net/projects/javagroups/files/JGroups/3.0.0.CR1

Tuesday, July 26, 2011

It's time for a change: JGroups 3.0

I'm happy to anounce that I just released a first beta of JGroups 3.0 !

It's been a long time since I released version 2.0 (Feb 2002); over 11 years and 77 2.x releases !

We've pushed a lot of API changes into 3.x, in order to provide more features, bug fixes and optimizations in 2.x releases, which were always (API) backwards compatible to previous 2.x releases.

However, now it was time to take that step and make all the changes we've accumulated over the years.

The bad thing is that 3.x will require code changes if you port your 2.x app to it... however I anticipate that those changes will be trivial. Please ask questions regarding porting on the JGroups mailing list (or forums), and also post suggestions for improvements !

The good thing is that I was able to remove a lot of code (ca. 25'000 lines compared to 2.12.1) and simplify JGroups significantly.

Just one example: the getState(OutputStream) callback in 2.x didn't have an exception in its signature, so an implementation would typically look like this:

public void getState(OutputStream output) {
    try {
        marshalStateToStream(output);
    }
    catch(Exception ex) {
         log.error(ex);
    }
}

In 3.x, getState() is allowed to throw an exception, so the code looks like this now:

public void getState(OutputStream output) throws Exception {
    marshalStateToStream(output);
}

First of all, we don't need to catch (and swallow !) the exception. Secondly, a possible exception will now actually be passed to the state requester, so that we know *why* a state transfer failed when we call JChannel.getState().

There are many small (or bigger) changes like this, which I hope will make using JGroups simpler. A list of all API changes can be found at [2].

The stability of 3 beta1 is about the same as 2.12.1 (very high), because there were mainly API changes, and only a few bug fixes or optimizations.

I've also created a new 3.x specific set of documentation (manual, tutorial, javadocs), for example see the 3.x manual at [3].

JGroups 3 beta1 can be downloaded from [1]. Please try it out and send me your feedback (mailing lists preferred) !

Enjoy !



[1] https://sourceforge.net/projects/javagroups/files/JGroups/3.0.0.Beta1

[2] https://github.com/belaban/JGroups/blob/master/doc/API_Changes.txt

[3] http://www.jgroups.org/manual-3.x/html/index.html

Friday, April 29, 2011

Largest JGroups cluster ever: 536 nodes !

I just returned from a trip to a customer who's working on creating a large scale JGroups cluster. The largest cluster I've ever created is 32 nodes, due to the fact that I don't have access to a larger lab...

I've heard of a customer who's running a 420 node cluster, but I haven't seen it with my own eyes.

However, this record was surpassed on Thursday April 28 2011: we managed to run a 536 node cluster !

The setup was 130 celeron based blades with 1GB of memory, each running 4 JVMs with 96MB of heap, plus 4 embedded devices with 4 JVMs running on each. Each blade had 2 1GB NICs setup with IP Bonding. Note that the 4 processes are competing for CPU time and network IO, so with more blades or more physical memory available, I'm convinced we could go to 1000+ nodes !

The configuration used was udp-largecluster.xml (with some modifications), recently created and shipped with JGroups 2.12.

We started the processes in batches of 130, then waited for 20 seconds, then launched the second batch and so on. The reason we staggered the startup was to reduce the number of merges, which would have increased the startup time.

Running this a couple of times (plus 50+ times over night), the cluster always formed fine, and most of the time we didn't have any merges at all.

It took around 150-200 seconds (including the 5 sleeps of 20 seconds each) to start the cluster; in the picture at the bottom we see a run that took 176 seconds.

Changes to JGroups

This large scale setup revealed that certain protocols need slight modifications to optimally support large clusters, a few of these changes are:
  • Discovery: the current view is sent back with every discovery response. This is not normally an issue, but if you have a 500+ view, then the size of a discovery response becomes huge. We'll fix this by returning only the coordinator's address and not the view. For discovery requests triggered by MERGE2, we'll return the ViewId instead of the entire view.
  • We're thinking about canonicalizing UUIDs with IDs, so nodes will be assigned unique (short) IDs instead of UUIDs. This means reducing the size for having 17 bytes (UUID) in memory in favor of 2 bytes (short).
  • STABLE messages: here, we return an array of members plus a digest (containing 3 longs) for *each* member. This also generates large messages (11K for 260 nodes).
  • The fix in general for these problems is to reduce the data sent, e.g. by compressing the view, or not sending it at all, if possible. For digests, we can also reduce the data sent by sending only diffs, by sending only 1 long and using shorts for diffs, by using bitsets representing offsets to a previously sent value, and so on. 
Ideas are abundant, we now need to see which one is the most efficient.

For now, 536 nodes is an excellent number and - remember - we got to this number *without* the changes discussed above ! I'm convinced we can easily go higher, e.g. to 1000 nodes, without any changes. However, to reach 2000 nodes, the above changes will probably be required.

Anyway, I'm very happy to see this new record !

If anyone has created an even larger cluster, I'd be very interested in hearing about it !
Cheers, and happy clustering,



Friday, April 01, 2011

JBossWorld 2011 around the corner

Wanted to let you know that I've got 2 talks at JBW (Boston, May 3-6).

The first talk [1] is about geographic failover of JBoss clusters. I'll show 2 clusters, one in NYC, the other one in ZRH. Both are completely independent and don't know about each other. However, they're bridged with a JGroups RELAY and therefore appear as if they were one big virtual cluster.

This can be used for geographic failover, but it could also be used for example to extend a private cloud with an external, public cloud without having to use a hardware VPN device.

As always with my talks, this will be demo'ed, so you know this isn't just vapor ware !

The second talk [2] discusses 5 different ways of running a JBoss cluster on EC2. I'll show 2 demos, one of which works only on EC2, the other works on all clouds.

This will be a fun week, followed by a week of biking in the Bay Area ! YEAH !!

Hope to see and meet many of you in Boston !
Cheers,


[1] http://www.redhat.com/summit/sessions/best-of.html#66

[2] http://www.redhat.com/summit/sessions/jboss.html#43