Every time you come across anything more than a rudimentary system that has some moderately serious performance needs, someone somewhere on the team considers using asynchronous processing to help reduce (perceived) response time. That person needs to be identified and quickly locked in a padded room . . . . Just kidding!
Often that person is me and often I end up writing the code and relearning why doing things in an asynchronous way (that also meets a certain "near guarantee" SLA including DR and HA needs) is very very hard.
So it was with some chagrin that I was tasked with coding some infrastructure components to implement a Task Queue for my current team. The goal was to satisfy a need to improve response time of the application when it was doing some back-end tasks that required 100ms or more. The tasks themselves aren't time critical but they are important e.g. replicating data to a remote site.
Fortunately this time I'm not writing financial systems software - there you typically need to guarantee that although a task is asynchronous that it will be done within some SLA (e.g. 2-3 seconds). So at least I didn't have that worry. That said, for most apps asynchronous doesn't mean performing the task "whenever" - it means a little bit later than now. In addition you still have to consider that your solution is Highly Available and has built in Disaster (or just plain VM crash) Recovery.
Anyway the default solutions in Java for Asynchronous processing are
1) Threads (and java.util.concurrent - which is awesome)
2) JMS (Java Message Service)
Why Threads aren't the solution
Java.util.concurrent is great - a really great step up from Threads and writing your own infrastructure to start, stop and generally manage and handle threads and thread pools. If you haven't dug into this package yet do so now. It's a life saver.
The downside to Threads / Java.util.concurrent is that when you launch a job to be executed in the same JVM and same VM slide then
1) If your JVM or your VM die without some persistence mechanism your job is lost forever.
2) You never have just one asynchronous job and you'd like to be able to balance the load
of asynchronous jobs across your entire farm of virtual machines and their JVMs.
JMS in a Nutshell
So that's where people start to invoke JMS (Java Message Service). JMS proposes two
core models of asynchronous communication.
1) Queues - which are a point-to-point communication system. VM #3 writes a message to a queue and some other thread in some other VM reads that message from the queue. Only one writer - one reader.
2) Topics - which are a "broadcast" communication system. VM #3 broadcasts a message to a topic and multiple threads on multiple VMs (each with potentially different processing goals) read the thread.
Here come the problems . . . .
The key problem with Queues is ensuring that although there is only one reader that the job, once it is read from the queue, actually gets executed.
The key problem with Topics is often ensuring that although there are multiple readers - every individual type of processing (e.g. store a record in a database table) happens only once - otherwise you are wasting resources.
There are other problems too. JMS doesn't have any built-in specification for Disaster recovery or High availability. Therefore each JMS provider (and there are many e.g. MQSeries, Progress Sonic, Active MQ) provides their own mechanisms.
The first other problem is ensuring that if a message is written (to a topic or queue) and one of the JMS processes (a broker) crashes or there's a network hiccup that the message is not lost. For that to occur you need a persistence mechanism. The persistence mechanisms either use the local filesystem of the broker or a relational DB. Either way you are just pushing your DR problem a bit further away from your JMS broker to now worrying about the local disk or your database. IMO it's not exactly a solution to replace one DR nightmare with another one. In addition to doing this another MAJOR downside to persistence is that it typically reduces throughput by an order of magnitude or more. See for example this link for Active MQ alone but this link comparing ActiveMQ and JBoss MQ and JBoss Messaging..
Guaranteed Messaging - who's guarantee?
The other problem is ensuring that once a message is read by a thread that it executes the task successfully. That guarantee is impossible - VMs crash, exceptions are thrown etc. Ideally if the thread can't handle the task you would want to put it back on the queue for someone else to attempt to handle it. This is hard with JMS although there is a way to "trick" JMS into supporting this that involves using message acknowledgement. That is you do not use the AUTO_ACKNOWLEDGE default - only acknowledging receipt of the message to the broker after the message has been successfully process. However please check how the Broker does this as there is some evidence that doing this just blocks the broker - giving you more guaranteed message handling at a huge cost in scalability.
Another problem with this approach is that you need to handle the possibility of messages being handled more than once. The reason for this is that the thread that originally picked up the message may have processed it successfully but failed JUST before it was going to acknowledge the message. Thus hopefully your message processing is either idempotent or can handle duplicate processing of messages without too much trouble.
The other side of the coin for persistence is high availability. Although you want a broker storing data in a file system or to database for persistence, you need some number of other brokers (at least one) up and running and ready to take over. Ideally they are on a different VM, a different blade and perhaps in a different data center. None of which is easy.
Just a side note real quick - one thing I really liked about Amazon's Simple Queue Service (SQS) is how it handles persistence. Unlike JMS which has simple "Send Message" and "Read Message" semantics. SQS has "Send Message" and "Read Message" and "Delete Message". Unlike in JMS, reading a message in SQS does not remove the message from the queue. It "locks" it temporarily and if the message consumer does not call to SQS to "delete" the message after a certain time (30 seconds is the default), SQS itself puts the message back on the queue and assumes the original consumer failed to process it. So why not use SQS? Well frankly all my processing is local and I'd rather not take the Boston -> Virginia network hit (on the send and the receive). But it would make sense if my app was deployed entirely inside an EC2 instance.
Another problem: message ordering
Oh wait here's another Gem for you to wonder. Say your app has some basic CRUD features - Create, Read, Update, Delete. Now you start to put CRUD related activities on the queue. Queues do not implicitly guarantee ordering. So a thread may begin processing an Update event BEFORE the related Create event was finished. Awesome eh!!! So what do you do in this case?
Sometimes you can detect that failure (you can't update something that wasn't created) and just put the message back on the queue or retry a small number of times before failing. Others make the choice to have a single reader for these critical cases to ensure proper ordering.
The downside to THAT of course is
1) Scalability sucks
2) How are you going to handle Disaster recovery / failover.
For #2 I've seen folks configure 1 thread in N separate JVMs across a linux VM farm.
Only one thread is truly active at a time - the others are on "hot standby" using some form of ping / heartbeat functionality to detect if the "one true" thread dies and then some mechanism for become the active thread and telling all the other threads. MAN was that complicated and buggy code.
Oh just one more problem - queue backups!
Even after solving all these problems
- High Availability
- Messages handled more than once
- Messages arrive out of order.
You get the problem of what if your asynchronous task processing slows down and your thread pool becomes maxed out and your queue starts to back up? In one case we have to replicate data from the Europe to Asia. Although it's over a high speed network, the latency is at LEAST 250 ms. And it could be worse. You can't just add threads - you might max out your JVM's memory. I don't live in an EC2 world (yet) where I can just spin up new Linux VMs - and
even if I did the cost might be prohibitive at some point.
I could start to persist some of the queue to somewhere - disk / DB but again I'm going to have to write code to handle all of this plus testing is going to be a bear and the edge cases are killer.
And have a thread read from that storage area at a later point and retry.
Asynchronous - don't unless you have to!
So here's the problem with asynchronous solutions - they are powerful but introduce HUGE complexities in testing, scalability, disaster recovery, failover and high availability.
Even if you get the code right the first time - you better automate all your scenarios to ensure any future "fix" doesn't regress on your edge case handling.
In general my advice is avoid asynchronous solutions unless ABSOLUTELY necessary. And when you do watch for the "tail wagging the dog" - when you design and architect solutions to handle asynchronous processing cases that start to overwhelm your normal design/architecture. Beware introducing unnecessary complexity - but with asynchronous processing much of this complexity is sadly necessary.
If you are reading this and are aware of any good solutions to these asynchronous problems please let me know. Thanks!
Some good references I use when designing asynchronous solutions (to make sure I'm not reinventing the wheel)