OK, "scalable" is a pretty subjective word (and by scalable I guess I mean "scale up" not out) but say I have this incoming message stream (say JMS) of many tens of thousands of small messages every second (stock ticks for example) and have a single-thread (for arguments sake) and just want to read as many of them as I can.
Which server platform in its default configuration will handle the highest number per second?
In my experience the Sun server (Solaris) platform is the most scalable, linux (on x86) second and Windows (x86) third but with the recent advances in hardware how is that holding up?
I don't trust any vendor benchmarks (something about foxes and hen houses) or reputedly independent analysts who just happened to get a lot of money from the winning vendor.
I don't trust any vendor benchmarks (something about foxes and hen houses) or reputedly independent analysts who just happened to get a lot of money from the winning vendor.
2) Which Database does Google use to store data?
If it's a commercial RDBMS then how can they cram so much data volume as a whole into them, plus it's coming in like a fire-hose. In my experience modern RDBMS can handle at most thousands of transactions per second. But Google has to be many orders of magnitudes beyond that. We can talk grids, sharding etc. up the ying-yang but at some point it all has to come back together in some data store (or does it?).
3) When will Sun go belly-up?
Don't get me wrong, Sun have done a lot of great things - Java, Solaris, J2EE (OK skip that one), but they've never really made any money off of Java (like, say, IBM did), they still focus on hardware solutions over a decade after Lou Gerstner of IBM saw the money in services, never mind packaged software.I guess they're the Xerox Parc of the 1990s / early 2000s - lots of great seminal ideas and too far ahead of their time - but just can't make any real money.
4) Feature-for-feature how does Oracle / MS Sql Server / DB2 stack up against MySQL / Derby / PostgresSQL etc.?
5) How come no-one bought Sybase?
There are really only three big players left - Oracle, IBM and Microsoft. A few years ago IBM bought Informix leaving Sybase as the outlier? Sybase still has quite a lot of deployments especially in the finance industry. But almost every such company is switching away since Sybase really hasn't had a good regular release schedule for the past few years.
If someone would just come up with a great tool to automate conversion from Sybase-to-XYZ (with minimal hand-coding for extended SPs etc.) they'd make a mint!
6) What will replace the RDBMS?
I don't know about you but there seems to be something in the air, the Object-relational impedance mismatch seems to be hitting a threshold, is the end of the RDBMS in our future (say 20 years?). Will it be the ODBMS? I doubt it.7) How come everyone talks about coordinating DB transactions - XA and 2PC and the like but I've *never* needed it in practice?
Maybe I just lack experience - but not ever in over a decade have I seen an unequivocal requirement to implement transaction coordination.
8) How come software vendors have such terrible software and no accountability?
I've been working with this JMS implementation for the past year (not IBM, nor Tibco - rhymes with "chronic achoo"). It crashes all the time and can't handle heavy load gracefully! I mean it's a *MESSAGING* server - it's *GOT* to be able to handle a heavy load
9) When will I be able to do more than three things at once with Windows desktop?
Sometimes I click on a URL in Firefox, while waiting to open iTunes and while waiting on that I try to open my email and my whole Windows UI starts to freeze up - the mouse won't even move. OK Unix / Linux aren't very user friendly but the good old ampersand (&) for background tasks does fabulously well in the command line world keeping your interface quite responsive.
Now we're talking just three things at once - relative to the 1980s I've got a frigging supercomputer on my desk - Microsoft come on already! This has been true regardless of how much CPU / memory I have or which Windows version (currently XP Professional).
Now we're talking just three things at once - relative to the 1980s I've got a frigging supercomputer on my desk - Microsoft come on already! This has been true regardless of how much CPU / memory I have or which Windows version (currently XP Professional).
10) Why is Google apparently trying to buy it's own Spectrum?
Intriguing eh?
And two bonus questions . . .
11) What is the air-speed velocity of an unladen swallow?
"What? I don't know that! Auuuuuuuugh!". The answer is out there - actually it's here.12) Why do us Programming / Computer Geeks love Monty Python?
Now that's one for the ages!
16 comments:
Frank - Maybe I can help with questions 2 and 6.
2) Google doesn't store all their data in a database. they store it in a specialized file system they developed themselves. It's well documented on the web, but you can find a short description here
6) What's going to replace databases for transactional and other (near) real-time uses are in memory data grids (IMDGs). GigaSpaces is one example, but there are others out there.
Regarding 11: An African or European swallow?
2) http://labs.google.com/papers/bigtable.html
5) Scriptella ETL started as an in-house project to migrate a database from Sybase to HSQLDB and then to H2.
Sun makes plenty of money on hardware and services (much like IBM, another large company that gives away a ton of valuable stuff away for free.) One of the purposes of Java is to keep Sun in the spotlight to sell said hardware and services and so far, it has worked beautifully.
MySQL and Postgres work in somehat different spheres so I don't think either one is in danger of being "shaken out". MySQL is for realatively simple databases but is relatively fast. Postgres is for applications that need a database with all of the Oracle style bells and whistles and then some. Overkill for most but very necessary for certain applications. I haven't used Derby yet, but I am intrigued. Like I said, Postgres compares very favoribly, feature wise, with Oracle and DB2 (even the "special" versions that are built for handling multimedia data. MySQL can certainly be used for richer applications but really shines when it comes to basic data types and speed. Most resembles a basic DB2.
1) For Java/J2EE what's the most scalable server hardware platform?
Well if you only use a single thread, it won't scale. At all. Throw more hardware at it, and performance won't improve.
7) How come everyone talks about coordinating DB transactions - XA and 2PC and the like but I've *never* needed it in practice?
In your JMS example: If you are for example writing the message to a DB: How do you guarantee that your message is processed once and only once?
With XA you get this for free, otherwise you have to hope the message has a unique ID. If not you are in trouble.
So if somebody can provide you with a bug-free high performance implementation of XA it might save you a lot of thinking and time.
In practice most companies don't care about transactions. If it works most of the time, that's good enough. Why do you think most production Oracle DBs run in "Read Committed" transaction isolation mode which doesn't even give you real ACID transactions?
In regards to scaling with a single thread - I want to discount any OS / Multi-threading unpredictability where possible to focus on the hardware.
Server hardware *HAS* to matter - OK after a certain point you might max out your NIC card but you can't tell me a 1 GHz P4 vs. a 2.4 GHz P4 isn't going to have a different throughput rate with one thread?
As for Sun making lots of money on hardware - Sun made a loss in 2006
http://finance.google.com/finance?q=SUNW
and barely made a profit ($67 million) in the Quarter ending April '07. Compare with IBM whose net income in that quarter was $1.8 Billion.
If scraping a profit is "working beautifully" for Sun I'd hate to see it start to lag.
-Frank
1) Scaling has a lot more to do with multi-threading, properly partitioning data, caching and careful data modeling. If you are really going to scale you're going to need lots of machines and being able to evenly spread the workload is more of an application layer problem than a hardware problem.
To "The narrator"
The question was about hardware - which hardware is best in terms of (my limited-for-the-sake of argument) definition of scalability.
Threads, data partitioning, I/O, disk access etc. are software / application concerns. I can influence those - I just want to know what's the best hardware platform for my needs.
Also, not all problems lend themselves well to scale out. Database servers for example are traditionally "heavy iron" machines. Today, yes with sharding , grid computing etc. things have changed a bit . . .
But even if I did want to scale out I still want the best bang for my buck - the issue still remains how can I find that out?
The answer I believe is you can't find that out without doing your own testing.
-Frank
I agree with geva, in memory clustered solutions are what will save developers from the db bloat (both in size and load) of application state living in database land and the impedance mismatch that comes from trying to implement such a solution.
I will disagree that Gigaspaces is the right solution, but that is my right, and quite obviously my bias, because I work for Terracotta.
The problem I think with Gigaspaces, or any clustered cache for that matter, is that it assumes that a clustering problem is a cache problem. Well that comes from the fact that the state-of-the art design paradigm in today's world is that all of the application data must live in the database, so the only natural way to speed that up is to put it in cache. And the only way to make that work in scale-out mode is to have it be a clustered-cache. For that model, Gigaspaces may be fine, but you've only played musical chairs with the problem.
What happens is that when you try to apply that model to true object-oriented paradigms, you find that you have to insert the cache paradigm into every object, which is very nearly the same impedance mismatch that existed when trying to shove object graphs into and out of the relational world (euphemistically referred to as "marshalling and unmarshalling")
You're missing the big picture on scalability. Sure there's a difference between a 1GHz P3 and a 2.4 GHz P4, but it's what, a 4x difference? There's really no way to get more than an order of magnitude scaling factor in single threaded mode, and that's only if you're comparing to really old machines. Intel processors have been stuck in the 2-3 GHz range for YEARS now, but what they're doing is adding more cores to the chips and making them hyperthreaded... That's not increasing your straightline single-threaded speed, but you can now have a lot more threads operating at once.
In your last comment, you ask where you can get the most bang for your buck... that's a whole different question, but is much easier to answer. The market has shown that x86 processors are the best value hands down. Whether it's AMD or Intel almost doesn't matter, but the x86 processor with a free operating system like Linux rules the roost for low-cost computing power.
@taylor: Well, I don't work for any of the vendors, but you seem to have a very misinformed view of the grid players out there. Gigaspaces and Oracle Coherence go way beyond just implementing a distributed cache. You can submit work to them to be transported across the grid and executed at each node with the data that's shared locally on that node. It basically handles work partitioning for you, and makes sure you're not shipping data around the network as well. Terracotta's distributed master / worker example are very weak by comparison,so if I were you I would just not bring it up.
Jason - point well made on the dual / multi-cores - the GHz hasn't gone up just the capability to multi-thread (ideally on Linux/Unix)
-Frank
The Database vendors have a clause that forbids publishing any comparison result without their prior approval.
About three-four years ago one chap published a Postgres-Oracle comparison (a quite serious one) beating the crap out of them, and three days later lawyers removed the article.
No surprise there is only hearsay on this subject.
@jason:
Are you saying that because you've used those products, or just read the marketecture?
Just curious because when it comes to marketing, I don't disagree with you - they definitely market grid better than we do, but then again, we're not trying to be a grid vendor.
Q: "which hardware is best in terms of (my limited-for-the-sake of argument) definition of scalability."
A: a Mainframe! check out Linux for Z on a zSeries machine!
=)
Google file system is famous but for many applications Google uses MySQL. If you look over will find many monitoring tools, patchs to make MySQL servers work as a grid, talks, .. I'm sure they use it for AdWords but probably many others.
Post a Comment