Information Infrastructure EII TCO/ROI Hardware Uncategorized Green IT Development
This blog post highlights a software company and technology that I view as potentially useful to organizations investing in business intelligence (BI) and analytics in the next few years. Note that, in my opinion, this company and solution are not yet typically “top of the mind” when we talk about BI today.
The Importance of the DataRush Software Technology to BI
The basic idea of DataRush, as I understand it, is to superimpose a “parallel dataflow” model on top of typical data management code, in order to improve the performance (and therefore scalability) of the data-processing operations used by typical large-scale applications. Right now, your processing in general and your BI querying in particular are typically done either by “query optimization” within a “database engine” that takes one stream of “basic” instructions and parallelizes it by figuring out (more or less) how to run each step in parallel on separate chunks of data, or by programmer code that attempts a wide array of strategies for speeding things up further, ranging from “delayed consistency” (in cases where lots of updates are also happening) to optimization for the special case of unstructured data (e.g., files consisting of videos or pictures). “Parallel dataflow” instead requires that particular types of querying/updates be separated into multiple streams depending on the type of operation. This is done up front, as a specification by a programmer of a dataflow “model” that applies across all applications with the same types of operation.
There is good reason to believe, as I do, that this approach can yield major, ongoing performance improvements in a wide variety of BI areas. In the first place, the approach should deliver performance improvements over and beyond existing engines and special-case solutions, and not force you into supporting yet another alternate technology path. The idea of dataflow is not new, but for various historical reasons this variant has not been the primary focus of today’s database engines, and so the job of retrofitting to support “parallel dataflow” is nowhere near completion in most database engines. That means that, potentially, using “parallel dataflow” on top of these engines can squeeze out additional parallelism, due to the increased number and sophistication of the streams, especially on massively parallel architectures such as today’s multicore-chip server farms.
At the same time, the increasing importance of unstructured and semi-structured data has created something of a “green field” in processing this data, especially in areas such as health care’s handling of CAT scans, vendors streaming video over the Web, and everyone querying social-media Big Data. Where existing data-processing techniques are not set in concrete, “parallel dataflow” is very likely to yield outsized performance gains when applied, because it operates at a greater level of abstraction than most database engines and special-case file handlers like Hadoop/MapReduce, and so can be customized more effectively to new data transaction mixes and data types.
There is always a caveat in dealing with “new” software technologies that are really an evolution of techniques whose time has come. In this case, the caveat concerns the fact that, as noted, programmers or system designers need to specify the dataflows, rather than the database engine, and this dataflow “model” is not a general case for all data processing. That, in turn, means that at least some programmers need to understand dataflows on an ongoing basis.
It is my guess that this is a task that users of “parallel dataflow” and DataRush should embrace. There is a direct analogy here between agile development and DataRush-based development. The usefulness of agile development lies not only in the immediate speedup of application development, but also in the way that agile development methodologies embed end-user knowledge in the development organization, with all sorts of positive follow-on effects on the organization as a whole. In the same way, setting up dataflows for a particular application leads typically to a new way of thinking about applications as dataflows, and that improves the quality and often the performance of every application that the organization handles, whether it is optimizable by “parallel dataflow” or not.
In other words, in my opinion, developers’ knowledge of data-driven programming is increasingly inadequate in many cases. Automating this programming in the database engine and user interface can only do so much to make up for the lack. It is more than worth the pain of additional ongoing dataflow programming to reintroduce the skill of programming based on a data “model” to today’s generation of developers.
The Relevance of Pervasive Software to BI
Let me state my conclusion up front: I view investment in Pervasive Software’s DataRush technology as every bit as safe as investment in an IBM or Oracle product. Why do I say this?
Let’s start with Pervasive Software’s “DNA.” Originally, more than 15 years ago, I ran across Pervasive Software as a spin-off of Novell’s Windows database of the 1980s. Over time, as databases almost always do, the solution that has become Pervasive PSQL has provided a stable source of ongoing revenue. More importantly, it has centered Pervasive Software from the very start in Windows, PC-server, and distributed database technologies servicing the SMB/large-enterprise-department market. In other words, Pervasive has demonstrated over 15 years of ups and downs that it is nowhere near failure, and that it knows the world even of the Windows/PC-server side of the Global 10,000 quite well.
At the same time, having followed the SMB/departmental market (and especially the database side) for more than 15 years, I am struck by the degree to which, now, software technologies move “bottom-up” from that market to the large enterprise market. Software as a Service, the cloud, and now some of the latest capabilities in self-service and agile BI are all taking their cue from SMB-style operations and technologies. Thus, in the Big Data market in particular and in data management in general, Pervasive is one leading-edge vendor well in tune with an overall movement of SMB-style open-source and other solutions centered around the cloud and Web data. I therefore see the risks of Pervasive Software DataRush vendor lock-in and technology irrelevance over the next few years as minimal. And, of course, participation in the cloud open-source “movement” means crowd-sourced support as effective as IT’s existing open-source software product support.
Aren’t there any risks? Well, yes, in my opinion, there are the product risks of any technology, i.e., that technology will evolve to the point where “parallel dataflow” or its equivalent is better integrated into another company’s product. However, if that happens, dollars to doughnuts there will be a straightforward path from a DataRush dataflow model to that product’s data-processing engine – because the open-source market, at the very least, will provide it.
Potential Uses of DataRush for IT
The obvious immediate uses of DataRush in IT are, as Pervasive Software has pointed out, in Big Data querying and pharmaceutical-company grid searches. In the case of Big Data, DataRush front-ending Hadoop for both public and hybrid clouds is an interesting way to both reduce the number of instances of “eventual consistency” turning into “never consistent” and to increase the depth of analytics by allowing a greater amount of Big Data to be processed in a given length of time, either on-site at the social-media sites or in-house as part of handling the “fire hose” of just-arrived Big Data from the public cloud.
However, I don’t view these as the most important use cases for IT to keep an eye on. Ideally, IT could infuse the entire Windows/PC-server part of its enterprise architecture with “parallel dataflow” smarts, for a semi-automatic ongoing data-processing performance boost. Failing that, IT should target the Windows/small-server information handling in which increased depth of analytics of near-real-time data is of most importance – e.g., agile BI in general.
These suggestions come with the usual caveats. This technology is more likely than most to require initial experimentation by internal R&D types, and some programmer training, as well. Finding the initial project with the best immediate value-add is probably not going to be as straightforward as in some other cases, as the exact performance benefit of this technology for any kind of database architecture is apparently not yet fully predictable. Effectively, these caveats say: if you don’t have the IT depth or spare cash to experiment, just point the technology at a nagging BI problem and odds are very good that it’ll pay off – but it may not be a home run the first time out.
The Bottom Line for IT Buyers
Really, Pervasive DataRush is one among several performance-enhancing approaches that offer potential additional analytical power in the next few years, and so if IT passes this one up and opts for another, they may well keep pace with the majority of their peers. However, in an environment that most CEOs seem to agree is unusually uncertain, out-performing the majority, and extreme IT smarts in order to do so, are more frequently becoming necessary. At the least, therefore, IT buyers in medium-sized and large organizations should keep Pervasive DataRush ready to insert in appropriate short lists over the next two years. Preferably, they should also start the due diligence now.
The key to getting the maximum out of DataRush, I think, will be to do some hard thinking about how one’s BI and data-processing applications “group” into dataflow types. Pervasive Software, I am sure, can help, but you also need to customize for the particular characteristics of your industry and business. Doing that near the beginning will make extension of DataRush’s performance benefits to all kinds of existing applications far quicker, and thus will deliver far wider-spread analytical depth to your BI.
How will a solution like DataRush impact the organization’s bottom line? The same as any increase in the depth of real-time analysis – and right now that means that, over time, it will improve the bottom line substantially. For that reason, at the very least, Pervasive Software’s DataRush is an Other BI solution that is worth the IT buyer’s attention.
One of the more interesting features of vendors’ recent marketing push to sell BI and analytics is the emphasis on the notion of Big Data, often associated with NoSQL, Google MapReduce, and Apache Hadoop – without a clear explanation of what these are, and where they are useful. It is as if we were back in the days of “checklist marketing”, where the aim of a vendor like IBM or Oracle was to convince you that if competitors’ products didn’t support a long list of features, that those competitors would not provide you with the cradle-to-grave support you needed to survive computing’s fast-moving technology. As it turned out, many of those features were unnecessary in the short run, and a waste of money in the long run; remember rules-based AI? Or so-called standard UNIX? The technology in those features was later to be used quite effectively in other, more valuable pieces of software, but the value-add of the feature itself turned out to be illusory.
As it turns out, we are not back in those days, and Big Data via Hadoop and NoSQL does indeed have a part to play in scaling Web data. However, I find that IT buyer misunderstandings of these concepts may indeed lead to much wasted money, not to mention serious downtime. These misunderstandings stem from a common source: marketing’s failure to explain how Big Data relates to the relational databases that have fueled almost all data analysis and data-management scaling for the last 25 years. It resembles the scene in Wizard of Oz where a small man, trying to sell himself as a powerful wizard by manipulating stage machines from behind a curtain, becomes so wrapped up in the production that when someone notes “There’s a man behind the curtain” the man shouts “Pay no attention to the man behind the curtain!” In this case, marketers are shouting about the virtues of Big Data related to new data management tools and “NoSQL” that they fail to note the extent to which relational technology is complementary to, necessary to, or simply the basis of, the new features.
So here is my understanding of the present state of the art in Big Data, and the ways in which IT buyers should and should not seek to use it as an extension of their present (relational) BI and information management capabilities. As it turns out, when we understand both the relational technology behind the curtain and the ways it has been extended, we can do a much better job of applying Big Data to long-term IT tasks.
NoSQL or NoREL?
The best way to understand the place of Hadoop in the computing universe is to view the history of data processing as a constant battle between parallelism and concurrency. Think of the database as a data store plus a protective layer of software that is constantly being bombarded by transactions – and often, another transaction on a piece of data arrives before the first is finished. To handle all the transactions, databases have two choices at each stage in computation: parallelism, in which two transactions are literally being processed at the same time, and concurrency, in which a processor switches between the two rapidly in the middle of the transaction. Pure parallelism is obviously faster; but to avoid inconsistencies in the results of the transaction, you often need coordinating software, and that coordinating software is hard to operate in parallel, because it involves frequent communication between the parallel “threads” of the two transactions.
At a global level (like that of the Internet) the choice now translates into a choice between “distributed” and “scale-up” single-system processing. As it happens, back in graduate school I did a calculation of the relative performance merits of tree networks of microcomputers versus machines with a fixed number of parallel processors, which provides some general rules. There are two key factors that are relevant here: “data locality” and “number of connections used” – which means that you can get away with parallelism if, say, you can operate on a small chunk of the overall data store on each node, and if you don’t have to coordinate too many nodes at one time.
Enter the problems of cost and scalability. The server farms that grew like Topsy during Web 1.0 had hundreds and thousands of PC-like servers that were set up to handle transactions in parallel. This had obvious cost advantages, since PCs were far cheaper; but data locality was a problem in trying to scale, since even when data was partitioned correctly in the beginning between clusters of PCs, over time data copies and data links proliferated, requiring more and more coordination. Meanwhile, in the High Performance Computing (HPC) area, grids of PC-type small machines operating in parallel found that scaling required all sorts of caching and coordination “tricks”, even when, by choosing the transaction type carefully, the user could minimize the need for coordination.
For certain problems, however, relational databases designed for “scale-up” systems and structured data did even less well. For indexing and serving massive amounts of “rich-text” (text plus graphics, audio, and video) data like Facebook pages, for streaming media, and of course for HPC, a relational database would insist on careful consistency between data copies in a distributed configuration, and so could not squeeze the last ounce of parallelism out of these transaction streams. And so, to squeeze costs to a minimum, and to maximize the parallelism of these types of transactions, Google, the open source movement, and various others turned to MapReduce, Hadoop, and various other non-relational approaches.
These efforts combined open-source software, typically related to Apache, large amounts of small or PC-type servers, and a loosening of consistency constraints on the distributed transactions – an approach called eventual consistency. The basic idea was to minimize coordination by identifying types of transactions where it didn’t matter if some users got “old” rather than the latest data, or it didn’t matter if some users got an answer but others didn’t. As a communication from Pervasive Software about an upcoming conference shows, a study of one implementation finds 60 instances of unexpected unavailability “interruptions” in 500 days – certainly not up to the standards of the typical business-critical operational database, but also not an overriding concern to users.
The eventual consistency part of this overall effort has sometimes been called NoSQL. However, Wikipedia notes that in fact it might correctly be called NoREL, meaning “for situations where relational is not appropriate.” In other words, Hadoop and the like by no means exclude all relational technology, and many of them concede that relational “scale-up” databases are more appropriate in some cases even within the broad overall category of Big Data (i.e., rich-text Web data and HPC data). And, indeed, some implementations provide extended-SQL or SQL-like interfaces to these non-relational databases.
Where Are the Boundaries?
The most popular “spearhead” of Big Data, right now, appears to be Hadoop. As noted, it provides a distributed file system “veneer” to MapReduce for data-intensive applications (including Hadoop Common that divides nodes into a master coordinator and slave task executors for file-data access, and Hadoop Distributed File System [HDFS] for clustering multiple machines), and therefore allows parallel scaling of transactions against rich-text data such as some social-media data. It operates by dividing a “task” into “sub-tasks” that it hands out redundantly to back-end servers, which all operate in parallel (conceptually, at least) on a common data store.
As it turns out, there are also limits even on Hadoop’s eventual-consistency type of parallelism. In particular, it now appears that the metadata that supports recombination of the results of “sub-tasks” must itself be “federated” across multiple nodes, for both availability and scalability purposes. And Pervasive Software notes that its own investigations show that using multiple-core “scale-up” nodes for the sub-tasks improves performance compared to proliferating yet more distributed single-processor PC servers. In other words, the most scalable system, even in Big Data territory, is one that combines strict and eventual consistency, parallelism and concurrency, distributed and scale-up single-system architectures, and NoSQL and relational technology.
Solutions like Hadoop are effectively out there “in the cloud” and therefore outside the enterprise’s data centers. Thus, there are fixed and probably permanent physical and organizational boundaries between IT’s data stores and those serviced by Hadoop. Moreover, it should be apparent from the above that existing BI and analytics systems will not suddenly convert to Hadoop files and access mechanisms, nor will “mini-Hadoops” suddenly spring up inside the corporate firewall and create havoc with enterprise data governance. The use cases are too different.The remaining boundaries – the ones that should matter to IT buyers – are those between existing relational BI and analytics databases and data stores and Hadoop’s file system and files. And here is where “eventual consistency” really matters. The enterprise cannot treat this data as just another BI data source. It differs fundamentally in that the enterprise can be far less sure that the data is up to date – or even available at all times. So scheduled reporting or business-critical computing based on this data is much more difficult to pull off.
On the other hand, this is data that would oth
erwise be unavailable – and because of the low-cost approach to building the solution, should be exceptionally low-cost to access. However, pointing the raw data at existing BI tools is like pointing a fire hose at your mouth. The savvy IT organization needs to have plans in place to filter the data before it begins to access it.
The Long-Run Bottom Line
The impression given by marketers is that Hadoop and its ilk are required for Big Data, where Big Data is more broadly defined as most Web-based semi-structured and unstructured data. If that is your impression, I believe it to be untrue. Instead, handling Big Data is likely to require a careful mix of relational and non-relational, data-center and extra-enterprise BI, with relational in-enterprise BI taking the lead role. And as the limits to parallel scalability of Hadoop and the like become more evident, the use of SQL-like interfaces and relational databases within Big Data use cases will become more frequent, not less.Therefore, I believe that Hadoop and its brand of Big Data will always remain a useful but not business-critical adjunct to an overall BI and information management strategy. Instead, users should anticipate that it will take its place alongside relational access to other types of Big Data, and that the key to IT success in Big Data BI will be in intermixing the two in the proper proportions, and with the proper security mechanisms. Hadoop, MapReduce, NoSQL, and Big Data, they’re all useful – but only if you pay attention to the relational technology behind the curtain.
On Monday, Pentaho, an open source BI vendor, announced Pentaho BI 4.0, its new release of its “agile BI” tool. To understand the power and usefulness of Pentaho, you should understand the fundamental ways in which the markets that we loosely call SMB have changed over the last 10 years.
First, a review. Until the early 1990s, it was a truism that computer companies in the long run would need to sell to central IT at large enterprises, eventually – else the urge of CIOs to standardize on one software and hardware vendor would favor larger players with existing toeholds in central IT. This was particularly true in databases, where Oracle sought to recreate the “nobody ever got fired for buying IBM” hardware mentality of the 1970s in software stacks. It was not until the mid-1990s that companies such as Progress Software and Sybase (with its iAnywhere line) showed that databases delivering near-lights-out administration could survive the Oracle onslaught. Moreover, companies like Microsoft showed that software aimed at the SMB could over time accumulate and force its way into central IT – not only Windows, Word, and Excel, but also SQL Server.
As companies such as IBM discovered with the bursting of the Internet bubble, this “SMB” market was surprisingly large. Even better, it was counter-cyclical: when large enterprises whose IT was a major part of corporate spend cut IT budgets dramatically, SMBs kept right on paying the yearly license fees for the apps on which they ran, which in turn hid the brand on the database or app server. Above all, it was not driven by brand or standards-based spending, nor even solely by economies of scale in cost.
In fact, the SMB buyer was and is distinctly and permanently different from the large-enterprise IT buyer. Concern for costs may be heightened, yes; but also the need for simplified user interfaces and administration that a non-techie can handle. A database like Pervasive could be run by the executive at a car dealership, who would simply press a button to run backup on his or her way out on the weekend, or not even that. The ability to fine-tune for maximum performance is far less important than the avoidance of constant parameter tuning. The ability to cut hardware costs by placing apps in a central location matters much less than having desktop storage to work on when the server goes down.
But in the early 2000s, just as larger vendors were beginning to wake up to the potential of this SMB market, a new breed of SMB emerged. This Web-focused SMB was and is tech-savvy, because using the Web more effectively is how it makes its money. Therefore, the old approach of Microsoft and Sybase when they were wannabes – provide crude APIs and let the customer do the rest – was exactly what this SMB wanted. And, again, this SMB was not just the smaller-sized firm, but also the skunk works and innovation center of the larger enterprise.
It is this new type of SMB that is the sweet spot of open source software in general, and open source BI in particular. Open source has created a massive “movement” of external programmers that have moved steadily up the software stack from Linux to BI, and in the process created new kludges that turn out to be surprisingly scalable: MapReduce, Hadoop, noSQL, and Pentaho being only the latest examples. The new SMB is a heavy user of open source software in general, because the new open source software costs nothing, fits the skills and Web needs of the SMB, and allows immediate implementation of crude solutions plus scalability supplied by the evolution of the software itself. Within a very few years, many users, rightly or wrongly, were swearing that MySQL was outscaling Oracle.
Translating Pentaho BI 4.0
The new features in Pentaho BI can be simply put, because the details simply show that they deliver what they promise:
- · Simple, powerful interactive reporting – which apparently tends to be used more for ad-hoc reporting that the traditional enterprise reporting, but can do either;
- · A more “usable” and customizable user interface with the usual Web “sizzle”;
- · Data discovery “exploration” enhancements such as new charts for better data visualization.
These sit atop a BI tool that distinguishes itself by “data integration” that handles an exceptional number of input data warehouses and data stores for inhaling to a temporary “data mart” for each use case.
With these features, Pentaho BI, I believe, is valuable especially to the new type of SMB. For the content-free buzz word “agile BI”, read “it lets your techies attach quickly to your existing databases as well as Big Data out there on the Web, and then makes it easy for you to figure out how to dig deeper as a technically-minded user who is not a data-mining expert.” Above all, Pentaho has the usual open source model, so it’s making its money by services and support – allowing the new SMB to decide exactly how much to spend. Note also Pentaho’s alliance not merely with the usual cloud open source suspects like Red Hat but also with database vendors with strong BI-performance technology such as Vertica.
The BI Bottom Line
No BI vendor is guaranteed a leadership position in cloud BI these days – the field is moving that fast. However, Pentaho is clearly well suited to the new SMB, and also understands the importance of user interfaces, simplicity for the administrator, ad hoc querying and reporting, and rapid implementation to both new and old SMBs.
Pentaho therefore deserves a closer look by new-SMB IT buyers, either as a cloud supplement to existing BI or as the core of low-cost, fast-growing Web-focused BI. And, remember, these have their counterparts in large enterprises – so those should take a look as well. Sooner than I expected, open source BI is proving its worth.
And so, another founder of the computing industry as we know it today officially bites the dust. A few days ago, Attachmate announced that it was acquiring Novell – and the biggest of the PC LAN companies will be no more.
I have more fond memories of Novell than I do of Progress, Sun, or any of the other companies that have seen their luster fade over the last decade. Maybe it was the original facility in Provo, with its Star Trek curving corridors and Moby Jack as haute cuisine, just down the mountain from Robert Redford’s Sundance. Maybe it was the way that when they sent us a copy of NetWare, they wrapped it in peanuts instead of bubble wrap, giving us a great wiring party. Or maybe it was Ray Noorda himself, with his nicknames (Pearly Gates and Ballmer the Embalmer) and his insights (I give him credit for the notion of coopetition).
But if Novell were just quirky memories, it wouldn’t be worth the reminiscence. I firmly believe that Novell, more than any other company, ushered out the era of IBM and the Seven Dwarves, and ushered in the world of the PC and the Internet.
Everyone has his or her version of those days. I was at Prime at the time, and there was a hot competition going on between IBM at the high end and DEC, Wang, Data General, and Prime at the “low end”. Even with the advent of the PC, it looked as if IBM or DEC might dominate the new form factor; Compaq was not a real competitor until the late 1980s.
And then along came the PC LAN companies: 3Com, Banyan, Novell. While IBM and the rest focused on high end sales, and Sun and Apollo locked up workstations, the minicomputer makers’ low ends were being stealthily undercut by PC LANs, and especially from the likes of Novell. The logic was simple: the local dentist, realtor, or retailer bought a PC for personal use, brought it to the business, and then realized that it was child’s play – and less than $1K – to buy LAN software to hook the PCs in the office together. It meant incredibly cheap scalability, and when I was at Prime it gutted the low end of our business, squeezing the mini makers from above (IBM) and below (Novell).
There was never a time when Novell could breathe easily. At first, there were Banyan and 3Com; later, the mini makers tried their hand at PC LANs; then came the Microsoft partnership with IBM to push OS/2 LAN Manager; and finally, in the early 1990s, Microsoft took dead aim at Novell, and finally managed to knock them off their perch. However, until the end, NetWare had two simple ideas to differentiate it, well executed by the “Magic 8” (the programmers doing fundamental NetWare design, including above all Drew Major): the idea that to every client PC, the NetWare file system should look like just another drive, and the idea that frequently accessed files should be stored in main memory on the server PC, so that, as Novell boasted, you could get a file faster from NetWare than you could from your own PC’s hard drive.
Until the mid 1990s, analysts embarrassed themselves by predicting rapid loss of market share to the latest competitor. Every year, surveys showed that purchasing managers were planning to replace their NetWare with LAN Server, with LAN Manager, with VINES; and at the end of the year, the surveys would show that NetWare had increased its hold, with market share in the high 70s. Why? Because what drove the market was purchases below the purchasing manager’s radar screen (less than the $10K that departments were supposed to report upstairs). One DEC employee told me an illustrative story: while DEC was trying to mandate in-house purchase of its PC LAN software, the techies at DEC were expanding their use of NetWare by leaps and bounds, avoiding official notice by “tunneling” NetWare communications as part of the regular DEC network. The powers that be finally noticed what was going on because the tunneled communications became the bulk of all communications across the DEC network.
In the early 1990s, Microsoft finally figured out what to do about this. Shortly after casting off OS/2 and LAN Manager, Microsoft developed its own, even more basic, PC LAN software that at first simply allowed sharing across a couple of “peer” PCs. Using this as a beachhead, Microsoft steadily developed Windows’ LAN capabilities, entirely wrapped in the Windows PC OS, so that it cost practically nothing to buy both the PC and the LAN. This placed Novell in an untenable position, because what was now driving the market was applications developed on top of the PC and LAN OS, and NetWare had never paid sufficient attention to LAN application development; it was easy for Microsoft to turn Windows apps into Windows plus LAN apps, while it was very hard for Novell to do so.
Nevertheless, Novell’s core market made do with third-party Windows apps that could also run on NetWare, until the final phase of the tragedy: Windows 2000. You see, PC LANs always had the limitation that they were local. The only way that PC LAN OSs could overcome the limitations of geography was to provide real-time updates to resource and user data stored in multiple, geographically separate “directories”: in effect, to carry out scalable multi-copy updates on data. Banyan had a pretty good solution for this, but Microsoft created an even better one in Windows 2000, well before Novell’s solution; and after that, as the world shifted its attention to the Internet, Novell was not even near anyone’s short list for distributed computing.
Over the last decade, Novell has not lacked good solutions; its own directory product, administrative and security software, virtualization software, and most recently what I view as a very nice approach to porting Windows apps to Linux and mainframes. Still, a succession of CEOs failed to turn around the company, and, in the ultimate irony, Attachmate, with strengths and a long history itself in remote PC software, has decided to take on Novell’s assets.
I think that the best summing up of Novell’s ultimate strategic mistake was the remark of one of its CEOs shortly after assuming command: “Everyone thinks about Microsoft as the biggest fish in the ocean. It is the ocean.” In other words, Novell would have done better by treating Microsoft as the vendor of the environment that Novell had to support, and aiming to service that market, rather than trying to out-feature Microsoft. But everyone else made that mistake; why should Novell have been any different?
We are left not only with Novell’s past contributions to computing, but also with the contributions of its alumni. Some fostered the SMB market with products like the Pervasive database; some were drivers of the UNIX standards push and later the TP monitors that led to today’s app servers. One created the Burton Group, a more technically-oriented analyst firm that permanently improved the quality of the analyst industry.
And we are also left with an enthusiasm that could not be contained by traditional firms, and that moved on to UNIX, to the Web, to open source. The one time, in the late 1980s, I went to Novell’s user group meeting, it was clearly a bit different. After one of the presentations, a LAN servicer rose to ask a question. “So-and-so, LANs Are My Life”, he identified himself. That was the name of his firm: LANs Are My Life, Inc. It’s not a bad epitaph for a computer company: we made a product so good that for some people – not just Novell employees – it was our life. Rest in peace, Novell.
There are certain vendors of infrastructure software who deliver long-term value-add to their customers by walking the narrow line between the innovative and the proprietary exceptionally well. Over its long history, InterSystems has parlayed an innovative database that could be fully integrated into existing data centers into an innovative middleware suite that could be fully integrated into existing data architectures, and then into innovative health care applications that could be fully integrated into existing health care systems and networks, delivering value-add at every step. Now, InterSystems has announced a new generation of its database/development platform, Caché 2010, with Caché Database Mirroring and Caché eXtreme for Java. Surprise, surprise: the new features are innovative, integrated out of the box with existing IT strategies and systems, and very useful.
InterSystems has long been known as the vendor of Caché, a “post-relational” object database that has proven its E-business prowess in real-world business-critical situations such as health care applications. Caché combines object, multidimensional, and SQL technologies to handle content-heavy OLTP, decision support, and “mixed” transaction streams effectively. More recently, InterSystems has also become known as the supplier of Ensemble, a Caché-based integration and development platform that allows access to a wide array of data types, plus data transmission from application to application, especially as part of business process integration. InterSystems has a position of strength in the health care market, with widespread Caché use by best-practices hospitals and labs.
Caché Database Mirroring
Due to a recent TCO study, I have become aware of just how expensive maintaining two or three redundant data centers for full global active/active rapid-recovery can be. As I understand it, Caché provides reduces costs by increasing the flexibility of replication of Caché data. Specifically, Caché Database Mirroring allows “warm” (not completely up to date) mirroring in certain circumstances, and “logical” (which some might call “virtual”) replication that does not have to be to a physically separate or remote system. The resulting decrease in load on both ends of a mirroring process, as well as the automation of Caché Database Mirroring deployment and operation, lowers contention for shared resources by the replication process and allows use of inexpensive PC servers and the like instead of expensive, dedicated Storage Area Network software and systems.
Caché eXtreme for Java
As CEP use increases, it has become clear that “contextual” data able to be accessed in “near real time” is needed to scale these solutions. While Caché users have found it particularly effective in accessing the object-type and XML data that CEP engines typically process, due to its object support and strong performance, the lingua franca of such engines is often Java, for better or worse. Caché eXtreme for Java provides direct access to Caché operations and data stores from Java, enabling this large class of developers to rapidly develop more scalable CEP applications.
Where similar infrastructure software companies have faltered or been acquired in the recent deep recession, Intersystems appears to be continuing to strike out in new directions. Some of that may come from the relative resilience of the health care market that was once its historical strength. However, it seems clear that much of its success comes from continuing to deliver “innovation with a difference” that fits with customer environments and also adds immediately useful features improving the customer’s cost effectiveness and flexibility.
Also notable is that these improvements involve both new and old products. Intersystems has been smart not to treat Caché like a cash cow, as the market’s focus switched to Internet middleware these last few years – other vendors seem to have fallen into that trap, and may be paying the price.
The new announcements, as ever, make Intersystems worth the IT buyer’s close attention, and especially in such areas as CEP and development.
I was reading a Business Intelligence (BI) white paper feed – I can’t remember which – when I happened across one whose title was, more or less, “Data Visualization: Is This the Way To Attract the Common End User?” And I thought, boy, here we go again.
You see, the idea that just a little better user interface will finally get Joe and Jane (no, not you, Mr. Clabby) to use databases dates back at least 26 years. I know, because I had an argument with my boss at CCA, Dan Ries iirc (a very smart fellow), about it. He was sure that with a fill-out-the-form approach, any line employee could do his or her own ad-hoc queries and reporting. Based on my own experiences as a naïve end user, I felt we were very far from being able to give the average end user an interface that he or she would be able or motivated to use. Here we are, 26-plus years later, and all through those years, someone would pipe up and say, in the immortal words of Bullwinkle, “This time for sure!” And every time, it hasn’t happened.
I divide the blame for this equally between vendor marketing and IT buying. Database and BI vendors, first and foremost, look to extend the ability of specific targets within the business to gain insights. That requires ever more sophisticated statistical and relationship-identifying tools. The vendor looking to design a “common-person” user interface retrofits the interface to these tools. In other words, the vendor acts like it is selling to a business-expert, not a consumer, market.
Meanwhile, IT buyers looking to justify the expense of BI try to extend its use to upper-level executives and business processes, not demand that it extend the interface approach of popular consumer apps to using data, or that it give the line supervisor who uses it at home a leg up at work. And yet, that is precisely how Word, Excel, maybe PowerPoint, and Google search wound up being far more frequently used than SQL or OLAP.
I have been saying things like this for the last 26 years, and somehow, the problem never gets solved. At this point, I am convinced that no one is really listening. So, for my own amusement, I give you three ideas – ideas proven in the real world, but never implemented in a vendor product – that if I were a user I would really like, and that I think would come as close as anything can to achieving “BI for the masses.”Idea Number 1: Google Exploratory Data Analysis
I’m reading through someone’s blog when they mention “graphical analysis.” What the hey? There’s a pointer to another blog, where they make a lot of unproven assertions about graphical analysis. Time for Google: a search on graphical analysis results in a lot of extraneous stuff, some of it very interesting, plus Wikipedia and a vendor who is selling this stuff. Wikipedia is off-topic, but carefully reading the article shows that there are a couple of articles that might be on point. One of them gives me some of the social-networking theory behind graphical analysis, but not the products or the market. Back to Google, forward to a couple of analyst market figures. They sound iffy, so I go to a vendor site and get their financials to cross-check. Not much in there, but enough that I can guesstimate. Back to Google, change the search to “graphical BI.” Bingo, another vendor with much more market information and ways to cross-check the first vendor’s claims. Which products have been left out? An analyst report lists the two vendors, but in a different market, and also lists their competitors. Let’s take a sample competitor: what’s their response to “graphical analysis” or graphical BI? Nothing, but they seem to feel that statistical analysis is their best competitive weapon. Does statistical analysis cover graphical analysis? The names SAS and SPSS keep coming up in my Google searches. It doesn’t seem as if their user manuals even mention the word “graph”. What are the potential use cases? Computation of shortest path. Well, only if you’re driving somewhere. Still, if it’s made easy for me … Is this really easier than Mapquest? Let’s try a multi-step trip. Oog. It definitely could be easier than Mapquest. Can I try out this product? All right, I’ve got the free trial version loaded, let’s try the multi-step trip. You know, this could do better for a sales trip than my company’s path optimization stuff, because I can tweak it for my personal needs. Combine with Google Maps, stir … wouldn’t it be nice if there was a Wikimaps, so that people could warn us about all these little construction obstructions and missing signs? Anyway, I’ve just given myself an extra half-hour on the trip to spend on one more call, without having to clear it.
Two points about this. First, Google is superb at free-association exploratory analysis of documents. You search for something, you alter the search because of facts you’ve found, you use the results to find other useful facts about it, you change the topic of the search to cross-check, you dig down into specific examples to verify, you even go completely off-topic and then come back. The result is far richer, far more useful to the “common end user” and his or her organization, and far more fun than just doing a query on graphical data in the company data warehouse.
Second, Google is lousy at exploratory data analysis, because it is “data dumb”: It can find metadata and individual pieces of data, but it can’t detect patterns in the data, so you have to do it yourself. If you are searching for “graphical analysis” across vendor web sites, Google can’t figure out that it would be nice to know that 9 of 10 vendors in the market don’t mention “graph” on their web sites, or that no vendors offer free trial downloads.
The answer to this seems straightforward enough: add “guess-type” data analysis capabilities to Google. And, by the way, if you’re at work, make the first port of call your company’s data-warehouse data store, full of data you can’t get anywhere else. You’re looking for the low-priced product for graphical analysis? Hmm, your company offers three types through a deal with the vendor, but none is the low-cost one. I wonder what effect that has had on sales? Your company did a recent price cut; sure enough, it hasn’t had a big effect. Except in China: does that have to do with the recent exchange rate manipulations, and the fact that you sell via a Chinese firm instead of on your own? It might indeed, since Google tells you the manipulations started 3 weeks ago, just when the price cut happened.
You get the idea? Note that the search/analysis engine guessed that you wanted your company’s data called out, and that you wanted sales broken down by geography and in a monthly time series. Moreover, this is exploratory data analysis, which means that you get to see both the summary report/statistics and individual pieces of raw data – to see if your theories about what’s going on make sense.
In Google exploratory data analysis, the search engine and your exploration drive the data analysis; the tools available don’t. It’s a fundamental mind shift, and one that explains why Excel became popular and in-house on-demand reporting schemes didn’t, or why Google search was accepted and SQL wasn’t. One’s about the features; the other’s about the consumer’s needs.
Oh, by the way, once this takes off, you can start using information about user searches to drive adding really useful data to the data warehouse.Idea Number 2: The Do The Right Thing Key
Back in 1986, I loved the idea behind the Spike Lee movie title so much that I designed an email system around it. Here’s how it works:
You know how when you are doing a “replace all” in Word, you have to specify an exact character string, and then Word mindlessly replaces all occurrences, even if some should be capitalized and some not, or even if you just want whole words to be changed and not character strings within words? Well, think about it. If you type a whole word, 90% of the time you want only words to be replaced, and capitals to be added at the start of sentences. If you type a string that is only part of a word, 90% of the time you want all occurrences of that string replaced, and capitals when and only when that string occurs at the start of a sentence. So take that Word “replace” window, and add a Do the Right Thing key (really, a point and click option) at the end. If it’s not right, the user can just Undo and take the long route.
The Do The Right Thing key is a macro; but it’s a smart macro. You don’t need to create it, and it makes some reasonable guesses about what you want to do, rather than you having to specify what it should do exactly. I found when I designed my email system that every menu, and every submenu or screen, would benefit from having a Do The Right Thing key. It’s that powerful an idea.
How does that apply to BI? Suppose you are trying to track down a sudden drop in sales one week in North America. You could dive down, layer by layer, until you found that stores in Manitoba all saw a big drop that week. Or, you could press the Break in the Pattern key, which would round up all breaks in patterns of sales, and dig down not only to Manitoba but also to big offsetting changes in sales in Vancouver and Toronto, with appropriate highlighting. 9 times out of ten, that will be the right information, and the other time, you’ll find out some other information that may prove to be just as valuable. Now do the same type of thing for every querying or reporting screen …
The idea behind the Do The Right Thing key is actually very similar to that behind Google Exploratory Data Analysis. In both cases, you are really considering what the end user would probably want to do first, and only then finding a BI tool that will do that. The Do The Right Thing key is a bit more buttoned-up: you’re probably carrying out a task that the business wants you to do. Still, it’s way better than “do it this way or else.”Idea Number 3: Build Your Own Data Store
Back in the days before Microsoft Access, there was a funny little database company called FileMaker. It had the odd idea that people who wanted to create their own contact lists, their own lists of the stocks they owned and their values, their own grades or assets and expenses, should be able to do so, in just the format they wanted. As Oracle steadily cut away at other competitors in the top end of the database market, FileMaker kept gaining individual customers who would bring FileMaker into their local offices and use it for little projects. To this day, it is still pretty much unique in its ability to let users quickly whip up small-sized, custom data stores to drive, say, class registrations at a college.
To my mind, FileMaker never quite took the idea far enough. You see, FileMaker was competing against folks like Borland in the days when the cutting edge was allowing two-way links between, let’s say, students and teachers (a student has multiple teachers, and teachers have multiple students). But what people really want, often, is “serial hierarchy”. You start out with a list of all your teachers; the student is the top level, the teachers and class location/time/topic the next level. But you next want to see if there’s an alternate class; now the topic is the top level, the time at the next level, the students (you, and if the class is full) at a third level. If the number of data items is too small to require aggregation, statistics, etc.; you can eyeball the raw data to get your answers. And you don’t need to learn a new application (Outlook, Microsoft Money, Excel) for each new personal database need.
The reason this fits BI is that, often, the next step after getting your personal answers is to merge them with company data. You’ve figured out your budget, now do “what if”: does this fit with the company budget? You’ve identified your own sales targets, so how do these match up against those supplied by the company? You download company data into your own personal workspace, and use your own simple analysis tools to see how your plans mesh with the company’s. You only get as complex a user interface as you need.Conclusions
I hope you enjoyed these ideas, because, dollars to doughnuts, they’ll never happen. It’s been 25 years, and the crippled desktop/folder metaphor and its slightly less crippled cousin, the document/link browser metaphor, still dominate user interfaces. It’s been fifteen years, and only now is Composite Software’s Robert Eve getting marketing traction by pointing out that trying to put all the company’s data in a data warehouse is a fool’s errand. It’s been almost 35 years, and still no one seems to have noticed that seeing a full page of a document you are composing on a screen makes your writing better. At least, after 20 years, Google Gmail finally showed that it was a good idea to group a message and its replies. What a revelation!
No, what users should really be wary of is vendors who claim they do indeed do any of the ideas listed above. This is a bit like vendors claiming that requirements management software is an agile development tool. No; it’s a retrofitted, slightly less sclerotic tool instead of something designed from the ground up to serve the developer, not the process.
But if you dig down, and the vendor really does walk the walk, grab the BI tool. And then let me know the millennium has finally arrived. Preferably not after another 26 years.
Two weeks ago, Merv Adrian’s blog was filled with analysis of the recent TDWI conference, which had as a theme “Agile Business Intelligence.” Merv’s initial reaction was the same as mine: what does BI have to do with the agile development movement  In the title, ABI is short for Agile Business Intelligence, and WTF, as every fan of the TV show Battlestar Galactica knows, is short for What The Frak, while WTFN stands for Why The Frak Not. My confusion deepened as I tracked down the BI companies that he cited: It appeared that only one, Endeca, was marketing its solution as “agile BI” (Wherescape simply notes that its data-warehouse-building solution is increasing its built-in support for agile development practices). Endeca’s definition of agile BI appears from its web site to boil down to: BI is agile if it speeds ad-hoc querying, because that allows changes in pre-decision analysis that lead to better and quicker business decisions. It isn’t intuitively obvious that such a definition corresponds to development agility as defined by the Agile Manifesto or to the various definitions of business agility that have recently surfaced.
 In the title, ABI is short for Agile Business Intelligence, and WTF, as every fan of the TV show Battlestar Galactica knows, is short for What The Frak, while WTFN stands for Why The Frak Not.
My confusion deepened as I tracked down the BI companies that he cited: It appeared that only one, Endeca, was marketing its solution as “agile BI” (Wherescape simply notes that its data-warehouse-building solution is increasing its built-in support for agile development practices). Endeca’s definition of agile BI appears from its web site to boil down to: BI is agile if it speeds ad-hoc querying, because that allows changes in pre-decision analysis that lead to better and quicker business decisions. It isn’t intuitively obvious that such a definition corresponds to development agility as defined by the Agile Manifesto or to the various definitions of business agility that have recently surfaced.
Definitions really matter in this case, because, as I have argued in previous articles, improved agility (using the correct definition) has a permanent positive impact on the top line, the bottom line, and business risk. Data from my Aberdeen Group 2009 survey of local and global firms of a range of sizes and verticals suggests that increased agility decreases costs in the long term, on average, by at least 10% below their previous trend line, increases revenues by at least a similar 10% above trend, and decreases the risk of negative surprises by at least 5%. And, according to the same study, the only business/IT process being tried that clearly increased agility and produced such effects was agile development as defined by the Manifesto (“hybrid” open-source development and collaborative development may also improve agility, to a much smaller extent).
On further reflection, I have decided that agile BI is indeed a step forward in overall business agility. However, it is a very small step. It is quite possible for a smart organization to take what’s out there, combine it in new ways, and make some significant gains in business agility. But it’s not easy to do, and right now, they won’t get much help from any single vendor.
Key Points in the Definition of Agility
I define agility as the ability of an organization to handle events or implement strategies that change the functioning of key organizational processes. Agility can be further categorized as proactive and reactive; anticipated and unanticipated; internally or externally caused; new-product, operational, and disaster. That is, improved agility is improvement in one or all of these areas.Initial data suggest that improvements in new-product development (proactive, unanticipated, externally caused) have the greatest impact, since they have spill-over effects on the other categories (anticipated, internally-caused, operational, and disaster). However, improvements in operational and disaster agility can also deliver significant bottom-line long-term increases. Improved agility can be measured and detected from its effects on organizational speed, effectiveness, and “follow-on” metrics (TCO, ROI, customer satisfaction, business risk).The implications for Agile BI are:
- Unless improved BI agility improves new-product development, its business impact is smaller.
- Increased speed (faster reporting of results) without increased effectiveness (i.e., a more agile business decision-making process) has minimal impact on overall agility.
- Improvements to “reactive” decision-making deliver good immediate results, but have less long-term impact than improvements to “proactive” decision-making that anticipates rather than reacting to key environmental changes.
In summary, agile BI that is part of an overall agile decision-making and new-product-strategy-driving business process, and that emphasizes proactive search for extra-organizational data sources, should produce much better long-term bottom-line results than today’s reactive BI that depends on relatively static and intra-organizational data sources.The Fundamental Limit to Today’s Agile Decision-Making via BI
Question: Where do the greatest threats to the success of the organization lie, in its internal business processes or in external changes to its environment and markets? Answer: In most cases, external. Question: Which does better at allowing the business person to react fast to, and even anticipate, external changes – internally gathered data alone, or internal data plus external data that appears ahead of or gives context to internal data? Answer: Typically, external. Question: What percentage of BI data is external data imported immediately, directly to the data store? Answer: Usually, less than 0.1 %. Question: What is the average time for the average organization from when a significant new data source shows up on the Web to when it begins to be imported into internal databases, much less BI? Answer: more than half a year.
The fundamental limit to the agility and effectiveness of BI therefore lies not in any inability to speed up analysis, but in the fact that today’s BI and the business processes associated with it are designed to focus on internal data. Increasingly, your customers are moving to the Web; your regulatory environment is moving to the Web; mobile devices are streaming data across the Web; new communications media like Facebook and Twitter are popping up; and businesses are capturing a very small fraction of this data, primarily from sources (long-time customers) that are changing the least. As a result, the time lost from deducing a shift in customer behavior from weekly or monthly per-store buying instead of social-network movement from one fad to another dwarfs the time saved when BI detects the per-store shift in a day instead of a weekend; and a correct reaction to the shift is far less likely without external contextual data.
This is an area where agile new product development is far ahead of BI. Where is the BI equivalent of reaching out to external open-source and collaborative communities? Of holding “idea jams” across organizations? Of features/information as a Web collaboration between an external user and a code/query creator? Of “spiraling in on” a solution? Of measuring effectiveness by “time to customer value” instead of “time to complete” or “time to decide”?
A simple but major improvement in handling external data in BI is pretty much doable today. It might involve integrating RSS feeds as pop-ups and Google searches as complements to existing BI querying. But if a major BI vendor features this capability on the front page of its Web site, I have yet to find that vendor.Action Items
In the long run, therefore, users should expect that agile BI that delivers major bottom-line results will probably involve:
- Much greater use of external data to achieve more proactive decision-making.
- Major changes to business processes involving BI to make them more agile.
- Constant fine-tuning of the querying that BI offers, customized to the needs of the business, rather than feature addition and decision-process change gated by the next BI vendor release.
- Integration with New Product Development, so that customer insights based on historical context can supplement agile development’s right-now interaction with its Web communities.
Here are a few suggestions:
- Look at a product like the joint Composite Software/Kapow Technologies Composite Application Data Services for Web Content to semi-automatically inhale new Web-based external data.
- Look for major BI vendors that “walk the walk” in agile development, such as IBM with its in-house-used Jazz development environment, as a good indicator that the vendor’s BI services arm is up to the job of helping improving the agility of BI-related business processes; but be sure to check that the BI solution is also being developed that way.
- Look for BI vendor support for ad-hoc querying (as noted above, kudos to Endeca in this area), as this will likely make it easier to fine-tune querying constantly.
- Look for a BI vendor that can offer, in its own product line or via a third party, agile NPD software that includes collaborative tools to pass data between BI and the NPD project. Note that in most if not all cases you will still need to implement the actual BI-to-NPD link for your organization, and that if your organization does not do agile NPD you won’t get the full benefit of this. Also note that agile plus lean NPD, where the emphasis is on lean, does not qualify.
- Above all, change your metrics for agile BI success, from “increased speed” to “time to value”.
Today’s agile BI as touted by BI vendors is a very small, very delayed piece of a very good idea. Rather than patting them and yourself on the back for being five years behind development-tool vendors and three years behind NPD software vendors, why don’t you get moving on more ambitious stuff with real business impact? If not, WTFN?
 Answers based on Aberdeen Group data usefulness study, used by permission of Aberdeen Group.
 I am in disagreement with other commentators on this matter. I believe that lean cost-focused just-in-time processes work against agility as much as they work for it, since if product specs change there is less resource “slack” to accommodate the change.
Recently, I received a blurb from a company named 1010data, claiming that its personnel had been doing columnar databases for more than 30 years. As someone who was integrally involved at a technical level in the big era of database theory development (1975-1988), when everything from relational to versioning to distributed to inverted-list technology (the precursor to much of today’s columnar technology) first saw the light, I was initially somewhat skeptical. This wariness was heightened by receiving marketing claims that performance in data warehousing was better than not only relational databases but also than competitors’ columnar databases, even though 1010data does little in the way of indexing; and this performance improvement applied not only to ad-hoc queries with little discernable pattern, but also to many repetitive queries for which index-style optimization was apparently the logical thing to do.
1010data’s marketing is not clear as to why this should be so; but after talking to them, and reading their technical white paper, I have come up with a theory as to why it might be so. The theory goes like this: 1010data is not living in the same universe.
That sounds drastic. What I mean by this is, while the great mass of database theory and practice went one way, 1010data went another, back in the 1980s, and by now, in many cases, they really aren’t talking the same language. So what follows is an attempt to recast 1010data’s technology in terms familiar to me. Here’s the way I see it:
Since the 1980s, people have been wrestling with the problem of read and write locks on data. The idea is that if you decide to update a datum while another person is attempting to read it, each of you will see a different value, or the other person can’t predict which value he/she will see. To avoid this, the updater can block all other access via a write lock – which in turn slows down the other person drastically; or the “query from hell” can block updaters via a read lock on all data. In a data warehouse, updates are held and then rushed through at certain times (end of day/week) in order to avoid locking problems. Columnar databases also sometimes provide what is called “versioning”, in which previous values of a datum are kept around, so that the updater can operate on one value while the reader can operate on another.
1010data provides a data warehouse/business intelligence solution as a remote service – the “database as a service” variant of SaaS/public cloud. However, 1010data’s solution does not start by worrying about locking. Instead, it worries about how to provide each end user with a consistent “slice of time” database of his/her own. It appears to do this as follows: all data is divided up into what they call “master” tables (as in “master data management” of customer and supplier records), which are smaller, and time-associated/time-series “transactional” tables, which are the really large tables.
Master tables are more rarely changed, and therefore a full copy of the table after each update (really, a “burst” of updates) can be stored on disk, and loaded into main memory if needed by an end user, with little storage and processing overhead. This isn’t feasible for the transactional tables; but 1010data sees old versions of these as integral parts of the time series, not as superseded data; so the actual amount of “excess” data “appended” to a table, if maximum session length for an end user is a day, is actually small in all realistic circumstances. As a result, two versions of a transactional table include a pointer to a common ancestor plus a small “append”. That is, the storage overhead of additional versioning data is actually small compared to some other columnar technologies, and not that much more than row-oriented relational databases.
Now the other shoe drops, because, in my very rough approximation, versioning entire tables instead of particular bits of data allows you to keep those bits of data pretty much sequential on disk – hence the lack of need for indexing. It is as if each burst of updates comes with an online reorganization that restores the sequentiality of the resulting table version, so that reads during queries are potentially almost eliminating seek time. The storage overhead means that more data must be loaded from disk; but that’s more than compensated for by eliminating the need to jerk from one end of the disk to the other in order to inhale all needed data.
So here’s my take: 1010data’s claim to better performance, as well as to competitive scalability, is credible. Since we live in a universe in which indexing to minimize disk seek time plus minimizing added storage to minimize disk accesses in the first place allows us to push against the limits of locking constraints, we are properly appreciative of the ability of columnar technology to provide additional storage savings and bit-mapped indexing to store more data in memory. Since 1010data lives in a universe in which locking never happens and data is stored pretty sequentially, it can happily forget indexes and squander a little disk storage and still perform better.
1010data Loves Sushi
At this point, I could say that I have summarized 1010data’s technical value-add, and move on to considering best uses. However, to do that would be to ignore another way that 1010data does not operate in the same universe: it loves raw data. It would prefer to operate on data before any detection of errors and inconsistencies, as it views these problems as important data in their own right.
As a strong proponent of improving the quality of data provided to the end user, I might be expected to disagree strongly. However, as a proponent of “data usefulness”, I feel that the potential drawbacks of 1010data’s approach are counterbalanced by some significant advantages in the real world.
In the first place, 1010data is not doctrinaire about ETL (Extract, Transform, Load) technology. Rather, 1010data allows you to apply ETL at system implementation time or simply start with an existing “sanitized” data warehouse (although it is philosophically opposed to these approaches), or apply transforms online, at the time of a query. It’s nice that skipping the transform step when you start up the data warehouse will speed implementation. It’s also nice that you can have the choice of going raw or staying baked.
In the second place, data quality is not the only place where the usefulness of data can be decreased. Another key consideration is the ability of a wide array of end users to employ the warehoused data to perform more in-depth analysis. 1010data offers a user interface using the Excel spreadsheet metaphor and supporting column/time-oriented analysis (as well as an Excel add-in), thus providing better rolling/ad-hoc time-series analysis to a wider class of business users familiar with Excel. Of course, someone else may come along and develop such a flexible interface, although 1010data would seem to have a lead as of now; but in the meanwhile, the wider scope and additional analytic capabilities of 1010data appear to compensate for any problems with operating on incorrect data – and especially when there are 1010data features to ensure that analyses take into account possible incorrectness.
To me, some of continuing advantages of 1010data’s approach depend fundamentally on the idea that users of large transactional tables require ad-hoc historical analysis. To put it another way, if users really don’t need to keep historical data around for more than an hour in their databases, and require frequent updates/additions for “real-time analysis” (or online transaction processing), then tables will require frequent reorganizing and will include a lot of storage-wasting historical data, so that 1010data’s performance advantages will decrease or vanish.
However, there will always be ad-hoc, in-depth queryers, and these are pretty likely to be interested in historical analysis. So while 1010data may or may not be the be-all, end-all data-warehousing database for all verticals forever, it is very likely to offer distinct advantages for particular end users, and therefore should always be a valuable complement to a data warehouse that handles vanilla querying on a “no such thing as yesterday” basis.
Not being in the mainstream of database technology does not mean irrelevance; not being in the same database universe can mean that you solve the same problems better. It appears that taking the road less travelled has allowed 1010data to come up with a new and very possibly improved solution to data warehousing, just as inverted list resurfaced in the last few years to provide new and better technology in columnar databases. And it is not improbable that 1010data can continue to maintain any performance and ad-hoc analysis advantages in the next few years.
Of course, proof of these assertions in the real world is an ongoing process. I would recommend that BI/data warehousing users in large enterprises in all verticals kick the tires of 1010data – as noted, testbed implementation is pretty swift – and then performance test it and take a crack at the really tough analyst wish lists. To misquote Santayana, those who do not analyze history are condemned to repeat it – and that’s not good for the bottom line.
What, to my mind, was the biggest news out of EMC World? The much-touted Private Cloud? Don’t think so. The message that, as one presenter put it, “Tape Sucks”? Sorry. FAST Cache, VPLEX, performance boosts, cost cuts? Not this time. No, what really caught my attention was a throw-away slide showing that almost a majority of EMC customers have already adopted some form of deduplication technology, and that in the next couple of years, probably a majority of all business storage users will have done so.
Why do I think this is a big deal? Because technology related to deduplication holds the potential of delivering benefits greater than cloud; and user adoption of deduplication indicates that computer markets are ready to implement that technology. Let me explain.First of all, let’s understand what “dedupe”, as EMC calls it, and a related technology, compression, mean to me. In its initial, technical sense, deduplication means removing duplicates in data. Technically, compression means removing "noise" -- in the information-theory sense of removing bits that aren’t necessary to convey the information. Thus, for example, removing all but one occurrence of the word “the” in storing a document would be deduplication; using the soundex algorithm to represent “the” in two bytes would be compression. However, today popular “compression” products often use technical-deduplication as well; for example, columnar databases compress the data via such techniques as bit-mapped indexing, and also de-duplicate column values in a table. Likewise, data deduplication products may apply compression techniques to shrink the storage size of data blocks that have already been deduplicated. So when we refer to “dedupe”, it often includes compression, and when we refer to compressed data, it often has been deduplicated as well. To try to avoid confusion, I refer to “dedupe” and “compress” to mean the products, and deduplication and compression to mean the technical terms.When I state that there is an upcoming “dedupe revolution”, I really mean that deduplication and compression combined can promise a new way to improve not only backup/restore speed, but also transaction processing performance. Because, up to now, “dedupe” tools have been applied across SANs (storage area networks), while “compress” tools are per-database, “dedupe” products simply offer a quicker path than “compress” tools to achieving these benefits globally, across an enterprise.
These upcoming “dedupe” products are perhaps best compared to a sponge compressor that squeezes all excess “water” out of all parts of a data set. That means not only removing duplicate files or data records, but also removing identical elements within the data, such as all frames from a loading-dock video camera that show nothing going on. Moreover, it means compressing the data that remains, such as documents and emails whose verbiage can be encoded in a more compact form. When you consider that semi-structured or unstructured data such as video, audio, graphics, and documents makes up 80-90% of corporate data, and that the most “soggy” data types such as video use up the most space, you can see why some organizations are reporting up to 97% storage-space savings (70-80%, more conservatively) where “dedupe” is applied. And that doesn’t include some of the advances in structured-data storage, such as the columnar databases referred to above that “dedupe” columns within tables.
So, what good is all this space saving? Note the fact that the storage capacity that users demand has been growing by 50-60 % a year, consistently, for at least the last decade. Today’s “dedupe” may not be appropriate for all storage; but where it is, it is equivalent to setting back the clock 4-6 years. Canceling four years of storage acquisition is certainly a cost-saver. Likewise, backing up and restoring “deduped” data involves a lot less to be sent over a network (and the acts of deduplicating and reduplicating during this process add back only a fraction of the time saved), so backup windows and overhead shrink, and recovery is faster. Still, those are not the real reasons that “dedupe” has major long-term potential.
No, the real long-run reason that storage space saving matters is that it speeds retrieval from disk/tape/memory, storage to disk/tape/memory, and even processing of a given piece of data. Here, the recent history of “compress” tools is instructive. Until a few years ago,the effort of compressing and uncompressing tended to mean that compressed data actually took longer to retrieve, process, and re-store; but, as relational and columnar database users have found out, certain types of “compress” tools allow you to improve performance – sometimes by an order of magnitude. For example, recently, vendors such as IBM are reporting that relational databases such as DB2 benefit performance-wise from using “compressed” data. Columnar databases are showing that it is possible to operate on data-warehouse data in “compressed” form, except when it actually must be shown to the user, and thereby get major performance improvements.
So what is my vision of the future of “dedupe”? What sort of architecture are we talking about, 3-5 years from now? One in which the storage tiers below fast disk (and, someday, all the tiers, all the way to main memory) have “dedupe”-type technology added to them. In this context, it was significant that EMC chose at EMC World to trumpet “dedupe” as a replacement for Virtual Tape Libraries (VTL). Remember, VTL typically allows read/query access to older, rarely accessed data within a minute; so, clearly, deduped data on disk can be reduped and accessed at least as fast. Moreover, as databases and applications begin to develop the ability to operate on “deduped” data without the need for “redupe”, the average performance of a “deduped” tier will inevitably catch up with and surpass that of one which has no deduplication or compression technology.
Let’s be clear about the source of this performance speedup. Let us say that all data is deduplicated and compressed, taking up 1/5 as much space, and all operations can be carried out on “deduped” data instead of its “raw” equivalents. Then retrieval from any tier will be 5 times as fast and 5 times as much data can be stored in the next higher tier for even more performance gains. Processing this smaller data will take ½ to 1/5 as much time. Adding all three together and ignoring the costs of “dedupe”/”redupe”, a 50% speedup of an update and an 80% performance speedup of a large query seems conservative. Because the system will only need “dedupe”/”redupe” rarely, “dedupe” when the data is first stored and “redupe” whenever the data is displayed to a user in a report or query response, and because the task could be offloaded to specialty “dedupe”/”redupe” processors, on average “dedupe”/redupe” should add only minimal performance overhead to the system, and should subtract less than 10% from the performance speedup cited above. So, conservatively, I estimate the performance speedup from this future “dedupe” at 40-70%.
What effect is this going to have on IT, assuming that “the dedupe revolution” begins to arrive 1-2 years from now? First, it will mean that, 3-5 years out, the majority of storage, rather than a minority replacing some legacy backup, archive, or active-passive disaster recovery storage, will benefit from turning the clock back 4-6 years via “dedupe.” Practically, performance will improve dramatically and storage space per data item will shrink drastically, even as the amount of information stored continues its rapid climb – and not just in data access, but also in networking. Data handled in deduped form everywhere within a system also has interesting effects on security: the compression within “dedupe” is a sort of quick-and-dirty encryption that can make data pilferage by those who are not expert in “redupe” pretty difficult. Storage costs per bit of information stored will take a sharp drop; storage upgrades can be deferred and server and network upgrades slowed. When you add up all of those benefits, from my point of view, “the dedupe revolution” in many cases does potentially more for IT than the incremental benefits often cited for cloud.
Moreover, implementing “dedupe” is simply a matter of a software upgrade to any tier: memory, SSD, disk, or tape. So, getting to “the dedupe revolution” potentially requires much less IT effort than getting to cloud.
One more startling effect of dedupe: you can throw many of your comparative TCO studies right out the window. If I can use “dedupe” to store the same amount of data on 20% as much disk as my competitor, with 2-3 times the performance, the TCO winner will not be the one with the best processing efficiency or the greatest administrative ease of use, but the one with the best-squeezing “dedupe” technology.
What has been holding us back in the last couple of years from starting on the path to dedupe Nirvana, I believe, is customers’ wariness of a new technology. The EMC World slide definitively establishes that this concern is going away, and that there’s a huge market out there. Now, the ball is in the vendors’ court. This means that all vendors, not just EMC, will be challenged to merge storage “dedupe” and database “compress” technology to improve the average data “dry/wet” ratio, and “dedupify” applications and/or I/O to ensure more processing of data in its “deduped” state. (Whether EMC’s Data Domain acquisition complements its Avamar technology completely or not, the acquisition adds the ability to apply storage-style “dedupe” to a wider range of use cases; so EMC is clearly in the hunt). Likewise, IT will be challenged to identify new tiers/functions for “dedupe,” and implement the new “dedupe” technology as it arrives, as quickly as possible. Gentlemen, start your engines, and may the driest data win!
Oracle’s announcement of its “hybrid columnar compression” option for its Exadata product last summer clearly relates to the renewed attention paid to columnar databases over the last year by columnists such as Merv Adrian and Curt Monash. This example of a “hybrid” between columnar and row-oriented technology makes life yet more complicated for the IT buyer of data warehousing and business intelligence (BI) solutions. However, there does seem to be some agreement between the advocates of columnar and row-oriented that sheds some light on the appropriate places for each – and for hybrid.
Daniel Abadi in an excellent post summarizes the spectrum. If I understand his taxonomy correctly, row-oriented excels in “random reads” where a single record with more than 2-3 fields is accessed (or for single inserts and updates); columnar excels for queries across a large number of records whose columns (or “fields”) contain some commonality that makes for effective compression. The hybrids attempt to achieve 80% (my figure, plucked from the air) of the performance advantages of both.
To follow this thought a bit further, Daniel divides the hybrid technologies into block-oriented or “PAX”, fractured mirror, and fine-grained. The PAX approach stores in column format for particular disk blocks; the fractured mirror operates in a real-world disaster recovery environment and treats one copy of the data store as row-oriented, the other as column-oriented, sending transactions to either as appropriate; the fine-grained hybrid is similar to PAX, but compresses the columns of particular fields of particular tables, not a whole disk block. Oracle appears to be an example of the PAX approach, while Vertica has some features that appear to implement the fine-grained approach.
I would argue that the future belongs more to pure columnar or fractured mirror than to row-oriented or the other two flavors of hybrid. Here is my reasoning: data stores continue to scale in size by 50% a year; the proportion of storage devoted to main memory and solid-state devices (SSDs) is likewise going up. The disadvantage of columnar in “random reads” is therefore decreasing, as databases are increasingly effective at ensuring that the data accessed is already in fast-response storage. In other words, it’s not just the number of I/Os, it’s what you are inputting from.
There is another factor: as database size goes up, the disadvantages of row-oriented queries increase. As database size increases, the number of “random reads” does not necessarily increase, but the amount of data that must be accessed in the average query does necessarily increase. Compression applied across all columns and indexes increases its advantage over selective compression and no compression at all in this case, because there is less data to upload. And the “query from hell” that scans all of a petabyte data warehouse is not only the extreme case of this increasing advantage; scaling the system’s ability to handle such queries is often a prime concern of IT.
I would also argue that the same trends make the pure or fractured-mirror columnar database of increasing interest for operational data stores that combine online transaction processing with lots of updates and continuous querying and reporting. For updates, the competition between columnar and row-oriented is getting closer, as many of these updates involve only 1-3 fields/columns of a row, while updates are most likely to affect the most recent data and therefore the data increasingly likely to be in main-memory cache. For inserts/deletes, updating in-memory indexes immediately along with “write to disk later” means that the need of columnar for multiple I/Os need not be a major disadvantage in many cases. And for online backup, the compressed data of the columnar approach wins hands down.
My takeaway with regard to Oracle Hybrid Columnar Compression, therefore, is that over time your mileage may vary. I would not be surprised if, someday, Oracle moved beyond “disk-block hybrid” to a fractured mirror approach, and that such an approach took over many of the high-end tasks for which vanilla row-oriented Oracle Database 11g r2 is now the database of choice.
Full disclosure: Dave is a friend, and a long-time colleague. He has just written an excellent book on Data Protection; hence the following musings. As I was reading (a rapid first scan), I tried to pin down why I liked the book so much. It certainly wasn’t the editing, since I helped with that. The topic is reasonably well covered, albeit piecemeal, by vendors, industry associations, and bloggers. And while I have always enjoyed Dave’s stories and jokes, the topic does not lend itself to elaborate stylistic flourishes. After thinking about it some more, I came to the conclusion that it’s Dave’s methodology that I value. Imho, Dave in each chapter will lay out a comprehensive and innovative classification of the topic at hand – data governance, information lifecycle management, data security – and then use that classification to bring new insight into a well-covered topic. The reason I like this approach is that it allows you to use the classification as a springboard, to come to your own conclusions, to extend the classification and apply it in other areas. In short, I found myself continually translating classifications from the narrow world of storage to the broad world of “information”, and being enlightened thereby. One area in particular that called forth this type of analysis was the topic of cloud computing and storage. If data protection, more or less, involves considerations of compliance, operational/disaster recovery, and security, how do these translate to a cloud external to the enterprise? And what is the role of IT in data protection when both physical and logical information are now outside of IT’s direct control? But this is merely a small part of the overall question of the future of IT, if external clouds take over large chunks of enterprise software/hardware. If the cloud can do it all cheaper, because of economies of scale, what justification is there for IT to exist any longer? Or will IT become “meta-IT”, applying enterprise-specific risk management, data protection, compliance, and security to their own logical part of a remote, multi-tenant physical infrastructure? I would suggest another way of slicing things. It is reasonable to think of a business, and hence underlying IT, as cost centers, which benefit from commodity solutions provided externally, and competitive-advantage or profit centers, for which being like everything else is actually counter-productive. In an ideal world, where the cloud can always underprice commodity hardware and software, IT’s value-add lies where things are not yet commodities. In other words, in the long run, IT should be the “cache”, the leading edge, the driver of the computing side of competitive advantage. What does that mean, practically? It means that the weight of IT should shift much more towards software and product development and initial use. IT’s product-related and innovative-process-related software and the systems to test and deploy them are IT’s purview; the rest should be in the cloud. But this does not make IT less important; on the contrary, it makes IT more important, because not only does IT focus on competitive advantage when things are going well, it also focuses on agile solutions that pay off in cost savings by more rapid adaptation when things are going poorly. JIT inventory management is a competitive advantage when orders are rising; but also a cost saver when orders are falling. I realize that this future is not likely to arrive any time soon. The problem is that in today’s IT, maintenance costs crowd out new-software spending, so that the CEO is convinced that IT is not competent to handle software development. But let’s face it, no one else is, either. Anyone following NPD (new product development) over the last few years realizes that software is an increasing component in an increasing number of industries. Outsourcing competitive-advantage software development is therefore increasingly like outsourcing R&D – it simply doesn’t work unless the key overall direction is in-house. Whether or not IT does infrastructure governance in the long run, it is necessarily the best candidate to do NPD software-development governance. So I do believe that IT has a future; but quite a different one from its present. As you can see, I have wandered far afield from Data Protection, thanks to Dave Hill’s thought-provoking book.The savvy reader of this tome will, I have no doubt, be able to come up with other, equally fascinating thoughts.
Full disclosure: Dave is a friend, and a long-time colleague. He has just written an excellent book on Data Protection; hence the following musings.
As I was reading (a rapid first scan), I tried to pin down why I liked the book so much. It certainly wasn’t the editing, since I helped with that. The topic is reasonably well covered, albeit piecemeal, by vendors, industry associations, and bloggers. And while I have always enjoyed Dave’s stories and jokes, the topic does not lend itself to elaborate stylistic flourishes.
After thinking about it some more, I came to the conclusion that it’s Dave’s methodology that I value. Imho, Dave in each chapter will lay out a comprehensive and innovative classification of the topic at hand – data governance, information lifecycle management, data security – and then use that classification to bring new insight into a well-covered topic. The reason I like this approach is that it allows you to use the classification as a springboard, to come to your own conclusions, to extend the classification and apply it in other areas. In short, I found myself continually translating classifications from the narrow world of storage to the broad world of “information”, and being enlightened thereby.
One area in particular that called forth this type of analysis was the topic of cloud computing and storage. If data protection, more or less, involves considerations of compliance, operational/disaster recovery, and security, how do these translate to a cloud external to the enterprise? And what is the role of IT in data protection when both physical and logical information are now outside of IT’s direct control?
But this is merely a small part of the overall question of the future of IT, if external clouds take over large chunks of enterprise software/hardware. If the cloud can do it all cheaper, because of economies of scale, what justification is there for IT to exist any longer? Or will IT become “meta-IT”, applying enterprise-specific risk management, data protection, compliance, and security to their own logical part of a remote, multi-tenant physical infrastructure?
I would suggest another way of slicing things. It is reasonable to think of a business, and hence underlying IT, as cost centers, which benefit from commodity solutions provided externally, and competitive-advantage or profit centers, for which being like everything else is actually counter-productive. In an ideal world, where the cloud can always underprice commodity hardware and software, IT’s value-add lies where things are not yet commodities. In other words, in the long run, IT should be the “cache”, the leading edge, the driver of the computing side of competitive advantage.
What does that mean, practically? It means that the weight of IT should shift much more towards software and product development and initial use. IT’s product-related and innovative-process-related software and the systems to test and deploy them are IT’s purview; the rest should be in the cloud. But this does not make IT less important; on the contrary, it makes IT more important, because not only does IT focus on competitive advantage when things are going well, it also focuses on agile solutions that pay off in cost savings by more rapid adaptation when things are going poorly. JIT inventory management is a competitive advantage when orders are rising; but also a cost saver when orders are falling.
I realize that this future is not likely to arrive any time soon. The problem is that in today’s IT, maintenance costs crowd out new-software spending, so that the CEO is convinced that IT is not competent to handle software development. But let’s face it, no one else is, either. Anyone following NPD (new product development) over the last few years realizes that software is an increasing component in an increasing number of industries. Outsourcing competitive-advantage software development is therefore increasingly like outsourcing R&D – it simply doesn’t work unless the key overall direction is in-house. Whether or not IT does infrastructure governance in the long run, it is necessarily the best candidate to do NPD software-development governance.
So I do believe that IT has a future; but quite a different one from its present. As you can see, I have wandered far afield from Data Protection, thanks to Dave Hill’s thought-provoking book.The savvy reader of this tome will, I have no doubt, be able to come up with other, equally fascinating thoughts.
There is a wonderful short story by Jorge Luis Borges ("Pierre Menard, Author of the Quixote") that, I believe, captures the open source effort to come to terms with Windows – which in some quarters is viewed as the antithesis of the philosophy of open source. In this short story, a critic analyzes Don Quixote as written by someone four hundred years later – someone who has attempted to live his life so as to be able to write the exact same words as in the original Don Quixote. The critic’s point is that even though the author is using the same words, today they mean something completely different. In much the same way, open source has attempted to mimic Windows on “Unix-like” environments (various flavors of Unix and Linux) without triggering Microsoft’s protection of its prize operating system. To do this, they have set up efforts such as Wine and ReactOS (to provide the APIs of Windows from Win2K onwards) and Mono (to provide the .NET APIs). These efforts attempt to support the same APIs as Microsoft’s, but with no knowledge of how Microsoft created them. This is not really reverse engineering, as the aim of reverse engineering is usually to figure out how functionality was achieved. These efforts don’t care how the functionality was achieved – they just want to provide the same collection of words (the APIs and functionality). But while the APIs are the same, the meaning of the effort has changed in the twenty-odd years since people began asking how to make moving programs from Wintel to another platform (and vice versa) as easy as possible. Then, every platform had difficulties with porting, migration, and source or binary compatibility. Now, Wintel and the mainframe, among the primary installed bases, are the platforms that are most difficult to move to or from. Moreover, the Web, or any network, as a distinct platform did not exist; today, the Web is increasingly a place in which every app and most middleware must find a way to run. So imitating Windows is no longer so much about moving Windows applications to cheaper or better platforms; it is about reducing the main remaining barrier to being able to move any app or software from any platform to any other, and into “clouds” that may hide the underlying hardware, but will still suffer when apps are platform-specific. Now, “moving” apps and “easy” are very vague terms. My own hierarchy of ease of movement from place to place begins with real-time portability. That is, a “virtual machine” on any platform can run the app, without significant effects on app performance, robustness, and usability (i.e., the user interface allows you to do the same things). Real-time portability means the best performance for the app via load balancing and dynamic repartitioning. Java apps are pretty much there today. However, apps in other programming languages are not so lucky, nor are legacy apps. The next step down from real-time portability is binary compatibility. The app may not work very well when moved in real time from one platform to another, but it will work, without needing changes or recompilation. That’s why forward and backward compatibility matter: they allow the same app to work on earlier or later versions of a platform. As time goes on, binary compatibility gets closer and closer to real-time portability, as platforms adapt to be able to handle similar workloads. Windows Server may not scale as well as the mainframe, but they both can handle the large majority of Unix-like workloads. It is surprising how few platforms have full binary compatibility with all the other platforms; it isn’t just Windows to the mainframe but also compatibility between different versions of Unix and Linux. So we are a ways away from binary compatibility, as well. The next step down is source-code compatibility. This means that in order to run on another platform, you can use the same source code, but it must be recompiled. In other words, source-code but not binary compatibility seems to rule out real-time movement of apps between platforms. However, it does allow applications to generate a version for each platform, and then interoperate/load balance between those versions; so we can crudely approximate real-time portability in the real world. Now we are talking about a large proportion of apps on Unix-like environments (although not all), but Windows and mainframe apps are typically not source-code compatible with the other two environments. Still, this explains why users can move Linux apps onto the mainframe with relative ease. There’s yet another step down: partial compatibility. This seems to come in two flavors: higher-level compatibility (that is, source-code compatibility if the app is written to a higher-level middleware interface such as .NET) and “80-20” compatibility (that is, 80% of apps are source-code incompatible in only a few, easily modified places; the other 20% are the nasty problems). Together, these two cases comprise a large proportion of all apps; and it may be comforting to think that legacy apps will sunset themselves so that eventually higher-level compatibility will become de facto source-code compatibility. However, the remaining cases include many important Windows apps and most mission- and business-critical mainframe apps. To most large enterprises, partial compatibility is not an answer. And so we come to the final step down: pure incompatibility, only cured by a massive portation/rewrite effort that has become much easier but is still not feasible for most such legacy apps. Why does all this matter? Because we are closer to Nirvana than we realize. If we can imitate enough of Windows on Linux, we can move most Windows apps to scale-up servers when needed (Unix/Linux or mainframe). So we will have achieved source-code compatibility from Windows to Linux, Java real-time portability from Linux to Windows, source-code compatibility for most Windows apps from Windows to Linux on the mainframe, and Linux source-code compatibility and Java real-time portability from Linux to the mainframe and back. It would be nice to have portability from z/OS apps to Linux and Windows platforms; but neither large enterprises nor cloud vendors really need this – the mainframe has that strong a TCO/ROI and energy-savings story for large-scale and numerous (say, more than 20 apps) situations. So, in an irony that Borges might appreciate, open-source efforts may indeed allow lower costs and greater openness for Windows apps; but not because open source free software will crowd out Windows. Rather, a decent approximation of cross-platform portability with lower per-app costs will be achieved because these efforts allow users to leverage Windows apps on other platforms, where the old proprietary vendors could never figure out how to do it. The meaning of the effort may be different than it would have been 15 years ago; but the result will be far more valuable. Or, as Borges’ critic might say, the new meaning speaks far more to people today than the old. Sometimes, Don Quixote tilting at windmills is a useful thing.
There is a wonderful short story by Jorge Luis Borges ("Pierre Menard, Author of the Quixote") that, I believe, captures the open source effort to come to terms with Windows – which in some quarters is viewed as the antithesis of the philosophy of open source. In this short story, a critic analyzes Don Quixote as written by someone four hundred years later – someone who has attempted to live his life so as to be able to write the exact same words as in the original Don Quixote. The critic’s point is that even though the author is using the same words, today they mean something completely different.
In much the same way, open source has attempted to mimic Windows on “Unix-like” environments (various flavors of Unix and Linux) without triggering Microsoft’s protection of its prize operating system. To do this, they have set up efforts such as Wine and ReactOS (to provide the APIs of Windows from Win2K onwards) and Mono (to provide the .NET APIs). These efforts attempt to support the same APIs as Microsoft’s, but with no knowledge of how Microsoft created them. This is not really reverse engineering, as the aim of reverse engineering is usually to figure out how functionality was achieved. These efforts don’t care how the functionality was achieved – they just want to provide the same collection of words (the APIs and functionality).
But while the APIs are the same, the meaning of the effort has changed in the twenty-odd years since people began asking how to make moving programs from Wintel to another platform (and vice versa) as easy as possible. Then, every platform had difficulties with porting, migration, and source or binary compatibility. Now, Wintel and the mainframe, among the primary installed bases, are the platforms that are most difficult to move to or from. Moreover, the Web, or any network, as a distinct platform did not exist; today, the Web is increasingly a place in which every app and most middleware must find a way to run. So imitating Windows is no longer so much about moving Windows applications to cheaper or better platforms; it is about reducing the main remaining barrier to being able to move any app or software from any platform to any other, and into “clouds” that may hide the underlying hardware, but will still suffer when apps are platform-specific.
Now, “moving” apps and “easy” are very vague terms. My own hierarchy of ease of movement from place to place begins with real-time portability. That is, a “virtual machine” on any platform can run the app, without significant effects on app performance, robustness, and usability (i.e., the user interface allows you to do the same things). Real-time portability means the best performance for the app via load balancing and dynamic repartitioning. Java apps are pretty much there today. However, apps in other programming languages are not so lucky, nor are legacy apps.
The next step down from real-time portability is binary compatibility. The app may not work very well when moved in real time from one platform to another, but it will work, without needing changes or recompilation. That’s why forward and backward compatibility matter: they allow the same app to work on earlier or later versions of a platform. As time goes on, binary compatibility gets closer and closer to real-time portability, as platforms adapt to be able to handle similar workloads. Windows Server may not scale as well as the mainframe, but they both can handle the large majority of Unix-like workloads. It is surprising how few platforms have full binary compatibility with all the other platforms; it isn’t just Windows to the mainframe but also compatibility between different versions of Unix and Linux. So we are a ways away from binary compatibility, as well.
The next step down is source-code compatibility. This means that in order to run on another platform, you can use the same source code, but it must be recompiled. In other words, source-code but not binary compatibility seems to rule out real-time movement of apps between platforms. However, it does allow applications to generate a version for each platform, and then interoperate/load balance between those versions; so we can crudely approximate real-time portability in the real world. Now we are talking about a large proportion of apps on Unix-like environments (although not all), but Windows and mainframe apps are typically not source-code compatible with the other two environments. Still, this explains why users can move Linux apps onto the mainframe with relative ease.
There’s yet another step down: partial compatibility. This seems to come in two flavors: higher-level compatibility (that is, source-code compatibility if the app is written to a higher-level middleware interface such as .NET) and “80-20” compatibility (that is, 80% of apps are source-code incompatible in only a few, easily modified places; the other 20% are the nasty problems). Together, these two cases comprise a large proportion of all apps; and it may be comforting to think that legacy apps will sunset themselves so that eventually higher-level compatibility will become de facto source-code compatibility. However, the remaining cases include many important Windows apps and most mission- and business-critical mainframe apps. To most large enterprises, partial compatibility is not an answer. And so we come to the final step down: pure incompatibility, only cured by a massive portation/rewrite effort that has become much easier but is still not feasible for most such legacy apps.
Why does all this matter? Because we are closer to Nirvana than we realize. If we can imitate enough of Windows on Linux, we can move most Windows apps to scale-up servers when needed (Unix/Linux or mainframe). So we will have achieved source-code compatibility from Windows to Linux, Java real-time portability from Linux to Windows, source-code compatibility for most Windows apps from Windows to Linux on the mainframe, and Linux source-code compatibility and Java real-time portability from Linux to the mainframe and back. It would be nice to have portability from z/OS apps to Linux and Windows platforms; but neither large enterprises nor cloud vendors really need this – the mainframe has that strong a TCO/ROI and energy-savings story for large-scale and numerous (say, more than 20 apps) situations.
So, in an irony that Borges might appreciate, open-source efforts may indeed allow lower costs and greater openness for Windows apps; but not because open source free software will crowd out Windows. Rather, a decent approximation of cross-platform portability with lower per-app costs will be achieved because these efforts allow users to leverage Windows apps on other platforms, where the old proprietary vendors could never figure out how to do it. The meaning of the effort may be different than it would have been 15 years ago; but the result will be far more valuable. Or, as Borges’ critic might say, the new meaning speaks far more to people today than the old. Sometimes, Don Quixote tilting at windmills is a useful thing.
A recent Techtarget posting by the SearchSOA editor picks up on the musings of Miko Matsumura of Software AG, suggesting that because most new apps in the cloud can use data in main memory, there’s no need for the enterprise-database SQL API; rather, developers should access their data via Java. OK, that’s a short summary of a more nuanced argument. But the conclusion is pretty blunt: “SQL is toast.” I have no great love for relational databases – as I’ve argued for many years, “relational” technology is actually marketing hype about data management that mostly is not relational at all. That is, the data isn’t stored as relational theory would suggest. The one truly relational thing about relational technology is SQL: the ability to perform operations on data in an elegant, high-level, somewhat English-like mini-language. What’s this Java alternative that Miko’s talking about? Well, Java is an object-oriented programming (OOP) language. By “object”, OOP means a collection of code and the data on which it operates. Thus, an object-oriented database is effectively chunks of data, each stored with the code to access it. So this is not really about Larry Ellison/Oracle deciding the future, or the “network or developer [rather] than the underlying technology”, as Miko puts it. It’s a fundamental question: which is better, treating data as a database to be accessed by objects, or as data within objects? Over the last fifteen years, we have seen the pluses and minuses of “data in the object”. One plus is that there is no object-relational mismatch, in which you have to fire off a SQL statement to some remote, un-Java-like database like Oracle or DB2 whenever you need to get something done. The object-relational mismatch has been estimated to add 50% to development times, mostly because developers who know Java rarely know SQL. Then there are the minuses, the reasons why people find themselves retrofitting SQL invocations to existing Java code. First of all, object-oriented programs in most cases don’t perform well in data-related transactions. Data stored separately in each object instance uses a lot of extra space, and the operations on it are not optimized. Second, in many cases, operations and the data are not standardized across object classes or applications, wasting lots of developer time. Third, OOP languages such as Java are low-level, and specifically low-level with regard to data manipulation. As a result, programming transactions on vanilla Java takes much longer than programming on one of the older 4GLs (like, say, the language that Blue Phoenix uses for some of its code migration). So what effect would storing all your data in main memory have on Java data-access operations? Well, the performance hit would still be there – but would be less obvious, because of the overall improvement in access speed. In other words, it might take twice as long as SQL access, but since we might typically be talking about 1000 bytes to operate on, we still see 2 microseconds instead of 1, which is a small part of response time over a network. Of course, for massive queries involving terabytes, the performance hit will still be quite noticeable. What will not go away immediately is the ongoing waste of development time. It’s not an obvious waste of time, because the developer either doesn’t know about 4GL alternatives or is comparing Java-data programming to all the time it takes to figure out relational operations and SQL. But it’s one of the main reasons reason that adopting Java actually caused a decrease in programmer productivity compared to structured programming, according to some user feedback I once collected, 15 years ago. More fundamentally, I have to ask if the future of programming is going to be purely object-oriented or data-oriented. The rapid increase in networking speed of the Internet doesn’t make data processing speed ignorable; on the contrary, it makes it all the more important as a bottleneck. And putting all the data in main memory doesn’t solve the problem; it just makes the problem kick in at larger amounts of data – i.e., for more important applications. And then there’s all this sensor data beginning to flow across the Web … So maybe SQL is toast. If what replaces it is something that Java can invoke that is high-level, optimizes transactions and data storage, and allows easy access to existing databases – in other words, something data-oriented, something like SQL – then I’m happy. If it’s something like storing data as objects and providing minimal, low-level APIs to manipulate that data – then we will be back to the same stupid over-application of Java that croaked development time and scalability 15 years ago.
A recent Techtarget posting by the SearchSOA editor picks up on the musings of Miko Matsumura of Software AG, suggesting that because most new apps in the cloud can use data in main memory, there’s no need for the enterprise-database SQL API; rather, developers should access their data via Java. OK, that’s a short summary of a more nuanced argument. But the conclusion is pretty blunt: “SQL is toast.”
I have no great love for relational databases – as I’ve argued for many years, “relational” technology is actually marketing hype about data management that mostly is not relational at all. That is, the data isn’t stored as relational theory would suggest. The one truly relational thing about relational technology is SQL: the ability to perform operations on data in an elegant, high-level, somewhat English-like mini-language.
What’s this Java alternative that Miko’s talking about? Well, Java is an object-oriented programming (OOP) language. By “object”, OOP means a collection of code and the data on which it operates. Thus, an object-oriented database is effectively chunks of data, each stored with the code to access it.
So this is not really about Larry Ellison/Oracle deciding the future, or the “network or developer [rather] than the underlying technology”, as Miko puts it. It’s a fundamental question: which is better, treating data as a database to be accessed by objects, or as data within objects?
Over the last fifteen years, we have seen the pluses and minuses of “data in the object”. One plus is that there is no object-relational mismatch, in which you have to fire off a SQL statement to some remote, un-Java-like database like Oracle or DB2 whenever you need to get something done. The object-relational mismatch has been estimated to add 50% to development times, mostly because developers who know Java rarely know SQL.
Then there are the minuses, the reasons why people find themselves retrofitting SQL invocations to existing Java code. First of all, object-oriented programs in most cases don’t perform well in data-related transactions. Data stored separately in each object instance uses a lot of extra space, and the operations on it are not optimized. Second, in many cases, operations and the data are not standardized across object classes or applications, wasting lots of developer time. Third, OOP languages such as Java are low-level, and specifically low-level with regard to data manipulation. As a result, programming transactions on vanilla Java takes much longer than programming on one of the older 4GLs (like, say, the language that Blue Phoenix uses for some of its code migration).
So what effect would storing all your data in main memory have on Java data-access operations? Well, the performance hit would still be there – but would be less obvious, because of the overall improvement in access speed. In other words, it might take twice as long as SQL access, but since we might typically be talking about 1000 bytes to operate on, we still see 2 microseconds instead of 1, which is a small part of response time over a network. Of course, for massive queries involving terabytes, the performance hit will still be quite noticeable.
What will not go away immediately is the ongoing waste of development time. It’s not an obvious waste of time, because the developer either doesn’t know about 4GL alternatives or is comparing Java-data programming to all the time it takes to figure out relational operations and SQL. But it’s one of the main reasons reason that adopting Java actually caused a decrease in programmer productivity compared to structured programming, according to some user feedback I once collected, 15 years ago.
More fundamentally, I have to ask if the future of programming is going to be purely object-oriented or data-oriented. The rapid increase in networking speed of the Internet doesn’t make data processing speed ignorable; on the contrary, it makes it all the more important as a bottleneck. And putting all the data in main memory doesn’t solve the problem; it just makes the problem kick in at larger amounts of data – i.e., for more important applications. And then there’s all this sensor data beginning to flow across the Web …
So maybe SQL is toast. If what replaces it is something that Java can invoke that is high-level, optimizes transactions and data storage, and allows easy access to existing databases – in other words, something data-oriented, something like SQL – then I’m happy. If it’s something like storing data as objects and providing minimal, low-level APIs to manipulate that data – then we will be back to the same stupid over-application of Java that croaked development time and scalability 15 years ago.
I was listening in on a discussion of a recent TPC-H benchmark by Sun (hardware) and its ParAccell columnar/in-memory-technology database (cf recent blog posts by Merv Adrian and Curt Monash), when a benchmarker dropped an interesting comment. It seems that ParAccell used 900-odd TB of storage to store 30 TB of data, not because of inefficient storage or to “game” the benchmark, but because disks are now so large that in order to gain the performance benefits of streaming from multiple spindles into main memory, ParAccell had to use that amount of storage to allow parallel data streaming from disks to main memory. Thus, if I understand what the benchmarker said, in order to maximize performance, ParAccell had to use 900-odd 1-terabyte disks simultaneously. What I find interesting about that comment is the indication that queuing theory still means something when it comes to database performance. According to what I was taught back in 1979, I/Os pile up in a queue when the number of requests is greater than the number of disks, and so at peak load, 20 500-MB disks can deliver a lot better performance than 10 1-GB disks – although they tend to cost a bit more. The last time I looked, at list price 15 TB of 750-GB SATA drives cost $34,560, or 25% more than 15 TB of 1-TB SATA drives. The commenter then went on to note that, in his opinion, solid-state disk would soon make this kind of maneuver passé. I think what he’s getting at is that solid-state disk should be able to provide parallel streaming from within the “disk array”, without the need to go to multiple “drives”. This is because solid-state disk is main memory imitating disk: that is, the usual parallel stream of data from memory to processor is constrained to look like a sequential stream of data from disk to main memory. But since this is all a pretence, there is no reason that you can’t have multiple disk-memory “streams” in the same SSD, effectively splitting it into 2, 3, or more “virtual disks” (in the virtual-memory sense). It’s just that SSDs were so small in the old days, there didn’t seem to be any reason to bother. To me, the fact that someone would consider using 900 TB of storage to achieve better performance for 30 TB of data is an indication that (a) the TPC-H benchmark is too small to reflect some of the user data-processing needs of today, and (b) memory size is reaching the point at which many of these needs can be met just with main memory. A storage study I have been doing recently suggests that even midsized firms now have total storage needs in excess of 30 TB, and in the case of medium-sized hospitals (with video-camera and MRI/CAT scan data) 700 TB or more. To slice it finer: structured-data database sizes may be growing, but not as fast as memory sizes, so many of these (old-style OLTP) can now be done via main memory and (as a stopgap for old-style programs) SSD. Unstructured/mixed databases, as in the hospital example, still require regular disk, but now take up so much storage that it is still possible to apply queuing theory to them by streaming I/O in parallel from data striped on 100s of disks. Data warehouses fall somewhere in between: mostly structured, but still potentially too big for memory/SSD. But data warehouses don’t exist in a vacuum: the data warehouse is typically physically in the same location as unstructured/mixed data stores. By combining data warehouse and unstructured-data storage and striping across disks, you can improve performance and still use up most of your disk storage – so queuing theory still pays off. How about the next three years? Well, we know storage size is continuing to grow, perhaps at 40-50%, despite the re cession, as regulations about email and video data retention continue to push the unstructured-data “pig” through the enterprise’s data-processing “python.” We also know that Moore’s Law may be beginning to break down, so that memory size may be on a slower growth curve. And we know that the need for real-time analysis is forcing data warehouses to extend their scope to updatable data and constant incremental OLTP feeds, and to relinquish a bit of their attempt to store all key data (instead, allowing in-situ querying across the data warehouse and OLTP). So if I had to guess, I would say that queuing theory will continue to matter in data warehousing, and that fact should be reflected in any new or improved benchmark. However, SSDs will indeed begin to impact some high-end data-warehousing databases, and performance-tuning via striping will become less important in those circumstances – that also should be reflected in benchmarks. However, it is plain that in such a time of transition, benchmarks such as TPC-H cannot fully and immediately reflect each shift in the boundary between SSD and disk. Caveat emptor: users should begin to make finer-grained decisions about which applications belong with what kind of storage tiering.
Yesterday, I participated in Microsoft’s grand experiment in a “virtual summit”, by installing Microsoft LiveCam on my PC at home and then doing three briefings by videoconferencing (two user briefings lacked video, and the keynote required audio via phone). The success rate wasn’t high; in two of the three briefings, we never did succeed in getting both sides to view video, and in one briefing, the audio kept fading in and out. From some of the comments on Twitter, many of my fellow analysts were unimpressed by their experiences.
However, in the one briefing that worked, I found there was a different “feel” to the briefing. Trying to isolate the source of that “feel” – after all, I’ve seen jerky 15-fps videos on my PC before, and video presentations with audio interaction – I realized that there was one aspect to it that was unique: not only did I (and the other side) see each other; we also saw ourselves. And that’s one possibility of videoconferencing that I’ve never seen commented on (although see http://www.editlib.org/p/28537).
The vendor-analyst interaction, after all, is an alternation of statements meant to convince: the vendor, about the value of the solution; the analyst, about the value of the analysis. Each of those speaker statements is “set up” immediately previously by the speaker acting as listener. Or, to put it very broadly, in this type of interaction a good listener makes a good convincer.
So the key value of a videoconference of this type is that instant feedback about how one is coming across as both a listener and speaker is of immense value. With peripheral vision the speaker can adjust his or her style so he/she appears more convincing to himself/herself; and the listener can adjust his or her style so as to emphasize interest in the points that he/she will use as a springboard to convince in his/her next turn as speaker. This is something I’ve found to work in violin practice as well: it allows the user to move quickly to playing with the technique and expression that one is aiming to employ.
So, by all means, criticize the way the system works intermittently and isn’t flexible enough to handle all “virtual summit” situations, the difficulties in getting it to work, and the lack of face-to-face richer information-passing. But I have to tell you, if all of the summit had been like that one brief 20 minutes where everything worked and both sides could see the way they came across, I would actually prefer that to face-to-face meetings.
“O wad some God the giftie gie us,” said my ancestors’ countryman, Scotsman Robbie Burns, “To see ourselves as others see us.” The implication, most have assumed, is that we would be ashamed of our behavior. But with something like Microsoft’s LiveCam, I think the implication is that we would immediately change our behavior so we liked what we saw; and would be the better for our narcissism.
It seems as if I’m doing a lot of memorializing these days – first Sun, now Joseph Alsop, CEO of Progress Software since its founding 28 years ago. It’s strange to think that Progress started up shortly before Sun, but took an entirely different direction: SMBs (small-to-medium-sized businesses) instead of large enterprises, software instead of hardware. So many database software companies since that time that targeted large enterprises have been marginalized, destroyed, crowded out, or acquired by IBM, CA (acting, in Larry Ellison’s pithy phrase, as “the ecosystem’s needed scavenger”), and Oracle.
Let’s see, there’s IDMS, DATACOM-DB, Model 204, and ADABAS from the mainframe generation (although Cincom with TOTAL continues to prosper), and Ingres, Informix, and Sybase from the Unix-centered vendors. By contrast, Progress, FileMaker, iAnywhere (within Sybase), and Intersystems (if you view hospital consortiums as typically medium-scale) have lasted and have done reasonably well. Of all of those SMB-focused database and development-tool companies, judged in terms of revenues, Progress (at least until recently) has been the most successful. For that, Joe Alsop certainly deserves credit.
But you don’t last that long, even in the SMB “niche”, unless you keep establishing clear and valuable differentiation in customers’ minds. Looking back over my 16 years of covering Progress and Joe, I see three points at which Progress made a key change of strategy that turned out to be right and valuable to customers.
First, in the early ‘90s, they focused on high-level database-focused programming tools on top of their database. This was not an easy thing to do; some of the pioneers, like Forte (acquired by Sun) and PowerBuilder (acquired by Sybase), had superb technology that was difficult to adapt to new architectures like the Web and low-level languages like Java. But SMBs and SMB ISVs continue to testify to me that applications developed on Progress deliver SMB TCO and ROI superior to the Big Guys.
Second, they found the SMB ISV market before most if not all other ISVs. I still remember a remarkable series of ads shown in one of their industry analyst days featuring a small shop whose owner, moving as slow as molasses, managed to sell one product to one customer during the day – by instantly looking up price and inventory and placing the order using a Progress-ISV-supplied customized application. That was an extreme; but it captured Progress’ understanding that the way to SMBs’ hearts was no longer just directly or through VARs, but also through a growing cadre of highly regional and niche-focused SMB ISVs. By the time SaaS arrived and folks realized that SMB ISVs were particularly successful at it, Progress was in a perfect position to profit.
Third, they home-grew and took a leadership position in ESBs (Enterprise Service Buses). It has been a truism that SMBs lag in adoption of technology; but Progress’ ESB showed that SMBs and SMB vendors could take the lead when the product was low-maintenance and easily implemented – as opposed to the application servers large-enterprise vendors had been selling.
As a result of Joe Alsop and Progress, not to mention the mobile innovations of Terry Stepien and Sybase, the SMB market has become a very different place – one that delivers new technology to large enterprises as much as large-enterprise technology now “trickles down” to SMBs. The reason is that what was sauce for the SMB goose was also sauce for the workgroup and department in the large enterprise – if it could be a small enough investment to fly under the radar of corporate standards-enforcers. Slowly, many SMBs have grown into “small large” enterprises, and many workgroups/departments have persuaded divisions, lines of business, and even data centers in large enterprises to see the low-cost and rapid-implementation benefits of an SMB-focused product. Now, big vendors like IBM understand that they win with small and large customers by catering to the needs of regional ISVs instead of the enterprise-app suppliers like SAP and Oracle. Now, Progress does a lot of business with large enterprises, not just SMBs.
Running a company focused on SMB needs is always a high-wire act, with constant pressure on the installed base by large vendors selling “standards” and added features, lack of visibility leading customers to worry about your long-term viability (even after the SMB market did far better in the Internet bust than large-enterprise vendors like Sun!), and constant changes in the technology that bigger folk have greater resources to implement. To win in the long term, you have to be like Isaiah Berlin’s hedgehog – have one big unique idea, and keep coming up with a new one – to counter the large-vendor foxes, who win by amassing lots of smaller ideas. Many entrepreneurs have come up with one big idea in the SMB space; but Joe Alsop is among the few that have managed to identify and foster the next one, and the one after that. And he managed to do it while staying thin.
But perhaps the greatest testimony to Joe Alsop is that I do not have to see his exit from CEO-ship as part of the end of an era. With Sun, with CA as Charles Wang left, with Compuware, the bloom was clearly off the old business-model rose. Progress continues to matter, to innovate, and to be part of an increase in importance of the SMB market. In fact, this is a good opportunity to ask yourself, if you’re an IT shop, whether cloud computing means going to Google, Amazon, IBM, and the like, or the kind of SMB-ISV-focused architecture that Progress is cooking up. Joe Alsop is moving on; the SMB market lives long and prospers!
Yesterday, I had a very interesting conversation with Mike Hoskins of Pervasive about his company’s innovative DataRush product. But this blog post isn’t about DataRush; it’s about the trends in the computer industry that I think DataRush helps reveal. Specifically, it’s about why, despite the fact that disks remain much slower than main memory, most processes, even those involving terabytes of data, are CPU-bound, not I/O-bound.
Mike suggested, iirc, that around 2006 Moore’s Law – in which every 2 years, approximately, the bit capacity of a computer chip doubled, and therefore processor speed correspondingly increased – began to break down. As a result, software written to assume that increasing processor speed would cover all programming sins against performance – e.g., data lockup by security programs when you start up your PC -- is now beginning to break down, as inevitable scaling of demands on the program are not met by scaling of program performance.
However, thinking about the way in which DataRush, or Vertica, achieve higher performance – in the first case by achieving higher parallelism within a process, in the second case by slicing relational data by columns of same-type data instead of rows of different-sized data – suggests to me that more is going on than just “software doesn’t scale any more.” At the very high end of the database market, which I follow, the software munching on massive amounts of data has been unable to keep up with disk I/O for the last 15 years, at least.
Thinking about CPU processing versus I/O, in turn, reminded me of Andrew Tanenbaum, the author of great textbooks on Structured Computer Organization and Computer Networks in the late 1970s and 1980s. Specifically, in one of his later works, he asserted that the speed of networks was growing faster than the speed of processors. Let me restate that as a Law: the speed of data in motion grows faster than the speed of computing on data at rest.
The implications of Tanenbaum’s Law and the death of Moore’s Law are, I believe, that most computing will be, for the foreseeable future, CPU-bound. Think of it in terms of huge query processing that reviews multiple terabytes of data. Data storage grows by 60% a year, and we would anticipate that the time to get a certain percent of that data off the disk to send to main memory would be greater each year, if networking speed was growing as fast as processor speed, and therefore slower than stored data. Instead, even today’s basic SATA drives can deliver multiple gigabytes/second – faster than the clock speeds of today’s microprocessors. To me, this says that disks are shoving the data at processors faster than they can process it. And the death of Moore’s Law just makes things worse.
The implications are that the fundamental barriers to scaling computing are not processor geometry, but the ability to parallelize the two key “at rest” tasks of the processor: storing the data in main memory, and operating on it. In order to catch up to storage growth and network speed growth, we have to throw as many processors as we can at a task in parallel. And that, in turn, suggests that the data-flow architecture needs to be looked at again.
The concept of today’s architecture is multiple processors running multiple processes in parallel, each process operating on a mass of (sometimes shared) data. The idea of the data-flow architecture is to split processes into unitary tasks, and then flow parallel streams of data under processors which carry out each of those tasks. The distinction here is that in one approach, the focus is in parallelizing multi-task processes that the computer carries out on a chunk of data at rest; in the other the focus is on parallelizing the same task carried out on a stream of data.
Imagine, for instance, that we were trying to find the best salesperson in the company in the last month, with a huge sales database not already prepared for the query. In today’s approach, one process would load the sales records into main memory in chunks, and for each chunk, maintain a running count of sales for every salesman in the company. Yes, the running count is to some extent parallelized. But the record processing is often not.
Now imagine that multiple processors are assigned the task of looking at each record as it arrives, with each processor keeping a running count for one salesperson. Not only are we speeding up the access to the data uploaded from disk by parallelizing that; we are also speeding up the computation of running counts beyond that of today’s architecture, by having multiple processors performing the count on multiple records at the same time. So the two key bottlenecks involving data at rest – accessing the data, and performing operations on the data – are lessened.
Note also that the immediate response to the death of Moore’s Law is the proliferation of multi-core chips – effectively, 4-8 processors on a chip. So a simple way of imposing a data-flow architecture over today’s approach is to have the job scheduler in a symmetric multiprocessing architecture break down processes into unitary tasks, then fire up multiple cores for each task, operating on shared memory. If I understand Mike Hoskins, this is the gist of DataRush’s approach.
But I would argue that if I am correct, programmers also need to begin to think of their programs as optimizing processing of data flows. One could say that event-driven programming does something similar; but so far, that’s typically a special case, not an all-purpose methodology or tool.
Recently, to my frustration, a careless comment got me embroiled again in the question of whether Java or Ruby or whatever is a high-level language – when I strongly feel that these do poorly (if examples on Wikipedia are representative) at abstracting data-management operations and therefore are far from ideal. Not one of today’s popular dynamic, functional, or object-oriented programming languages, as far as I can tell, thinks about optimizing data flow. Is it time to merge them with LabVIEW or VEE?
So many memories …
I first became really aware of Sun in the late ‘80s, when I was working for Prime. At the time, Sun was one of the two new competitors in the area of engineering workstations – itself a new market. The key area of competition was cross-machine file systems that made multiple workstations look like one system – in other words, you’d invoke a program on one machine, and if it didn’t reside there, the file system would do a remote procedure call (RPC) to the other. Sun’s system was called NFS.
Yes, Sun won that war – but the way it did it was a harbinger of things to come. With more than a little chutzpah, Sun insisted that Unix was the right way to do networked file systems. Now, at the time, there was nothing to indicate that Unix was better (or worse) than any other operating system for doing cross-machine operations. But Sun’s marketing tapped into a powerful Movement in computing. This Movement – geeks, first movers, technology enthusiasts, anti-establishment types – gave Sun a niche where established players like Prime and DEC could not crowd Sun off buyers’ short lists. The Movement was very pro-Unix, and that allowed Sun to establish itself as the engineering-workstation market leader.
Sun’s next marketing message appealed to the Movement very much: it said it was going down-market and attacking Microsoft. In fact, that became a feature of Sun for the next 15 years: Scott McNealy would get up at Sun sales, investor, and analyst events and make cracks about Bill Gates and Microsoft. Of course, when you looked closely at what was happening, that was pretty much hogwash: Sun wasn’t cutting into the PC market, because it couldn’t cut prices that much. Instead, Sun’s pricing was cutting into minicomputer low-end markets. Because PCs and Novell LANs were cutting into those markets far more, the demise of minicomputer vendors is rightly ascribed to PCs. But Sun’s real market growth came from moving up-market.
As everyone remembers, Scott McNealy as the public face of Sun had a real gift for pithy phrases criticizing competitors that really stuck in people’s minds. My favorite is the time in the early 1990s when IBM as Big Blue joined with Apple (corporate color: red) in a consortium to develop a common standard for some key market and crowd others out: Scott derided the result as “purple applesauce.”
But what really saved Sun in the early 90s was not the Movement nor Scott’s gift for credulity-straining claims. First among engineering-workstation vendors, Sun decided to move into servers. This took Sun from techie markets (although not consumer markets) to medium-scale to large-scale corporate IT – not the easiest market to crack. But at the time, lines of business were asserting their independence from central IT by creating their own corporate networks, and Sun was able to position itself against IBM, Unisys, NCR/AT&T, and HP in growing medium-scale businesses and lines of business. While Silicon Graphics (number 2 in workstations) waited too long to move into servers and spent too much time trying to move down-market to compete with Microsoft, Sun grew in servers as the workstation market matured.
I remember talking to the trade press at that time and saying that Sun’s days of 90%/year revenue growth were over. As a large company, you couldn’t grow as fast as a smaller one, and especially not in the server market. I wasn’t wrong; but I certainly didn’t anticipate Sun’s amazing growth rate in the late 90s. It was all due to the Internet boom in Silicon Valley. Every startup wanted “an Oracle on a Sun”. Sun marketing positioned Java as part of the Movement – an object-oriented way of cutting through proprietary barriers to porting applications from one machine to another – and all the anti-Microsoft users flocked to Sun. Never mind the flaws in the language or the dip in programmer productivity as Java encouraged lower-level programming for a highly complex architecture; the image was what mattered.
Sun’s marketing chutzpah reached its height in those years. I remember driving down the Southeast Expressway in Boston one day and seeing a Sun billboard that said “We created the Internet. Let us help you create your Internet.” Well, I was at Computer Corporation of America with full access to the Internet back in the early 1980s when the Internet was being “created”, and I can tell you that BBN was the primary creator of the Internet aside from the government and academia, and Sun was far less visible in Internet newsgroups than most other major computing vendors. Yet when I pointed this out to a Sun marketer, he was honestly surprised. Ah, the Koolaid was sweet in those days.
It is fashionable now to say that Sun’s downfall came because it was late to embrace Linux. It is certainly true that Sun’s refusal to move aggressively to Linux cost it badly, especially because it ran counter to the Movement, and my then colleague Bill Claybrook deserves lots of credit for pushing them early and hard to move to Linux. But I think the real mistake was in not moving from a company focused on hardware to one focused on software during the Internet-boom years. Oh, there were all sorts of excuses – B-school management theory said you should focus on services, and Sun did beef up its middleware – but it was always reacting, always behind, never really focused on differentiation via software.
The mood at analyst meetings during the Internet-bust years was highly defensive: You guys don’t get us, we’ve shown before that we see things no one else sees and we’ll do it again, all our competitors are suffering too. And yet, the signs were there: talking to an Oracle rep, it became clear that Sun was no longer one of their key partners.
I am less critical of Jonathan Schwartz than some other commentators I have read. I think that he was dealt a hand that would lose no matter how he played it. The Internet-boom users had gone forever, leaving a Sun with too high a cost structure to make money from the larger corporations and financial-services markets that remained. In fact, I think that however well he executed, he was right to focus on open source (thereby making peace with the Movement) and software. At my last Sun do in 2007 when I was with Illuminata, the sense of innovation in sync with the Movement was palpable – even if Sun was mostly catching up to what other open-source and Web efforts like Google’s were doing. But underlying the marketers’ bravado at that event was depression at the endless layoffs that were slowly paring back the company. The analysts’ dinner was populated as much by the ghosts of past Sun employees as by those that remained.
Even as a shadow of its former self, I am going to miss Sun. I am going to miss the techie enthusiasm that produced some really good if way over-hyped ideas that continue to help move the industry forward. I am going to miss many of the smart, effective marketers and technologists still there that will probably never again get such a bully pulpit. I am going to miss a competitor that still was part of the ongoing debate about the Next Big Thing, a debate which more often than not has produced the Next Big Thing. I’m not going to miss disentangling marketing claims that sound good but aren’t true, or competitor criticisms that are great sound bites but miss the point, while watching others swallow the Sun line, hook and sinker included; but the days when Sun’s misdeeds in those areas mattered are long past.
Rest in peace, Sun. All you employees and Sun alumni, take a bow.
Never having had the chance to study system dynamics at Sloan School of Management (MIT), I was very happy recently to have the opportunity to read Donella Meadows’ “Thinking in Systems”, an excellent primer on the subject – I recommend it highly. Reading the book sparked some thoughts on how system dynamics and the concept of business and IT agility complement each other – and more importantly, how they challenge each other fundamentally.
Let’s start with the similarities. System dynamics says that most systems grow over time; my concept of business agility would argue that growth is a type of change, and agile businesses should do better at handling that type of change. System dynamics says that people have a lot to say about system functioning, and people resist change; I would argue that business agility includes building organizations in which people expect and know how to handle change, because they know what to do. System dynamics says that to change system behavior, it is better to change the system than replace components (including people); business agility says that business processes if changed can increase the agility of the company, even if the same people are involved.
What about the differences? System dynamics really doesn’t have a good analog for the proactive side of agility. They mention resilience, which is really the ability of a system to react well to a wider range of external changes, they mention “self-organization” as elaborating the complexity of systems, and they talk about a system having a certain amount of room to grow without reaching constraints or limits; but there is an implicit assumption that unexpected or planned change is the exception, not the norm. Likewise, according to system dynamics, tuning the system to handle changes better is in the long run simply delaying the inevitable; a more effective redesign changes the system itself, as profoundly as possible. Agility says that change is the norm, that redesign should be aimed at improving the ability to “proact” and the ability to react, and that increased agility has a value independent of what system is being used.
System dynamics poses challenges to the practice of business agility, as well. It says that how agility is to be improved matters: have we found the right “leverage point” for the improvement, have we understood well enough how people will “game the system”, have we anticipated future scenarios in which the “agilified” process generates new constraints and reaches new limits? To my mind, the key question that system dynamics raises about business agility is, are we measuring it without incorporating the importance of the unmeasurable? Or, to put it in system-dynamics terms, in attempting to capture the business value of increased agility in terms of costs, revenues, and upside and downside risks, are we “emphasizing quantity over quality”?
I think, based on the data on agility improvements I’ve seen so far, that one of the most interesting ideas about business agility is that focusing on agility results in doing better in long-term costs, revenues, and upside/downside risks than a strategy focused on costs, revenues, or risks themselves. If this is true, and if organizations set out to improve agility “for agility’s sake”, I don’t think system dynamics and agility strategies are in disagreement: both want to create a process, an organization, a business built to do the right thing more often (“quality”), not one to improve a cost or revenue metric (“quantity”). Or, as Tom Lehrer the comedian once put it, we are “doing well by doing good”.
So my most important take-away from gaining an admittedly basic understanding of system dynamics is that metrics like AFI (agility from investment, which attempts to measure the long-term effects of a change in agility on costs, revenues, and risks) explain the relative agility effects of various strategies, but should not be used to justify strategies not focused on agility that may improve costs, revenues, and/or risks in the short term, but will actually have a negative effect in the long term. As Yoda in Star Wars might put it: “Build to change or be not agile; there is no accidental agility.”
Recently I’ve been writing down some thoughts about business and IT agility: What they are, how they evidence themselves in the bottom line and in risk (or its proxy, public-company beta), and how to measure them. At the same time, in my study of “data usefulness” (how much potentially useful data actually gets used effectively by the appropriate target), I included a factor called ‘data agility,’ or the ability of the organization to keep up to date with new useful data sources. What I want to do now is consider a larger set of questions: what does agility mean in the context of the organizational process that ideally gathers all potentially useful information in a timely fashion and leverages it effectively, how can we measure it, and what offers the biggest “bang for the agility-investment buck”?
My initial pass at “information-handling agility” is there are four sources of change that are key: Unexpected changes in the environment, planned changes in the non-informational organization/process (which also should cover expected changes in the environment), unplanned changes in the non-informational organization, and new data sources/types. Therefore, information-handling agility includes the ability to react rapidly and effectively in supplying information about unexpected changes in the environment, the proactively planned but timely supply of information about expected changes in the environment, the ability to react rapidly and effectively by supplying different types of information due to an unexpected internal change, and the ability to proactively seek and effectively use new data sources.
Note that, strictly speaking, this doesn’t cover all cases. For example, it doesn’t cover outside change during the information-handling process – but that’s reasonable, if in most cases that change either doesn’t change the ultimate information use or it’s so important that it’s already handled by special “alerts”, as seems to be the case in real-world businesses. Likewise, the definition of data agility doesn’t include all changes in the data, rather than just the new data-source ones; again, in the real world this seems to be much less of a problem.
To see how this can be measured and what offers the biggest “bang for the buck,” let’s create a “thought experiment”. Let’s take Firm A, a typical firm in my “data usefulness” research, and apply the Agility From Investment (AFI) metric, defined as AFI = (1 + % change [revenues] – % change [development and operational change in costs]) * (1 + %change [upside risk] - % change [downside risk]) - 1. Let’s assume that Firm A invests in decreasing the time it takes to deliver data to the average user from 7 days to 3 ½ days – and ensures that the data can be used as effectively as before. Btw, the different types of agility won’t show up again, but they underlie the analysis.
We can see that in the long term, assuming its competitors don’t match it, the “timeliness” strategy will improve revenues by increasing the agility of new-product development – but only if new-product development is agile itself. If we assume an “average” NPD out there of ½ the firms being “agile enough”, then we have 15% improvement in ROI x ½ = 7 ½ % (the maximum change in long-term revenues). Since we have only improved timeliness by ½, the maximum change becomes 3 ¾ %; the typical data usefulness is about 1/3, taking it down to 1 ¼ %; and information’s share of this takes it below 1%.
Costs are seemingly a different story. Reducing time to deliver information affects not only the per-transaction IT costs of delivery, but also every business process that depends on that information. So it is reasonable to assume a 1% decrease in NPD costs, but also a 5% decrease in operational costs, for an average of 3%. Meanwhile, the increase in upside risk goes through a similar computation as for revenues, yielding less that a 1% increase in that type of risk.
That leaves downside risk. Those risks appear to be primarily failure to get information in time to react effectively to a disaster, and failure to get the right information to act effectively. Because the effect on risk increases as the response time gets closer to zero, it is reasonable to estimate the effect on downside risk at perhaps a 5% decrease; and since only 1/3 of the data is useful, that takes it down below 2%. Putting it all together, AFI = (1 + 1% + 3%) * (1 + 1% + 2%) – 1 = a 7% overall improvement in the corporation’s bottom line and risk – and that’s being optimistic.
Now suppose we invested in doubling the percentage of potentially useful data that is effectively used – i.e., not timeliness but accuracy, consistency, scope, fit with the needs of the user/business, and analyzeability. Performing the same computations, I come out with AFI = (1 + 1% + 1.5%) * (1 + 7.5% + 1%) – 1 = an 11% long-term agility improvement.
One more experiment: suppose we invested in immediately identifying key new data sources and pouring them into our information-handling process, rather than waiting ½ year or more. Again, applying the same computations, but with one more assumption (a high degree of rapid change in the sources of key data), AFI = (1 + 2% + 2%) * (1 + 7.5% + 8%) – 1 = a 20% improvement in long-term contribution to the company’s bottom line.
Now, mind you, I have carefully shaped my assumptions, so please do not assume that this analysis is exactly what any firm will experience over the long term. There are, however, two take-aways that do not seem to be part of the general consensus today.
First, firms are probably typically underestimating the long-term effects of efforts aimed at improving data usefulness (including timeliness, effectiveness, and data agility). Reasonably enough, they are concerned with immediate decision-making and strategies that affect information-handling tangentially and piecemeal. However, the result, as I have noted, is a “whack-a-mole” game in which no sooner is one information-handling problem tackled than another pops up.
Second, firms are also clearly underestimating the long-term benefits of improving data usefulness compared to improving timeliness, and of improving data agility compared to improving both timeliness and data usefulness. The reason for that appears to be that firms don’t appreciate the value for new-product development of inserting better and new data in the new-product development process, compared to more timely delivery of the same old data.
I offer up one action item: IT organizations should seriously consider adding a Data Agility role. The job would be monitoring all organizational sources of data from the environment – especially the Web – and ensuring that they are appropriately added to the information-handling inputs and process as soon as possible.
My personal experiences as a programmer have led me to anticipate – apparently correctly – that agile development would deliver consistently better results by cost, profit, and agility metrics. What about the down side? Or, to put it another way, what else could users do that agile development hasn’t done?
After I left Prime, I started as a computer industry analyst at The Yankee Group. I will always be grateful to Yankee for giving me the broadest possible canvas on which to paint my visions of what could be – as I used to put it, I covered “everything below the mainframe”. Of course, in the early 90s that was only ½ of the computing field … Anyway, one of the things I wrote was a comprehensive report called “Software Development: Think Again or Fail”. [yes, I know; I was a bit immodest in those days]
The reason I bring this up is that two things mentioned in that report seem to be missing in agile development:
1. High-level tools, languages, and components; and
2. Tasking programmers with keeping track of particular markets.
As far as I can see, agile theory and practice doesn’t give a hoot whether those programmers are using Java, Perl, Python, or Ruby on Rails. I use those examples because they all have been touted as ways to speed up programming in the Java/open-source world, and because only tunnel vision leads people to conclude that they’re anything but dolled-up 3GLs that do very well at handling function-driven programming and only adequately at rapidly generating data-access and user-interface code. Compare that to M204 UL, drag-and-drop VPEs (visual programming environments), and the like, and I am forced to conclude that in some respects, these tools are still less productive than what was available 13 years ago. The point is that, even if agile succeeds in improving the speed of the individual programmer, the other productivity shoe will never drop as long as the first instinct of managers and programmers is to reach for a 3GL.
The second point is that although agile does well with making sure that programmers talk to individual end users, that is different from following the entire software market. Following a market gives context to what the end user wants, and allows the designer to look at where the market appears to be going, rather than where end users have been.
So my caution about agile development is that my experience tells me that so much more can be done. The results are startling and consistent; but they could be more so. Agile development deserves praise; but the worst thing for software development would be to assume that no more fundamental changes in the paradigm need be done.
The more I write about agile software development, Key Agility Indicators, and users seeing an environment of rapid change as their most worrisome business pressure, the more I wonder why agility, or flexibility, is not a standard way of assessing how a business is doing. Here's my argument:
Agility is a different measure and target from costs or revenues or risks. It's really about the ability of the organization to respond to significant changes in its normal functioning or new demands from outside, rapidly and effectively. It's not just costs, because a more agile organization will reap added revenues by beating out its competitors for business and creating new markets. It's not just profits or revenues, because a more agile organization can also be more costly, just as an engine tuned for one speed can perform better at that speed than one tuned to minimize the cost of changing speeds; and bad changes, such as a downturn in the economy, may decrese your revenues no matter how agile you are. It's not just risk, because agility should involve responding well to positive risks and changes as well as negative ones, and often can involve generating changes in the organization without or before any pressures or risks.
That said, we should understand how increased or decreased agility impacts other business measures, just as we should understand how increased costs affect cash flow, profits, and business risks, or increased revenues affect costs (e.g., are we past the point where marginal revenue = marginal cost?), or the likelihood that computer failures will croak the business. My conjecture is that increased agility will always decrease downside risk, but should increase upside risk. Likewise, increased agility that exceeds the competition's rate of agility improvement will always have a positive effect on gross margin over time, whether through more rapid implementation of cost-decreasing measures or an effective competitive edge in implementing new products that increase revenues and margins. And, of course, decreased agility will operate in the opposite direction. However, the profit effects will in many cases be hard to detect, both because of stronger trends from the economy and from unavoidable disasters, and because the rate of change in the environment may vary.
How to measure agility? At the product-development level, the answer seems faily easy: lob a major change at the process and see how it reacts. Major changes happen all the time, so it's not as if we can't come up with some baseline and some way of telling whether our organization is doing better or worse than before.
At the financial-statement level, the answer isn't as obvious. Iirc, IBM suggested a measure like inventory turnover. Yes, if you speed up production, certainly you can react to an increase in sales better; but what I believe we're really talking about is a derivative effect: for example, a change in the level of sales OVER a change in cost of goods sold, or a percent change in product mix over the percent change in cost of goods sold, or change in financial leverage over change in revenues (a proxy for the ability to introduce better new products faster?).
So I wonder if financial analysts shouldn't take a crack at the problem of measuring a firm's agility. It would certainly be interesting to figure out if some earnings surprises could have been partially preducted by a company's relative agility, or lack of it.
At the level of the economy, I guess I don't see an obvious application so far. Measures of frictional unemployment over total employment, I would think, would serve as a interesting take both on how much economic change is going on and to what extent comparative advantage is shifting. But I'm not sure that they would also serve to get at how well and how quickly a nation's economy is responding to these changes. I suppose we could look at companies' gross margin changes over the short term in a particular industry compared to overall industry gross margin changes to guess at each company's relative agility in responding to changes in the industry. However, that's not a good cross-industry yardstick...
And finally, is this something where unforeseen factors make any measurement iffy? If what drives long-term success is innovation, which is driven by luck, then you can be very agile and still lose out to a competitor who is moderately agile and comes up with a couple of once-in-a-genertion market-defining good new products.
Everyone talks about the weather, Mark Twain said, but nobody does anything about it. Well, everyone's talking about agility, and lots of people are doing something about it; but I don't think anybody really knows how effective their efforts are. Ideas, anyone?