Skip to main content

Grids, data centres and reliability

In my work with the Grid Computing Now! Knowledge Transfer Network, I talk about "virtualisation" and "service-oriented architecture" just as much as "grid" itself. People sometimes ask what is the difference between these concepts. My first answer is perhaps rather glib - I say that I don't care as long as the technology gets the job done. Although this is not a straight answer, those of us on the GCN! team believe it is important to put the business answers before any notion of technological purity.

But if we turn to the question as stated, I think that as long as a solution includes the key concepts of virtualised resources and dynamic allocation of applications across those resource, then that to me is enough to call the system a grid. But, of course, we can go further.
A recent conversation reminded me of the important point that distributed systems typically have to manage failure. As systems scale to many machines and many sites, then some of those are going to fail some of the time. The systems have to be resilient enough to adapt and recover. Systems also have to cope with additions and deletions from the set of available resources.

This is most obvious in cycle-stealing grids, which use spare power of desktop PCs, and of scientific grids, which link many research sites across the world. The interesting question is whether this also applies to the data centre. That seems to depend to some extent on how the system is designed. For example, Google is built specifically around this approach; they have always used lots of generic systems and just replaced resources when they fail. I believe Ebay's massive server farms use the same dynamic approach.

This question arose in a conversation I had with Liam Newcombe, an independent consultant. We were supposed to be talking about Green IT (of which more another time), but our discussion wandered to include all sorts of ideas. Liam is working on an open source model of data centre reliability and performance. He believes that reliability is best achieved by adopting this approach of explicitly allowing for it within the software - rather than, for example, attempting to make the hardware itself ultra-reliable.

A key question must be how high up the stack does this awareness have to extend? Can we write applications without worrying about this or does every application have to have some potential adaptability built in? It's a fascinating topic and I look forward to reading the book the Liam is co-authoring, in due course.

Comments

Popular posts from this blog

Presentation: Putting IT all together

This is a presentation I gave to an audience of University staff: 

In this seminar, I invite you to consider what the University’s online services would be like, if we worked together to design them from the perspective of the student or member of staff who will use them, instead of designing them around the organisational units that provide them. I’ll start with how the services might appear to that student or member of staff, then work back from there to show what this implies for how we work, how we manage our data, and how we integrate our IT systems. It might even lead to changes in our organisational structure.

Our online services make a vital and valued contribution to the work of our students and staff. I argue that with better integration, more consistent user interfaces, and shared data, this contribution could be significantly enhanced.

This practice is called “Enterprise Architecture”. I’ll describe how it consults multiple organisational units and defines a framework …

Service Excellence, Digital Transformation and Enterprise Architecture

Our University Secretary has sponsored a major review of the University’s administrative processes, coining the banner “Service Excellence”.  The aim is to look at the services we provide to staff and students with a fresh eye, making them more effective, more efficient, and focussed on the user rather than administrative convenience.

Our CIO is sponsoring a similar programme called “Digital Transformation”. This will replace old paper-based processes, starting with the question of what would processes look like if we designed them afresh for the modern connected world.  The aim is to make processes that are more focussed on the user and hence more effective and efficient.

Both of these ambitious programmes will need an effective enterprise architecture, if they are to succeed.  Digital Transformation is intrinsically about using opportunities provided by new technology to improve services and, as such, it requires effective technology services to make data available when needed, to pro…

Not so simple...

A common approach to explaining the benefits of Enterprise Architecture is to draw two diagrams: one that shows a complicated mess of interconnections, and one that shows a nicely layered set of blocks. Something like this one, which came from some consultants:


I've never felt entirely happy with this approach.  Yes, we do want to remove as much of the needless complexity and ad-hoc design that litters the existing architecture.  Yes, we do want to simplify the architecture and make it more consistent and intelligible.  But the simplicity of the block diagram shown here is unobtainable in the vast majority of real enterprises.  We have a mixture of in-house development and different third-party systems, some hosted in-house, some on cloud infrastructure and some accessed as software-as-a-service.  For all the talk of standards, vendors use different authentication systems, different integration systems, and different user interfaces.

So the simple block diagram is, basically, a l…