Archive for the ‘IT’ Category

Development and Getting it Right

Lots of thoughts ruminating the last week or so on the kick off of a new development project, and how to get it right. I’m reminded of some quotes from books or key folks in this space (e.g. Steve Jobs) that help form my thoughts around execution:

Real Artists Ship — Steve Jobs

Thrash Early – Seth Godin

Perfect is the Enemy of Good (something I use to describe cloud computing quite often)

Avoid “Just one more thing” — and this seems to get worse the longer you take between releases.

Got some more?


My Keynote from Ethernet Alliance – Next Gen Network and System Design

It’s been a while but wanted to post my presentation from this event —

Great mix of chip designers, network engineers, and great session on TRILL vs SBP and status in these areas.

The entire afternoon was spent discussing impact of virtualization on networking. Extreme networks highlighted the 1-2 additional layers in the network after looking at blades and/or virtual machines. I highlighted in my slides the policy shifts and changes that are happening in these various layers.

Regulating Scale and Velocity — Wall Street vs Clouds Part 1

Looking at the “sell-off” and market hiccup that occurred Thursday last week formed an analogy to cloud computing for me. Trading is now a distributed process, something we were reminded of last Wed. Trades occurred, perhaps erroneously, perhaps not, that spanned various exchanges under “rules” apparently set by those who know best but seemingly find enforcement a struggle (as it often is in distributed systems.) Even though trading on some specific securities stopped on some exchanges, they were allowed to occur on others — and the “view” we have into this is a single asking price, apparently an accumulated or averaged set of values.

We learned this single view was not reality, not was this tied to a single security but effected many in ways that we are still trying to understand.

IT also looks for a similar singular view. Few actually achieve this. Even today, where still most workloads are ran inside fairly well-known controlled environments (e.g. data centers) that you own or control. Or perhaps the workloads have ventured out on the “cloud” and some risk is spread across providers and on-premise data centers. Or maybe all in the off-premise cloud for those that judge the risk (or type of workload, data, etc) acceptable.

But as above highlights changes are afoot. Increasingly workloads are deployed across on-premise and off-premise clouds. Scale is increasing. Scale? Just like billions of stock trades, there are billions and billions of objects now marshaled to provide a “view” of a service — and to provide such a service to end users.

I probably pick on both IT vendors and IT users equally. I believe all sides can do a better job in building and implementing technologies to help with scale, velocity, and the distributed nature that is network computing.

We need to increase investment in three areas:

1) understanding how to provide IT policy management, definition, and distributed regulation

2) improved service definitions as an extension of workload/payloads (could be a subset of #1 but there is a “state-change” when you go from modeled to deployed (e.g. instance immutability) — and this is a critical point to understand

3) creating more reflective architectures — meaning creating connected autonomous systems that are regulated by #1 and run #2, with concrete well defined interfaces and policy enforcement controls (and where events can be triggered both inside and outside the system.)

Dealing with these issues is not easy, especially in multi-vendor environments. It requires a strong partnership between vendor and user/customer.

More on this in the coming week, as well as my take on how to actually build solutions that incorporate these areas.

The Future of Cloud Data Center Networking

Jason Corollary: “Impossible” exists only because we haven’t stated or re-factored the problem so it is “possible.”

I have thought for the last few years that in virtualized and larger density data centers (a PDF link) there has to be a better solution than MAC and IP.

IP is challenged by having location embedded in the address. Not good for a world that wants to be able to move “cloudy” workloads across compute nodes, PODs, and data centers. One solution I’ve seen is to late binding on this IP layer but session state becomes tied and unable to migrate without loss. And let’s not mention the oversubscription issues happening in most of these topologies today.

I’ve also (see my earlier posts) never liked the complexity of managing switches. With SDN we basically could take a switch out of the box and configure a couple things and be good to go. Now companies are linking virtualization management with switch management, dealing with some complexity around this aspect but still route configuration in a non-layer 2 “flat” network is managed by admins.

Today, I enjoyed listing to a talk about PortLand (targeting 100,000 nodes and 1m VMs, full BW to each node), a UCSD project (with a great Itunes U talk) to provide a self-managing, scale out networking layer 2 design. The build in parallel from some of work happening on TRILL and others around some of these issues. But I like their design a bit better — don’t assume global host address space, assume a multi-routed tree (see corollary above.) It assumes a level of backwards compatibility — critical for widespread adoption.

They have created a simple protocol that seems to make well to the modular world of the data center future. — Location Discovery Protocol implemented in the rather simple layer 2 switching fabric. There’s also a fabric manager — necessary to maintain some level of backwards compatibility and management of the MAC mappings without changing protocols. Forwarding in the topology is done completely at this “pseudo” mac layer. This is created dynamically, hierarchically, addressing the global addressing (l2) and memory constraints that some of the other solutions in this space seem to be facing.

Looks promising!! Would like to see more work around latency, without the assumption of internet connected services. The PortLand work does address dynamic hierarchy, where as if more intelligent proximity (or P2P) based data stores etc could be used it might address the latency within POD vs outside, etc. All that is stored at the pMAC level.

Even love some of the “self-integrity” monitoring by neighboring switches like we thought about for Project OpenSolarisDSC. Wonder how we can help them with their fabric manager?? Hmm…thinking…

What You Don’t Want — the Cloud and Cost to Deliver

Providing cloud computing at scale is about economics. The delivered services have a narrow range of margin over the first 2-3 years, as to be expected when you outlay some cash for infrastructure, design, implementation etc. Hopefully after year 2 you achieve positive margin and start to pay yourself back. This works well in an environment that meets the Cloud’s perceived “80/20” rule where your meeting 80% of the needs of 80% of the customers out there. The economics and time to market should help offset the 20% of remaining need, and customers can shop elsewhere for the 20% “doesn’t fit here” remainder.

With that said, how does one manage the cloud delivery process? In many hosting situations, hosting companies have done customized negotiations for clients, one-offs, and custom designs depending on how large the opportunity is. Many hosting companies are getting into the cloud game, e.g. AT&T, Telstra, and Terremark. How does one balance custom requirements with the cloud? When does it make sense to deviate from the 80/20 rule? Or when does it make sense to sell 10-20% of your capacity to one customer or workload?

This can be a difficult decision. There are many factors. It plays to standardization and service management. It requires careful analysis on who else might want that feature and when.

I theorize that a significant change to the “cloud” services to provide a specific new feature for a customer or two will likely result in changing the cost of service for the entire platform in a bad way. The deviations need to be managed and well thought-out.

There will be an effect on cost to deliver — can your pricing sustain that? Can you pass on that cost to the new customer and not effect overall delivery margin? Maybe the addition of this new feature or platform will help drive adoption of your cloud, in which case that may certainly offset the overall change in cost. Can you do it without adding additional complexity? The “zen of cloud” would almost state that you must.

Hopefully some of the questions work out! It will hopefully avoid this…


What’s an example of a significant change — let’s say your definition of SMALL LARGE compute service doesn’t fit with a customer’s definition. They want something that is medium plus more memory. That’s ok, but if you spec’d your PODs (point of delivery) with re-stacking workloads in mind (and you should) you now may find yourself with spare capacity on a node that you might not be able to utilize.

IT, Complexity, and Clouds Redux


I listed to Dr. Paul Borrill’s (exSun DE, VP, Veritas CTO) talk at Stanford on itunes the other day. I loved it. My favorite statement was on TIME = CHANGE. It’s exactly in line with much of my thinking and observations after life in IT for 15+ years. I and some other smart people at Sun (now Oracle) put some of this to “code” with DSC – dynamic service containers awhile back. But listening to his talk inspired me to do some more thinking in this space. I’m reposting this here. It was on my other blog a couple years ago or so.

I also read Lori MacVittie’s post on Apathy vs Architecture — it highlights some other aspects of why we are here and why we must work harder to solve the hard problems.

More to come but here’s some background…

I’ve been doing some thinking lately around the cloud model and how enterprises might adopt it. Enterprises are challenged with a conflict — between giving their developers control and choices, and maintaining operational control. Case in point — the ownership around SLAs if often with the operations/adminsitration org — not the developer. The developer in many cases is hoping that most of the “systemic qualities” will appear within the platform and not necessarily require lots of development time. An interesting example of improvement in this space is the SHOAL project around Glassfish.

One of my employees is working on some modeling projects — trying to model the data center “as is” vs deriving the model from a “perfect” state where choices are somewhat removed from the scenario. I mean the data center is architected in specific ways that allow or disallow some functionality — you see this in very large sites, like Google and Yahoo. They have several major architecture patterns where many or most services confirm to those patterns. You want to deploy? You conform.

This battle is often up hill. The last 20% of a solution is the area that you spend the most time on, convincing others of the design or that “good enough” will trump perfect. But I think we need to get over that — we can’t afford not to.

Graffiti is a good example. Hand writing recognition was very hard, companies failed trying to figure this out. Did they constrain the problem (thus the solution) enough to progress to something that works without a whole bunch of “change?” Jeff got it right — fix the few letters that cause the problem (i vs L) and constrain the problem. He found a solution. We’ve gotten a bit more flexible today but its still the core thinking in the industry.

What problems can we solve today if we limit the choices, give a way a little control, and are able to take technology to the next level?

UPDATE: Forgot the Jason Corollary: “Impossible” exists only because we haven’t stated or re-factored the problem so it is “possible.”

Requirements for a real-time cloud marketplace

There’s been lots of press (AWS’s SPOT Instances, Zimory, here’s one from Vinton Cerf) the last few weeks about cloud moving towards becoming an interconnected set of compute/data processing infrastructure. One step towards that is certainly some level of interoperability to be able to deploy apps — e.g. what Rightscale does essentially today or libs like libcloud that provide a consistent interface to many clouds/providers.

I sat down about 6 months back with Lou Springer to discuss the idea of cloud marketplace and we came up with a few items that would need to be addressed:

–workload transportability — and its beyond encapsulation in images but a description language of the relationships to data and “data physics”

–network transportability — our DNS-based way of doing things on the net has been pretty broken for a while. One way to do this is almost provide an escrow service that handles service delivery addressing. Payload size is also an issue (see above)

–workload rating/pricing — all apps have a time to live — how do I price my workload? what access does it need? what are the constraints to this model?

–capacity management — how do I know what capacity I have in order to provide a price?

–run-time permissions/evocation — at some point I will want to ensure that my stray workloads are not run by providers that I don’t want to run them anymore.

–provider indemnification — in some cases providers would rather be completely unaware to a high degree of what is running — sure network ports and the like are fair game but some providers may not want to be able to run reports against your workload. Is there a model that allows us to encrypt “everything” and provide the other values above?

I’m sure there’s more. What’s missing from your cloud marketplace?

Part 3- Cloud’s Impact on Data Center Architectures

This is part 3 if my work in progress that I hope to release in more detail as a paper! Find Part 1 and Part 2 here.

ITArchitecture and the Business

It is often said that architecture is a study in tradeoffs. Perhaps its best to look at this architecture problem from the perspective of three different types of customers (or divisions, organizations, projects, etc) These groups are not holistic in nature — every company has parts of its business in each category.


All of these enterprises provide value to customers. The first provides value by its external connections and “systemness.” The second wants to build connections to what it has. The third is perhaps shrinking in market share and needs to rapidly redefine its cost structures. They share other similar characteristics at time. But they also differ. The quickly growing often cannot utilize “off the shelf” technologies because they don’t exist at the scale necessary, are too expensive, etc.

The second is looking at often legacy data and providing it in new and exciting formats to provide additional business value to itself by enabling others. It may have invested in large scale databases at centralized facilities. Perhaps it wants to move towards real-time analytics and provide these functions across the world.

The third may more rapidly embrace internal consolidation strategies or public cloud strategies. Optimization for this type of business is to quickly reduce complexity and increase operational optics by quickly determining what is core, what isn’t and devising a strategy to deal with it accordingly. This category is constrained by not only cost but generally the ability to go beyond a single IT platform layer.

These businesses must all make decisions about where to invest and where not to. How do you leverage what provides the competitive advantage vs something others have? Increasing connections, increasing value by making what you have available to more audiences, or by going “double-down” on what you need to do to stay in business?

Part 2 – Cloud’s Impact to Data Center and Application Architecture


Architecturally speaking IT is at the precipice of a large shift in complexity. The tools (like virtualization) that have developed over the course of the last decade are now embedded into most IT platforms. This provides benefits and challenges. The benefits are well-known to most. The challenges are being experienced by most. A “medium” data center of a few thousand machines in the 90s has grown to perhaps still only a few thousand machines, but an ever growing sea of virtual machines.

As horizontally scaled systems evolved, so did the IT process. Some shops had strong admins with the golden rule – don’t do things more than once, and if you do script it. They had a model to build machines, OS and applications.

A Shift of Control

Over the last couple years, there has been a shift of control. This shift has enabled faster time to market, perhaps increasing business value, and reduced the number of people involved in the IT process. The developer now has the opportunity to create a customized encapsulation of their platforms, and they are able to deploy almost anywhere.

The downside is loss of centralized IT control and with it a difficult to measure and even characterize aspect of service levels. The business has had to rationalize the tradeoffs between time to market and service levels. The business needs both to maintain competitive advantage. It will need the centralized IT and developer forces working together.

The Next Shift

One only has to look at the cloud for some examples of the coming shift. This shift not only effects technology but it also effects organizational and operational aspects of the IT equation. Another critical aspect of this shift is the balance of service level, control, and security. How do I deliver secure IT services in a distributed fashion at the service level I need?

This will all change again when we have global compute capacity just like we have a global network (The Internet) today. Don’t believe me? Check out The Economist and a few startups as well as Amazon.

Part 1 – Cloud’s Impact to Data Center and Application Architecture

First part of a series of three, and preview of a work in progress…


The data center is constantly changing. New technologies, new business demands, environmental challenges mix together with the obsolescence of the physical aspects of hardware platforms, facilities, and even operational practices to provide a complex evolving environment.

There are some clear trends today:

– data is growing

– data centers are getting larger

– data center services are being provided by a growing abstraction of components

– and these services are increasingly delivered in real-time over distributed networks

This paper addresses infrastructure and platform architecture to address these challenges by utilizing some well known, proven patterns applied to slightly different strategies. A quick reading of the table of contents of books like Cal Henderson’s book highlights many of these patterns.

– Secure distributed execution/processing

– Centralized control of critical functions

– Abstraction and Encapsulation

– Replication and “Sharding”

– Eventual Consistency and Coherency

– Simplification and De-statefulness

These principles are often intermingled. How do I determine what critical functions over the complexity of the everything else? How do I develop a service I can run anywhere while ensuring performance and coherency?

The definition of data centers and DC infrastructure has also changed. A decade or more ago you might call this “data center architecture” but now its about architectures for service delivery, regardless of where those services exist.