Is cloud computing a dirty word?

 I’m sitting at the airport in Montreal, waiting to head to Las Vegas for the Enterprise Cloud Summit. It’s been three years since we ran the first ECS at Interop, and much has changed. Back then, we spent a lot of time talking about what clouds were—helped by heavyweights from big public clouds like Amazon, Google, Rackspace and Microsoft.

A funny thing happened on the way to the clouds, though. The following year, incumbent vendors started preaching private clouds, preying on fears of lost control and invaded privacy. Change-resistant IT executives listened to them, and soon after, hybrid clouds were a hot topic.

I’ve no idea what a hybrid cloud is. I can’t go and buy one, or subscribe to one. I can, however, have an application that relies on hardware I own and hardware I rent. I think that’s what they mean. To me, hybrid clouds are a gateway drug to the adoption of public computing—as economies of scale and skill take over, companies will be drawn inexorably into a more on-demand world.

With much of what passes for clouds, I call shenanigans. The co-opting of cloud computing by organizations that don’t want change ignores the promise of cloud computing: the end of the IT monopoly. No surprise, then, that the monopolists are resisting.

In DC a few weeks ago, I joked that cloud computing was IT socialism. Nobody laughed. Inside the beltway, it seems, socialism is a dirty word. But the statement rings true: cloud computing is computing for the 99 percent, and the 1 percent that controls technology today is resisting change. IT conservatives worry about retaining control, when instead they should worry about delivering competitive IT services.

The ultimate purpose of clouds, in the clouds-as-utility model, is to abstract a layer of complexity away. It’s the same as payroll services, or the electrical system: immensely complex, but the business has a simple interface to them. The one percent forgets this at their peril.

That doesn’t mean a company can’t be cognizant of those services. I need to understand payroll tax, or home wiring, now that we have ADP or the utility grid. Clouds are about managing and quantifying complexity, abstracting it through simple interfaces. James Urquhart, who got me thinking about this a few months ago, is right—complexity abounds. It does so at a higher layer of abstraction: data centers instead of machines, services instead of ports, and applications instead of subroutines.

Look at it another way. Physics is applied math. Chemistry is applied physics. Biology is applied chemistry. By the time you get up the scientific stack to biology, systems are hopelessly complex, and we have to understand them by observing emerging behaviors and working with patterns. Down at the math level, we can work with the numbers themselves.

At the lower, simpler levels we use deduction, causality, and the burden of proof. At the higher, more complex levels we use induction, correlation, and the strength of probability. This is true in the sciences; it’s also true for complex computing systems. This may be why many of the best IT operators I know are biologists.

What clouds do is allow us to transition from device-level thinking to system-level thinking. Just as biologists, ecologists, and anthropologists work with observation and make up the “wet” sciences that aren’t precise, so cloud architects are doing “chaotic IT.”

This chaos has its rewards. We get a lot more emergent complexity and interestingness from biology than math, at least from a practical, hands-on perspective.

Now I’m really going to wax rhapsodic and philosophical. Gottfried Leibniz spent a lot of time trying to figure out what the best of all possible worlds was. He concluded that it’s the one of plentitude, where the fewest starting conditions give us the most outcomes. In his words, the best world would “actualize every genuine possibility.”

I suspect Leibniz would have considered today’s connected, abstracted, service-oriented Internet a better world than yesterday’s islands of client-server and mainframe computing. Biology, and cloud computing, are complex. They’re messy. And according to Leibniz, they’re also better, because they allow more possibilities.

It’s interesting to note that Leibniz also used this to argue that the world needs to have evil in it, because this is a “better” world as it has more possibilities. So the next time your cloud app dies a horrible, complex death, remember that Leibniz says it’s for the best.

The move to turnkey computing

At this year’s Cloud Connect, Werner Vogels predicted a future in which everything-as-a-service is the norm. While enterprise IT often equates virtual machines with the cloud, the reality is that virtual machines are only one of dozens of services Amazon offers. Its competitors aren’t far behind: companies like Google offer a horde of APIs, and even more traditional memory/compute/storage providers like Joyent are adding turnkey products for large storage.

In the end, nobody wants to see the sausage being made. Recent announcements by folks like VMWare, public provider acquisitions of PaaS products, competing private stacks like Openstack and Cloud.com, and private cloud tools that run higher up the stack remind us of one thing above all else: herding your boxen is a distraction from the business of building software and deploying applications.

I tried to argue this point at Cloud Connect, in a presentation entitled The Move to Turnkey Computing. Here it is on Slideshare, as a PDF with speakers’ notes.

Cloud provider deep dives at ECS 2011

This year, we’re adding a new element to Interop’s Enterprise Cloud Summit. We’ve asked eight public cloud providers (Amazon, Joyent, Microsoft, Rackspace, Salesforce, Navisite, Terremark, and Gogrid) and four private cloud stacks (Openstack, Cloud.com, Red Hat Makara, and Eucalyptus) to answer a set of predefined questions about their offerings.

The questions are designed to let attendees compare offerings—while they’re not strictly apples-to-apples, at least they bound the discussion.

Public clouds (here’s the slide deck)

  • Main elements of your service: What are your main service offerings (i.e. the top 3 technologies, services, or APIs your subscribers use?)
  • Pricing: How are those services priced? (i.e. what is the unit of measure, what do you charge, do you have an elastic pricing model based on usage?)
  • Security and certifications: What security or similar certifications do you have? (i.e. FIPS, SAS-70, PCI)
  • Data centers and zones: Where are your data centers or availability zones (i.e. Europe, US, China)
  • Customers: What are three companies building things on your platform? (One slide per customer profile is OK here)
  • SLAs and compensation: What SLAs do you offer (i.e. data recoverability, uptime, latency?) and how do you compensate those (i.e. service refund)?
  • Architecture: How is the system architected (i.e. what underlying stacks do you rely on?) A couple of diagrams are OK here.
  • Portability: How can people move things onto and off of your platform (i.e. are there APIs? Portable machine image formats? Private stacks they can run?)

Private clouds (here’s the slide deck)

  • Service library: What cloud services does the stack offer? (i.e. virtual machines, code execution, large-object storage, message queueing.)
  • OS, language, and API support: What operating systems, languages, or APIs can the user employ? (i.e. Python, any OS, etc.)
  • On what stack is it based? (VMWare, Xen, KVM, etc.)
  • Requirements & limitations: What are the underlying hardware requirements or constraints? (i.e. pairs of machines; Intel quad-core processors; sub-10-ms latency between nodes)
  • Portability: Is the stack portable to other stacks, or to public cloud provider environments? (i.e. can you move an AMI to Amazon?)
  • What standards or de-facto standards does it support?
  • How is it priced? (i.e. by core, by user, open source with a support contract, etc.)
  • Included tools: What management tools do you offer? (a screenshot or two is fine here)
  • Capacity, performance, availability: What are the capacity, performance, and availability constraints? (i.e. scales to a maximum of 20 nodes)
  • Global distribution: Can it work in a distributed, multi-city mode with additional redundancy?

Putting vendors on stage to talk about their wares can be a disaster, and ECS is a paid event with a focus on content. To be sure this works, we’re doing two things: First, we’re limiting each presenter to ten minutes and the slides we’re providing them. And second, if we don’t get the slides from a provider at least 2 weeks in advance, we’re either going to replace them (as you can imagine, there are plenty of folks who didn’t make our list of 12 providers that would like the visibility) or present the content ourselves (based on our research).

Hopefully this will keep the signal-to-noise ratio high. Anyway, it’s the first time we’re trying something like this, and it’ll be interesting to see what happens. We’ll consolidate the responses and publish them as a research paper once we get everyone’s answers.

Cloud Connect live: Neal Sample of eBay – cloud bursting for fun and profit

Cloud bursting for fun and profit…own the base, rent the spike.

eBay 2B page views/day 6000 application servers 9PB of data.

Cloud bursting: control costs and increase efficiency.

2000 servers provisioned at a fixed cost but only 300-1600 used at any given point of the day. A lot of “whitespace” or excess capacity.

How can we remove the peaks? Time shifting the workload helps a little bit, bringing us down to about 800 servers required.

Financial model structure: take the hourly demand in average TPS and max TPS per CU to get the hourly computing units required (CUs). Take the number of CUs in the internal cloud and the hourly cost, plus the hourly CUs in the extermal cloud, you get total hourly cost.

Cost benefit analysis shows that an external cloud at 4x the cost of the internal cloud still yields economic benefits in dealing with the spikes.

If the cost of the cloud is half of the internal cloud then it makes no sense to run anything internally.

In between is the interesting area where we need to make hard choices.

Cloud bursting is very interesting to eBay, allowing us to cut costs by at least 25%.

Lower the investment in infrastructure to focus on the business intelligence.

Cloud Connect live: Willem van Biljon of Nimbula

Cloud architecture is made up of many complex pieces. Compute, storage and network hardware, with a Cloud OS (IaaS control plane), and finally PaaS and SaaS layered on top.

Scale, automation, resource management, permission and policy management are the key challenges.

The hypervisor is NOT the Cloud OS, rather an essential component.

Large enterprises have shown that commodity hardware can lower costs. Design the application for commodity hardware and you can dramatically lower costs.

Network: topology no longer defines security, which now needs to be automatically and dynamically managed.

Federation: challenges are API, identity, data and application environment. Identity is the key.

Billing: metering, rating, and bill generation. The business is where the rating is.

Storage: Lots of data on enterprise platforms, opportunity to re-architect with low-cost storage. No one-size-fits all. The key is balance.

The cloud ecosystem has many components, many issues PER component. Focus on the key issues per component.

Cloud Connect live: Andy Schroepfer of Rackspace

The cloud is coming and all that matters is cost. Rackspace cloud is for smaller clients, medium clients use hybrid strategies, and they are hoping large companies will use OpenStack.

Market will divide into 2 sectors: the self-service cloud, and the managed cloud. Just like in hosting versus managed hosting.

Service is what matters.

Pick the right platform for the right application. Rackspace’s strategy is that a hybrid approach of dedicated infrastructure and cloud is best served by a single provider.

SaaS – the “other” cloud. Rackspace hosts a huge number of SaaS platforms. SaaS is a huge agent of change in IT, and reduces IT staff.

OpenStack is a Nasa collaboration to provide an open-source platform that is hypervisor agnostic and provides storage portability.

Cloud Connect live: Marty Kagan of Cedexis keynote

Cedexis took a different approach to monitoring performance: inserted objects in all CDNs and convinced top sites to insert objects into their pages to gather real user performance experience.

15 Billion measurements in January, analyzed by Bitcurrent.

EC2 East outperforms Rackspace, Joyent, Google AppEngine and Azure.

If you deploy on multiple availability zones…do you use static geo-load balancing or base your decision on real-time performance information?

Turns out you get much better international coverage, a 20-30% improvement by using performance based load balancing.

Cloud connect live: Oriol Vinyals artificial intelligence in RTS and cloud computing

Oriol introduces StarCraft, one of the most popular real time strategy games of all time, as a virtual world where you gather resources,

Starcraft is adversarial, long horizon, partially observable, realtime, concurrent. Very hard AI problem.

Challenge: Long Horizon, length of gae 10K to 100K frames versus chess with only hundreds of moves.

Designed Berkeley Overmind as a hierarchical search.

Realtime decisions, have to output actions every frame, 24+ times per second.

What we are learning with our AI efforts may be applicable to cloud computing problems…

Cloud Connect live: Scott Baker of Eventbrite

Scott confesses that he’s a network engineer who loves hardware and datacenters…. He titled his talk “How I learned to stop worrying and love the cloud” Specific technologies work for specific solutions until needs change for a more efficient economical powerful technology comes along.

2006 Digg..so busy with the hardware side of the solution that they didn’t have time for application layer monitoring. Provisioning hardware took a disproportionate amount of time and effort. When they released more features they threw more servers at the problem. Lather, rinse repeat. Were not able to focus on the application which was really the important part of what they were doing.

Eventbrite moved from hosted servers to EC2 and Puppet. Single source of truth – master AMI, moved from Debian to Ubuntu. DNS/DHCP were major issues.Cloud computing is timesharing updated. Tips… use RAID 0 with EBS instead of RAID 10. Use 3rd parties for e-mail. Practice defense, redundancy and backups in depth.

Cloud Connect live: Derek Chan of Dreamworks keynote

Dreamworks is a CG film making company and a huge consumer of CPU. Cloud computing is a perfect fit. Cost effective scalable compute at their fingerprints.

A film takes 4-5 years, 200+ workstations, 50+ million compute hours, 100TB data, 500 million files, 10Gbps WAN, 40GBps LAN. Resource demands are bursty. On demands is a tremendous benefit.

In 2003 they used HP’s IaaS rendering farms for Shrek 2.

Getting the data to where the compute happens is a big challenge.

In 2010 3 CG feature releases in 3D, over 7M compute hours sent to IaaS. For 2011 they are increasing capacity 10x, adding a new IaaS provider, increasing network bandwidth 3x.

Redhat, KVM, MRG messaging queue management. Weblogic and JBoss middleware, and deltaCloud for cloud management.

Cloud computing allowed them to defer data center buildout, reduce cost, increase flexibility.

Want more multitenancy, more flexibility in ramping up and shedding infrastructure and better cloud storage.