Interop Unconference: A twist on tech conferences in Vegas

Bitcurrent will be presenting a new kind of conference in Las Vegas this year. It’s called Unconference, and it’s an attempt to make the event more interactive and collaborative. We’re pretty psyched about the event, and looking forward to trying the concept out.

Since 1999, we’ve been organizing conferences for Interop: First on performance, following the publication of a book on the subject; then on web operations, data centers, and so on. This year, we’re running the SaaS and Cloud Computing track, and we’ve got a lineup of experts and panelists.

But even though that conference track has folks from Google, Amazon, Akamai, and a host of startups participating, it’s Unconference that keeps us up at night.

Continue reading “Interop Unconference: A twist on tech conferences in Vegas”

Feedsync, a bit more detail

I wrote a piece for GigaOm today on Microsoft’s new Feedsync, a clever blend of RSS feeds and an update system that allows you to keep changing bits of data — like an address or a calendar event — up to date.

In putting the article together, I asked Jonathan Ginter what he thought. He gave me a pretty thoughtful analysis that I didn’t have room to print in its entirety over at GigaOm; here it is.

I’d never heard of FeedSync, so I looked it up, glanced over the spec, etc. It’s an interesting idea and a neat way to leverage RSS to accomplish something different. I expect that it will do what it is designed to do quite well and has the hallmarks of a solution that might really catch on (although you can never really tell with these things). Its major flaw appears to be the inability for a publisher to know for certain whether a given subscriber has actually received all of its updates, which is a characteristic of RSS feeds in general. This forces the subscriber to do some heavy lifting on occasion. I’m not sure how this will scale if the cache being synchronized is very large. I like the fact that it is based on open protocols, unlike other cache sync solutions out there right now (e.g., Coherence, etc)

In FeedSync, the publisher must provide its current cache state as an “initial” feed. All updates on that initial state are provided as an “update” feed. Subscribers to this service start by processing the full “initial” feed and then focus on the “update” feed. However, publishers are allowed to roll updates off the update feed, making them unavailable to any subscribers that have not yet polled that feed. Consequently, subscribers are expected to notice whenever they appear to have missed something on the update feed. When this happens, they must re-process the entire initial feed – followed by the update feed – to bring them back in sync. If the cache is very large, I can only assume that processing the initial feed could be a tremendous burden on the subscriber, especially in the presence of brittle communication. Two-way sync simply forces both sides to take on both roles and increases the burden.

As for how it compares with JMS, they are founded on different business needs. In its simplest reduction, JMS delivers messages from publisher to subscriber whereas FeedSync is trying to synchronize data caches.

JMS is really just another Message-Oriented-Middleware (MOM) framework – i.e., deliver messages from point-to-point in a secure, reliable fashion (MQSeries being another example). One of the basic assumptions for MOM is that the publisher will no longer hold on to the message once it is sent. This is crucial, since a 1-to-many delivery scenario means that MOMs must have features such as internal message storage, built-in retry strategies, etc. They can natively deal with the idea that each subscriber must be dealt with separately for delivery issues without impacting the publisher. Essentially, the publisher is able to hand the message to the MOM and delete it locally without having to worry about complex retry mechanisms. Since the publisher does not keep the information, MOM solutions assume that delivery failure is a potential crisis. Finally, MOM frameworks are built around the idea that once the subscriber has the data, the publisher and the middleware no longer have it.

FeedSync is much more aligned with the notion of distributed data caches. FeedSync exposes the full data cache through its “initial” feed queue, while maintaining an “update” feed queue for changes to the state of the data cache. The subscriber is polling the sender, which is not the case in MOM frameworks, where the publisher and subscriber are often completely decoupled. Data does not disappear as it passes through a FeedSync system, as it does with MOM-oriented solutions where data is moved instead of shared. In FeedSync, delivery failure is the subscriber’s problem. Unlike MOM solutions, FeedSync has built-in protocols to handle issues related to data synchronization – e.g., merging changes, flagging collisions, deleting data items (which it calls a “tombstone”), etc. JMS simply delivers data from point A to point B with none of these built-in semantics.

JMS is similar to web services in a lot of respects – i.e., it carries documents, expects to be asynchronous, etc. Could you replace web services with FeedSync (or the reverse)? Would you want to?

I’m sure my opinions on this would draw a lot of defensive rebuttals, though, from both sides. 😉

Figured his perspective was too good to keep to myself, and a decent clarification of the subtle differences between MOM and cache updates. It’s always nice to be able to ask smarter people than yourself for opinions, then pass them off as your own.

WordPress and theme hacks

After an interesting weekend, I wrote an article for GigaOm about WordPress Themes and vulnerability. Got lots of press — even made the front page of Digg! — and several people propelled the story to new levels.

Nice to see the amount of activity on the topic and how much coverage it got. Derek, Paul, and Mark had all rung the warning bell earlier on.

Green Code and the Internet OS

Been writing elsewhere a bit lately, while out at Web2Summit in San Francisco.

Off to New York for Interop and some interesting research on data centers now.

VOIP on the iPhone: Packets versus carriers

The good news is that with the announcement of Skype support on the iPhone via the Safari browser, the openness of Internet technology seems to have leaked into the iPhone. This breaks up attempts at traditional exclusivity and monopolism and replaces them with openness and innovation. Whether by design or by oversight, the first chink in the armor is the delivery of voice services in a way that keeps money out of the pockets of carriers.

There will be lots of others coming soon.

The details

When Apple released the iPhone, they did it in conjunction with an exclusive carrier. They needed the infrastructure, and they needed someone to build the network side of their Visual Voicemail. So they got AT&T, and in return AT&T got the iPhone.

But this isn’t going to stick. The first blow to iPhone exclusivity came yesterday from Shape Services. By browsing to http://skypeforiphone.com/ an iPhone user can run Skype via the web browser built into the phone.

It works on a normal data connection (but the cost for data services could be awful.)

But here’s the important bit: it works with built-in Wifi, which means all of the money AT&T would make from calls when you’re in a Wifi zone could potentially go away. And with Wifi blanketing the world (fueled by collaborative hotspots and civic initiatives) that’s a lot of lost revenue. As in, “gee, no phone calls from San Francisco this month.”

One wonders whether Apple foresaw this when it announced that apps for the iPhone would be built through the Safari web browser. At first there was a huge backlash against the company for locking the phone itself (and a race to try and break it.) But now, people seem to be coming around to the world of apps in browsers. It’s a real finesse for Apple, since it gets people off the operating system and onto the Internet, levelling the playing field to a degree in their rivalry with Microsoft.

More details after the jump.

Continue reading “VOIP on the iPhone: Packets versus carriers”

The emerging discipline of Web Operations

The Podcast for this is now online.

In traditional IT environments, silos of domain expertise focus on the atomic elements that perform some kind of business function. These include the client; the WAN/network, the data center, the application cluster, and the back-end systems.

A traditional silo view of application responsibilities

The network and data center tiers are often represented by clouds (particularly when drawn by those whose responsibilities do not include them.) This is because, while they are often a mixture of technologies, they are treated as a single logical unit that forwards and processes packets in some way.

A second way of describing the divisions of labor that perform a business function is the “layered” model.

A tiered view of roles similar to the OSI model

While the tiered model borrows from the network topology, the layered model borrows from the OSI model of networking. Physical “facilities” teams, network connectivity teams, application developers, and other groups all play a role in delivering an application–but they’re seldom interconnected.

Of course, neither of these models transcends the tiers or layers from which organizational responsibilities are usually defined. The result is that the e-business group cares little for the impact of the platform, hardware, and application layers; or that the networking group seldom worries about end-user performance as long as the packets are flowing.

But this is changing. Companies have dramatically increased the amount of customer-, partner-, and employee-based interaction they do via computer applications. While the web is the leading platform for such interactions, other initiatives—from thin-client terminals to Flash- or AJAX-based applications—are more and more common.

There’s a significant gap in operational tools to manage web applications. While network, platform, and server teams have traditionally focused on operational tasks, their application and e-business peers have been worried about deployment and design. But very few solutions can work across silos or organizational boundaries.

The result is the emerging discipline of Web Operations, which blends the operational tools of server, network, and platform operations with the customer- and business-process emphasis of marketing and e-business.

Web Operations is the intersection of traditional IT operations and e-business tools

These two domains differ significantly.

Operations

  • Measurement of success: Performance and availability of the application or infrastructure
  • Clients identified by: Source IP, region, ASN
  • Unit of measure: Packet, query, hit
  • Performance problems from: CPU overload, insufficient network capacity
  • Availability problems from: Faulty hardware, data corruption

e-business

  • Measurement of success: Conversion versus abandonment
  • Clients identified by: Referring search engine, user account
  • Unit of measure: Session, user
  • Performance problems from: Large page size, improper cache parameters
  • Availability problems from: Bad navigational logic, missing content

It is only with a blend of both domains that we can answer some of the most costly and perplexing problems that web operators face today. Web operators often need to blend data from operations (such as performance and availability) with more user-centric information (such as customer or subscriber groupings.)

When I talk with IT teams at many of the e-business companies out there, they all have the same kinds of questions–questions for which there aren’t easy or immediate answers. Here are some of them:

  • What’s the performance and availability of key web functions like? I need to provide application performance visibility to executives in my organization and it takes too much time and effort to create reports from disparate data sources. Existing reports address tech level staff and are not exec friendly.
  • Which groups are best or worst off? Which groups are above or below an “acceptable” service level? What I mean by “group” will change based on who I am (network, server, platform) and what kind of user model I run (B2B, B2C, Intranet).
  • What’s broken on my site? I configured some tests but they’re stale because the site changes often; I don’t have a lot of time to manage and configure monitoring applications because I’m often in firefighting mode. Broken might be “elements” like states, service providers, servers, or application functions; or it might be user sessions that didn’t achieve some kind of goal.
  • Why aren’t users achieving their goals? When someone doesn’t complete a goal I’d like, is it because they didn’t like my offer? Because they couldn’t understand or use the application properly? Because they got “stuck” due to bad programming on my part? Had a hard error? Or simply because they lost their connection for no related reason?
  • How much traffic can I handle? Based on what I’ve seen in the past, how many users can I support before performance becomes unacceptably slow?
  • How are errors affecting my users? Is my slow performance causing me to lose money? Do users switch to my competition? Or to another, more costly form of interaction like a phone call or an in-person visit?
  • Why is my web application slow? I’m looking at a function or time period where the response time of my application is very slow and I don’t know why. I need to get to the root cause of this so I can fix it.
  • How big a problem is this? Is the complaint or incident I’m investigating affecting everyone or just a single user?
  • Is the problem I’m looking at my fault or someone else’s? My customers sometimes experience slow performance that is caused by issues outside of my own network but they blame me for it. I need to be able to help my customers see the light.
  • How do back-end web or database services affect site performance? I’m using web services extensively as part of my application delivery and I don’t have a good way to see how web service calls affect my overall performance.
  • How should I investigate this alert or incident? I don’t always have a high degree of expertise in house to resolve complex performance issues. I need solutions that will help my level 1 technical staff resolve real issues with a concrete workflow.
  • How did the recent change affect the health of my application? I need to quantify how changes to my application are affecting the overall quality of my service delivery. I can’t do this today without extensive effort.
  • Did I meet my service objectives for a particular subscriber group? Which contracts or agreements am I at risk of violating?

There are many important dimensions to consider when trying to understand what a particular web operations team will care about. While there are hundreds of ways to slice up web operations’ needs, I find that these three dimensions matter the most in terms of what kinds of problems a company has and how they’d like to solve those problems.

  • The company’s relationship with its users (B2B, B2C, or intranet)
  • The timeframe in which the team operates (tactical incidents, mid-tier reporting, or long-term planning)
  • The team’s organizational responsibility (network and CDN, server and platform, application development, or user/content owner)

At Interop in May we’ll be running the first WebOps Summit. It will include several companies whose businesses focus on the unique predicaments of WebOps teams, and should be an interesting event.

And Technorati does an interesting job of tracking the buzz for terms like Web Operations…

Web Operations 90-day history

AJAX and networking — feedback

Gave the AJAX and networking presentation today. Reasonably good feedback, though. The presentation’s available here. People want more details — more scientific analysis, better traces, a look at the differences with caching enabled, and so on.

What’s clear is that people haven’t really thought about this and that developers can slide it in quickly without networking knowing about it.

The impact of AJAX on web operations

I decided to look into AJAX for a class I’m teaching at Interop in New York this week. And it was an interesting project. I’ve talked to some of the guys that popularized it, as well as people who’ve written in it, and everyone seems consumed with how great, how flexible, how versatile it is. But few people have looked into the impact it has on networks. (One notable exception is Jep Castelein of Backbase.)

Now, I’m a pretty pragmatic guy; and I can’t code to save my life. But I’ve talked to lots of folks about running large-scale web applications, and there are some fairly basic things that they worry about. Availability, for example. Simply making sure that the site works reasonably quickly for everyone. And maintainability—so that when it does break, we can fix it.

Here’s what I learned.

First of all: What is AJAX? Specifically, it’s a term coined by an expert in usability and application design named Jessie James Garrett. He was referring to a specific combination of client- and server-side technologies that developers use to make cool web sites. Okay, that’s a huge oversimplification. But AJAX is behind things like GMail and Google Maps. And a couple of years ago, people had pretty much decided that web-based mail (Hotmail) and mapping (Mapquest) were technical dead-ends. Now you’ve got e-mail that saves a backup in case you’re cut off; and maps that you can zoom and drag.

Big deal, you say. My PC can do that. Well, that’s not the point. When your web page can do that, it means you can make cool applications that don’t follow the click-and-wait approach we’re used to on the web.

The Internet Geeks (mea culpa) are up in arms about this. Some say it’s the greatest innovation since HTML, freeing us from the tyranny of clumsy forms and inhumane interfaces. Others say it breaks the nice, strong metaphor of pages and documents on the Internet and will result in hundreds of different user interface widgets that are cool to their inventors but incomprehensible to others. Alex Bosworth, formerly Chief Architect at BEA and now over at Google, is on my list of cool people to have dinner with someday, primarily because of some amazing stuff he wrote about the migration from relational databases to message-centric networking. And Adam Bosworth has some excellent, hype-free guidelines on AJAX design.

I say, let the chips fall where they may. AJAX and its ilk—more broadly referred to as client-side scripting or rich client interfaces—are here to stay. Want proof? The underpinnings of this technology, which lets web page scripts talk XML to servers, has been a part of Microsoft Exchange’s Outlook Web Access since 1998. So you probably used it this week, if you accessed your company’s e-mail from home.

Let’s dispense with the political fallout and look at what AJAX actually does. In my (admittedly unscientific) research, I looked at some popular sites, and compared them to their non-AJAX counterparts. I then talked to several people who spend their time working on these applications to get a sense of where they’re going. My conclusions (so you’ll be intrigued and keep reading) are:

  • Many more HTTP requests (“hits”) per second
  • A bigger first page, followed by many smaller updates
  • In some applications, one hit per keystroke or mouseclick
  • Pre-fetching 3-4 times as much data as the user actually needs

Yikes!

So why on earth would someone actually use this stuff? The answers are speed and usability. Applications written in AJAX-like approaches “feel” more responsive, and are therefore more engaging. They also tend to do nice things like checking forms and populating drop-down dialogs more intelligently.

To understand why AJAX has these effects, we need to spend a bit of time digging into networking. Two of the most important concepts in network performance are latency and throughput. They’re different, but related, and they both affect how much data you can pump across a network. It’s easy to use a highway analogy to understand this.

  • Latency is how fast the traffic goes. Think of the speed on the highway. If it’s a 60 mph highway, it takes you an hour to go sixty miles. If you double the speed, you get there in half the time (derr, as Jess would say). But it also means (on a highway) that more cars can get there. The problem is, there’s a limit to speed. Just as bad things would happen when cars started driving a thousand miles an hour, so networks can only move packets of data so fast. And it turns out there’s a good reason for that: The speed of light. A couple of years ago, I gave a presentation in Vegas on performance, and as part of it, I did the math on the network latency from New York to Las Vegas. Turns out, it’s always going to take at least 13 milliseconds. So there.
  • Throughput is how wide the highway is. A two-lane highway can carry twice of many cars, at the same speed, as a one-lane highway. And this is easy to grow—there’s really no limit to how many lanes a highway can have. The problem is, if you’re in one of the cars, you’re still going to take an hour to get where you’re going.

Okay, so we have latency (which we can’t change until the Star Trek days) and throughput (which is easy to fix; there’s already a glut of bandwidth out there and we really aren’t splitting the spectrum much yet anyway.)

Now, consider that latency is tightly related to usability and the productivity of a person. A number of studies into optimal worker productivity by various organizations (IBM is one, pioneered byMihaly Csikszentmihalyi) show that people are productive when they’re in a state of concentration called flow state. And flow state takes place when people feel like they’re interacting in real time. Here’s my unscientific table of response times and usability.

Responsiveness Feels like Results in Good for
Under 1 millisecond Instantaneous Total immersion Video games, virtual reality
1-10 milliseconds Real-time Browsing, exploration, affordances UI devices (buttons, etc.), completing forms
10-100 milliseconds Natural conversation Back-and-forth pauses Voice conversations, moving from area to area
100 milliseconds – 1 second Brief pauses Concentration Instant messenger, authentication, completing a process
1-10 seconds Longer pauses Some distraction but continued engagement Getting search results, starting a video stream
10 seconds or more Waiting in a queue Loss of concentration (Alt-Tab to something else) Batch-based computation, offline downloads

So here’s the problem. UI designers want their users to be productive, and to feel engaged in the application. But to get to that “real-time” state, they need to get under 10 milliseconds’ latency—which is tough across the Internet. And if they’re ever going to shake the hold that the desktop has on the design of applications, they need to fix this.

Latency, as we’ve seen, is very expensive. It’s even priceless at some points in time. There’s nothing we can do to speed it up. Or is there?

One trick designers can employ is pre-fetching. This is the act of grabbing several things you might need, so that they’re ready when you do need them. In our highway analogy, this is like putting a police car, an ambulance, and a fire truck on the highway as soon as an emergency call comes in. Once you know the nature of the emergency, the latency before which the right car arrives goes way down since it’s already on its way.

“But wait,” you say. “Isn’t that ridiculously expensive, sending three times as many vehicles as you need?” It would be if it weren’t for the fact that latency is fantastically more expensive than throughput. Put another way: Emergency services vehicles are cheap; quick response is precious.

A second issue AJAX tries to tackle is form prepopulation. In this case, the application tries to anticipate what you want (for example, by showing the most popular search terms or a list of known e-mail recipients.) Google Suggest is one such application.

Google suggest example

In fact, Adam Bosworth lists ten places where AJAX makes sense.

Use case Impact on the network
Form driven interaction Smaller messages but may trigger additional traffic if autosaved intermittently.
Deep hierarchical tree navigation Lazy loading of data is like pre-fetching; trading additional data for improved responsiveness.
Rapid user-to-user communication Many small messages; people tend to type shorter sentences.
Voting, Yes/No boxes, Ratings submissions More, smaller messages.
Filtering and involved data manipulation Additional prefetching (draggable maps) and more transactions (one message per click-and-drag).
Commonly entered text hints/autocompletion One transaction per keystroke; prefetching.

(yes, according to Adam, this is ten entries. But he’s smart so I’ll agree.)

Okay, so I looked at both these cases. For the first one, I put a sniffer onto a Google search and looked up “Coradiant” (where I work.) Then I did the same with Google Suggest. You can try it if you like.
In the first case, there are two web hits—one for the search page, and one for the results page:

A normal search

Here’s the trace:

Normal trace

In the second, there are as many hits as there are letters (but the hits are much smaller.)

Google Suggest dialog box

Here’s the second trace:

Traces

(notice all of those hits, one per character? Scary…) So we have a lot more back-and-forth HTTP transactions.

The second thing I looked at was a side-by-side comparison of a “traditional” mapping application (Mapquest) and an AJAX-enabled one (Google Maps.)

For Mapquest, I zoomed in, then recentered three times:

Mapquest screen

I did the same for Google Maps:

Google Maps shot

Here’s what I found.

Site MapQuest Google Maps
HTTP requests 97 217
MBytes 0.88 MBytes 3.69 MBytes
Usability Pause between clicks Smooth, continuous
Wait ~1.5 seconds None, but repainting in background

And we have a lot more data.

One of the big reasons for the higher bandwidth is the pre-fetching. In a maps application, you’re getting “tiles” of surrounding map in every direction; but the user’s probably only going to drag in one of them. If you think about this for a bit, you’ll see that we initially load 9 tiles of data; then we drag in one of them; and this means we’re downloading three-to-four times as many images as we need. This is a worst-case scenario; a lot depends on the size of the window and the tiles, etc.

Here’s what you see:

A single Google Map screen

And here’s what your browser downloaded (sort of)

Google preload

So most Ajax implementations add more HTTP transactions and pre-fetch more data. They also burden the browser with a lot of processing (though most modern browsers are OK on a fairly current machine.) Studies show that performance degrades exponentially with increased data processing on browsers.

For those of us who run networks, this means we’ll get more hits, more often. Those hits will start with a larger object, then a series of small messages or a series of content updates whose size will vary by application.

Some of my conclusions from all this, then, are that the application should:

  • Provide clear feedback to the user: Be smart about preloading data and handle the XMLHttpRequest object properly.
  • Plan for more throughput & latency complaints. Remember that throughput is cheap, but response time is expensive, some people think latency matters 4 times as much as download size
  • Recognize that the “hypertext” metaphor vanishes. So there’s no more back/forward/bookmarking concepts
  • Without feedback, users will complain. Background errors will seem inexplicable, and slow-downs will be difficult to troubleshoot.
  • Some key factors to demand from your developers so you can understand traffic profiles are the message size, the user interaction that triggers a transaction, the amount of data being pre-fetched, and the expected network latency of typical users.

More to follow as I get feedback from Wednesday’s presentation.

Interop this week

I’m spending the week in New York at Interop. Should be an interesting week; lots of networking folks for three days of conferences and some workshops and classes.