The emerging discipline of Web Operations

The Podcast for this is now online.

In traditional IT environments, silos of domain expertise focus on the atomic elements that perform some kind of business function. These include the client; the WAN/network, the data center, the application cluster, and the back-end systems.

A traditional silo view of application responsibilities

The network and data center tiers are often represented by clouds (particularly when drawn by those whose responsibilities do not include them.) This is because, while they are often a mixture of technologies, they are treated as a single logical unit that forwards and processes packets in some way.

A second way of describing the divisions of labor that perform a business function is the “layered” model.

A tiered view of roles similar to the OSI model

While the tiered model borrows from the network topology, the layered model borrows from the OSI model of networking. Physical “facilities” teams, network connectivity teams, application developers, and other groups all play a role in delivering an application–but they’re seldom interconnected.

Of course, neither of these models transcends the tiers or layers from which organizational responsibilities are usually defined. The result is that the e-business group cares little for the impact of the platform, hardware, and application layers; or that the networking group seldom worries about end-user performance as long as the packets are flowing.

But this is changing. Companies have dramatically increased the amount of customer-, partner-, and employee-based interaction they do via computer applications. While the web is the leading platform for such interactions, other initiatives—from thin-client terminals to Flash- or AJAX-based applications—are more and more common.

There’s a significant gap in operational tools to manage web applications. While network, platform, and server teams have traditionally focused on operational tasks, their application and e-business peers have been worried about deployment and design. But very few solutions can work across silos or organizational boundaries.

The result is the emerging discipline of Web Operations, which blends the operational tools of server, network, and platform operations with the customer- and business-process emphasis of marketing and e-business.

Web Operations is the intersection of traditional IT operations and e-business tools

These two domains differ significantly.

Operations

  • Measurement of success: Performance and availability of the application or infrastructure
  • Clients identified by: Source IP, region, ASN
  • Unit of measure: Packet, query, hit
  • Performance problems from: CPU overload, insufficient network capacity
  • Availability problems from: Faulty hardware, data corruption

e-business

  • Measurement of success: Conversion versus abandonment
  • Clients identified by: Referring search engine, user account
  • Unit of measure: Session, user
  • Performance problems from: Large page size, improper cache parameters
  • Availability problems from: Bad navigational logic, missing content

It is only with a blend of both domains that we can answer some of the most costly and perplexing problems that web operators face today. Web operators often need to blend data from operations (such as performance and availability) with more user-centric information (such as customer or subscriber groupings.)

When I talk with IT teams at many of the e-business companies out there, they all have the same kinds of questions–questions for which there aren’t easy or immediate answers. Here are some of them:

  • What’s the performance and availability of key web functions like? I need to provide application performance visibility to executives in my organization and it takes too much time and effort to create reports from disparate data sources. Existing reports address tech level staff and are not exec friendly.
  • Which groups are best or worst off? Which groups are above or below an “acceptable” service level? What I mean by “group” will change based on who I am (network, server, platform) and what kind of user model I run (B2B, B2C, Intranet).
  • What’s broken on my site? I configured some tests but they’re stale because the site changes often; I don’t have a lot of time to manage and configure monitoring applications because I’m often in firefighting mode. Broken might be “elements” like states, service providers, servers, or application functions; or it might be user sessions that didn’t achieve some kind of goal.
  • Why aren’t users achieving their goals? When someone doesn’t complete a goal I’d like, is it because they didn’t like my offer? Because they couldn’t understand or use the application properly? Because they got “stuck” due to bad programming on my part? Had a hard error? Or simply because they lost their connection for no related reason?
  • How much traffic can I handle? Based on what I’ve seen in the past, how many users can I support before performance becomes unacceptably slow?
  • How are errors affecting my users? Is my slow performance causing me to lose money? Do users switch to my competition? Or to another, more costly form of interaction like a phone call or an in-person visit?
  • Why is my web application slow? I’m looking at a function or time period where the response time of my application is very slow and I don’t know why. I need to get to the root cause of this so I can fix it.
  • How big a problem is this? Is the complaint or incident I’m investigating affecting everyone or just a single user?
  • Is the problem I’m looking at my fault or someone else’s? My customers sometimes experience slow performance that is caused by issues outside of my own network but they blame me for it. I need to be able to help my customers see the light.
  • How do back-end web or database services affect site performance? I’m using web services extensively as part of my application delivery and I don’t have a good way to see how web service calls affect my overall performance.
  • How should I investigate this alert or incident? I don’t always have a high degree of expertise in house to resolve complex performance issues. I need solutions that will help my level 1 technical staff resolve real issues with a concrete workflow.
  • How did the recent change affect the health of my application? I need to quantify how changes to my application are affecting the overall quality of my service delivery. I can’t do this today without extensive effort.
  • Did I meet my service objectives for a particular subscriber group? Which contracts or agreements am I at risk of violating?

There are many important dimensions to consider when trying to understand what a particular web operations team will care about. While there are hundreds of ways to slice up web operations’ needs, I find that these three dimensions matter the most in terms of what kinds of problems a company has and how they’d like to solve those problems.

  • The company’s relationship with its users (B2B, B2C, or intranet)
  • The timeframe in which the team operates (tactical incidents, mid-tier reporting, or long-term planning)
  • The team’s organizational responsibility (network and CDN, server and platform, application development, or user/content owner)

At Interop in May we’ll be running the first WebOps Summit. It will include several companies whose businesses focus on the unique predicaments of WebOps teams, and should be an interesting event.

And Technorati does an interesting job of tracking the buzz for terms like Web Operations…

Web Operations 90-day history