Event Management in IT Operations. The Journey of RapidOSS v3

IT infrastructure needs to be managed holistically. Its a given. How do we do that? It's been a painful journey.

First there was the "frameworks" (Tivoli/CA). The idea of building all management tools on a common framework so that they would all be integrated, have common user interface, etc. was very appealing. But the execution did not match the promise.

A single framework to manage the entire IT infrastructure turned up to be a pipe dream. Too costly, too difficult and too painful to buy, implement, and maintain. After spending hundreds (millions?) of thousands of dollars, many IT organizations abandoned the framework projects or significantly scaled them down. Most people who worked on framework implementation projects can still feel the bad taste left in their mouth.

Pendulum swung the other way, and most organizations moved to implement the best tools (aka best of breed or point solutions) they can find to manage different technology silos, and management disciplines. Easier implementation both for technical (solve one problem) and organizational reasons (no need for collaboration of different departments), easier to make and show some progress, quicker return on investment (ROI), etc.

Point solutions have been very successful in the market. But they brought their own set of problems, mainly integration. By their nature, these tools only manage parts of the infrastructure, ignorant to the rest. In a highly interdependent world, it is no surprise that they fell short to meet the needs.

Users complain the application is slow. What to do? Where to start? The problem may be in the LAN, WAN, application server, database, desktop, etc. What does slow mean anyway? What are the acceptable parameters for this application? How important is this application? What is the impact on business? What changed? Do the monitoring tools say anything? Is it possible to tell which events are relevant to this application?

The answers to all these questions may exist, but it is spread in different in management systems. For example, the problem may manifest itself in different ways in various availability and performance monitoring solutions for network, servers, applications, etc. The information in these tools is not useful/sufficient in isolation, yet analyzed together they can play an essential role to identify where the problem may be. In short, with the proliferation of point solutions philosophy, integration of these tools to provide a holistic view of the IT infrastructure has become the major challenge for IT organizations.

Event management systems (aka enterprise consoles, manager of managers) like IBM/ Micromuse Tivoli) Netcool have arrived in the nick of time to fill this void. The idea was simple: implement a tool above all others to consolidate all IT related events into a single location; giving IT Operations teams visibility to what's going on in different silos. This loose integration approach (only integration between systems is events) is easy to understand, and much easier to get up and running. From organization stand point, there is little resistance since (unlike framework approach) different silos/departments can continue to use their beloved tools hence minimal disruption to existing systems.

Merely consolidating the events is a big step forward but there are many more steps to go before event management systems can become effective tools in IT operations. What are these steps? Let's take a look at common functions in event management :


Consolidation

First step of event management. Receive/collect events from any source through standard and API based transports and consolidate in a single repository.

Normalization Transform events from different sources into a common event format
Filtering Eliminate unnecessary events before they get into the system
Association Associate the event with the relevant IT infrastructure component. (application, network device, server, etc.) Association is a crucial step as it is an enabler for higher level event management functions such as enrichment.
Enrichment Add information (which business unit, location, asset information, configuration changes, tickets, etc.) from other datasources to events either by directly adding the data into event record/object or adding the necessary data to access the additional information. Enrichment is an enabler for prioritization and correlation.
Prioritization

Set the priority of the events programmatically based on information available in the event itself or objects event is related to.

Correlation Correlate related events (matching resolution events with problem events, specifying cause-effect relationship among events from different tools, determining X number of events in Y secs, etc.)
Lifecycle Allow users to manage the lifecycle of an event to match the operational processes. Acknowledge an event, assign events to a user or group, suppress or escalate an event, archive an event, etc.
Presentation Present the events to the users through a user interface
Notification Not all users can be expected to watch the glass at all times for events. Send the information where the people are. Notifications via email, sms, IM, phone call etc
Reporting Event repository can be a tremendous resource that can be used to analyze and report on how IT is performing. Provide capabilities to create reports using active and historical events.



The market leader in event management, Netcool Omnibus, does well in the collection layer, but poorly in the intelligence and presentation layers, limiting the value provided by the solution. Why is that? Setting aside the commercial concerns, IMHO, there are several underlying technical shortcomings:

  • No embedded or integrated model (CMDB) to provide context for the events

  • In memory database limiting amount of data that can be kept in the system

  • No mechanism to easily integrate (federate) with external data sources

  • Reliance on proprietary (and inferior) presentation technologies instead of open, modern web technologies to create user interfaces

As a result, Netcool Omnibus requires a separate product (Netcool Impact) to add intelligence (enrichment/prioritization/correlation) and others to improve the presentation layer (TBSM, Netcool WebTop, Netcool Reporter, and more). Result is a complex (not to mention very expensive) solution, that is hard to implement, and even harder to maintain.

The reason for that "no so brief" background is to give some insight into where the motivation to develop RapidOSS has originated from. We call RapidOSS "an integration, automation and presentation suite for IT operations management". It is our attempt to develop a solution that performs the functions listed above, for better IT operations management. RapidOSS is a complete IT operations management solution, yet instead of rip and replace, it strives to complement existing systems by addressing their shortcomings, increasing the return on investment made on existing systems. It can be used as a traditional event management solution, consolidating events; IT operations management console to orchestrate all IT management tools; a portal to provide IT management information to customers and business users; to enhance existing management systems behind the scenes by bringing the power of standard open technologies into proprietary world; or helping organizations to advance in their struggle to better align IT with the business objectives.

To be able to deliver for such broad set of applications, it is designed from the ground up as an open solution, leveraging leading technologies, from web technologies to modeling to dynamic scripting languages to minimize implementation times, total cost of ownership and maximize skills reuse.

RapidOSS v3 has been in the making for over a year now, starting with the development of the underlying CMDB. It is exciting for us to see the ideas and concepts we've discussed for many hours to take shape and come together. Naturally the ultimate test is to see whether our creation is off value to others. No better reward than that.

Give it a try, I'm certain you'll find it worth your time. If you do, we'd be grateful to hear your thoughts!