by Richard Ferri, Dr. Moon J. Kim, & Dr. Dikran S. Meliksetian
In October 2002, when IBM CEO Sam Palmisano ushered in the new age of on-demand computing to a group of 300 customers, some asked whether the on-demand initiative was just more marketing hype - or a fundamental change in the way customers will view computing in the future. The answer is that on-demand computing represents a fundamental change in how customers will use technology.
To understand this change requires a thorough understanding of grid computing and how it complements IBM's on-demand strategy. We will explore the nature of grids, the difference between clusters and grids, and how dynamic virtual organizations can use grids to share information and solve problems in an on-demand world. We will explain why protocols are important to grids, and conclude with a snapshot of perhaps the best-known grid, the TeraGrid.
Specifics of On Demand
Systems for an e-business on demand should be empowered by responsive, flexible, focused, and resilient technologies such as autonomic computing technologies. An on-demand autonomic computing approach enables a network of organized, self-managing computing components that provides customers with what they need, when they need it, with less effort.
The dynamics of the market are changing, and that requires a new way of thinking about business processes and the information technology infrastructure that supports them. IBM is entering the next phase of e-business, in which companies move beyond simply integrating their various processes, to a world in which they need to be able to sense and respond to fluctuating market conditions in real time. IBM is leading the way toward the on-demand world, and has the business insight and technological expertise to help customers become on-demand businesses.
An on-demand business is an enterprise whose business processes - integrated end to end across the company and with key partners, suppliers, and customers - can respond with agility and speed to any customer demand, market opportunity, or external threat. An on-demand system is dynamically responsive to business environment changes. It uses variable structures and adapts with flexibility to business environment changes. This flexibility will enable it to reduce risk and to perform at high levels of productivity, cost control, capital efficiency, and financial predictability. It is focused on its core competencies and differentiating tasks and assets. Finally, it is resilient enough to manage changes and threats.
In an on-demand business the computing environment is integrated to allow systems to be seamlessly linked across the enterprise and its entire range of customers, partners, and suppliers. It uses open standards, so different systems can work together and link with devices and applications across organizational and geographic boundaries. It is virtualized to hide the physical resources from the application level and make the best use of technology resources and minimize complexity for users. The grid computing model - which makes the collective power of the computing resources in the grid available to any user - is a good example of a virtualized system model.
The On-Demand Operating Environments
To be successful in the future, businesses need to achieve new levels of productivity by becoming fully integrated - by integrating end to end across people, processes, and information. To accomplish this goal, the business must build on an IT infrastructure specifically designed for that purpose. This infrastructure is called the on-demand operating environment. Specifically, the on-demand operating environment depends on integrating middleware, open standards (so the company can interact with anyone in the world), virtualized systems (to hide the complexity of the infrastructure and use its resources more efficiently), and autonomic capabilities (to manage the growing system complexity).
Software On Demand
As part of an on-demand operating environment, software on demand is designed to address the heterogeneity and complexity of today's systems so that the existing computing power is fully utilized, and data can be shared among partners, customers, and suppliers responsively. Virtualization technology, incorporated into WebSphere, makes it easier for a business to integrate processes and IT solutions that have been traditionally separate. Using WebSphere in a grid environment enables the business to manage applications running on different servers - with differing priorities, usage patterns, and computing profiles - as a single environment.
Storage Solution
In an on-demand environment, with quickly changing requirements for data sharing, businesses can no longer afford the simplistic thinking of a storage device being "owned" by a server. Storage virtualization, the layer of software that isolates the physical storage from the application servers that use them, provides a centrally managed method of sharing storage within an environment. In a virtualized environment, physical disks are pooled and can be mapped into virtual disks and used by all servers. Virtualization also removes the complexity of individually managed physical disks and allows customers to maximize usage of their storage area network.
These new, specific requirements of on-demand operating environments force us to think differently about the very nature of computing. The barriers that inhibit sharing of information and resources within an entity or between entities have to be carefully removed to permit dynamic sharing of resources and information by disparate groups, for disparate lengths of time. This dynamic sharing of resources is exactly what grids are designed to do.
The Role of Grids
Clusters, by definition, are interconnected whole computers that cooperate to solve a problem. But behind the definition, clusters are fairly static and homogeneous creations. The number of nodes in a cluster might grow or shrink over time, but the intent of the cluster rarely changes (and if it does, it's usually with a significant amount of re-architecting). For example, a cluster assembled for purposes of weather forecasting or for geopetrol exploration will have software installed for those applications only. The individual users of the cluster may change over time, but typically even as users turn over, the applications they run on the cluster remain similar to those run by previous users. The types of data that cluster users manipulate are typically chosen during the creation of the cluster.
Second, clusters are usually homogeneous, both in function and hardware. With the possible exception of the cluster manager, the purpose of the nodes in a cluster is to solve the same kind of problem. Therefore, the nodes of a cluster are usually machines with similar architectures and operating systems. They also have, in general, a uniform set of software libraries and applications. Finally, for ease of administration and access, clusters are generally in one physical location.
However, grids can be thought of as clusters without the preconceived limitations. Grids, like clusters, are interconnected whole systems that cooperate in solving problems. However, a grid can be thought of as a cluster that forms dynamically in response to requirements based on computing power and data sharing (with its associated authorization issues), without regard for the physical location where the computing power is generated, where the data is coming from, or on what kind of platform it is available. Imagine, in essence, having the ability to carve the necessary computing resources out of the universe of computing resources available to solve a specific problem, for whatever duration is required. A grid allows the dynamic, on-demand assembly of the different types of computing resources necessary to solve a particular problem. These computing resources usually belong to separate organizations whose members are interested in the solution of a given problem. This is the thinking that fostered the grid concept. The premise of grids is to share computing resources and data in a dynamic way on an unprecedented scale.
Virtual Organizations and Grids
At the heart of understanding the grid paradigm is the concept of virtual organizations, or VOs. A VO is set of individuals and/or institutions that are dynamically brought together to solve a problem. The VO might span multiple companies, academic organizations, or government departments that have an interest in solving a common problem. Interactions among members of the VO are governed by a well-defined set of rules. The sharing of resources is highly controlled, "with resource providers and consumers defining clearly and carefully just what is shared, who is allowed to share, and the conditions under which sharing occurs," according to Foster, Kesselman, and Tuecke in their paper, The Anatomy of the Grid: Enabling Scalable Virtual Organizations.
For example, with grid computing, the design and manufacture of a new model of automobile can become an increasingly international and collaborative event. Car designers in Germany may want to provide detailed specification information to manufacturing plants in England. Multiple manufacturing sites in England may want to share inventory, supplier, and pricing information. The sales team in the United States may want to share real-time information on how the manufacture of a single automobile is going as an anxious customer awaits delivery of the vehicle he or she ordered. A grid converts the design, manufacturing, and sales teams - along with the customer - into a virtual organization.
Scientists working to solve a problem in medicine might also combine to form a long-term virtual organization. A team brought together to forge an emergency response in case of disaster, however, might bring together individuals and institutions representing relief response agencies, medical response teams, and law enforcement. Such a dynamic team would be formed in real time in response to an emergency, and would last only for the duration of the relief effort.
Each of these VOs needs to collaborate in response to a problem. VO members are probably from different geographies, may work for different companies or entities, and will provide to and take from the grid different resources. VOs may be well thought out and planned prior to their creation, or they might be in response to a particular one-time, unforseeable event. Regardless of how a VO is formed, the need to share information and computing resources in real time requires individual systems within a grid to cooperate in well-defined ways by following various protocols.
The Role of Standards in Grids
The paradigms described in the previous sections would be very difficult to realize without established industry standards. If we want to make resources of various types belonging to different organizational entities collaborate in solving a common problem, it is necessary for these resources to communicate with each other using common standards.
The most significant activity in this area is the Open Grid Service Architecture (OGSA) definition. Drawing on prior research and experimentation in both the Web services arena and the grid computing field, OGSA defines an architecture for grid services. A grid service in this context is a Web service tailored to the requirements of a grid environment. This architecture abstracts and virtualizes the hardware and software platform on which the grid services will run. Thus, it provides a mechanism to allow grid nodes to interact with each other through grid services, irrespective of their hardware/software platforms. The core of the OGSA is the Open Grid Service Infrastructure (OGSI), which defines a set of interfaces that every grid service must expose and implement. OGSI is being standardized through the efforts of the Global Grid Forum (GGF), which is a community-initiated forum of researchers and practitioners working on distributed computing and grid technologies. GGF's primary objective is to promote and support the development, deployment, and implementation of grid technologies and applications via the creation and documentation of "best practices" - technical specifications, user experiences, and implementation guidelines.
The Globus Toolkit v3.0 (GT3) is an implementation of OGSA from the Globus Project. At the time of writing, the release of a production version of GT3 was planned for June 2003. The GT3 core is the reference implementation of OGSI. GT3 base services include the security service, and additional services available (see Figure 1) include a managed job service that allows remote execution of programs, a file transfer service, and an information service that exposes and indexes information about the other services and system information.
WebSphere
Over time, IBM products such as WebSphere will adopt the OGSA (see Figure 2) for middleware products. WebSphere will function as the Web services engine in the OGSA. OGSA will run on WebSphere and WebSphere with OGSA will be integrated into IBM servers and storage platforms.
TeraGrid
IBM is a key supplier to the TeraGrid, probably the most striking and best-known example of the grid principles in practice. The raw power of the TeraGrid is in itself impressive - 20 teraflops (a teraflop is a trillion floating point operations per second) of computing power, nearly a petabyte (a petabyte is just over a million gigabytes) of data, distributed over five U.S. sites and connected by a 40-gigabits/second backbone - the fastest research network in the world (see Figure 3). The five sites include the National Center for Supercomputing Applications at the University of Illinois, Urbana-Champaign; the San Diego Supercomputer Center at the University of California, San Diego; Argonne National Laboratory in Argonne, IL; the Center for Advanced Computing Research at the California Institute of Technology in Pasadena; and the most recent addition, the Pittsburgh Supercomputer Center.
In true grid spirit, some of the sites specialize in computing power, and others specialize in data management and visualization resources, with each site contributing specific resources to the overall TeraGrid, comprising a whole greater than the individual parts. The peer-to-peer structure of the TeraGrid allows scientists to tap into the collective resources of the grid from their local workstations. Besides the general objective of enhancing collaboration, the TeraGrid project has several high-level objectives:
To provide an unprecedented increase in the computational capabilities available to the open research community
To deploy a distributed system, without centralized control, which allows resource mapping by participants
To create an "enabling cyberinfrastructure" for scientific research so that additional resources can be readily added, and to provide a model for future grids
In practice, the individual members of the TeraGrid follow a set of standards that dictates a certain level of function, but not an implementation. This allows independent control of the individual sites, and the addition of new sites, provided they meet the functional requirements for participation. In practice, there is both a grid services layer and a set of application services. The grid services layer includes software that does overall grid scheduling, data movement, job scheduling and monitoring, and authentication and resource management. The application services are hosted on top of the basic and core grid services, and include support for batch job management, interactive job management, large-data management, and archival support.
The TeraGrid is the perfect match for the on-demand computing paradigm. It can be broken up into chunks to meet immediate computing needs, and the combined resources of the individual sites make it possible to run applications that otherwise would not be feasible, or that would require a physical presence at the computer. Since the TeraGrid has been designed to conform to the emerging grid standards (such as Globus, the de facto standard), and because the functional expectations for joining the grid have been architected, the TeraGrid is extensible. Other sites meeting the functional architecture will be added in the future. There is a feeling that the TeraGrid is the first major step toward a national cyberinfrastructure, a pervasive network of computers all running grid-standard software. Just like the power grid that the consumer simply plugs into without giving a thought as to where the power is generated, the cyberinfrastructure will provide transparent computing resources that over time may be adopted by the business sector. Already we're seeing advances in real-time weather forecasting and tornado prediction based on the on-demand power of the grid.
Conclusion
In this article, we've discussed the ways IBM has introduced the specifics of on-demand computing in the areas of operating environments, software, and storage. We've discussed the concept of virtual organizations and how they form and dissolve dynamically, with an ever-changing list of computing requirements that are difficult to plan for. We've discussed the concept of grids, how adherence to standards enables grids, and how grids differ from clusters in their ability to morph to meet the demands of virtual organizations. We've talked about the TeraGrid, which is providing scientists with an unprecedented ability to collaborate on a national scale, perhaps on the road to a national cyberinfrastructure. By its very nature, the grid is becoming the vehicle to lead us to a new world of on-demand computing.
References
Foster I., Kesselman, C., and Tuecke, S. The Anatomy of the Grid: Enabling Scalable Virtual Organizations:
www.globus.org/research/papers/anatomy.pdf
Catlett, C. The TeraGrid: A Primer:
www.teragrid.org/about/TeraGrid-Primer-Sept-02.pdf
Foster, I., Kesselman, C., Nick, J., and Tuecke, S. The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration:
www.globus.org/research/papers/ogsa.pdf
Tuecke, S., Czajkowski, K., Foster, I., Frey, J., Graham, S., and Kesselman, S. Grid Service Specification (Globus Project, Draft 29):
www.gridforum.org/ogsi-wg/drafts/ draft-ggf-ogsi-gridservice-29_2003-04-05.pdf
Global Grid Forum:
www.ggf.org
Author Bio
Richard Ferri is a senior programmer in IBM's Linux Technology Center, where he writes about and works on open-source cluster projects. He recently joined the Advanced Linux Response Team, working with customers converting to Linux. He holds five patents in the field of clustering, with one pending.
Dr. Moon J. Kim is an IBM senior technical staff member responsible for the development of strategic infrastructures. He has developed system solutions such as the Advanced Web System and the Broadband Interactive System, and was also a key architect and developer of the S/390 family and MPP systems. He is an IBM Master Inventor and has published numerous technical papers.
moonkim@us.ibm.com
Dr. Dikran S. Meliksetian works with the Internet Technology team at IBM and is involved in the design and development of advanced content management applications. He is a senior technical staff member and is engaged in the design and development of the IBM internal grid based on industry standards.
dikran_meliksetian @us.ibm.com