Home
 

User login

 
 

Navigation

 
 

Events

« August 2008
SunMonTueWedThuFriSat
12
3456789
10111213141516
17181920212223
24252627282930
31
 

Designing Mission Critical Applications for the Cape Clear ESB Platform

By James Pasley

Enterprises are increasingly putting applications that are critical to their business on the ESB. Such applications require high levels of quality of service. Among these non-functional requirements are the need for Reliability, Availability, Scalability and Performance. These are often collectively referred to as RASP. This paper looks at how these requirements can be addressed by the ESB and makes recommendations for design and development of applications.

The following definitions of Reliability, Availability, Scalability and Performance should help to clarify what is meant by these terms. It may also be useful to think of these as goals to be achieved in the creation of your applications.

Reliability refers to the ability of the system to perform as designed. On the ESB we are primarily interested in two kinds of reliability:

  • Reliable Messaging: The ability of the system to deliver messages even in the event of failures.
  • Reliable Processing: The ability of the system to process the messages in a reliable manner.

Availability is the ability to continue to handle requests in the event of a failure. This covers both failures of hardware and software. Typically this is addressed through the provision of redundant hardware in order to avoid any single points of failure.

Scalability is the ability to handle increased load through the addition of hardware resources. There are two forms:

  • Vertical scaling (also known as building up) refers to improving the power of a single computer. For example, by adding memory or up grading the CPU.
  • Horizontal scaling (also known as building out) refers to distributing the load across additional computers.

Performance is the ability to handle requests promptly. It is simply the measure of the response time for the system. The related concept of throughput is a combination of performance and the maximum number of concurrent requests the system can handle.

The Cape Clear ESB Platform

The Cape Clear ESB Platform provides a number of features targeted directly at fulfilling the requirements for RASP. This section provides a brief overview of them.

The main BPEL engine components
The BPEL engine provides a reliable processing environment in which processes are executed. As each instance of a business process is executed, the state associate with it is automatically persisted to the BPEL database. In the event of a server failure, this state can be recovered from the database and execution of the process instance continues as normal. A low latency recovery strategy minimizes the time taken for this recovery. The delivery of messages to the appropriate instance of a process (known as correlation) is performed using a high speed correlation engine to ensure that message delivery remains efficient as the number of instances increases. In a clustered environment the BPEL engine implements the server affinity model for process instances. This ensures that each ‘in flight’ process instance is cached on only one server at any given time. Messages intended for ‘in-flight’ process instances are automatically forwarded to the correct server.

The Message Store ensures that the quality of service offered by transports continues uninterrupted into to the BPEL engine, linking the reliable messaging provided in the transport layer with the reliable processing offered by the BPEL engine. In the event of the database becoming unavailable, the message store will automatically halt the retrieval of messages from reliable transports in order to ensure that no messages are lost.

Multiple servers within a BPEL engine cluster can collaborate to provide load balancing. The correlation of messages happens within the cluster and ensures that messages can arrive at any service in the cluster and still be processed correctly. Recovery is triggered automatically should one of the servers leave the cluster which ensures that all process instances remain running within the cluster.

The BPEL Cache minimizes the amount of databases access required by the BPEL engine to improve performance. The cache is optimized to reduce the in-memory foot print for ‘in flight’ process instances. It also implements an eviction strategy to which allows process instances to be entirely removed from memory.

The Cape Clear ESB hosts all of the above components and manages all of the BPEL engines interactions with the outside world. Web service invocations to co-located services are performed in-process for maximum efficiently.

RASP First Steps

Using the features of the Cape Clear ESB Platform as described above within a cluster of servers is the first step is satisfying the requirements of RASP. By doing this you will be isolated from many of the complex programming paradigms you would be exposed to if trying to build RASP applications in other environments. In addition to this there are a number of decisions you need to make about the deployment environment. Some of these are described in this section.

Use a reliable transport: Messages should be exchanged over reliable transports, either using WS-RM or one of the many JMS providers. You may already use a reliable queue based messaging system that can be accessed using the JMS interface, in which case the Cape Clear ESB will be able to use it as a reliable transport layer. Decide on the use of such transports early in the design of the application. It can be tempting to start development of services which communicate using raw HTTP (in other words, where WS-RM is not used). However, this can lead to situations where testing highlights the problems caused by connection failures and may lead to message redelivery logic appearing in application code. This is not the appropriate place in any architecture to deal with low level transport related errors. See Spot the mistakes - Two things you shouldn’t do with BPEL for an example of what can go wrong when the problem of using unreliable transports is addressed too late the development of lifecycle.

Many internet based applications use raw HTTP as a transport and there will be places where it is the most appropriate transport to us. So it may be a question of which places in the architecture are appropriate for HTTP and which are not. It may be useful to think of the issue in these terms:  If a user is interacting with the system through a web browser, HTTP is the most appropriate transport. In this case, errors occurring which the users may be propagated back to the user who may take appropriate action such as aborting the transaction or trying again later. However, in many cases such as an on-line retailer, there comes a point where the user completes the transaction and leaves. The backend systems which process the transaction must now do so in a reliable manner. It is not acceptable to loose the order once payment has been received. This is the point at which the use of reliable transports and processing must be introduced.

Use a highly available database: The BPEL engine makes use of a database to persist all information. This makes the choice of database and how it is configured an import factor in the overall quality of service delivered by the system. A high availability database is required for use with the BPEL engine and the Message Store.

Eliminate single points of failure: In order to create a system which continues to be available in the event of failure you will need to have at least two of everything – that means routers, servers, database, everything. This is a relatively well understood approach and much has been written on how to achieve high availability of web sites using this approach. Even though the characteristics of handling http requests from web browsers are very different to messaging on an ESB, the principle remains the same.

Design with overcapacity: A cluster of servers provide both load balancing and failover. You need to consider how these requirements relate to each other. In the event of a failure, the system will failover to the remaining hardware. For the system to continue running successfully, the remaining hardware needs to be capable of handling the load. In other words, you need to design the system with overcapacity. If your availability requirement states that the system should continue in the event that any one server is removed, then the system should be sized on the assumption that one server is unavailable.

Use the in-process transport for calls to co-located Web services: A BPEL process often makes calls to local Web services which perform some utility tasks such as data transformation. Use the in-process transport for calls to such services, thus bypassing the overhead of the transport layer. Java services can also make use of the in-process transport. Alternatively, Java services which need to make use of transformations can invoke them directly through Java APIs, rather than using Web service calls.

Plan for scheduled maintenance: The concept of Manageability is related to availability. This is the ability to keep the system running in the event of scheduled maintenance. For a system which is required to be available 24/7, the redundancy built into the system to provide availability can also be used to allow for maintenance. Maintenance should include the proactive replacement of hardware as well as the upgrades and maintenance of software components. Mean time between failure should be considered when choosing when to replace hardware.

Do your investigation first and test continually: Most applications built on an ESB include some interaction with legacy or existing systems. These systems will have to play their part in satisfying the RASP requirements. Evaluation and testing of these existing systems needs to be done as during the planning phase of a project in order to identify what remedial action might be necessary. Testing of all aspects of RASP should continue in parallel with development throughout the lifecycle of the project. See Performance Testing in a SOA for advice on the kinds of testing that are necessary.

Know your objectives: It is important to know your RASP requirements from the start. These requirements need to feed into both the software design and the hardware specification. Don’t waste time tuning a system which just can’t handle the load – you will either need to scale the system or modify the application architecture.

Application Design Issues

This section looks at some of the choices you can make when designing applications which will allow them to make full use of the RASP features provided by the Cape Clear ESB Platform.

Identify the right balance of RA and SP: The issues of reliability and availability are related as are those for scalability and performance. Sometimes more reliability and availability may mean less performance and scalability. An appropriate balance needs to be found for each application. Perhaps the most significant decision in relation to this is when to use BPEL.

The reliability provided by the BPEL engine makes it a compelling choice when processing messages, particularly where state needs to be stored. However, some parts of a business process could be idempotent or may not require any state to be managed. Such tasks could be implemented as Java Web services thus avoiding the overhead introduced by database persistence. In many cases, BPEL can be used to model the process and manage the state associated with it, but then make calls out to Java services in order to perform much of the processing. In this way the appropriate balance between the reliability of BPEL and the performance of Java can be achieved.

Localize the effect of unavailability - Use asynchronous messaging: In a synchronous system, unavailability of any one component can cause errors to quickly propagate throughout the system. It may not be possible to eliminate all single points of failure, particularly where legacy applications are used. This is a cause for serious concern, and requires that procedures are put in place to ensure that the time to repair is minimized. The use of asynchronous messaging over a reliable transport can ensure that such incidents are perceived as delays in processing rather that the rejection of requests. Such an approach needs to be carefully validated against the business requirements, for example, a system which requires that all trades are settled within a twenty-four hour period may allow for the brief unavailability of one of its components within that time. In contrast, a service exposed on a web site will find itself subject to the classic eight second rule – if the user does not receive a response within eight seconds, they will go elsewhere.

Localize the effect of any unreliability: The ESB is often placed in the role of exposing existing, something legacy, applications as Web services. These applications and in particular the transports that they use may not offer guarantees of reliability. The ESB supports multiple transports in order to connect to these applications. Most of these transports offer no guarantee of reliability, e.g. file drop, SMTP, FTP, HTTP. In the event of an error they either simply report it, leaving the client to deal with recovery (e.g HTTP), or they provide a best effort level of service – if an error occurs, they reserve the right to simple discard the message (e.g. SMTP). Many of the reliability issues that need to be faced relate to the construction of a reliable architecture which contains such applications.
The characteristics of these transports should not be allowed to propagate throughout your ESB. Services created from these legacy applications need to contain the faults that can occur and take appropriate action. As part of this strategy, the characteristics of the legacy application itself will need to be analyzed. For example, if a fault occurs, you may be unsure if the original message was delivered or not. Is it safe to resubmit the message? If so, then the message may be automatically resent, if not, the appropriate action may be to raise an alarm and await manual intervention.

Ensure resources can be managed efficiently - Use asynchronous messages: The use of asynchronous messaging has already been mentioned in relation to the availability of the system. It is also an important factor in how the system will scale. Where a request/response style operation is invoked over HTTP, the connection must remain open while the request is processed. This consumes valuable resources in both the client and the service while this is happening. When the load increases, the time taken to process requests may increase resulting in more and more connections remaining open. This leads to a vicious circle which can significantly limit the scalability of the system. The use of asynchronous messaging avoids the problem of having to keep connections open in this way.

The BPEL Cache can actively manage its own memory consumption by evicting process instances from memory. Process instances can then be re-hydrated from the database, perhaps onto a different server in the cluster, when there is more work for them to do. Request/response style operations over HTTP require that the connection be maintained while the operation is executed. Keeping this connection open requires that the thread of execution which created it remains active. This is through for both the client and the service. This prevents the BPEL engine from evicting process instances which have open connections and can significantly limit the ability of the cache to manage its memory consumption.

Contain faults within the service: Where ever possible, services should contain the faults that occur and not pass them back to the client. Faults of the form "Resource is unavailable, please try again later" should not be part of a SOA. If simply storing the request and retrying later is appropriate, then the service itself should do that rather than push this requirement until to client. Where is it not possible to automatic the handling of faults, consider raising an alarm requesting manual intervention rather than propagating the fault to the client. This ensures that the fault is handled closest to where it occurs. Of course, in some circumstances it is necessary to report faults back to the client, this should be reserved for faults relating to the business logic of the service rather than IT level issues.

Choose an appropriate message granularity: The issue of granularity is fundamental to a successful SOA. Services need to be built using a granularity that matches the common tasks required by the business. This approach should avoid the creation of fine grained services which require multiple interactions in order to achieve a task. This is also an important aspect of ensuring that the system is scalable. Ideally, a service should receive a single document containing everything it needs in order to complete the task.

Choosing the right granularity is difficult and will depend on a good understanding of the business requirements. For this reason it is difficult to provide specific advice on granularity in a general way. However, focusing on the tasks is a good place to start. Identify the tasks that the business needs to achieve and then design messages which contain sufficient information to complete them without multiple exchanges.

The issue of interdependence between messages is related to the concept of granularity. For example, consider a trading system in which a client to request a price for a particular trade and later to execute that trade. The service might store a record of the quoted price allowing the client to subsequently send a message to execute on the quote. This creates a dependency which might require the establishment of a session or the storing of state. A better solution is to make these two tasks independent of each other. The first interaction allows the client to establish the price. The second interaction takes the form “execute trade at the price X”. In this scenario, the client resends all the details (including the price) as part of the second interaction. This avoids the need for a session or the storage of state. The trade can of course be rejected if the price has changed, from a user’s perspective this is equivalent to a session timeout.

Calculate the number of messages exchanged in order to process a single transaction: Performance issues can not always be solved using hardware. Good performance requires good software design. In a SOA, an important consideration for the design of a system is the number of messages exchanged. When reviewing the design of an application in relation to performance, calculate how many messages are exchanges as the result of receiving a single request from the client. This will have an impact on both the performance and scalability of the system.

Use features consistently: The need to integrate a number of existing systems may result in situations where different levels of RASP interconnect. For example, consider a client which invokes a BPEL process over raw HTTP. Now imagine that the server goes down after receiving the request, but before returning the response. When the server is rebooted the process instance will be recovered and will continue execution. However, the HTTP connection has been lost and the client has already received a connection error. (The same situation applies for servers in a cluster, except that recovery will happen immediately as the process instances failover to the remaining services). You need to define what the required behavior is for this kind of situation. It might be appropriate to turn off recovery for the BPEL service.

Conclusion – Help us to help you

The combination of the BPEL engine cluster, BPEL Cache and Message Store used in conjunction with a reliable transport and highly available database provides a powerful solution to the RASP requirements. In addition to this, a number of application design issues, which are typically a case of applying the principles of good SOA design, ensure that the maximum benefit can be taken from the Cape Clear ESB Platform.


Categories: