Clustering: Solution Overview

Highly Available, Scalable, Multi-tier Solution Architecture

Definitions, terms, and abbreviations

The following definitions, terms or abbreviations are used in this document, with a brief description available:

  • Vertical scalability - potential increase in processing capacity of a machine attainable by hardware upgrades
  • Horizontal scalability - potential increase in processing capacity of a cluster attainable by increasing the number of nodes (machines)
  • Stateful services - services that provide access to persistent information (e.g. account configuration and mailbox) over multiple sessions. Typically refers to a service in the back-end tier. (e.g.. IMAP services for an account)
  • Stateless services - services that do not store persistent information over multiple sessions. Typically refers to the services in the front-end tier (e.g IMAP Proxy)
  • Front-end tier - Subnet, medium security level, provides proxy and routing services
  • Back-end tier - Subnet, high security level, provides e-mail data storage and directory services
  • Front-end node - Machine residing in the front-end network tier
  • Back-end node - Machine residing in the back-end network tier, participating in the high availability cluster
  • Service instance - A running service and storage for a specific set of accounts
  • Cluster node - A machine, capable of running one (typically) or more service instances.

Scalability

Non-distributed email solutions, where account information (configuration and messages) is stored on a single machine allow vertical scalability through hardware upgrades (CPU, RAM, additional hard disks). However, due to limitations in a typical machine (e.g. max 2 CPU, max 4 GB RAM etc) an upper limit is eventually reached where one can no longer upgrade one machine - we shall refer to this as vertical scalability limit.

When the vertical scalability limit is reached, the only solution available is to distribute account information (configuration and mailbox) on more than one machine - we shall refer to this as horizontal scalability. Since information for one account is atomic and cannot be spread across more machines, the solution is to distribute accounts on more than one machine. This way, for a single account, there will be one machine responding to requests (i.e. IMAP, SMTP etc.) for that specific account. Thus, when the overall capacity (in terms of active accounts and traffic) of the messaging solution is reached, adding one more machine to the solution and making sure new accounts are created, provides a capacity upgrade, therefore allowing virtually unlimited horizontal scalability.

The solution architecture described here is oriented towards this horizontal scalability concept and aims to provide practically unlimited expansion abilities of the messaging system. It should be noted, though, that there are both physical and logical (software related) limits that must be considered.

Since each account of the system is serviced by a specific node, a centralized location directory must be available to provide location services. In this case, an LDAP service system will store information about which node is able to service requests for a specific account.

Stateless services

Stateless services include any proxy services and all other content independent services (such as web content services) that do not store or keep track of the information exchange between clients and themselves. These systems do not require storage redundancy and their hardware failures do not affect user data, settings or other information.

Since stateless services do not store information over multiple sessions, we can assume that two different machines are able to serve requests for the same account. This way, horizontal scalability can be achieved by simply adding more machines providing the same service in the exact same configuration.

The only remaining requirement is to ensure that requests to a specific service are distributed evenly throughout the machines providing that specific service (i.e. if the front-end tier contains two machines providing IMAP proxy services, half of the incoming IMAP connections must reach one of the machines and the other half of the connections must reach the other machine). This distribution functionality is provided by a load balancer.

Fault tolerance and stateful services

For stateful services, all requests to an account are sent to one specific back-end node (mandatory), where the back-end service resides at that moment in time. If that back-end node experiences a fault and can no longer respond to requests, none of the other back-end nodes are able to serve requests for the respective account. A mechanism is required to ensure that, in the event of a catastrophic failure on one back-end node, some other node must takeover the task of serving requests for that account, thus providing high-availability.

The Cluster Management software provides this exact functionality. It ensures that if one back-end node (running a stateful service) fails, another cluster node will automatically detect the fault and start the required service, in-place of the failed node, providing minimal downtime of that service. This detection process is achieved through the heartbeat mechanism deployed along with the clustering software.

In the case of stateless services, any of the nodes providing the same service is able to respond to requests for any account. Accordingly, the only requirement is to make sure the request distribution device (load balancer) can detect when one of the front-end nodes no longer responds to requests, in order to cease the distribution of service requests to that specific front-end node. The total request processing capacity is decreased (the entire clustering system will respond slower, since one node can no longer process requests), but all service requests can still be processed.

Multi-tier structure

The solution uses three tiers to provide the required functionality.

The load balancer tier provides services for network layer 4 (transport), TCP connections, and is completely unaware of account information; it only provides distribution of connections to the nodes in the front-end tier.

The front-end tier comprises of nodes running proxy and SMTP routing services. Its task is to ensure that messages and connections are routed to the appropriate node (depending on the account for which a request is being performed) in the back-end tier.

Finally, the back-end tier provides access to persistent data (such as account configuration and mailbox data); each node in the back-end tier is capable of responding to requests for a set of accounts. No node in the back-end tier is capable of serving requests for an account that is hosted on a different node. The simplest way to identify the systems that have to run in the back-end is to check if they require storage access. All systems that require storage access are part of the back-end, while the proxy or front-end nodes will not require access to the storage resource.

High availability

Depending on the service type (stateful or stateless), high availability is achieved differently.

In the case of stateless services (services in the front-end tier), the load distribution system also provides high availability capabilities transparently. The load balancer automatically distributes requests (at TCP level – network layer 4) to all the nodes in the front-end tier, based on the configured algorithm. If one node in the front-end tier fails, the load balancer will automatically detect the failed node and will no longer distribute connections to it. A major advantage over the stateful services high availability mechanism is that, due to its active/active nature, a node failure causes no service downtime. Once a front-end node recovers, it will automatically be re-included in the load balancing mechanism and will start processing requests immediately.

High-availability in the back-end tier is achieved by making sure that in the event of failure of a physical node, the service instance is automatically started on a different cluster node. Currently, there are a number of software packages providing this functionality. Please see the Cluster management software section for a list of supported cluster management software.

Also, the high availability can be implemented for all the other devices and connections from the solution, like redundant network adapters, an additional load balancer. However, these are out of this document scope.

Mailbox node failure recovery

In the event of a failure one of the back-end nodes will be unable for any reason (software or hardware) to serve incoming requests. Heartbeat, the mechanism deployed to identify this failure, will report the issue to all of the other nodes that are part of the cluster and trigger the service migration from the failed node to the hot-standby node. The service migration can last between 10 and 60 seconds, resulting in less than a minute between service recovery. The failure of one such node will affect only a limited section of the entire pool of accounts. The quantity of user accounts affected is equal to those that were stored on the failed back-end node and connected to a Axigen service at the same time.

The hot stand-by node can take over as many failed nodes as it is configured to replace. However, the resources available to this node and how many services it can run with minimal impact on the overall system performance should be taken into consideration. The best practice, recommended in this situation, is to run no more than two services on any hot stand-by node.

In addition, the solution can be designed to have as many hot stand-by nodes as required in order to have zero performance impact in case multiple back-end tier nodes failed. However, most situations require only one hot stand-by node.

No single point of failure

The high availability used at service level provides a full no-single-point-of-failure. However, any faulty hardware component in a node causes that node to be unusable thus diminishing the total processing power (hence the total transaction capacity) the solution provides. There is a mechanism which makes sure that, even in the case of an I/O controller failure, the node can continue to provide the service; this relies on having duplicate I/O controllers on each node and a software method of failover (rerouting I/O traffic from the faulty controller to the healthy one).

The I/O high availability can be used for disk I/O and network I/O fault tolerance, provided that duplicate controllers are available on the nodes. This reduces the occurrence of service downtime in the case of stateful services; if an I/O controller fails, the service would need to be restarted on a different node.

Account provisioning

When a new account is being created, one Axigen back-end node must be selected to hold its related information. The provisioning interface must be implemented to select the back-end node based on one of the following algorithms:

  1. Random: Each new account is created on one of the back-end node, picked randomly; as an enhancement, a weighted-random distribution algorithm may be used to allow creating more accounts on some of the back-end cluster nodes than on others.
  2. Least used: The provisioning interface must be aware of the number of accounts that exist on each back-end node, so that, each time a new account is created, the back-end cluster node that has the least number of accounts is used.
  3. Domain based: Each domain is placed on one of the back-end cluster node; the provisioning interface must have configured a domain/back-end service instance table in order to be able to select a specific back-end node when creating a new account. Also, each domain will have a "home" back-end node assigned.

The first and second distribution algorithms have the advantage of a better spread of the accounts to the back-end service instances. The disadvantage resides in the fact that each domain must be created on all the back-end service instances and that domain-wide settings for each domain must be kept in sync on all the other back-end systems.

The third distribution algorithm simplifies the management of the accounts (the domain is only created on the specific back-end node that will host that domain, changes to domain configuration are performed only on the domain-wise home back-end service instance). Moreover, routing can be performed with one home back-end entry per domain instead of one entry per account.

Groupware, sharing, address book and other common resources cannot be accessed between users from the same domain, if the domain is split across multiple back-end nodes.

For compatibility with other external systems, account synchronization, and also to complete the high availability and clustering functionalities, an LDAP service is required.

Axigen can authenticate out of the box against any external standard LDAP service. More than this, the current version can also synchronize its account base with specific LDAP services, namely Microsoft Active Directory and OpenLDAP. The synchronization can be made either Axigen to LDAP which means that any account created in Axigen will also be pushed and synchronized into the LDAP service, LDAP to Axigen, which means that any account created in the LDAP service (for example by a provisioning interface) will be created in Axigen internal database, or both ways synchronization, which means that the synchronization between the LDAP service and Axigen will be made in both ways: any account created in either of LDAP or Axigen will also be synchronized in the other database. In Axigen, the LDAP connectors can be configured for each domain, or generic LDAP connectors can be configured use place holders, expanded by Axigen into specific data (i.e. %d is replaced with the domain name).

In a clustered environment where multiple back-ends share accounts from the same domain, besides the user authentication and synchronization, SMTP routing is also possible using LDAP queries.