Globule User Manual: 1 Introduction

1 Introduction

Globule is a system that performs Web site replication. This means that WWW documents are copied across multiple machines where clients can access them. A document has one origin server, which is the machine where it is edited and published. The origin server usually belongs to the owner of the document. In addition, a document can have any number of replica servers, which host copies of the original document and deliver them to the clients. Replica servers do not necessarily belong to the owner of the document.

Replicating a Web site has multiple advantages:

The documents remain accessible to clients even if some servers are down (provided that at least one of them is alive)
The performance of the site is optimized, because clients can fetch documents from a server close to them. Shorter server-to-client transfer distances usually mean faster downloads.
Multiple servers are better at handling a flash crowd than a single server. Flash crowds are events where the request load of a given site increases by several order of magnitudes within a few minutes or seconds. This frequently happens, for example, when a page is being linked to by Slashdot.
Playing with multiple servers is fun!

Globule is designed as a third-party module for the Apache version 2.0.x Web server. You will therefore benefit from all great features that made Apache the Number One Server On The Internet. In addition, Globule will provide replication functionalities to this server.

1.1 Globule Features

Globule provides the following features:

Replication: Web sites (or parts thereof) can be replicated across multiple servers, even if the involved servers do not belong to the same persons.
Client redirection: Clients accessing the site are automatically redirected to one of the replicas. This can be done using HTTP or DNS redirection. Moreover, Globule supports several policies which allow one to decide to which replica each client should be redirected.
Fault-tolerance: Each Globule server periodically checks the availability of the machines holding its replicas. In case one replica site is down or misconfigured, it will stop redirecting client requests to this replica until it has recovered. Globule also supports the creation of backups of the origin server so that the site will function correctly even if the origin server is down.
Monitoring: Globule allows administrators to monitor the behavior of their system in three different ways. First, the logs of requests addressed to replicas are transfered back to the origin server to rebuild a ``global access log.'' Second, Globule can attach a cookie with each delivered document containing information on how this request was treated. Lastly it is possible to collect internal statistics on the usage pattern and process them using configurable filters.
Adaptive replication: There are many ways by which a given document can be replicated and document updates can be taken into account. Unlike many other systems, Globule does not consider that there is one policy that is best in all cases [2]. Instead, it supports multiple policies and periodically checks for each document which policy is likely to offer best performance. When the current policy is not optimal, it is automatically replaced with the best one.
Dynamic document replication: Globule can not only replicate static documents but also scripts that are executed at the server to generate content, such as PHP scripts. The scripts themselves are replicated and executed on replica servers. If a PHP script accesses a MySQL database, then Globule can also cache database queries to further optimize performance [3].
Configuration server: The Globe Broker System (GBS) is a web site where Globule users can register, meet each other, and decide to replicate each other's content. Configuration files are automatically generated, which saves most users from the need to read most of this documentation.

In addition to the current features, we are currently conducting research on the following topics. When research is finished, these features will be integrated into Globule.

Latency estimation: We developed a nifty way to estimate the latency between any pair of nodes in the Internet, that is the delay it takes to transfer a bit of information from one machine to the other. Instead of sending gazillions of messages between every pair of machines, our method is based on a very low number of actual measurements [4]. A prototype is running fine, but it will take a while before we can integrate it into the Globule distribution.
Replica placement: Based on latency estimations, we are now capable to analyze the location of clients requesting a Web site, and derive a set of locations where replica servers should ideally be placed [5].
Flash-crowd prediction and pro-active management: The request rate of a web-site can sometimes change drastically within minuts such as when it is being referred in a popular bulletin board such as SlashDot. Flash-crowds, as they are called, often jam web-servers to the point where they become unusable. We are working on techniques to detect the early stages of flash-crowds, and pro-actively replicate the concerned sites such that they can serve every request efficiently [1].

1.2 Terminology

Figure 1: Globule Model

Globule makes a strong distinction between a site and a server. A site is defined as a collection of documents that belong to a given user. A server is a machine connected to a network, which runs an instance of the Globule software. Each server may host one or more sites, that is, be capable of delivering the site's content to its clients. As shown in Figure 1, servers can be classified in four categories according to the role they play with respect to this site.

Each site has one Origin server, which typically belongs to the site owner. The origin contains the authoritative version of all documents of the site. It is also in charge of making all decisions on where, when and how replication of the site should take place.

The origin server is helped delivering documents by any number of replica servers. Replica servers replicate documents locally and deliver them to the clients. A replica server for a site is typically operated by a different user than its origin, so the replica's administrator may impose restrictions on the amount of resources (disk space, bandwidth, etc,) that the hosted site can use on their machine. As a result, each replica server typically contains only a partial copy of its hosted site. Similarly to a caching proxy, when requested for a document not present locally, a replica server fetches the document from its origin before delivering it to the client.

When a replica server needs to fetch a document from the origin, it obviously requires that the origin server is available to deliver the document. Remember, however, that the origin server is typically simply the desktop machine of the site owner. Such machines are often unavailable for a multitude of reasons ranging from a voluntary shutdown to a network failure. To overcome this problem, the origin server can define any number of backup servers. Backups are very similar to replica servers, except that they hold a full copy of the hosted site. If a replica cannot contact the origin server, then it can obtain the requested information from any backup. Backup servers also perform the role of a regular replica server, helping serving the content of a site.

Once documents have been replicated across multiple servers, one needs to automatically redirect client requests to one of the replicas. This is done by one or more redirectors, which know the list of all replica and backup servers for the site, monitor their status, and redirect client requests to one of the currently available servers. Redirectors can use two different mechanisms for redirection: in HTTP redirection a redirector responds to HTTP requests from the clients with a redirection return code 302. Alternatively, in DNS redirection, a redirector implements a specialized DNS server that resolves the site's name into the IP address of the destination replica.

It should be clear that the distinction between origin, replica, backup and redirector servers refers only to the role that a given server takes with respect to any given site. The same server may for example simultaneously act as the origin and one of the redirectors for its owner's site, as a backup for a few selected friend's sites, as a replica for other sites, and as a redirector for yet other sites.

1.3 Known issues/limitations

Globule only works with Apache version 2.0.x. It does not work with Apache version 1.3.x, and there is no plan on changing this.

Apache version 2.1/2.2 is not supported up to and including Globule release 1.3.1. This may change in future releases of Globule.

globule@globule.org
February 27, 2006