Shared Web Hosting Analysis Tool for Service Providers

Ludmila Cherkasova
Hewlett Packard Labs
1501 Page Mill Road
Palo Alto, CA 94303, USA
Mohan DeSouza
University of California
Dept. of Computer Science
Riverside, CA 92521, USA
Jobin James
University of California
Dept. of Computer Science
Riverside, CA 92521, USA
The shared Web hosting market targets small and medium size businesses. The most common purpose of a shared hosting web site is marketing (in other words, it means that most of the documents are static). In this case, many different sites are hosted on the same hardware. A shared Web hosting service creates a set of virtual servers on the same server. Each virtual server is set-up to write its own access log. Such implementation and set-up, however, splits the ``whole picture'' of web server usage into multiple independent pieces, making it difficult for the service provider to understand and analyze the ``aggregate'' traffic characteristics. The situation gets even more complex when a Web hosting infrastructure is based on a web server farm or cluster, used to create a scalable and highly available solution.

There are several web log analysis tools available ( Analog, Webalizer, WebTrends to name just a few). They give detailed data analysis useful for business sites to understand their customers and customers interests. However, these tools lack the information which is of interest to system administrators; the information which provides insight into the system's resource requirements and traffic access patterns.

Shared Web Hosting Analysis Tool (WHAT) aims to provide a Web hosting service profile and characterize the system's usage specifics and trends:

These characteristics provide an insight into the system's resource requirements and traffic access patterns - the information which is of special interest to system administrators and service providers.

Service Characterization

While the typical growth for most of the sites is exponential, it takes different times for different sites to double. Some of the sites experience decrease of the traffic rates and actually demonstrate negative growth. User access patterns differ significantly too. For example, some sites have a few, very popular documents or products. The accesses to such sites are heavily skewed: 2% of the documents account for 95% of the sites' traffic. In order to design an efficient, high quality Web hosting solution, the specifics of access rates and users' access patterns should be taken into account. The traffic growth/decrease and the users' access patterns' changes should be monitored in order to provision for those changes well in time and in the most efficient way.

WHAT identifies all the different hosted web sites (from the given collection of web server access logs). For each hosted web site i, the tool builds a site profile by evaluating the following characteristics:

We normalize both AR_i and WS_i with respect to AR and WS combined over all the sites in order to identify the percentage contribution of each particular site. The access rate AR_i gives an approximation of the load to a server provided by the traffic to the site i . The working set WS_i characterizes the memory (RAM) requirements by the site i . These parameters provide a high level characterization of customers (hosted web sites) and their system resource requirements. The sites profiles accumulated on a daily (weekly) basis allow to derive growth trends for those sites. ``Combined trend'' help to evaluate and, more important, to predict the overall ``aggregate'' service growth, and do capacity panning and scaling of the underlying infrastructure accordingly.

Traffic Characterization

It reports the number of successful requests (code 200), conditional_get requests (code 304) and errors (the rest of the codes). The percentage of conditional_get requests often indicates the ``reuse'' factor for the documents on a server. These are the documents cached somewhere in the Internet by proxy caches. WHAT provides statistics for the average response-file-size (averaged across all successful requests with 200 code). We also build a characterization of the file size distribution (average response size for 30/60/90% of all (200 code) requests).

WHAT reports a percentage of the files requested only a few times - the files requested less than 2/6/10 times. This is another important characterization of traffic which has a close connection to document reuse and gives indication of memory (RAM) efficiency for the analyzed workload. Most likely ``onetimers'' are the requests served from disk. This data is helpful in understanding whether performance improvements can be achieved via optimization of the caching or replacement strategy.

System Requirements Characterization

System requirements are characterized by the combined access rate and working set of all the hosted sites (during the observed period of time). WHAT provides the combined size of ``onetimers''. High percentage of ``onetimers'' and small memory size could cause bad site performance.

In order to characterize the ``locality'' of overall traffic to the site, we build a table of all accessed files with their sizes and access frequency information, ordered in decreasing order by frequency. WHAT provides working sets for 97/98/99% of all (200 code) requests. The smaller numbers for 97/98/99% of the working set indicate higher traffic locality: this means that the most frequent requests target a smaller set of documents. This characterization is very important for capacity planing and provisioning goals. Practically, 97/98/99\% of the working set allows to optimally plan for the actual RAM size of the underlying hardware for web servers. It characterizes the ``actively used'' file subset independently of the overall ``passive'' file set size.

More details on WHAT and its usage can be found in [CDJ99]:

L. Cherkasova, M.DeSouza, J.James: Web Hosting Analysis Tool for Service Providers. HP Laboratories Report No. HPL-1999-150, 1999.