Modeling the Cacheability of HTML Documents

Timo Koskela, Jukka Heikkonen and Kimmo Kaski
Helsinki University of Technology
Laboratory of Computational Engineering
P.O. Box 9400, FIN-02015 HUT, FINLAND


Web caching is a technique that is both beneficial and necessary to Web users and service providers [1]. Especially proxy caches have some very attractive advantages when implemented correctly. However, caching can also lead to increased latency or to providing outdated documents to users. Also the rate of change in today's Web is very fast, which makes caching less attractive. Optimizing the cache performance is a key issue in effort to minimize traffic and server load and at the same time guarantee an acceptable service level for the users. Optimizing the performance of a Web cache has in many studies concentrated on developing optimal caching policies. Caching policy refers to the strategy which is used in replacement of objects in the cache. When a new document must be brought into the cache which is full (or heavily loaded), which document(s) should be removed from the cache? The difficulty is that of finding the value of a document, and whether it should be cached, without dramatically increasing computation and communication overheads.

Empirical Approach

Different heuristic policies have been proposed and proved to be optimal using simulation with actual proxy access logs. However, these algorithms are suboptimal since they only optimize a particular performance metric. Also some semi-adaptive policies have been proposed, which collect statistics of request rates and server responses to allow a more realistic cost function to be optimized [2].

If the value of the document could be predicted, cache performance could be optimized with respect to any given cost criteria. This would also allow using a prefetching algorithm for documents that change frequently and have a high value. Also adaptive policies which would adapt to the changes in the usage patterns, content distribution or network and computation resources could be used.

Model-based approach

A novel method of cache performance optimization is proposed, which is based on building a model that predicts the value of a document by using features extracted from the document itself. Features which characterize the structure of the document, e.g. URL, content type and number of references to other documents are used. Also some other information can be used, e.g. expiration and last modification dates, Cache-control meta information (if available), document size and relative portion of text vs. images. Extracted features are then used as the source of information to build a model which predicts the value of the document.

Case Study

To test the proposed approach a case study was carried out. An access log containing one day requests was obtained from FUNET (Finnish University and Research Network) proxy cache. A Web robot was then used to retrieve each HTML document in the log and a HTML parser to parse its structure and store the features in a data file. The data was cleaned from empty HTML documents and from features that had the same value throughout the data (usually either missing or zero). Final data consisted of 51054 HTML documents with 56 features from each document. Table 1. shows the features used in more detail.

Table 1. Features 01-49 contain the number of certain HTML tag existing in the document. Features 50-56 include additional information from the document extracted mostly from the Web cache access log. Features 51 Last modified and 52 Expires were assigned to 1 if the feature existed in the document headers and 0 otherwise. Feature 55 Path depth states how many "/" characters the document URL path contains, leaving out the first "/" character which denotes the root directory.
01 02 03 04 05 06 07 08
<a> <address> <area> <b> <base> <big> <blink> <blockquote>
09 10 11 12 13 14 15 16
<body> <br> <center> <div> <dl> <dt> <dd> <em>
17 18 19 20 21 22 23 24
<font> <frame> <frameset> <h1> <h2> <h3> <h4> <h5>
25 26 27 28 29 30 31 32
<h6> <head> <hr> <html> <i> <iframe> <img> <li>
33 34 35 36 37 38 39 40
<link> <map> <meta> <p> <param> <pre> <script> <select>
41 42 43 44 45 46 47 48
<small> <strong> <td> <textarea> <th> <title> <tr> <u>
49 50 51 52 53 54 55 56
<ul> Doc length Last modified Expires Length of content Content type Path depth Domain

To make the modeling more straightforward, the document value was assigned to be either zero (low value) or one (high value). HTML documents with 2 or more requests in the access log were labeled to belong to the class one, and all the others to the class zero. Two different model classes were used in the classification task: generalized linear models and nonlinear neural networks [3], in this case MLP (multilayer perceptron) network. Separate portions of the data were used to estimate the parameters of the model, and to test the ability of the model to classify documents correctly.

Table 2: Confusion matrix of the classification task. Class 0 refers to a low value of the document (do not cache) and class 1 refers to a high value of the document (store in cache).
  Linear model MLP model
Class \ Prediction0101

Table 2. shows the classification results. Linear model classifies 67% of class zero and 56% of class one correctly. Respectively, MLP model classifies 74% of class zero and 64% of class one correctly. In this case, nonlinear models can distinguish documents that have a high value from those that a have low value much more accurately than linear models.


Presented case study suggests that the proposed approach can be valuable in the optimization of a Web cache. The method effectively replaces heuristic algorithms with a model that is based on real measurement data. Building the model requires a large amount of data to be collected. Also finding a set of suitable features, the model architecture and estimating the model parameters can be a computationally intensive task.

To be applicable in Web caches, the prediction of the document value should be carried out within the caching policy algorithm. Initial parameters of the model, however, can be estimated beforehand, which greatly reduces the computational burden. Since the statistics of the requested documents change continuously, the parameters of the model should be tuned from time to time or even continuously as a part of the caching algorithm.


  1. Wessels, D (1995), Intelligent Caching for World-Wide Web Objects, Proceedings of INET'95.
  2. Cao, P, Irani, S (1997), Cost-aware WWW proxy caching algorithms, USENIX Symposium on Internet Technology and Systems.
  3. Haykin, S (1994), Neural networks - A comprehensive foundation, Macmillan.