Caching Strategies for Data-Intensive Web Sites
We propose a customizable cache system architecture for Web sites whose
content is dynamically extracted from large relational databases. Our solution
improves the response time of those sites by customizing data caching at
the various levels of data elaboration within the Web site, according to
the users' access profiles. The system allows to cache the results of database
queries, intermediate XML fragments and HTML files.
Performance problems may arise when a Web site provides access to large
numbers of pages that are dynamically built from a database. In this context,
producing a Web page may require costly interaction with the database system
mainly, for connection and querying. The database cost adds up to the already
non-negligeable base cost of Web page delivery. A solution for reducing
the client waiting time relies on caching HTML pages. Despite good performance,
this solution has several major drawbacks. It incurs significant space
overhead, propagating updates from the database to the cached data is made
more difficult, and the caching granularity (i.e. a page) is not always
appropriate. For instance, different fragments in a page can have different
update frequencies, and caching at the page level imposes the recomputation
of the entire page, even if some parts of the page did not change.
To overcome these problems, we propose a Web site management system,
called Weave , supporting customized cache management
and automatic generation of runtime policies from declarative specification
of Web sites.
In the following, we briefly discus the key features and the corresponding
components of the system:
The foreseen advantage of a 3-level caching architecture, coupled with
runtime policy customization, is that it makes possible to specialize the
runtime behavior of a Web site, according to each user, to the user's context
or to the portion of the Web site being accessed, and to automatically
balance the load of such system.
Weave is a data-intensive Web site management system. It relies on a
specification of the Web site through an XML graph data model that
captures the structure and the content of the site independently of its
graphical representation. A site schema then represents an XML view definition
over a relational database. This enables managing the Web site at three
levels: database, XML fragments, and HTML files. The
graphical representation of the site is described using
XSL style sheets.
Two components are then fundamental in the Weave system: the XML
Generator, which applies the definition of the site schema to
the underlying data and produces (fragments of) the XML site graph, and
the HTML Generator, which applies XSL style sheets
to XML fragments, resulting in browsable HTML pages.
It comes with a declarative language which is WeaveL, for specifying
the XML site schema. A WeaveL program consists of a set of site class specifications.
Each class specification includes the declaration of the parameters identifying
an instance of the class, the SQL query whose result gives all possible
instances for the above parameters (describing how to produce all instances
of the class), the specification of the data contained in an instance,
and the specification of the hyperlinks from an instance of the respective
It proposes a three level caching architecture composed
of DB, XML, and HTML caches. The DB cache allows to cache, in the DBMS,
the results of parameterized SQL computation, under the form of relational
tables, and reuse the results for subsequent requests . This improves
performance of handling database queries, allows for efficient update
propagation, and enables caching of data that are shared among various pages.
Compared to the HTML cache, which caches HTML files
on disk, the XML cache has the advantage of storing less data and allows
for carefully controlling the granularity of the cached data, ranging from
the entire page to fragments of the page. Moreover, it allows to reduce
the load generated on the database by the Web server.
The architecture includes also a manager for each individual cache and
a cache scheduler. The cache managers export the same interface and implement
usual cache operations for data retrieval, addition and removal, which
are triggered following events such as HTTP requests and invalidations.
The scheduler is the major component of the system. It has the task of coordinating
the behavior of the other components. It receives HTTP requests
and redirects them to the cache managers where data relating to a given
page may be retrieved.
The runtime behavior of a Web site can be controlled by a
runtime policy so as to make optimal usage of the caches according
to behavioral information such as the users' access patterns. A runtime
policy specifies which kind of data to prefetch or to cache (HTML pages, XML fragments,
relational tables, or any combination of those), which particular items
to prefetch or to cache (e.g. particular HTML pages), and which actions to execute under
different events like page requests, data updates or environmental changes.
Finally, we introduce a high level language, which is WeaveRPL, for the abstract
specification of the cache system's behavior (the specification is similar for the three
caches). The language is based on
event-condition rules. It enables to explicitly specify the global runtime
policies implemented by an individual cache manager for setting overall
features such as the maximum cache size, and the actions to be carried
out upon a global event such as a cache overflow. Furthermore, it builds
upon the declarative Web site specification, and allows the definition
of per-site-class customized caching, basically, how to handle events related
to data retrieval, addition, removal, and staleness.
 Daniela Florescu, Alon Levy, Dan Suciu, and Khaled Yagoub. Run time management of data
intensive Web sites. In Proc. of the Int. Conf. on Very Large Data Bases (VLDB), Edinburgh, UK,