1.0 Introduction
End-user application developers are
increasingly building applications
around Web pages. The user interface development environment of choice
is now a web page development environment. Data warehousing is one of the core
responsibilities of information Technology. In many ways Data Warehousing
fulfils the promise of “getting the
data out” after the OLTP based system “gets the data in”. The web revolution has
certainly not replaced the need for the data warehouse. In fact the web
revolution has raised everyone’s expectations much higher that all sorts of
information will be seamlessly published through web browser interface. The
audience for data warehouse data has grown from internal management to encompass
customers, partners and much larger pool of internal employees. The web’s focus
on the customer experience has made many organization aware of learning about
the customer and giving the customer useful information.
The web revolution has propelled the data warehouse out onto the main stage, because in many situations the data warehouse must be the engine that controls or analysis the web experience. In order to step up to this new responsibility, the data warehouse must adjust. The nature of the data warehouse needs to be somewhat different. (Kimball, p 4) As a result, our data warehouses are becoming data webhouses. The data warehouse is becoming the infrastructure that supports customer relationship management (CRM). And the data warehouse is being asked to make the customer clickstream available for analysis. This rebirth of data warehousing architecture is called the data webhouse. (Kimball, 1999 (July))
2.0 The need for Data Webhouses
Consider
the marketing analysts. Five years ago they were saying, “Soon, we will have databases recording every
customer interaction, no matter how seemingly insignificant….” Then, they would go on
to predict how that holy grail, the complete customer record, would allow us to
understand customer behavior, interests, and needs in greater depth than ever
before. From that understanding would come growth, profit, customer satisfaction
— everything a business executive could want.
Well,
when it comes to the complete customer record, the Internet has let the genie
out of the bottle. The “clickstream”,
that record of every mouse click or keystroke of every visitor to a Web site,
has generated a more complete customer record than has ever existed in any form
of commerce. (Kimball, 2000 (April))
The data webhouse is a web instantiation
of the data warehouse. The webhouse plays a central and a crucial role in the
operations of the web enabled business. The Data Webhouse will,
·
Houses and
publishes click stream data and other behavioral data from the web that drive
an understanding of customer behavior.
·
Act as a
foundation for web enabled decision making. The data webhouse must allow its
users to make decisions about the web, as well as make decisions using the web.
·
Act as a medium that publishes data to the
customers, business partners, and employees appropriately, but at the same time
protects the enterprise’s data against unattended use.
3.0 Data Webhouse Architecture
Web-created demands are
drawing the data warehouse increasingly closer to the front line of operational
reporting and operational response generation, forcing to rethink data
warehouse architecture. Today's dramatically increased pace of business
decision-making requires not only a comprehensive snapshot of the business in
real time, but simultaneously, answers to broad questions about customer
behavior. The data warehouse is taking central stage in the Web revolution, and
it requires restating and adjusting our data warehouse thinking.
We believe Data Webhouse
is a variation of a traditional Data
WareHouse. But the architecture of a
Data Webhouse is different from a data warehouse.
3.1 Features of the Data Webhouse
This “Data Webhouse”
must:
·
Be
designed from the start as a fully distributed system, with many independently
developed nodes contributing to the overall whole. In other words, there is no
center to the data Webhouse.
·
Not
be a client/server system, but a Web-enabled system. This means a top-to-bottom
redesign. A Web-enabled system delivers its results and exposes its interfaces
through remote browsers on the Web.
·
Deal
equally well with textual, numeric, graphic, photographic, audio, and video
data streams because the Web already supports this mix of media.
·
Support
atomic-level behavior data to at least the terabyte level in many data marts,
especially those containing clickstream data. Many behavioral analyses must, by
definition, crawl through the lowest level of data because the analysis
constraints preclude summarizing in advance.
·
Respond
to an end-user request in roughly 10 seconds, regardless of the complexity of
the request
·
Include
the user interface’s effectiveness as a primary design criterion. The only
thing that matters in the data Webhouse is the effective publication of
information on the Web. Delays, confusing dialogs, and the lack of the desired
choices are all direct failures.
(Kimball, 1999, June)
With this data warehouse
evolution, we have managed to make three big technical design factors more
difficult.
o
Challenges
in Data Webhouse Design
·
Timeliness. Business results must
now be available in real time. “As of the previous day” reporting, on the wish
list two years ago, is no longer a sufficient pace. Increasingly more-efficient
delivery pipelines with smaller, just-in-time inventories, along with mass
customization, force us to quickly understand and respond to demand.
·
Data Volumes. The big move to mass
customization means we now capture, analyze, and respond to every transaction
in the business including every gesture a customer makes, both before and after
operational or sales transactions and there seems to be no volume limit. For
instance, the combined Microsoft-related Web sites, analyzed daily as a single
entity, on some busy days have captured more than a billion page events!
·
Response Times. The Web makes fast
response times critical. If something useful doesn't happen within 10 seconds
the user may navigate to another page. Those of us who run big data warehouses
know that many queries will take more than 10 seconds. (Kimball, 2000, p31)
o
Sample
Data WebHouse Architecture
As these
design factors have become more difficult, we find ourselves supporting a
broader continuum of users and requests. To address these issues, we need to
adjust our data warehouse architecture. We can't just make our single database
server increasingly powerful. We can't make it deliver all these complicated
objects and hope to keep up with these escalating requirements.
The following
diagram clearly explain the a sample
data ware house architecture.
Note :-
(Because of
the word limitations I have refrain from explaining the obvious and
concentrated on the distinct deviations and the similarities from the traditional data warehouse design
architecture.)
|
|
(Source :- Kimball, 2000, p 32)
3.4 Deviations from Traditional Data
Warehouse Architecture.
One way to take pressure
off the main database engines is to build a powerful hot response cache (see
above figure) that anticipates as many of the predictable and repeated
information requests as possible. The hot response cache adjoins the
application servers that feed the public Web server and the private firewall
entry point for employees. A series of batch jobs running in the main webhouse
application server creates the cache's data. Once stored in the hot response
cache, the data objects can be fetched on demand through either a public Web
server application or a private firewall application.
The fetched items are
complex file objects, not low-level data elements. The hot response cache is
therefore a file server, not a database. Its file storage hierarchy will
inevitably be a simple kind of lookup structure, but it does not need to
support a complex query access method.
3.4.1 The features of the “hot response cache”.
·
Custom
greetings to Web visitors, consisting of both text and graphics
·
Dynamically
chosen promotion content to Web visitors
·
XML-based
(Extended Markup Language) , structured-form content to business partners (what
we used to call EDI) requesting delivery status, order status, hours' supply in
inventory (we used to measure days' supply, which is becoming obsolete), and
critical-path warnings in the delivery pipeline
·
Low-level
FAQ-like answers to problems and support requests
·
Top-line
reports to management, needing significant integration across time (multi-year
trends), customers, product lines, or geographies all delivered in three
interchangeable formats including page-oriented report, pivot table, and graph,
and frequently accompanied by images
·
Downloadable
precomputed OLAP cubes for exploratory analysis
·
Data-mining
studies, both near-term and long-term, showing the evolution of customer
demographic and behavior clusters, and the effects of decisions about promotion
content and Web site content on business done through the Web
·
Conventional
aggregations that enhance query performance when drilling up through standard
hierarchies in the major dimensions such as customer, product, and time.
(Kimball, 2000, p34)
The hot response cache's
management must help it support the application servers' needs. Ideally, a
batch job will have computed and stored in advance the information object that
the application server needs. All applications need to be aware that the hot
response cache exists and should be able to probe it to see if the answer they
want is already there. The hot response cache has two distinct modes of use;
the nature of the visitor session requesting the data determines which one to
use.
The guaranteed response
time request must produce some kind of answer in response to a page request
that the Web server is handling, usually in less than a second. If the
requested object (such as a custom greeting, a custom cross-selling
proposition, an immediate report, or an answer to a question) has not been
precomputed and hence is not stored, a default response object must be
delivered in its place, all within the guaranteed response time.
The accelerated response
time request hopes to produce a response to the Web visitor's request but will
default to computing the response directly from the underlying data warehouse
if the precomputed object is not found immediately.
The application server
should optionally be able to warn the user that there may be a delay in
providing the response in this case. The Web server needs to be able to alert
the application server if it detects that the user has gone on to another page,
so the application server can halt the data warehouse process.
3.5 Similarities with Traditional Data
Warehouses
Note that this strategy
of seeking a precomputed answer and defaulting if necessary to the base data is
exactly the way conventional aggregates have always worked in the data
warehouse. The data warehouse aggregate navigator has always searched for
aggregates to answer portions of an overall report query. If the navigator
finds the aggregate, it uses it. But if it doesn't find the aggregate, it
gracefully defaults to computing the answer slowly from the base data. Viewed
this way, the hot response cache is a kind of supercharged aggregate navigator. (Kimball, 2000, p35)
As we can see Data
WebHosue is a refinement, not a distinct deviation from the traditional Data
Warehouse.
4.0 Building the Data Warehouse
An interesting part of the emerging data
webhouse is the data mart that stores
and presents the Web activity for later analysis. Fundamentally, we want to analyze all the hits on our Web
site. We want to build a comprehensible
view of the immense stream of clicks arriving at our sites, whether we're dealing with our intranet
users or with customers on our public
Web site. We call this aspect of our data webhouse the “clickstream data mart.”
The clickstream data mart
can tell us a great deal about detailed customer behavior. If we have
information on our customers' every click and gesture through our Web site, we
should be able to answer such questions as:
·
What
parts of our Web site get the most visitors?
·
What
parts of the Web site do we associate most frequently with actual sales?
·
What
parts of the Web site are superfluous or visited infrequently?
·
Which
pages on our Web site seem to be "session killers," where the remote
user stops the session and leaves?
·
What
is the new-visitor click profile on our site?
·
What
is the click profile of an existing customer? A profitable customer? A
complaining customer that all too frequently returns our product?
·
What
is the click profile of a customer about to cancel our service, complain, or
sue us?
·
How
can we induce the customer to register with our site so we learn some useful
information about that customer?
·
How
many visits do unregistered customers typically make with us before they are
willing to register? Before they buy a product or service?
Given this information,
can we imagine building the clickstream data mart using conventional slice and
dice dimensional models? And if we manage to build it, how can we hope to
analyze the clickstream data mart to answer all these questions?
4.2 Building Clickstream Data Mart
To build the clickstream
data mart, let's use a simple, four-step methodology to build the dimensional
model.
·
Define
the source of our data,
·
Choose
the grain of our fact table,
·
Choose
the dimensions appropriate for that grain
·
Choose
the facts appropriate for that grain.
We need to go
after the most granular and detailed data possible describing the clicks on our
Web server. Each Web server will potentially report different details, but at
the lowest level we should be able to obtain a record for every page hit with
the following information: precise date and time of page click; remote client's
(requesting user's) IP address; page requested (with path to page starting at
the server machine); specific control downloaded; and cookie information, if
available.
The most
serious problem, which permeates all analyses of Web clicking behavior, is that
the page hits are often stateless (Does not remember what the user did in the
last page) . Without surrounding context, a page hit may just be a random
isolated event that is difficult to interpret as part of a user session.
Perhaps the user linked to this page from some remote Web site and then left
the page five seconds later without returning. It is difficult to make much
sense out of such an event, so our first goal is to identify and label complete
sessions.
The second
serious problem is whether we can make any sense out of the remote client's IP
address. If the only client identification is the IP address, we cannot learn
much. Most Internet users come through an Internet service provider (ISP) that
assigns IP addresses dynamically. Thus, remote users will have a different
address in a later session than they have at the moment. We can probably track
the individual session reliably, but we can't be sure when the user returns to
the site in a different session.
We can
significantly reduce these problems if our Web server creates cookies on the
requesting user's machine. A cookie is a piece of information that the
requesting user "agrees" to store and may agree to send to your Web
server each time his browser opens one of your pages. A cookie usually does not
contain much information, but it can identify the requesting user's computer
unambiguously. Furthermore, it provides a way to link page hits across a
complete user session. A cookie may contain significant information if the user
has voluntarily registered with your Web server and provided other information,
such as a true human name and a company affiliation. In such a case, the cookie
provides an ID to data you have stored in one of your own databases.
In order to
make the raw clickstream data usable in our data webhouse, we need to collect
and transform the data so it has a session perspective. This process will be a
significant step in the back room. We assume that we have some kind of cookie
mechanism that lets us transform our data source into the following format:
• Precise date and time
of page hit
• Identity of requesting
user (consistent from session to session)
• Session ID
• Page and event
requested.
We now see
that each event an individual user invokes in a special session is the grain of
our clickstream data mart fact table. Each event is an individual record, and
each record is a event on a Web page. Note that the Web server may not notice
events within the user interface of a downloaded Web page unless we have
programmed the Web page to specifically alert the server when the event occurs.
In the back room's extract and transform process, we filter out automatic
events and focus on ones more related to page formatting than to user actions.
These kinds of filtered events include the download of graphic images, such as
GIFs adorning a requested page. So if we have 100,000 user sessions per day on
our Web site, and if each session involves an average of eight meaningful
events, then we will collect 800,000 records per day. (Kimball, 2000, p158)
See the below
diagram for the clickstream data mart's dimensional model. The dimensions are
universal date, universal time, local date, local time, user, page, event, and
session. We split the date from the time-of-day because these two time
components have quite different descriptors. Date relates to calendar,
weekdays, and seasons; and time-of-day relates to the specific spot we are in
within a day. Most data warehouse fact tables that track specific times split
the date from the time-of-day in this way. The date dimension clearly has a
real join to a real dimension table with many textual attributes. The
time-of-day dimension may be rather dull as a dimension unless we have some
specific intervals during the day to which we are willing to assign names.

(Source :- Kimball, 2000, p 160)
We provide two versions
of the date and time -- universal and local -- that let us align clickstream
events in absolute time as well as relative to the user's wall clock. The
analyst's query tool can perform this alignment, but this extra logic imposes
an unreasonable burden on the application. Therefore, we prefer to provide two
hard-wired entry points into each event's date/timestamp.
The user dimension should
contain some useful information about who the user is, other than just a
consistent machine ID. However, this will depend on whether we have coaxed the
user into revealing facts about his or her identity. (Kimball, 2000 (January))
The page dimension is
important because it contains the meaningful context that tells the analyst the
user's Web site location. Each Web page must contain some simple descriptors
identifying the location and type of page. A complete path name is not nearly
as interesting as such basic attributes as "Product Info,"
"Company Info," "Frequently Asked Questions," and
"Order Form." A large Web site should have a hierarchical description
associated with each page that gives progressively more detail about what the
page is. This information needs to be stored in the page dimension and be
maintained consistently as we update and modify the Web site. In other words,
we have to update the production transaction system (the Web server)
responsively to meet the needs of the data webhouse analysts.
Finally, the session
dimension is more than just a tag that groups together all the page events that
constitute a single user's session. This dimension is also where we label the
session and trace its activity. We might characterize sessions as "Searching
for Information," "Random Browsing," "Price and Feature
Shopping," or "Placing an Order." We may be able to create these
tags with simple criteria regarding what the user does during the session, or
we may turn over the session record to a full-blown link analysis data mining
package. In either case, the result is a set of descriptive tags we can put in
the session dimension. The session should also be characterized by what we
currently know about the customer, such as "Recent Large Purchaser,"
"Not Yet a Customer," or "Chronic Product Returner." (Kimball, 2000 (January))
Our clickstream fact
table only contains one fact, and that fact ("time spent") is an
estimate. We try to accurately record the length of time the user spent on the
page after the last click and before moving on. Because page browsing is
basically stateless, we can never be completely sure whether the user has
perhaps minimized the window or clicked on an unrelated site. We can only make
an accurate estimate of the time spent on the page if we have a following event
that is part of the session, but we have to be careful not to interpret long
"times spent" too seriously.
This dimensional design
lets us perform many powerful queries. It is fairly easy to find the most
frequently visited parts of the Web site and identify the most frequent users.
We can also correlate pages and users to our more valuable customers because we
know who places the order on the Web site.
The good news about this
design is that we have successfully established a framework for collecting and analyzing
all the clicks on our Web site. The bad news is that we really haven't shed
much light on whether we are selling products or Web services. That confusion
is quite deeply rooted and is one of the reasons that the Internet revolution
is so interesting and important. (Kimball, 2000, p 162)
5.0 Conclusion
The tremendous rush
toward customer relationship management (CRM), e-business, and business
intelligence has brought many end-user departments into the computer
marketplace as new customers. This demand is almost entirely great news for us
data warehouse and data Webhouse implementers.
(Kimball, 2000, June)
The web and the data
warehouse are drawn together like two
powerful magnets. The web needs the warehouse for many of it’s customer centric
functions and the warehouse is
been transformed by the demands of the web to Data Webhouses. Data Webhouses
will play a major role in the cooperate world in the very near future.
6.0 Reference
·
Kimball. R “Data Webhouse Toolkit” , 2000, John
Wiley & Sons
http://www.intelligententerprise.com/000605/webhouse.shtml