End-user application developers are increasingly building applications around Web pages. The user interface development environment of choice is now a web page development environment. Data warehousing is one of the core responsibilities of information Technology. In many ways Data Warehousing fulfils the promise of “getting the data out” after the OLTP based system “gets the data in”. The web revolution has certainly not replaced the need for the data warehouse. In fact the web revolution has raised everyone’s expectations much higher that all sorts of information will be seamlessly published through web browser interface. The audience for data warehouse data has grown from internal management to encompass customers, partners and much larger pool of internal employees. The web’s focus on the customer experience has made many organization aware of learning about the customer and giving the customer useful information.
The web revolution has propelled the data warehouse out onto the main stage, because in many situations the data warehouse must be the engine that controls or analysis the web experience. In order to step up to this new responsibility, the data warehouse must adjust. The nature of the data warehouse needs to be somewhat different. (Kimball, p 4) As a result, our data warehouses are becoming data webhouses. The data warehouse is becoming the infrastructure that supports customer relationship management (CRM). And the data warehouse is being asked to make the customer clickstream available for analysis. This rebirth of data warehousing architecture is called the data webhouse. (Kimball, 1999 (July))
2.0 The need for Data Webhouses
Consider the marketing analysts. Five years ago they were saying, “Soon, we will have databases recording every customer interaction, no matter how seemingly insignificant….” Then, they would go on to predict how that holy grail, the complete customer record, would allow us to understand customer behavior, interests, and needs in greater depth than ever before. From that understanding would come growth, profit, customer satisfaction — everything a business executive could want.
Well, when it comes to the complete customer record, the Internet has let the genie out of the bottle. The “clickstream”, that record of every mouse click or keystroke of every visitor to a Web site, has generated a more complete customer record than has ever existed in any form of commerce. (Kimball, 2000 (April))
The data webhouse is a web instantiation of the data warehouse. The webhouse plays a central and a crucial role in the operations of the web enabled business. The Data Webhouse will,
· Houses and publishes click stream data and other behavioral data from the web that drive an understanding of customer behavior.
· Act as a foundation for web enabled decision making. The data webhouse must allow its users to make decisions about the web, as well as make decisions using the web.
· Act as a medium that publishes data to the customers, business partners, and employees appropriately, but at the same time protects the enterprise’s data against unattended use.
3.0 Data Webhouse Architecture
Web-created demands are drawing the data warehouse increasingly closer to the front line of operational reporting and operational response generation, forcing to rethink data warehouse architecture. Today's dramatically increased pace of business decision-making requires not only a comprehensive snapshot of the business in real time, but simultaneously, answers to broad questions about customer behavior. The data warehouse is taking central stage in the Web revolution, and it requires restating and adjusting our data warehouse thinking.
We believe Data Webhouse is a variation of a traditional Data WareHouse. But the architecture of a Data Webhouse is different from a data warehouse.
3.1 Features of the Data Webhouse
This “Data Webhouse” must:
· Be designed from the start as a fully distributed system, with many independently developed nodes contributing to the overall whole. In other words, there is no center to the data Webhouse.
· Not be a client/server system, but a Web-enabled system. This means a top-to-bottom redesign. A Web-enabled system delivers its results and exposes its interfaces through remote browsers on the Web.
· Deal equally well with textual, numeric, graphic, photographic, audio, and video data streams because the Web already supports this mix of media.
· Support atomic-level behavior data to at least the terabyte level in many data marts, especially those containing clickstream data. Many behavioral analyses must, by definition, crawl through the lowest level of data because the analysis constraints preclude summarizing in advance.
· Respond to an end-user request in roughly 10 seconds, regardless of the complexity of the request
· Include the user interface’s effectiveness as a primary design criterion. The only thing that matters in the data Webhouse is the effective publication of information on the Web. Delays, confusing dialogs, and the lack of the desired choices are all direct failures. (Kimball, 1999, June)
With this data warehouse evolution, we have managed to make three big technical design factors more difficult.
o Challenges in Data Webhouse Design
· Timeliness. Business results must now be available in real time. “As of the previous day” reporting, on the wish list two years ago, is no longer a sufficient pace. Increasingly more-efficient delivery pipelines with smaller, just-in-time inventories, along with mass customization, force us to quickly understand and respond to demand.
· Data Volumes. The big move to mass customization means we now capture, analyze, and respond to every transaction in the business including every gesture a customer makes, both before and after operational or sales transactions and there seems to be no volume limit. For instance, the combined Microsoft-related Web sites, analyzed daily as a single entity, on some busy days have captured more than a billion page events!
· Response Times. The Web makes fast response times critical. If something useful doesn't happen within 10 seconds the user may navigate to another page. Those of us who run big data warehouses know that many queries will take more than 10 seconds. (Kimball, 2000, p31)
o Sample Data WebHouse Architecture
As these design factors have become more difficult, we find ourselves supporting a broader continuum of users and requests. To address these issues, we need to adjust our data warehouse architecture. We can't just make our single database server increasingly powerful. We can't make it deliver all these complicated objects and hope to keep up with these escalating requirements.
The following diagram clearly explain the a sample data ware house architecture.
the word limitations I have refrain from explaining the obvious and
concentrated on the distinct deviations and the similarities from the traditional data warehouse design
(Source :- Kimball, 2000, p 32)
3.4 Deviations from Traditional Data Warehouse Architecture.
One way to take pressure off the main database engines is to build a powerful hot response cache (see above figure) that anticipates as many of the predictable and repeated information requests as possible. The hot response cache adjoins the application servers that feed the public Web server and the private firewall entry point for employees. A series of batch jobs running in the main webhouse application server creates the cache's data. Once stored in the hot response cache, the data objects can be fetched on demand through either a public Web server application or a private firewall application.
The fetched items are complex file objects, not low-level data elements. The hot response cache is therefore a file server, not a database. Its file storage hierarchy will inevitably be a simple kind of lookup structure, but it does not need to support a complex query access method.
3.4.1 The features of the “hot response cache”.
· Custom greetings to Web visitors, consisting of both text and graphics
· Dynamically chosen promotion content to Web visitors
· XML-based (Extended Markup Language) , structured-form content to business partners (what we used to call EDI) requesting delivery status, order status, hours' supply in inventory (we used to measure days' supply, which is becoming obsolete), and critical-path warnings in the delivery pipeline
· Low-level FAQ-like answers to problems and support requests
· Top-line reports to management, needing significant integration across time (multi-year trends), customers, product lines, or geographies all delivered in three interchangeable formats including page-oriented report, pivot table, and graph, and frequently accompanied by images
· Downloadable precomputed OLAP cubes for exploratory analysis
· Data-mining studies, both near-term and long-term, showing the evolution of customer demographic and behavior clusters, and the effects of decisions about promotion content and Web site content on business done through the Web
· Conventional aggregations that enhance query performance when drilling up through standard hierarchies in the major dimensions such as customer, product, and time. (Kimball, 2000, p34)
The hot response cache's management must help it support the application servers' needs. Ideally, a batch job will have computed and stored in advance the information object that the application server needs. All applications need to be aware that the hot response cache exists and should be able to probe it to see if the answer they want is already there. The hot response cache has two distinct modes of use; the nature of the visitor session requesting the data determines which one to use.
The guaranteed response time request must produce some kind of answer in response to a page request that the Web server is handling, usually in less than a second. If the requested object (such as a custom greeting, a custom cross-selling proposition, an immediate report, or an answer to a question) has not been precomputed and hence is not stored, a default response object must be delivered in its place, all within the guaranteed response time.
The accelerated response time request hopes to produce a response to the Web visitor's request but will default to computing the response directly from the underlying data warehouse if the precomputed object is not found immediately.
The application server should optionally be able to warn the user that there may be a delay in providing the response in this case. The Web server needs to be able to alert the application server if it detects that the user has gone on to another page, so the application server can halt the data warehouse process.
3.5 Similarities with Traditional Data Warehouses
Note that this strategy of seeking a precomputed answer and defaulting if necessary to the base data is exactly the way conventional aggregates have always worked in the data warehouse. The data warehouse aggregate navigator has always searched for aggregates to answer portions of an overall report query. If the navigator finds the aggregate, it uses it. But if it doesn't find the aggregate, it gracefully defaults to computing the answer slowly from the base data. Viewed this way, the hot response cache is a kind of supercharged aggregate navigator. (Kimball, 2000, p35)
As we can see Data WebHosue is a refinement, not a distinct deviation from the traditional Data Warehouse.
4.0 Building the Data Warehouse
An interesting part of the emerging data webhouse is the data mart that stores and presents the Web activity for later analysis. Fundamentally, we want to analyze all the hits on our Web site. We want to build a comprehensible view of the immense stream of clicks arriving at our sites, whether we're dealing with our intranet users or with customers on our public Web site. We call this aspect of our data webhouse the “clickstream data mart.”
The clickstream data mart can tell us a great deal about detailed customer behavior. If we have information on our customers' every click and gesture through our Web site, we should be able to answer such questions as:
· What parts of our Web site get the most visitors?
· What parts of the Web site do we associate most frequently with actual sales?
· What parts of the Web site are superfluous or visited infrequently?
· Which pages on our Web site seem to be "session killers," where the remote user stops the session and leaves?
· What is the new-visitor click profile on our site?
· What is the click profile of an existing customer? A profitable customer? A complaining customer that all too frequently returns our product?
· What is the click profile of a customer about to cancel our service, complain, or sue us?
· How can we induce the customer to register with our site so we learn some useful information about that customer?
· How many visits do unregistered customers typically make with us before they are willing to register? Before they buy a product or service?
Given this information, can we imagine building the clickstream data mart using conventional slice and dice dimensional models? And if we manage to build it, how can we hope to analyze the clickstream data mart to answer all these questions?
4.2 Building Clickstream Data Mart
To build the clickstream data mart, let's use a simple, four-step methodology to build the dimensional model.
· Define the source of our data,
· Choose the grain of our fact table,
· Choose the dimensions appropriate for that grain
· Choose the facts appropriate for that grain.
We need to go after the most granular and detailed data possible describing the clicks on our Web server. Each Web server will potentially report different details, but at the lowest level we should be able to obtain a record for every page hit with the following information: precise date and time of page click; remote client's (requesting user's) IP address; page requested (with path to page starting at the server machine); specific control downloaded; and cookie information, if available.
The most serious problem, which permeates all analyses of Web clicking behavior, is that the page hits are often stateless (Does not remember what the user did in the last page) . Without surrounding context, a page hit may just be a random isolated event that is difficult to interpret as part of a user session. Perhaps the user linked to this page from some remote Web site and then left the page five seconds later without returning. It is difficult to make much sense out of such an event, so our first goal is to identify and label complete sessions.
The second serious problem is whether we can make any sense out of the remote client's IP address. If the only client identification is the IP address, we cannot learn much. Most Internet users come through an Internet service provider (ISP) that assigns IP addresses dynamically. Thus, remote users will have a different address in a later session than they have at the moment. We can probably track the individual session reliably, but we can't be sure when the user returns to the site in a different session.
We can significantly reduce these problems if our Web server creates cookies on the requesting user's machine. A cookie is a piece of information that the requesting user "agrees" to store and may agree to send to your Web server each time his browser opens one of your pages. A cookie usually does not contain much information, but it can identify the requesting user's computer unambiguously. Furthermore, it provides a way to link page hits across a complete user session. A cookie may contain significant information if the user has voluntarily registered with your Web server and provided other information, such as a true human name and a company affiliation. In such a case, the cookie provides an ID to data you have stored in one of your own databases.
In order to make the raw clickstream data usable in our data webhouse, we need to collect and transform the data so it has a session perspective. This process will be a significant step in the back room. We assume that we have some kind of cookie mechanism that lets us transform our data source into the following format:
• Precise date and time of page hit
• Identity of requesting user (consistent from session to session)
• Session ID
• Page and event requested.
We now see that each event an individual user invokes in a special session is the grain of our clickstream data mart fact table. Each event is an individual record, and each record is a event on a Web page. Note that the Web server may not notice events within the user interface of a downloaded Web page unless we have programmed the Web page to specifically alert the server when the event occurs. In the back room's extract and transform process, we filter out automatic events and focus on ones more related to page formatting than to user actions. These kinds of filtered events include the download of graphic images, such as GIFs adorning a requested page. So if we have 100,000 user sessions per day on our Web site, and if each session involves an average of eight meaningful events, then we will collect 800,000 records per day. (Kimball, 2000, p158)
See the below diagram for the clickstream data mart's dimensional model. The dimensions are universal date, universal time, local date, local time, user, page, event, and session. We split the date from the time-of-day because these two time components have quite different descriptors. Date relates to calendar, weekdays, and seasons; and time-of-day relates to the specific spot we are in within a day. Most data warehouse fact tables that track specific times split the date from the time-of-day in this way. The date dimension clearly has a real join to a real dimension table with many textual attributes. The time-of-day dimension may be rather dull as a dimension unless we have some specific intervals during the day to which we are willing to assign names.
(Source :- Kimball, 2000, p 160)
We provide two versions of the date and time -- universal and local -- that let us align clickstream events in absolute time as well as relative to the user's wall clock. The analyst's query tool can perform this alignment, but this extra logic imposes an unreasonable burden on the application. Therefore, we prefer to provide two hard-wired entry points into each event's date/timestamp.
The user dimension should contain some useful information about who the user is, other than just a consistent machine ID. However, this will depend on whether we have coaxed the user into revealing facts about his or her identity. (Kimball, 2000 (January))
The page dimension is important because it contains the meaningful context that tells the analyst the user's Web site location. Each Web page must contain some simple descriptors identifying the location and type of page. A complete path name is not nearly as interesting as such basic attributes as "Product Info," "Company Info," "Frequently Asked Questions," and "Order Form." A large Web site should have a hierarchical description associated with each page that gives progressively more detail about what the page is. This information needs to be stored in the page dimension and be maintained consistently as we update and modify the Web site. In other words, we have to update the production transaction system (the Web server) responsively to meet the needs of the data webhouse analysts.
Finally, the session dimension is more than just a tag that groups together all the page events that constitute a single user's session. This dimension is also where we label the session and trace its activity. We might characterize sessions as "Searching for Information," "Random Browsing," "Price and Feature Shopping," or "Placing an Order." We may be able to create these tags with simple criteria regarding what the user does during the session, or we may turn over the session record to a full-blown link analysis data mining package. In either case, the result is a set of descriptive tags we can put in the session dimension. The session should also be characterized by what we currently know about the customer, such as "Recent Large Purchaser," "Not Yet a Customer," or "Chronic Product Returner." (Kimball, 2000 (January))
Our clickstream fact table only contains one fact, and that fact ("time spent") is an estimate. We try to accurately record the length of time the user spent on the page after the last click and before moving on. Because page browsing is basically stateless, we can never be completely sure whether the user has perhaps minimized the window or clicked on an unrelated site. We can only make an accurate estimate of the time spent on the page if we have a following event that is part of the session, but we have to be careful not to interpret long "times spent" too seriously.
This dimensional design lets us perform many powerful queries. It is fairly easy to find the most frequently visited parts of the Web site and identify the most frequent users. We can also correlate pages and users to our more valuable customers because we know who places the order on the Web site.
The good news about this design is that we have successfully established a framework for collecting and analyzing all the clicks on our Web site. The bad news is that we really haven't shed much light on whether we are selling products or Web services. That confusion is quite deeply rooted and is one of the reasons that the Internet revolution is so interesting and important. (Kimball, 2000, p 162)
The tremendous rush toward customer relationship management (CRM), e-business, and business intelligence has brought many end-user departments into the computer marketplace as new customers. This demand is almost entirely great news for us data warehouse and data Webhouse implementers. (Kimball, 2000, June)
The web and the data
warehouse are drawn together like two
powerful magnets. The web needs the warehouse for many of it’s customer centric
functions and the warehouse is
been transformed by the demands of the web to Data Webhouses. Data Webhouses
will play a major role in the cooperate world in the very near future.
· Kimball. R “Data Webhouse Toolkit” , 2000, John Wiley & Sons