DW COMPONENTS
Cooking up a Data Warehouse Todd Saunders Abstract As a data warehousing professional, you know that your environment has many components that must work together and interact just so to provide valuable information to your business. However, for colleagues who are beginning to familiarize themselves with data warehousing, it is not always clear what those components are and how they affect each other. This article will provide an analogy to help you explain key Todd Saunders is Saunders is chief stis chitect f
DW components and their interactions to neophytes.
ConnECT: The Kwedge netwk. He hs bee bidig systems d giztis f ey 20 yes d hs bee ied specicy i bidig d impemetig dt wehses, bsiess iteigece stis, d dtbse mketig systems f the st 12 yes. tsdes@cectkwedge.cm
Introduction When explaining explaini ng the basic components components of a data warehouse environment, the analogy I like to use is that of a restaurant. This is not a new analogy; in a Web search, I found articles from several years ago that use it. However, new techniques and technologies have influenced the way data warehouses are developed and used, so it’s time to update the analogy. Because we’re all familiar with restaurants, you can use this analogy to explain data warehousing to people— friends or colleagues, perhaps—who are unfamiliar with technology in general g eneral or with wit h DW components components such as ETL tools or databases. Providing an easy way to visualize what is happening in one of these solutions can go a long way toward effectively communicating how a data warehouse works. It should be easier for someone new to data warehousing to visualize and remember how ingredients are stored in a kitchen by type (for example, frozen, canned, or fresh) than to visualize how data is stored in a database according to subject area. The analogy will provide DW neophytes with a clarifying context about data warehouse solutions, so when they are pulled into conversations regarding data structures, they can discuss them meaningfully.
16
BUSINESS INTELLIGENCE Journal • vol. 14, no. 2
DW COMPONENTS
First, let’s define our terms. A restaurant is a business that prepares and serves food to customers. For this article, I will focus on the food preparation and delivery process of the restaurant, and not so much on issues related to renting a building or designing the décor. In short, we’ll look at the process of procuring raw materials (food) from suppliers, storing the food in the kitchen, and preparing and serving dishes ordered by customers. When we look at the general process flow that occurs in our restaurant, we assume placing orders for the ingredients in our dishes is the first step. When the ingredients arrive, we store them in the appropriate places in our kitchen—the freezer, the refrigerator, or on the shelf. As orders are placed by customers, the appropriate ingredients are retrieved by the chef; measured, mixed, combined, and cooked; and finally delivered to the customer. How is this relevant to data warehousing? The parallel between the two is striking, I believe. Purchasing raw ingredients is analogous to obtaining data from various source systems in your business. The raw ingredients— the data—come from source systems such as your ERP system, sales system, accounting system, or fulfillment system, to name a few. Occasionally, you may procure data from outside sources to get information about prospective customers or your market. Storing the raw ingredients in the appropriate places is the equivalent of storing data in a database (i.e., your data warehouse). As in a restaurant, it is important to put your resources (in this case, your data) in the right place so everyone knows what is stored where and how it can be retrieved. Preparing the dishes ordered by the customers is analogous to building reports and delivering the reports to your business users. To summarize, obtaining raw ingredients, storing the ingredients, then retrieving the ingredients to prepare a dish is similar to obtaining data from source systems, storing the data in a database, and then using that data to build reports. Get data in, manage it, and then get information out.
It sounds simple, right? Well, it can be, but the demands of business users and the complexity of businesses usually mean that more detail and complexity is needed in the data warehousing environment to support the business requirements. Again, our restaurant analogy can help clarify some of these issues.
Pchsig w igediets is gs t btiig dt fm is sce systems i y bsiess. Technical Resources Just as a restaurant needs a top chef to prepare the meals, businesses need to have high-quality technical people who can build and manage t he data warehouse environment. Just as a poor chef can take good ingredients and still produce a mediocre dish, a poor technical team can have good data supplied from the source systems yet still struggle to produce timely and accurate information. An experienced database administrator with deep knowledge of data warehousing is one of the keys to creating a successful data warehouse environment. Data Sources One of the complexities in data warehousing is determining what data to put into the data warehouse. In our restaurant analogy, this equates to figuring out what raw ingredients to order. The key is deciding what is going to be on the menu. As a restaurant owner, you decide what soups, salads, appetizers, main dishes, and desserts you will offer. Each of these dishes requ ires ingredients, so the complete menu gives you the total list of ingredients. If one of the desserts you offer is a milk shake, you know you will need to have milk, ice cream, and flavoring available in the kitchen. In the business world, knowing what information is required by business users will help you determine what data needs to be available in your data warehouse.
BUSINESS INTELLIGENCE Journa l • vol. 14, no. 2
17
DW COMPONENTS
If the business users need a weekly report of accounts receivables by customer, you know that you will need data from the accounting department that details the amounts each customer owes. With that raw data in the data warehouse, a report of the receivables can be developed that displays the information needed by the end user.
a p techic tem c he gd dt sppied fm the sce systems yet sti stgge t pdce timey d ccte ifmti. Data Granularity When preparing a meal, you need to know what the ingredients are as well as what amounts to use. In some cases, it may make sense to buy ingredients that are already a complete food item. For example, ice cream could be served as a dessert by itself or included as an ingredient in a more complex dessert. If a consumer wishes to know all the ingredients that are included in a dessert, it may not be enough to know that ice cream is one of the ingredients. The consumer may want to know exactly what ingredients went into the ice cream being served. The fact that they know it is ice cream may be at too high a level of aggregation for their needs.
The data that goes into data warehouses faces the same issues. If a source system can only deliver total sales amounts by product per week, it may not be possible to determine what days of the week the most sales are generated by product. This may or may not be important to the business users, but it is important to find out before the database is designed so that expectations can be set about exactly what level of detail will be available for analysis. Data Updates Another complexity is timing of data refresh. In the previous example, a business user needed a report delivered each week of the receivables by customer, but what if
18
BUSINESS INTELLIGENCE Journal • vol. 14, no. 2
the data warehouse only gets data from the accounting system once a month? In our restaurant, this would be like ordering milk once a month. That first week the cake you make with the milk ta stes pretty good—like it is supposed to. Going into weeks two, three, and four, the milk might not be providing the flavor and texture expected. In other words, the milk starts going bad and causes the end product (the cake) to be bad even if all the other ingredients are fresh and the cake is made and delivered to the customer in a timely manner. If you know that milk is only good for a week, you’ll set up weekly deliveries of fresh milk so that the dishes produced are good. In the same way, reports that are developed on a weekly basis will be meaningful only if the data is no more than a week old. If the data can only be updated monthly, then reports should be produced from that data only once a month.
Data Standardization Another complexity is standa rdization and hygiene. Food orders can help in explaining what happens in the standardization process. The key to standardization and hygiene is getting everything to look the way we expect it to. If we have filet mignon on the menu, we need know how much of exactly what to order. We can’t just place an order for “meat.” We need to make sure we a re ordering beef, and we need to make sure we are ordering beef tenderloin and not strip steak.
What happens if we are ordering our beef from two different suppliers? One supplier may ship us individual filets. The other may ship us tenderloins that can be carved into six filets each. When we want to know how many filet dinners we’ll be able to serve at a given time, we need to know how many filets and tenderloins are on hand and how they add up to individual filet mignon meals. In business, we need to know the number of items in an order unit. Our business may have one supplier that ships six oil filters per order and another that ships 24 filters per order. In our data warehouse, we need to recognize how many orders have been received from each supplier and
DW COMPONENTS
apply the appropriate multiplication to know how ma ny individual oil filters we have received. It is not meaningful to simply say we have received 20 orders, since we wouldn’t necessarily know how many of those orders were from the first supplier and how many were from the second. We may have received anywhere between 120 and 480 filters. The proper standardization of our warehouse data will tell us exactly. Another form of standardization is recognizing that different terms may mean the same thing. In our example above, we know that one beef tenderloin equals six filets. If we order one pound of cilantro from one supplier and one pound of coriander from a second, we know that we actually have two pounds of cilantro (since cilantro and coriander are the same thing and we choose to call both cilantro). In business, we may have one division that uses the term “customers” and another that uses the term “consumers.” In our data warehouse, if the business rules specify, we can know that a customer is the same as a consumer, even though the different divisions refer to them with different names. Other businesses have both B2B and B2C models where customers and consumers are different—a distinction that needs to be known and tracked.
Data Storage (Database) We need to know where to store our ingredients. They need to be organized and kept in specific places based on the type of ingredient, their attributes, and how they are used. In our kitchen, we need to keep frozen foods in the freezer, perishables in the refrigerator, and canned goods on the shelf. We also need to keep like foods together within each of those storage areas to make it easier to get to them. It makes sense to keep the cilantro and coriander together in the same bin a nd just call it cila ntro. That way, it is quick and easy for our chef to go to one place and find what he or she is looking for. On the other hand, it would certainly make life more difficult if sugar and flour were kept in the same container and the chef had to try to separate it out each time a cup of one or the other was needed. In our business example, we want to keep like data grouped together. It would be very difficult to manage
and access data if we tried to keep all information in one big table. Imagine if for every sales transaction you had to list information about the order (such as item sold, number of units, amount, and data/time) and all the information about the customer (name, address, phone, e-mail, previous purchases, lifetime value, value segment, etc.), as well as store information (including address, current manager, and inventory levels). Each record would be huge. We would have so much redundancy and eventually conflicting information that our data warehouse would become useless.
a expeieced dtbse dmiistt with deep kwedge f dt wehsig is e f the keys t cetig sccessf dt wehse eimet. It makes more sense to keep information about customers in one area, store attributes in another, and sales transactions in another. Our sales transaction record would have the sales information (item ID, units sold, and amount), with just a store ID and customer ID that can be used to find out more information about the store or customer later if needed. Organizing the data in this manner will help with data management as well as data retrieval, a key attribute of data warehouses: the ability to access and retrieve data (relatively) quickly.
ETL In data wa rehousing, one of the biggest parts of the development effort is the ETL process. ETL (extract, transform, and load) refers to getting (extracting) data from point A (the source system), transforming it (e.g., changing euros to U.S. dollars), and loading it into point B (the correct table within the data warehouse). It is a much
BUSINESS INTELLIGENCE Journa l • vol. 14, no. 2
19
DW COMPONENTS
simpler process in our restaurant example than in a real data warehouse system. In our restaurant, the extract consists of placing an order with a vendor. Once the order arrives, we transform it (cut up the tenderloin into filets) and load it to its proper location (freezer, refrigerator, or shelf). As it turns out, data across different business units within a company can require transformation and manipulation to make it compatible with the rest of the data within the warehouse. This is why it typically requires the most effort in a data warehouse development project.
I kitche, dt mt wd be ike fd tht is ptiy pe-mde t expedite cmpeti f the dish. Matching Matching is another key component of data warehousing. Going back to the beef example, our restaurant may have several meat suppliers. One supplier sends us “beef tenderloin,” another “filet mignon,” another “filets,” and yet another “beef (filet mignon).” We know these all refer to the same cut of meat, so we decide on one term—filet mignon—and call all of these by that single, standard name. This way we are able to easily track exactly how much filet mignon we have on hand. In business, we may receive sales transactions from the same Home Depot store, but the store name could have several variations: “Home Depot, Ottumwa, IA,” “Home Depot #207, IA,” “Home Depot Store #207, Ottumwa,” or other variations. If we are attempting to track sales across the different Home Depot stores, we need to know that these are all referring to the same store so that we can appropriately attribute the sale. Commercial matching software can be configured to help the data warehouse recognize that all of these refer to the same store and aggregate the information correctly.
20
BUSINESS INTELLIGENCE Journal • vol. 14, no. 2
Data Marts Typically, a data mart contains summarized (or aggregated) data relevant to a particular subject area such as marketing or sales. In our kitchen, a data mart would be like food that is partially pre-made to expedite completion of the dish. Picture one of those fast food Chinese restaurants where you can choose rice or noodles, then one or several main courses such as orange chicken, garlic chicken, or beef and broccoli. The rice and noodles have already been cooked and are ready to dish, as are the main courses. This is how the data mart works. You have your raw ingredients (raw data) stored in the kitchen in the various storage areas (freezer, refrigerator, or shelf) just like the data in the tables in the data warehouse. You then partially prepare the food (cook the rice or make the orange chicken), much as you would aggregate the data for the data mart (sum up all the sales by customer or calculate total parts sold per time period). When you want to prepare the final dish (orang e chicken on rice), you can quickly scoop the two ingredients together on a plate. In the case of the data mart, you can simply select the appropriate time period and see how many parts were sold without having to go to the data warehouse and select each and ever y individual transaction (where some may have been sales transactions, some were order corrections, and some were returns). The data has already been prepared, so you know that when you ask for net parts sold during a time period, the mart has already applied all the necessary logic to the raw data to present you with the right answer.
Reporting Consider the dishes delivered to the customers at their tables. The dishes are analogous to the presentation (i.e., reporting) layer in a data warehousing environment. They are the end product. They are what are produced using our raw ingredients as inputs. The dishes are ordered by the customers based on choices from the menu. The menu is not infinite. It has a set selection from which the customers can choose, because
DW COMPONENTS
the kitchen cannot possibly stock all the ingredients necessary to produce any dish that a customer might desire. Rather, the kitchen is stocked with the raw ingredients needed to produce any of the items listed on the menu. When a chef receives an order for veal parmesa n, he or she knows that the necessary ingredients are available in the kitchen and can find those ingredients and produce the dish in a timely manner. Similarly, in our data warehouse environment, the end users have been identified and their reporting needs captured. These reports are like the menu items. Just as the menu items require certain raw ingredients, the reports require certain data. Since the report specifications are known before the warehouse is built (if the process works as it should), we can be confident that the needed data is in the warehouse and available for each report.
the end user may want to vary that report slightly. For example, a particular report may show sales by region by year for the last 10 years, but the end user would prefer sales by month over the last 12 months. It is often possible to make reports like this configurable, where the end user can select some of these parameters—such as time period or region—but the basic structure of the report and the supporting data remains the same.
a iti f eptig is the bffet. a bffet hs t f fd, bt t ifiite ptis. Ad Hoc Reporting
Standard Reports
The reporting environment often includes a set of standard reports. It is useful in many cases for a business unit to receive the same report every Monday morning showing sales for the previous week, day, or other time period, depending on the business need. The point is that the report arrives at the expected time on a predetermined frequency containing the most recent information. This is like having the same meal prepared and picked up or delivered on a regular schedule. Maybe you like to treat yourself to a favorite meal every Friday for lunch and have it delivered. You talk to the restaurant, let them know how you would like the meal prepared, and ask them to deliver it to your office every Friday at noon. You get the same food every week and it is prepared with fresh ingredients each time. Configurable Reports
Sometimes when reading a menu, you like a particular dish but would like to exchange one of the ingredients or side dishes. You might ask the server for soup instead of salad, or chips instead of fries. Reporting can operate in a similar fashion. Reporting environments will often provide a list of standard reports that an end user can select to view. However,
A variation of reporting is the buffet. A buffet has a lot of food, but not infinite options. It presents the food items the restaurant has decided the customers are most likely to want, leaving the customer free to pick and choose exactly which of those items they wish. The business analogy to the buffet is ad hoc reporting. End users can pick and choose which data they would like on their reports, but they have a finite amount of data to choose from. However, end users can choose and combine the data any way they want. The data that has been made available in the warehouse was based on gathering information requirements from the end users and finding the sources of data to put into the warehouse so the end users can access it. The available data should serve most or all of the business needs of the end users. One caution is that the users should have some familiarity with the data and the data structure so they don’t end up selecting data that would be the equivalent of putting mustard on an ice cream sundae. Dashboards and Scorecards
In some ways, dashboards and scorecards are like standard reports, but they tend to present information in summarized, easy-to-read, graphical formats. For example, you may have a n internal Web site for your company that displays several charts and graphs showing
BUSINESS INTELLIGENCE Journa l • vol. 14, no. 2
21
DW COMPONENTS
key pieces of current information. There may be a graph showing quarter-to-date sales and how it compares to the goal. There may be another graph showing month-to-date profitability by region presented in a pleasing-to-the-eye, consumable format. When I think about eye-pleasing, consumable items in a restaurant, the dessert cart comes to mind. What is typically shown on a dessert cart is that day’s current selection of desserts in single-serving portions. You can quickly and easily see what is there without having read a menu; you can quickly zero in on the item that is of most interest to you. All of the desserts have been prepared and are ready to be served, just as the results on a dashboard or scorecard have already been calculated. The same information is probably available in one of the other standard reports, just as the desserts are probably listed in the menu. However, the dessert cart, like the dashboard,
presents interesting information in a format that is very quick to see (visualize) and comprehend.
Summary It is a little surprising how closely the process of producing meals at a restaurant resembles the processes in a data warehousing environment. When people you know are thinking about the components of a data warehouse solution, this restaurant analogy should be a good way to help them keep the components and processes straight and provide a clearer picture of what is going on in your solution. Of course, no analogy is perfect, but this one does a good job of providing an easy-to-understand overview of what could be a whole new environment for those new to the technology. �
Instructions for Authors The Business Intelligence Journal is a quarterly journal that focuses on all a spects of data wa rehousing and business intelligence. It serves the needs of researchers and practitioners in this important field by publishing surveys of current practices, opinion pieces, conceptual frameworks, case studies that describe innovative practices or provide important insights, tutorials, technology discussions, and annotated bibliographies. The Journal publishes educational articles that do not market, advertise, or promote one particular product or company. Visit www.tdwi.org/journalsubmissions for the Business Intelligence Journal’s complete submissions guidelines, including writing requirements and editorial topics.
22
BUSINESS INTELLIGENCE Journal • vol. 14, no. 2
Submissions www.tdwi.org/journalsubmissions Materials should be submitted to: Jennifer Agee, Managing Editor E-mail:
[email protected]
Upcoming Deadlines Volume 14, Number 4
Submissions Deadline: September 4, 2009 Distribution Date: December 2009 Volume 15, Number 1
Submissions Deadline: December 18, 2009 Distribution Date: March 2010