This is a Big Data Project report which I had to make for my Internship at Fujitsu.Full description
Universiti Teknologi MARaFull description
Report Writing for Data Science in RFull description
Report Writing for Data Science in R
Data LeakageFull description
seng penteng onokFull description
The data warehouse allows the storage of data in a format that facilitates its access, but if the tools for deriving information and/or knowledge and presenting them in a format that is useful for ...
Different people have different definitions for a data warehouse. The most popular definition came from Bill Inmon, who provided the following: A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process . Ralph states that a data warehouse is "a copy of transaction data specifically structured for query and analysis." analysis."A A data warehouse warehouse is a repository of an organi organizat zation ion's 's electr electroni onical cally ly stored stored data. data. Data Data warehouses are designed to facilitate reporting and analysis .This This defini definitio tion n of the data data wareho warehouse use focus focuses es on data data storag storage. e. Howev However, er, the means means to retrie retrieve ve and and anal analyze yze Many Many refer referenc ences es to to data data wareho warehousi using ng use this this broade broaderr contex context. t. Thus, Thus, an expanded expanded definitio definition n for data warehousin warehousing g includes includes busine bus iness ss int intell ellige igence nce too tools ls , tools to extract, transform trans form,, and load data into the repository, repository, and tools to manage and retrieve metadata .A data warehouse can be normalized or deno denorm rmal aliz ized ed.. It can can be a rela relati tion onal al data databa base se,, multidim multidimensio ensional nal database, database, flat file, file, hierarchi hierarchical cal database, object database, etc. Data warehouse data often gets changed. And data warehouses often focus on a specific activity or entity. Of course if you want to define define every every user user as a decisi decision on maker maker and all activi activitie tiess as decisi decision on making making proces processes ses,, then then my asse assert rtio ion n is fals false. e. But in my expe experi rien ence ce,, the the overwhelming uses of data warehouses are for quite mundane, non-decision making purposes rather than for grist grist for making making decisi decisions ons with with wide wide rangin ranging g effects effects (so-call (so-called ed "strategic "strategic"" decisions decisions.). .). In fact, fact, I would assert that most of data warehouses are used for for post post-d -dec ecis isio ion n moni monito tori ring ng of the the effe effect ctss of deci decisi sion onss – or, or, as some some peop people le migh mightt say, say, for for "operational" issues. By the way, this is not saying that using data warehousing in the decision making process process is not a wonderful wonderful,, potential potentially ly high return return effort. But my caution is that though the trade press, vendors, and many industry experts trumpet the role of data warehousing vis–à–vis decision making, in reality we do not now have nor will we ever have a clear understanding of decision making.
Datawarehousing arises in an organisation's need for reli reliab able le,, cons consol olid idat ated ed,, uniq unique ue and and inte integr grat ated ed reporting reporting and analysis analysis of its data, at different different levels levels of aggregation. The practical practical reality reality of most organisation organisationss is that their data infrastructure is made up by a collection of heterogeneous systems. For example, an organisation migh mightt have have one one syst system em that that hand handle less cust custom omer er-relati relations onship hip,, a syste system m that that handle handless employ employees ees,, systems that handles sales data or production data, yet another system for finance and budgeting data etc. In practice, these systems are often poorly or not at all integr integrate ated d and simple simple questi questions ons like: like: "How "How much time did sales person A spend on customer C, how much did we sell to Customer C, was customer C happy with the provided service, Did Customer C pay pay his his bill bills" s" can can be very very hard hard to answ answer er,, even even though though the informati information on is available available "somewhere" "somewhere" in the different data systems. Another problem is that ERP systems are designed to support relevant operations. For example, a finance syst system em migh mightt keep keep trac track k of ever every y sing single le stam stamp p bought; When it was ordered, when it was delivered, when when it was was paid paid and and the the syst system em migh mightt offe offer r accounting principles (like double bookkeeping) that further complicates the data model. Such information is great for the person in charge of buying "stamps" or the accountant trying to sort out an irregularity, irregularity, but the CEO is definitely not interested in such detailed info inform rmat atio ion, n, the the CEO CEO want wantss to know know stuf stufff like like "What's the cost?", "What's the revenue?", "did our latest initiative reduce costs?". Yet another problem might be that the organisation is, internally, internally, in disagreem disagreement ent about which which data is correct. correct. For example, example, the sales sales departmen departmentt might might have have one one view view of its its cost costs, s, whil whilee the the fina financ ncee departme department nt has another view of that cost. In such cases cases the organi organisat sation ion can spend spend unlimi unlimited ted time time discussing who's got the correct view of the data.
It is partly the purpose of Datawarehousing to bridge such such prob proble lems ms.. It is impo import rtan antt to note note that that in
Datawarehousing the source data systems are considered as given: It is not the task of the datawarehousing consultant to figure out, that since the problem is that the CRM system identifies a person by initials, while the Employee-TimeManagement system identifies a person by full name while the ERP system identifies a person by social security number; and since a person can change his name: things do not work and the organization should invest in and implement one or two new systems to handle CRM, ERP etc. in a more consistent manner. Rather, the datawarehousing consultant is charged with making the data appear consistent, integrated and consolidated despite the problems in the underlying source systems. The datawarehousing consultant achieves this by employing different datawarehousing techniques, creating one or more new data repositories (i.e. the datawarehouse) whose data model(s) support the needed reporting and analysis.
redundancy was required to support multiple decision support environments. In larger corporations it was typical for multiple decision support environments to operate independently. Each environment served different users but often required much of the same stored data. The process of gathering, cleaning and integrating data from various sources, usually from long-term existing operational systems (usually referred to as legacy systems), was typically in part replicated for each environment. Moreover, the operational systems were frequently reexamined as new decision support requirements emerged. Often new requirements necessitated gathering, cleaning and integrating new data from "data marts" that were tailored for ready access by users. Key developments in early years of data warehousing were: •
The concept of data warehousing dates back to the late 1980s  when IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse". In essence, the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments.
The concept attempted to address the various problems associated with this flow, mainly the high costs associated with it. In the absence of a data warehousing architecture, an enormous amount of
1960s — General Mills and Dartmouth College, in a joint research project, develop the terms dimensions and facts. 1970s — ACNielsen and IRI provide dimensional data marts for retail sales. 1983 — Teradata introduces a database management system specifically designed for decision support. 1988 — Barry Devlin and Paul Murphy publish the article An architecture for a business and information systems in IBM Systems Journal where they introduce the term "business data warehouse". 1990 — Red Brick Systems introduces Red Brick Warehouse, a database management system specifically for data warehousing. 1991 — Prism Solutions introduces Prism Warehouse Manager, software for developing a data warehouse. 1991 — Bill Inmon publishes the book Building the Data Warehouse. 1995 — The Data Warehousing Institute, a for-profit organization that promotes data warehousing, is founded. 1996 — Ralph Kimball publishes the book The Data Warehouse Toolkit. 1997 — Oracle 8, with support for star queries, is released. 1998 — Microsoft releases Microsoft Analysis Services (then OLAP Services) heavily utilizing data warehousing schemas.
Architecture, in the context of an organization's data warehousing efforts, is a conceptualization of how the data warehouse is built. There is no right or wrong architecture, but rather there are multiple architectures that exist to support various environments and situations. The worthiness of the architecture can be judged from how the conceptualization aids in the building, maintenance, and usage of the data warehouse. One possible simple conceptualization of a data warehouse architecture consists of the following interconnected layers: Operational database layer The source data for the data warehouse — An organization's Enterprise Resource Planning systems fall into this layer. Data access layer The interface between the operational and informational access layer — Tools to extract, transform, load data into the warehouse fall into this layer.
In recent years, the evolution of data warehousing has reached a new pinnacle with the deployment of decision support capability throughout an organization and even beyond its conventional boundaries to partners and customers. In the early days, data warehousing focused almost entirely on providing strategic decision-making capability to knowledge workers in the corporate ivory tower. End users for the data warehouse were traditionally in areas such as marketing, strategic planning and finance. Access to information dramatically increased the quality of their decisionmaking. However, developing a superior corporate strategy is only part of what it takes to succeed in today’s intensely competitive business environment. A great strategy is nothing without great execution. The emerging generation of data warehouse deployments improves the execution of a business strategy in addition to its development. This evolution imposes an ever-increasing set of service levels upon the data warehouse architect. In this article we discuss the evolution of data warehousing through the five stages that are most common in the maturation of decision support within an organization (see Figure1).
Metadata layer The data directory - This is usually more detailed than an operational system data directory. There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be accessed by a particular reporting and analysis tool. Informational access layer The data accessed for reporting and analyzing and the tools for reporting and analyzing data — Business intelligence tools fall into this layer. And the Inmon-Kimball differences about design methodology, discussed later in this article, have to do with this layer.
Information Evolution in Data Warehousing
Data warehousing is a journey. The most successful data warehouse implementations deliver business value on an iterative and continuous basis. Each iteration builds upon its predecessor to increase the business value proposition for information delivery.
Stage 1: Reporting
The initial stage of data warehouse deployment typically focuses on reporting from a single source of truth within an organization. The data warehouse brings huge value simply by integrating disparate sources of information within an organization into a single repository to drive decision-making across functional and/or product boundaries. For the most part, the questions in a reporting environment are known in advance. Thus, database structures can be optimized to deliver good performance even when queries require access to huge amounts of information. The biggest challenge in Stage 1 data warehouse deployment is data integration. The challenges in constructing a repository with consistent, cleansed data cannot be overstated. There can easily be hundreds of data sources in a legacy computing environment—each with a unique domain value standard and underlying implementation technology. The hard work that goes into providing wellintegrated information for decision-makers becomes the foundation for all subsequent stages of data warehouse deployment. Stage 2: Analyzing
In a Stage 2 data warehouse deployment, decisionmakers focus less on what happened and more on why it happened. Analysis activities are concerned with drilling down beneath the numbers on a report to slice and dice data at a detailed level. Ad hoc analysis plays a big role in Stage 2 data warehouse implementations. Questions against the database cannot be known in advance. Performance management relies a lot more on advanced optimizer capability in the RDBMS because query structures are not as predictable as they are in a pure reporting environment. Performance is also a lot more important in a Stage 2 data warehouse implementation because the information repository is used much more interactively. Whereas reports are typically scheduled to run on a regular basis with business calendars as a driver for timing, ad hoc analysis is fundamentally a hands-on activity with iterative refinement of questions in an interactive environment. Business users require direct access to the data warehouse via GUI tools without the need for programmer intermediaries. Support for concurrent query execution and large numbers of users against the warehouse is typical of a Stage 2 implementation.
Business users, however, are a very impatient bunch. Performance must provide response times measured in seconds or a small number of minutes for drilldowns in an OLAP (online analytical processing) environment. The database optimizer’s ability to determine efficient access paths, using indexing and sophisticated join techniques, plays a critical role in allowing flexible access to information within acceptable response times. Stage 3: Predicting
As an organization becomes well-entrenched in quantitative decision-making techniques and experiences the value proposition for understanding the “whats” and “whys” of its business dynamics, the next step is to leverage information for predictive purposes. Understanding what will happen next in the business has huge implications for proactively managing the strategy for an organization. Stage 3 data warehousing requires data mining tools for building predictive models using historical detail. The number of end users who will apply the advanced analytics involved in predictive modeling is relatively small. However, the workloads associated with model construction and scoring are intense. Model construction typically involves derivation of hundreds of complex metrics for hundreds of thousands (or more) of observations as the basis for training the predictive algorithms for a specific set of business objectives. Scoring is frequently applied against a larger set (millions) of observations because the full population is scored rather than the smaller training sets used in model construction. Advanced data mining methods often employ complex mathematical functions such as logarithms, exponentiation, trigonometric functions and sophisticated statistical functions to obtain the predictive characteristics desired. Access to detailed data is essential to the predictive power of the algorithms. Tools from vendors such as SAS and Quadstone provide a framework for development of complex models and require direct access to information stored in the relational structures of the data warehouse. The business end users in the data mining space tend to be a relatively small group of very sophisticated analysts with market research or statistical backgrounds. However, beware in your capacity planning! This small quantity of end users can easily consume 50% or more of the machine cycles on the data warehouse platform during peak periods. This heavy resource utilization is due to the
complexity of data access and volume of data handled in a typical data mining environment.
cause other packages that are ready to go to miss their service levels.
Stage 4: Operationalizing
How long the truck should wait will depend on the service levels of all delayed packages destined for the truck as well as service levels for those packages already on the truck. A package due the next day is obviously going to have more difficulty in meeting its service levels under conditions of delay that one that is not due until many days later. Moreover, the sending and receiving parties associated with the package shipment should also be considered. Higher priority on making service levels should be given to packages associated with profitable customers where the relationship may be at risk if a package is late. Alternative routing options for the late packages, weather conditions and many other factors may also come into play. Making good decisions in this environment amounts to a highly complex optimization problem.
Operationalization in Stage 4 of the evolution starts to bring us into the realm of active data warehousing. Whereas stages 1 to 3 focus on strategic decisionmaking within an organization, Stage 4 focuses on tactical decision support. Think of strategic decision support as providing the information necessary to make long-term decisions in the business. Applications of strategic decision support include market segmentation, product (category) management strategies, profitability analysis, forecasting and many others. Tactical decision support is not focused on developing corporate strategy in the ivory tower, but rather on supporting the people in the field who execute it. Operationalizing typically means providing access to information for immediate decision-making in the field. Two examples are (1) inventory management with just-in-time replenishment and (2) scheduling and routing for package delivery. Many retailers are moving toward vendor managed inventory, with a retail chain and the manufacturers that supply it working as partners. The goal is to reduce inventory costs through more efficient supply chain management. In order for the partnership to be successful, access to information regarding sales, promotions, inventory-on-hand, etc. must be provided to the vendor at a detailed level. Manufacturing, delivery and so on can then be executed efficiently based on inventory requirements on a per-store and per-SKU level. To be useful, the information must be extremely up-to-date and query response times must be very fast. In the example of package shipping with less than full load trucking there are very complex decisions involved in how to schedule trucks and route packages. Trucks generally converge at break bulks wherein packages get moved from one truck to another so that they ultimately arrive at their desired destination (in a way very analogous to how humans are shuffled around between connecting flights at an airline hub). When packages are on a late-arriving truck, tough decisions need to get made in regard to whether the connecting truck that the late package is scheduled for will wait for the package or leave on time. If it leaves without the package, the service level on that package may be compromised. On the other hand, waiting for the delayed package may
It is clear that a break bulk manager will dramatically increase the quality of his or her scheduling and routing decisions with the assistance of advanced decision support capabilities. However, for these capabilities to be useful, the information to drive decision-making must be extremely up-to-date. This means continuous data acquisition into the data warehouse in order for the decision-making capabilities to be relevant to day-to-day operations. Whereas a strategic decision support environment can use data that is loaded once per month or once per week, this lack of data freshness is unacceptable for tactical decision support. Furthermore, the response time for queries must be measured in a small number of seconds in order to accommodate the realities of decision-making in an operational, field environment. Stage 5: Active Warehousing
The larger the role an active data warehouse plays in the operational aspects of decision support, the more incentive the business has to automate the decision processes. Both for efficiency reasons and for consistency in decision-making, the business will want to automate decisions when humans do not add significant value. In e-commerce business models there is no choice but to automate decision-making when a customer interacts with a Web site. Interactive customer relationship management (CRM) on a Web site or at an ATM is all about making decisions to optimize the customer relationship through individualized product offers, pricing, content delivery and so on. The very
complex decision-making associated with interactive CRM takes place without humans in a completely automated fashion and must be executed with response times measured in seconds or milliseconds. As technology evolves, more and more decisions become executed with event-driven triggers to initiate fully automated decision processes. For example, the retail industry is on the verge of a technology breakthrough in the form of electronic shelf labels. This technology obsoletes the old-style Mylar labels, which require manual labor to update prices by swapping small plastic numbers on a shelf label. The new electronic labels can implement price changes remotely via computer controls without any manual labor. Integration of the electronic shelf label technology with an active data warehouse facilitates sophisticated price management with as much automation as a business cares to deploy. For seasonal items in stores where inventories are higher than they ought to be, it will be possible to automatically initiate sophisticated mark-down strategies to drive maximum sell-through with minimum margin erosion. Whereas a sophisticated mark-down strategy is prohibitively costly in the world of manual pricing, the use of electronic shelf labels with promotional messaging and dynamic pricing opens a whole new world of possibilities for price management. Moreover, the power of an active data warehouse allows these decisions to be made in an optimal fashion on an item-by-item, store-by-store and second-by-second basis using event triggering and sophisticated decision support capability. In a CRM context, even customer-by-customer decisions are possible with an active data warehouse. Intense competition and technology innovations are motivating these advances in decision support deployment. An active data warehouse delivers information and enables decision support throughout an organization rather than being confined to strategic decision-making processes. However, tactical decision support does not replace strategic decision support. Rather, an active data warehouse supports the coexistence of both types of workloads. Notice in Figure 1 that a significant amount of workload in a Stage 5 data warehouse is still focused on strategic thinking. The operationalized and event triggered decision support of stages 4 and 5 provide the execution capability for strategies developed from traditional data warehouse analysis characterized in stages 1 to 3.
Normalized versus dimensional approach for storage of data There are two leading approaches to storing data in a data warehouse — the dimensional approach and the normalized approach. In a dimensional approach, transaction data are partitioned into either "facts", which are generally numeric transaction data, or "dimensions", which are the reference information that gives context to the facts. For example, a sales transaction can be broken up into facts such as the number of products ordered and the price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order. A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval of data from the data warehouse tends to operate very quickly. The main disadvantages of the dimensional approach are: 1.
In order to maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is complicated, and It is difficult to modify the data warehouse structure if the organization adopting the dimensional approach changes the way in which it does business.
In the normalized approach, the data in the data warehouse are stored following, to a degree, database normalization rules. Tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.). The main advantage of this approach is that it is straightforward to add information into the database. A disadvantage of this approach is that, because of the number of tables involved, it can be difficult for users both to: 1.
join data from different sources into meaningful information and then 2. access the information without a precise understanding of the sources of data and of the data structure of the data warehouse. These approaches are not mutually exclusive, and there are other approaches. Dimensional approaches can involve normalizing data to a degree.
Conforming information Another important fact in designing a data warehouse is which data to conform and how to conform the data. For example, one operational system feeding data into the data warehouse may use "M" and "F" to denote sex of an employee while another operational system may use "Male" and "Female".Though this is a simple example, much of the work in implementing a data warehouse is devoted to making similar meaning data consistent when they are stored in the data warehouse. Typically, extract, transform, load tools are used in this work. Master Data Management has the aim of conforming data that could be considered "dimensions".
Top-down versus methodologies
the data warehouse is designed using a normalized enterprise data model. "Atomic" data, that is, data at the lowest level of detail, are stored in the data warehouse. Dimensional data marts containing data needed for specific business processes or specific departments are created from the data warehouse. In the Inmon vision the data warehouse is at the center of the "Corporate Information Factory" (CIF), which provides a logical framework for delivering business intelligence (BI) and business management capabilities. Inmon states that the data warehouse is: Subject-oriented The data in the data warehouse is organized so that all the data elements relating to the same real-world event or object are linked together.
Ralph Kimball, a well-known author on data warehousing, is a proponent of an approach to data warehouse design frequently considered as bottomup. In the so-called bottom-up approach data marts are first created to provide reporting and analytical capabilities for specific business processes. Data marts contain atomic data and, if necessary, summarized data. These data marts can eventually be unioned together to create a comprehensive data warehouse. The combination of data marts is managed through the implementation of what Kimball calls "a data warehouse bus architecture". Business value can be returned as quickly as the first data marts can be created. Maintaining tight management over the data warehouse bus architecture is fundamental to maintaining the integrity of the data warehouse. The most important management task is making sure dimensions among data marts are consistent. In Kimball's words, this means that the dimensions "conform". Top-down design
Bill Inmon, one of the first authors on the subject of data warehousing, has defined a data warehouse as a centralized repository for the entire enterprise. Inmon is one of the leading proponents of the topdown approach to data warehouse design, in which
Data in the data warehouse is never overwritten or deleted - once committed, the data is static, read-only, and retained for future reporting. Integrated
The data warehouse contains data from most or all of an organization's operational systems and this data is made consistent. Time-variant
The top-down design methodology generates highly consistent dimensional views of data across data marts since all data marts are loaded from the centralized repository. Top-down design has also proven to be robust against business changes. Generating new dimensional data marts against the data stored in the data warehouse is a relatively simple task. The main disadvantage to the top-down methodology is that it represents a very large project with a very broad scope. The up-front cost for implementing a data warehouse using the top-down methodology is significant, and the duration of time from the start of project to the point that end users experience initial benefits can be substantial. In addition, the top-down methodology can be inflexible and unresponsive to changing departmental needs during the implementation phases.
Hybrid design Over time it has become apparent to proponents of bottom-up and top-down data warehouse design that both methodologies have benefits and risks. Hybrid methodologies have evolved to take advantage of the fast turn-around time of bottom-up design and the enterprise-wide data consistency of top-down design.
Off line Operational Database
Data warehouses in this initial stage are developed by simply copying the data off an operational system to another server where the processing load of reporting against the copied data does not impact the operational system's performance. Off line Data Warehouse
Data warehouses versus operational systems Operational systems are optimized for preservation of data integrity and speed of recording of business transactions through use of database normalization and an entity-relationship model. Operational system designers generally follow the Codd rules of database normalization in order to ensure data integrity. Codd defined five increasingly stringent rules of normalization. Fully normalized database designs (that is, those satisfying all five Codd rules) often result in information from a business transaction being stored in dozens to hundreds of tables. Relational databases are efficient at managing the relationships between these tables. The databases have very fast insert/update performance because only a small amount of data in those tables is affected each time a transaction is processed. Finally, in order to improve performance, older data are usually periodically purged from operational systems. Data warehouses are optimized for speed of data analysis. Frequently data in data warehouses are denormalised via a dimension-based model. Also, to speed data retrieval, data warehouse data are often stored multiple times—in their most granular form and in summarized forms called aggregates. Data warehouse data are gathered from the operational systems and held in the data warehouse even after the data has been purged from the operational systems.
Data warehouses at this stage are updated from data in the operational systems on a regular basis and the data warehouse data is stored in a data structure designed to facilitate reporting.
Real Time Data Warehouse
Data warehouses at this stage are updated every time an operational system performs a transaction (e.g. an order or a delivery or a booking) Integrated Data Warehouse
Data warehouses at this stage are updated every time an operational system performs a transaction. The data warehouses then generate transactions that are passed back into the operational systems.
Decision Support/ Business Intelligence The term decision support, goes back to the 1970s when it was coined by some academics associated with the Massachusetts Institute of Technology. Since then, many academic definitions have been offered.
Evolution in organization use
A decision support system or tool is one specifically designed to facilitate business end users performing computer generated analyses of data on their own.
Organizations generally start off with relatively simple use of data warehousing. Over time, more sophisticated use of data warehousing evolves. The following general stages of use of the data warehouse can be distinguished:
There are very few pure decision support tools.That is, there are very few tools designed specifically for the business end users. Most business users who do analyses on their own use tools that IT people also use.
Business intelligence has become the vendors’ preferred synonym for decision support. This is because decision support has an academic connotation and, as just mentioned, decision support systems do not necessarily support decisions. On the other hand, business intelligence systems do not necessarily make a business more intelligent. By the way, the consultant–coined term business intelligence goes back to the late 1950s, fell out of use, was revived by a DEC consultant, fell out of use again, and then was revived by the DW/DSS/BI world in the late 1990s. Confusingly, business intelligence is also used as a synonym for competitive intelligence (and is probably a more apt term for that area).
You can be sure that there will be future synonyms for decision support. Industry "experts" and marketeers always are on the prowl for ways of differentiating their expertise and products.
The Case for Data Warehousing In probably 99% of the data warehousing implementations, data warehousing is only one step out of many in the long road toward the ultimate goal of accomplishing these highfalutin objectives.
We cannot say that decision support systems or tools necessarily support the making of decisions.
The basic reasons organizations implement data warehouses are:
What’s in a name? – As far as I know, cognitive researchers do not agree on how decisions are made. Therefore, saying that these tools support making decisions is not a provable statement. Nor, is it, in may opinion, an insightful way of defining these tools. It seems, though, that 99% of the definitions of BI say something about better decisions. My wish is that these defintions would include a cognitive model of how decisions are made and an explanation on how the tools fit into the model.
To perform server/disk bound tasks associated with querying and reporting on server/disks not used by transaction processing systems
These tools do not analyze by themselves – rather they help a person analyze.In other words, the tools facilitate analyses rather than perform analyses. Data warehousing and decision support systems and tools do not necessarily go hand in hand.Many data warehouses are not used as decision support systems. And decision support systems or tools do not necessarily require the use of a data warehouse as a source for data. I assert that, by far, the most used decision support tools are spreadsheets not connected in any automated way with a data warehouse. Actually there is relatively small amount of decision support going on. Analyzing data, no matter what tool is being used, is difficult. Whatever the vendors do, it will remain difficult. But it is an activity, when done well, that can be quite beneficial.
Most firms want to set up transaction processing systems so there is a high probability that transactions will be completed in what is judged to be an acceptable amount of time. Reports and queries, which can require a much greater range of limited server/disk resources than transaction processing, run on the servers/disks used by transaction processing systems can lower the probability that transactions complete in an acceptable amount of time. Or, running queries and reports, with their variable resource requirements, on the servers/disks used by transaction processing systems can make it quite complex to manage servers/disks so there is a high enough probability that acceptable response time can be achieved. Firms therefore may find that the least expensive and/or most organizationally expeditious way to obtain high probability of acceptable transaction processing response time is to implement a data warehousing architecture that uses separate servers/disks for some querying and reporting. To use data models and/or server technologies that speed up querying and reporting and that are not appropriate for transaction processing.
There are ways of modeling data that usually speed up querying and reporting (e.g., a star schema) and may not be appropriate for transaction processing because the modeling technique will slow down and complicate transaction processing. Also, there are server technologies that that may speed up query and
reporting processing but may slow down transaction processing (e.g., bit-mapped indexing) and server technologies that may speed up transaction processing but slow down query and report processing (e.g., technology for transaction recovery.) – Do note that whether and by how much a modeling technique or server technology is a help or hindrance to querying/reporting and transaction processing varies across vendors' products and according to the situation in which the technique or technology is used. To provide an environment where a relatively small amount of knowledge of the technical aspects of database technology is required to write and maintain queries and reports and/or to provide a means to speed up the writing and maintaining of queries and reports by technical personnel
Often a data warehouse can be set up so that simpler queries and reports can be written by less technically knowledgeable personnel. Nevertheless, less technically knowledgeable personnel often "hit a complexity wall" and need IS help. IS, however, may also be able to more quickly write and maintain queries and reports written against data warehouse data. It should be noted, however, that much of the improved IS productivity probably comes from the lack of bureaucracy usually associated with establishing reports and queries in the data warehouse. To provide a repository of “cleaned up” transaction processing systems data that can be reported against and that does not necessarily require fixing the transaction processing systems.
The data warehouse provides an opportunity to clean up the data without changing the transaction processing systems. Note, however, that some data warehousing implementations provide a means to capture corrections made to the data warehouse data and feed the corrections back into transaction processing systems. Sometimes it makes more sense to handle corrections this way than to apply changes directly to the transaction processing system. To make it easier, on a regular basis, to query and report data from multiple transaction processing systems and/or from external data sources and/or from data sources and/or from data that must be stored for query/report purposes only
For a long time firms that need reports with data from multiple systems have been writing data extracts and then running sort/merge logic to combine the extracted data and then running reports against the sort/merged data. In many cases this is a perfectly adequate strategy. However, if a company has large amounts of data that need to be sort/merged frequently, if data purged from transaction processing systems needs to be reported upon, and most importantly, if the data need to be "cleaned", data warehousing may be appropriate. To provide a repository of transaction processing system data that contains data from a longer span of time that can efficiently be held in a transaction processing system and/or to be able to generate reports “as was” as of a previous point in time
Older data are often purged from transaction processing systems so the expected response time can be better controlled. For querying and reporting, this purged data and the current data may be stored in the data warehouse where there presumably is less of a need to control expected response time or the expected response time is at a much higher level. – As for "as was" reporting, some times it is difficult, if not impossible, to generate a report based on some characteristic at a previous point in time. For example, if you want a report of the salaries of employees at grade Level 3 as of the beginning of each month in 1997, you may not be able to do this because you only have a record of current employee grade level. To be able to handle this type of reporting problem, firms may implement data warehouses that handle what is called the "slowly changing dimension" issue. To prevent persons who only need to query and report transaction processing system data from having any access whatsoever to transaction processing system databases and logic used to maintain those databases
The concern here is security. For example, data warehousing may be interesting to firms that want to allow report and querying only over the Internet. Some firms implement data warehousing for all the reasons cited. Some firm implement data warehousing for only one of the reasons cited.
If you examine the list you may be struck that need for data warehousing is mainly caused by the limitations of transaction processing systems. These limitations of transaction processing systems are not, however, inherent. That is, the limitations will not be in every implementation of a transaction processing system. Also, the limitations of transaction processing systems will vary in how crippling they are. Finally,a firm that expects to get business intelligence, better decision making, closeness to its customers, and competitive advantage simply by plopping down a data warehouse is in for a surprise. Obtaining these next order benefits requires firms to figure out, usually by trial and error, how to change business practices to best use the data warehouse and then to change their business practices. And that can be harder than implementing a data warehouse.
Use of Business Intelligence Tools The main uses of business intelligence tools are: To check that "everything" is okay
Nothing will be done with many, perhaps most, of the queries and reports created with business intelligence tools. They are run to confirm a person’s usually not crisply defined notion but intuitively felt notion of "okayness." Primary function of business intelligence tools is to support non-action. To confirm the "obvious"
Most end users the reports and queries are ultimately being produced for have a pretty good gut feel for what is going on in their area of concern. business intelligence tools do not tell these people anything amazing that the people don’t already suspect. But the information produced with the tools gives them confidence their gut feel is okay. To identify the out of the ordinary
Usually the ultimate consumer of the tool’s output has somewhat vague criteria of what is out of the ordinary. The business intelligence tools kind of do double duty in that they help refine the criteria of what is out of the ordinary and identify what fit the refined criteria of out of ordinariness.
To figure out how something "works"
Most people are not looking for some grand Unified Theory of how firm XYZ works. Rather, they want to understand some small aspect of an operation like Customer A always pays on time, Customer B usually pays late and still takes the early payment discount, etc. To convey information in a more digestible manner
These tools are often used to convey what a person or persons already know. These knowing people use the tools simply to present information to other people in a way that it is more easily read. To
compare information products, cost/profit accounts
about customers, centers, financial
Sometimes this is side by side comparisons of a series of measures. Sometimes this is identification of the most, the least, the earliest, the latest, etc. To compare the same type of information in different time periods
This is simply the usual daily, weekly, monthly, quarterly, yearly comparisons. To check performance versus formal and informal goals or constraints
That is, measures of what actually occurred are compared with budgets, forecasts, quotas, or some other types of goals. To grab a little piece of information out of a large volume of information
These tools make picking that virtual needle out of that virtual haystack a lot simpler. To
get around an Information Technology department that does not have the time or the resources to write reports
Often end users use these tools out of impatience with the IT department. Or, the IT department gives the user these tools to relieve the pressure off of itself. The end users in these cases often write reports that could hardly be called analyses.
To provide a report "of record"
For all kinds of reasons it is often necessary for people to agree that "these are the numbers." Note they do not have to agree on all the data – just some data whose credibility must be accepted for actions to be taken. business intelligence tools often are used to produce this "official" information. To confirm and sometimes to discover trends and relationships
With all respect to the people working hard on data mining most good businesspeople have an intuitive feeling of the most important trends and relationships between factors that are affecting their business. The business intelligence tools perform the function of confirming their intuition. Yes, the tools also can help discover trends and relationships but it is difficult (though potentially profitable) to sift out the meaningless and spurious trends. To help advocate a position
These tools are not just for "objective" presentation of the facts. Often they are cleverly used to help bolster the case for doing (or not doing) something. To provide data for a what if analysis or a forecast
That is, the tools are used to feed data into a spreadsheet where the actual what–if analysis or forecast will be done. The tools can do some of the what–if–ing and forecasting themselves but most business users are more comfortable doing this work in spreadsheets. Most of these tools are not used as the sole input into making a non–trivial decision. Decisions are made and business intelligence is garnered only with the combination of the output of the business intelligence tools, human judgment and intuition, and the ability to put the information spit out by tools into a context of information that is much wider than any data warehouse, transaction processing system, knowledge repository can handle.
Case against Data Warehousing Some of the reasons data warehousing efforts may not be appropriate for certain organizations are:
Data warehousing systems, for the most part, store historical data that have been generated in internal transaction processing systems. This is a small part of the universe of data available to manage a business. Sometimes this part has limited value.
That is, sometimes the business end user community does not have a strong interest in old transaction processing system data beyond what are available in basic reports generated in transaction processing systems. This lack of interest often stems from the fact that the markets in which a business competes are in great flux or that the internal structure of the organization is in perpetual transition. If these conditions exist, there may not be a solid historical base to compare current performance with. Also, sometimes there is a lack of interest in looking at this data in any in-depth way because a business is so simple that a data warehouse is overkill. Data
warehousing systems can complicate business processes significantly
Though the interest in business process reengineering seems to have waned, some of the appreciation of how complicated processes can slowly strangle a business has remained. Data warehousing, if unchecked, can foster the "institutionalization" of easily created reports whose reason for being quickly is forgotten while people still toil to process these reports. If your organization does not know how to throw out processes (pardon my calling producing, distributing, and reading a report a "process"), data warehousing can quickly add clutter to the business environment. If most of your business needs are to report on data in one transaction processing system and/or all the historical data you need are in that system and/or the data in the system are clean and/or your hardware can support reporting against the live system data and/or the structure of the system data is relatively simple and/or your firm does not have much interest in end user ad hoc query/report tools, data warehousing may not be for your business
Whew! You can say that again. – Anyway, you may find that as more of these conditions are met, the less value data warehousing may add to your firm. And once you get away from the big "Fortune 500, centralized IS" type shops most of the data
warehousing vendors slant their marketing to, these conditions describe the reporting needs of many firms. Data warehousing can have a learning curve that may be too long for impatient firms
Despite the speed of the data warehousing development effort, it takes time for an organization to figure how it can change its business practices to get a substantial return on its data warehousing investment. I speculate that rigorous analysis of the return on most of the major data warehousing implementers' investments would find a much longer average payback period that you would surmise from reading the trade press. Data warehousing can become an exercise in data for the sake of the data
Organizations find that there are unlimited opportunities to add data to their data warehouse. Data warehouses, like most other complex systems, take a life of their own. Unfortunately, adding data without questioning the business value of the data can lessen the business value of the data warehouse and quickly increase the cost of maintaining the data warehouse. In
certain organizations ad hoc end query/reporting tools do not "take"
This is of concern to organizations that believe they can get their return on investment by having users write many of their own queries and reports. In some firms there are profound cultural barriers in the business organization to the acceptance of a tool that allows a person to ask questions on his own. Trying to promote the use of such a tool in these organizations is setting yourself up for failure. Or, sometimes these tools do not take because a business is so complicated that only relatively simple reports with little business value can be written by end users. Many
"strategic applications" of data warehousing have a short life span and require the developers to put together a technically inelegant system quickly. Some developers are reluctant to work this way
Again, the importance of the culture cannot be underestimated. This time, though, the issue is in the IS organization. If your sell of the data warehousing project is the ability to do this strategic work (which
is probably now being done by your users with large and complex spreadsheets) as opposed to the usual development of canned and semi–canned reports and queries, ask yourself if the IS culture can accept this mode of working. For many organizations this approach to systems work is much harder to accept than most people realize. There is a limited number of people available who have worked with the full data warehousing system project "life cycle"
Systems of some depth require a considerable amount of time to develop fully. In other words, it takes a long time to gain experience with the usual problems that develop at different phases of a data warehousing effort. You should be wary of a consultant who says he has experience implementing scores of data warehouses in a couple of years. Usually this is experience will be with a well–defined part of a data warehousing project that was amenable to outsourcing or with minor projects. Data warehousing systems can require a great deal of "maintenance" which many organizations cannot or will not support
Despite the best efforts to architect a system so "maintenance" (in quotation marks because it seems often there is never the closure to the initial data warehousing effort that the term "maintenance" implies) demands are minimized, many systems by their very nature require a great deal of care and feeding once they are in "production". It is important to note that the more successful a warehouse is with the users, the more maintenance it may require. Organizations who cannot or will not staff to meet these maintenance demands should think twice before they jump into the data warehousing business. By the way, it’s very easy for the users to quickly go sour on a system they were enthusiastic about at roll– out time if the system personnel do not support the maturing of the system. Sometimes the cost to capture data, clean it up, and deliver it in a format and time frame that is useful for the end users is too much of a cost to bear.
The percentage of time that must be devoted to extracting, cleaning, and loading data has been well discussed in the literature. It should be pointed out that there are some potential "show–stoppers" in these efforts. Loading data from previous years can
require the knowledge of transaction processing system developers who have long since moved on. Cleaning data so they are in a form that is acceptable to users from different functional areas may require arbitration skills the typical data warehousing developer may not possess. Finally, data may have to be loaded into a data warehousing system in a processing window that just isn’t big enough. Sometimes compromises are acceptable get–arounds. Often, though, compromises end up substantially compromising the value of the information in the data warehouse. You may have gotten the impression from reading the trade press that data warehousing is only for large organizations because it requires huge staffs and huge budgets. Well, most of the trade press is dominated by vendors/consultants/publications trying to market to large organizations with huge staffs and huge budgets. – Though I have no way to prove this, in terms of numbers, I think most data warehousing efforts are done by small staffs with modest budgets. In fact, smaller organizations are probably much more "into" data warehousing than larger organizations. It is only recently that practical technology for huge organizations who lust for multi"terabyte databases has become available. The technology for more modestly sized data warehouses, on the other hand, has been available for many years. Finally, you may have seen articles that state that data warehousing failure rates are between 10% and 90%. Though how these failure rates are determined is suspect, there is no denying that data warehousing is risky. Now the fact that these efforts are risky does not bolster the case against data warehousing. Data warehousing has not repealed the positive relationship between risk and expected return in capital projects. However, if your organization does not know how to manage risky projects, then data warehousing may not be for you.
Sample applications Some of the applications data warehousing can be used for are: • • • •
REFERENCES 1. Inmon, W.H. Tech Topic: What is a Data
Warehouse? Prism Solutions. Volume 1. 1995. 2. "The Story So Far". 2002-04-15. http://www.computerworld.com/databasetop ics/data/story/0,10801,70102,00.html. Retrieved 2008-09-21. 3. Kimball 2002, pg. 16 4. Kimball 2002, pg. 310 5. "The Bottom-Up Misnomer" . 2003-09-17. http://www.intelligententerprise.com/03091 7/615warehouse1_1.jhtml. Retrieved 200811-05. 6. Ericsson 2004, pp. 28-29 7. Yang, Jun. WareHouse Information Prototype at Stanford (WHIPS). . Stanford University. July 7, 1998. 8. Caldeira, C. "Data Warehousing - Conceitos e Modelos". Edições Sílabo. 2008. ISBN 978-972-618-479-9 9. Pendse, Nigel and Bange, Carsten "The Missing Next Big Things", http://www.olapreport.com/Faileddozen.htm 10. "Gartner Reveals Five Business Intelligence Predictions for 2009 and Beyond", 11.http://www.gartner.com/it/page.jsp? id=856714