What is Data Warehouse? A data warehouse is a repository of a business organization's historical data. It is a large part of an enterprise data management system which consists of several servers running on different kinds of platforms and database management systems. It is generally practiced that in an enterprise data management system, it is the data warehouse house which contains static data while it is the operational data store that contains dynamic data that gets frequently updated during the course of business operations. To illustrate this further, it important to know that in an enterprise data management system environment, there may plenty of servers and database systems which constitute various data stores and these servers may be of varying platforms and database management systems come different vendors. Each data store gather data based on the departments they are server or on other special function that they are designed to do. But during the entire business operation, these servers send their data to the operational data store which acts as the unifying areas were disparate data from various data stores are extracted and transformed into a unified structure based on the enterprise data architecture. The process of unifying disparate data is referred to as ETL which stands for extract, transform and load. The extract and transform are mostly done in the operational data store before the transformed data is "loaded" into the data warehouse. With this picture wherein the data warehouse only get the loading part, many people get the impression that the data warehouse indeed is a mere static repository does not do a lot of things except accept data for storage. In fact, the concept of data warehouse has been taken from the analogy with real life warehouses where good are put before the need arise to get them. And so with data, the operational data store goes to the data warehouse to get the data and process them at the operational data store area. Hence the term operational because it refers to the data currently being operated on or manipulated with. But modern data warehouses are no longer as static as they seem or look. Data warehouses today are already managed by software application tools that have the functionality that allows the data warehouse itself to track data and perform all sorts of analysis related to the movement of data from the warehouse to the other data stores and back. Many data warehouse employ a technology known as Online Analytical Processing (OLAP) which helps in providing answers to various multidimensional analytical queries. Most areas of business including business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting use OALP for retrieving information from the data warehouse so that the company can spot trends and patterns as basis for the corporate decisions. There are many companies specifically offering data warehousing software solutions which come with sophisticated proprietary intuitive functions. Many of these vendors even offer integrated solutions that add data warehousing functions with such complex features as data transformation, management, analytics and delivery components.
Having an intuitive data warehouse greatly increases overall performance of the enterprise data management system because the data warehouse can already share some of the load which is supposed to be for the operational data stores which tackles very labor intensive processes from the on-going business operations. Components of Data Warehousing
Operational systems vs. data warehousing The fundamental difference between operational systems and data warehousing systems is that operational systems are designed to supporttransaction processing whereas data warehousing systems are designed to supportonline analytical processing(or OLAP, for short). Based on this fundamental difference,data usage patternsassociated with operational systems are significantly different than usage patterns associated with data warehousing systems. As a result, data warehousing systems are designed and optimized using methodologies that drastically differ from that of operational systems. The table below summarizes many of the differences between operational systems and data warehousing systems.
A comparison of operational systems and data warehousing systems No.
Operational Systems
Data Warehousing Systems
1
Operational systems are generally designed to support high-volume transaction processingwith minimal backend reporting.
Data warehousing systems are generally designed to support high-volumeanalytical processing (i.e. OLAP) and subsequent, often elaborate report generation.
2
Operational systems are generally process-oriented or process-driven, meaning that they are focused on specific business processes or tasks. Example tasks include billing, registration, etc.
Data warehousing systems are generally subject-oriented, organized around business areas that the organization needs information about. Such subject areas are usually populated with data from one or more operational systems. As an example, revenue may be a subject area of a data warehouse that incorporates data from operational systems that contain student tuition data, alumni gift data, financial aid data, etc.
3
Operational systems are generally concerned with current data.
Data warehousing systems are generally concerned with historical data.
4
Data within operational systems are generally updated regularly according to need.
Data within a data warehouse is generally non-volatile, meaning that new data may be added regularly, but once loaded, the data is rarely changed, thus preserving an ever-growing history of information. In short, data within a data warehouse is generally read-only.
5
Operational systems are generally optimized to performfast inserts and updates of relatively small volumes of data.
Data warehousing systems are generally optimized to performfast retrievals of relatively large volumes of data
6
Operational systems are generally Data warehousing systems are generally application-specific, resulting in a multitude integrated at a layer above the application of partially or non-integrated systems and layer, avoiding data redundancy problems. redundant data (e.g. billing data is not integrated with payroll data).
7
Operational systems generally require a non-trivial level of computing skills amongst the end-user community.
Data warehousing systems generally appeal to an end-user community with a wide range of computing skills, from novice to expert users.
Benefits Some of the benefits that a data warehouse provides are as follows: A data warehouse provides a common data model for all data of interest regardless of the data's source. This makes it easier to report and analyze information than it would be if multiple data models were used to retrieve information such as sales invoices, order receipts, general ledger charges, etc. Prior to loading data into the data warehouse, inconsistencies are identified and resolved. This greatly simplifies reporting and analysis. •
•
•
•
•
•
Information in the data warehouse is under the control of data warehouse users so that, even if the source system data are purged over time, the information in the warehouse can be stored safely for extended periods of time. Because they are separate from operational systems, data warehouses provide retrieval of data without slowing down operational systems. Data warehouses can work in conjunction with and, hence, enhance the value of operational business applications, notably customer relationship management (CRM) systems. Data warehouses facilitate decision support system applications such as trend reports (e.g., the items with the most sales in a particular area within the last two years), exception reports, and reports that show actual performance versus goals.
Disadvantages There are also disadvantages to using a data warehouse. Some of them are: • •
• •
•
Data warehouses are not the optimal environment for unstructured data. Because data must be extracted, transformed and loaded into the warehouse, there is an element of latency in data warehouse data. Over their life, data warehouses can have high costs. Data warehouses can get outdated relatively quickly. There is a cost of delivering suboptimal information to the organization. There is often a fine line between data warehouses and operational systems. Duplicate, expensive functionality may be developed. Or, functionality may be developed in the data warehouse that, in retrospect, should have been developed in the operational systems.
ETL Concepts Extraction, transformation, and loading. ETL refers to the methods involved in accessing and manipulating andisloading it into database. The first step source in ETL data process mapping the target data between source systems and target database (data warehouse or data mart). The second step is cleansing of source data in staging area. The third step is transforming cleansed source data and then loading into the target system.
Areas where Data Warehousing can be applied Credit card churn analysis Insurance fraud analysis Call record analysis Logistics management. Agriculture • • • • •
Here goes some of the more famous very large data warehouses: • eBay has a 6 1/2 petabyte database running on Greenplum and a 2 1/2 petabyte enterprise data warehouse running on Teradata • Facebook has a 2 1/2 petabyte datawarehouse running on Hadoop/Hive • Walmart has a 2.5 petabytes warehouse, Bank of America has 1.5 petabytes, Dell with 1 petabyte – All running on Teradata • Yahoo, Fox Interactive Media, TEOCO (which runs outsourced DWs’ for top US telcos) are all in the hundreds of terabytes range