Data Warehousing arehousing ETL Checklist INTRODUCTION ETL – Extract, Transform and Load – is the process by which data from multiple systems is consolidated, typically in a data warehouse, so that executives can obtain a comprehensive picture of business functions, e.g. the relationships between marketing campaigns, sales, production and distribution. It has been estimated that in any data integration proect, ETL will consume !" to #" percent of the time and resources. The process is complicated and has been the subect of numerous books. $nyone undertaking an ETL proect for the first time will have to do some serious research. %ere is a high&level checklist of the important topics.
CHECKLIST
' ()*+E. The conventional wisdom is, -ont try to boil the ocean./ Its more important to deliver business results than it is to have a comprehensive program. In some cases, it may be better for the business as a whole to leave certain data sources untouched.
' T$01ET T$01ET (2(TE3 )*4TE4T. The content of the target system will drive the whole proect, and will of course be determined by business needs. (pecifically, target system content will determine which source systems will be involved. -$T$ (*50)E( 6(*50)E (2(TE3(7. The first step in the process is ' -$T$ identifying the systems from which the data will be extracted. There are two broad categories8 •
•
Internal data. This is data from within your organi9ation. :rom a technical p oint of view, sources can range from E0+, )03 and legacy applications to flat files and even Excel spreadsheets. Its important to become very familiar with the data in all internal sources as part of the planning process. External -ata. *ften, when a database 6e.g. a data warehouse7 is to be used for decision support, its usefulness can be greatly enhanced when the internal data is supplemented with external data such as demographic information on customers.
' *;4E0(%I+. *;4E0(%I+. Its Its critical to determine de termine who will take responsibility for the data in its new home,/ and who will be responsible for maintaining the update process 6which includes taking responsibility for the the datas accuracy7. accuracy7.
' 3*-E *: E
•
•
•
%ome grown. ;riting code in&house used to be the most common approach to ETL. This approach is often the easiest for small proects, and has the advantage of being able to handle the idiosyncrasies of unusual data formats. *n the negative side, home grown code re=uires maintenance over time, and often has scalability problems. >rd +arty bolt&on. ?olt&on modules for existing systems are often a convenient approach, however they often mandate data formats that are not very flexible and can cause trouble when it comes to accommodating data from other sources. +ackaged systems. (ystems from pure&play data integration companies offer flexibility and relative ease&of&use, but may be costly and re=uire more training than the other two solutions.
' -$T$ +0*:ILI41. This process provides metrics on the =uality of the data in source systems prior to beginning the proect and can help predict the difficulties that will be involved in data re&use.
' -$T$ T0$4(:*03$TI*4 •
•
•
•
)leansing. (ource system data typically must be cleansed. The most important aspect of cleansing is usually de&duping/ – the removal of multiple files identifying the same item or person, e.g. @. (mith and @ohn (mith, both with the same address. )leansing also involves removal 6or correction7 of files with incorrect data, e.g. an address in the name field, and establishing default values. 0eformatting. The data must be standardi9ed in terms of nomenclature and format. Enhancement. -ata associated with marketing is often enhanced via external sources, which creates a re=uirement for additional fields beyond those associated with the internal data. $ggregationA)alculation. If there are to be aggregated or calculated fields, its necessary to determine at what point in the process the aggregationAcalculation will take place. This is an issue during the initial population of the new database and in its ongoing maintenance.
' 4*3E4)L$T50E. 4aming conventions can have a disproportionate effect on user satisfaction, and users should by all means be involved in decisions involving names. •
•
:ield names. *ne issue is what to name the new fields, e.g. (ex/ vs. 1ender/ vs. 3r.A3rs.A3s./ ;henever possible, the field names in the target system should match 6or be derived from7 the field names in the source systems. It is easier for all involved if the data model column names also match the target system field names. -ata names. The same issue exists with data names. The paint called 0ed >#/ by the E0+ may be called %ot )rimson/ in the marketing database.
' 3ET$-$T$. There are two types of metadata to be considered. •
•
Technical metadata is concerned with data types, lengths, source mappings and other details that are relevant to developers. ?usiness metadata is information that would be potentially useful to end users, such as valid values, mandatory vs. optional indications, and source systems.
' (E)50IT2. (ecurity is extremely important with sensitive data, e.g . customer records that include personal data like date of birth or financial information such as credit card numbers.
' TI3I41. *nce the initial data has been loaded into the target system, its necessary to determine how often it will be refreshed. This depends primarily on business needs. -o managers need to track a number 6sales, inventory, hours, etc.7 on a =uarterly, monthly, weekly or daily basisB *ther considerations involve the =uantity of data to be transferred and the speed of the process.
' T0$I4I41. If you choose a >rd party bolt&on or a packaged system, the developers involved will most likely need training. They may also need training in a database reporting tool and – if its new – the scheduling system.