Den gode SSIS pakke
- 1 Skal oversættes til dansk
- 2 Figurer fra original skal tilføjes
- 3 1 Introduction
- 4 2 Architecture – Flow perspective
- 5 3 Fact flows
- 6 4 Dimension Flow
- 7 5 General Techniques
- 8 6 Data sources and Connections
- 9 7 Control flow
- 10 8 Proper use of SQL in the ETL flow
- 11 9 Pipeline / Data Flow Usage
- 12 10 Package Configurations
- 13 11 Logging
- 14 12 Performance optimization
- 15 13 Monitoring and Measuring Packages
- 15.1 13.1 Perfmon counters
- 15.2 13.2 SQL Profiler
1 Introduction
This document describes the Visma MS best practise for creating ETL flow. The document focuses on Integration Services 2008 SP2 and later.
The important points are highlighted with the blue rule font. This help the reader skim through the document quickly for guidance.
1.1 Structure of the document
In section 2 we introduce our best practice with an architectural overview of a typical Visma MS based ETL-flow. This section even includes guidelines for which packages you would typically have in a full warehouse flow. Use this section as your starting point for digging into the rest of the document.
Zooming further into the flow structure, section 3 and 4 describes techniques tailored towards fact- and dimension flows. These sections also tie our architecture to the general Insighters ETL guideline.
Moving up one abstraction level we arrive at some general techniques common to all ETL packages. These techniques provide guidance that can be applied no matter which tool or implementation you apply. You can find these general tips and tricks in section 5.
The rest of the document is specific to Integration Services (SSIS) implementation. It describes the context that we work within and the best practice rules we follow. Do note that users of other technologies may still gain valuable insights by applying similar techniques in their own tools.
We begin our SSIS guidelines with a description on how to connect to data sources (section 6). From here we move on the control flow (section 7).
Before we can introduce the pipeline we must discus the proper usage of SQL. This precautionary measure is taken in section 8. With proper utilization of SQL in place we can finally move on to the SSIS pipeline in section 9.
Our next sections deal with the challenges of structured code maintenance and deployment. We introduce this with a discussion about package configurations in section 10. No maintenance argument is of course complete until we properly treat logging – the subject of section 11.
We conclude our guidelines with sections 12 and 13 which deal exclusively with performance tuning. In these concluding sections you will find the guidance to squeeze the last “rows / sec” out of your flow.
2 Architecture – Flow perspective
The overall ETL flow in a Visma MS solution is divided into some main phases. These main phases happen in sequence.
Each of the main phases is implemented as its own SSIS package. Sub packages are used inside the phases to further partition and modularize the flow. Section 3 and 4 contain implementation specifics for each phase.
The table below lists the main phases of an ETL flow
Phase | Purpose | SSIS Package implementation |
Stage Dimensions | Extracts the dimension data from the source systems and places these in Extract Schema | One package pr. Source System. Optionally one package per source system per dimension table |
Transform/Load Dimensions | Performs transformation on the data from the extract schema into the EDW | One package per dimension table If you have more than one source system loading a dimension table you create one package pr. Source system for this dimension.
|
Maintain dimensions | Maintains indexes on dimensions tables. Will typically rebuild all indexes with a fillfactor = 100, perform materialization of performance critical tables and other performance optimization tasks | One Package |
Stage Facts | Extracts fact data from the source systems and places these in the Extract Schema | One package pr. Source System. May be split into more if dealing with large fact tables |
Transform/Load Fact | Performs transformation of fact data from the Extract schema into the EDW | One package pr. Fact table If you have more than one source system for a fact tables use one package pr source system for this fact table |
Build Data Mart | Build the data marts. In a Microsoft solution these packages will typically process cubes | One package pr. Data mart |
Archive Extracts | Archives relevant Extract tables to enable replay of old data | One package pr. Source system (this package will archive all relevant tables from the source system |
Maintain Fact | Maintains the indexes and sometimes statistics on facts (optionally rebuilding them) This phase is optional since you may not use indexes on fact tables at all | One package |
Backup | Performs backup of the changed warehouse | Not always implemented as package. Can be performed by different tools depending on customer needs |
Naming guidelines for the packages are found in the SSIS Naming convention document in the document center.
2.1 Dependency between stages
Some of the stages can be run in parallel, while some have dependencies. The diagram below illustrates the dependencies in a flow
The illustration below show a typical flow with two source systems, some dimensions/facts and two data marts:
Figure 1 - Dependenciens between SSIS Packages
2.2 Collecting packages into master packages
Depending on the scheduler tool you use, it may be an advantage to create aggregate[1] packages that execute one or more sub packages. Examples of such packages include
Collect the packages that load a dimension or fact into one, big aggregate packages
Create a package that loads all data from a specific source system
Automate the entire warehouse load using just one, aggregate package
Automate maintenance task in one big packages
Create on aggregate package for weekend loads and another for daily loads
Which strategy is correct depends largely on the structure of your data, the batch window, the need for restarts etc... However, no matter which strategy you choose, you should always follow certain design patterns:
The aggregate packages should not contain any business logic. They can contain only these elements
Execute of other packages
Start / Stop of job logs
All logic in the should be executable on its own, without participating in an aggregate package
This leaves you free to recombine logic as you see fit in other packages
3 Fact flows
This section contains general guidelines for structuring the flow of data from source system to warehouse. The flow components in the section below are listed in the order they are normally used.
3.1 Overview
The fact flow is initiated by staging the fact data to the Extract schema. Why do we copy data here instead of moving it directly to the pipeline? Rationale: There are several reasons:
Performing a pure copy of the data minimizes the time we use disturbing the source system. This is generally good news for the source system DBA
By taking a local copy we become able to quickly reload from our local database
Archiving, if necessary, becomes easier
In all but the most extreme data sizes the above reasons justify an extra copy of the data while loading.
Once data is staged you generally have to perform some or all of these operations:
Detect delta (i.e. find new rows and update existing ones)
Optionally this delta detection will happen as part of the staging.
Map keys to master data
Assign entity/surrogate keys
Perform business transformation
Delta detection is described in the sections below. For guidance on the other operations refer to section 5.
3.2 Delta detection for fact data
The discussion in this section will focus on delta detection of facts / transactions. There is separate kind of delta detection used for dimension data discussed in section 4.2.
The purpose of delta detection is to limit the amount of changes committed to the warehouse. Instead of reloading every table you instead, intelligently, try to change only what is strictly necessary.
Delta detection of facts should generally be applied very early in the ETL flow – since it will reduce the amounts of rows you have to work with at later phases.
You basically have three strategies for delta detection
Perform delta detection in source system
Perform delta detection in ETL flow
Do not perform delta detection.
The following sections will treat these cases.
3.2.1 Performing delta detection at the source system
This delta detection technique, while optimal, is rarely feasible. Implementing source delta detection follows the pattern:
Find candidate columns in the source system that indicate if a row has been changed or added
Question source system experts on how these columns are updated
Test the theory thoroughly
Implement loader that only extracts delta from the source system
1) The purpose of this process is the locate columns that uniquely identify changes in the source system.
Typical candidate columns are date and time columns. Delta detection on date/time will only work if the columns are maintained by either database triggers or strict programming techniques[2].
Another column candidate is keys that perpetually increase; examples include SQL Server IDENTITY columns and Oracle sequences. Note that while new rows can be located quickly by comparing keys, detecting changes in existing rows is not possible.
Some databases provide a feature to automatically timestamp a row. In Oracle the rownumber can sometimes be used and in SQL Server columns of type timestamp/rowversion automatically change whenever a row is changed.
Be aware that deletes are also changes. When you examine you source system, you should also look for ways to detect deleted rows. One option would be for the source to include an “IsDeleted” column which is updates when the row is removed (instead of physically removing the row)
2) When you have located candidate columns for delta detection you must question the source system experts closely about these. If they reply with confidence that these columns are correctly maintained (this is rare) you can move on to the next step. If you detect doubts or have trouble getting information about the model, stop wasting your time and consider moving on to another delta detection method.
3) Always test your candidate delta columns thoroughly. Try changing the source system and see if columns change appropriately.
4) If the last steps were successful it is now a simple matter to implement an ETL loader that only fetches the last changed columns from the source system. Remember to congratulate yourself for performing the necessary detective work and achieving the optimal delta solution.
3.2.2 Performing delta detection in the ETL flow
This technique follows the pattern:
Fetch all data from source system
Compare this data with a copy from the EDW to detect changes
Perform the rest of ETL flow only on the changes
1) Fetching all data from the source system is often a costly affair. Make sure that you coordinate this with the DBAs of the source system. Consider disabling locking of source tables or even running on a replicated copy of the source.
2 + 3) You should maintain an archive copy of the extracted, unclean table as it looked last time you saw it. By merge joining this archive with the newly received source data (sorted on the key columns) your can quickly perform a delta detection in the ETL flow
The illustration below demonstrates this technique:
Compare archive and the full source data and perform delta.
An SSIS merge join transformation between Source data and archive will make the delta obvious.
The output is a table containing the delta plus archive
Replace the old archive with the new table
Take the delta from the new table and perform the next ETL steps on it.
Figure 2 – Fact Delta Detection in ETL flow
3.2.3 Running without delta detection
It is worth reminding ourselves why we do delta detection in the first place: because it reduces the data we need to transform.
Delta detection, especially in the ETL flow, is really only worth the effort if the expected delta is small. Performing delta detection costs both time and machine resources.
You should carefully compare the cost of the delta detection with full reloads. As a rule of thumb, if you delta is much larger than 10% of the total data – a full load is often faster than delta detection[3]
4 Dimension Flow
The dimension flow normally deals with much less data than the fact flows. However, the transformations you need to apply are typically much more complicated.
4.1 Overview
The dimension flow typically proceeds in these phases:
Copy source system data to extract schema
Even for the most extreme dimension this should be a fast operation. No need to apply the guidance from the fact loaders here
Clean data types and fix domain problems
Examples
i. Trimming / casing of strings
ii. Quick fix values that are outside acceptable range – for example setting early dates to a fixed date in the past,
iii. Removing leading/trailing zeros
Cast columns to correct types
This is the first step that can produce error rows
Perform business rules for both transformation and cleaning
This may include join operation and other combining of data streams
Check and optionally fix integrity against other tables (apply techniques in section 5.1)
This step may also generate error rows
Land the final data in a “load ready” table
Perform type 1 and type 2 delta detection
Optionally: Archive extracts
Rationale:
First of all: Debug friendliness, not performance is our typical concern on dimension loaders. Hence, we gladly sacrifice some performance if this helps us locate errors in data with more elegance and flexibility.
Copying the source system allows us to quickly inspect what we received and eases the archiving.
Before we apply error detection it is generally a good idea to perform some basic cleaning of the data (step 2). It becomes much easier to write error detection code if we can make some basic assumptions about our data (untrimmed strings is a classic example of hard to locate errors)
Using the same argument as step 2 the following step tries to cast data into their acceptable types. This is the first step that can produce error rows.
After step 3 we can make some strong assumption about our data and the column domains. We are now ready to perform step 4 and 5 which apply integrity check and business transformations. Again we may produce error rows that may need handling by users.
Before we move on the delta detection we should land our data in a “load ready” table. As developers we can inspect this table and see the final results of the ETL flow before any delta detection is applied. Since the following step is generally the most complicated, having a “checkpoint” here is a good idea.
Step 8 performs the final type 1 and type 2 delta detection. If you are using the meta framework – no problem – just call the build in procedures for this. If not, refer to Kimballs books and the following section.
4.2 Delta detection on dimensional / entity data
This form of delta detection is used to detect type 1 and type 2 changes in dimensions. Very often, this is the most error prone part of a warehouse load.
By far the easiest technique is this:
Make a full copy of the source system entity table
Clean and transform the entity into a “current source system view of dimension” table
Compare the final, transformed dimension table with the current EDW table
Use a merge between the current view and the EDW dimension table
In this merge, perform type 1 and type 2 delta detection
Step 3 can often be automated – which it indeed has been in the meta framework.
This above technique works for all but the largest[4] dimensions. If dealing with large dimensions, you may decide to combine the technique in section with the above ideas.
5 General Techniques
This section contains some general techniques that apply to all flow types.
5.1 Assigning new keys
Part of the warehouse load is the assigning of small surrogate or entity keys. You should perform this step early in the ETL flow: it will generally make you columns narrow by replacing character keys with integer values. This has a beneficial effect on memory and storage requirements.
When assigning keys, be especially aware of the following:
How will you handle source system keys that do not map to your EDW keys?
How will you handle referential integrity errors in the source system?
In general, you can apply the following solutions to the problems above:
Create “inferred” members in your tables. These members are rows that “should exist”[5] but have not yet been observed in the source system.
Be aware that inferred members need special treatment if you put them into dimension tables. A dimension table will contain many columns that you cannot populate from the knowledge you have at the time of inferring the member.
If you use inferred members you must supply some default values for the columns of the dimension.
Replace the source system keys with an “unknown member” key.
In Visma MS we generally use a negative value for the surrogate or entity key to easily distinguish these special unknown members.
Throw away rows from source system that do not match the keys in your EDW
Remember to somehow notify your users that data has been thrown away!
Move unmatched rows to an error table and let users inspect this
This is typically combined with one of the above solutions
5.2 Cleaning data and performing single row transforms
Once keys have been assigned you start transforming data. The first transformation should be single rows transformations that either reduce or maintain the number of rows in the pipeline. Examples include:
Throwing away rows you will not need
Remember to log the rows you throw away to ease debugging.
You should also control which rows are thrown away using some parameter table instead of hardcoding
Renaming and combining columns from source system names to the final EDW names
Aggregation of data
Removing duplicates
Concatenation or splitting of data
Renormalization – i.e. transforming de-normalized data from source systems to 3NF
Applying key transformation. Examples include: mapping between master data keys, deriving keys from columns.
5.3 Applying transformation that “blow up” data
You may also be required to perform transformations that increase the number of rows in the data. Examples of these are:
Filling out “holes” in the data (e.g. a “missing row” should be interpreted as a zero)
Changing YTD values to periods and vice versa
Changing a from/to date in the input data to one row for every day in the interval
Doing currency calculations.
Simulating transactions (e.g. performing a FIFO buffer simulation of inventory movements)
Such transformation can quickly increase the amount of data and should therefore be performed on the smallest and most efficient input. Hence, “blow up” transformations should not be applied until after you reduce both the size and amount of input rows.
6 Data sources and Connections
The following guidelines apply to data sources and connection managers.
6.1 Avoid using Shared Data sources
Integration Services has a feature called shared data sources. While this feature is in principle a good idea it has some flaws that make us avoid it:
It works badly with TFS and other source control systems.
Every time you check out a package, Visual Studio will attempt to check out the shared data source too (temporarily).
This presents an (unnecessary) merge problem if another user has the data source checked out.
The reference to the data source is maintained in the package. Changing the data source will break the reference and force an annoying prompt when opening packages that depend on it
A shared data source does not give package developers the freedom to experiment with connection specific settings optimal for their package usage (for example using :NET connectors)
We feel that the above concerns and their impact on developer productivity outweigh any benefit the shared data source provides. Our recommendation: do not use shared data sources.
In the same line of reasoning data source views in Integration Services should not be used either.
6.2 Connecting to SQL Server
There are many different ways to connect to SQL Server. Performance and stability characteristics for each connection type differ. Hence, it is important to choose the right connection type.
The following drivers (in order of preference) are available when connection to SQL Server:
SQL Native client (aka: Netlib)
OLEDB
ODBC
SQL .NET library
JDBC (For java clients)
Each step down the list adds another translation layer. Such layers are both overhead and sources of bugs.
In a warehouse environment you should always prefer SQL Native Client if available.
6.2.1 Picking the right protocol
Once you have chosen your driver you need to determine which protocol to use. SQL Server supports the following connection protocols (in order of performance):
Shared Memory
VIA
Named Pipes
TCP/IP
Shared memory (the fastest) is available only if you are running the ETL tools on the same server as the database. Shared Memory is somewhat more tolerant of timeouts than the other protocols and generally provides better stability and performance. Choose shared memory whenever possible for best performance and stability.
The VIA protocol is only very rarely used and requires special hardware. This protocol should be avoided.
Named Pipes uses the windows pipe system for connections and authentication. The protocol is not “firewall friendly” and it is generally only good when connecting to the server from inside the corporate network. However, do bear in mind that named pipes is quite a bit faster than TCP/IP.
TCP/IP requires only one port to be open to the server (the default is 1433). This protocol is especially useful in firewalled environments. Be aware that the TCP/IP protocol, especially on long running warehouse queries, is sensitive to bad network conditions and server timeout. Such timeout can occur if the server is running low on resources or if SQL Server is executing inefficient query plans.
The protocol used for connections can be enforced either though the connection string (see knowledge base articles below) or through a SQL Server alias.
You should always validate that you are using the right connection to SQL Server. The list if all connections and their active protocols can be obtained by querying the system view:
SELECT session_id, net_transport
FROM sys.dm_exec_connections
6.2.2 Using aliases
If you are using either named pipes or TCP/IP we generally recommend that you use aliases when connecting to SQL Server.
An alias is an alternate name for a SQL Server instance. In addition to the name it also provides a specification of the protocol to use when connecting. If a client has an alias defined it can use the alias to connect directly the SQL Server without any additional connection information specified. This provides an abstraction layer that makes it easy to replace server names in connection strings.
An alias is created using the SQL Server configuration manager. On 64-bit machines you must create the alias both in the 32 and the 64 bit environment (this can be done through SQL configuration manager)
Be aware that you cannot use aliases for shared memory connections.
6.2.3 Setting Application Name
To help monitor the execution of SSIS packages you should always set the application name in your connection string to SQL Server.
By setting this application name property to the name of the executing package you are allowing the SQL Server DBA to easily troubleshoot bad SSIS behavior.
The application name is set like this:
Figure 3 – Setting the application name
6.2.4 Relevant MS Articles
The following knowledge base and MSDN articles provide useful guidance about SQL Server connectivity:
6.3 Connecting to Oracle
There are two OLE DB drivers that can be used to extract data from a source system running Oracle in SQL Server Integration Services.
Oracle Provider for OLE DB
Microsoft OLE DB Provider for Oracle
6.3.1 Oracle Provider for OLE DB
The Oracle Provider for OLE DB exists both in a 32 bit and a 64 bit version. The latest version at the time of writing is 10.1.0.4.0 and can be found at:
There is a bug in the current version. Columns with the Oracle data type NUMBER fails to be extracted, if there is no precision and scale set. To avoid this, there are two solutions.
Make sure all columns of the data type NUMBER have precision and scale set on the Oracle data source.
Install a patch (p5203839_10201_MSWIN-x86-64.zip for the x64 version – there should be a 32 bit patch as well) from the Oracle Technology Network (http://www.oracle.com/technology/) – requires a paid subscription to OTN.
Our experience shows that the driver is stable and faster than the Microsoft OLE DB Provider for Oracle.
6.3.2 SSIS and AlwaysUseDefaultCodePage
You might get the following error, trying to connect to an Oracle data source in SSIS:
“Cannot retrieve the column code page info from the OLE DB Provider. If the component supports the "DefaultCodePage" property, the code page from that property will be used. Change the value of the property if the current string code page values are incorrect. If the component does not support the property, the code page from the component's locale ID will be used."
This error occurs when you are extracting non-unicode string values from Oracle.
The fix to this is simple. Set AlwaysUseDefaultCodePage to true in the properties for the connection.
The DefaultCodePage should be set to 1252 which is correct for Western Alphabet. If Oracle is not using the Western Alphabet, you'll need to determine the character set used and set the default code page. The correct default code page can be found at:
6.3.3 Microsoft OLE DB Provider for Oracle
The Microsoft OLE DB Provider for Oracle only exists in a 32 bit version, and that version is deprecated by Microsoft.
The benefit of using this driver is that there are no problems with NUMBER without precision or scale.
However it is our experience that the Microsoft OLE DB Provider for Oracle is a lot slower than the Oracle Provider for OLE DB.
6.3.4 Other drivers
There are other methods to connect to an Oracle data source, for example Oracle Instant Client or ODBC, but we have not managed to get neither to work. If you discover good alternatives please post on forums.
Also, note that Oracle Instant Client is not officially support by Microsoft as a method for connecting to Oracle (Product support will kindly tell you to use the Oracle OLEDB driver instead)
6.4 Connecting to DB2
Need guidelines from subject matter expert[JR1]
6.5 Connecting to TeraData
To be written if we get a project where this is needed
6.6 Reading text files
Need guidelines here from subject matter expert
6.7 Links on connectivity and drivers
7 Control flow
The control flow of an SSIS package provides the backbone of your package. The section provides guidance on both specific tasks and general guidelines for the control flow.
7.1 Layout of Control Flow
The control flow should be laid out to start at the top of screen and flow downwards. If parallelism is added to the flow, indicate this by aligning the parallel task horizontally.
Use Sequence containers to group operations into “modules[JR2] ” [PLI3] [tke4]
7.2 Maintaining and Rebuilding Indexes
Indexes do not maintain themselves and will gradually become useless if they are not kept in shape.
In a warehouse, the ETL flow is generally responsible for maintaining indexes. Indexing is a task that has the focus of the DBA and it is vital that we as ETL developers behave properly when working with indexes.
There are two different strategies for keeping an index in good shape
Drop and recreate – This involves completely removing the indexes, perform a big operation, and rebuilding indexes after this
If you plan to modify a large part of a table (more than 20%) in bulk mode - it is good practice to drop the indexes and recreate them after the operation
Failure to do this will result is page splitting[6] which will show up in the DBAs monitoring and negatively impact your flow.
Reorganize – This strategy keeps the indexes on the table while the insert is happening.
This strategy is often used for dimension tables.
It is used when a table changes only slowly for each batch run.
You should periodically maintain the index with reorganization
7.2.1 Correct procedure for Dropping and Recreating Indexes
If you are dropping and recreating indexes on a table the following order of operations is the correct approach:
Drop all non-clustered indexes on the table
Drop the clustered index (if any) on the table
Perform the large, bulk operation
Recreate the clustered index
Recreate the non-clustered indexes
This procedure will ensure the best performance during the bulk operation and keep the index in good shape.
7.2.1.1 Using FILLFACTOR = 100
If you are always dropping and recreating indexes before modifying a table (as is often the case with fact tables) you can gain extra performance and reduce storage needs by using FILLFACTOR = 100. Search for “fill factor” in book online for documentation on this property
Example:
The following statement creates the index CIX_Dates with a fill factor of 100:
CREATE UNIQUE CLUSTERED INDEX CIX_DateFrom
ON Dimensions.Product_T2 (DW_EK_Product, DW_Valid_From)
WITH (FILLFACTOR = 100)
7.2.2 Reorganizing indexes
If you decide to keep the indexes on the table while changing it you should make sure that the indexes are periodically maintained.
There are several strategies for reorganizing (consult Books Online) and you should make sure a strategy is implemented on the warehouse.
There are two conditions that you should monitor to check if the indexes are being maintained:
Is there enough free space on pages in index?
Is the index fragmented?
You can use the management view: sys.dm_db_index_physical_stats to check for these conditions.
7.2.3 Using SORT_IN_TEMPDB
SQL Server 2005 supports an index options called SORT_IN_TEMPDB. Depending on the configuration of your warehouse you they may be useful options to consider.
TempDb is sometimes placed on a set of very fast disks that are separate from the data warehouse database. If this is the case, using the SORT_IN_TEMPDB index option will provide you with a performance gain. If TempDb is sitting on the same disks as you database, this option will not provide any performance and may actually degrade it.