Den gode SSIS pakke

Den gode SSIS pakke

1       Introduction

This document describes the Visma MS best practise for creating ETL flow. The document focuses on Integration Services 2008 SP2 and later.

The important points are highlighted with the blue rule font. This help the reader skim through the document quickly for guidance.

1.1      Structure of the document

In section 2 we introduce our best practice with an architectural overview of a typical Visma MS based ETL-flow. This section even includes guidelines for which packages you would typically have in a full warehouse flow. Use this section as your starting point for digging into the rest of the document.

Zooming further into the flow structure, section 3 and 4 describes techniques tailored towards fact- and dimension flows. These sections also tie our architecture to the general Insighters ETL guideline.

Moving up one abstraction level we arrive at some general techniques common to all ETL packages. These techniques provide guidance that can be applied no matter which tool or implementation you apply. You can find these general tips and tricks in section 5.

The rest of the document is specific to Integration Services (SSIS) implementation. It describes the context that we work within and the best practice rules we follow. Do note that users of other technologies may still gain valuable insights by applying similar techniques in their own tools.

We begin our SSIS guidelines with a description on how to connect to data sources (section 6). From here we move on the control flow (section 7).

Before we can introduce the pipeline we must discus the proper usage of SQL. This precautionary measure is taken in section 8. With proper utilization of SQL in place we can finally move on to the SSIS pipeline in section 9.

Our next sections deal with the challenges of structured code maintenance and deployment. We introduce this with a discussion about package configurations in section 10.  No maintenance argument is of course complete until we properly treat logging – the subject of section 11.

We conclude our guidelines with sections 12 and 13 which deal exclusively with performance tuning. In these concluding sections you will find the guidance to squeeze the last “rows / sec” out of your flow.

 

2       Architecture – Flow perspective

The overall ETL flow in a Visma MS solution is divided into some main phases. These main phases happen in sequence.

Each of the main phases is implemented as its own SSIS package. Sub packages are used inside the phases to further partition and modularize the flow.  Section 3 and 4 contain implementation specifics for each phase.

The table below lists the main phases of an ETL flow

Phase

Purpose

SSIS Package implementation

Stage   Dimensions

Extracts   the dimension data from the source systems and places these in Extract Schema

One   package pr. Source System. Optionally one package per source system per   dimension table

Transform/Load  

Dimensions

Performs   transformation on the data from the extract schema into the EDW

One   package per dimension table

If you   have more than one source system loading a dimension table you create one   package pr. Source system for this dimension.

 

Maintain   dimensions

Maintains   indexes on dimensions tables. Will typically rebuild all indexes with a  fillfactor = 100, perform materialization   of performance critical tables and other performance optimization tasks

One   Package

Stage   Facts

Extracts   fact data from the source systems and places these in the Extract Schema

One   package pr. Source System. May be split into more if dealing with large fact   tables

Transform/Load   Fact

Performs   transformation of fact data from the Extract schema into the EDW

One   package pr. Fact table

If you   have more than one source system for a fact tables use one package pr source   system for this fact table

Build   Data Mart

Build the   data marts. In a Microsoft solution these packages will typically process   cubes

One   package pr. Data mart

Archive   Extracts

Archives   relevant Extract tables to enable replay of old data

One   package pr. Source system (this package will archive all relevant tables from   the source system

Maintain   Fact

Maintains   the indexes and sometimes statistics on facts (optionally rebuilding them)

This   phase is optional since you may not use indexes on fact tables at all

One   package

Backup

Performs   backup of the changed warehouse

Not   always implemented as package. Can be performed by different tools depending   on customer needs

 

Naming guidelines for the packages are found in the SSIS Naming convention document in the document center.

2.1      Dependency between stages

Some of the stages can be run in parallel, while some have dependencies. The diagram below illustrates the dependencies in a flow

The illustration below show a typical flow with two source systems, some dimensions/facts and two data marts:

                                                                   

Figure 1 - Dependenciens between SSIS Packages

 

2.2      Collecting packages into master packages

Depending on the scheduler tool you use, it may be an advantage to create aggregate[1] packages that execute one or more sub packages. Examples of such packages include

  • Collect the packages that load a dimension or fact into one, big aggregate packages

  • Create a package that loads all data from a specific source system

  • Automate the entire warehouse load using just one, aggregate package

  • Automate maintenance task in one big packages

  • Create on aggregate package for weekend loads and another for daily loads

 

Which strategy is correct depends largely on the structure of your data, the batch window, the need for restarts etc... However, no matter which strategy you choose, you should always follow certain design patterns:

  • The aggregate packages should not contain any business logic. They can contain only these elements

    • Execute of other packages

    • Start / Stop of job logs

    • All logic in the should be executable on its own, without participating in an aggregate package

      • This leaves you free to recombine logic as you see fit in other packages

 

3       Fact flows

This section contains general guidelines for structuring the flow of data from source system to warehouse. The flow components in the section below are listed in the order they are normally used.

3.1      Overview

The fact flow is initiated by staging the fact data to the Extract schema. Why do we copy data here instead of moving it directly to the pipeline? Rationale: There are several reasons:

  • Performing a pure copy of the data      minimizes the time we use disturbing the source system. This is generally      good news for the source system DBA

  • By taking a local copy we become able to      quickly reload from our local database

  • Archiving, if necessary, becomes easier

In all but the most extreme data sizes the above reasons justify an extra copy of the data while loading.

Once data is staged you generally have to perform some or all of these operations:

  • Detect delta (i.e. find new rows and      update existing ones)

  •  

    • Optionally this delta detection will       happen as part of the staging.

  • Map keys to master data

  • Assign entity/surrogate keys

  • Perform business transformation

Delta detection is described in the sections below. For guidance on the other operations refer to section 5.

3.2      Delta detection for fact data

The discussion in this section will focus on delta detection of facts / transactions. There is separate kind of delta detection used for dimension data discussed in section 4.2.

The purpose of delta detection is to limit the amount of changes committed to the warehouse. Instead of reloading every table you instead, intelligently, try to change only what is strictly necessary.

Delta detection of facts should generally be applied very early in the ETL flow – since it will reduce the amounts of rows you have to work with at later phases.

You basically have three strategies for delta detection

  1. Perform delta detection in source system

  2. Perform delta detection in ETL flow

  3. Do not perform delta detection.

 

The following sections will treat these cases.

3.2.1     Performing delta detection at the source system

This delta detection technique, while optimal, is rarely feasible. Implementing source delta detection follows the pattern:

  1. Find candidate columns in the source      system that indicate if a row has been changed or added

  2. Question source system experts on how      these columns are updated

  3. Test the theory thoroughly

  4. Implement loader that only extracts delta      from the source system

 

1) The purpose of this process is the locate columns that uniquely identify changes in the source system.

Typical candidate columns are date and time columns. Delta detection on date/time will only work if the columns are maintained by either database triggers or strict programming techniques[2].

Another column candidate is keys that perpetually increase; examples include SQL Server IDENTITY columns and Oracle sequences. Note that while new rows can be located quickly by comparing keys, detecting changes in existing rows is not possible.

Some databases provide a feature to automatically timestamp a row. In Oracle the rownumber can sometimes be used and in SQL Server columns of type timestamp/rowversion automatically change whenever a row is changed.

Be aware that deletes are also changes. When you examine you source system, you should also look for ways to detect deleted rows. One option would be for the source to include an “IsDeleted” column which is updates when the row is removed (instead of physically removing the row)

2) When you have located candidate columns for delta detection you must question the source system experts closely about these. If they reply with confidence that these columns are correctly maintained (this is rare) you can move on to the next step. If you detect doubts or have trouble getting information about the model, stop wasting your time and consider moving on to another delta detection method.

3) Always test your candidate delta columns thoroughly. Try changing the source system and see if columns change appropriately.

4) If the last steps were successful it is now a simple matter to implement an ETL loader that only fetches the last changed columns from the source system. Remember to congratulate yourself for performing the necessary detective work and achieving the optimal delta solution.

3.2.2     Performing delta detection in the ETL flow

This technique follows the pattern:

  1. Fetch all data from source system

  2. Compare this data with a copy from the EDW      to detect changes

  3. Perform the rest of ETL flow only on the      changes

 

1) Fetching all data from the source system is often a costly affair. Make sure that you coordinate this with the DBAs of the source system. Consider disabling locking of source tables or even running on a replicated copy of the source.

2 + 3) You should maintain an archive copy of the extracted, unclean table as it looked last time you saw it.  By merge joining this archive with the newly received source data (sorted on the key columns) your can quickly perform a delta detection in the ETL flow

The illustration below demonstrates this technique:

  1. Compare archive and the full source data      and perform delta.

  2.  

    1. An SSIS merge join transformation between       Source data and archive will make the delta obvious.

    2. The output is a table containing the       delta plus archive

  3. Replace the old archive with the new table

  4. Take the delta from the new table and      perform the next ETL steps on it.

 

Figure 2 – Fact Delta Detection in ETL flow

 

3.2.3     Running without delta detection

It is worth reminding ourselves why we do delta detection in the first place: because it reduces the data we need to transform.

Delta detection, especially in the ETL flow, is really only worth the effort if the expected delta is small. Performing delta detection costs both time and machine resources.

You should carefully compare the cost of the delta detection with full reloads. As a rule of thumb, if you delta is much larger than 10% of the total data – a full load is often faster than delta detection[3]

 

4       Dimension Flow

The dimension flow normally deals with much less data than the fact flows. However, the transformations you need to apply are typically much more complicated.

4.1      Overview

The dimension flow typically proceeds in these phases:

  1. Copy source system data to extract schema

  2.  

    1. Even for the most extreme dimension this       should be a fast operation. No need to apply the guidance from the fact       loaders here

  3. Clean data types and fix domain problems

  4.  

    1. Examples

  1.                                           i.    Trimming / casing of strings

  2.                                          ii.    Quick fix values that are outside acceptable range – for example setting early dates to a fixed date in the past,

  3.                                         iii.    Removing leading/trailing zeros

  1. Cast columns to correct types

  2.  

    1. This is the first step that can produce       error rows

  3. Perform business rules for both      transformation and cleaning

  4.  

    1. This may include join operation and other       combining of data streams

  5. Check and optionally fix integrity against      other tables (apply techniques in section 5.1)

  6.  

    1. This step may also generate error rows

  7. Land the final data in a “load ready”      table

  8. Perform type 1 and type 2 delta detection

  9. Optionally: Archive extracts

 

Rationale:

First of all: Debug friendliness, not performance is our typical concern on dimension loaders. Hence, we gladly sacrifice some performance if this helps us locate errors in data with more elegance and flexibility.

Copying the source system allows us to quickly inspect what we received and eases the archiving.

Before we apply error detection it is generally a good idea to perform some basic cleaning of the data (step 2). It becomes much easier to write error detection code if we can make some basic assumptions about our data (untrimmed strings is a classic example of hard to locate errors)

Using the same argument as step 2 the following step tries to cast data into their acceptable types. This is the first step that can produce error rows.

After step 3 we can make some strong assumption about our data and the column domains. We are now ready to perform step 4 and 5 which apply integrity check and business transformations. Again we may produce error rows that may need handling by users.

Before we move on the delta detection we should land our data in a “load ready” table. As developers we can inspect this table and see the final results of the ETL flow before any delta detection is applied. Since the following step is generally the most complicated, having a “checkpoint” here is a good idea.

Step 8 performs the final type 1 and type 2 delta detection. If you are using the meta framework – no problem – just call the build in procedures for this. If not, refer to Kimballs books and the following section.

4.2      Delta detection on dimensional / entity data

This form of delta detection is used to detect type 1 and type 2 changes in dimensions. Very often, this is the most error prone part of a warehouse load.

By far the easiest technique is this:

  1. Make a full copy of the source system      entity table

  2. Clean and transform the entity into a      “current source system view of dimension” table

  3. Compare the final, transformed dimension      table with the current EDW table

  4.  

    1. Use a merge between the current view and       the EDW dimension table

    2. In this merge, perform type 1 and type 2       delta detection

Step 3 can often be automated – which it indeed has been in the meta framework.

This above technique works for all but the largest[4] dimensions. If dealing with large dimensions, you may decide to combine the technique in section  with the above ideas.

 

5       General Techniques

This section contains some general techniques that apply to all flow types.

5.1      Assigning new keys

Part of the warehouse load is the assigning of small surrogate or entity keys. You should perform this step early in the ETL flow: it will generally make you columns narrow by replacing character keys with integer values. This has a beneficial effect on memory and storage requirements.

When assigning keys, be especially aware of the following:

  • How will you handle source system keys      that do not map to your EDW keys?

  • How will you handle referential integrity      errors in the source system?

In general, you can apply the following solutions to the problems above:

  • Create “inferred” members in your tables.      These members are rows that “should exist”[5]      but have not yet been observed in the source system.

  •  

    • Be aware that inferred members need       special treatment if you put them into dimension tables. A dimension       table will contain many columns that you cannot populate from the       knowledge you have at the time of inferring the member.

    • If you use inferred members you must       supply some default values for the columns of the dimension.

  • Replace the source system keys with an      “unknown member” key.

  •  

    • In Visma MS we generally use a negative       value for the surrogate or entity key to easily distinguish these special       unknown members.

  • Throw away rows from source system that do      not match the keys in your EDW

  •  

    • Remember to somehow notify your users that       data has been thrown away!

  • Move unmatched rows to an error table and      let users inspect this

  •  

    • This is typically combined with one of       the above solutions

5.2      Cleaning data and performing single row transforms

Once keys have been assigned you start transforming data. The first transformation should be single rows transformations that either reduce or maintain the number of rows in the pipeline. Examples include:

  • Throwing away rows you will not need

  •  

    • Remember to log the rows you throw away       to ease debugging.

    • You should also control which rows are       thrown away using some parameter table instead of hardcoding

  • Renaming and combining columns from source      system names to the final EDW names

  • Aggregation of data

  • Removing duplicates

  • Concatenation or splitting of data

  • Renormalization – i.e. transforming de-normalized      data from source systems to 3NF

  • Applying key transformation. Examples      include: mapping between master data keys, deriving keys from columns.

5.3      Applying transformation that “blow up” data

You may also be required to perform transformations that increase the number of rows in the data. Examples of these are:

  • Filling out “holes” in the data (e.g. a      “missing row” should be interpreted as a zero)

  • Changing YTD values to periods and vice      versa

  • Changing a from/to date in the input data      to one row for every day in the interval

  • Doing currency calculations.

  • Simulating transactions (e.g. performing a      FIFO buffer simulation of inventory movements)

 

Such transformation can quickly increase the amount of data and should therefore be performed on the smallest and most efficient input. Hence, “blow up” transformations should not be applied until after you reduce both the size and amount of input rows.

 

6       Data sources and Connections

The following guidelines apply to data sources and connection managers.

6.1      Avoid using Shared Data sources

Integration Services has a feature called shared data sources. While this feature is in principle a good idea it has some flaws that make us avoid it:

  • It works badly with TFS and      other source control systems.

  •  

    • Every time you check out a       package, Visual Studio will attempt to check out the shared data source       too (temporarily).

    • This presents an (unnecessary)       merge problem if another user has the data source checked out.

  • The reference to the data      source is maintained in the package. Changing the data source will break      the reference and force an annoying prompt when opening packages that      depend on it

  • A shared data source does not      give package developers the freedom to experiment with connection specific      settings optimal for their package usage (for example using :NET      connectors)

We feel that the above concerns and their impact on developer productivity outweigh any benefit the shared data source provides. Our recommendation: do not use shared data sources.

In the same line of reasoning data source views in Integration Services should not be used either.

6.2      Connecting to SQL Server

There are many different ways to connect to SQL Server. Performance and stability characteristics for each connection type differ. Hence, it is important to choose the right connection type.

The following drivers (in order of preference) are available when connection to SQL Server:

  1. SQL Native client (aka: Netlib)

  2. OLEDB

  3. ODBC

  4. SQL .NET library

  5. JDBC (For java clients)

Each step down the list adds another translation layer. Such layers are both overhead and sources of bugs.

In a warehouse environment you should always prefer SQL Native Client if available.

6.2.1     Picking the right protocol

Once you have chosen your driver you need to determine which protocol to use. SQL Server supports the following connection protocols (in order of performance):

  • Shared Memory

  • VIA

  • Named Pipes

  • TCP/IP

 

Shared memory (the fastest) is available only if you are running the ETL tools on the same server as the database. Shared Memory is somewhat more tolerant of timeouts than the other protocols and generally provides better stability and performance. Choose shared memory whenever possible for best performance and stability.

The VIA protocol is only very rarely used and requires special hardware. This protocol should be avoided.

Named Pipes uses the windows pipe system for connections and authentication. The protocol is not “firewall friendly” and it is generally only good when connecting to the server from inside the corporate network. However, do bear in mind that named pipes is quite a bit faster than TCP/IP.

TCP/IP requires only one port to be open to the server (the default is 1433). This protocol is especially useful in firewalled environments. Be aware that the TCP/IP protocol, especially on long running warehouse queries, is sensitive to bad network conditions and server timeout. Such timeout can occur if the server is running low on resources or if SQL Server is executing inefficient query plans.

The protocol used for connections can be enforced either though the connection string (see knowledge base articles below) or through a SQL Server alias.

You should always validate that you are using the right connection to SQL Server. The list if all connections and their active protocols can be obtained by querying the system view:

SELECT session_id, net_transport

FROM sys.dm_exec_connections

6.2.2     Using aliases

If you are using either named pipes or TCP/IP we generally recommend that you use aliases when connecting to SQL Server.

An alias is an alternate name for a SQL Server instance. In addition to the name it also provides a specification of the protocol to use when connecting. If a client has an alias defined it can use the alias to connect directly the SQL Server without any additional connection information specified. This provides an abstraction layer that makes it easy to replace server names in connection strings.

An alias is created using the SQL Server configuration manager. On 64-bit machines you must create the alias both in the 32 and the 64 bit environment (this can be done through SQL configuration manager)

Be aware that you cannot use aliases for shared memory connections.

6.2.3     Setting Application Name

To help monitor the execution of SSIS packages you should always set the application name in your connection string to SQL Server.

By setting this application name property to the name of the executing package you are allowing the SQL Server DBA to easily troubleshoot bad SSIS behavior.

The application name is set like this:

 

Figure 3 – Setting the application name

 

6.2.4     Relevant MS Articles

The following knowledge base and MSDN articles provide useful guidance about SQL Server connectivity:

 

6.3      Connecting to Oracle

There are two OLE DB drivers that can be used to extract data from a source system running Oracle in SQL Server Integration Services.

  • Oracle Provider for OLE DB

  • Microsoft OLE DB Provider for Oracle

6.3.1     Oracle Provider for OLE DB

The Oracle Provider for OLE DB exists both in a 32 bit and a 64 bit version. The latest version at the time of writing is 10.1.0.4.0 and can be found at:

There is a bug in the current version. Columns with the Oracle data type NUMBER fails to be extracted, if there is no precision and scale set. To avoid this, there are two solutions.

  • Make sure all columns of the data type      NUMBER have precision and scale set on the Oracle data source.

  • Install a patch      (p5203839_10201_MSWIN-x86-64.zip for the x64 version – there should be a      32 bit patch as well) from the Oracle Technology Network (http://www.oracle.com/technology/) – requires a paid      subscription to OTN.

Our experience shows that the driver is stable and faster than the Microsoft OLE DB Provider for Oracle.

6.3.2     SSIS and AlwaysUseDefaultCodePage

You might get the following error, trying to connect to an Oracle data source in SSIS:

“Cannot retrieve the column code page info from the OLE DB Provider. If the component supports the "DefaultCodePage" property, the code page from that property will be used. Change the value of the property if the current string code page values are incorrect. If the component does not support the property, the code page from the component's locale ID will be used."

This error occurs when you are extracting non-unicode string values from Oracle.

The fix to this is simple. Set AlwaysUseDefaultCodePage to true in the properties for the connection.

The DefaultCodePage should be set to 1252 which is correct for Western Alphabet.  If Oracle is not using the Western Alphabet, you'll need to determine the character set used and set the default code page. The correct default code page can be found at:

6.3.3     Microsoft OLE DB Provider for Oracle

The Microsoft OLE DB Provider for Oracle only exists in a 32 bit version, and that version is deprecated by Microsoft.

The benefit of using this driver is that there are no problems with NUMBER without precision or scale.

However it is our experience that the Microsoft OLE DB Provider for Oracle is a lot slower than the Oracle Provider for OLE DB.

6.3.4     Other drivers

There are other methods to connect to an Oracle data source, for example Oracle Instant Client or ODBC, but we have not managed to get neither to work. If you discover good alternatives please post on forums.

Also, note that Oracle Instant Client is not officially support by Microsoft as a method for connecting to Oracle (Product support will kindly tell you to use the Oracle OLEDB driver instead)

6.4      Connecting to DB2

Need guidelines from subject matter expert[JR1] 

6.5      Connecting to TeraData

To be written if we get a project where this is needed

6.6      Reading text files

Need guidelines here from subject matter expert

6.7      Links on connectivity and drivers

 

 

 

7       Control flow

The control flow of an SSIS package provides the backbone of your package. The section provides guidance on both specific tasks and general guidelines for the control flow.

7.1      Layout of Control Flow

The control flow should be laid out to start at the top of screen and flow downwards. If parallelism is added to the flow, indicate this by aligning the parallel task horizontally.

Use Sequence containers to group operations into “modules[JR2] ” [PLI3] [tke4] 

7.2      Maintaining and Rebuilding Indexes

Indexes do not maintain themselves and will gradually become useless if they are not kept in shape.

In a warehouse, the ETL flow is generally responsible for maintaining indexes. Indexing is a task that has the focus of the DBA and it is vital that we as ETL developers behave properly when working with indexes.

There are two different strategies for keeping an index in good shape

  • Drop and recreate – This involves completely removing the indexes, perform a big      operation, and rebuilding indexes after this

  •  

    • If you plan to modify a large part of a       table (more than 20%) in bulk mode - it is good practice to drop the       indexes and recreate them after the operation

    • Failure to do this will result       is page splitting[6]       which will show up in the DBAs monitoring and negatively impact your       flow.

  • Reorganize – This strategy keeps the indexes on the table while the insert is      happening.

  •  

    • This strategy is often used       for dimension tables.

    • It is used when a table       changes only slowly for each batch run.

    • You should periodically       maintain the index with reorganization

7.2.1     Correct procedure for Dropping and Recreating Indexes

If you are dropping and recreating indexes on a table the following order of operations is the correct approach:

 

  • Drop all non-clustered indexes      on the table

  • Drop the clustered index (if      any) on the table

  • Perform the large, bulk      operation

  • Recreate the clustered index

  • Recreate the non-clustered      indexes

 

This procedure will ensure the best performance during the bulk operation and keep the index in good shape.

7.2.1.1    Using FILLFACTOR = 100

If you are always dropping and recreating indexes before modifying a table (as is often the case with fact tables) you can gain extra performance and reduce storage needs by using FILLFACTOR = 100. Search for “fill factor” in book online for documentation on this property

 

Example:

The following statement creates the index CIX_Dates with a fill factor of 100:

 

CREATE UNIQUE CLUSTERED INDEX CIX_DateFrom

ON Dimensions.Product_T2 (DW_EK_Product, DW_Valid_From)

WITH (FILLFACTOR = 100)

 

7.2.2     Reorganizing indexes

If you decide to keep the indexes on the table while changing it you should make sure that the indexes are periodically maintained.

There are several strategies for reorganizing (consult Books Online) and you should make sure a strategy is implemented on the warehouse.

There are two conditions that you should monitor to check if the indexes are being maintained:

  • Is there enough free space on      pages in index?

  • Is the index fragmented?

 

You can use the management view: sys.dm_db_index_physical_stats to check for these conditions.

7.2.3     Using SORT_IN_TEMPDB

SQL Server 2005 supports an index options called SORT_IN_TEMPDB. Depending on the configuration of your warehouse you they may be useful options to consider.

TempDb is sometimes placed on a set of very fast disks that are separate from the data warehouse database. If this is the case, using the SORT_IN_TEMPDB index option will provide you with a performance gain. If TempDb is sitting on the same disks as you database, this option will not provide any performance and may actually degrade it.