Den gode SSIS pakke

  • Skal oversættes til dansk

  • Figurer fra original skal tilføjes 

1       Introduction

This document describes the Visma MS best practise for creating ETL flow. The document focuses on Integration Services 2008 SP2 and later.

The important points are highlighted with the blue rule font. This help the reader skim through the document quickly for guidance.

1.1      Structure of the document

In section 2 we introduce our best practice with an architectural overview of a typical Visma MS based ETL-flow. This section even includes guidelines for which packages you would typically have in a full warehouse flow. Use this section as your starting point for digging into the rest of the document.

Zooming further into the flow structure, section 3 and 4 describes techniques tailored towards fact- and dimension flows. These sections also tie our architecture to the general Insighters ETL guideline.

Moving up one abstraction level we arrive at some general techniques common to all ETL packages. These techniques provide guidance that can be applied no matter which tool or implementation you apply. You can find these general tips and tricks in section 5.

The rest of the document is specific to Integration Services (SSIS) implementation. It describes the context that we work within and the best practice rules we follow. Do note that users of other technologies may still gain valuable insights by applying similar techniques in their own tools.

We begin our SSIS guidelines with a description on how to connect to data sources (section 6). From here we move on the control flow (section 7).

Before we can introduce the pipeline we must discus the proper usage of SQL. This precautionary measure is taken in section 8. With proper utilization of SQL in place we can finally move on to the SSIS pipeline in section 9.

Our next sections deal with the challenges of structured code maintenance and deployment. We introduce this with a discussion about package configurations in section 10.  No maintenance argument is of course complete until we properly treat logging – the subject of section 11.

We conclude our guidelines with sections 12 and 13 which deal exclusively with performance tuning. In these concluding sections you will find the guidance to squeeze the last “rows / sec” out of your flow.

 

2       Architecture – Flow perspective

The overall ETL flow in a Visma MS solution is divided into some main phases. These main phases happen in sequence.

Each of the main phases is implemented as its own SSIS package. Sub packages are used inside the phases to further partition and modularize the flow.  Section 3 and 4 contain implementation specifics for each phase.

The table below lists the main phases of an ETL flow

Phase

Purpose

SSIS Package implementation

Stage   Dimensions

Extracts   the dimension data from the source systems and places these in Extract Schema

One   package pr. Source System. Optionally one package per source system per   dimension table

Transform/Load  

Dimensions

Performs   transformation on the data from the extract schema into the EDW

One   package per dimension table

If you   have more than one source system loading a dimension table you create one   package pr. Source system for this dimension.

 

Maintain   dimensions

Maintains   indexes on dimensions tables. Will typically rebuild all indexes with a  fillfactor = 100, perform materialization   of performance critical tables and other performance optimization tasks

One   Package

Stage   Facts

Extracts   fact data from the source systems and places these in the Extract Schema

One   package pr. Source System. May be split into more if dealing with large fact   tables

Transform/Load   Fact

Performs   transformation of fact data from the Extract schema into the EDW

One   package pr. Fact table

If you   have more than one source system for a fact tables use one package pr source   system for this fact table

Build   Data Mart

Build the   data marts. In a Microsoft solution these packages will typically process   cubes

One   package pr. Data mart

Archive   Extracts

Archives   relevant Extract tables to enable replay of old data

One   package pr. Source system (this package will archive all relevant tables from   the source system

Maintain   Fact

Maintains   the indexes and sometimes statistics on facts (optionally rebuilding them)

This   phase is optional since you may not use indexes on fact tables at all

One   package

Backup

Performs   backup of the changed warehouse

Not   always implemented as package. Can be performed by different tools depending   on customer needs

 

Naming guidelines for the packages are found in the SSIS Naming convention document in the document center.

2.1      Dependency between stages

Some of the stages can be run in parallel, while some have dependencies. The diagram below illustrates the dependencies in a flow

The illustration below show a typical flow with two source systems, some dimensions/facts and two data marts:

                                                                   

Figure 1 - Dependenciens between SSIS Packages

 

2.2      Collecting packages into master packages

Depending on the scheduler tool you use, it may be an advantage to create aggregate[1] packages that execute one or more sub packages. Examples of such packages include

  • Collect the packages that load a dimension or fact into one, big aggregate packages
  • Create a package that loads all data from a specific source system
  • Automate the entire warehouse load using just one, aggregate package
  • Automate maintenance task in one big packages
  • Create on aggregate package for weekend loads and another for daily loads

 

Which strategy is correct depends largely on the structure of your data, the batch window, the need for restarts etc... However, no matter which strategy you choose, you should always follow certain design patterns:

  • The aggregate packages should not contain any business logic. They can contain only these elements
    • Execute of other packages
    • Start / Stop of job logs
    • All logic in the should be executable on its own, without participating in an aggregate package
      • This leaves you free to recombine logic as you see fit in other packages

 

3       Fact flows

This section contains general guidelines for structuring the flow of data from source system to warehouse. The flow components in the section below are listed in the order they are normally used.

3.1      Overview

The fact flow is initiated by staging the fact data to the Extract schema. Why do we copy data here instead of moving it directly to the pipeline? Rationale: There are several reasons:

  • Performing a pure copy of the data      minimizes the time we use disturbing the source system. This is generally      good news for the source system DBA
  • By taking a local copy we become able to      quickly reload from our local database
  • Archiving, if necessary, becomes easier

In all but the most extreme data sizes the above reasons justify an extra copy of the data while loading.

Once data is staged you generally have to perform some or all of these operations:

  • Detect delta (i.e. find new rows and      update existing ones)
    • Optionally this delta detection will       happen as part of the staging.
  • Map keys to master data
  • Assign entity/surrogate keys
  • Perform business transformation

Delta detection is described in the sections below. For guidance on the other operations refer to section 5.

3.2      Delta detection for fact data

The discussion in this section will focus on delta detection of facts / transactions. There is separate kind of delta detection used for dimension data discussed in section 4.2.

The purpose of delta detection is to limit the amount of changes committed to the warehouse. Instead of reloading every table you instead, intelligently, try to change only what is strictly necessary.

Delta detection of facts should generally be applied very early in the ETL flow – since it will reduce the amounts of rows you have to work with at later phases.

You basically have three strategies for delta detection

  1. Perform delta detection in source system
  2. Perform delta detection in ETL flow
  3. Do not perform delta detection.

 

The following sections will treat these cases.

3.2.1     Performing delta detection at the source system

This delta detection technique, while optimal, is rarely feasible. Implementing source delta detection follows the pattern:

  1. Find candidate columns in the source      system that indicate if a row has been changed or added
  2. Question source system experts on how      these columns are updated
  3. Test the theory thoroughly
  4. Implement loader that only extracts delta      from the source system

 

1) The purpose of this process is the locate columns that uniquely identify changes in the source system.

Typical candidate columns are date and time columns. Delta detection on date/time will only work if the columns are maintained by either database triggers or strict programming techniques[2].

Another column candidate is keys that perpetually increase; examples include SQL Server IDENTITY columns and Oracle sequences. Note that while new rows can be located quickly by comparing keys, detecting changes in existing rows is not possible.

Some databases provide a feature to automatically timestamp a row. In Oracle the rownumber can sometimes be used and in SQL Server columns of type timestamp/rowversion automatically change whenever a row is changed.

Be aware that deletes are also changes. When you examine you source system, you should also look for ways to detect deleted rows. One option would be for the source to include an “IsDeleted” column which is updates when the row is removed (instead of physically removing the row)

2) When you have located candidate columns for delta detection you must question the source system experts closely about these. If they reply with confidence that these columns are correctly maintained (this is rare) you can move on to the next step. If you detect doubts or have trouble getting information about the model, stop wasting your time and consider moving on to another delta detection method.

3) Always test your candidate delta columns thoroughly. Try changing the source system and see if columns change appropriately.

4) If the last steps were successful it is now a simple matter to implement an ETL loader that only fetches the last changed columns from the source system. Remember to congratulate yourself for performing the necessary detective work and achieving the optimal delta solution.

3.2.2     Performing delta detection in the ETL flow

This technique follows the pattern:

  1. Fetch all data from source system
  2. Compare this data with a copy from the EDW      to detect changes
  3. Perform the rest of ETL flow only on the      changes

 

1) Fetching all data from the source system is often a costly affair. Make sure that you coordinate this with the DBAs of the source system. Consider disabling locking of source tables or even running on a replicated copy of the source.

2 + 3) You should maintain an archive copy of the extracted, unclean table as it looked last time you saw it.  By merge joining this archive with the newly received source data (sorted on the key columns) your can quickly perform a delta detection in the ETL flow

The illustration below demonstrates this technique:

  1. Compare archive and the full source data      and perform delta.
    1. An SSIS merge join transformation between       Source data and archive will make the delta obvious.
    2. The output is a table containing the       delta plus archive
  2. Replace the old archive with the new table
  3. Take the delta from the new table and      perform the next ETL steps on it.

 

Figure 2 – Fact Delta Detection in ETL flow

 

3.2.3     Running without delta detection

It is worth reminding ourselves why we do delta detection in the first place: because it reduces the data we need to transform.

Delta detection, especially in the ETL flow, is really only worth the effort if the expected delta is small. Performing delta detection costs both time and machine resources.

You should carefully compare the cost of the delta detection with full reloads. As a rule of thumb, if you delta is much larger than 10% of the total data – a full load is often faster than delta detection[3]

 

4       Dimension Flow

The dimension flow normally deals with much less data than the fact flows. However, the transformations you need to apply are typically much more complicated.

4.1      Overview

The dimension flow typically proceeds in these phases:

  1. Copy source system data to extract schema
    1. Even for the most extreme dimension this       should be a fast operation. No need to apply the guidance from the fact       loaders here
  2. Clean data types and fix domain problems
    1. Examples
  1.                                           i.    Trimming / casing of strings
  2.                                          ii.    Quick fix values that are outside acceptable range – for example setting early dates to a fixed date in the past,
  3.                                         iii.    Removing leading/trailing zeros
  1. Cast columns to correct types
    1. This is the first step that can produce       error rows
  2. Perform business rules for both      transformation and cleaning
    1. This may include join operation and other       combining of data streams
  3. Check and optionally fix integrity against      other tables (apply techniques in section 5.1)
    1. This step may also generate error rows
  4. Land the final data in a “load ready”      table
  5. Perform type 1 and type 2 delta detection
  6. Optionally: Archive extracts

 

Rationale:

First of all: Debug friendliness, not performance is our typical concern on dimension loaders. Hence, we gladly sacrifice some performance if this helps us locate errors in data with more elegance and flexibility.

Copying the source system allows us to quickly inspect what we received and eases the archiving.

Before we apply error detection it is generally a good idea to perform some basic cleaning of the data (step 2). It becomes much easier to write error detection code if we can make some basic assumptions about our data (untrimmed strings is a classic example of hard to locate errors)

Using the same argument as step 2 the following step tries to cast data into their acceptable types. This is the first step that can produce error rows.

After step 3 we can make some strong assumption about our data and the column domains. We are now ready to perform step 4 and 5 which apply integrity check and business transformations. Again we may produce error rows that may need handling by users.

Before we move on the delta detection we should land our data in a “load ready” table. As developers we can inspect this table and see the final results of the ETL flow before any delta detection is applied. Since the following step is generally the most complicated, having a “checkpoint” here is a good idea.

Step 8 performs the final type 1 and type 2 delta detection. If you are using the meta framework – no problem – just call the build in procedures for this. If not, refer to Kimballs books and the following section.

4.2      Delta detection on dimensional / entity data

This form of delta detection is used to detect type 1 and type 2 changes in dimensions. Very often, this is the most error prone part of a warehouse load.

By far the easiest technique is this:

  1. Make a full copy of the source system      entity table
  2. Clean and transform the entity into a      “current source system view of dimension” table
  3. Compare the final, transformed dimension      table with the current EDW table
    1. Use a merge between the current view and       the EDW dimension table
    2. In this merge, perform type 1 and type 2       delta detection

Step 3 can often be automated – which it indeed has been in the meta framework.

This above technique works for all but the largest[4] dimensions. If dealing with large dimensions, you may decide to combine the technique in section  with the above ideas.

 

5       General Techniques

This section contains some general techniques that apply to all flow types.

5.1      Assigning new keys

Part of the warehouse load is the assigning of small surrogate or entity keys. You should perform this step early in the ETL flow: it will generally make you columns narrow by replacing character keys with integer values. This has a beneficial effect on memory and storage requirements.

When assigning keys, be especially aware of the following:

  • How will you handle source system keys      that do not map to your EDW keys?
  • How will you handle referential integrity      errors in the source system?

In general, you can apply the following solutions to the problems above:

  • Create “inferred” members in your tables.      These members are rows that “should exist”[5]      but have not yet been observed in the source system.
    • Be aware that inferred members need       special treatment if you put them into dimension tables. A dimension       table will contain many columns that you cannot populate from the       knowledge you have at the time of inferring the member.
    • If you use inferred members you must       supply some default values for the columns of the dimension.
  • Replace the source system keys with an      “unknown member” key.
    • In Visma MS we generally use a negative       value for the surrogate or entity key to easily distinguish these special       unknown members.
  • Throw away rows from source system that do      not match the keys in your EDW
    • Remember to somehow notify your users that       data has been thrown away!
  • Move unmatched rows to an error table and      let users inspect this
    • This is typically combined with one of       the above solutions

5.2      Cleaning data and performing single row transforms

Once keys have been assigned you start transforming data. The first transformation should be single rows transformations that either reduce or maintain the number of rows in the pipeline. Examples include:

  • Throwing away rows you will not need
    • Remember to log the rows you throw away       to ease debugging.
    • You should also control which rows are       thrown away using some parameter table instead of hardcoding
  • Renaming and combining columns from source      system names to the final EDW names
  • Aggregation of data
  • Removing duplicates
  • Concatenation or splitting of data
  • Renormalization – i.e. transforming de-normalized      data from source systems to 3NF
  • Applying key transformation. Examples      include: mapping between master data keys, deriving keys from columns.

5.3      Applying transformation that “blow up” data

You may also be required to perform transformations that increase the number of rows in the data. Examples of these are:

  • Filling out “holes” in the data (e.g. a      “missing row” should be interpreted as a zero)
  • Changing YTD values to periods and vice      versa
  • Changing a from/to date in the input data      to one row for every day in the interval
  • Doing currency calculations.
  • Simulating transactions (e.g. performing a      FIFO buffer simulation of inventory movements)

 

Such transformation can quickly increase the amount of data and should therefore be performed on the smallest and most efficient input. Hence, “blow up” transformations should not be applied until after you reduce both the size and amount of input rows.

 

6       Data sources and Connections

The following guidelines apply to data sources and connection managers.

6.1      Avoid using Shared Data sources

Integration Services has a feature called shared data sources. While this feature is in principle a good idea it has some flaws that make us avoid it:

  • It works badly with TFS and      other source control systems.
    • Every time you check out a       package, Visual Studio will attempt to check out the shared data source       too (temporarily).
    • This presents an (unnecessary)       merge problem if another user has the data source checked out.
  • The reference to the data      source is maintained in the package. Changing the data source will break      the reference and force an annoying prompt when opening packages that      depend on it
  • A shared data source does not      give package developers the freedom to experiment with connection specific      settings optimal for their package usage (for example using :NET      connectors)

We feel that the above concerns and their impact on developer productivity outweigh any benefit the shared data source provides. Our recommendation: do not use shared data sources.

In the same line of reasoning data source views in Integration Services should not be used either.

6.2      Connecting to SQL Server

There are many different ways to connect to SQL Server. Performance and stability characteristics for each connection type differ. Hence, it is important to choose the right connection type.

The following drivers (in order of preference) are available when connection to SQL Server:

  1. SQL Native client (aka: Netlib)
  2. OLEDB
  3. ODBC
  4. SQL .NET library
  5. JDBC (For java clients)

Each step down the list adds another translation layer. Such layers are both overhead and sources of bugs.

In a warehouse environment you should always prefer SQL Native Client if available.

6.2.1     Picking the right protocol

Once you have chosen your driver you need to determine which protocol to use. SQL Server supports the following connection protocols (in order of performance):

  • Shared Memory
  • VIA
  • Named Pipes
  • TCP/IP

 

Shared memory (the fastest) is available only if you are running the ETL tools on the same server as the database. Shared Memory is somewhat more tolerant of timeouts than the other protocols and generally provides better stability and performance. Choose shared memory whenever possible for best performance and stability.

The VIA protocol is only very rarely used and requires special hardware. This protocol should be avoided.

Named Pipes uses the windows pipe system for connections and authentication. The protocol is not “firewall friendly” and it is generally only good when connecting to the server from inside the corporate network. However, do bear in mind that named pipes is quite a bit faster than TCP/IP.

TCP/IP requires only one port to be open to the server (the default is 1433). This protocol is especially useful in firewalled environments. Be aware that the TCP/IP protocol, especially on long running warehouse queries, is sensitive to bad network conditions and server timeout. Such timeout can occur if the server is running low on resources or if SQL Server is executing inefficient query plans.

The protocol used for connections can be enforced either though the connection string (see knowledge base articles below) or through a SQL Server alias.

You should always validate that you are using the right connection to SQL Server. The list if all connections and their active protocols can be obtained by querying the system view:

SELECT session_id, net_transport

FROM sys.dm_exec_connections

6.2.2     Using aliases

If you are using either named pipes or TCP/IP we generally recommend that you use aliases when connecting to SQL Server.

An alias is an alternate name for a SQL Server instance. In addition to the name it also provides a specification of the protocol to use when connecting. If a client has an alias defined it can use the alias to connect directly the SQL Server without any additional connection information specified. This provides an abstraction layer that makes it easy to replace server names in connection strings.

An alias is created using the SQL Server configuration manager. On 64-bit machines you must create the alias both in the 32 and the 64 bit environment (this can be done through SQL configuration manager)

Be aware that you cannot use aliases for shared memory connections.

6.2.3     Setting Application Name

To help monitor the execution of SSIS packages you should always set the application name in your connection string to SQL Server.

By setting this application name property to the name of the executing package you are allowing the SQL Server DBA to easily troubleshoot bad SSIS behavior.

The application name is set like this:

 

Figure 3 – Setting the application name

 

6.2.4     Relevant MS Articles

The following knowledge base and MSDN articles provide useful guidance about SQL Server connectivity:

 

6.3      Connecting to Oracle

There are two OLE DB drivers that can be used to extract data from a source system running Oracle in SQL Server Integration Services.

  • Oracle Provider for OLE DB
  • Microsoft OLE DB Provider for Oracle

6.3.1     Oracle Provider for OLE DB

The Oracle Provider for OLE DB exists both in a 32 bit and a 64 bit version. The latest version at the time of writing is 10.1.0.4.0 and can be found at:

There is a bug in the current version. Columns with the Oracle data type NUMBER fails to be extracted, if there is no precision and scale set. To avoid this, there are two solutions.

  • Make sure all columns of the data type      NUMBER have precision and scale set on the Oracle data source.
  • Install a patch      (p5203839_10201_MSWIN-x86-64.zip for the x64 version – there should be a      32 bit patch as well) from the Oracle Technology Network (http://www.oracle.com/technology/) – requires a paid      subscription to OTN.

Our experience shows that the driver is stable and faster than the Microsoft OLE DB Provider for Oracle.

6.3.2     SSIS and AlwaysUseDefaultCodePage

You might get the following error, trying to connect to an Oracle data source in SSIS:

“Cannot retrieve the column code page info from the OLE DB Provider. If the component supports the "DefaultCodePage" property, the code page from that property will be used. Change the value of the property if the current string code page values are incorrect. If the component does not support the property, the code page from the component's locale ID will be used."

This error occurs when you are extracting non-unicode string values from Oracle.

The fix to this is simple. Set AlwaysUseDefaultCodePage to true in the properties for the connection.

The DefaultCodePage should be set to 1252 which is correct for Western Alphabet.  If Oracle is not using the Western Alphabet, you'll need to determine the character set used and set the default code page. The correct default code page can be found at:

6.3.3     Microsoft OLE DB Provider for Oracle

The Microsoft OLE DB Provider for Oracle only exists in a 32 bit version, and that version is deprecated by Microsoft.

The benefit of using this driver is that there are no problems with NUMBER without precision or scale.

However it is our experience that the Microsoft OLE DB Provider for Oracle is a lot slower than the Oracle Provider for OLE DB.

6.3.4     Other drivers

There are other methods to connect to an Oracle data source, for example Oracle Instant Client or ODBC, but we have not managed to get neither to work. If you discover good alternatives please post on forums.

Also, note that Oracle Instant Client is not officially support by Microsoft as a method for connecting to Oracle (Product support will kindly tell you to use the Oracle OLEDB driver instead)

6.4      Connecting to DB2

Need guidelines from subject matter expert[JR1] 

6.5      Connecting to TeraData

To be written if we get a project where this is needed

6.6      Reading text files

Need guidelines here from subject matter expert

6.7      Links on connectivity and drivers

 

 

 

7       Control flow

The control flow of an SSIS package provides the backbone of your package. The section provides guidance on both specific tasks and general guidelines for the control flow.

7.1      Layout of Control Flow

The control flow should be laid out to start at the top of screen and flow downwards. If parallelism is added to the flow, indicate this by aligning the parallel task horizontally.

Use Sequence containers to group operations into “modules[JR2] ” [PLI3] [tke4] 

7.2      Maintaining and Rebuilding Indexes

Indexes do not maintain themselves and will gradually become useless if they are not kept in shape.

In a warehouse, the ETL flow is generally responsible for maintaining indexes. Indexing is a task that has the focus of the DBA and it is vital that we as ETL developers behave properly when working with indexes.

There are two different strategies for keeping an index in good shape

  • Drop and recreate – This involves completely removing the indexes, perform a big      operation, and rebuilding indexes after this
    • If you plan to modify a large part of a       table (more than 20%) in bulk mode - it is good practice to drop the       indexes and recreate them after the operation
    • Failure to do this will result       is page splitting[6]       which will show up in the DBAs monitoring and negatively impact your       flow.
  • Reorganize – This strategy keeps the indexes on the table while the insert is      happening.
    • This strategy is often used       for dimension tables.
    • It is used when a table       changes only slowly for each batch run.
    • You should periodically       maintain the index with reorganization

7.2.1     Correct procedure for Dropping and Recreating Indexes

If you are dropping and recreating indexes on a table the following order of operations is the correct approach:

 

  • Drop all non-clustered indexes      on the table
  • Drop the clustered index (if      any) on the table
  • Perform the large, bulk      operation
  • Recreate the clustered index
  • Recreate the non-clustered      indexes

 

This procedure will ensure the best performance during the bulk operation and keep the index in good shape.

7.2.1.1    Using FILLFACTOR = 100

If you are always dropping and recreating indexes before modifying a table (as is often the case with fact tables) you can gain extra performance and reduce storage needs by using FILLFACTOR = 100. Search for “fill factor” in book online for documentation on this property

 

Example:

The following statement creates the index CIX_Dates with a fill factor of 100:

 

CREATE UNIQUE CLUSTERED INDEX CIX_DateFrom

ON Dimensions.Product_T2 (DW_EK_Product, DW_Valid_From)

WITH (FILLFACTOR = 100)

 

7.2.2     Reorganizing indexes

If you decide to keep the indexes on the table while changing it you should make sure that the indexes are periodically maintained.

There are several strategies for reorganizing (consult Books Online) and you should make sure a strategy is implemented on the warehouse.

There are two conditions that you should monitor to check if the indexes are being maintained:

  • Is there enough free space on      pages in index?
  • Is the index fragmented?

 

You can use the management view: sys.dm_db_index_physical_stats to check for these conditions.

7.2.3     Using SORT_IN_TEMPDB

SQL Server 2005 supports an index options called SORT_IN_TEMPDB. Depending on the configuration of your warehouse you they may be useful options to consider.

TempDb is sometimes placed on a set of very fast disks that are separate from the data warehouse database. If this is the case, using the SORT_IN_TEMPDB index option will provide you with a performance gain. If TempDb is sitting on the same disks as you database, this option will not provide any performance and may actually degrade it.

Another advantage of SORT_IN_TEMPDB is that the DBA can monitor your activity by looking at the free space in TempDb in Perfmon. You should talk to the DBA on the project and establish a strategy on whether to use this index option or not.

7.3      Bulk Insert Task vs. Data Flow Task

We do not recommend the use of the bulk insert task in the data flow. The Bulk Insert task can be slightly faster in a very few cases[7], the flexibility of the pipeline more than makes up for this. Using a pipeline with a SQL Server destination is only marginally slower if slower at all.

7.4      Processing Analysis Services objects

When processing objects in analysis services you should generally make this processing as fine grained as possible. This assists you in troubleshooting problems and provides restart points in the package.

You should apply the following guidelines:

  • Process each dimension in its own control      flow task
  • When processing a cube, split the      processing of partitions into at two control flow tasks:
    • Process Data – This will read all data       from the relational source and build the “base aggregates”. After this,       the cube is online
    • Process Index – This will build all       aggregates.
    • Rationale: By splitting the processing       like this, you can restart the flow at more points and individually       measure how long it takes to read data and build aggregates
  • For large cubes – consider splitting the processing      of measure group into separate tasks.

7.4.1     Keeping cubes online while processing

Need information from subject matter expert

7.5      Error Behaviour

One common scenario is that an error in a loop should not halt the whole package. When not actively configuring error behaviour, the package/component will fail. It is recommended that each component ought to be evaluated with errors in mind. The control can be made to take certain actions when failing (cleaning temporary tables etc.). Control flows can be made to proceed after errors have occurred, but remember to address this in the Master package as well, to avoid a halt in the control flow.

 

8       Proper use of SQL in the ETL flow

When using SQL from an ETL flow there are some general guidelines to observe. These ensure that we keep the DBA happy and make optimal use of the balance between Pipeline and SQL based transformations.

8.1      Execute SQL Task

When executing SQL in the control flow you should always use always use the Execute SQL Task.

The control flow task: Execute T-SQL Statement Task should not be used – it has a much less flexible API and does not provide the same benefits as Execute SQL Task

8.1.1     Only one statement per task

When using the execute SQL task never put more than one T-SQL Statement in each Execute SQL Task. If you add more SQL statements to the same control flow tasks your error messages from Integration Services become less meaningful.

Instead – add one Execute SQL Task for each T-SQL Statement. If you need to execute more statements in on task you should encapsulate this in procedure.

8.1.2     Avoid SQL based control flow in SQL Tasks

Try to avoid using the T-SQL IF / THEN / ELSE constructs inside a SQL task if the Integration Services conditional can be used instead. By embedding control inside SQL you are essentially “hiding” part of the execute flow of the Integration Services package – this makes the code harder to read.

 

Example:

The following T-SQL Task is wrong:

 

IF @DailyLoad BEGIN

                      <DoSomething>

END

ELSE BEGIN

                      <DoSomethingElse>

 

The correct flow would look like this:

 

 

8.1.3     Using NET connectors

When you use the Execute SQL Task you should always connect the Execute SQL Task to SQL Server using the ConnectionType = ADO.NET (see Figure 4).

Be aware that this does mean that you must maintain both an OLEDB connection and a .NET connection in the connection managers. We believe this administration is worth the effort because ADO.NET gives us these benefits:

  • Executing stored procedures,      using IsQueryStoredProcedure,      does not require the question marks (?) for specifying parameters
    • Parameters to stored       procedures can be directly named
    • Some basic validation is even       done in by SSIS
  • The interface for handling      returned values (output parameter and return values) is much more      flexible.
  • The ADO.NET connection has much      better support for SQL Server data types
    • No need for developer to map       database type to generic SSIS data type.

 

 

Figure 4 - ADO.NET connection type

 

8.2      Use Table Locking

While loading the data warehouse you generally want to either change or read an entire table, or a large part of it.

The SQL Server locking mechanism is optimized for OLTP use and will generally try to take a lock that interferes as little as possible with queries executed by other users. The expectation is that a query will only return a small result set and thus require locking of only a few pages.

This expectation from SQL Server has an unfortunate side effect in data warehousing: the locking system in SQL Server tend to be put under pressure when queries are consistently asking for entire tables instead of very small result sets.

Also, bear in mind that in a warehouse batch you, the developer is in control of which queries are executed. Hence, the locking mechanism in SQL Server is working against you and not for you.

Fortunately this is easily circumvented. You can instruct SQL Server to take table locks instead of page locks in your queries by executing the locking hints: TABLOCK and TABLOCKX (for shared and exclusive locks respectively)

You should do table locking on all tables in which you wish to read or modify a large number of row. For tables in source systems you should use the NOLOCK hint.

 

Example:

This will table lock a select statement. The locking still allows other queries to read the table

 

SELECT c1, c2 FROM Dimension.MyDim_T1 WITH (TABLOCK)

 

Use TABLOCK for lookup transformation and other reads on dimension and map tables.

 

Example:

The following query uses both TABLOCK and TABLOCKX to avoid a large locking of the updated table:

 

                      UPDATE f

                      SET f.Measure = s.Measure

                      FROM Facts.SomeFact f WITH (TABLOCKX)

                      INNER JOIN Staging.SomeFact s WITH (TABLOCK)

 

8.2.1     Index locking options

If you are running on a SQL Server 2005 database you can build you indexes in such a way that they enforce table locking. The index build options you must set are:

  • ALLOW_ROW_LOCKS = OFF
  • ALLOW_PAGE_LOCKS = OFF

 

Example: The following statement builds a clustered index and enforces table locking on it

 

                      CREATE UNIQUE CLUSTERED INDEX CIX_<SomeTable>

                      ON <SomeTable> (<Column>)

                      WITH (ALLOW_ROW_LOCKS = OFF, ALLOW_PAGE_LOCKS = OFF)

 

 

8.2.2     Table locking in data sources

Pipeline source transformation should use Data Access Mode = SQL Statement instead of the default “table of view”.

In the SQL Statement make sure you execute a TABLOCK hint. See Figure 5 for an example of how to configure this:

 

Figure 5 - Table locking a data source transformation

 

 

8.3      Use SQL for non-equi joins

Be aware that Integration Services does not support non-equi joins. The most common form of these are:

 

SELECT …

FROM TableA A

INNER JOIN TableB B

          ON B.Key = A.Key AND  B.Date >= A.Date_From AND B.Date < A.Date_To

 

The “non-equi” part here simply means that the join condition contains at least one less than or greater than operator or a similar construct that does not ensure that you hit only one row in the lookup.

Such equi-joins cannot be done efficiently with Integration Services lookup transformations.

You can often rewrite the join into an “equi-join” like this:

 

SELECT …

FROM TableA A

INNER JOIN TableB B

          ON B.Key = A.Key AND  A.IsCurrent = 1

 

In this case you can implement the join as an SSIS lookup transformation.

Only resort to SQL for joining when there is no other (good) alternative in Integration Services.

8.4      Use TRUNCATE instead of DELETE

It is often necessary to delete the contents of a table before loading new data to it.

Whenever possible, prefer the use of TRUNCATE instead of DELETE. Truncation is a minimally logged operation that will generally be a lot faster than DELETE. Delete statements can be used when the target table is relatively small. In this case, complexity is reduced at the cost of an acceptable performance loss.

 

Example: This is wrong:

DELETE FROM MyFactTable

 

This is right:

TRUNCATE TABLE MyFactTable

 

Be aware that you cannot truncate a table that is referenced by constraints. You can consider temporarily removing these constrains if performance is a concern.

 

8.5      Choosing the right indexes

General index tuning strategies are outside the scope of this document. However, there are some guidelines that are worth highlighting in an OLAP environment.

These are:

 

  • Use small (narrow column)      clustered indexes and always try to make them UNIQUE
  • Covering indexes should be used      generously on dimension tables
    • Take advantage of the feature INCLUDE  in SQL Server 2005 to create covering       indexes[8]
  • FILLFACTOR 100 can often be      leveraged on indexes
  • Avoid, or try to minimize the      use of indexed views

 

8.6      Forbidden SQL Commands

There are some SQL commands you should never use from an Execute SQL task. These are

:

Forbidden T-SQL   Statement

Reason

USE <Database>

The name of the database should be   configurable and is set in the connection manager not in the Execute SQL Task

GO

You should instead make two or more Execute   SQL tasks

 

9       Pipeline / Data Flow Usage

The pipeline is (generally) where you spend most of you execution time while running the ETL flows. Hence, it is especially important to properly utilize this powerful feature of Integrations Services.

This section provides standards that you should apply to achieve proper pipelines usage.

9.1      Sorting

Sorting, and avoiding it, is a very central part of any ETL flow which is why we must dedicate a full section to it.

Sorted data provide many benefits to the pipeline:

 

  • A sorted buffer can use Merge      and Merge join transformations
  • You can work with “the previous      row” in a script component on sorted buffer
    • Example: For setting To-dates based on From-dates this is very       handy
  • Sorting is a central part of      delta detection on large data sets
    • Merge join is a very fast way       to perform delta once the input is sorted
  • Eliminating duplicates is easy      on a sorted input
  • Creating aggregate data often      requires sorting input
  • Inserting Pre-Sorted buffers      reduce the amount of page splitting that occurs in the database

 

With all these advantages, it looks like we should be using sort almost exclusively (which is actually the case in many bad implementations).

However, sorting data has a price. The disadvantages of sorting are:

  • Sorting is slow.
    • The time complexity of sort is       O(n lg(n)).
    • This means that it does not       scale linearly with the size of the data
    • As the warehouse grows, sort       operations can cause you to miss your batch window
  • Sorting requires space
    • To sort an input you need       temporary space
    • This can either come from       memory or from disk
      •   
      • When the sort space comes        from memory you may be causing other applications to slow down while you        are sorting
      •   
      • When the sort space comes        from disk you will take an even more significant performance hit
  • Sorting is blocking
    • Any operations relying on the       sort must wait for the sort before they can proceed
    • This will reduce parallelism –       and hence performance

 

With these pitfalls of sorting, every sort you do should be analyzed for its potential impacts. Since you pay a high price for sort you should generally design your ETL flow in such as way that you take maximum advantage of a sorted input. Every transformation that needs the input to be sorted in a specific way should be executed while the input is still sorted.

With these things in mind, let’s move on to how you properly handle sorting.

9.1.1     SQL Server vs. Integration Services

When it comes to sorting, SQL Server is still faster than Integration Services. Furthermore, for sorting large data sets, Integration Services will simply be so slow that it is unworkable.

Hence, we recommend that you use SQL Server to perform your sort operations.

There are two ways you can perform a sort inside SQL Server:

  • Build a clustered index on a      table. This will sort the table based on the key columns in the index
  • Issue an ORDER BY statement when you SELECT data. If there is no index to support this statement      SQL Server will perform a sort in TempDb (Disk) and memory

In both these cases you must tell Integration Services that the data are leaving the database sorted (see below).

9.1.2     Marking output as sorted

Even if you have a clustered index on a table you should still executed the ORDER BY statement in your SELECT statement to ensure that your output is correctly sorted independent of query optimization.

Example: The following data source forces the database to sort the data:

 

Figure 6 - A sorted data source

 

Executing the ORDER BY is not enough to ensure that Integration Services treats the output as sorted; you should also change the meta data of the pipeline.

This is done in advanced properties of the data source transformation. You have to perform two steps for this to work:

  • Mark the output as sorted
  • Declare which columns are      sorted and in which order

 

Example: The following shows how you mark the output as sorted

 

Figure 7 - Marking the pipeline as sorted

 

Each key is then marked with its position in the sequence:

 

Figure 8 - Setting sort key position

 

A negative number indicates a descending sort. See SortKeyPosition in Book Online for more information.

9.1.3     Cluster indexes and sorted queries

When you execute an ORDER BY against SQL Server the database engine has two choices:

 

  1. Fetch the data first, sort, then      return
  2. Use an index and scan this in      sorted order returning results as scan proceeds

 

Seen from Integration Services the last option is of course the best – since this ensures that rows start flowing from the data source almost as soon as the packages start. However, to execute the last strategy SQL Server will need the correct indexes. In a warehouse and index used to support sort is often made the clustered index.

With this in mind, if you expect your table to be frequently accessed by a pipeline doing a sort you may want to put an index in place to support this.

You can easily distinguish between the two strategies by looking at the query plan for your SELECT statement. If you see the following symbol in Figure 9, the query engine is using the first plan.

:

 

 

Figure 9 - The sort icon in a query plan

 

 

Example: The following query:

 

SELECT ProductId, StartDate, EndDate, StandardCost

FROM Production.ProductCostHistory

ORDER BY StartDate, ProductID

 

Has the query plan:

 

Notice how the sort is being done as part of the query. Either change the order of the columns in the ORDER BY or change the index to obtain the better query plan:

 

 

 

9.1.4     A word of warning

There are cases when sorting is inevitable. In these cases you must pay the price of sorting. Though you can move the cost between processes, for example by using SQL and building indexes instead of sorting the pipeline, you still have to pay the price of the sort.

The art of sorting is about paying the price as few times as possible (zero times at the optimal). When you do pay the price for the sort – do so in a structured way.

9.1.5     List of Flow components that maintain sorting

Since sorting is such a precious property to maintain in the pipeline is necessary to know which transformations maintain the sorting.

The following list describes the commonly used flow components and whether they maintain sorting or not.

 

Flow Component

Maintains sorting of buffer?

Merge

Yes

Merge   Join

Yes

Conditional   Split

Yes

Multicast

Yes

Lookup

Yes

Derived   column and other synchronous

Yes

Script   component

Depends   (synchronous will, async will not)

Union All

No (note   that multicast does maintain)

Aggregate

No (!)

 

Conclusions from the list above:

  • Use Merge instead of Union All      on sorted input
  • As a rule of thumb:
    • A synchronous transform will always       maintain sorted property.
    • Asyncronous transforms will only       maintain it if designed for this purpose.

9.2      Configuring lookups

Lookups are one of the most efficient data transformations in SSIS. By making just a few configuration changes to this transformation – significant performance can be gained.

The following guidelines describe the correct configuration of a lookup transformation

  • Use a SQL Statement instead of a table      name and only select the columns you need
  • Add a TABLOCK hint to this SQL Statement.      The illustration below illustrates this[9]:

 

Figure 10 – Configuring a lookup

 

  • Make sure that the full caching is enable      on the lookup

9.3      Script components

On script components you should always set the PreCompile propery to true.

9.4      Aggregating

The aggregation component, like the sort, does not work well on large datasets. However, it is very handy for removing duplicates and making aggregates that dynamically depend on the flow.

Use the component with care on large datasets and test properly before putting to production. If dataset is too large, consider using a SQL based GROUP BY instead.

9.5      Data Sources Adaptors

We often move quite a lot of data when reading data source.. It is therefore important to perform this read efficiently.

The following best practices apply to data source adaptors

  • Data source adaptors should be named after      the table or view they reference.
  • If a data source adaptor references more      than one table (with a join or union) consider creating a view in the      database to serve the data source
    • This assists debugging since you can now       easily reproduce the SQL executed from management studio
  • Do not reference the table or view      directly, instead use a SQL Statement
  • Add a TABLOCK hint to this SQL Statement
  • Do not SELECT any unneeded columns

 

The following illustrates a correctly configured data source:

 

Figure 11 - Configuring a data source

9.6      Data Destination Adaptors

On data destination adaptors the following should be set

  • Use the Tablock checkbox whenever possible
  • During development, check constraints.      Once you are sure that constraints are not violated you can consider      disabling the constraint check
  • If transaction log space is limited, set      the Maximum Insert Commit Size      to a value that is a multiple of the DefaultBufferMaxRows      property of the pipeline

Remember that a bulk lock (SQL Lock type: BU) is compatible with another bulk lock. This means that two bulk operations can operate on the same table at the same time – speeding up the operation

In general you should use the OLEDB data destination adaptor since this is compatible with remote development. However, if speed is very critical – the SQL Server destination type is faster since it works on a shared memory connection.

10    Package Configurations

When configuring packages the following best practices apply:

 

Use one configuration file pr. Connection manager

Rationale: Not all packages use all connection managers. Having one file per connection manager instead of one big file eliminates error messages from “missing” connection managers.

 

Use indirect configuration – i.e.: Use a system environment variable to point to the configuration file.

Rationale: This makes it much easier to change configuration file locations

 

Make a separate configuration file or separate filtering in configuration table for package-specific configuration. Interesting configurations include:

  • FolderBufferTempStorage – Where the buffers are stored      if RAM gets low
  • Checkpoints – points to the package      checkpoint files
  • Furthermore, you probably want to set some      global variables including:
    • FolderPackages – Points to the location of       sub packages

 

Use DontSaveSensitive as package security.

Rationale: When allowing sensitive data in a package, it gets encrypted. This can happen even when no sensitive data is embedded and might render packages unusable when opened on another machine than the original. Sensitive data can be applied by configuration files or configuration tables.

 

11    Logging

The logging of SSIS packages provides important debug information, both at development time and when the package is put into production.

As a very minimum, you should log the following events:

  • OnError
  • OnWarning
  • OnTaskFailed
  • OnPreValidate – Useful for catching      compile time errors
  • OnPreExecute – Logs the start of the step,      no matter if an error occurred or not.
  • OnPostExecute – Logs the end of the step,      allowing you to time the total runtime of the step.

 

We recommend that you log to a SQL Server table for ease of access and to perform post run analysis and aggregation of run statistics. The logging server should be configured through environment variable as described in section 10.

 

12    Performance optimization

This section is a contain performance optimization tricks that did not fit in the previous best practise recommendations..

12.1    Getting more Parallelism

The design of the SSIS Pipeline is generally very efficient and most ETL jobs will run quite fast..

However, on larger, multi-CPU, machines you can often gain extra performance by redesigning the pipeline for larger parallelism (thus taking advantage of the CPUs)

There are two design patterns that you can generally apply if you pipeline is not fast enough. These are described below.

12.1.1   Vertical Parallelism

Vertical parallelism is achieved by adding more buffers to an existing flow.

The Union All transformation can be used for this purpose since it is an asynchronous transformation.  By adding a union all between two transformation components a new pipeline buffer is created by SSIS. Having more buffers will increase parallelism but at the cost of more memory.

Information about this technique can be found here:

Be aware that Union All removes the sorted property of a flow.

12.1.2   Horizontal Parallelism

Horizontal parallelism is achieved by partitioning the input data and creating a pipeline for each of them.

The pattern for obtaining this performance optimization is:

  • Find a proper partition function on the      input data
    • The goal is to split the input into       several, almost equal sized, piles of data
    • Time columns often form a good       partitioning function: for example the monthly partitioning of input
    • Demographics keys, such as customer       numbers, can also be used
    • If you cannot find a good key for       partitioning, resort to using a hash value over the input
      •   
      • let a modulo operator on this hash value        split the input into piles
      •   
      • Example: the        expression binary_checksum(*) % 4        = x splits input into 4 piles with x determining which pile to take
  • Make a list of work that needs doing – one      item on the list per pile created
  • Put the work items into a queue structure
  • Start several instances of the SSIS      packages.
    • Each instance grabs the first element in       the queue and starts working on the piles
    • Spawn about 1-2 SSIS pipelines for each       CPU in the machine
    • When there is no more work to be done       (queue empty) the packages stop.

 

The meta framework provides a SQL based queue data structure that is perfect for this purpose.

13    Monitoring and Measuring Packages

There are many good tools to monitor your SSIS packages and establish a baseline performance profile. You should generally perform some degree of monitoring to locate bad code and sanity check your installation in general.

This section describes some common approaches.

13.1    Perfmon counters

PerfMon is a powerful tool for monitoring your ETL flows. Bad code can be quickly located and corrected.

You can even schedule runs of perform during your batch window using the command line interface: logman.

The table below lists some commonly used performance counter and their “Comfort Zone”. As a minimum you should monitor one full data flow for these conditions below and correct as appropriate.

 

Counter

Comfort Zone

Effects when outside zone

Possible Resolutions

SQL Server

: Locks

Locks / Sec

< 10K

Lock timeouts and unstable queries

Use TABLOCK hints

SQL Server

: Access Methods

Page Splits / Sec

< 300

I/O system taking unessecary hits.

Instability due to timeouts

Optimize Indexing. Consider dropping indexes before bulk loading

Physical Disk

/ Avg. Disk Sec / Transfer

< 0.020

(less if enterprise)

Timeouts and the “Communication Link Failure” error message

Rewrite queries or add indexes for better query plans

 

Investigate I/O subsystem

Processor

/ % Processor Time

> 50%

Flow  not running as fast as it   could

Use more parallelism

Investigate I/O system

SQL Server

: Transactions

/ Free Space In TempDb (KB)

Not big drops

> 20% of total space in TempDb

Code might be doing inefficient joining or sorting

Investigate code for bad query plans

SQL Server

: Databases

/ Log File Used (KB)

< size of largest table in database

Excessive logging and out of disk space

Put database in simple mode

Consider rewriting SQL to use pipeline instead

SQL Server

: Databases

Log Growths

= 0

Log files is growing and fragmenting causing performance degradation   and possible running disk full

Allocate a log file that is big enough

Processes

/ Working Set

(All DTExec processes)

< available physical memory

SISS being paged out or causing general instability of system

Rewrite SSIS code to eliminate bad memory usage

Use 64-bit computing and add more memory

SQL Server

: Memory Manager

/ Target Server Memory

And

Total Server Memory

Should match

SQL Server is releasing memory to OS or not getting enough memory from   other, blocking applications

Set SQL Server to Lock Pages in memory

Set a minimum memory size in SQL Server

SQL Server

: SQL Errors

Errors / Sec / _Total

= 0

You flow might be ignoring warnings from the database

Inspect flow parts that have errors > 0 and locate possible   problems in SQL code.

Processor

/ Interrups / Sec

< 2000

Driver or service on machine is misbehaving

Locate driver or service by shutting down components one at a time   until interrupts stop.

 

13.2    SQL Profiler

You can use SQL Profiler to trace all the SQL statements coming to the server. The template “Tuning” is perfect for this purpose (you may want to extend it with start and end times).

If you trace to a file or table – be aware that SQL Profiler traces can quickly grow large. Here are some suggestions for keeping files small:

  • Set a filter on DatabaseName so only queries going to your EDW / DSA database      are show
  • Filter on Duration – queries that run for less than 50 ms are rarely      worth tuning (in a warehouse).
  • Don’t trace any more events than you strictly need.