Skal oversættes til dansk
Figurer fra original skal tilføjes

1 Introduction

This document describes the Visma MS best practise for creating ETL flow. The document focuses on Integration Services 2008 SP2 and later.

The important points are highlighted with the blue rule font. This help the reader skim through the document quickly for guidance.

1.1 Structure of the document

In section 2 we introduce our best practice with an architectural overview of a typical Visma MS based ETL-flow. This section even includes guidelines for which packages you would typically have in a full warehouse flow. Use this section as your starting point for digging into the rest of the document.

Zooming further into the flow structure, section 3 and 4 describes techniques tailored towards fact- and dimension flows. These sections also tie our architecture to the general Insighters ETL guideline.

Moving up one abstraction level we arrive at some general techniques common to all ETL packages. These techniques provide guidance that can be applied no matter which tool or implementation you apply. You can find these general tips and tricks in section 5.

The rest of the document is specific to Integration Services (SSIS) implementation. It describes the context that we work within and the best practice rules we follow. Do note that users of other technologies may still gain valuable insights by applying similar techniques in their own tools.

We begin our SSIS guidelines with a description on how to connect to data sources (section 6). From here we move on the control flow (section 7).

Before we can introduce the pipeline we must discus the proper usage of SQL. This precautionary measure is taken in section 8. With proper utilization of SQL in place we can finally move on to the SSIS pipeline in section 9.

Our next sections deal with the challenges of structured code maintenance and deployment. We introduce this with a discussion about package configurations in section 10. No maintenance argument is of course complete until we properly treat logging – the subject of section 11.

We conclude our guidelines with sections 12 and 13 which deal exclusively with performance tuning. In these concluding sections you will find the guidance to squeeze the last “rows / sec” out of your flow.

2 Architecture – Flow perspective

The overall ETL flow in a Visma MS solution is divided into some main phases. These main phases happen in sequence.

Each of the main phases is implemented as its own SSIS package. Sub packages are used inside the phases to further partition and modularize the flow. Section 3 and 4 contain implementation specifics for each phase.

The table below lists the main phases of an ETL flow

Phase	Purpose	SSIS Package implementation
Stage Dimensions	Extracts the dimension data from the source systems and places these in Extract Schema	One package pr. Source System. Optionally one package per source system per dimension table
Transform/Load Dimensions	Performs transformation on the data from the extract schema into the EDW	One package per dimension table If you have more than one source system loading a dimension table you create one package pr. Source system for this dimension.
Maintain dimensions	Maintains indexes on dimensions tables. Will typically rebuild all indexes with a fillfactor = 100, perform materialization of performance critical tables and other performance optimization tasks	One Package
Stage Facts	Extracts fact data from the source systems and places these in the Extract Schema	One package pr. Source System. May be split into more if dealing with large fact tables
Transform/Load Fact	Performs transformation of fact data from the Extract schema into the EDW	One package pr. Fact table If you have more than one source system for a fact tables use one package pr source system for this fact table
Build Data Mart	Build the data marts. In a Microsoft solution these packages will typically process cubes	One package pr. Data mart
Archive Extracts	Archives relevant Extract tables to enable replay of old data	One package pr. Source system (this package will archive all relevant tables from the source system
Maintain Fact	Maintains the indexes and sometimes statistics on facts (optionally rebuilding them) This phase is optional since you may not use indexes on fact tables at all	One package
Backup	Performs backup of the changed warehouse	Not always implemented as package. Can be performed by different tools depending on customer needs

Naming guidelines for the packages are found in the SSIS Naming convention document in the document center.

2.1 Dependency between stages

Some of the stages can be run in parallel, while some have dependencies. The diagram below illustrates the dependencies in a flow

The illustration below show a typical flow with two source systems, some dimensions/facts and two data marts:

Figure 1 - Dependenciens between SSIS Packages

2.2 Collecting packages into master packages

Depending on the scheduler tool you use, it may be an advantage to create aggregate[1] packages that execute one or more sub packages. Examples of such packages include

Collect the packages that load a dimension or fact into one, big aggregate packages
Create a package that loads all data from a specific source system
Automate the entire warehouse load using just one, aggregate package
Automate maintenance task in one big packages
Create on aggregate package for weekend loads and another for daily loads

Which strategy is correct depends largely on the structure of your data, the batch window, the need for restarts etc... However, no matter which strategy you choose, you should always follow certain design patterns:

The aggregate packages should not contain any business logic. They can contain only these elements
- Execute of other packages
- Start / Stop of job logs
- All logic in the should be executable on its own, without participating in an aggregate package
  - This leaves you free to recombine logic as you see fit in other packages

3 Fact flows

This section contains general guidelines for structuring the flow of data from source system to warehouse. The flow components in the section below are listed in the order they are normally used.

3.1 Overview

The fact flow is initiated by staging the fact data to the Extract schema. Why do we copy data here instead of moving it directly to the pipeline? Rationale: There are several reasons:

Performing a pure copy of the data minimizes the time we use disturbing the source system. This is generally good news for the source system DBA
By taking a local copy we become able to quickly reload from our local database
Archiving, if necessary, becomes easier

In all but the most extreme data sizes the above reasons justify an extra copy of the data while loading.

Once data is staged you generally have to perform some or all of these operations:

Detect delta (i.e. find new rows and update existing ones)

Optionally this delta detection will happen as part of the staging.

Map keys to master data
Assign entity/surrogate keys
Perform business transformation

Delta detection is described in the sections below. For guidance on the other operations refer to section 5.

3.2 Delta detection for fact data

The discussion in this section will focus on delta detection of facts / transactions. There is separate kind of delta detection used for dimension data discussed in section 4.2.

The purpose of delta detection is to limit the amount of changes committed to the warehouse. Instead of reloading every table you instead, intelligently, try to change only what is strictly necessary.

Delta detection of facts should generally be applied very early in the ETL flow – since it will reduce the amounts of rows you have to work with at later phases.

You basically have three strategies for delta detection

Perform delta detection in source system
Perform delta detection in ETL flow
Do not perform delta detection.

The following sections will treat these cases.

3.2.1 Performing delta detection at the source system

This delta detection technique, while optimal, is rarely feasible. Implementing source delta detection follows the pattern:

Find candidate columns in the source system that indicate if a row has been changed or added
Question source system experts on how these columns are updated
Test the theory thoroughly
Implement loader that only extracts delta from the source system

1) The purpose of this process is the locate columns that uniquely identify changes in the source system.

Typical candidate columns are date and time columns. Delta detection on date/time will only work if the columns are maintained by either database triggers or strict programming techniques[2].

Another column candidate is keys that perpetually increase; examples include SQL Server IDENTITY columns and Oracle sequences. Note that while new rows can be located quickly by comparing keys, detecting changes in existing rows is not possible.

Some databases provide a feature to automatically timestamp a row. In Oracle the rownumber can sometimes be used and in SQL Server columns of type timestamp/rowversion automatically change whenever a row is changed.

Be aware that deletes are also changes. When you examine you source system, you should also look for ways to detect deleted rows. One option would be for the source to include an “IsDeleted” column which is updates when the row is removed (instead of physically removing the row)

2) When you have located candidate columns for delta detection you must question the source system experts closely about these. If they reply with confidence that these columns are correctly maintained (this is rare) you can move on to the next step. If you detect doubts or have trouble getting information about the model, stop wasting your time and consider moving on to another delta detection method.

3) Always test your candidate delta columns thoroughly. Try changing the source system and see if columns change appropriately.

4) If the last steps were successful it is now a simple matter to implement an ETL loader that only fetches the last changed columns from the source system. Remember to congratulate yourself for performing the necessary detective work and achieving the optimal delta solution.

3.2.2 Performing delta detection in the ETL flow

This technique follows the pattern:

Fetch all data from source system
Compare this data with a copy from the EDW to detect changes
Perform the rest of ETL flow only on the changes

1) Fetching all data from the source system is often a costly affair. Make sure that you coordinate this with the DBAs of the source system. Consider disabling locking of source tables or even running on a replicated copy of the source.

2 + 3) You should maintain an archive copy of the extracted, unclean table as it looked last time you saw it. By merge joining this archive with the newly received source data (sorted on the key columns) your can quickly perform a delta detection in the ETL flow

The illustration below demonstrates this technique:

Compare archive and the full source data and perform delta.

An SSIS merge join transformation between Source data and archive will make the delta obvious.
The output is a table containing the delta plus archive

Replace the old archive with the new table
Take the delta from the new table and perform the next ETL steps on it.

Figure 2 – Fact Delta Detection in ETL flow

3.2.3 Running without delta detection

It is worth reminding ourselves why we do delta detection in the first place: because it reduces the data we need to transform.

Delta detection, especially in the ETL flow, is really only worth the effort if the expected delta is small. Performing delta detection costs both time and machine resources.

You should carefully compare the cost of the delta detection with full reloads. As a rule of thumb, if you delta is much larger than 10% of the total data – a full load is often faster than delta detection[3]

4 Dimension Flow

The dimension flow normally deals with much less data than the fact flows. However, the transformations you need to apply are typically much more complicated.

4.1 Overview

The dimension flow typically proceeds in these phases:

Copy source system data to extract schema

Even for the most extreme dimension this should be a fast operation. No need to apply the guidance from the fact loaders here

Clean data types and fix domain problems

Examples

i. Trimming / casing of strings
ii. Quick fix values that are outside acceptable range – for example setting early dates to a fixed date in the past,
iii. Removing leading/trailing zeros

Cast columns to correct types

This is the first step that can produce error rows

Perform business rules for both transformation and cleaning

This may include join operation and other combining of data streams

Check and optionally fix integrity against other tables (apply techniques in section 5.1)

This step may also generate error rows

Land the final data in a “load ready” table
Perform type 1 and type 2 delta detection
Optionally: Archive extracts

Rationale:

First of all: Debug friendliness, not performance is our typical concern on dimension loaders. Hence, we gladly sacrifice some performance if this helps us locate errors in data with more elegance and flexibility.

Copying the source system allows us to quickly inspect what we received and eases the archiving.

Before we apply error detection it is generally a good idea to perform some basic cleaning of the data (step 2). It becomes much easier to write error detection code if we can make some basic assumptions about our data (untrimmed strings is a classic example of hard to locate errors)

Using the same argument as step 2 the following step tries to cast data into their acceptable types. This is the first step that can produce error rows.

After step 3 we can make some strong assumption about our data and the column domains. We are now ready to perform step 4 and 5 which apply integrity check and business transformations. Again we may produce error rows that may need handling by users.

Before we move on the delta detection we should land our data in a “load ready” table. As developers we can inspect this table and see the final results of the ETL flow before any delta detection is applied. Since the following step is generally the most complicated, having a “checkpoint” here is a good idea.

Step 8 performs the final type 1 and type 2 delta detection. If you are using the meta framework – no problem – just call the build in procedures for this. If not, refer to Kimballs books and the following section.

4.2 Delta detection on dimensional / entity data

This form of delta detection is used to detect type 1 and type 2 changes in dimensions. Very often, this is the most error prone part of a warehouse load.

By far the easiest technique is this:

Make a full copy of the source system entity table
Clean and transform the entity into a “current source system view of dimension” table
Compare the final, transformed dimension table with the current EDW table

Use a merge between the current view and the EDW dimension table
In this merge, perform type 1 and type 2 delta detection

Step 3 can often be automated – which it indeed has been in the meta framework.

This above technique works for all but the largest[4] dimensions. If dealing with large dimensions, you may decide to combine the technique in section  with the above ideas.

5 General Techniques

This section contains some general techniques that apply to all flow types.

5.1 Assigning new keys

Part of the warehouse load is the assigning of small surrogate or entity keys. You should perform this step early in the ETL flow: it will generally make you columns narrow by replacing character keys with integer values. This has a beneficial effect on memory and storage requirements.

When assigning keys, be especially aware of the following:

How will you handle source system keys that do not map to your EDW keys?
How will you handle referential integrity errors in the source system?

In general, you can apply the following solutions to the problems above:

Create “inferred” members in your tables. These members are rows that “should exist”[5] but have not yet been observed in the source system.

Be aware that inferred members need special treatment if you put them into dimension tables. A dimension table will contain many columns that you cannot populate from the knowledge you have at the time of inferring the member.
If you use inferred members you must supply some default values for the columns of the dimension.

Replace the source system keys with an “unknown member” key.

In Visma MS we generally use a negative value for the surrogate or entity key to easily distinguish these special unknown members.

Throw away rows from source system that do not match the keys in your EDW

Remember to somehow notify your users that data has been thrown away!

Move unmatched rows to an error table and let users inspect this

This is typically combined with one of the above solutions

5.2 Cleaning data and performing single row transforms

Once keys have been assigned you start transforming data. The first transformation should be single rows transformations that either reduce or maintain the number of rows in the pipeline. Examples include:

Throwing away rows you will not need

Remember to log the rows you throw away to ease debugging.
You should also control which rows are thrown away using some parameter table instead of hardcoding

Renaming and combining columns from source system names to the final EDW names

Aggregation of data
Removing duplicates
Concatenation or splitting of data
Renormalization – i.e. transforming de-normalized data from source systems to 3NF
Applying key transformation. Examples include: mapping between master data keys, deriving keys from columns.

5.3 Applying transformation that “blow up” data

You may also be required to perform transformations that increase the number of rows in the data. Examples of these are:

Filling out “holes” in the data (e.g. a “missing row” should be interpreted as a zero)
Changing YTD values to periods and vice versa
Changing a from/to date in the input data to one row for every day in the interval
Doing currency calculations.
Simulating transactions (e.g. performing a FIFO buffer simulation of inventory movements)

Such transformation can quickly increase the amount of data and should therefore be performed on the smallest and most efficient input. Hence, “blow up” transformations should not be applied until after you reduce both the size and amount of input rows.

6 Data sources and Connections

The following guidelines apply to data sources and connection managers.

6.1 Avoid using Shared Data sources

Integration Services has a feature called shared data sources. While this feature is in principle a good idea it has some flaws that make us avoid it:

It works badly with TFS and other source control systems.

Every time you check out a package, Visual Studio will attempt to check out the shared data source too (temporarily).
This presents an (unnecessary) merge problem if another user has the data source checked out.

The reference to the data source is maintained in the package. Changing the data source will break the reference and force an annoying prompt when opening packages that depend on it
A shared data source does not give package developers the freedom to experiment with connection specific settings optimal for their package usage (for example using :NET connectors)

We feel that the above concerns and their impact on developer productivity outweigh any benefit the shared data source provides. Our recommendation: do not use shared data sources.

In the same line of reasoning data source views in Integration Services should not be used either.

6.2 Connecting to SQL Server

There are many different ways to connect to SQL Server. Performance and stability characteristics for each connection type differ. Hence, it is important to choose the right connection type.

The following drivers (in order of preference) are available when connection to SQL Server:

SQL Native client (aka: Netlib)
OLEDB
ODBC
SQL .NET library
JDBC (For java clients)

Each step down the list adds another translation layer. Such layers are both overhead and sources of bugs.

In a warehouse environment you should always prefer SQL Native Client if available.

6.2.1 Picking the right protocol

Once you have chosen your driver you need to determine which protocol to use. SQL Server supports the following connection protocols (in order of performance):

Shared Memory
VIA
Named Pipes
TCP/IP

Shared memory (the fastest) is available only if you are running the ETL tools on the same server as the database. Shared Memory is somewhat more tolerant of timeouts than the other protocols and generally provides better stability and performance. Choose shared memory whenever possible for best performance and stability.

The VIA protocol is only very rarely used and requires special hardware. This protocol should be avoided.

Named Pipes uses the windows pipe system for connections and authentication. The protocol is not “firewall friendly” and it is generally only good when connecting to the server from inside the corporate network. However, do bear in mind that named pipes is quite a bit faster than TCP/IP.

TCP/IP requires only one port to be open to the server (the default is 1433). This protocol is especially useful in firewalled environments. Be aware that the TCP/IP protocol, especially on long running warehouse queries, is sensitive to bad network conditions and server timeout. Such timeout can occur if the server is running low on resources or if SQL Server is executing inefficient query plans.

The protocol used for connections can be enforced either though the connection string (see knowledge base articles below) or through a SQL Server alias.

You should always validate that you are using the right connection to SQL Server. The list if all connections and their active protocols can be obtained by querying the system view:

SELECT session_id, net_transport

FROM sys.dm_exec_connections

6.2.2 Using aliases

If you are using either named pipes or TCP/IP we generally recommend that you use aliases when connecting to SQL Server.

An alias is an alternate name for a SQL Server instance. In addition to the name it also provides a specification of the protocol to use when connecting. If a client has an alias defined it can use the alias to connect directly the SQL Server without any additional connection information specified. This provides an abstraction layer that makes it easy to replace server names in connection strings.

An alias is created using the SQL Server configuration manager. On 64-bit machines you must create the alias both in the 32 and the 64 bit environment (this can be done through SQL configuration manager)

Be aware that you cannot use aliases for shared memory connections.

6.2.3 Setting Application Name

To help monitor the execution of SSIS packages you should always set the application name in your connection string to SQL Server.

By setting this application name property to the name of the executing package you are allowing the SQL Server DBA to easily troubleshoot bad SSIS behavior.

The application name is set like this:

Figure 3 – Setting the application name

6.2.4 Relevant MS Articles

The following knowledge base and MSDN articles provide useful guidance about SQL Server connectivity:

6.3 Connecting to Oracle

There are two OLE DB drivers that can be used to extract data from a source system running Oracle in SQL Server Integration Services.

Oracle Provider for OLE DB
Microsoft OLE DB Provider for Oracle

6.3.1 Oracle Provider for OLE DB

The Oracle Provider for OLE DB exists both in a 32 bit and a 64 bit version. The latest version at the time of writing is 10.1.0.4.0 and can be found at:

http://www.oracle.com/technology/software/tech/windows/ole_db/index.html

There is a bug in the current version. Columns with the Oracle data type NUMBER fails to be extracted, if there is no precision and scale set. To avoid this, there are two solutions.

Make sure all columns of the data type NUMBER have precision and scale set on the Oracle data source.
Install a patch (p5203839_10201_MSWIN-x86-64.zip for the x64 version – there should be a 32 bit patch as well) from the Oracle Technology Network (http://www.oracle.com/technology/) – requires a paid subscription to OTN.

Our experience shows that the driver is stable and faster than the Microsoft OLE DB Provider for Oracle.

6.3.2 SSIS and AlwaysUseDefaultCodePage

You might get the following error, trying to connect to an Oracle data source in SSIS:

“Cannot retrieve the column code page info from the OLE DB Provider. If the component supports the "DefaultCodePage" property, the code page from that property will be used. Change the value of the property if the current string code page values are incorrect. If the component does not support the property, the code page from the component's locale ID will be used."

This error occurs when you are extracting non-unicode string values from Oracle.

The fix to this is simple. Set AlwaysUseDefaultCodePage to true in the properties for the connection.

The DefaultCodePage should be set to 1252 which is correct for Western Alphabet. If Oracle is not using the Western Alphabet, you'll need to determine the character set used and set the default code page. The correct default code page can be found at:

http://msdn2.microsoft.com/en-us/library/ms531361.aspx

6.3.3 Microsoft OLE DB Provider for Oracle

The Microsoft OLE DB Provider for Oracle only exists in a 32 bit version, and that version is deprecated by Microsoft.

The benefit of using this driver is that there are no problems with NUMBER without precision or scale.

However it is our experience that the Microsoft OLE DB Provider for Oracle is a lot slower than the Oracle Provider for OLE DB.

6.3.4 Other drivers

There are other methods to connect to an Oracle data source, for example Oracle Instant Client or ODBC, but we have not managed to get neither to work. If you discover good alternatives please post on forums.

Also, note that Oracle Instant Client is not officially support by Microsoft as a method for connecting to Oracle (Product support will kindly tell you to use the Oracle OLEDB driver instead)

6.4 Connecting to DB2

Need guidelines from subject matter expert[JR1]

6.5 Connecting to TeraData

To be written if we get a project where this is needed

6.6 Reading text files

Need guidelines here from subject matter expert

6.7 Links on connectivity and drivers

Connectivity and SQL Server 2005 Integration Services

Oracle OLEDB Driver download

7 Control flow

The control flow of an SSIS package provides the backbone of your package. The section provides guidance on both specific tasks and general guidelines for the control flow.

7.1 Layout of Control Flow

The control flow should be laid out to start at the top of screen and flow downwards. If parallelism is added to the flow, indicate this by aligning the parallel task horizontally.

Use Sequence containers to group operations into “modules[JR2] ” [PLI3] [tke4]

7.2 Maintaining and Rebuilding Indexes

Indexes do not maintain themselves and will gradually become useless if they are not kept in shape.

In a warehouse, the ETL flow is generally responsible for maintaining indexes. Indexing is a task that has the focus of the DBA and it is vital that we as ETL developers behave properly when working with indexes.

There are two different strategies for keeping an index in good shape

Drop and recreate – This involves completely removing the indexes, perform a big operation, and rebuilding indexes after this

If you plan to modify a large part of a table (more than 20%) in bulk mode - it is good practice to drop the indexes and recreate them after the operation
Failure to do this will result is page splitting[6] which will show up in the DBAs monitoring and negatively impact your flow.

Reorganize – This strategy keeps the indexes on the table while the insert is happening.

This strategy is often used for dimension tables.
It is used when a table changes only slowly for each batch run.
You should periodically maintain the index with reorganization

7.2.1 Correct procedure for Dropping and Recreating Indexes

If you are dropping and recreating indexes on a table the following order of operations is the correct approach:

Drop all non-clustered indexes on the table
Drop the clustered index (if any) on the table
Perform the large, bulk operation
Recreate the clustered index
Recreate the non-clustered indexes

This procedure will ensure the best performance during the bulk operation and keep the index in good shape.

7.2.1.1 Using FILLFACTOR = 100

If you are always dropping and recreating indexes before modifying a table (as is often the case with fact tables) you can gain extra performance and reduce storage needs by using FILLFACTOR = 100. Search for “fill factor” in book online for documentation on this property

Example:

The following statement creates the index CIX_Dates with a fill factor of 100:

CREATE UNIQUE CLUSTERED INDEX CIX_DateFrom

ON Dimensions.Product_T2 (DW_EK_Product, DW_Valid_From)

WITH (FILLFACTOR = 100)

7.2.2 Reorganizing indexes

If you decide to keep the indexes on the table while changing it you should make sure that the indexes are periodically maintained.

There are several strategies for reorganizing (consult Books Online) and you should make sure a strategy is implemented on the warehouse.

There are two conditions that you should monitor to check if the indexes are being maintained:

Is there enough free space on pages in index?
Is the index fragmented?

You can use the management view: sys.dm_db_index_physical_stats to check for these conditions.

7.2.3 Using SORT_IN_TEMPDB

SQL Server 2005 supports an index options called SORT_IN_TEMPDB. Depending on the configuration of your warehouse you they may be useful options to consider.

TempDb is sometimes placed on a set of very fast disks that are separate from the data warehouse database. If this is the case, using the SORT_IN_TEMPDB index option will provide you with a performance gain. If TempDb is sitting on the same disks as you database, this option will not provide any performance and may actually degrade it.

Another advantage of SORT_IN_TEMPDB is that the DBA can monitor your activity by looking at the free space in TempDb in Perfmon. You should talk to the DBA on the project and establish a strategy on whether to use this index option or not.

7.3 Bulk Insert Task vs. Data Flow Task

We do not recommend the use of the bulk insert task in the data flow. The Bulk Insert task can be slightly faster in a very few cases[7], the flexibility of the pipeline more than makes up for this. Using a pipeline with a SQL Server destination is only marginally slower if slower at all.

7.4 Processing Analysis Services objects

When processing objects in analysis services you should generally make this processing as fine grained as possible. This assists you in troubleshooting problems and provides restart points in the package.

You should apply the following guidelines:

Process each dimension in its own control flow task
When processing a cube, split the processing of partitions into at two control flow tasks:

Process Data – This will read all data from the relational source and build the “base aggregates”. After this, the cube is online
Process Index – This will build all aggregates.
Rationale: By splitting the processing like this, you can restart the flow at more points and individually measure how long it takes to read data and build aggregates

For large cubes – consider splitting the processing of measure group into separate tasks.

7.4.1 Keeping cubes online while processing

Need information from subject matter expert

7.5 Error Behaviour

One common scenario is that an error in a loop should not halt the whole package. When not actively configuring error behaviour, the package/component will fail. It is recommended that each component ought to be evaluated with errors in mind. The control can be made to take certain actions when failing (cleaning temporary tables etc.). Control flows can be made to proceed after errors have occurred, but remember to address this in the Master package as well, to avoid a halt in the control flow.

8 Proper use of SQL in the ETL flow

When using SQL from an ETL flow there are some general guidelines to observe. These ensure that we keep the DBA happy and make optimal use of the balance between Pipeline and SQL based transformations.

8.1 Execute SQL Task

When executing SQL in the control flow you should always use always use the Execute SQL Task.

The control flow task: Execute T-SQL Statement Task should not be used – it has a much less flexible API and does not provide the same benefits as Execute SQL Task

8.1.1 Only one statement per task

When using the execute SQL task never put more than one T-SQL Statement in each Execute SQL Task. If you add more SQL statements to the same control flow tasks your error messages from Integration Services become less meaningful.

Instead – add one Execute SQL Task for each T-SQL Statement. If you need to execute more statements in on task you should encapsulate this in procedure.

8.1.2 Avoid SQL based control flow in SQL Tasks

Try to avoid using the T-SQL IF / THEN / ELSE constructs inside a SQL task if the Integration Services conditional can be used instead. By embedding control inside SQL you are essentially “hiding” part of the execute flow of the Integration Services package – this makes the code harder to read.

Example:

The following T-SQL Task is wrong:

IF @DailyLoad BEGIN

<DoSomething>

END

ELSE BEGIN

<DoSomethingElse>

The correct flow would look like this:

8.1.3 Using NET connectors

When you use the Execute SQL Task you should always connect the Execute SQL Task to SQL Server using the ConnectionType = ADO.NET (see Figure 4).

Be aware that this does mean that you must maintain both an OLEDB connection and a .NET connection in the connection managers. We believe this administration is worth the effort because ADO.NET gives us these benefits:

Executing stored procedures, using IsQueryStoredProcedure, does not require the question marks (?) for specifying parameters

Parameters to stored procedures can be directly named
Some basic validation is even done in by SSIS

The interface for handling returned values (output parameter and return values) is much more flexible.
The ADO.NET connection has much better support for SQL Server data types

No need for developer to map database type to generic SSIS data type.

Figure 4 - ADO.NET connection type

8.2 Use Table Locking

While loading the data warehouse you generally want to either change or read an entire table, or a large part of it.

The SQL Server locking mechanism is optimized for OLTP use and will generally try to take a lock that interferes as little as possible with queries executed by other users. The expectation is that a query will only return a small result set and thus require locking of only a few pages.

This expectation from SQL Server has an unfortunate side effect in data warehousing: the locking system in SQL Server tend to be put under pressure when queries are consistently asking for entire tables instead of very small result sets.

Also, bear in mind that in a warehouse batch you, the developer is in control of which queries are executed. Hence, the locking mechanism in SQL Server is working against you and not for you.

Fortunately this is easily circumvented. You can instruct SQL Server to take table locks instead of page locks in your queries by executing the locking hints: TABLOCK and TABLOCKX (for shared and exclusive locks respectively)

You should do table locking on all tables in which you wish to read or modify a large number of row. For tables in source systems you should use the NOLOCK hint.

Example:

This will table lock a select statement. The locking still allows other queries to read the table

SELECT c1, c2 FROM Dimension.MyDim_T1 WITH (TABLOCK)

Use TABLOCK for lookup transformation and other reads on dimension and map tables.

Example:

The following query uses both TABLOCK and TABLOCKX to avoid a large locking of the updated table:

UPDATE f

SET f.Measure = s.Measure

FROM Facts.SomeFact f WITH (TABLOCKX)

INNER JOIN Staging.SomeFact s WITH (TABLOCK)

8.2.1 Index locking options

If you are running on a SQL Server 2005 database you can build you indexes in such a way that they enforce table locking. The index build options you must set are:

ALLOW_ROW_LOCKS = OFF
ALLOW_PAGE_LOCKS = OFF

Example: The following statement builds a clustered index and enforces table locking on it

CREATE UNIQUE CLUSTERED INDEX CIX_<SomeTable>

ON <SomeTable> (<Column>)

WITH (ALLOW_ROW_LOCKS = OFF, ALLOW_PAGE_LOCKS = OFF)

8.2.2 Table locking in data sources

Pipeline source transformation should use Data Access Mode = SQL Statement instead of the default “table of view”.

In the SQL Statement make sure you execute a TABLOCK hint. See Figure 5 for an example of how to configure this:

Figure 5 - Table locking a data source transformation

8.3 Use SQL for non-equi joins

Be aware that Integration Services does not support non-equi joins. The most common form of these are:

SELECT …

FROM TableA A

INNER JOIN TableB B

ON B.Key = A.Key AND B.Date >= A.Date_From AND B.Date < A.Date_To

The “non-equi” part here simply means that the join condition contains at least one less than or greater than operator or a similar construct that does not ensure that you hit only one row in the lookup.

Such equi-joins cannot be done efficiently with Integration Services lookup transformations.

You can often rewrite the join into an “equi-join” like this:

SELECT …

FROM TableA A

INNER JOIN TableB B

ON B.Key = A.Key AND A.IsCurrent = 1

In this case you can implement the join as an SSIS lookup transformation.

Only resort to SQL for joining when there is no other (good) alternative in Integration Services.

8.4 Use TRUNCATE instead of DELETE

It is often necessary to delete the contents of a table before loading new data to it.

Whenever possible, prefer the use of TRUNCATE instead of DELETE. Truncation is a minimally logged operation that will generally be a lot faster than DELETE. Delete statements can be used when the target table is relatively small. In this case, complexity is reduced at the cost of an acceptable performance loss.

Example: This is wrong:

DELETE FROM MyFactTable

This is right:

TRUNCATE TABLE MyFactTable

Be aware that you cannot truncate a table that is referenced by constraints. You can consider temporarily removing these constrains if performance is a concern.

8.5 Choosing the right indexes

General index tuning strategies are outside the scope of this document. However, there are some guidelines that are worth highlighting in an OLAP environment.

These are:

Use small (narrow column) clustered indexes and always try to make them UNIQUE
Covering indexes should be used generously on dimension tables

Take advantage of the feature INCLUDE in SQL Server 2005 to create covering indexes[8]

FILLFACTOR 100 can often be leveraged on indexes
Avoid, or try to minimize the use of indexed views

8.6 Forbidden SQL Commands

There are some SQL commands you should never use from an Execute SQL task. These are

:

Forbidden T-SQL Statement	Reason
USE <Database>	The name of the database should be configurable and is set in the connection manager not in the Execute SQL Task
GO	You should instead make two or more Execute SQL tasks

9 Pipeline / Data Flow Usage

The pipeline is (generally) where you spend most of you execution time while running the ETL flows. Hence, it is especially important to properly utilize this powerful feature of Integrations Services.

This section provides standards that you should apply to achieve proper pipelines usage.

9.1 Sorting

Sorting, and avoiding it, is a very central part of any ETL flow which is why we must dedicate a full section to it.

Sorted data provide many benefits to the pipeline:

A sorted buffer can use Merge and Merge join transformations
You can work with “the previous row” in a script component on sorted buffer

Example: For setting To-dates based on From-dates this is very handy

Sorting is a central part of delta detection on large data sets

Merge join is a very fast way to perform delta once the input is sorted

Eliminating duplicates is easy on a sorted input
Creating aggregate data often requires sorting input
Inserting Pre-Sorted buffers reduce the amount of page splitting that occurs in the database

With all these advantages, it looks like we should be using sort almost exclusively (which is actually the case in many bad implementations).

However, sorting data has a price. The disadvantages of sorting are:

Sorting is slow.

The time complexity of sort is O(n lg(n)).
This means that it does not scale linearly with the size of the data
As the warehouse grows, sort operations can cause you to miss your batch window

Sorting requires space

To sort an input you need temporary space
This can either come from memory or from disk

When the sort space comes from memory you may be causing other applications to slow down while you are sorting
When the sort space comes from disk you will take an even more significant performance hit

Sorting is blocking

Any operations relying on the sort must wait for the sort before they can proceed
This will reduce parallelism – and hence performance

With these pitfalls of sorting, every sort you do should be analyzed for its potential impacts. Since you pay a high price for sort you should generally design your ETL flow in such as way that you take maximum advantage of a sorted input. Every transformation that needs the input to be sorted in a specific way should be executed while the input is still sorted.

With these things in mind, let’s move on to how you properly handle sorting.

9.1.1 SQL Server vs. Integration Services

When it comes to sorting, SQL Server is still faster than Integration Services. Furthermore, for sorting large data sets, Integration Services will simply be so slow that it is unworkable.

Hence, we recommend that you use SQL Server to perform your sort operations.

There are two ways you can perform a sort inside SQL Server:

Build a clustered index on a table. This will sort the table based on the key columns in the index
Issue an ORDER BY statement when you SELECT data. If there is no index to support this statement SQL Server will perform a sort in TempDb (Disk) and memory

In both these cases you must tell Integration Services that the data are leaving the database sorted (see below).

9.1.2 Marking output as sorted

Even if you have a clustered index on a table you should still executed the ORDER BY statement in your SELECT statement to ensure that your output is correctly sorted independent of query optimization.

Example: The following data source forces the database to sort the data:

Figure 6 - A sorted data source

Executing the ORDER BY is not enough to ensure that Integration Services treats the output as sorted; you should also change the meta data of the pipeline.

This is done in advanced properties of the data source transformation. You have to perform two steps for this to work:

Mark the output as sorted
Declare which columns are sorted and in which order

Example: The following shows how you mark the output as sorted

Figure 7 - Marking the pipeline as sorted

Each key is then marked with its position in the sequence:

Figure 8 - Setting sort key position

A negative number indicates a descending sort. See SortKeyPosition in Book Online for more information.

9.1.3 Cluster indexes and sorted queries

When you execute an ORDER BY against SQL Server the database engine has two choices:

Fetch the data first, sort, then return
Use an index and scan this in sorted order returning results as scan proceeds

Seen from Integration Services the last option is of course the best – since this ensures that rows start flowing from the data source almost as soon as the packages start. However, to execute the last strategy SQL Server will need the correct indexes. In a warehouse and index used to support sort is often made the clustered index.

With this in mind, if you expect your table to be frequently accessed by a pipeline doing a sort you may want to put an index in place to support this.

You can easily distinguish between the two strategies by looking at the query plan for your SELECT statement. If you see the following symbol in Figure 9, the query engine is using the first plan.

:

Figure 9 - The sort icon in a query plan

Example: The following query:

SELECT ProductId, StartDate, EndDate, StandardCost

FROM Production.ProductCostHistory

ORDER BY StartDate, ProductID

Has the query plan:

Notice how the sort is being done as part of the query. Either change the order of the columns in the ORDER BY or change the index to obtain the better query plan:

9.1.4 A word of warning

There are cases when sorting is inevitable. In these cases you must pay the price of sorting. Though you can move the cost between processes, for example by using SQL and building indexes instead of sorting the pipeline, you still have to pay the price of the sort.

The art of sorting is about paying the price as few times as possible (zero times at the optimal). When you do pay the price for the sort – do so in a structured way.

9.1.5 List of Flow components that maintain sorting

Since sorting is such a precious property to maintain in the pipeline is necessary to know which transformations maintain the sorting.

The following list describes the commonly used flow components and whether they maintain sorting or not.

Flow Component	Maintains sorting of buffer?
Merge	Yes
Merge Join	Yes
Conditional Split	Yes
Multicast	Yes
Lookup	Yes
Derived column and other synchronous	Yes
Script component	Depends (synchronous will, async will not)
Union All	No (note that multicast does maintain)
Aggregate	No (!)

Conclusions from the list above:

Use Merge instead of Union All on sorted input
As a rule of thumb:

A synchronous transform will always maintain sorted property.
Asyncronous transforms will only maintain it if designed for this purpose.

9.2 Configuring lookups

Lookups are one of the most efficient data transformations in SSIS. By making just a few configuration changes to this transformation – significant performance can be gained.

The following guidelines describe the correct configuration of a lookup transformation

Use a SQL Statement instead of a table name and only select the columns you need
Add a TABLOCK hint to this SQL Statement. The illustration below illustrates this[9]:

Figure 10 – Configuring a lookup

Make sure that the full caching is enable on the lookup

9.3 Script components

On script components you should always set the PreCompile propery to true.

9.4 Aggregating

The aggregation component, like the sort, does not work well on large datasets. However, it is very handy for removing duplicates and making aggregates that dynamically depend on the flow.

Use the component with care on large datasets and test properly before putting to production. If dataset is too large, consider using a SQL based GROUP BY instead.

9.5 Data Sources Adaptors

We often move quite a lot of data when reading data source.. It is therefore important to perform this read efficiently.

The following best practices apply to data source adaptors

Data source adaptors should be named after the table or view they reference.
If a data source adaptor references more than one table (with a join or union) consider creating a view in the database to serve the data source

This assists debugging since you can now easily reproduce the SQL executed from management studio

Do not reference the table or view directly, instead use a SQL Statement
Add a TABLOCK hint to this SQL Statement
Do not SELECT any unneeded columns

The following illustrates a correctly configured data source:

Figure 11 - Configuring a data source

9.6 Data Destination Adaptors

On data destination adaptors the following should be set

Use the Tablock checkbox whenever possible
During development, check constraints. Once you are sure that constraints are not violated you can consider disabling the constraint check
If transaction log space is limited, set the Maximum Insert Commit Size to a value that is a multiple of the DefaultBufferMaxRows property of the pipeline

Remember that a bulk lock (SQL Lock type: BU) is compatible with another bulk lock. This means that two bulk operations can operate on the same table at the same time – speeding up the operation

In general you should use the OLEDB data destination adaptor since this is compatible with remote development. However, if speed is very critical – the SQL Server destination type is faster since it works on a shared memory connection.

10 Package Configurations

When configuring packages the following best practices apply:

Use one configuration file pr. Connection manager

Rationale: Not all packages use all connection managers. Having one file per connection manager instead of one big file eliminates error messages from “missing” connection managers.

Use indirect configuration – i.e.: Use a system environment variable to point to the configuration file.

Rationale: This makes it much easier to change configuration file locations

Make a separate configuration file or separate filtering in configuration table for package-specific configuration. Interesting configurations include:

FolderBufferTempStorage – Where the buffers are stored if RAM gets low
Checkpoints – points to the package checkpoint files
Furthermore, you probably want to set some global variables including:

FolderPackages – Points to the location of sub packages

Use DontSaveSensitive as package security.

Rationale: When allowing sensitive data in a package, it gets encrypted. This can happen even when no sensitive data is embedded and might render packages unusable when opened on another machine than the original. Sensitive data can be applied by configuration files or configuration tables.

11 Logging

The logging of SSIS packages provides important debug information, both at development time and when the package is put into production.

As a very minimum, you should log the following events:

OnError
OnWarning
OnTaskFailed
OnPreValidate – Useful for catching compile time errors
OnPreExecute – Logs the start of the step, no matter if an error occurred or not.
OnPostExecute – Logs the end of the step, allowing you to time the total runtime of the step.

We recommend that you log to a SQL Server table for ease of access and to perform post run analysis and aggregation of run statistics. The logging server should be configured through environment variable as described in section 10.

12 Performance optimization

This section is a contain performance optimization tricks that did not fit in the previous best practise recommendations..

12.1 Getting more Parallelism

The design of the SSIS Pipeline is generally very efficient and most ETL jobs will run quite fast..

However, on larger, multi-CPU, machines you can often gain extra performance by redesigning the pipeline for larger parallelism (thus taking advantage of the CPUs)

There are two design patterns that you can generally apply if you pipeline is not fast enough. These are described below.

12.1.1 Vertical Parallelism

Vertical parallelism is achieved by adding more buffers to an existing flow.

The Union All transformation can be used for this purpose since it is an asynchronous transformation. By adding a union all between two transformation components a new pipeline buffer is created by SSIS. Having more buffers will increase parallelism but at the cost of more memory.

Information about this technique can be found here:

Integration Services Performance Tuning Techniques

Be aware that Union All removes the sorted property of a flow.

12.1.2 Horizontal Parallelism

Horizontal parallelism is achieved by partitioning the input data and creating a pipeline for each of them.

The pattern for obtaining this performance optimization is:

Find a proper partition function on the input data

The goal is to split the input into several, almost equal sized, piles of data
Time columns often form a good partitioning function: for example the monthly partitioning of input
Demographics keys, such as customer numbers, can also be used
If you cannot find a good key for partitioning, resort to using a hash value over the input

let a modulo operator on this hash value split the input into piles
Example: the expression binary_checksum(*) % 4 = x splits input into 4 piles with x determining which pile to take

Make a list of work that needs doing – one item on the list per pile created
Put the work items into a queue structure
Start several instances of the SSIS packages.

Each instance grabs the first element in the queue and starts working on the piles
Spawn about 1-2 SSIS pipelines for each CPU in the machine
When there is no more work to be done (queue empty) the packages stop.

The meta framework provides a SQL based queue data structure that is perfect for this purpose.

13 Monitoring and Measuring Packages

There are many good tools to monitor your SSIS packages and establish a baseline performance profile. You should generally perform some degree of monitoring to locate bad code and sanity check your installation in general.

This section describes some common approaches.

13.1 Perfmon counters

PerfMon is a powerful tool for monitoring your ETL flows. Bad code can be quickly located and corrected.

You can even schedule runs of perform during your batch window using the command line interface: logman.

The table below lists some commonly used performance counter and their “Comfort Zone”. As a minimum you should monitor one full data flow for these conditions below and correct as appropriate.

Counter	Comfort Zone	Effects when outside zone	Possible Resolutions
SQL Server : Locks Locks / Sec	< 10K	Lock timeouts and unstable queries	Use TABLOCK hints
SQL Server : Access Methods Page Splits / Sec	< 300	I/O system taking unessecary hits. Instability due to timeouts	Optimize Indexing. Consider dropping indexes before bulk loading
Physical Disk / Avg. Disk Sec / Transfer	< 0.020 (less if enterprise)	Timeouts and the “Communication Link Failure” error message	Rewrite queries or add indexes for better query plans Investigate I/O subsystem
Processor / % Processor Time	> 50%	Flow not running as fast as it could	Use more parallelism Investigate I/O system
SQL Server : Transactions / Free Space In TempDb (KB)	Not big drops > 20% of total space in TempDb	Code might be doing inefficient joining or sorting	Investigate code for bad query plans
SQL Server : Databases / Log File Used (KB)	< size of largest table in database	Excessive logging and out of disk space	Put database in simple mode Consider rewriting SQL to use pipeline instead
SQL Server : Databases Log Growths	= 0	Log files is growing and fragmenting causing performance degradation and possible running disk full	Allocate a log file that is big enough
Processes / Working Set (All DTExec processes)	< available physical memory	SISS being paged out or causing general instability of system	Rewrite SSIS code to eliminate bad memory usage Use 64-bit computing and add more memory
SQL Server : Memory Manager / Target Server Memory And Total Server Memory	Should match	SQL Server is releasing memory to OS or not getting enough memory from other, blocking applications	Set SQL Server to Lock Pages in memory Set a minimum memory size in SQL Server
SQL Server : SQL Errors Errors / Sec / _Total	= 0	You flow might be ignoring warnings from the database	Inspect flow parts that have errors > 0 and locate possible problems in SQL code.
Processor / Interrups / Sec	< 2000	Driver or service on machine is misbehaving	Locate driver or service by shutting down components one at a time until interrupts stop.

13.2 SQL Profiler

You can use SQL Profiler to trace all the SQL statements coming to the server. The template “Tuning” is perfect for this purpose (you may want to extend it with start and end times).

If you trace to a file or table – be aware that SQL Profiler traces can quickly grow large. Here are some suggestions for keeping files small:

Set a filter on DatabaseName so only queries going to your EDW / DSA database are show
Filter on Duration – queries that run for less than 50 ms are rarely worth tuning (in a warehouse).
Don’t trace any more events than you strictly need.

Skal oversættes til dansk

Figurer fra original skal tilføjes

1 Introduction

1.1 Structure of the document

2 Architecture – Flow perspective

2.1 Dependency between stages

2.2 Collecting packages into master packages

3 Fact flows

3.1 Overview

3.2 Delta detection for fact data

3.2.1 Performing delta detection at the source system

3.2.2 Performing delta detection in the ETL flow

3.2.3 Running without delta detection

4 Dimension Flow

4.1 Overview

4.2 Delta detection on dimensional / entity data

5 General Techniques

5.1 Assigning new keys

5.2 Cleaning data and performing single row transforms

5.3 Applying transformation that “blow up” data

6 Data sources and Connections

6.1 Avoid using Shared Data sources

6.2 Connecting to SQL Server

6.2.1 Picking the right protocol

6.2.2 Using aliases

6.2.3 Setting Application Name

6.2.4 Relevant MS Articles

6.3 Connecting to Oracle

6.3.1 Oracle Provider for OLE DB

6.3.2 SSIS and AlwaysUseDefaultCodePage

6.3.3 Microsoft OLE DB Provider for Oracle

6.3.4 Other drivers

6.4 Connecting to DB2

6.5 Connecting to TeraData

6.6 Reading text files

6.7 Links on connectivity and drivers

7 Control flow

7.1 Layout of Control Flow

7.2 Maintaining and Rebuilding Indexes

7.2.1 Correct procedure for Dropping and Recreating Indexes

7.2.1.1 Using FILLFACTOR = 100

7.2.2 Reorganizing indexes

7.2.3 Using SORT_IN_TEMPDB

7.3 Bulk Insert Task vs. Data Flow Task

7.4 Processing Analysis Services objects

7.4.1 Keeping cubes online while processing

7.5 Error Behaviour

8 Proper use of SQL in the ETL flow

8.1 Execute SQL Task

8.1.1 Only one statement per task

8.1.2 Avoid SQL based control flow in SQL Tasks

8.1.3 Using NET connectors

8.2 Use Table Locking

8.2.1 Index locking options

8.2.2 Table locking in data sources

8.3 Use SQL for non-equi joins

8.4 Use TRUNCATE instead of DELETE

8.5 Choosing the right indexes

8.6 Forbidden SQL Commands

9 Pipeline / Data Flow Usage

9.1 Sorting

9.1.1 SQL Server vs. Integration Services

9.1.2 Marking output as sorted

9.1.3 Cluster indexes and sorted queries

9.1.4 A word of warning

9.1.5 List of Flow components that maintain sorting

9.2 Configuring lookups

9.3 Script components

9.4 Aggregating

9.5 Data Sources Adaptors

9.6 Data Destination Adaptors

10 Package Configurations

11 Logging

12 Performance optimization

12.1 Getting more Parallelism

12.1.1 Vertical Parallelism

12.1.2 Horizontal Parallelism

13 Monitoring and Measuring Packages

13.1 Perfmon counters

13.2 SQL Profiler