This document describes the Visma MS best practise for creating ETL flow. The document focuses on Integration Services 2008 SP2 and later.
The important points are highlighted with the blue rule font. This help the reader skim through the document quickly for guidance.
In section 2 we introduce our best practice with an architectural overview of a typical Visma MS based ETL-flow. This section even includes guidelines for which packages you would typically have in a full warehouse flow. Use this section as your starting point for digging into the rest of the document.
Zooming further into the flow structure, section 3 and 4 describes techniques tailored towards fact- and dimension flows. These sections also tie our architecture to the general Insighters ETL guideline.
Moving up one abstraction level we arrive at some general techniques common to all ETL packages. These techniques provide guidance that can be applied no matter which tool or implementation you apply. You can find these general tips and tricks in section 5.
The rest of the document is specific to Integration Services (SSIS) implementation. It describes the context that we work within and the best practice rules we follow. Do note that users of other technologies may still gain valuable insights by applying similar techniques in their own tools.
We begin our SSIS guidelines with a description on how to connect to data sources (section 6). From here we move on the control flow (section 7).
Before we can introduce the pipeline we must discus the proper usage of SQL. This precautionary measure is taken in section 8. With proper utilization of SQL in place we can finally move on to the SSIS pipeline in section 9.
Our next sections deal with the challenges of structured code maintenance and deployment. We introduce this with a discussion about package configurations in section 10. No maintenance argument is of course complete until we properly treat logging – the subject of section 11.
We conclude our guidelines with sections 12 and 13 which deal exclusively with performance tuning. In these concluding sections you will find the guidance to squeeze the last “rows / sec” out of your flow.
The overall ETL flow in a Visma MS solution is divided into some main phases. These main phases happen in sequence.
Each of the main phases is implemented as its own SSIS package. Sub packages are used inside the phases to further partition and modularize the flow. Section 3 and 4 contain implementation specifics for each phase.
The table below lists the main phases of an ETL flow
Phase | Purpose | SSIS Package implementation |
Stage Dimensions | Extracts the dimension data from the source systems and places these in Extract Schema | One package pr. Source System. Optionally one package per source system per dimension table |
Transform/Load Dimensions | Performs transformation on the data from the extract schema into the EDW | One package per dimension table If you have more than one source system loading a dimension table you create one package pr. Source system for this dimension.
|
Maintain dimensions | Maintains indexes on dimensions tables. Will typically rebuild all indexes with a fillfactor = 100, perform materialization of performance critical tables and other performance optimization tasks | One Package |
Stage Facts | Extracts fact data from the source systems and places these in the Extract Schema | One package pr. Source System. May be split into more if dealing with large fact tables |
Transform/Load Fact | Performs transformation of fact data from the Extract schema into the EDW | One package pr. Fact table If you have more than one source system for a fact tables use one package pr source system for this fact table |
Build Data Mart | Build the data marts. In a Microsoft solution these packages will typically process cubes | One package pr. Data mart |
Archive Extracts | Archives relevant Extract tables to enable replay of old data | One package pr. Source system (this package will archive all relevant tables from the source system |
Maintain Fact | Maintains the indexes and sometimes statistics on facts (optionally rebuilding them) This phase is optional since you may not use indexes on fact tables at all | One package |
Backup | Performs backup of the changed warehouse | Not always implemented as package. Can be performed by different tools depending on customer needs |
Naming guidelines for the packages are found in the SSIS Naming convention document in the document center.
Some of the stages can be run in parallel, while some have dependencies. The diagram below illustrates the dependencies in a flow
The illustration below show a typical flow with two source systems, some dimensions/facts and two data marts:
Figure 1 - Dependenciens between SSIS Packages
Depending on the scheduler tool you use, it may be an advantage to create aggregate[1] packages that execute one or more sub packages. Examples of such packages include
Which strategy is correct depends largely on the structure of your data, the batch window, the need for restarts etc... However, no matter which strategy you choose, you should always follow certain design patterns:
This section contains general guidelines for structuring the flow of data from source system to warehouse. The flow components in the section below are listed in the order they are normally used.
The fact flow is initiated by staging the fact data to the Extract schema. Why do we copy data here instead of moving it directly to the pipeline? Rationale: There are several reasons:
In all but the most extreme data sizes the above reasons justify an extra copy of the data while loading.
Once data is staged you generally have to perform some or all of these operations:
Delta detection is described in the sections below. For guidance on the other operations refer to section 5.
The discussion in this section will focus on delta detection of facts / transactions. There is separate kind of delta detection used for dimension data discussed in section 4.2.
The purpose of delta detection is to limit the amount of changes committed to the warehouse. Instead of reloading every table you instead, intelligently, try to change only what is strictly necessary.
Delta detection of facts should generally be applied very early in the ETL flow – since it will reduce the amounts of rows you have to work with at later phases.
You basically have three strategies for delta detection
The following sections will treat these cases.
This delta detection technique, while optimal, is rarely feasible. Implementing source delta detection follows the pattern:
1) The purpose of this process is the locate columns that uniquely identify changes in the source system.
Typical candidate columns are date and time columns. Delta detection on date/time will only work if the columns are maintained by either database triggers or strict programming techniques[2].
Another column candidate is keys that perpetually increase; examples include SQL Server IDENTITY columns and Oracle sequences. Note that while new rows can be located quickly by comparing keys, detecting changes in existing rows is not possible.
Some databases provide a feature to automatically timestamp a row. In Oracle the rownumber can sometimes be used and in SQL Server columns of type timestamp/rowversion automatically change whenever a row is changed.
Be aware that deletes are also changes. When you examine you source system, you should also look for ways to detect deleted rows. One option would be for the source to include an “IsDeleted” column which is updates when the row is removed (instead of physically removing the row)
2) When you have located candidate columns for delta detection you must question the source system experts closely about these. If they reply with confidence that these columns are correctly maintained (this is rare) you can move on to the next step. If you detect doubts or have trouble getting information about the model, stop wasting your time and consider moving on to another delta detection method.
3) Always test your candidate delta columns thoroughly. Try changing the source system and see if columns change appropriately.
4) If the last steps were successful it is now a simple matter to implement an ETL loader that only fetches the last changed columns from the source system. Remember to congratulate yourself for performing the necessary detective work and achieving the optimal delta solution.
This technique follows the pattern:
1) Fetching all data from the source system is often a costly affair. Make sure that you coordinate this with the DBAs of the source system. Consider disabling locking of source tables or even running on a replicated copy of the source.
2 + 3) You should maintain an archive copy of the extracted, unclean table as it looked last time you saw it. By merge joining this archive with the newly received source data (sorted on the key columns) your can quickly perform a delta detection in the ETL flow
The illustration below demonstrates this technique:
Figure 2 – Fact Delta Detection in ETL flow
It is worth reminding ourselves why we do delta detection in the first place: because it reduces the data we need to transform.
Delta detection, especially in the ETL flow, is really only worth the effort if the expected delta is small. Performing delta detection costs both time and machine resources.
You should carefully compare the cost of the delta detection with full reloads. As a rule of thumb, if you delta is much larger than 10% of the total data – a full load is often faster than delta detection[3]
The dimension flow normally deals with much less data than the fact flows. However, the transformations you need to apply are typically much more complicated.
The dimension flow typically proceeds in these phases:
Rationale:
First of all: Debug friendliness, not performance is our typical concern on dimension loaders. Hence, we gladly sacrifice some performance if this helps us locate errors in data with more elegance and flexibility.
Copying the source system allows us to quickly inspect what we received and eases the archiving.
Before we apply error detection it is generally a good idea to perform some basic cleaning of the data (step 2). It becomes much easier to write error detection code if we can make some basic assumptions about our data (untrimmed strings is a classic example of hard to locate errors)
Using the same argument as step 2 the following step tries to cast data into their acceptable types. This is the first step that can produce error rows.
After step 3 we can make some strong assumption about our data and the column domains. We are now ready to perform step 4 and 5 which apply integrity check and business transformations. Again we may produce error rows that may need handling by users.
Before we move on the delta detection we should land our data in a “load ready” table. As developers we can inspect this table and see the final results of the ETL flow before any delta detection is applied. Since the following step is generally the most complicated, having a “checkpoint” here is a good idea.
Step 8 performs the final type 1 and type 2 delta detection. If you are using the meta framework – no problem – just call the build in procedures for this. If not, refer to Kimballs books and the following section.
This form of delta detection is used to detect type 1 and type 2 changes in dimensions. Very often, this is the most error prone part of a warehouse load.
By far the easiest technique is this:
Step 3 can often be automated – which it indeed has been in the meta framework.
This above technique works for all but the largest[4] dimensions. If dealing with large dimensions, you may decide to combine the technique in section with the above ideas.
This section contains some general techniques that apply to all flow types.
Part of the warehouse load is the assigning of small surrogate or entity keys. You should perform this step early in the ETL flow: it will generally make you columns narrow by replacing character keys with integer values. This has a beneficial effect on memory and storage requirements.
When assigning keys, be especially aware of the following:
In general, you can apply the following solutions to the problems above:
Once keys have been assigned you start transforming data. The first transformation should be single rows transformations that either reduce or maintain the number of rows in the pipeline. Examples include:
You may also be required to perform transformations that increase the number of rows in the data. Examples of these are:
Such transformation can quickly increase the amount of data and should therefore be performed on the smallest and most efficient input. Hence, “blow up” transformations should not be applied until after you reduce both the size and amount of input rows.
The following guidelines apply to data sources and connection managers.
Integration Services has a feature called shared data sources. While this feature is in principle a good idea it has some flaws that make us avoid it:
We feel that the above concerns and their impact on developer productivity outweigh any benefit the shared data source provides. Our recommendation: do not use shared data sources.
In the same line of reasoning data source views in Integration Services should not be used either.
There are many different ways to connect to SQL Server. Performance and stability characteristics for each connection type differ. Hence, it is important to choose the right connection type.
The following drivers (in order of preference) are available when connection to SQL Server:
Each step down the list adds another translation layer. Such layers are both overhead and sources of bugs.
In a warehouse environment you should always prefer SQL Native Client if available.
Once you have chosen your driver you need to determine which protocol to use. SQL Server supports the following connection protocols (in order of performance):
Shared memory (the fastest) is available only if you are running the ETL tools on the same server as the database. Shared Memory is somewhat more tolerant of timeouts than the other protocols and generally provides better stability and performance. Choose shared memory whenever possible for best performance and stability.
The VIA protocol is only very rarely used and requires special hardware. This protocol should be avoided.
Named Pipes uses the windows pipe system for connections and authentication. The protocol is not “firewall friendly” and it is generally only good when connecting to the server from inside the corporate network. However, do bear in mind that named pipes is quite a bit faster than TCP/IP.
TCP/IP requires only one port to be open to the server (the default is 1433). This protocol is especially useful in firewalled environments. Be aware that the TCP/IP protocol, especially on long running warehouse queries, is sensitive to bad network conditions and server timeout. Such timeout can occur if the server is running low on resources or if SQL Server is executing inefficient query plans.
The protocol used for connections can be enforced either though the connection string (see knowledge base articles below) or through a SQL Server alias.
You should always validate that you are using the right connection to SQL Server. The list if all connections and their active protocols can be obtained by querying the system view:
SELECT session_id, net_transport
FROM sys.dm_exec_connections
If you are using either named pipes or TCP/IP we generally recommend that you use aliases when connecting to SQL Server.
An alias is an alternate name for a SQL Server instance. In addition to the name it also provides a specification of the protocol to use when connecting. If a client has an alias defined it can use the alias to connect directly the SQL Server without any additional connection information specified. This provides an abstraction layer that makes it easy to replace server names in connection strings.
An alias is created using the SQL Server configuration manager. On 64-bit machines you must create the alias both in the 32 and the 64 bit environment (this can be done through SQL configuration manager)
Be aware that you cannot use aliases for shared memory connections.
To help monitor the execution of SSIS packages you should always set the application name in your connection string to SQL Server.
By setting this application name property to the name of the executing package you are allowing the SQL Server DBA to easily troubleshoot bad SSIS behavior.
The application name is set like this:
Figure 3 – Setting the application name
The following knowledge base and MSDN articles provide useful guidance about SQL Server connectivity:
There are two OLE DB drivers that can be used to extract data from a source system running Oracle in SQL Server Integration Services.
The Oracle Provider for OLE DB exists both in a 32 bit and a 64 bit version. The latest version at the time of writing is 10.1.0.4.0 and can be found at:
There is a bug in the current version. Columns with the Oracle data type NUMBER fails to be extracted, if there is no precision and scale set. To avoid this, there are two solutions.
Our experience shows that the driver is stable and faster than the Microsoft OLE DB Provider for Oracle.
You might get the following error, trying to connect to an Oracle data source in SSIS:
“Cannot retrieve the column code page info from the OLE DB Provider. If the component supports the "DefaultCodePage" property, the code page from that property will be used. Change the value of the property if the current string code page values are incorrect. If the component does not support the property, the code page from the component's locale ID will be used."
This error occurs when you are extracting non-unicode string values from Oracle.
The fix to this is simple. Set AlwaysUseDefaultCodePage to true in the properties for the connection.
The DefaultCodePage should be set to 1252 which is correct for Western Alphabet. If Oracle is not using the Western Alphabet, you'll need to determine the character set used and set the default code page. The correct default code page can be found at:
The Microsoft OLE DB Provider for Oracle only exists in a 32 bit version, and that version is deprecated by Microsoft.
The benefit of using this driver is that there are no problems with NUMBER without precision or scale.
However it is our experience that the Microsoft OLE DB Provider for Oracle is a lot slower than the Oracle Provider for OLE DB.
There are other methods to connect to an Oracle data source, for example Oracle Instant Client or ODBC, but we have not managed to get neither to work. If you discover good alternatives please post on forums.
Also, note that Oracle Instant Client is not officially support by Microsoft as a method for connecting to Oracle (Product support will kindly tell you to use the Oracle OLEDB driver instead)
Need guidelines from subject matter expert[JR1]
To be written if we get a project where this is needed
Need guidelines here from subject matter expert
The control flow of an SSIS package provides the backbone of your package. The section provides guidance on both specific tasks and general guidelines for the control flow.
The control flow should be laid out to start at the top of screen and flow downwards. If parallelism is added to the flow, indicate this by aligning the parallel task horizontally.
Use Sequence containers to group operations into “modules[JR2] ” [PLI3] [tke4]
Indexes do not maintain themselves and will gradually become useless if they are not kept in shape.
In a warehouse, the ETL flow is generally responsible for maintaining indexes. Indexing is a task that has the focus of the DBA and it is vital that we as ETL developers behave properly when working with indexes.
There are two different strategies for keeping an index in good shape
If you are dropping and recreating indexes on a table the following order of operations is the correct approach:
This procedure will ensure the best performance during the bulk operation and keep the index in good shape.
If you are always dropping and recreating indexes before modifying a table (as is often the case with fact tables) you can gain extra performance and reduce storage needs by using FILLFACTOR = 100. Search for “fill factor” in book online for documentation on this property
Example:
The following statement creates the index CIX_Dates with a fill factor of 100:
CREATE UNIQUE CLUSTERED INDEX CIX_DateFrom
ON Dimensions.Product_T2 (DW_EK_Product, DW_Valid_From)
WITH (FILLFACTOR = 100)
If you decide to keep the indexes on the table while changing it you should make sure that the indexes are periodically maintained.
There are several strategies for reorganizing (consult Books Online) and you should make sure a strategy is implemented on the warehouse.
There are two conditions that you should monitor to check if the indexes are being maintained:
You can use the management view: sys.dm_db_index_physical_stats to check for these conditions.
SQL Server 2005 supports an index options called SORT_IN_TEMPDB. Depending on the configuration of your warehouse you they may be useful options to consider.
TempDb is sometimes placed on a set of very fast disks that are separate from the data warehouse database. If this is the case, using the SORT_IN_TEMPDB index option will provide you with a performance gain. If TempDb is sitting on the same disks as you database, this option will not provide any performance and may actually degrade it.
Another advantage of SORT_IN_TEMPDB is that the DBA can monitor your activity by looking at the free space in TempDb in Perfmon. You should talk to the DBA on the project and establish a strategy on whether to use this index option or not.
We do not recommend the use of the bulk insert task in the data flow. The Bulk Insert task can be slightly faster in a very few cases[7], the flexibility of the pipeline more than makes up for this. Using a pipeline with a SQL Server destination is only marginally slower if slower at all.
When processing objects in analysis services you should generally make this processing as fine grained as possible. This assists you in troubleshooting problems and provides restart points in the package.
You should apply the following guidelines:
Need information from subject matter expert
One common scenario is that an error in a loop should not halt the whole package. When not actively configuring error behaviour, the package/component will fail. It is recommended that each component ought to be evaluated with errors in mind. The control can be made to take certain actions when failing (cleaning temporary tables etc.). Control flows can be made to proceed after errors have occurred, but remember to address this in the Master package as well, to avoid a halt in the control flow.
When using SQL from an ETL flow there are some general guidelines to observe. These ensure that we keep the DBA happy and make optimal use of the balance between Pipeline and SQL based transformations.
When executing SQL in the control flow you should always use always use the Execute SQL Task.
The control flow task: Execute T-SQL Statement Task should not be used – it has a much less flexible API and does not provide the same benefits as Execute SQL Task
When using the execute SQL task never put more than one T-SQL Statement in each Execute SQL Task. If you add more SQL statements to the same control flow tasks your error messages from Integration Services become less meaningful.
Instead – add one Execute SQL Task for each T-SQL Statement. If you need to execute more statements in on task you should encapsulate this in procedure.
Try to avoid using the T-SQL IF / THEN / ELSE constructs inside a SQL task if the Integration Services conditional can be used instead. By embedding control inside SQL you are essentially “hiding” part of the execute flow of the Integration Services package – this makes the code harder to read.
Example:
The following T-SQL Task is wrong:
IF @DailyLoad BEGIN
<DoSomething>
END
ELSE BEGIN
<DoSomethingElse>
The correct flow would look like this:
When you use the Execute SQL Task you should always connect the Execute SQL Task to SQL Server using the ConnectionType = ADO.NET (see Figure 4).
Be aware that this does mean that you must maintain both an OLEDB connection and a .NET connection in the connection managers. We believe this administration is worth the effort because ADO.NET gives us these benefits:
Figure 4 - ADO.NET connection type
While loading the data warehouse you generally want to either change or read an entire table, or a large part of it.
The SQL Server locking mechanism is optimized for OLTP use and will generally try to take a lock that interferes as little as possible with queries executed by other users. The expectation is that a query will only return a small result set and thus require locking of only a few pages.
This expectation from SQL Server has an unfortunate side effect in data warehousing: the locking system in SQL Server tend to be put under pressure when queries are consistently asking for entire tables instead of very small result sets.
Also, bear in mind that in a warehouse batch you, the developer is in control of which queries are executed. Hence, the locking mechanism in SQL Server is working against you and not for you.
Fortunately this is easily circumvented. You can instruct SQL Server to take table locks instead of page locks in your queries by executing the locking hints: TABLOCK and TABLOCKX (for shared and exclusive locks respectively)
You should do table locking on all tables in which you wish to read or modify a large number of row. For tables in source systems you should use the NOLOCK hint.
Example:
This will table lock a select statement. The locking still allows other queries to read the table
SELECT c1, c2 FROM Dimension.MyDim_T1 WITH (TABLOCK)
Use TABLOCK for lookup transformation and other reads on dimension and map tables.
Example:
The following query uses both TABLOCK and TABLOCKX to avoid a large locking of the updated table:
UPDATE f
SET f.Measure = s.Measure
FROM Facts.SomeFact f WITH (TABLOCKX)
INNER JOIN Staging.SomeFact s WITH (TABLOCK)
If you are running on a SQL Server 2005 database you can build you indexes in such a way that they enforce table locking. The index build options you must set are:
Example: The following statement builds a clustered index and enforces table locking on it
CREATE UNIQUE CLUSTERED INDEX CIX_<SomeTable>
ON <SomeTable> (<Column>)
WITH (ALLOW_ROW_LOCKS = OFF, ALLOW_PAGE_LOCKS = OFF)
Pipeline source transformation should use Data Access Mode = SQL Statement instead of the default “table of view”.
In the SQL Statement make sure you execute a TABLOCK hint. See Figure 5 for an example of how to configure this:
Figure 5 - Table locking a data source transformation
Be aware that Integration Services does not support non-equi joins. The most common form of these are:
SELECT …
FROM TableA A
INNER JOIN TableB B
ON B.Key = A.Key AND B.Date >= A.Date_From AND B.Date < A.Date_To
The “non-equi” part here simply means that the join condition contains at least one less than or greater than operator or a similar construct that does not ensure that you hit only one row in the lookup.
Such equi-joins cannot be done efficiently with Integration Services lookup transformations.
You can often rewrite the join into an “equi-join” like this:
SELECT …
FROM TableA A
INNER JOIN TableB B
ON B.Key = A.Key AND A.IsCurrent = 1
In this case you can implement the join as an SSIS lookup transformation.
Only resort to SQL for joining when there is no other (good) alternative in Integration Services.
It is often necessary to delete the contents of a table before loading new data to it.
Whenever possible, prefer the use of TRUNCATE instead of DELETE. Truncation is a minimally logged operation that will generally be a lot faster than DELETE. Delete statements can be used when the target table is relatively small. In this case, complexity is reduced at the cost of an acceptable performance loss.
Example: This is wrong:
DELETE FROM MyFactTable
This is right:
TRUNCATE TABLE MyFactTable
Be aware that you cannot truncate a table that is referenced by constraints. You can consider temporarily removing these constrains if performance is a concern.
General index tuning strategies are outside the scope of this document. However, there are some guidelines that are worth highlighting in an OLAP environment.
These are:
There are some SQL commands you should never use from an Execute SQL task. These are
:
Forbidden T-SQL Statement | Reason |
USE <Database> | The name of the database should be configurable and is set in the connection manager not in the Execute SQL Task |
GO | You should instead make two or more Execute SQL tasks |
The pipeline is (generally) where you spend most of you execution time while running the ETL flows. Hence, it is especially important to properly utilize this powerful feature of Integrations Services.
This section provides standards that you should apply to achieve proper pipelines usage.
Sorting, and avoiding it, is a very central part of any ETL flow which is why we must dedicate a full section to it.
Sorted data provide many benefits to the pipeline:
With all these advantages, it looks like we should be using sort almost exclusively (which is actually the case in many bad implementations).
However, sorting data has a price. The disadvantages of sorting are:
With these pitfalls of sorting, every sort you do should be analyzed for its potential impacts. Since you pay a high price for sort you should generally design your ETL flow in such as way that you take maximum advantage of a sorted input. Every transformation that needs the input to be sorted in a specific way should be executed while the input is still sorted.
With these things in mind, let’s move on to how you properly handle sorting.
When it comes to sorting, SQL Server is still faster than Integration Services. Furthermore, for sorting large data sets, Integration Services will simply be so slow that it is unworkable.
Hence, we recommend that you use SQL Server to perform your sort operations.
There are two ways you can perform a sort inside SQL Server:
In both these cases you must tell Integration Services that the data are leaving the database sorted (see below).
Even if you have a clustered index on a table you should still executed the ORDER BY statement in your SELECT statement to ensure that your output is correctly sorted independent of query optimization.
Example: The following data source forces the database to sort the data:
Figure 6 - A sorted data source
Executing the ORDER BY is not enough to ensure that Integration Services treats the output as sorted; you should also change the meta data of the pipeline.
This is done in advanced properties of the data source transformation. You have to perform two steps for this to work:
Example: The following shows how you mark the output as sorted
Figure 7 - Marking the pipeline as sorted
Each key is then marked with its position in the sequence:
Figure 8 - Setting sort key position
A negative number indicates a descending sort. See SortKeyPosition in Book Online for more information.
When you execute an ORDER BY against SQL Server the database engine has two choices:
Seen from Integration Services the last option is of course the best – since this ensures that rows start flowing from the data source almost as soon as the packages start. However, to execute the last strategy SQL Server will need the correct indexes. In a warehouse and index used to support sort is often made the clustered index.
With this in mind, if you expect your table to be frequently accessed by a pipeline doing a sort you may want to put an index in place to support this.
You can easily distinguish between the two strategies by looking at the query plan for your SELECT statement. If you see the following symbol in Figure 9, the query engine is using the first plan.
:
Figure 9 - The sort icon in a query plan
Example: The following query:
SELECT ProductId, StartDate, EndDate, StandardCost
FROM Production.ProductCostHistory
ORDER BY StartDate, ProductID
Has the query plan:
Notice how the sort is being done as part of the query. Either change the order of the columns in the ORDER BY or change the index to obtain the better query plan:
There are cases when sorting is inevitable. In these cases you must pay the price of sorting. Though you can move the cost between processes, for example by using SQL and building indexes instead of sorting the pipeline, you still have to pay the price of the sort.
The art of sorting is about paying the price as few times as possible (zero times at the optimal). When you do pay the price for the sort – do so in a structured way.
Since sorting is such a precious property to maintain in the pipeline is necessary to know which transformations maintain the sorting.
The following list describes the commonly used flow components and whether they maintain sorting or not.
Flow Component | Maintains sorting of buffer? |
Merge | Yes |
Merge Join | Yes |
Conditional Split | Yes |
Multicast | Yes |
Lookup | Yes |
Derived column and other synchronous | Yes |
Script component | Depends (synchronous will, async will not) |
Union All | No (note that multicast does maintain) |
Aggregate | No (!) |
Conclusions from the list above:
Lookups are one of the most efficient data transformations in SSIS. By making just a few configuration changes to this transformation – significant performance can be gained.
The following guidelines describe the correct configuration of a lookup transformation
Figure 10 – Configuring a lookup
On script components you should always set the PreCompile propery to true.
The aggregation component, like the sort, does not work well on large datasets. However, it is very handy for removing duplicates and making aggregates that dynamically depend on the flow.
Use the component with care on large datasets and test properly before putting to production. If dataset is too large, consider using a SQL based GROUP BY instead.
We often move quite a lot of data when reading data source.. It is therefore important to perform this read efficiently.
The following best practices apply to data source adaptors
The following illustrates a correctly configured data source:
Figure 11 - Configuring a data source
On data destination adaptors the following should be set
Remember that a bulk lock (SQL Lock type: BU) is compatible with another bulk lock. This means that two bulk operations can operate on the same table at the same time – speeding up the operation
In general you should use the OLEDB data destination adaptor since this is compatible with remote development. However, if speed is very critical – the SQL Server destination type is faster since it works on a shared memory connection.
When configuring packages the following best practices apply:
Use one configuration file pr. Connection manager
Rationale: Not all packages use all connection managers. Having one file per connection manager instead of one big file eliminates error messages from “missing” connection managers.
Use indirect configuration – i.e.: Use a system environment variable to point to the configuration file.
Rationale: This makes it much easier to change configuration file locations
Make a separate configuration file or separate filtering in configuration table for package-specific configuration. Interesting configurations include:
Use DontSaveSensitive as package security.
Rationale: When allowing sensitive data in a package, it gets encrypted. This can happen even when no sensitive data is embedded and might render packages unusable when opened on another machine than the original. Sensitive data can be applied by configuration files or configuration tables.
The logging of SSIS packages provides important debug information, both at development time and when the package is put into production.
As a very minimum, you should log the following events:
We recommend that you log to a SQL Server table for ease of access and to perform post run analysis and aggregation of run statistics. The logging server should be configured through environment variable as described in section 10.
This section is a contain performance optimization tricks that did not fit in the previous best practise recommendations..
The design of the SSIS Pipeline is generally very efficient and most ETL jobs will run quite fast..
However, on larger, multi-CPU, machines you can often gain extra performance by redesigning the pipeline for larger parallelism (thus taking advantage of the CPUs)
There are two design patterns that you can generally apply if you pipeline is not fast enough. These are described below.
Vertical parallelism is achieved by adding more buffers to an existing flow.
The Union All transformation can be used for this purpose since it is an asynchronous transformation. By adding a union all between two transformation components a new pipeline buffer is created by SSIS. Having more buffers will increase parallelism but at the cost of more memory.
Information about this technique can be found here:
Be aware that Union All removes the sorted property of a flow.
Horizontal parallelism is achieved by partitioning the input data and creating a pipeline for each of them.
The pattern for obtaining this performance optimization is:
The meta framework provides a SQL based queue data structure that is perfect for this purpose.
There are many good tools to monitor your SSIS packages and establish a baseline performance profile. You should generally perform some degree of monitoring to locate bad code and sanity check your installation in general.
This section describes some common approaches.
PerfMon is a powerful tool for monitoring your ETL flows. Bad code can be quickly located and corrected.
You can even schedule runs of perform during your batch window using the command line interface: logman.
The table below lists some commonly used performance counter and their “Comfort Zone”. As a minimum you should monitor one full data flow for these conditions below and correct as appropriate.
Counter | Comfort Zone | Effects when outside zone | Possible Resolutions |
SQL Server : Locks Locks / Sec | < 10K | Lock timeouts and unstable queries | Use TABLOCK hints |
SQL Server : Access Methods Page Splits / Sec | < 300 | I/O system taking unessecary hits. Instability due to timeouts | Optimize Indexing. Consider dropping indexes before bulk loading |
Physical Disk / Avg. Disk Sec / Transfer | < 0.020 (less if enterprise) | Timeouts and the “Communication Link Failure” error message | Rewrite queries or add indexes for better query plans
Investigate I/O subsystem |
Processor / % Processor Time | > 50% | Flow not running as fast as it could | Use more parallelism Investigate I/O system |
SQL Server : Transactions / Free Space In TempDb (KB) | Not big drops > 20% of total space in TempDb | Code might be doing inefficient joining or sorting | Investigate code for bad query plans |
SQL Server : Databases / Log File Used (KB) | < size of largest table in database | Excessive logging and out of disk space | Put database in simple mode Consider rewriting SQL to use pipeline instead |
SQL Server : Databases Log Growths | = 0 | Log files is growing and fragmenting causing performance degradation and possible running disk full | Allocate a log file that is big enough |
Processes / Working Set (All DTExec processes) | < available physical memory | SISS being paged out or causing general instability of system | Rewrite SSIS code to eliminate bad memory usage Use 64-bit computing and add more memory |
SQL Server : Memory Manager / Target Server Memory And Total Server Memory | Should match | SQL Server is releasing memory to OS or not getting enough memory from other, blocking applications | Set SQL Server to Lock Pages in memory Set a minimum memory size in SQL Server |
SQL Server : SQL Errors Errors / Sec / _Total | = 0 | You flow might be ignoring warnings from the database | Inspect flow parts that have errors > 0 and locate possible problems in SQL code. |
Processor / Interrups / Sec | < 2000 | Driver or service on machine is misbehaving | Locate driver or service by shutting down components one at a time until interrupts stop. |
You can use SQL Profiler to trace all the SQL statements coming to the server. The template “Tuning” is perfect for this purpose (you may want to extend it with start and end times).
If you trace to a file or table – be aware that SQL Profiler traces can quickly grow large. Here are some suggestions for keeping files small: