SSIS Best Practices
Posted by Sunil Reddy Enugala on July 28, 2011
1. By experience it seems that the lookups tend to slow down in performance considerably with huge lookup tables (~millions). It is a better convention to use Lookups with tables having lower amounts of data and use Inner Joins in the source query itself with proper indexing for better performance.
2. Use SQL command option for Selecting only the ‘needed’ columns from the lookup table rather than the using table option which selects all the columns from the table. This reduces the amount of buffers of the lookup cache
3. Based on the type of the incoming and referenced data it is beneficial to choose the proper ‘caching’ mode – Partial or Full. If only a small portion of the data is referenced by the incoming data repetitively then use ‘Partial’ mode by specifying the memory constraint. Use ‘Full’ mode when the reference table is small and is referenced fully by the incoming data.
4. Use a proper folder structure under a root folder to store the project, package configurations, logs, error files etc.
5. Avoid as much as possible the direct use of the dev environment settings including references to folders, paths, filenames, server name, login and passwords etc. in the package. This helps in the environment independence of the package.
6. It’s a best practice to read all the package configuration parameters from a configuration table where all the parameters can be stored centrally.
7. Try to build all of the configuration parameters like folder paths, connection string etc. from this list of configuration parameter values.
8. Make maximum use of the SSIS ‘Expressions’ for assigning and calculating the property values for each component in the package. The property values calculated this way result may use configured variables and provide machine independence.
9. Build ConnectionString property for ConnectionManagerthis way dynamically and don’t hard code passwords and other login information.
10. It’s a best practice to pragmatically determine the logging requirements. SSIS provides huge number of events which can be logged while package execution. Wherein the log may provide good insight in the package execution too much of file writing may also slow down the performance, particularly when there are number of components and huge data set is being worked on.
11. It’s a very good practice to identify all the logical units of tasks and mark them ‘start’ as the task starts and ‘completed’ or failed as it’s gets completed or failed. Each SSIS component has a state event (Success, Completed or Failed) which can be tracked and based on which the corresponding making can be done.
12. Variable scoping should follow the least privilege rule wherein scope the variable lowest container possible.
13. Use of dynamic SQL through the use of variables (by enabling ‘SQL command from Variable’ property) is also a very good practice when similar SQL statement are required based on some parameters. This avoids unnecessary components and produces less code.
14. Use of ADO.NET connections is a better option while using trying to execute SQL statements (e.g. using ‘ExecuteSQL’ task). ADO.NET connections generally work faster than the OLEDB connections while working on a data set from the package.
15. It’s a good practice to use Sequence containers for grouping the related tasks being performed in sequence. This also gives a clear visibility and define the parallel execution flow of the components when these containers are joined in parallel.
16. Using containers and grouping the related units of works is also helpful in enforcing transactional control flow.
17. For complex logics try to use the SSIS script task or Custom components rather than using the database stored procedures or complex SQL queries. Wherein Custom components have a advantage of being more usable than the script tasks, the latter may be simpler to write and may save dev time if the logic is not being used frequently.
18. The performance the SQL Command task is exceptionally low when used with large data set. For updates and other tasks it’s a better option to employ ‘set based updates’ and data processing while working on small set of data in chunks. The ExecuteSQL task may also be used for this purpose.
19. One of a universal best practice is try to minimize the data set as much as possible for reducing the overhead. Some of the pointers to achieve this may include-
· Try to filter out the data as much as possible at the source itself by selecting only the needed columns and using other constraints (is any)
· Remove unused columns from the SSIS package. Warnings are logged in the log file for unused columns if any in the data flow tasks.
· Use conditional splits to filter out the unused data as close as possible to the source as and when permitted by the data flow logic.
20. Avoid blocking and memory intensive transformations – Aggregations, Merge, sort etc. as and when possible in the data flow logic.
21. SSIS aggregation task normally has very good performance. But since it’s a memory intensive task it’s always recommended to aggregate on the integer columns like Keys and ID rather than string values and lookup them after the aggregation.
22. To avoid the transactional log explosion it’s always preferable to specify the batch and commit size in OLEDB destinations while inserting the data. Normally incremental loads are not very huge while history loads are, so we specify these parameters only for the later.
23. If the table the data being inserted to is not to be used while loading, enabling the ‘Table Lock’ option will improve the performance by opening the fast load. You may also disable the ‘check constraints’ option for better performance if permitted by the data rules.
24. If the data is pre-sorted as coming from the source is more than desirable to set the IsSorted property equal to TRUE. This helps in majorly improving the performance where sorting is required later in the data flow.
25. Specify the ‘BufferTempStoragePath’ property for the SSIS components which are likely to use intensive memory. This property will help in maintaining the pool if the DTExec runs short on virtual memory.
26. In case dealing with huge data ‘Increase’ the value of DefaultBufferMaxSize and DefaultBufferMaxRowsto a optimum value. This will reduce the number of buffers moving through the data flow. But increasing too much may also degrade the performance.
27. While dealing with Flat File Sources using Fastparse option is more performance oriented.
28. Classified and Intelligent use of the integrated environment of Microsoft .NET framework may also help in structured and more performance oriented code e.g. In adCenterBI a customized SSISLib (SSIS library) was written to load the same set of tables into different database by providing different parameters. This allows extensive code reusability also better performance.
29. Set RetainSameConnection=TRUE on the connection manager and set DelayValidation=TRUE on all tasks that use the temp table. Setting RetainSameConnection=TRUE ensures that the temp table does not disappear when the task that creates it has completed. Setting DelayValidation=TRUE ensures that the tasks do not try to check that the table exists before it is created.
30. In Derived Column component it is not possible to replace a value in a column with a value of a different type.
31. Synchronous components output the same number of rows as in the input. Asynchronous components may not.
32. Synchronous components output data in the same buffer as is input. Asynchronous components use a different buffer.
33. If you get a warning on an OLE DB Source component when connecting to Oracle you can probably fix it by setting AlwaysUseDefaultCodePage=‘True’
34. You can change the default type mappings that SSIS (and the Import Export wizard) uses for connecting to external sources by editing the XML files at C:\Program Files\Microsoft SQL Server\90\DTS\MappingFiles
35. If you are dynamicly setting connection strings of connection managers at runtime then you may need to set DelayValidation=TRUE on the tasks that use those connection managers
36. When storing connection manager properties in a configuration file you only need to store the ConnectionString property as this is an amalgamation of all the other properties.
37. SSIS will not store passwords for you in a configuration file as this is considered a security risk. You need to manually edit the configuration file yourself to enter the password.
38. Setting IsSorted=TRUE on a component output does not sort the data for you, it only tells the engine that the data is sorted. Only set this property to be TRUE if you know for sure that the data is sorted.
39. When using OpenRowset Using FastLoad in the OLE DB Destination, the component will attempt to commit records in batches, the size of which is determined by the FastLoadMaxInsertCommitSize property. If you have set the component to divert failed rows then in the case of an insertion failure all rows in the batch will be diverted to the error output. Hence, you are likely to see perfectly valid rows in the error output.
40. The need to set DelayValidation=TRUE on any data-flow that contains a Raw File Source Adapter.
41. Sometimes when developing you need to set DisableEventHandlers=TRUE in order to unit test some functionality. It is easy to forget to set it back to FALSE.
42. When accessing some relational data sources (particularly Oracle) you need to set AlwaysUseDefaultCodePage=TRUE in order to suppress a warning.
43. Setting BypassPrepare=TRUE on the Execute SQL Task is sometimes a pre-requisite to your SQL statement being validated
44. In order for a package that uses checkpoints to restart from the previously failing task (rather than the subsequent task), the task in question needs to have FailPackageOnFailure=TRUE.
45. Packages that have ProtectionLevel=’DontSaveSensitive’ need to use configurations in order for passwords to be used at execution-time
46. Columns in the data-flow that are not used will cause warnings at execution-time. Those warnings are useful, but annoying.
47. You need to manually set IsSorted=TRUE if you know that data in a data-path is sorted and there is no SORT component in the data-path. Setting IsSorted=TRUE will not sort the data for you.
48. The Excel 2007 default output format for the SSIS Excel Destination is Excel Binary Format (.xlsb). If you’re creating a new Excel file, and have given it an .xlsx extension, you’ll get the following error when trying to open it in Excel:
Excel cannot open the file ‘xxx.xlsx’ because the file format or file extension is not valid. Verify that the file has not been corrupted and that the file extension matches the format of the file.
If you want the Excel Destination to output a standard .xlsx file( Excel XML Format),
you’ll need to tweak your Excel Connection Manager’s connection string property.
By default it will look something like this:
Provider=Microsoft.ACE.OLEDB.12.0;Data Source=c:\path\xxx.xlsb;Extended Properties=”Excel 12.0;HDR=YES”;
Changing “Excel 12.0” to “Excel 12.0 Xml” will tell the provider to output in .xslx format instead.
Provider=Microsoft.ACE.OLEDB.12.0;Data Source=c:\path\xxx.xlsx;Extended Properties=”Excel 12.0 XML;HDR=YES”
49. Avoid using components unnecessarily. For example:
· Step 1. Declare the variable varServerDate.
· Step 2. Use ExecuteSQLTask in the control flow to execute a SQL query to get the server date-time and store it in the variable
· Step 3. Use the dataflow task and insert/update database with the server date-time from the variable varServerDate.
This sequence is advisable only in cases where the time difference from step 2 to step 3 really matters. If that doesn’t really matter, then just use the getdate() command at step 3, as shown below:
–create table #table1 (Lap_Id int, LAP_Date datetime)
Insert into #table1 (Lap_Id, LAP_Date) values (1, getdate())
50. Calling a child package multiple times from a parent with different parameter values.
When a child package is executed from a master package, the parameters that are passed from the master need to be configured in the child package. For this, you can use the ‘Parent Package Configuration’ option in the child package. But, for using the ‘Parent Package Configuration’, you need to specify the name of the ‘Parent Package Variable’ that is passed to the child package. If you want to call the same child package multiple times (each time with a different parameter value), declare the parent package variables (with the same name as given in the child package) with a scope limited to ‘Execute Package Tasks’.
SSIS allows declaring variables with the same name but the scope limited to different tasks – all inside the same package!
51. SQL job with many atomic steps.
For the SQL job that calls the SSIS packages, make multiple steps, each doing small tasks, rather than a single step doing all the tasks. In the first case, the transaction log grows too big, and if a rollback happens, it may take the full processing space of the server.
52. Avoid unnecessary typecasts.
Avoid unnecessary type casts. For example, the flat file connection manager, by default, uses the string [DT_STR] data type for all the columns. In case you want to use the actual data types, you have to manually change it. Better to change it at the source-level itself to avoid unnecessary type castings.
Usually, the ETL processes handle large volumes of data. In such a scenario, do not attempt a transaction on the whole package logic. However, SSIS supports transaction, and it is advisable to use transactions where the atomicity of the transaction is taken care of.
For example, consider a scenario where a source record is to be spitted into 25 records at the target – where either all the 25 records reach the destination or zero. In this scenario, using a transaction, we can ensure either all the 25 records reach the destination or zero.
54. Distributed transaction spanning multiple tasks.
The control flow of an SSIS package threads together various control tasks. It is possible to set a transaction that can span into multiple tasks using the same connection. To enable this, the “retainsameconnection” property of the Connection Manager should be set to “True”.
55. Limit the package names to a maximum of 100 characters.
When an SSIS package with a package name exceeding 100 chars is deployed into SQL Server, it trims the package name to 100 chars, which may cause an execution failure. So, limit the package names to a maximum of 100 characters.
56. Select * from…
Make sure that you are not passing any unnecessary columns from the source to the downstream. With the OLEDB connection manager source, using the ‘Table or View’ data access mode is equivalent to ‘SELECT * FROM <TABLE_NAME>’, which will fetch all the columns. Use ‘SQL command’ to fetch only the required columns, and pass that to the downstream. At each down-stream component, filter out the unnecessary columns.
Sorting in SSIS is a time consuming operation. At the places where we know that data is coming from database tables, it’s better to perform the sorting operation at the database query itself.
58. Excel Source and 64-bit runtime.
The Excel Source or Excel Connection manager works only with the 32 bit runtime. When a package using the Excel Source is enabled for 64-bit runtime (by default, it is enabled), it will fail on the production server using the 64-bit runtime. Go to the solution property pages\debugging and set Run64BitRuntime to False.
59. On failure of a component, stop/continue the execution with the next component.
When a component fails, the property failParentonFailure can be effectively used either to stop the package execution or continue with the next component – exception – stop/continue with the next component in a sequence container. The value of the constraint connecting the components in the sequence should be set to “Completion”, and the failParentonFailure property should be set to False (default).
To avoid most of the package deployment error from one system to another system, set the package protection level to ‘DontSaveSenstive’.
61. Copy pasting the Script component.
Once you copy-paste a script component and execute the package, it may fail. Just open the script editor of the pasted script component, save the script, and execute the package – it will work.
62. Configuration filter – Use as a filter.
It is a best practice to use the package name as the configuration filter for all the configuration items that are specific to a package. It is especially useful when there are so many packages with package-specific configuration items. For the configuration items that are general to many packages, use a generic name.
62. Optimal use of configuration records.
Avoid the same configuration item recorded under different filter/object names. For example, if two packages are using the same connection string, you need only one configuration record. To enable this, use the same name for the connection manager in both the packages. Also, use a generic configuration filter. This is quite convenient at the time of porting from one environment to another (e.g.: from UAT to production).
62 is not the end point…i’ll add whenever I get a new one…
Let me know your comments on this Article…