Tuesday, August 9, 2011

Joiner Transformation Overview


Joiner Transformation Overview
Transformation type:
Active
Connected
Use the Joiner transformation to join source data from two related heterogeneous sources residing in different locations or file systems. You can also join data from the same source. The Joiner transformation joins sources with at least one matching column. The Joiner transformation uses a condition that matches one or more pairs of columns between the two sources.
The two input pipelines include a master pipeline and a detail pipeline or a master and a detail branch. The master pipeline ends at the Joiner transformation, while the detail pipeline continues to the target.
To join more than two sources in a mapping, join the output from the Joiner transformation with another source pipeline. Add Joiner transformations to the mapping until you have joined all the source pipelines.
following limitations on the pipelines you connect to the Joiner transformation:
·         You cannot use a Joiner transformation when either input pipeline contains an Update Strategy transformation.
·         You cannot use a Joiner transformation if you connect a Sequence Generator transformation directly before the Joiner transformation.
you must configure the transformation properties, join type, and join condition.
You can configure the Joiner transformation for sorted input to improve Integration Service performance.
Joiner transformation properties:
Ø  Case-Sensitive String Comparison
Ø  Cache Directory
Ø  Join Type
Ø  Null Ordering in Master: Not applicable for this transformation type.
Ø  Null Ordering in Detail: Not applicable for this transformation type.
Ø  Tracing Level
Ø  Joiner Data Cache Size
Ø  Joiner Index Cache Size
Ø  Sorted Input
Ø  Master Sort Order
Ø  Transformation Scope

To improve performance for an unsorted Joiner transformation, use the source with fewer rows as the master source. To improve performance for a sorted Joiner transformation, use the source with fewer duplicate key values as the master.
By default, when you add ports to a Joiner transformation, the ports from the first source pipeline display as detail sources. Adding the ports from the second source pipeline sets them as master sources. To change these settings, click the M column on the Ports tab for the ports you want to set as the master source. This sets ports from this source as master ports and ports from the other source as detail ports.

Use one or more ports from the input sources of a Joiner transformation in the join condition. Additional ports increase the time necessary to join two sources. The order of the ports in the condition can impact the performance of the Joiner transformation. If you use multiple ports in the join condition, the Integration Service compares the ports in the order you specify.
The Designer validates datatypes in a condition. Both ports in a condition must have the same datatype. If you need to use two ports in the condition with non-matching datatypes, convert the datatypes so they match.
If you join Char and Varchar datatypes, the Integration Service counts any spaces that pad Char values as part of the string:
Char(40) = "abcd"
Varchar(40) = "abcd"
The Char value is “abcd” padded with 36 blank spaces, and the Integration Service does not join the two fields because the Char field contains trailing spaces.
Note: The Joiner transformation does not match null values. For example, if both EMP_ID1 and EMP_ID2 contain a row with a null value, the Integration Service does not consider them a match and does not join the two rows. To join rows with null values, replace null input with default values, and then join on the default values.
Note: A normal or master outer join performs faster than a full outer or detail outer join.
If you pass unsorted or incorrectly sorted data to a Joiner transformation configured to use sorted data, the session fails and the Integration Service logs the error in the session log file.
Do not place any of the following transformations between the sort origin and the Joiner transformation:

Ø  Custom

Ø  Unsorted Aggregator


Ø  Normalizer

Ø  Rank


Ø  Union transformation

Ø  XML Parser transformation


Ø  XML Generator transformation

Ø  Mapplet, if it contains one of the above transformations
Joining Two Branches of the Same Pipeline
When you join data from the same source, you can create two branches of the pipeline. When you branch a pipeline, you must add a transformation between the source qualifier and the Joiner transformation in at least one branch of the pipeline. You must join sorted data and configure the Joiner transformation for sorted input.
For example, you have a source with the following ports:
*
Employee
*
Department

*
Total Sales
In the target, you want to view the employees who generated sales that were greater than the average sales for their departments. To do this, you create a mapping with the following transformations:
*
Sorter transformation. Sorts the data.
*
Sorted Aggregator transformation. Averages the sales data and group by department. When you perform this aggregation, you lose the data for individual employees. To maintain employee data, you must pass a branch of the pipeline to the Aggregator transformation and pass a branch with the same data to the Joiner transformation to maintain the original data. When you join both branches of the pipeline, you join the aggregated data with the original data.

*
Sorted Joiner transformation. Uses a sorted Joiner transformation to join the sorted aggregated data with the original data.
*
Filter transformation. Compares the average sales data against sales data for each employee and filter out employees with less than above average sales.
Source->src_source->str_src->.|……pipeline1…….Jnr->fil->trg
                                                     |   pipeline 2            |
                                                     Agg|
Joining two branches might impact performance if the Joiner transformation receives data from one branch much later than the other branch. The Joiner transformation caches all the data from the first branch, and writes the cache to disk if the cache fills. The Joiner transformation must then read the data from disk when it receives the data from the second branch. This can slow processing.
Joining Two Instances of the Same Source
You can also join same source data by creating a second instance of the source. After you create the second source instance, you can join the pipelines from the two source instances. If you want to join unsorted data, you must create two instances of the same source and join the pipelines.
Note: When you join data using this method, the Integration Service reads the source data for each source instance, so performance can be slower than joining two branches of a pipeline.

Guidelines

Use the following guidelines when deciding whether to join branches of a pipeline or join two instances of a source:
*
Join two branches of a pipeline when you have a large source or if you can read the source data only once. For example, you can only read source data from a message queue once.
*
Join two branches of a pipeline when you use sorted data. If the source data is unsorted and you use a Sorter transformation to sort the data, branch the pipeline after you sort the data.

*
Join two instances of a source when you need to add a blocking transformation to the pipeline between the source and the Joiner transformation.
*
Join two instances of a source if one pipeline may process slower than the other pipeline.

*
Join two instances of a source if you need to join unsorted data.
Perform joins in a database when possible.
Performing a join in a database is faster than performing a join in the session. In some cases, this is not possible, such as joining tables from two different databases or flat file systems. If you want to perform a join in a database, use the following options:
·         Create a pre-session stored procedure to join the tables in a database.
·         Use the Source Qualifier transformation to perform the join.

1 comment:

  1. If I configure the Cache Size to 10KB and source the data of 50KB what happens?
    Q1. Does the session fail or does it create 5 different files of 10KB each and slow down the performance ?
    Q2. In case it creates 5 different files of 10KB, does that mean, that the value i defined as the cache size is the max size of ONE cache file and its not the limit of the complete cache size ?

    ReplyDelete