Informatica Depth Analysis: August 2011

Joiner Transformation Overview

Joiner Transformation Overview

Transformation type:

Active

Connected

Use the Joiner transformation to join source data from two related heterogeneous sources residing in different locations or file systems. You can also join data from the same source. The Joiner transformation joins sources with at least one matching column. The Joiner transformation uses a condition that matches one or more pairs of columns between the two sources.

The two input pipelines include a master pipeline and a detail pipeline or a master and a detail branch. The master pipeline ends at the Joiner transformation, while the detail pipeline continues to the target.

To join more than two sources in a mapping, join the output from the Joiner transformation with another source pipeline. Add Joiner transformations to the mapping until you have joined all the source pipelines.

following limitations on the pipelines you connect to the Joiner transformation:

· You cannot use a Joiner transformation when either input pipeline contains an Update Strategy transformation.

· You cannot use a Joiner transformation if you connect a Sequence Generator transformation directly before the Joiner transformation.

you must configure the transformation properties, join type, and join condition.

You can configure the Joiner transformation for sorted input to improve Integration Service performance.

Joiner transformation properties:

Ø Case-Sensitive String Comparison

Ø Cache Directory

Ø Join Type

Ø Null Ordering in Master: Not applicable for this transformation type.

Ø Null Ordering in Detail: Not applicable for this transformation type.

Ø Tracing Level

Ø Joiner Data Cache Size

Ø Joiner Index Cache Size

Ø Sorted Input

Ø Master Sort Order

Ø Transformation Scope

To improve performance for an unsorted Joiner transformation, use the source with fewer rows as the master source. To improve performance for a sorted Joiner transformation, use the source with fewer duplicate key values as the master.

By default, when you add ports to a Joiner transformation, the ports from the first source pipeline display as detail sources. Adding the ports from the second source pipeline sets them as master sources. To change these settings, click the M column on the Ports tab for the ports you want to set as the master source. This sets ports from this source as master ports and ports from the other source as detail ports.

Use one or more ports from the input sources of a Joiner transformation in the join condition. Additional ports increase the time necessary to join two sources. The order of the ports in the condition can impact the performance of the Joiner transformation. If you use multiple ports in the join condition, the Integration Service compares the ports in the order you specify.

The Designer validates datatypes in a condition. Both ports in a condition must have the same datatype. If you need to use two ports in the condition with non-matching datatypes, convert the datatypes so they match.

If you join Char and Varchar datatypes, the Integration Service counts any spaces that pad Char values as part of the string:

Char(40) = "abcd"

Varchar(40) = "abcd"

The Char value is “abcd” padded with 36 blank spaces, and the Integration Service does not join the two fields because the Char field contains trailing spaces.

Note: The Joiner transformation does not match null values. For example, if both EMP_ID1 and EMP_ID2 contain a row with a null value, the Integration Service does not consider them a match and does not join the two rows. To join rows with null values, replace null input with default values, and then join on the default values.

Note: A normal or master outer join performs faster than a full outer or detail outer join.

If you pass unsorted or incorrectly sorted data to a Joiner transformation configured to use sorted data, the session fails and the Integration Service logs the error in the session log file.

Do not place any of the following transformations between the sort origin and the Joiner transformation:

	Ø Custom
	Ø Unsorted Aggregator

	Ø Normalizer
	Ø Rank

	Ø Union transformation
	Ø XML Parser transformation

	Ø XML Generator transformation
	Ø Mapplet, if it contains one of the above transformations

Joining Two Branches of the Same Pipeline

When you join data from the same source, you can create two branches of the pipeline. When you branch a pipeline, you must add a transformation between the source qualifier and the Joiner transformation in at least one branch of the pipeline. You must join sorted data and configure the Joiner transformation for sorted input.

For example, you have a source with the following ports:

	Employee
	Department

Total Sales

In the target, you want to view the employees who generated sales that were greater than the average sales for their departments. To do this, you create a mapping with the following transformations:

Sorter transformation. Sorts the data.

Sorted Aggregator transformation. Averages the sales data and group by department. When you perform this aggregation, you lose the data for individual employees. To maintain employee data, you must pass a branch of the pipeline to the Aggregator transformation and pass a branch with the same data to the Joiner transformation to maintain the original data. When you join both branches of the pipeline, you join the aggregated data with the original data.

	Sorted Joiner transformation. Uses a sorted Joiner transformation to join the sorted aggregated data with the original data.
	Filter transformation. Compares the average sales data against sales data for each employee and filter out employees with less than above average sales.

Source->src_source->str_src->.|……pipeline1…….Jnr->fil->trg

| pipeline 2 |

Agg|

Joining two branches might impact performance if the Joiner transformation receives data from one branch much later than the other branch. The Joiner transformation caches all the data from the first branch, and writes the cache to disk if the cache fills. The Joiner transformation must then read the data from disk when it receives the data from the second branch. This can slow processing.

Joining Two Instances of the Same Source

You can also join same source data by creating a second instance of the source. After you create the second source instance, you can join the pipelines from the two source instances. If you want to join unsorted data, you must create two instances of the same source and join the pipelines.

Note: When you join data using this method, the Integration Service reads the source data for each source instance, so performance can be slower than joining two branches of a pipeline.

Guidelines

Use the following guidelines when deciding whether to join branches of a pipeline or join two instances of a source:

	Join two branches of a pipeline when you have a large source or if you can read the source data only once. For example, you can only read source data from a message queue once.
	Join two branches of a pipeline when you use sorted data. If the source data is unsorted and you use a Sorter transformation to sort the data, branch the pipeline after you sort the data.

	Join two instances of a source when you need to add a blocking transformation to the pipeline between the source and the Joiner transformation.
	Join two instances of a source if one pipeline may process slower than the other pipeline.

Join two instances of a source if you need to join unsorted data.

Perform joins in a database when possible.

Performing a join in a database is faster than performing a join in the session. In some cases, this is not possible, such as joining tables from two different databases or flat file systems. If you want to perform a join in a database, use the following options:

· Create a pre-session stored procedure to join the tables in a database.

· Use the Source Qualifier transformation to perform the join.

Expression Transformation Overview

Expression Transformation Overview

Transformation type:

Passive

Connected

Use the Expression transformation to calculate values in a single row. For example, you might need to adjust employee salaries, concatenate first and last names, or convert strings to numbers. You can also use the Expression transformation to test conditional statements before you pass the results to a target or other transformations.

Use the Expression transformation to perform non-aggregate calculations. To perform calculations involving multiple rows, such as sums or averages, use the Aggregator transformation.

The naming convention for an Expression transformation is EXP_TransformationName.

You can enter multiple expressions in a single Expression transformation by creating an expression for each output port.

Note: After you make the transformation reusable, you cannot copy ports from the source qualifier or other transformations. You can create ports manually within the transformation.

Aggregator Transformation Overview

Aggregator Transformation Overview

The Aggregator transformation performs aggregate calculations, such as averages and sums.

The Integration Service performs aggregate calculations as it reads and stores data group and row data in an aggregate cache

The Aggregator transformation to perform calculations on groups. The Expression transformation permits you to perform calculations on a row-by-row basis only.

After you create a session that includes an Aggregator transformation, you can enable the session option, Incremental Aggregation. When the Integration Service performs incremental aggregation, it passes new source data through the mapping and uses historical cache data to perform new aggregation calculations incrementally.

The Aggregator is an active transformation, changing the number of rows in the pipeline. The Aggregator transformation has the following components and options:

Aggregate cache. The Integration Service stores data in the aggregate cache until it completes aggregate calculations. It stores group values in an index cache and row data in the data cache

Aggregate expression. Enter an expression in an output port. The expression can include non-aggregate expressions and conditional clauses.

Group by port. Indicate how to create groups. The port can be any input, input/output, output, or variable port. When grouping data, the Aggregator transformation outputs the last row of each group unless otherwise specified.

Sorted input. Select this option to improve session performance. To use sorted input, you must pass data to the Aggregator transformation sorted by group by port, in ascending or descending order.

Aggregator Transformation Properties

· Cache Directory

· Tracing Level

· Sorted Input:

· Aggregator Data Cache Size

· Aggregator Index Cache Size

· Transformation Scope

Sorted Input: Indicates input data is presorted by groups. Select this option only if the mapping passes sorted data to the Aggregator transformation.

Aggregate Expressions

The Designer allows aggregate expressions only in the Aggregator transformation. An aggregate expression can include conditional clauses and non-aggregate functions. It can also include one aggregate function nested within another aggregate function

Nested Aggregate Functions

When you include single-level and nested functions in the same Aggregator transformation, the Designer marks the mapping or mapplet invalid. If you need to create both single-level and nested functions, create separate Aggregator transformations.

Conditional Clauses

Use conditional clauses in the aggregate expression to reduce the number of rows used in the aggregation. The conditional clause can be any clause that evaluates to TRUE or FALSE.

For example, use the following expression to calculate the total commissions of employees who exceeded their quarterly quota:

SUM( COMMISSION, COMMISSION > QUOTA )

Non-Aggregate Functions

You can also use non-aggregate functions in the aggregate expression.

The following expression returns the highest number of items sold for each item (grouped by item). If no items were sold, the expression returns 0.

IIF( MAX( QUANTITY ) > 0, MAX( QUANTITY ), 0))

Null Values in Aggregate Functions

When you configure the Integration Service, you can choose how you want the Integration Service to handle null values in aggregate functions. You can choose to treat null values in aggregate functions as NULL or zero. By default, the Integration Service treats null values as NULL in aggregate functions.

If you use sorted input and do not presort data correctly, you receive unexpected results.

Sorted Input Conditions

Do not use sorted input if either of the following conditions are true:

· The aggregate expression uses nested aggregate functions.

· The session uses incremental aggregation.

· Source data is data driven.

If you use sorted input and do not sort data correctly, the session fails.

Use sorted input to decrease the use of aggregate caches.

Sorted input reduces the amount of data cached during the session and improves session performance. Use this option with the Sorter transformation to pass sorted data to the Aggregator transformation.

Limit connected input/output or output ports.

Limit the number of connected input/output or output ports to reduce the amount of data the Aggregator transformation stores in the data cache.

Filter the data before aggregating it.

If you use a Filter transformation in the mapping, place the transformation before the Aggregator transformation to reduce unnecessary aggregation.

Labels

Tuesday, August 9, 2011