Data Warehouse Concepts

Thursday, December 17, 2009

Lookup Cache in Informatica

Will an index on a column help in performance of a lookup

Lets say we are doing a lookup on the employee table which has - Empno, ename,salary
Now we want to check if the empno is present then update else insert. So we do a lookup on the table

The data resulted from the lookup query will be stored in the cache (index and data cache), each record from the source is looked up against this cache. Now checking against the condition port column is done in the Informatica Lookup cache and not in the database. Due to this any index created in the database has no effect or imporvement on the performance of the lookup.

Can we replicate the same index in Lookup Cache? We don't need to do this. PowerCenter create Index and Data cache for the Lookup. In our case, condition port data - "EMPNO" is indexed and hashed in Index cache and the rest data is found in Data cache.

Now let's consider another lookup case, disable lookup cache. In this kind of Lookup, there is no cache. Everytime a row is sent into lookup, the SQL is executed against database. In this case, the database index may work. But, if the performance of the lookup is a problem, then "cache-less" lookup itself may be a problem.

We should go for cache-less lookup if our source data record is less than the number of records in my lookup table. In this case ONLY, indexing the condition ports will work. Everywhere else, it is just a mere chanse of luck, that makes the database pick up index.

Putting a where condition in the lookup, fetching the minimum required rows in lookup definately adds up to the performance and should always be taken care of.

Happy caching !!

Friday, December 4, 2009

How do you run a full DAC load

How do you run a full load (not incremental)

If DAC is running incremental everyday and due to some reason you want to do a full load then

You need to reset the Data Warehouse > DAC Client ver 7.9.5. Click on "Tools" --> "ETL Management" --> "Reset Data Warehouse" , it will prompt for

This action will cause all tables that are loaded during
next ETL to be truncated.

Do you want to proceed? >> Yes

This will truncate all the tables, drop indexes and do a full load of dimensions and facts then recreate indexes and compute stats.

What 'Reset Datawareshouse' does is it just updates last_refresh_dt of w_etl_refresh_dt to NULL.
The other way to do it is go to metadata tables and set the Last_refresh_dt to NULL

select row_wid,last_upd,dbconn_wid,ep_wid,table_name,last_refresh_dt,total_count,curr_process_count,version_id from w_etl_refresh_dt

update w_etl_refresh_dt set last_refresh_dt=NULL;
commit;

This will do a full DAC load.

Oracle Tuning

The optimizer mode can be set at the system-wide level, for an individual session, or for a specific SQL statement:

alter system set optimizer_mode=first_rows_10;
alter session set optimizer_goal = all_rows;
select /*+ first_rows(100) */ from student;

We need to start by defining what is the "best" execution plan for a SQL statement. Is the best execution plan the one that begins to return rows the fastest, or is the best execution plan the one that executes with the smallest amount of computing resources? Of course, the answer depends on the processing needs of your database.

The choices of executions plans made by the CBO are only as good as the statistics available to it. The old-fashioned analyze table and dbms_utility methods for generating CBO statistics are obsolete and somewhat dangerous to SQL performance. As we may know, the CBO uses object statistics to choose the best execution plan for all SQL statements.

The dbms_stats utility does a far better job in estimating statistics, especially for large partitioned tables, and the better statistics result in faster SQL execution plans.

Ddbms_stats example that creates histograms on all indexes columns:
BEGIN
dbms_stats.gather_schema_stats(
ownname=>'TPCC',
METHOD_OPT=>'FOR ALL INDEXED COLUMNS SIZE SKEWONLY',
CASCADE=>TRUE,
ESTIMATE_PERCENT=>100);
END;
/

There are several values for the OPTIONS parameter that we need to know about:

GATHER_ reanalyzes the whole schema

GATHER EMPTY_ only analyzes tables that have no existing statistics

GATHER STALE_ only reanalyzes tables with more than 10 percent modifications (inserts, updates, deletes)

GATHER AUTO_ will reanalyze objects that currently have no statistics and objects with stale statistics. Using GATHER AUTO is like combining GATHER STALE and GATHER EMPTY.
Note that both GATHER STALE and GATHER AUTO require monitoring. If you issue the ALTER TABLE XXX MONITORING command, Oracle tracks changed tables with the dba_tab_modifications view.

SQL> desc dba_tab_modifications;

Automating sample size with dbms_stats.The better the quality of the statistics, the better the job that the CBO will do when determining your execution plans. Unfortunately, doing a complete analysis on a large database could take days, and most shops must sample your database to get CBO statistics. The goal is to take a large enough sample of the database to provide top-quality data for the CBO.

According to the Oracle documentation, the I/O and CPU costs are evaluated as follows:

Cost = (#SRds * sreadtim +
#MRds * mreadtim +
#CPUCycles / cpuspeed ) / sreadtim

where:

#SRDs - number of single block reads
#MRDs - number of multi block reads
#CPUCycles - number of CPU Cycles *)

sreadtim - single block read time
mreadtim - multi block read time
cpuspeed - CPU cycles per second

Note that the costs are a function of the number of reads and the relative read times, plus the CPU cost estimate for the query. Also note the external costing does not consider the number of data blocks that reside in the RAM data buffers, but a future release of the optimizer is likely to consider this factor.

Here we see that Oracle uses the both the CPU and I/O cost estimations in evaluating the execution plans. This equation becomes even more complex when we factor-in parallel query where many concurrent processes are servicing the query.

The best benefit for using CPU costing is for all_rows execution plans where costs is more important than with first_rows optimization.

v$sql_plan view can help us locate SQL tuning opportunities

The EXPLAIN PLAN statement displays execution plans chosen by the Oracle optimizer for SELECT, UPDATE, INSERT, and DELETE statements. A statement's execution plan is the sequence of operations Oracle performs to execute the statement.

You can specify a statement Id when using the INTO clause.

EXPLAIN PLAN
INTO my_plan_table
SET STATEMENT_ID = 'bad1' FOR
SELECT last_name FROM employees;

SELECT lpad(' ',level-1)||operation||' '||options||' '||
object_name "Plan"
FROM plan_table
CONNECT BY prior id = parent_id
AND prior statement_id = statement_id
START WITH id = 0 AND statement_id = '&1'
ORDER BY id;

CREATE TABLE emp_range
PARTITION BY RANGE(hire_date)
(
PARTITION emp_p1 VALUES LESS THAN (TO_DATE('1-JAN-1991','DD-MON-YYYY')),
PARTITION emp_p2 VALUES LESS THAN (TO_DATE('1-JAN-1993','DD-MON-YYYY')),
PARTITION emp_p3 VALUES LESS THAN (TO_DATE('1-JAN-1995','DD-MON-YYYY')),
PARTITION emp_p4 VALUES LESS THAN (TO_DATE('1-JAN-1997','DD-MON-YYYY')),
PARTITION emp_p5 VALUES LESS THAN (TO_DATE('1-JAN-1999','DD-MON-YYYY'))

) AS SELECT * FROM employees;

For the first example, consider the following statement:

EXPLAIN PLAN FOR SELECT * FROM emp_range;

EXPLAIN PLAN FOR SELECT * FROM emp_range
WHERE hire_date >= TO_DATE('1-JAN-1995','DD-MON-YYYY');

CREATE TABLE emp_comp PARTITION BY RANGE(hire_date) SUBPARTITION BY
HASH(department_id)
SUBPARTITIONS 3
(
PARTITION emp_p1 VALUES LESS THAN (TO_DATE('1-JAN-1991','DD-MON-YYYY')),
PARTITION emp_p2 VALUES LESS THAN (TO_DATE('1-JAN-1993','DD-MON-YYYY')),
PARTITION emp_p3 VALUES LESS THAN (TO_DATE('1-JAN-1995','DD-MON-YYYY')),
PARTITION emp_p4 VALUES LESS THAN (TO_DATE('1-JAN-1997','DD-MON-YYYY')),
PARTITION emp_p5 VALUES LESS THAN (TO_DATE('1-JAN-1999','DD-MON-YYYY'))

) AS SELECT * FROM employees;

EXPLAIN PLAN FOR SELECT * FROM emp_comp
WHERE hire_date = TO_DATE('15-FEB-1997', 'DD-MON-YYYY');

How to obtain explain plans

Explain plan for
Main advantage is that it does not actually run the query - just parses the sql. This means that it executes quickly. In the early stages of tuning explain plan gives you an idea of the potential performance of your query without actually running it. You can then make a judgement as to any modifications you may choose to make.

Autotrace
Autotrace can be configured to run the sql & gives a plan and statistics afterwards or just give you an explain plan without executing the query.

Tkprof
Analyzes trace file

Using Oracle hints can be very complicated and Oracle developers only use hints as a last resort, preferring to alter the statistics to change the execution plan. Oracle contains more than 124 hints, and many of them are not found in the Oracle documentation

Optimizer hint is an optimizer directive placed inside comments inside your SQL statement and used in those rare cases where the optimizer makes an incorrect decision about the execution plan. Because hints are inside comments, it is important to ensure that the hint name is spelled correctly and that the hint is appropriate to the query.

Oracle Hints

A hint is nothing more than a directive to change an execution plan

SELECT /*+ hint --or-- text */ statement body

Thursday, December 3, 2009

Oracle Table Lock

-- Check Lock Table

select

c.owner,

c.object_name,

c.object_type,

b.sid,

b.serial#,

b.status,

b.osuser,

b.machine

from

v$locked_object a ,

v$session b,

dba_objects c

where

b.sid = a.session_id

and

a.object_id = c.object_id;

Do the following >>

select object_id from dba_objects where object_name='tablename';

select * from v$locked_object where object_id=id number;

Note the "oracle_username" and "session_id".

Or you can query v$access

select sid from v$access where owner='table owner' and object='table
name';

Note the session id number or "sid".

select sid, serial#, command, taddr from v$session where sid=session id
number;

Now you have identified the user and what they are doing.

To terminate the session:

Alter system kill session 'sid, serial#' immediate;

The session should now be killed and the lock SHOULD release.

Rechecking "v$locked_object" will tell you this. If the lock does not
immediately release, there may be a rollback occuring.

To check for rollback:

select used_ublk from v$transaction where ADDR=value from TADDR in
v$session;

select c.owner, c.object_name, c.object_type,b.sid, b.serial#, b.status, b.osuser, b.machine
from v$locked_object a , v$session b, dba_objects c
where b.sid = a.session_id
and a.object_id = c.object_id;

select
username,
v$lock.sid,
trunc(id1/power(2,16)) rbs,
bitand(id1,to_number('ffff','xxxx'))+0 slot,
id2 seq,
lmode,
request
from v$lock, v$session
where v$lock.type = 'TX'
and v$lock.sid = v$session.sid
and v$session.username = USER

Informatica Metadata

Informatica maintains metedata regarding the mappings and its tranformations, sessions, workflows and their statistics. These details are maintained in a set of tables called OPB tables and REP tables.

The widget refers to the types of transformation details stored in these tables.

Widget Ids and transformation types
widget_type Type of transformation
1 Source
2 Target
3 Source Qualifier
4 Update Strategy
5 expression
6 Stored Procedures
7 Sequence Generator
8 External Procedures
9 Aggregator
10 Filter
11 Lookup
12 Joiner
14 Normalizer
15 Router
26 Rank
44 mapplet
46 mapplet input
47 mapplet output
55 XML source Qualifier
80 Sorter
97 Custom Transformation

select z.widget_id, decode(z.porttype, 1, 'INPUT', 3, 'IN-OUT', 2, 'OUT', 32, 'VARIABLE', 8, 'LOOKUP', 10, 'OUT-LOOKUP', to_char(z.porttype)) Port_Type from opb_widget_field z;

If you want to know the mapping name, then match the widget_id against the widget_id of opb_widget_inst and then pull the mapping_id which can be mapped against mapping_id in opb_mappings table. If you want to know the Folder name, then map the subject_id from opb_mappings to that of subj_id in OPB_SUBJECTS table to get the subject_name.

OPB_EXPRESSION is the table that stores all the expressions in metadata. To associate an expression to a field in a transformation, OPB_WIDG_EXPR is the table to be used.

select g.expression from opb_widget_expr f, opb_expression g where f.expr_id = g.expr_id;

SQL overrides can be in Source Qualifiers and Lookup transformations.
To get the SQL Override from metadata, check REP_WIDGET_ATTR.ATTR_VALUE column.

select * from opb_widget_field where wgt_datatype=11

select g.expression from opb_widget_expr f, opb_expression g where f.expr_id = g.expr_id;

select * from REP_WIDGET_ATTR

select * from opb_mapping -- to check for mapping,mapplet, its stauts of valid/invalid etc

select * from opb_server_info -- to get server hostname,ip,etc

select src_id from opb_src where source_name ='W_DAY_D' -- source table , owner info

select * from opb_src_fld where src_id=1753 -- to get table column/field list info

select * from opb_targ where target_name like 'W_DAY_D' -- target table info

select * from opb_targ_fld where target_id =1835 -- to get target field info

Friday, November 13, 2009

Data Warehouse Concepts - Part 2

Data warehouse is designed mainly to supply Information to the business to decide in a better and faster way based on analysis of historical data. So it is essential we model its Logical and Physical design in the best way. Physical design is mainly for the purpose of performance and functionality of the data warehouse and logical design is a way to capture and present the business requirements in the entity way.

In data modeling following tasks are performed in an iterative manner:
• Identify entity types
• Identify attributes
• Apply naming conventions
• Identify relationships
• Apply data model patterns
• Assign keys
• Normalize to reduce data redundancy (Entity Relationship Model)
• Denormalize to improve performance (Dimensional Model)

Two types of data modeling are as follows:
· Logical modeling
· Physical modeling

Logical modeling deals with gathering business requirements and converting those requirements into a model. The logical model revolves around the needs of the business, not the database, although the needs of the business are used to establish the needs of the database. Logical modeling involves gathering information about business processes, business entities (categories of data), and organizational units. After this information is gathered, diagrams and reports are produced including entity relationship diagrams, business process diagrams, and eventually process flow diagrams. The diagrams produced should show the processes and data that exists, as well as the relationships between business processes and data. Logical modeling should accurately render a visual representation of the activities and data relevant to a particular business.

Typical deliverables of logical modeling include
Entity relationship diagrams :

An Entity Relationship Diagram is also referred to as an analysis ERD. The point of the initial ERD is to provide the development team with a picture of the different categories of data for the business, as well as how these categories of data are related to one another.
Business process diagrams :

The process model illustrates all the parent and child processes that are performed by individuals within a company. The process model gives the development team an idea of how data moves within the organization. Because process models illustrate the activities of individuals in the company, the process model can be used to determine how a database application interface is design.
User feedback documentation :

Physical modeling involves the actual design of a database according to the requirements that were established during logical modeling. Logical modeling mainly involves gathering the requirements of the business, with the latter part of logical modeling directed toward the goals and requirements of the database. Physical modeling deals with the conversion of the logical, or business model, into a relational database model. When physical modeling occurs, objects are being defined at the schema level. A schema is a group of related objects in a database. A database design effort is normally associated with one schema.

During physical modeling, objects such as tables and columns are created based on entities and attributes that were defined during logical modeling. Constraints are also defined, including primary keys, foreign keys, other unique keys, and check constraints. Views can be created from database tables to summarize data or to simply provide the user with another perspective of certain data. Other objects such as indexes and snapshots can also be defined during physical modeling. Physical modeling is when all the pieces come together to complete the process of defining a database for a business.

Physical modeling is database software specific, meaning that the objects defined during physical modeling can vary depending on the relational database software being used. For example, most relational database systems have variations with the way data types are represented and the way data is stored, although basic data types are conceptually the same among different implementations. Additionally, some database systems have objects that are not available in other database systems.

Typical deliverables of physical modeling include the following:
Server model diagrams:

The server model diagram shows tables, columns, and relationships within a database.
User feedback documentation :

Database design documentation :

Designing the Data warehouse
Main emphasis in the design is on fast query retrieval rather than transactional performance. The design should be such that it enables analysis of data from any angle at any given point of time.

Dimensional Data Model is the best way of designing a data ware house. The main terms used in dimensional modeling are facts and dimensions.

1. Fact - A fact is a single iteration in a historical record
2. Dimension - A dimension is something used to dig into, divide, and collate those facts into something useful

Facts represent historical or archived data and dimensions represent smaller static data entities. It follows that dimension entities will generally be small and fact entities can become frighteningly huge. Fact entities will always be appended to, and dimension entities can be changed, preferably not as often as the fact entities are appended to. The result is many very small entities related to data in groups from very large entities.

Granularity
While designing a fact the most important point to keep in mind is the granularity, how much data to keep and to what level of detail. So do we need to store all transaction or do we store it at summary level. Like in case of a Retailer and Manufacturer, a Retailer would like to capture each and every piece of an item sold from its point of sale i.e. at the lowest Stock keeping unit level to maintain its inventory better, where as the Manufacturer might keep it at a level higher than the pieces, they would not need information of each piece sold, they might be interested to keep record at an Item level, one level above Stock keeping as compared to Retailers. So depending on business needs we need to decide the lowest possible granular level to be kept in facts to make the reporting accurate. From a planning perspective, it might be best to begin by retaining all facts down to the smallest detail if at all possible. Data warehouses are expected to be large, and disk space is cheap.

Star and Snow Flake Schema

A star schema contains one, or at least very few, very large fact entities, plus a large number of small dimensional entities. As already stated, effectively fact entities contain transactional histories and dimension entities contain static data describing the fact entity archive entries. The objective for performance is to obtain joins on a single join level, where one fact entity is joined to multiple small dimension entities, or perhaps even a single dimension entity.

A snowflake schema is a normalized star schema such that dimension entities are normalized.

A single data warehouse can contain multiple fact entities and, therefore, multiple star schemas. Additionally, individual dimension entities can point to multiple fact entities. Dimension entities will occupy a small fraction of the storage space than fact entities do. Fact entities in a data warehouse can have trillions of rows, whereas dimensions are in the range of tens, hundreds or perhaps thousands. Any larger than thousands and those dimensions could possibly be facts.

Data Warehouse Concepts

Thursday, December 17, 2009

Lookup Cache in Informatica

Friday, December 4, 2009

How do you run a full DAC load

Oracle Tuning

Oracle Hints

Thursday, December 3, 2009

Oracle Table Lock

Informatica Metadata

Friday, November 13, 2009

Data Warehouse Concepts - Part 2

Followers

Blog Archive

Contributors

Data Warehouse Concepts

Thursday, December 17, 2009

Lookup Cache in Informatica

Friday, December 4, 2009

How do you run a full DAC load

Oracle Tuning

Oracle Hints

Thursday, December 3, 2009

Oracle Table Lock

Informatica Metadata

Friday, November 13, 2009

Data Warehouse Concepts - Part 2

Subscribe To DWH

Followers

Blog Archive

Contributors