-flexible
DATA WAREHOUSE
Data warehousing
provides architectures and tools for business
executives to
systematically organize,
understand, and use their data to make strategic
decisions. A
large number of
organizations have found that data warehouse
systems are valuable tools
in today's
competitive, fast evolving world. In the last
several years, many firms have
spent
millions of dollars in building enterprise-wide
data warehouses. Many people feel
that
with competition mounting in every industry, data
warehousing is the latest
must-have
marketing weapon
——
a way to
keep customers by learning more about their
needs.
“So
Data warehouses have been defined in
many ways, making it difficult to formulate a
rigorous definition. Loosely speaking,
a data warehouse refers to a database that is
maintained separately from an
organization's operational databases. Data
warehouse
systems allow for the
integration of a variety of application systems.
They support
information processing by
providing a solid platform of consolidated,
historical data for
analysis.
According to W. H. Inmon, a leading
architect in the construction of data warehouse
systems, “a data warehouse is a
subject
-oriented, integrated, time-
variant, and nonvolatile
collection of
data in support of management's decision making
process.
comprehensive definition
presents the major features of a data warehouse.
The four
keywords, subject-oriented,
integrated, time-variant, and nonvolatile,
distinguish data
warehouses from other
data repository systems, such as relational
database systems,
transaction
processing systems, and file systems. Let's take a
closer look at each of these
key
features.
(1).Subject-
oriented: A data warehouse is organized around
major subjects, such as
customer,
vendor, product, and sales. Rather than
concentrating on the day-to-day
operations and transaction processing
of an organization, a data warehouse focuses on
the
modeling and analysis of data for
decision makers. Hence, data warehouses typically
provide a simple and concise view
around particular subject issues by excluding data
that
are not useful in the decision
support process.
(2)
Integrated: A data warehouse is usually
constructed by integrating multiple
heterogeneous sources, such as
relational databases, flat files, and on-line
transaction
records. Data cleaning and
data integration techniques are applied to ensure
consistency
in naming conventions,
encoding structures, attribute measures, and so
on.
(3).Time-variant: Data are stored to
provide information from a historical perspective
(e.g., the past 5-10 years). Every key
structure in the data warehouse contains, either
implicitly or explicitly, an element of
time.
(4)Nonvolatile: A data
warehouse is always a physically separate store of
data
transformed from the application
data found in the operational environment. Due to
this
separation, a data warehouse does
not require transaction processing, recovery, and
concurrency control mechanisms. It
usually requires only two operations in data
accessing: initial loading of data and
access of data.
In sum, a
data warehouse is a semantically consistent data
store that serves as a
physical
implementation of a decision support data model
and stores the information on
which an
enterprise needs to make strategic decisions. A
data warehouse is also often
viewed as
an architecture, constructed by integrating data
from multiple heterogeneous
sources to
support structured and/or ad hoc queries,
analytical reporting, and decision
making.
“OK
Based on the
above, we view data warehousing as the process of
constructing and
using data warehouses.
The construction of a data warehouse requires data
integration,
data cleaning, and data
consolidation. The utilization of a data warehouse
often
necessitates a collection of
decision support technologies. This allows
“knowledge
workers
conveniently obtain
an overview of the data, and to make sound
decisions based on
information in the
warehouse. Some authors use the term “data
warehousing
only to the process of data
warehouse construction, while the term warehouse
DBMS is
used to refer to the management
and utilization of data warehouses. We will not
make
this distinction here.
“How are organizations using the
information from data warehouses?
Many
organizations are using
this information to support business decision
making activities,
including:
(1)
increasing customer focus, which includes the
analysis of customer buying
patterns
(such as buying preference, buying time, budget
cycles, and appetites for
spending),
(2) repositioning products
and managing product portfolios by comparing the
performance of sales by quarter, by
year, and by geographic regions, in order to fine-
tune
production strategies,
(3) analyzing operations and looking
for sources of profit,
(4)
managing the customer relationships, making
environmental corrections, and
managing
the cost of corporate assets.
Data warehousing is also very useful
from the point of view of heterogeneous
database integration. Many
organizations typically collect diverse kinds of
data and
maintain large databases from
multiple, heterogeneous, autonomous, and
distributed
information sources. To
integrate such data, and provide easy and
efficient access to it is
highly
desirable, yet challenging. Much effort has been
spent in the database industry and
research community towards achieving
this goal.
The traditional
database approach to heterogeneous database
integration is to build
wrappers and
integrators (or mediators) on top of multiple,
heterogeneous databases. A
variety of
data joiner and data blade products belong to this
category. When a query is
posed to a
client site, a metadata dictionary is used to
translate the query into queries
appropriate for the individual
heterogeneous sites involved. These queries are
then
mapped and sent to local query
processors. The results returned from the
different sites
are integrated into a
global answer set. This query-driven approach
requires complex
information filtering
and integration processes, and competes for
resources with
processing at local
sources. It is inefficient and potentially
expensive for frequent queries,
especially for queries requiring
aggregations.
Data
warehousing provides an interesting alternative to
the traditional approach of
heterogeneous database integration
described above. Rather than using a query-driven
approach, data warehousing employs an
update-driven approach in which information
from multiple, heterogeneous sources is
integrated in advance and stored in a warehouse
for direct querying and analysis.
Unlike on-line transaction processing databases,
data
warehouses do not contain the most
current information. However, a data warehouse
brings high performance to the
integrated heterogeneous database system since
data are
copied, preprocessed,
integrated, annotated, summarized, and
restructured into one
semantic data
store. Furthermore, query processing in data
warehouses does not interfere
with the
processing at local sources. Moreover, data
warehouses can store and integrate
historical information and support
complex multidimensional queries. As a result,
data
warehousing has become very
popular in industry.
1.
Differences between operational
database systems and data warehouses
Since most people are familiar with
commercial relational database systems, it is
easy to understand what a data
warehouse is by comparing these two kinds of
systems.
The major task of
on-line operational database systems is to perform
on-line
transaction and query
processing. These systems are called on-line
transaction processing
(OLTP) systems.
They cover most of the day-to-day operations of an
organization, such
as, purchasing,
inventory, manufacturing, banking, payroll,
registration, and accounting.
Data
warehouse systems, on the other hand, serve users
or “knowledge workers
role of data
analysis and decision making. Such systems can
organize and present data in
various
formats in order to accommodate the diverse needs
of the different users. These
systems
are known as on-line analytical processing (OLAP)
systems.
The major
distinguishing features between OLTP and OLAP are
summarized as
follows.
(1). Users and system orientation: An
OLTP system is customer-oriented and is used
for transaction and query processing by
clerks, clients, and information technology
professionals. An OLAP system is
market-oriented and is used for data analysis by
knowledge workers, including managers,
executives, and analysts.
(2). Data contents: An OLTP system
manages current data that, typically, are too
detailed to be easily used for decision
making. An OLAP system manages large amounts
of historical data, provides facilities
for summarization and aggregation, and stores and
manages information at different levels
of granularity. These features make the data
easier for use in informed decision
making.
(3). Database
design: An OLTP system usually adopts an entity-
relationship (ER)
data model and an
application -oriented database design. An OLAP
system typically
adopts either a star
or snowflake model, and a subject-oriented
database design.
(4). View:
An OLTP system focuses mainly on the current data
within an enterprise
or department,
without referring to historical data or data in
different organizations. In
contrast,
an OLAP system often spans multiple versions of a
database schema, due to the
evolutionary process of an
organization. OLAP systems also deal with
information that
originates from
different organizations, integrating information
from many data stores.
Because of their
huge volume, OLAP data are stored on multiple
storage media.
(5). Access
patterns: The access patterns of an OLTP system
consist mainly of short,
atomic
transactions. Such a system requires concurrency
control and recovery
mechanisms.
However, accesses to OLAP systems are mostly read-
only operations (since
most data
warehouses store historical rather than up-to-date
information), although many
could be
complex queries.
Other
features which distinguish between OLTP and OLAP
systems include database
size,
frequency of operations, and performance metrics
and so on.
2.
But, why have a separate data
warehouse?