-
XL
STAT
Generalities
Installation
Running
XL
STAT -
XL
STAT Manager
XL
STAT direct access
General remarks
Dialog boxes
Numerical efficiency
Missing
data / Numerical and categorical variables
Error messages
Shortcut keys
References
Help
files
Statistical tools
6D Plots
ANOVA / ANCOVA
AxesZoomer
Categorical sorting
Clustering 1 & 2
Correlations / Principal Component
Analysis (PCA)
Correspondence analysis
Crossed sorting / Flat sorting
DataFlagger
Descriptive
statistics
Discretize data
Discriminant analysis
Easy labels (two clicks to add labels
on a plot)
Extract a sample of rows
from a dataset
Factor
analysis
Fit data to density
functions and test fit
Histograms
Kruskal-Wallis test
MicroMover
MinMax Search
Models for binary response data (Logit,
Probit, ...)
MotriMax
Multidimensional Scaling
(MDS)
Multiple
correspondence analysis
Non-
linear fitting (GenFit)
Odds ratio
Similar rows detection
Plot
Transformer
Regression
Test
for comparing two proportions
Tests for
comparing two samples (Student, Wilcoxon, Fisher,
...)
Tests on contingency tables (Chi-
square and Exact tests)
Transposition
Microsoft Excel? and
Statistics
License conditions
Generalities
Installation
Required software : Excel ? 5.0 or
above
Required
hardware
:
XL
STAT
is
compatible
with
all
PCs
that
can
run
Microsoft
Excel?
5.0.
However,
for
optimal performance, a 486DX33 PC with
8Mb RAM or a PowerPC-based Macintosh is
recommended.
To install
XL
STAT you need to copy all
the files from the
XL
STAT
disk to the hard disk. You load
XL
STAT by
opening
the file with the open command in the File menu
of Excel, or better, you can use the
add-
in manager in the tools menu. A
double click on the file name or icon in the
Windows Explorer of file manager
is
also possible.
The
professional version of
XL
STAT is exactly the same
as the shareware one, the only difference being
that it
is not time limited and that a
free update service is provided if the author is
contacted at fahmy@.
XL
STAT website is :
Running XLSTAT - XLSTAT
Manager
To run
XLSTAT, please load the file by opening it with
Excel File|Open command or by using the
Tools|Add-in Manager menu.
Once
XL
STAT
is
opened,
the
available
commands
can
be
directly
accessed
from
the
menubar
or
from
the
XL
STAT
Manager
XL
STAT toolbar.
Using the XLSTAT Manager to
open the tools makes it possible to access them in
the
XL
STAT toolbar, as they
are added to it after selection. To
know what tool corresponds to what icon of the
toolbar, leave the mouse cursor
on the
icon, and read the
By
using
the
XL
STAT
Manager
to
select
tools,
you
can
also
save
a
configuration
as
Saving
a
configuration as
XL
STAT will load, it will
automatically open the tools you saved
in the configuration. When you click on save the
options corresponding to
the “O” button
of the toolbar are also saved for the next
sessions.
You
can
also
activate
the
XL
STAT
Manager
at
any
time,
using
the
XL
STAT
Manager
command
of
the
XL
STAT menu.
To edit the
XL
STAT standard ordering
sheet, you can use the
XL
STAT
or open
from your
XL
STAT directory.
After you registered you can use the
your
personal
license
number,
given
by
the
author
or
the
distributor.
You
will
then
immediately
become
an
XL
STAT-Pro !
XL
STAT direct
access
If you
want to
open
and
close
XL
STAT directly at any time
by just pressing a button, you only need to open
once in Excel the macro. It will add a
blue color button in the standard toolbar of
Excel, close to the
button.
You
can
move
or
delete
this
button
as
any
other
Excel
button
using
the
General remarks
In the
plots will be stored
in standard Chartsheets. If you don't select this
option, the plots will be stored on the same
sheet as the output tables, in the
upper left corner of the sheet.
XL
STAT sometimes creates
intermediate sheets
that are stored in
the results workbooks. If you want you can view
them by selecting the corresponding option.
Each time you load
XL
STAT, the reference option
is set to A1 if it was on R1C1. This is necessary
for
XL
STAT
to
work properly. If you change that option while
working, don’t forget to change it back before
using
XL
STAT
again.
Each
XL
STAT
copy
is
protected
by
copyright
laws,
and
can
only
be
used
on
one
computer
at
a
time.
Every
single shareware
release of
XL
STAT is time
limited.
To obtain the
latest version (updates are released frequently)
please contact Thierry FAHMY at the following
e-
mail addresses : fahmy@
Don't
hack
this
software,
and
more
importantly
don't
spread
illegal
copies.
Respect
other
people's
work,
or
shareware will disappear. Don’t forget
that if you buy
XL
STAT you
will get free updates for a year!
The Author disclaims any responsibility
for damages or subsequent loss caused by using
XL
STAT.
Dialog boxes
Dialog boxes have been
created to be as straight-forward and uniform as
possible.
The
Excel sheet the cell which will be the
upper left corner of the results tables.
When you select data,
variables always need to be stored in columns.
or
column
labels
correspond
to
the
labels
or
names
of
the
explanatory
variables.
If
you
select
the
variable labels within the data range
you must tick the corresponding option.
within the data
range, then you must specify tick the
corresponding option.
NB:
The Outputs range does not need to be on the same
sheet or on the same Workbook than the input
table.
Numerical efficiency
XL
STAT is
efficient for most methods, but it can be slow
when there are a lot of variables and rows.
depend on your computer processor and
on the RAM. This occurs because
XL
STAT is fully written in
VBA and
Excel macros are not compiled.
But the fact that
XL
STAT is
only VBA makes it possible for you to use it on
any
computer and any system.
Missing data -
Numerical and categorical data
XL
STAT is able
to detect missing data for the following tools :
PCA / Disc. Analysis / Classification / CA / MCA
/ Histograms / Descriptive Statistics /
Factor analysis.
Missing
data must be coded with empty cells.
Categorical
variables
can
be
coded
by
any
symbol
except
free
cells
(the
only
way
to
code
missing
data).
Numerical data needs
to be numbers.
Error messages
There are very few error messages and
they are explicit enough so that you can easily
understand what is wrong.
The most frequent message is :
specify
if
variable
or
observation
labels are
included,
or
because
you
forgot
to
select
the
upper
left
corner
of
output
tables.
Sometimes it can
also occur because there is some non-numerical
data in a column that should correspond to a
numerical variable.
References
To create XLSTAT, many
references have been consulted within the which :
?
a (1991),
Probabilité
s, Analyse de
donné
es et statistiques, Technip, Paris
?
one (1993),
Biomé
trie , Masson, Paris
?
A. Agresti
(1990), Categorical Data Analysis , Wiley
Interscience
Shortcut Keys
Ctrl +M : activates the
XLSTAT Manager
Help files
XL
STAT help file
On
PC
and
Macintosh
computers,
help
topics
can
be
accessed
from
either
the
XL
STAT
menu,
or
from
each
dialog box. On PCs the help file can
also be accessed from the file manager.
For users who would like to
have a printed version of the help file, they can
print the Word6 version of the help
file
Ordering XLSTAT
To
know
the
how
to
order
XLSTAT,
please
visit
/
and
go
to
the
order
page,
or
open
which is distributed
with XLSTAT modules. For any further information,
please contact Thierry Fahmy
at
fahmy@
6D Plots
This tool is one of the simplest ones
of XLSTAT for the mathematical background it
requires from the user, but it
helps
creating very surprising charts that will impress
of lot of the people who ignore XLSTAT. With this
tool
you can represent up to 6
dimensions at a time (or even seven if you use the
labels to display information), 4 of
which can be numerical and 2
categorical. This tool can easily be used to plot
2 dimensions and differentiating
the
belonging to various groups (a third dimension),
or four dimensions to distinguish the financial
results over
two
years
for
various
geographies
and
products.
The
only
compulsory
entries
for
this
tool
are
the
first
two
numerical dimensions and the output
location (it can be on an Excel spreadsheet or on
a separate chart sheet. Or
you can
simply use this tool to do two dimensional plots
using other object formats than the
traditional
Excel
ones (instead of using the small
circles to represent data you can use any image or
object you want !).
Dialog box entries
Select Xs
: select in this
entry box the data on the Excel sheet that
correspond to the first dimension (numerical
data only). This entry is compulsory
and is represented on the abscissa axis.
Select
Ys
:
select
in
this
entry
box
the
data
on
the
Excel
sheet
that
correspond
to
the
second
dimension
(numerical data only). This entry is
compulsory and is represented on the ordinates
axis.
Select
3
rd
D
: select in
this box the data on the Excel sheet that
correspond to the third dimension (optional). The
third dimension is represented by the
size
of the object.
Select
4
th
D
: select in
this box the data on the Excel sheet that
correspond to the fourth dimension (optional).
The fourth dimension is represented by
the
size
of the object.
Select Groups 1
: select in this box the data on the Excel sheet
that correspond to the fifth dimension. Whatever
the data type in this selection, XLSTAT
will take them as categorical data. The fifth
dimension is represented by
the
color
of the object.
Select Groups 2
: select in this box the data on the Excel sheet
that correspond to the sixth dimension. Whatever
the data type in this selection, XLSTAT
will take them as categorical data. You can select
data in this zone even
there is no
type
of the object.
Data
labels
included
:
if
this
option
is
selected,
you
must
have
included
and
selected
the
data
labels
when
selecting the Xs data. The labels must
be on the left side of the Xs.
Variable
labels
included
:
if
this
option
is
selected,
you
must
have
included
and
selected
a
name
for
each
dimension.
Outputs :
the output is the
chart itself. However you can decide to put it on
a special area of a spreadsheet (then
select a cell of your choice on this
spreadsheet in the
(then select
Autocolor :
deselect this option if want to control
the color and the shape of the objects that will
be used instead
of the usual Excel
formats (dots, circles, squares). A
default colors suggested by XLSTAT. A
XLSTAT suggests only four shapes :
circles, squares, triangles and diamonds, any kind
of shape or image can be
added.
With
Excel97
(and
later
versions)
you
can
take
a
full
advantage
of
the
shapes
and
textures
that
are
available.
Example
This example shows a simple 3 dimension
plot where the first two dimensions are
represented on the X and Y
axes,
and
the
third
dimension
is
represented
using
the
object
size.
More
examples
can
be
found
on
/
.
Table 1 : data
on Excel sheet
Figure 1 : dialog box as it should filled in
To
produce
the
plot
below
the
autocolor
option
has
been
deselected.
Then
the
default
object
suggested
by
XLSTAT in the
axes and of the background have been
manually changed by the user. The Micromover tool
of XLSTAT has been
used to move all
labels a little bit up, all at the same time.
Figure 2 : 3D representation of the
data
AxesZoomer
A simple but time saving
utility tool. Select a plot, then run this tool
and you will be able to modify the min and
max of the axes until you are
satisfied.
Categorical sorting
This
tool
makes
both
increasing
and
decreasing
sorting
(with
the
usual
alphabetical
conventions)
of
several
Categorical variables easy to achieve,
with decreasing priorities from the first to the
last variable. The sorting is
done
first on the first variable then on the second
variable, but within the first variable
categories, and so on,
until the last
variable.
Note : if you
want to sort numerical data it is better to use
the classical Excel? sorting tool.
Cluster
analysis 1 & 2
Cluster analysis is a data analysis
tool that allows you to classify into groups, a
set of observations described by
numerical
variables.
You
can
specify
the
number
of
groups
you
want
to
create.
XL
STAT
offers
you
two
different
clustering techniques. It is often a good idea to
try both and compare the results.
NB: If you want to classify
observations described by Categorical variables it
is advised to use first the Multiple
Correspondence Analysis and then use
the observations coordinates on factorial axes (in
the last results table of
MCA) as
numerical variables to finally classify the
observations.
XLSTAT offers
now the possibility to normalize and weight the
variables, some very important options for the
two methods suggested here which are
based on euclidean distances : the scale of a
variable might have a major
effect
on
the
final
result.
One
can
cancel
the
scale
effect
by
normalizing
the
variables.
Weighting
variables
enables
you
to
give
more
or
less
importance
to
the
variables
depending
on
how
much
you
want
them
to
influence the final
result of the clustering.
Cluster analysis 1
Classification using
centro?
ds method (also know as
k-means). It is a quick and powerful method , but
it does
not give any help in finding
the ideal number of clusters.
In the dialog box,
that it
might give some different results each time
(because the starting point is selected by
random). Only the
best classification
(in sense of inertia) is presented in the results
tables.
Note : sum of
Between-groups and Within-groups inertia is
constant and is equal to Total inertia. To choose
between the different iterations this
tool selects the one with the highest
between/within rate.
Cluster analysis 2
Ascendant hierarchical
cluster analysis using Ward's clustering
technique. If there are over 85 entries, it is not
possible to see the dendrogram graphic.
If you want to use a
dissimiliraty measure which is not the euclidean
distance (Ward's method), you may give as
input to
XL
STAT
the dissimilarity matrix instead of the raw data.
You can also select a correlation matrix as the
input data in that case, the
dissimilarities are automatically computed using
the following formula : dissimilarity
(i,j) = SquareRoot [ 2 x
(1-correlation(i,j))] which produces a sphere with
radius 1.
Results :
If you choose option
groups and only prints the abscissa (on
dendrogram) table and the knots description table,
and then creates the
level-histogram
and
the
dendrogram.
If
there
are
more
than
85
entries
XLSTAT
stops
to
ask
you
how
many
groups you want to
build. If you don’t know, put 1 and follow the
advice that will be written below one of the
tables.
If you
choose option
described above, but with
two more tables containing the data classification
results.
If you choose the
inertia criterion.
Correlations /
Principal Component Analysis
Introduction
Correlation coefficients enable
analyzing relations between numerical or
categorical ordinal variables. It's a way
to
check
if
two
variables
evolve
in
the
same
way
or
opposite,
or
if
they
are
independent.
If
you
have
N
variables,
the
result
of
this
analysis
is
an
NxN
symmetric
table
with
correlation
coefficients
for
each
pair
of
variables.
The
correlation
coefficients
matrix
is
the
most
common
basis
for
starting
a
Principal
Component
Analysis,
which most
important results are the building of N
independent factors (they are linear combines of
the initial
variables, which have the
property of have null correlation coefficients),
and an optimized representation of all
the data on a P-dimensional plot, where
P is often 2 in practice, although it can be from
1 to 3. The quality of
the plot is
given by the % of the N-dimensional variability is
has been able to show.
Dialog box entries
Reference on sheet : select in this box
the table that contains the data, with or without
the columns and rows
labels (if
necessary select the labels options). The data
must be numerical variables. If you want to do
a
full
PCA, then
this table must include rough data (several
individuals - in rows - described by several
variables - in
columns). But if all you
have is a correlations or a covariance matrix you
can analyze its structure and see the
correlations circle.
Data
must
be
numerical
data.
If
there
are
missing
data
in
your
dataset
(coded
with
free
cells),
you
can
ask
XL
STAT to
remove the corresponding rows for all the
calculations or only when the missing data are
involved
in the calculations. If the
deleted. For the correlations you can
choose if you want to keep the rows with missing
values or not.
Output range
: select the cell that will correspond to the
upper left corner of the results printed by
XL
STAT.
Number of additional rows : if they are
some data that you don't want to use for the
computations but only to
plot them
afterwards, you only need to include them at the
bottom of the selected table and specify the
number
of data that need to be used
only at the end.
Number of
additional variables : if they are some variables
that you don't want to use for the computations
but
only to plot them afterwards, you
only need to include them at the right of the
selected table and specify the
number
of variables that need to be used only post-
computations.
NB : the
classical correlation coefficient is usually
calculated for numerical continuous integer or
real data. Be
very careful before using
it with ordinal numerical data Kendall's and
Spearman's rank correlation coefficients
are much better for ordinal data.
Results :
Correlation coefficients / Covariances
If
you
want
to
use
the
covariance
matrix
for
the
PCA
and
to
print
the
covariance
matrix
within
the
results,
please select the
matrix, because it removes the scales
effects (the scale of the values of the various
variables is not taken into
account to
make them easy to compare in trends).
This
allows
you
to
calculate
the
correlation
coefficients
matrices.
You
can
either
calculate
the
classical
parametric coefficients, or the
Spearman's or Kendall's non-parametric rank
correlation coefficients. Spearman's
and Kendall's coefficients can be
useful if you want to do PCA with ordinal data.
If the PCA option of the
dialog
box
isn't
activated,
you
will only
get
the
correlation
matrix
as
a
result.
Two
calculation techniques
are possible. If you choose
containing
missing values are deleted. If not they will be
deleted only when the variable corresponding to
the
missing value will be involved in
calculations. If you think it is worth it, this
will allow you to keep as much
information as possible.
If you don't want to keep the rows with
missing values, then if
XL
STAT comes across missing
data in a row, this
row will be ignored
for all the calculations. If there are missing
value that you haven't seen and if you require a
PCA, XLSTAT will automatically detect
and remove the corresponding rows.
To make the reading of the correlation
matrix easier, you can choose to highlight the
significant
correlations
which are determined using the test
that compares the absolute values to 1/sqr(n-1),
where n is the number of
variables.
NB
:
for
Kendall's
and
Spearman's
coefficients,
to
find
out
if
it
is
significantly
different
from
zero
for
a
significance level you
choose, you need to look in the corresponding
tables which can be found at the end of any
book on statistics.
To help you in reading the correlation
matrix, you can select in the dialog box a range
within which all values
will appear in
a light green color. If you want to see all values
then put nothing or two equal values.
Note
:
for
Kendall's
and
Spearman's
coefficients,
to
know
if
there
significantly
different
from
zero
for
a
significance level you choose, you need
to look in the corresponding tables which can be
found at the end of any
book talking
about statistics.
Suggestion : you should use the
DataFlagger
tool of XLSTAT
to make more visible some particular values on
large correlation coefficients tables.
Results :
Principal component analysis (PCA)
After
the
calculation
of
the
correlation
matrix,
if
the
PCA
option
of
the
dialog
box
is
activated,
some
more
calculations will start, beginning the
Principal Component Analysis, at the end of the
which you will be able to
see the
you are projecting some n-dimensional
data on a 2-dimensional
plot,
as
much
variability
as
possible
has
been
saved.
The
%
calculated
from
the
eigenvalues
gives
you
an
idea
of
the
global
variability
which
is
represented when using
the axes of interest.
If
you want to view some other two dimensions data
representation, select other axis numbers in the
dialog box or
change the series
references on the chart. It's of course advised to
always start with axis 1 and 2.
The
correlations
circle
is
a
2-dimensional
plot
helpful
for
interpreting
the
correlations
between
the
initial
variables and the new variables which
are linear combines of the initial ones : the
closer a variable is to an axis
and
to
the
circle,
the
higher
the
correlation
with
the
corresponding
factor
will
be.
It
is
also
a
way
to
plot
the
correlations
between
the
initial
variables
:
if
two
variables
are
in
opposite
quarters,
they
are
negatively
correlated
;
if
two
variables
are
perpendicular
the
correlation
is
close
to
0
;
if
two
variables
are
close,
the
correlation is close to 1.
A biplot is also plotted. It is not
exactly a mix of the to previous plots, as the
variables coordinates are modified
to
take into account the representation power of each
axis and the scale of the plot of the data.
To avoid any mistaken
interpretation on the positions of the data in the
new representation space, one can use the
squared
cosines
printed
in
the
last
table
:
the
closer
a
squared
cosine
is
to
one,
the
closer
the
data
is
to
the
corresponding axis.
Example
Here is a simple example
with 8 observations being described by 6
variables. PCA is used here to visualize the
observations on a map to quickly see
which years have been comparable and which not.
Table 1 : data on Excel sheet
Figure 1 : the dialog box as it must be filled in
The
first
result
displayed
by
XLSTAT
is
the
correlation
coefficients
matrix.
Here
the
DataFlagger
(see
in
the
XLSTAT -
Excel Utilities folder) has been used to highlight
the strong positive and negative correlations.
Table 1 :
correlations matrix
Among
the 3 plots displayed by XLSTAT, the biplot is the
nicest one when there aren't too many observations
and variables (if not, it can be messy)
as it shows the mapping of the data simultaneously
with the mapping of the
initial of the
variables on the factorial axes.
Figure 2 : biplot of variables and observations
For example, using this
chart one can say that years 1965, 1966 and 1968
are close because production of citrus
fruits have been high and production of
rice has been low.
Correspondence
Analysis
Introduction
Correspondence
analysis
is
generally
useful
for
analyzing
survey
results
when
two
questions
have
been
asked
with
several possible answers. It can of course be used
for the analysis of any two-way contingency table.
For
example it is possible with this
tool to study the relation between parents jobs
and studies of children. A plot is
generated to make the interpretation of
the numerical results easier.
Dialog box
Contingency
table
:
select
in
this
box
the
table
that
contains
the
data,
with
or
without
the
columns
and
rows
labels (if necessary select the labels
options). The data must be numbers. A contingency
table is a table for which
a cell (i;j)
corresponds to the frequency observed for
simultaneous characteristics: category
category
Output
range : select the cell that will correspond to
the upper left corner of the results printed by
XLSTAT.
Number of
supplementary rows : default is zero. If you want
that some rows are not used for the computations
but only projected on the
representation space using the results obtained
using the other data, then put a value N
bigger than 0. The N last rows you
selected in the contingency table will be used as
additional rows.
Number
of
supplementary
columns
:
default
is
zero.
If
you
want
that
some
columns
are
not
used
for
the
computations but only
projected on the representation space using the
results obtained using the other data, then
put a value N bigger than 0. The N last
columns you selected in the contingency table will
be used as additional
columns.
Maximum
number
of
axes
:
if
you
to
accelerate
the
outputs
generation
and
to
improve
the
readability
of
the
outputs
you can ask to see only the results for the first
N axes.
Confidence
range
for
the
Chi2
:
this
value
is
necessary
for
testing
if
one
can
say
rows
and
columns
are
independent.
If
they
are,
then
there
is
no
use
to
do
a
correspondence
analysis,
and
there
are
no
significant
relations to
study.
Show
contributions
and
cosines
:
these
results
are
important
but
you
can
ask
not
to
see
them,
which
will
accelerate the generation of outputs.
Results
To make possible the
graphical representation of the proximities
between categories of the two variables many
calculations
are
necessary.
As
well
as
with
Principal
Component
Analysis,
it
leads
to
eigenvalues
and
eigenvectors
calculation.
Eigenvalues
can
be
interpreted
as
an
image
of
the
total
inertia
represented
on
the
corresponding axes.
A representation space is
designed so that the
big as possible. A
projection on the first two axes is plotted. This
allows an interpretation of relations between
categories. For example, if two
categories (MBA in famous business schools and
lawyer) are close on the graphic
one
can say that they are strongly linked.
Several
results
tables
allow
a
better
interpretation.
The
contributions
of
each
category
to
each
axis
make
it
possible
to
understand
their
importance
and
their
relations.
The
two
last
results
tables
evaluate
the
cosine
between
each
category
and
the
representation
plane.
So
one
can
avoid
wrong
interpretations
saying
two
categories are close
because of the plot though they were far on the
whole space before projection. The squared
cosines are called by some authors
Example
The source of the Table 1
is the American Social Attitudes Sourcebook
(Converse et al., 1980), and it shows the
results of a survey where people of
different ages were asked if civil rights people
were pushing too fast.
Table 1 : data from American Social
Attitudes Sourcebook
On
Figure 1, the dialog box is represented as it must
be filled in to perform a Correspondence Analysis
on the
data of Table 1.
Figure 1 : the dialog box as it must be
filled in
The first result
which is directly produced from the table itself
is the 3D plot (see Figure 2). It enables to see
some
strong
trends.
With
simple
and
small
tables
it
already
provides
some
interesting
information,
but
not
as
much as the factorial map (Figure 3).
Figure 2 : 3D plot of the table itself
On the factorial map, we are able to
represent in 2 dimensions already 97% of the total
inertia which is excellent.
This makes
the map a good representation of the links between
the categories of the two variables (here age and
opinion) hidden in the contingency
table.
Figure
3 : Factorial map on axes 1 and 2
References
For further information on
this method, please visit
/
Crossed sorting / Flat
sorting
These
techniques are mostly effective for survey
analysis. The variables can be questions with
several possible
answers, or any other
kind of categorical variables.
Crossed sorting
If
q
questions
are
asked
to
n
people,
and
if
you
have
the
results
in
a
n
rows
and
q
columns
array,
then
this
technique
will
allow
you
to
transform
this
array
into
q(q-1)/2
arrays
where
the
rows
will
represent
all
the
possible
answers to the Qi question, and the columns will
represent all the possible answers to the Qj
question.
This kind of array is often
called
The
interest
of
this
tool
is
that
it
allows
you
to
know
how
many
people
answered
X
to
Qi
and
Y
to
Qj.
Furthermore the percentages are also
written in the results array; this can help you
for an interpretation.
Flat sorting
Instead of presenting the results in a
contingency table, with this tool the data array
is transformed into a single
tree
:
the
first
column
corresponds
to
the
sorted
different
categories
of
the
first
variable;
the
second
column
corresponds to the various categories
of variable 2, sorted within each variable 1
categories and so on. Before
the last
column is written the number of combinations of
categories corresponding to the row. In the last
the
column is written the corresponding
percentage.
Note
:
the
result
table
will
look
different
for
each
different
permutation
of
the
initial
table.
The
hierarchy
is
respected during the transformation so
if the variable which interests you the most is
the i
variable, then swap
columns 1 and i in the initial
table.
DataFlagger
This tool allows you to
identify some Excel cells - within a selected
range - that correspond to some criterion you
define.
The
criterion can be a text, a value or an interval.
For highlighting the cells
you can change the color, or select the bold or
italic format,
Descriptive Statistics
This is
probably the most complete tool you can find to
analyze in a simple way data with Excel. And
probably
the most simple to use on the
market.
Data can be
categorical or numerical and various types of
plots are possible.
Analyzing numerical data
XL
STAT
gives
as
a
standard
output
a
series
of
indexes
that
help
understanding
the
structure
of
the
data.
Skewness and Kurtosis
values are computed using the same formulae as
Excel (see Excel help for details).
For the standard deviation, three
formulae are possible when the data are weighted :
1) Sum of weighted squares of
differences between data and weighted mean,
divided by sum of weights minus 1.
2)
Sum of weighted squares of differences between
data and weighted mean, divided by sum of weights.
3) Sum of weighted squares of
differences between data and weighted mean,
divided by sum of weights minus
average
value of weights.
For the
median,
XL
STAT uses the same
technique as most packages. For the 1st and 3rd
quartile Excel formula is
used is the
data are unweighted. For weighted data, another
approach base on a linear interpolation is used.
X/Y and X/X plots are
simple two dimensional plots.
X/Y and
Q/Q
plots
are
simple
two
dimensional
plots
when
X<>Y,
but
if
X=Y
a
Q-Q
plot
is
plotted
with
the
empirical cumulative distribution
function of X on the abscissa, and the normal
cumulative distribution function
on
ordinates (with the same mean and std deviation as
X). This is useful to compare the distribution of
X to a
normal distribution.
Box plots can be plotted vertically or
horizontally. The means are plotted in red color.
Box plots give you an idea
of
the
distribution
of
your
data
especially
in
terms
of
symmetry
and
scale.
It
is
useful
when
comparing
the
structure of several variables (or
groups if data are grouped) at the same time. The
limits of the box correspond to
the
1
st
and the
3
rd
quartiles (Q1 et Q3),
and the fences to respectively Q1-1.5(Q3-Q1) and
Q3+1.5(Q3-Q1). Data
which are outside
of these fences are considered as potential
outliers and are printed on the box plot as empty
circles. The min and max are always
printed with a black circle.
Scattergrams make possible to visualize
all the data and their mean for all the selected
variables. This tool can be
very useful
to visually compare several populations in terms
of distribution and density, without any
parametric
consideration (type of
distribution, parameters such as mean and
variance).
Stem-Leaf
plots
are
mostly
used
with
integer
values.
XL
STAT
uses
an
expert
system
to
determine
the
ideal
structure
of
the
stem.
For
this
technique,
details
can
be
found
at
the
following
Web
page:
/~droyster/maed3103/Lesson_7_Section_
Analyzing
categorical data
XLSTAT analyzes the structure of
categorical data by counting (using weights if
necessary) the frequencies for
each
category of each variable. Two plots are available
: histograms and pie charts.
If the option
easier the
comparison of the groups and whole population
structure.
References
Terry Sincich (1996), Business
Statistics by Example, Prentice Hall.
Discretize
data
Data discretization should be used when
you want to transform numerical data to
categorical data. To do so, the
numerical variable must be divided into
several ranges. The result is an ordinal
categorical variable because of
the
natural order between the intervals.
Discretization is very useful in
marketing studies where for
some
variables
the
numbers
are
not
interesting
by
themselves but only because they put
people into different groups, where the groups
have a marketing meaning.
Furthermore, if you want to map some
data described by both numerical and categorical
variables, it might be
interesting
to
discretize
the
numerical
data
and
then
to
use
multiple
correspondence
analysis
which
enables
a
mapping of categorical data and
variables. As the results of discretization is
categorical ordinal data, if the other
categorical variables of your dataset
are also ordinal, it is better to use Principal
Component Analysis based on
the rank
correlation coefficients (Spearman's or Kendall's
coefficients) in that case, categories must be
coded
using numbers (1, 2, 3 ... for
example) to indicate to XLSTAT the ranking among
categories.
There are
several ways to define the intervals limits before
starting discretizing a variable. The two most
common
techniques one are based on
equal size ranges and on equal density classes
(based on quantiles). XLSTAT does
not
automatically create the limits for you, but it is
easy to transmit your results to XLSTAT.
XLSTAT
offers
two
ways
to
define
the
intervals
once
the
limits
are
known
from
you.
You
can
use
the
might look :
Figure 1 :
definition table
A definition table must contain 3
columns. The first column must contain the name of
the category, the second
must contain
the type of operator, and the third one the limit
value. Up to the last category (not included) the
intervals must
be
defined
using
the
or
operators.
The
last
operator
must
be
if
the
latter
was
way that enables you to
surely and quickly define any series of intervals
you want, without excluding any data.
What is interesting in using a
definition table is that you can save it, change
it easily, use it later or regularly. But
if you want, XLSTAT allows you to enter
manually the definitions in the discretization
tool dialog box.
Figure 2 : data example
If the data you want to
discretize using the definition table shown on
Figure 1, are as on Figure 2, the dialog box
should be filled as presented on Figure
3.
Figure 3 :
dialog box
After pressing
OK, you will have the following results on
Figure 4 : results table
This example is based only on a few
data, but it works very well and very quickly for
thousands of data.
Discriminant
analysis
The
categorical variable are
stored. The variables do not necessarily need to
be contiguous. You can use the
key
to
do
multiple
selection
with
Excel.
You
can
also
do
it
manually
by
adding
a
(or
a
if
this
is
your
standard
list separator) between the ranges. Example :
$$B$$2:$$B$$10,$$F$$2:$$F$$10 .
If
you
want
to
include
columns
labels,
these
must
be
selected
with
the
Ys
and
the
Xs
(first
row)
and
the
corresponding option
must be ticked. If included, the data labels must
be stored in the first column of the initial
dataset.
The
an Excel sheet the cell which will be
the upper left corner of the results tables.
The categorical variable
position in the dataset (column number) must be
specified. Note that if the first column
of your dataset includes the
observations labels you mustn’t take it into
account when specifying the categorical
variable position.
Discriminant analysis allows you to
test if the assumed membership of the observations
to some a priori groups
(described
by
the
categorical
variable
containing
the
various
groups
names)
can
be
justified
using
some
explanatory
variables
(the
numerical
variables
of
the
initial
dataset).
The
explanatory
variables
need
to
be
numerical,
but
if
you
want
to
introduce
some
qualitative
variables
to
explain
the
membership
to
this
or
that
group, it is possible it you first do a
Multiple Correspondence Analysis and then use the
observations coordinates
(stored in one
of the results tables of MCA - coordinates on one
or two axes are usually enough) as quantitative
data.
Discriminant analysis can lead (if the
option is selected) to an optimal (in sense of
between-groups over within-
groups
variability) graphical representation of the
observations in a two dimensions space. XLSTAT
discriminant
analysis makes possible to
reclassify the observations that have been used
for the analysis (
also (optional) to do
a crossvalidation using some more observations
stored at the end of the initial dataset. The
number of crossvalidation data needs to
be specified in the dialog box. Note that for the
resubstitution and the
crossvalidation,
no
prior
probabilities
are
set
by
XLSTAT.
It
is
not
necessary
to
define
a
priori
belonging
to
groups for the crossvalidation data.
The linear discriminant
functions are often used to score people (e.g. by
financial organizations): knowing some
numerical data about an individual, you
can determine to which category he most probably
belongs calculating
the scores given by
each discriminant function.
-
-
-
-
-
-
-
-
-
上一篇:电子产品英文说明书
下一篇:PC1D5.9说明书