-
外文翻译原文
名称:
Fundamentals_of_Statistics
Measures of Central Tendency and
Location:
mean, median, mode,
percentiles, quartiles and deciles.
x
sorted x
53
53
55
53
70
53
58
55
64
57
57
57
53
58
69
64
57
68
68
69
53
70
The Measures of Central
Tendency are Mean, Median and Mode
Mean
?
x-bar
or
x
?
for
a
given
variable,
it
is
the
sum
of
the
values
divided
by
the
number
of
values
(
?
x
i
/<
/p>
n
).
In
this case, we have
n
= 11.
So we need to add all of
the values together and
divide by 11.
?
= 657,
x
= 59.73
Median
?
the number in a
distribution of a variable’s response where one
half of the values are above and
one
half of the values are below.
To find the
median, we first need to put our data in ascending
order
(smallest
to
largest).
Then
we
can
determine
the
median…if
the
value
of
n
is
odd,
it
is
simply the
middle observation, but if the value of
n
is even, it is the average
of the two middle
observations.
In this case,
n
is odd, so the median will
be the middle observation of our sorted values
(the 6
th
value)...57
Mode
?
the
value
that
occurs
most
frequently.
If
there
are
two
different
values
most
frequently
occurring, the data are said to be bi-
modal.
If there are more
than two modes, and the distribution is
said to be multi-modal.
In this case, the value that occurs
most often is 53.
So, the
mode is 53.
The measures of
location are Percentile, Quartile and Decile
Percentile
?
the
p
th
percentile is a value
such that
at least p
percent
of the observations are less than
or
equal to this value
and
at least (100
–
p)
percent of the observations are
greater than or
equal to this value.
To calculate percentiles,
we use indices
(
i
).
i =
(p/100) n
for
p
1
,
p
2
, p
3
,…p
99
If the answer is a whole number (an
integer), then
i
is the
average of
(P/100)n
and
1
+ (P/100)n
.
If the index
number is not a whole number, we ALWAYS round up.
The position
of
the index is the next whole number (integer)
greater than the computed index.
For example:
i
(p50)
= (50/100)11 = 5.5...this rounds up to
6
So, we would count from the lowest
value of the sorted data to the index number (6).
Since the calculated
i
was not a whole number we
had to round up to find the value
where
at least 50% of the values are equal to or lower
than this value and at least 50%
are
equal to or higher than this value.
In this case, the value of
the 50
th
percentile
is the 6
th
value..
.57 … Does this look familiar?
?
The
50
th
percentile is the same
thing as the median.
What does it tell us?
In this distribution, AT LEAST 50% of
the observations are
LESS THAN OR EQUAL
TO 57 AND AT LEAST 50% of the observations are
GREATER THAN OR EQUAL TO 57.
i
(p80)
= (80/100)11 = 8.8...this round up to
9.
The
9
th
value is 68.
Again, since the index number is not a
whole number, we round up. So, we would count from
the
lowest value of the sorted data to
the index number (9).
In
this case, the value of the
80
th
percentile
is 68.
Since this dataset has 11
observations, we won’t have any instances where
our calculated index
number is a whole
number.
However, if we just
remove our value of 70 and create a
new
distribution, we will be able to see an example...
53 53 53 55 57 57 58 64 68
69
i
(p30)
= (30/100)10 = 3...this is a whole
number, so we must take the
3
rd
and
4
th
values and
average them to find the
30
th
percentile.
(53 + 55)/2 = 54
So, the value of the
30
th
percentile is 54.
Return to our original data
distribution ...
Quartiles
–
are special
cases of percentiles…Q
1
=
P
25
,
Q
2
=
P
50
,
Q
3
=
P
75
,
These three values divide the
distribution into 4 equal quarters
i
(Q1)
= (25/100)11 =
2.75...this rounds to 3, so Q1 is the
3
rd
value...53
i
(Q2)
= (50/100)11 = 5.5...this round to 6,
so Q2 is the 6
th
value...57
i
(Q3)
= (75/100)11 = 8.25...this rounds to 9,
so Q3 is the 9
th
value...64
Measures of Dispersion or
V
ariability
:
Range, interquartile range (IQR),
variance, standard deviation
and
coefficient of variation.
Range
= This tells us how
wide the span is from the maximum value to the
minimum value.
(Max
–
Min) = Range.
In this instance, the range
is 69 - 53 = 16.
Interquartile Range (IQR)
=
This tells us how wide the span is in the middle
50% of the data.
(Q3
–
Q1) = IQR.
In this case ... 64
–
53 = 11
We will use IQR in later processes, so
we will want to keep this
x
(x-xbar)
(x-xbar)
2
53
-6.73
45.29
53
-6.73
45.29
53
-6.73
45.29
55
-4.73
22.37
57
-2.73
7.45
57
-2.73
7.45
58
-1.73
2.99
64
68
69
70
657
657/11=59.73
4.27
8.27
9.27
10.27
-0.03
18.23
68.39
85.93
105.47
454.18
454.18/10
≈
45.2
p>
?
(
x
?
x
)
We use the formula:
n
?
1
p>
2
=
s
2
The variance for these data
is 454.18.
For our purposes
here, the computation of variance
is
just a step towards the computation of the
standard deviation.
Sample standard deviation
(
s
)
is the
positive square root of the variance.
?
45<
/p>
.
42
?
6
p>
.
74
= s
So the formula for sample standard
deviation is…
?<
/p>
(
x
?
x
)
n
?
1
2
Population
Variance
< br>(
?
2
)
?
uses
the
same
formula
in
the
numerator,
but
N
instead
of
n-1
in
the
denominator.
Since
we
rarely
have
information
about
the
entire
population,
we
almost
always use the
formula for sample variance,
s
2
.
Population
Standard
Deviation:
?
=
?
2
…since
we
rarely
have
information
from
the
entire
population, we use the formula for
sample standard deviation,
s
.
Coefficient of Variation:
?
?
100
tells us what percent the sample
standard deviation is of
the sample
mean
This number
is “relative” and is only of use in
comparing the distribution of two or more
variables.
Suppose I have two samples, and I want
to know which sample has more
variability…
If
both
samples
have
the
same
mean,
the
one
with
the
higher
standard
deviation
will
have the greater
variability.
However, if
they have different means, I need to calculate
the coefficient of variation to
determine which one has the most variability.
xbar = 458,
s =
112 versus xbar = 687, s = 192
Standardized Data and Detecting
Outliers
Z
-score:
z
=
?
s<
/p>
?
?
x
?
x
?
x
s
The z-score
tells us how many standard deviations a value is
from the mean.
We can look
at a
picture of what a z-score tells
u
s.
In the
Normal Curve…the mean is at the highest point and
the
curve tails off symmetrically in
both directions.
The sign of the z-score
tells us which direction the value is from the
mean on the Normal Curve.
Negative values will be to the left,
and positive values will be to the right.
Standardizing Scores:
Standard
Normal
Curve
…the
mean
is
zero,
and
the
standard
deviation
is
1.
The
distribution
is
bell-shaped
and
symmetrical.
The area
under
the curve
is
1,
and
the
tails
of
the
curve
extend
out
infinitely.
They
never actually touch the horizontal axis.
The highest point on the
curve is at the
mean
Return
to our data …let’s calculate the
z
-
scores for each of the
values…
Empirical Rule
?
used when the distribution
is assumed to known to be approximately
normal.
?
Approximately
68% of the values will fall within 1
sd
of the mean
?
Approximately 95% of the
values will fall within 2
sd
of the mean
?
Approximately
99.9% of the values will fall within 3
sd
of the mean
Chebyshev’s Theorem
?
doesn’t require
that the data have a normal
distribution
Says that at least
(
1
–
1/z
2
) values will fall
within z standard deviations of the mean.
1-1/1
2
= 0,
1-1/2
2
= .75,
1-1/3
2
= .88889,
1-1/4
2
= .9375,
1-1/5
2
= .96
?
We can’t make
a
ny assumptions about the percent of
values that are within 1
sd
of the
mean
But…
?
At
least 75% of the values will fall within 2
sd
of the mean
?
At least 88.9% of the
values will fall within 3
sd
of the mean
We use
Chebyshev’s
Theorem to estimate the
variation in a distribution when
?
n
<
30, or
?
the shape of the
distribution is unknown, or
?
the distribution is assumed
to be non-normal.
Outliers:
suspect or extreme values of data that
must be identified and scrutinized.
If they
are instances of
incorrectly entered data, they should be
corrected.
If the value was
entered correctly and it is a valid
number, it should remain in the dataset as part of
the
initial analysis.
When
we use the z-score method for identifying
outliers, we assume that any value that has a
z-score
with
an
absolute
value
greater
than
3.0
(that
is
less
than
-3.0
or
greater
than
+3.0)
is
an
outlier.
Before
we proceed with data analysis, we need to examine
all outliers for accuracy.
If we determine
that the
value is valid, we often run two sets of analysis.
One with the outlier, and
one without.
Another way to identify
outliers…
Related to IQR is
the Five number summary
…minimum, Q1,
Q2, Q3, & maximum.
These
values feed into upper and lower
limits, and we graph them in a box plot.
Five Number Summary
Minimum
53
-
-
-
-
-
-
-
-
-
上一篇:学会这两招,想写不好英语作文都难!
下一篇:解析雅思写作Task 2利弊类作文