-
Rowley, Baluja, and Kanade:
Neural Network-Based Face Detection (PAMI, January
1998)
1
Copyright
1998
IEEE.
Personal
use
of
this
material
is
permitted.
However,
permission to
reprint/republish this
material
for
advertising or
promotional
pur- poses or for creating new collective works
for resale or redistribution to servers or lists,
or to reuse any copyrighted component of
this work in other works must be
obtained from the IEEE.
Neural Network-Based Face Detection
Henry
A.
Rowley,
Shumeet
Baluja,
and
Takeo
Kanade
Abstract
Rowley, Baluja, and Kanade:
Neural Network-Based Face Detection (PAMI, January
1998)
2
We present a neural
network-based upright frontal face detection
system. A retinally con- nected
neural
network examines small windows of an image, and
decides whether each win- dow contains
a
face.
The
system
arbitrates
between multiple
networks to
improve
performance
over
a
single
network.
We present a
straightforward procedure for aligning positive
face ex- amples for training.
To
collect negative examples,
we use a
bootstrap
algorithm, which
adds false detections
into the
training set as training
progresses. This eliminates the difficult task of
manually selecting nonface
training
examples,
which
must
be
chosen
to
span
the
entire
space
of
nonface
images.
Simple
heuristics,
such
as
using
the
fact
that
faces
rarely
overlap
in
images,
can
further
improve
the
accuracy.
Comparisons with several other state-
of-the-art face detec- tion systems are presented;
showing that our system has comparable
performance in terms of detection and false-
positive rates.
Keywords:
Face detection, Pattern recognition,
Computer vision, Artificial neural networks,
Ma- chine learning
1
Introduction
In
this paper, we present a neural network-based
algorithm to detect upright, frontal views
of
faces
in
gray-scale
images
1
.
The
algorithm
works
by
applying
one
or
more
neural
networks
directly
to
portions
of
the
input
image,
and
arbitrating
their
results.
Each
network
is
trained
to
output
the
presence
or
absence
of
a
face.
The
algorithms
and
training methods are designed to be
general, with little customization for faces.
Rowley, Baluja, and Kanade:
Neural Network-Based Face Detection (PAMI, January
1998)
3
Many
face
detection
researchers
have
used
the
idea
that
facial
images
can
be
characterized directly in terms of
pixel intensities. These images can be
characterized by
probabilistic models
of the set of face images [4, 13, 15], or
implicitly by neural networks or
other
mechanisms [3, 12, 14,
19, 21, 23, 25,
26]. The parameters for these models are adjusted
either automatically from
example
images (as in our work) or by hand.
A few authors have taken the approach
of extracting
features
and
applying
either
manually
or
automatically
generated
rules
for
evaluating
these features
[7,
1
1].
Training
a
neural
network
for
the
face
detection
task
is
challenging
because
of
the
difficulty
in characterizing prototypical
“non
f
ace”
images.
Unlike face
recognition
, in which
the classes to be discriminated are
different faces, the two classes to be
discriminated in
face
detection
are
“
images
containing
faces
”
and
“
images
not
containing
f
aces”
.
It
is
easy
to
get
a
representative
sample
of
images
which
contain
faces,
but
much
harder
to
get
a
representative sample of those which do
not. We avoid the problem of using a huge training
set for nonfaces by selectively adding
images to the
Rowley,
Baluja, and Kanade: Neural Network-Based Face
Detection (PAMI, January 1998)
4
training
set
as
training
progresses
[21].
This
“
bootstra
p”
method
reduces
the
size
of
the
training
set
needed.
The
use
of
arbitration
between
multiple
networks
and
heuristics to clean up the results
significantly improves the accuracy of the
detector.
Detailed
descriptions
of
the
example
collection
and
training
methods,
network
architecture,
and arbitration methods are
given
in Section 2. In
Section 3,
the performance
of the
system
is
examined.
We
find
that
the
system
is
able
to
detect
90.5%
of
the
faces
over
a
test
set
of
130
complex
images,
with
an
acceptable
number
of
false
positives.
Section
4
briefly discusses
some techniques that can be used to
make the system run
faster,
and
Section
5
compares
this
system
with
similar
systems.
Conclusions
and
directions for future research are
presented in Section 6.
2
Description of the System
Our system operates in two stages: it
first applies a set of neural network-based
filters
to an
image, and
then uses an arbitrator to combine the outputs.
The filters examine
each
location in the image at several scales, looking
for locations that might contain a
face.
The
arbitrator
then
merges
detections
from
individual
filters
and
eliminates
overlapping
detections.
Rowley, Baluja, and Kanade: Neural
Network-Based Face Detection (PAMI, January 1998)
5
2.1
Stage One: A Neural
Network-Based
Filter
The
first
component
of
our
system
is
a
filter
that
receives
as
input
a
20x20
pixel
region
of
the
image,
and
generates
an
output
ranging
from
1
to
-1,
signifying
the
presence or absence of a
face, respectively. To detect faces anywhere in
the input, the
filter is applied at
every location in the image.
To detect
faces larger than the window
size, the
input image is repeatedly reduced in size (by
subsampling),
and
the filter
is
applied at each size. This filter
must have some invariance to position and scale.
The
amount of invariance determines the
number of scales and positions at which it must
be applied.
For the work
presented here, we apply the filter at every pixel
position in
the image, and scale the
image down by a factor of 1.2 for each step in the
pyramid.
The
filtering
algorithm
is
shown
in
Fig.
1.
First,
a
preprocessing
step,
adapted
from
[21], is
Rowley, Baluja, and Kanade: Neural
Network-Based Face Detection (PAMI, January 1998)
6
applied
to
a
window
of
the
image.
The
window
is
then
passed
through
a
neural
network,
which
decides
whether
the
window
contains
a
face.
The
preprocessing
first attempts
to
equalize
the
intensity values in across the window.
We fit a function
which
varies
linearly
across
the
window
to
the
intensity
values
in
an
oval
region
inside
the
window.
Pixels
outside
the
oval
(shown
in
Fig.
2a)
may
represent
the
background,
so
those
intensity
values
are
ignored
in
computing
the
lighting
variation across
the face.
The linear function will
approximate the overall brightness
of
each
part
of
the
window,
and
can
be
subtracted
from
the
window
to
compensate
for
a
variety of lighting
conditions. Then histogram equalization is
performed, which
non-linearly
maps
the
intensity
values
to
expand
the
range
of
intensities
in
the
window.
The
histogram is computed for pixels inside an oval
region in the window. This
compensates
for
differences
in
camera
input
gains,
as
well
as
improving
contrast
in
some
cases. The preprocessing steps are shown in Fig.
2.
Rowley,
Baluja, and Kanade: Neural Network-Based Face
Detection (PAMI, January 1998)
7
The
preprocessed
window
is
then
passed
through
a
neural
network.
The
network
has
retinal
connections to its input layer;
the receptive fields of hidden units are shown in
Fig. 1.
There
are
three
types
of
hidden
units:
4
which
look
at
10x10
pixel
subregions,
16
which
look
at
5x5
pixel
subregions,
and
6
which
look
at
overlapping
20x5
pixel
horizontal stripes of
pixels.
Each of these types
was
chosen
to
allow the
hidden
units
to
detect
local
features
that
might
be
important
forface
detection.
In
particular, the
horizontal
stripes allow the hidden units to detect such
features as mouths or pairs of
eyes,
while the hidden units with square receptive
fields might detect features such as
individual
eyes,
the
nose,
or
corners
of
the
mouth.
Although
the
figure
shows
a
single hidden unit for
each
subregion of the input, these
units can be replicated.
For
the experiments which are described
later, we use networks with two and three sets of
these
hidden
units.
Similar
input
connection
patterns
are
commonly
used
in
speech
and character
recognition tasks [10, 24]. The network has a
single, real-valued output,
which
indicates whether or not the window contains a
face.
Examples of output from a single
network are shown in Fig. 3.
In the
figure, each box
represents
the
position
and
size
of
a
window
to
which
the
neural
network
gave
a
positive
response.
The
network
has
some
invariance
to
position
and
scale,
which
results
in
multiple boxes around some faces. Note
also that there are some false detections; they
will be eliminated by methods presented
in Section 2.2.
Rowley,
Baluja, and Kanade: Neural Network-Based Face
Detection (PAMI, January 1998)
8
TTo train the
neural network used in stage
one
to serve as an accurate
filter, a
large
number of
face and nonface images are needed.
Nearly 1050 face examples were gathered from face
databases
at
CMU,
Harvard
2
,
and
from
the
World
Wide
Web.
The
images
contained
faces of various
sizes, orientations, positions, and intensities.
The eyes, tip of nose, and
corners and
center of the mouth of each face were labelled
manually.
hese points were
used to normalize each face to the same
scale, orientation, and position, as follows:
1.
Initialize
, a
vector which will be the average positions of each
labelled feature over
all the faces, with the feature
locations in the first face
F
.
2.
The feature coordinates in
are rotated, translated, and scaled,
so that the average
locations of the eyes will appear at
predetermined locations in a 20x20 pixel window.
3.
For each face
i
, compute the best
rotation, translation, and scaling to align the
face
’
s
features
Rowley, Baluja, and Kanade: Neural
Network-Based Face Detection (PAMI, January 1998)
with the average feature
locations
linear
9
.
Such
transformations
can
be
written
as
a
F
function
of
their
parameters.
Thus,
we
can
write
a
system
of
linear
equations
mapping
the
features from
F
to
.
The
least
squares
solution
to
this
over-constrained
system
yields the
p
arameters for the best
alignment transformation. Call the aligned feature
locations
F
.
Rowley, Baluja,
and Kanade: Neural Network-Based Face Detection
(PAMI, January 1998)
10
4.
Update
by averaging the
aligned feature locations
for each face .
5.
Go to step 2.
The
alignment
algorithm
converges
within
five
iterations,
yielding
for
each
face
a
function which
maps that face to a
20x20
pixel
window.
Fifteen face
examples are
generated for the
training
set
from each original
image, by randomly rotating the images (about
their center points)
up to
10
,
scaling
between
90%
and
110%,
translating
up
to
half
a
pixel,
and
mirroring.
Each
20x20
window
in
the
set
is
then
preprocessed
(by
applying
lighting
correction
and
histogram equalization). A few example
images are shown in Fig. 4.
The
randomization
gives
the
filter
invariance
to
translations
of
less
than
a
pixel
and
scalings
of
20%.
Larger
changes
in
translation
and
scale
are
dealt
with by
applying
the
filter at
every
pixel position in an image pyramid, in
which the images are scaled by factors of 1.2.
Practically
any
image
can
serve
as
a
nonface
example
because
the
space
of
nonface
images is
much larger than the space of face
images. However, collecting a
“
< br>representativ
e”
set of
nonfaces
Rowley,
Baluja, and Kanade: Neural Network-Based Face
Detection (PAMI, January 1998)
11
is
difficult. Instead of
collecting the images before training
is started,
the images are
collected during training, in the
following manner, adapted from [21]:
1.
Create
an
initial
set
of
nonface
images
by
generating
1000
random
images.
Apply
the pre- processing
steps to each of these images.
2.
Train a neural network to produce an
output of 1 for the face examples, and -1 for the
nonface
examples.
The
training
algorithm
is
standard
error
backpropogation
with
momentum [8]. On the first iteration of
this loop, the network
’
s
weights are initialized
randomly.
After
the
first
iteration,
we
use
the
weights
computed
by
training
in
the
previous iteration as the starting
point.
3.
Run the system on
an image of scenery
which contains no
faces
. Collect subimages in
which the network incorrectly
identifies a face (an output activation ).
4.
Select up to 250 of these
subimages at random, apply the preprocessing
steps, and
add them into the training
set as negative examples. Go to step 2.
Some
examples
of
nonfaces
that
are
collected
during
training
are
shown
in
Fig.
5.
Note
that
some
of
the
examples
resemble
faces,
although
they
are
not
very
close
to
the
positive
examples
shown
in
Fig.
4.
The
presence
of
these
examples
forces
the
neural
network
to
learn
the
precise
boundary between face and nonface
images.
We
used 120 images
of scenery for collecting negative examples
in the
bootstrap manner
described above.
A typical
training run selects approximately
8000
nonface images from the 146,212,178 subimages that
are available at all locations
and
scales
Rowley, Baluja, and
Kanade: Neural Network-Based Face Detection (PAMI,
January 1998)
12
in
the
training
scenery
images.
A
similar
training
algorithm
was
described
in
[5],
where
at
each
iteration
an
entirely
new
network
was
trained
with
the
examples
on
which
the previous networks had made mistakes.
2.2
Stage Two: Merging Overlapping
Detections and Arbitration
The examples in Fig. 3 showed that the
raw output from a single network will contain a
number
of
false
detections.
In
this
section,
we
present
two strategies
to
improve
the
reliability of the detector: merging
overlapping detections from a single network and
arbitrating among multiple networks.
2.2.1
Merging Overlapping
Detections
Note that in Fig.
3, most faces are detected at multiple nearby
positions or scales, while
false
detec-
tions
often
occur
with
less
consistency.
This
observation
leads
to
a
heuristic which can eliminate many
false detections. For
each
location and
scale,
the
number of detections
within a specified neighborhood of that location
can be counted.
If
the
number
is
above
a
threshold,
then
that
lo-
cation
is
classified
as
a
face.
The
centroid of the nearby
detections defines the location of the
detection result, thereby
collapsing
multiple
detections.
In
the
experiments
section,
this
heuristic
will
be
referred
to as
“
thres
holdi
ng”.
-
-
-
-
-
-
-
-
-
上一篇:所有学科英文翻译
下一篇:高速铣削英文加中文翻译中英文对照