详细的英文-账户名称
SHANGHAI JIAO TONG
UNIVERSITY
Reinforcement Learning
Project Title
:
Playing the Game of Flappy Bird with
Deep
Group Number
: G-07
GroupMembers
:Wang Wenqing 0
Gao Xiaoning2
Qian Chen
3
Contents
1
2
Introduction
p>
.
.............................
..................................................
..
1
Deep
Q-learning Network ...............................
............................
2
2.1
Q-learning ............................
..................................................
........................
2
2.1.1
Reinforcement Learning Problem
< br>.
...................................
..................
2
2.1.2
Q-learning
Formulation [6]
.
...........
..................................................
..
3
2.2
2.3
2.4
2.5
Deep
Q-learning Network
.
........
..................................................
...................
4
Input Pre-processing ..................
..................................................
..................
5
Experience Replay and Stability
.
..................................
.................................
5
DQN
Architecture and Algorithm
.
..................................................
...............
6
3
4
5
Experiments ...........................
..................................................
....
7
3.1
Parameters
Settings
.
..................
..................................................
...................
7
3.2
Results
Analysis
..............................
..................................................
.............
9
Conclusion ............................
..................................................
...
1
1
References
.
.....
..................................................
..........................
1
2
I
Playing the Game of Flappy Bird with
Deep Reinforcement Learning
Playing the
Game of Flappy Bird with Deep Reinforcement
Learning
Abstract
Letting
machine
play
games
has
been
one
of
the
populartopics
in
AI
game theory and search algorithms to
play games requires specific domain knowledge,
lacking
scalability.
In
this
project,
we
utilize
a
convolutional
neural
network
to
represent
the
environment
of
games,
updating
its
parameters
with
Q-learning,
a
reinforcement
learning
algorithm.
We
call
this
overall
algorithm
as
deep
reinforcement
learning
or
Deep
Q-learning
Network(DQN).
Moreover,
we
only
use
the raw images of the
game of flappy bird as the input of DQN, which
guarantees the
scalability
for
other
games.
After
training
with
some
tricks,
DQN
can
greatly
outperform human beings.
1
Introduction
Flappy bird is a popular game in the
world recent years. The goal of players is guiding
the bird on screen to pass the gap
constructed by two pipes by tapping screen. If the
player tap the screen, the bird will
jump up, and if the player do nothing, the bird
will
fall down at a constant rate. The
game will be over when the bird crash on pipes or
ground,
while
the
scores
will
be
added
one
when
the
bird
pass
through
the
gap.
In
Figure1, there are three
different state of bird. Figure 1 (a) represents
the normal flight
state, (b) represents
the crash state, (c) represents the passing state.
(a) (b)
(c)
Figure 1
: (a) normal flight
state (b) crash state (c) passing state
Our goal in this paper is
to design an agent to play Flappy bird
automatically with the
same
input
comparing
to
human
player,
which
means
that
we
use
raw
images
and
rewards to teach our agent to learn how
to play this game. Inspired by [1], we propose
a deep reinforcement learning
architecture to learn and play this game.
Recent
years,
a
huge
amount
of
work
has
been
done
on
deep
learningin
computer
vision
[6].
Deep
learning
extracts
high
dimension
features
from
raw
images.
Therefore, it is
nature to ask whether the deep learning can be
used in reinforcement
learning.
However,
there
are
four
challenges
in
using
deep
learning.
Firstly,
most
successful
deep
learning
applications
to
date
have
required
large
amounts
of
hand-
labelled training data. RL algorithms, on the
other hand, must be able to learn
from
a scalar rewardsignal that is frequently sparse,
noisy and delayed. Secondly, the
1
Playing the Game of Flappy
Bird with Deep Reinforcement Learning
delay
between
actions
and
resulting
rewards,which
can
be
thousands
of
time
steps
long,
seems
particularly
daunting
when
compared
to
the
directassociation
between
inputs
and
targets
found
in
supervised
learning.
The
third
issue
is
that
most
deeplearning
algorithms
assume
the
data
samples
to
be
independent,
while
in
reinforcement learning
onetypically encounters sequences of highly
correlated states.
Furthermore,
in
RL
the
data
distributionchanges
as
the
algorithm
learns
new
behaviors,
which
can
be
problematic
for
deep
learningmethods
that
assume
a
fixed
underlying
distribution.
This
paper
will
demonstrate
that
using
Convolutional
Neural
Network(CNN)
can
overcome those challenge
mentioned above and learn successful control
polices from
raw images data in the
game Flappy bird. This network is trained with a
variant of the
Q-learningalgorithm[6].
By using Deep Q-learning Network(DQN), we
construct the
agent to make right
decisions on the game flappy bird barely according
to consequent
raw images.
2
Deep Q-learning
Network
Recent
breakthroughs
in
computer
vision
have
relied
on
efficiently
training
deep
neural networks on very
large training sets. By feedingsufficient data
into deep neural
networks,
it
is
often
possible
to
learn
better
representations
thanhandcrafted
features[2][3].
These
successes
motivate
us
to
connect
a
reinforcement
learning
algorithm
to
a
deep
neural
network,
which
operatesdirectly
on
raw
images
and
efficiently update
parameters by using stochastic gradient descent.
In
the
following
section,
we
describethe
Deep
Q-learning
Network
algorithm
(DQN)and how its model is
parameterized.
2.1
Q-learning
2.1.1
Reinforcement Learning Problem
Q-learning is a specific algorithm of
reinforcement learning (RL).As
Figure
2
show,
an agent interacts
with its environment in discrete time steps. At
each time t, the agent
receives
an
state
s
< br>t
and
a
reward
r
t
.
It
then
chooses
an
action
a
t
from
the
set
of
actions
available,
which
is
subsequently
sent
to
the
environment.
The
environment
moves
to
a
new
state
s
t
?
1
and
the
reward
r
t
< br>?
1
associated
with
the
transition
(
s
t
,
a
t
,
s
t
?
< br>1
)
is determined[4].
2
Playing the Game of Flappy Bird with
Deep Reinforcement Learning
Figure 2
: Traditional
Reinforcement Learning scenario
The goal of an agent is to collect as
much reward as possible. The agent can choose
any
action
as
a
function
of
the
history
and
it
can
even
randomize
its
action
that in order to act
near optimally, the agent must reason about the
long
term
consequences
of
its
actions
(i.e.,
maximize
the
future
income),
although
the
immediate reward
associated with this might be negative[5].
2.1.2
Q-learning
Formulation[6]
In
Q-learning
problem,
the
set
of
states
and
actions,
together
with
rules
for
transitioning
from
one
state
to
another,
make
up
a
Markov
decision
process.
One
episode of this process (e.g. onegame)
forms a finite sequence of states, actions and
rewards:
s
0
,
a
0
,
< br>r
1
,s
1
,
a
1
,
r
2
,...,s
n
< br>?
1
,
a
n
?
1
,
r
n
,
s
n
p>
Here
s
i<
/p>
represents the state,
a
i
is the action
and
r
i
?
1
is the reward after performing the
p>
action
a
i
. The episode ends with terminal
state
s
n
. To
perform well in the long-term,
we
need
to
take
into
account
not
only
theimmediate
rewards,
but
also
the
future
rewards we are going to get. Define the
total future reward from time point t onward
as:
(1)
< br>R
t
?
r
t
?
r
t
?
1
?
...
?
r
n
?
1
p>
?
r
n
In order to ensure the divergence and
balance the immediate reward and future reward,
total reward must use discounted future
reward:
R
t
?
r
t
?
< br>?
r
t
?
1
?
...
?
?
n
?
t
?
1
r
n
?
p>
1
?
?
n
?
t
r
n
?
?
?
i
< br>?
t
r
i
i
?
t
n
(2)
Here
?
is the discount factor
between 0 and 1, the more into the future the
rewardis,
the less we take it into
consideration. Transforming equation
(2)
can get:
(3)
R
t
?
r
t
?
?<
/p>
R
t
?
1
In
Q-learning,
define
a
function
Q
(
s
t
p>
,
a
t
)
representing
the
maximum
discounted
future reward when we perform action
a
t
in state:
(4)
Q
(
s
t
,
a<
/p>
t
)
?
max<
/p>
R
t
?
1
It is called Q-function,
because itre
presents the “quality” of a
certain action in a given
state.
A
good
strategy
for
an
agent
would
be
to
always
choose
an
action
that
maximizesthe discounted future reward:
(5)
?
(
s
t
)
?<
/p>
argmax
a
t
Q
(
s
t
,
a
t
)
3
Playing the
Game of Flappy Bird with Deep Reinforcement
Learning
Here π represents the policy,
the rule how we choose an action in each
state
. Given a
transition
(
s
t
,
a
t
,
s
t
?
1
)
, equation
(3)(4)
can get
following bellman equation - maximum
future reward for this state and action
is the immediate reward plus maximum future
reward for the next state:
(6)
Q
(
s
t
,
a
t<
/p>
)
?
r
t
?
?
max
a
'
Q
(
s
t
?
1
,
a
'
)
The only way to collect information
about the environment is by interacting with it.
Q-learning is the process of learning
the optimal function
Q
(
s
t
,
a
< br>t
)
, which is a table in.
Here is the overall algorithm 1:
Algorithm 1
Q-learning
Initialize
Q[num_states, num_actions] arbitrarily
Observe initial state
s
0
Repeat
Select and carry out an action
a
Observe reward r and new state
s’
Q<
/p>
(
s
,
a
)
?
Q
(
s
,
a
)
?
?
(
r
?
?
max
a
'
Q
(
s
'
,
a
'
)<
/p>
?
Q
(
s
,
a
))
s
=
s’
Until
terminated
2.2
Deep
Q-learning Network
In Q-learning, the
state space often is too big to be put into main
memory.
A game
frame
of
80
?
80
binary
images
has
2
6400
states,
which
is
impossible
to
be
represented
by
Q-
table.
What’s
more,
during
training,
encountering
a
known
state,
Q-learning
ju
st
perform
a
random
action,
meaning
that
it’s
not
heuristic.
In
order
overcome
these
two
problems,
just
approximate
the
Q-table
with
a
convolutional
neural
networks (CNN)[7][8]. Thisvariation of Q-learning
is called Deep Q-learning
Network
(DQN)[9][10].
After
training
the
DQN,
a
multilayer
neural
networks
can
approach
the
traditional
optimal Q-table
as followed:
(7)
Q
(
s
t
,
a
t
;
?
p>
)
?
Q
*
(
s
t
,
a
t
)
As for playing flappy bird, the
screenshot
s
t
is
inputted into the CNN, and the outputs
are the Q-value of actions, as shown in
Figure 3
:
Figure
3:
In DQN, CNN’s
input is
raw
game image
while its outputs
are
Q
-values
Q(s, a),
one neuron corresponding to one
action’s Q
-value.
In
order
to
update
CNN’s
weight,
defining
the
cost
function
and
gradient
update
function as[9][10]:
4
Playing the
Game of Flappy Bird with Deep Reinforcement
Learning
2
1
'
?
?
?
L
?
?
r
t
?
max
a
'
Q
(
s
t
?
1<
/p>
,
a
;
?
)
?
Q
(
s
t
,
a
t
;
?
)
?
2
'
?
?
?
L
?<
/p>
?
(
r
?
?
max
Q
(
s
,
a
;
?
)
?
Q
(
s
t
,
a
t
;
?
)
?
'
t
?<
/p>
1
a
?
t
?
?
?
Q
(
s
t
,
a
t
;
?
)
(8)
(9)
(10)
?
?
?
?
??
?
L
(
?
?
)<
/p>
?
Here,
?
are
the
DQN
parameters
that
get
trained
and
?
?
a
re
non-updated
parametersfor
the
Q-value
function.
During
training,
use
equation(9)
to
update
the
weights of CNN.
Meanwhile, obtaining optimal reward in
every
episode requires the
balancebetween
exploring the
environment and exploiting
experience.
?
-greedy approach
can achieve
this
target.
When
training,
select
a
random
action
with
probability
?
or
otherwisechoose
the
optimal
action
a
p>
?
argmax
a
'
Q
(
s
t<
/p>
,
a
'
;
?
)
.The
?
p>
anneals
linearly
to
zero with increase in number ofupdates.
2.3
Input Pre-
processing
Working
directly
with
raw
game
frames,
which
are
288
?
512
pixel
RGB
images,can
be
computationally
demanding,
so
we
apply
a
basic
preprocessing
step
aimed
at
reducing
the input dimensionality.
Figure
4:
Pre-process
game
frames.
First
convert
frames
to
gray
images,
then
down-sample them to specific size.
Afterwards, convert them to binary images, finally
stack up last 4 frames as a state.
In
order
to
improve
the
accuracy
of
the
convolutional
network,the
background
of
game
was
removed
and
substitutedwith
a
pure
black
image
to
remove
Figure 4
shows,
the raw game frames are preprocessed by first
converting their RGB
representationto
gray-scale
and
down-sampling
it
to
an
80
?
80
image.
Then
convert
the gray image to
binary image. In addition, stack up last 4 game
frames as a state for
CNN. The
currentframe is overlapped with the previous
frames with slightlyreduced
intensities
and
the
intensity
reduces
as
we
movefarther
away
from
the
most
recent
frame. Thus, the
inputimage will give good information on the
trajectory onwhich the
bird is
currently in.
2.4
ExperienceReplayandStability
By
now
we
can
estimate
the
future
reward
in
each
state
using
Q-learning
and
5