gleam-大腿内侧
SHANGHAI JIAO TONG
UNIVERSITY
Project Title
:
Group Number
:
Group
Members
:
Playing the Game of Flappy
Bird with Deep
Reinforcement
Learning
G-07
Wang Wenqing
0
Gao Xiaoning
2
Qian Chen
3
Contents
1
2
Introduction
.
..........................................
.......................................
1
Deep Q-learning
Network
...............................
............................
2
2.1
Q-learning
.................
..................................................
...................................
2
2.1.1
Reinforcement Learning
Problem
.
........
.............................................
2
2.1.2
Q-learning
Formulation [6]
.
..................................................
.............
3
2.2
2.3
2.4
2.5
Deep
Q-learning Network
.
< br>............................................... ..............................
4
Input Pre-
processing
............................
..................................................
........
5
Experience Replay and
Stability
.
......
..................................................
...........
5
DQN
Architecture and Algorithm
.
.......................................
..........................
6
3
4
5
Experiments
................
..................................................
...............
7
3.1
Parameters
Settings
.
.......
..................................................
..............................
7
3.2
Results Analysis
..................................................
...........................................
9
Conclusion
.................
..................................................
..............
1
1
References
.
............................................ .....................................
1
2
I
Playing the Game of Flappy Bird with
Deep Reinforcement Learning
Playing the Game of Flappy Bird with
Deep Reinforcement Learning
Abstract
Letting machine play games has been one
of the popular topics in AI today. Using game
theory
and
search
algorithms
to
play
games
requires
specific
domain
knowledge,
lacking
scalability.
In
this
project,
we
utilize
a
convolutional
neural
network
to
represent
the
environment
of
games,
updating
its
parameters
with
Q-learning,
a
reinforcement learning
algorithm. We call this overall algorithm as deep
reinforcement
learning or Deep
Q-learning Network(DQN). Moreover, we only use the
raw images of
the game of flappy bird
as the input of DQN, which guarantees the
scalability for other
games. After
training with some tricks, DQN can greatly
outperform human beings.
1
Introduction
Flappy bird is a popular game in the
world recent years. The goal of players is guiding
the bird on screen to pass the gap
constructed by two pipes by tapping screen. If the
player tap the screen, the bird will
jump up, and if the player do nothing, the bird
will
fall down at a constant rate. The
game will be over when the bird crash on pipes or
ground,
while
the
scores
will
be
added
one
when
the
bird
pass
through
the
gap.
In
Figure1, there are three
different state of bird. Figure 1 (a) represents
the normal flight
state, (b) represents
the crash state, (c) represents the passing
state.
(a)
(b)
(c)
Figure 1
: (a) normal flight
state (b) crash state (c) passing state
Our goal in this paper is
to design an agent to play Flappy bird
automatically with the
same
input
comparing
to
human
player,
which
means
that
we
use
raw
images
and
rewards to teach our agent to learn how
to play this game. Inspired by [1], we propose
a deep reinforcement learning
architecture to learn and play this
game.
Recent
years, a huge amount of work has been done on deep
learning in computer vision
[6]. Deep
learning extracts high dimension features from raw
images. Therefore, it is
nature to ask
whether the deep learning can be used in
reinforcement learning. However,
there
are four challenges in using deep learning.
Firstly, most successful deep learning
applications
to
date
have
required
large
amounts
of
hand-
labelled
training
data.
RL
algorithms, on the other
hand, must be able to learn from a scalar reward
signal that is
frequently sparse, noisy
and delayed. Secondly, the delay between actions
and resulting
rewards, which can be
thousands of time steps long, seems particularly
daunting when
1
Playing the Game of Flappy Bird with
Deep Reinforcement Learning
compared
to
the
direct
association
between
inputs
and
targets
found
in
supervised
learning. The third issue is that most
deep learning algorithms assume the data samples
to be independent, while in
reinforcement learning one typically encounters
sequences
of
highly
correlated
states.
Furthermore,
in
RL
the
data
distribution
changes
as
the
algorithm
learns new behaviors, which can be problematic for
deep learning methods
that assume a
fixed underlying distribution.
This
paper
will
demonstrate
that
using
Convolutional
Neural
Network
(CNN)
can
overcome those challenge mentioned
above and learn successful control polices from
raw images data in the game Flappy
bird. This network is trained with a variant of
the
Q-learning algorithm [6]. By using
Deep Q-learning Network (DQN), we construct the
agent to make right decisions on the
game flappy bird barely according to consequent
raw images.
2
Deep Q-learning
Network
Recent breakthroughs in
computer vision have relied on efficiently
training deep neural
networks
on
very
large
training
sets.
By
feeding
sufficient
data
into
deep
neural
networks, it is often
possible to learn better representations than
handcrafted features
[2][3]. These
successes motivate us to connect a reinforcement
learning algorithm to a
deep
neural
network,
which
operates
directly
on
raw
images
and
efficiently
update
parameters by using stochastic gradient
descent.
In the
following section, we describe the Deep Q-learning
Network algorithm (DQN)
and how its
model is parameterized.
2.1
Q-learning
2.1.1
Reinforcement Learning
Problem
Q-learning is a
specific algorithm of reinforcement learning (RL).
As
Figure 2
show,
an agent interacts with its environment
in discrete time steps. At each time t, the agent
receives an
state
s
t
and
a reward
r
t
.
It
then chooses an action
a
t
from
the set
of
actions
available,
which
is
subsequently
sent
to
the
environment.
The
environment
moves
to
a
new
state
s
t
?
1
and
the
reward
r
t
< br>?
1
associated
with
the
transition
(
s
t
,
a
t
,
s
t
?
< br>1
)
is determined [4].
2
Playing the Game of Flappy Bird with
Deep Reinforcement Learning
Figure 2
: Traditional
Reinforcement Learning scenario
The goal of an agent is to
collect as much reward as possible. The agent can
choose any
action as a function of the
history and it can even randomize its action
selection. Note
that
in
order
to
act
near
optimally,
the
agent
must
reason
about
the
long
term
consequences of its actions (i.e.,
maximize the future income), although the
immediate
reward associated with this
might be negative [5].
2.1.2
Q-learning
Formulation [6]
In Q-learning problem, the set of
states and actions, together with rules for
transitioning
from one state to
another, make up a Markov decision process. One
episode of this
process (e.g. one game)
forms a finite sequence of states, actions and
rewards:
s
0
,
a
0
,
< br>r
1
,s
1
,
a
1
,
r
2
,...,s
n
< br>?
1
,
a
n
?
1
,
r
n
,
s
n
p>
Here
s
i<
/p>
represents the state,
a
i
is the action
and
r
i
?
1
is the reward after performing the
p>
action
a
i
. The episode ends with terminal
state
s
n
. To
perform well in the long-term, we
need
to take into account not only the immediate
rewards, but also the future rewards
we
are going to get. Define the total future reward
from time point t onward as:
R
t
?
r
t
?
r
t
?
p>
1
?
...
?
p>
r
n
?
1
?
r
n
(1)
In order to ensure the divergence and
balance the immediate reward and future reward,
total reward must use discounted future
reward:
R
t
?
r
t
?
?
r
t
?
1
?
...
?
?
n
?
t<
/p>
?
1
r
n
?
1
?
?
n
?
t
r
n
?
?
?
i
?
t
r
i
i
?<
/p>
t
n
(2)
Here
?
is the
discount factor between 0 and 1, the more into the
future the reward is,
the less we take
it into consideration. Transforming equation
(2)
can get:
R
t
?
r
t
?
?
R
< br>t
?
1
(3)
In Q-learning, define a function
Q
(
s
t
< br>,
a
t
)
representing the maximum discounted
future
reward when we perform action
a
t
in
state:
Q
(
< br>s
t
,
a
t
)
?
max
R
t
?
1
(4)
It is called Q-function, because it
represents the “quality” of a certain action in a
given
state. A good strategy for an
agent would be to always choose an action that
maximizes
the discounted future
reward:
?
(
s
t
)
?
arg
max
a
t
Q
(
s
t
,
a
t
)
(5)
Here π represents the policy, the rule
how we choose an action in each state. Given a
transition
(
s
< br>t
,
a
t
,
s
t
?
1
)
, equation
(3)(4)
can get following
bellman equation - maximum
3
Playing the Game of Flappy
Bird with Deep Reinforcement Learning
future reward for this state and action
is the immediate reward plus maximum future
reward for the next state:
(6)
Q
(
s
t
,
p>
a
t
)
?
r
t
?
?
max
a
'
Q
(
s
t
?
< br>1
,
a
'
)
The only way
to collect information about the environment is by
interacting with it. Q-
learning is the
process of learning the optimal function
Q
(
s
t
,
a
t
)
,
which is a table in.
Here is the
overall algorithm 1:
Algorithm 1
Q-learning
Initialize
Q[num_states, num_actions] arbitrarily
Observe initial state
s
0
Repeat
Select and carry out an action
a
Observe reward r and new state
s’
Q<
/p>
(
s
,
a
)
?
Q
(
s
,
a
)
?
?
(
r
?
?
max
a
'
Q
(
s
'
,
a
'
)<
/p>
?
Q
(
s
,
a
))
s
=
s’
Until
terminated
2.2
Deep
Q-learning Network
In
Q-learning, the state space often is too big to be
put into main memory. A game frame
of
80
?
80
binary images has
2
6400
states,
which is impossible to be represented by
Q-
table.
What’s
more,
during
training,
encountering
a
known
state,
Q-learning
just
perform a random action, meaning that
it’s not heuristic. In order overcome these two
problems, just approximate the Q-table
with a convolutional neural networks (CNN)
[7][8]. This variation of Q-learning is
called Deep Q-learning Network (DQN) [9][10].
After
training
the
DQN,
a
multilayer
neural
networks
can
approach
the
traditional
optimal Q-table
as followed:
Q
(
p>
s
t
,
a
t
;
?
)
?
Q
*
(
< br>s
t
,
a
t
)
(7)
As for
playing flappy bird, the screenshot
s
t
is inputted
into the CNN, and the outputs
are the
Q-value of actions, as shown in
Figure
3
:
Figure 3:
In DQN, CNN’s
input is raw game image while its outputs are
Q-values Q(s,
a), one neuron
corresponding to one action’s Q-value.
In
order
to
update
CNN’s
weight,
defining
the
cost
function
and
gradient
update
function as
[9][10]:
2
1
'
?
?
L
< br>?
?
r
?
max
Q
(
s
,
a
;
?
)
?
Q
(
s
p>
,
a
;
?
)
(8)
t
t
?
1
t
t
p>
a
'
?
?
2
'
?
(9)
?
?
L
?
?
(
p>
r
?
?
max
p>
'
Q
(
s
t
?
1
,
a
;
?
)
< br>?
Q
(
s
t
,
a
t
;
?
)
?
?
p>
?
Q
(
s
t
,
a
t
;
?
)
t
a
?
p>
?
4
Playing the Game of Flappy Bird with
Deep Reinforcement Learning
?
?
?
?
p>
??
?
L
(
?
?
)
?
(10)
Here,
?
are the
DQN parameters that get trained and
?
?
are non-updated
parameters
for the Q-value function.
During training, use equation(9) to update the
weights of CNN.
Meanwhile, obtaining optimal reward in
every episode requires the balance between
exploring the environment and
exploiting
experience.
?
-greedy approach
can achieve
this
target.
When
training,
select
a
random
action
with
probability
?
o
r
otherwise
choose the optimal action
a
p>
?
argmax
a
'
Q
(
s
t<
/p>
,
a
'
;
?
)
.
The
?
anneals linearly to zero
with
increase in number of
updates.
2.3
Input Pre-processing
Working directly with raw game frames,
which are
288
?
512
pixel RGB images, can
be
computationally
demanding,
so
we
apply
a
basic
preprocessing
step
aimed
at
reducing
the input dimensionality.
Figure 4:
Pre-
process game frames. First convert frames to gray
images, then down-
sample them to
specific size. Afterwards, convert them to binary
images, finally stack
up last 4 frames
as a state.
In
order to improve the accuracy of the convolutional
network, the background of game
was
removed and substituted with a pure black image to
remove noise. As
Figure 4
shows,
the
raw
game
frames
are
preprocessed
by
first
converting
their
RGB
representation to gray-scale and down-
sampling it to an
80
?
80
image. Then convert
the gray image to binary image. In
addition, stack up last 4 game frames as a state
for
CNN. The current frame is
overlapped with the previous frames with slightly
reduced
intensities and the intensity
reduces
as
we move
farther away from the most
recent
frame. Thus, the
input image will give good information on the
trajectory on which the
bird is
currently in.
2.4
Experience
Replay and Stability
By
now
we
can
estimate
the
future
reward
in
each
state
using
Q-learning
and
approximate
the
Q-function
using
a
convolutional
neural
network.
But
the
approximation of Q-values using non-
linear functions is not very stable. In
Q-learning,
the experiences recorded in
a sequential manner are highly correlated. If
sequentially
use them to update the DQN
parameters, the training process might stuck in a
local
minimal solution or diverge.
5
gleam-大腿内侧
gleam-大腿内侧
gleam-大腿内侧
gleam-大腿内侧
gleam-大腿内侧
gleam-大腿内侧
gleam-大腿内侧
gleam-大腿内侧
-
上一篇:全新大学英语综合教程第二册复习资料
下一篇:建筑学外文翻译