gleam(完整版)基于深度强化学习的flappybird_高中生题库网|高考真题|高考试题-「密云二中」

gleam-大腿内侧

2021年1月28日发(作者：comments是什么意思)

SHANGHAI JIAO TONG UNIVERSITY

Project Title

Group Number

Group

Members

Playing the Game of Flappy Bird with Deep

Reinforcement Learning

G-07

Wang Wenqing

Gao Xiaoning

Qian Chen

Contents

Introduction

.......................................... .......................................

Deep Q-learning Network

............................... ............................

2.1

Q-learning

................. .................................................. ...................................

2.1.1

Reinforcement Learning Problem

........ .............................................

2.1.2

Q-learning Formulation [6]

.................................................. .............

2.2

2.3

2.4

2.5

Deep Q-learning Network

............................................... ..............................

Input Pre- processing

............................ .................................................. ........

Experience Replay and Stability

...... .................................................. ...........

DQN Architecture and Algorithm

....................................... ..........................

Experiments

................ .................................................. ...............

3.1

Parameters Settings

....... .................................................. ..............................

3.2

Results Analysis

.................................................. ...........................................

Conclusion

................. .................................................. ..............

References

............................................ .....................................

Playing the Game of Flappy Bird with Deep Reinforcement Learning

Abstract

Letting machine play games has been one of the popular topics in AI today. Using game

theory

and

algorithms

play

games

requires

specific

domain

knowledge,

lacking

scalability.

this

project,

utilize

convolutional

neural

network

represent

the

environment

games,

updating

its

parameters

with

Q-learning,

reinforcement learning algorithm. We call this overall algorithm as deep reinforcement

learning or Deep Q-learning Network(DQN). Moreover, we only use the raw images of

the game of flappy bird as the input of DQN, which guarantees the scalability for other

games. After training with some tricks, DQN can greatly outperform human beings.

Introduction

Flappy bird is a popular game in the world recent years. The goal of players is guiding

the bird on screen to pass the gap constructed by two pipes by tapping screen. If the

player tap the screen, the bird will jump up, and if the player do nothing, the bird will

fall down at a constant rate. The game will be over when the bird crash on pipes or

ground,

while

the

scores

will

added

one

when

the

bird

pass

through

the

gap.

Figure1, there are three different state of bird. Figure 1 (a) represents the normal flight

state, (b) represents the crash state, (c) represents the passing state.

(a)

(b)

(c)

Figure 1

: (a) normal flight state (b) crash state (c) passing state

Our goal in this paper is to design an agent to play Flappy bird automatically with the

same

input

comparing

human

player,

which

means

that

use

raw

images

and

rewards to teach our agent to learn how to play this game. Inspired by [1], we propose

a deep reinforcement learning architecture to learn and play this game.

Recent years, a huge amount of work has been done on deep learning in computer vision

[6]. Deep learning extracts high dimension features from raw images. Therefore, it is

nature to ask whether the deep learning can be used in reinforcement learning. However,

there are four challenges in using deep learning. Firstly, most successful deep learning

applications

date

have

required

large

amounts

hand- labelled

training

data.

algorithms, on the other hand, must be able to learn from a scalar reward signal that is

frequently sparse, noisy and delayed. Secondly, the delay between actions and resulting

rewards, which can be thousands of time steps long, seems particularly daunting when

Playing the Game of Flappy Bird with Deep Reinforcement Learning

compared

the

direct

association

between

inputs

and

targets

found

supervised

learning. The third issue is that most deep learning algorithms assume the data samples

to be independent, while in reinforcement learning one typically encounters sequences

highly

correlated

states.

Furthermore,

the

data

distribution

changes

the

algorithm learns new behaviors, which can be problematic for deep learning methods

that assume a fixed underlying distribution.

This

paper

will

demonstrate

that

using

Convolutional

Neural

Network

(CNN)

can

overcome those challenge mentioned above and learn successful control polices from

raw images data in the game Flappy bird. This network is trained with a variant of the

Q-learning algorithm [6]. By using Deep Q-learning Network (DQN), we construct the

agent to make right decisions on the game flappy bird barely according to consequent

raw images.

Deep Q-learning Network

Recent breakthroughs in computer vision have relied on efficiently training deep neural

networks

very

large

training

sets.

feeding

sufficient

data

into

deep

neural

networks, it is often possible to learn better representations than handcrafted features

[2][3]. These successes motivate us to connect a reinforcement learning algorithm to a

deep

neural

network,

which

operates

directly

raw

images

and

efficiently

update

parameters by using stochastic gradient descent.

In the following section, we describe the Deep Q-learning Network algorithm (DQN)

and how its model is parameterized.

2.1

Q-learning

2.1.1

Reinforcement Learning Problem

Q-learning is a specific algorithm of reinforcement learning (RL). As

Figure 2

show,

an agent interacts with its environment in discrete time steps. At each time t, the agent

receives an

state

and a reward

then chooses an action

from

the set

actions

available,

which

subsequently

sent

the

environment.

The

environment

moves

new

state

and

the

reward

 ?

associated

with

the

transition

(

t

,

a

t

,

s

t

?
 1

)

is determined [4].

2

Playing the Game of Flappy Bird with Deep Reinforcement Learning

Figure 2

: Traditional Reinforcement Learning scenario

The goal of an agent is to collect as much reward as possible. The agent can choose any

action as a function of the history and it can even randomize its action selection. Note

that

in

order

to

act

near

optimally,

the

agent

must

reason

about

the

long

term

consequences of its actions (i.e., maximize the future income), although the immediate

reward associated with this might be negative [5].

2.1.2

Q-learning Formulation [6]

In Q-learning problem, the set of states and actions, together with rules for transitioning

from one state to another, make up a Markov decision process. One episode of this

process (e.g. one game) forms a finite sequence of states, actions and rewards:

s

0

,

a

0

,
 r

1

,s

1
,

a

1

,

r

2

,...,s

n
 ?

1

,

a

n

?

1

,

r

n

,

s

n

Here

s

i

represents the state,

a

i

is the action and

r

i

?

1

is the reward after performing the

action

a

i

. The episode ends with terminal state

s

n

. To perform well in the long-term, we

need to take into account not only the immediate rewards, but also the future rewards

we are going to get. Define the total future reward from time point t onward as:

R

t

?

r

t

?

r

t

?

1

?

...

?

r

n

?

1

?

r

n

(1)

In order to ensure the divergence and balance the immediate reward and future reward,

total reward must use discounted future reward:

R

t

?

r

t
?

?

r

t

?

1

?

...

?

?

n

?

t

?

1

r

n

?

1

?

?

n

?

t

r

n

?

?

?
i

?

t

r

i

i

?

t

n

(2)

Here

?

is the discount factor between 0 and 1, the more into the future the reward is,

the less we take it into consideration. Transforming equation

(2)

can get:

R

t

?

r

t

?

?

R
 t

?

1

(3)

In Q-learning, define a function

Q

(

s

t
 ,

a

t

)

representing the maximum discounted future

reward when we perform action

a

t

in state:

Q

(
 s

t

,

a

t

)

?

max

R

t

?

1

(4)

It is called Q-function, because it represents the “quality” of a certain action in a given

state. A good strategy for an agent would be to always choose an action that maximizes

the discounted future reward:

?

(

s

t

)

?
arg

max

a

t

Q

(

s

t
,

a

t

)

(5)

Here π represents the policy, the rule how we choose an action in each state. Given a

transition

(

s
 t

,

a

t

,

s

t

?

1

)

, equation

(3)(4)

can get following bellman equation - maximum

3

Playing the Game of Flappy Bird with Deep Reinforcement Learning

future reward for this state and action is the immediate reward plus maximum future

reward for the next state:

(6)

Q

(

s

t

,

a

t

)

?

r

t

?

?

max

a

'

Q

(

s

t

?
 1

,

a

'

)

The only way to collect information about the environment is by interacting with it. Q-

learning is the process of learning the optimal function

Q

(

s

t

,

a

t

)

, which is a table in.

Here is the overall algorithm 1:

Algorithm 1

Q-learning

Initialize Q[num_states, num_actions] arbitrarily

Observe initial state

s

0

Repeat

Select and carry out an action

a

Observe reward r and new state

s’

Q

(

s

,

a

)

?

Q

(

s

,

a

)

?

?

(

r
?

?

max

a
'

Q

(

s

'

,

a

'

)

?

Q

(

s

,

a

))

s

= s’

Until

terminated

2.2

Deep Q-learning Network

In Q-learning, the state space often is too big to be put into main memory. A game frame

of

80

?

80

binary images has

2

6400

states, which is impossible to be represented by Q-

table.

What’s

more,

during

training,

encountering

a

known

state,

Q-learning

just

perform a random action, meaning that it’s not heuristic. In order overcome these two

problems, just approximate the Q-table with a convolutional neural networks (CNN)

[7][8]. This variation of Q-learning is called Deep Q-learning Network (DQN) [9][10].

After

training

the

DQN,

a

multilayer

neural

networks

can

approach

the

traditional

optimal Q-table as followed:

Q

(

s

t

,

a

t

;

?

)

?

Q

*

(
 s

t

,

a

t

)

(7)

As for playing flappy bird, the screenshot

s

t

is inputted into the CNN, and the outputs

are the Q-value of actions, as shown in

Figure 3

:

Figure 3:

In DQN, CNN’s input is raw game image while its outputs are Q-values Q(s,

a), one neuron corresponding to one action’s Q-value.

In

order

to

update

CNN’s

weight,

defining

the

cost

function

and

gradient

update

function as [9][10]:

2

1

'

?

?

L
 ?

?

r

?

max

Q

(

s

,

a

;

?

)

?

Q

(

s

,

a

;

?

)

(8)

t

t

?

1

t

t

a

'

?

?

2

'

?

(9)

?

?

L

?

?

(

r

?

?

max

'

Q

(

s

t

?

1

,

a

;

?

)
 ?

Q

(

s

t

,

a

t

;

?

)

?

?

?

Q

(

s

t

,

a

t

;

?

)

t

a

?

?

4

Playing the Game of Flappy Bird with Deep Reinforcement Learning

?

?

?

?

??

?

L

(

?

?

)

?

(10)

Here,

?

are the DQN parameters that get trained and

?

?

are non-updated parameters

for the Q-value function. During training, use equation(9) to update the weights of CNN.

Meanwhile, obtaining optimal reward in every episode requires the balance between

exploring the environment and exploiting experience.

?

-greedy approach can achieve

this

target.

When

training,

select

a

random

action

with

probability

?

o

r

otherwise

choose the optimal action

a

?

argmax

a

'

Q

(

s

t

,

a

'

;

?

)

. The

?

anneals linearly to zero with

increase in number of updates.

2.3

Input Pre-processing

Working directly with raw game frames, which are

288

?

512

pixel RGB images, can

be

computationally

demanding,

so

we

apply

a

basic

preprocessing

step

aimed

at

reducing the input dimensionality.

Figure 4:

Pre- process game frames. First convert frames to gray images, then down-

sample them to specific size. Afterwards, convert them to binary images, finally stack

up last 4 frames as a state.

In order to improve the accuracy of the convolutional network, the background of game

was removed and substituted with a pure black image to remove noise. As

Figure 4

shows,

the

raw

game

frames

are

preprocessed

by

first

converting

their

RGB

representation to gray-scale and down- sampling it to an

80

?

80

image. Then convert

the gray image to binary image. In addition, stack up last 4 game frames as a state for

CNN. The current frame is overlapped with the previous frames with slightly reduced

intensities and the intensity reduces

as

we move

farther away from the most

recent

frame. Thus, the input image will give good information on the trajectory on which the

bird is currently in.

2.4

Experience Replay and Stability

By

now

we

can

estimate

the

future

reward

in

each

state

using

Q-learning

and

approximate

the

Q-function

using

a

convolutional

neural

network.

But

the

approximation of Q-values using non- linear functions is not very stable. In Q-learning,

the experiences recorded in a sequential manner are highly correlated. If sequentially

use them to update the DQN parameters, the training process might stuck in a local

minimal solution or diverge.

5

gleam-大腿内侧

gleam-大腿内侧

gleam-大腿内侧

gleam-大腿内侧

gleam-大腿内侧

gleam-大腿内侧

gleam-大腿内侧

gleam-大腿内侧

本文更新与2021-01-28 09:53，由作者提供，不代表本网站立场，转载请注明出处：https://www.bjmy2z.cn/gaokao/578604.html

返回列表：英语

上一篇：全新大学英语综合教程第二册复习资料
下一篇：建筑学外文翻译

当前您在：主页 > 英语 >

gleam(完整版)基于深度强化学习的flappybird

gleam-大腿内侧

gleam-大腿内侧

gleam-大腿内侧

gleam-大腿内侧

gleam-大腿内侧

gleam-大腿内侧

gleam-大腿内侧

gleam-大腿内侧

gleam-大腿内侧

返回列表：英语

(完整版)基于深度强化学习的flappybird的相关文章

余华爱情经典语录,余华爱情句子

心情低落的图片压抑,心情低落的图片发朋友圈

经典古训100句图片大全,古训名言警句

关于青春奋斗的名人名言鲁迅,关于青年奋斗的名言鲁迅

三国群英单机版手游礼包码,三国群英手机单机版攻略

不收费的情感挽回专家电话,情感挽回免费咨询

新婚贺语怎么说祝福语,新

适合小学生包容的句子经

开启美好一天的句子,开启

林徽因传,林徽因传主要内

结婚祝福语句句暖心,结婚

正能量的句子经典简短1

沈从文语录经典语录关于

史铁生的简介和作品,史铁

打动人心的爱情句子:我的

平凡的生活.简单的幸福的

母爱的最经典金句,母亲的

相守一生不离不弃的句子

余华的作品值得初中生看

奇妙萌可珍珠公主变好,彩

喝酒后的心情经典句子,适

努力挣钱的霸气图片,努力

有深度有涵养的句子精选

高情商女人分手说的话,高

当前您在： 主页 > 英语 >

gleam-大腿内侧

gleam-大腿内侧

gleam-大腿内侧

gleam-大腿内侧

gleam-大腿内侧

gleam-大腿内侧

gleam-大腿内侧

gleam-大腿内侧

gleam-大腿内侧

(完整版)基于深度强化学习的flappybird的相关文章

当前您在：主页 > 英语 >