详细的英文基于深度强化学习的flappy bird_高中生题库网|高考真题|高考试题-「密云二中」

详细的英文-账户名称

2021年1月28日发(作者：信息名址)

SHANGHAI JIAO TONG UNIVERSITY

Reinforcement Learning

Project Title

Playing the Game of Flappy Bird with Deep

Group Number

: G-07

GroupMembers

:Wang Wenqing 0

Gao Xiaoning2

Qian Chen 3

Contents

Introduction

............................. .................................................. ..

Deep Q-learning Network ............................... ............................

2.1

Q-learning ............................ .................................................. ........................

2.1.1

Reinforcement Learning Problem

.

................................... ..................

2.1.2

Q-learning Formulation [6]

........... .................................................. ..

2.2

2.3

2.4

2.5

Deep Q-learning Network

........ .................................................. ...................

Input Pre-processing .................. .................................................. ..................

Experience Replay and Stability

.................................. .................................

DQN Architecture and Algorithm

.................................................. ...............

Experiments ........................... .................................................. ....

3.1

Parameters Settings

.................. .................................................. ...................

3.2

Results Analysis

.............................. .................................................. .............

Conclusion ............................ .................................................. ...

References

..... .................................................. ..........................

Playing the Game of Flappy Bird with Deep Reinforcement Learning

Abstract

Letting

machine

play

games

has

been

one

the

populartopics

game theory and search algorithms to play games requires specific domain knowledge,

lacking

scalability.

this

project,

utilize

convolutional

neural

network

represent

the

environment

games,

updating

its

parameters

with

Q-learning,

reinforcement

learning

algorithm.

call

this

overall

algorithm

deep

reinforcement

learning

Deep

Q-learning

Network(DQN).

Moreover,

only

use

the raw images of the game of flappy bird as the input of DQN, which guarantees the

scalability

for

other

games.

After

training

with

some

tricks,

DQN

can

greatly

outperform human beings.

Introduction

Flappy bird is a popular game in the world recent years. The goal of players is guiding

the bird on screen to pass the gap constructed by two pipes by tapping screen. If the

player tap the screen, the bird will jump up, and if the player do nothing, the bird will

fall down at a constant rate. The game will be over when the bird crash on pipes or

ground,

while

the

scores

will

added

one

when

the

bird

pass

through

the

gap.

Figure1, there are three different state of bird. Figure 1 (a) represents the normal flight

state, (b) represents the crash state, (c) represents the passing state.

(a) (b)

(c)

Figure 1

: (a) normal flight state (b) crash state (c) passing state

Our goal in this paper is to design an agent to play Flappy bird automatically with the

same

input

comparing

human

player,

which

means

that

use

raw

images

and

rewards to teach our agent to learn how to play this game. Inspired by [1], we propose

a deep reinforcement learning architecture to learn and play this game.

Recent

years,

huge

amount

work

has

been

done

deep

learningin

computer

vision

[6].

Deep

learning

extracts

high

dimension

features

from

raw

images.

Therefore, it is nature to ask whether the deep learning can be used in reinforcement

learning.

However,

there

are

four

challenges

using

deep

learning.

Firstly,

most

successful

deep

learning

applications

date

have

required

large

amounts

hand- labelled training data. RL algorithms, on the other hand, must be able to learn

from a scalar rewardsignal that is frequently sparse, noisy and delayed. Secondly, the

Playing the Game of Flappy Bird with Deep Reinforcement Learning

delay

between

actions

and

resulting

rewards,which

can

thousands

time

steps

long,

seems

particularly

daunting

when

compared

the

directassociation

between

inputs

and

targets

found

supervised

learning.

The

third

issue

that

most

deeplearning

algorithms

assume

the

data

samples

independent,

while

reinforcement learning onetypically encounters sequences of highly correlated states.

Furthermore,

the

data

distributionchanges

the

algorithm

learns

new

behaviors,

which

can

problematic

for

deep

learningmethods

that

assume

fixed

underlying distribution.

This

paper

will

demonstrate

that

using

Convolutional

Neural

Network(CNN)

can

overcome those challenge mentioned above and learn successful control polices from

raw images data in the game Flappy bird. This network is trained with a variant of the

Q-learningalgorithm[6]. By using Deep Q-learning Network(DQN), we construct the

agent to make right decisions on the game flappy bird barely according to consequent

raw images.

Deep Q-learning Network

Recent

breakthroughs

computer

vision

have

relied

efficiently

training

deep

neural networks on very large training sets. By feedingsufficient data into deep neural

networks,

often

possible

learn

better

representations

thanhandcrafted

features[2][3].

These

successes

motivate

connect

reinforcement

learning

algorithm

deep

neural

network,

which

operatesdirectly

raw

images

and

efficiently update parameters by using stochastic gradient descent.

the

following

section,

describethe

Deep

Q-learning

Network

algorithm

(DQN)and how its model is parameterized.

2.1

Q-learning

2.1.1

Reinforcement Learning Problem

Q-learning is a specific algorithm of reinforcement learning (RL).As

Figure 2

show,

an agent interacts with its environment in discrete time steps. At each time t, the agent

receives

state

t

and

reward

then

chooses

action

from

the

set

actions

available,

which

subsequently

sent

the

environment.

The

environment

moves

new

state

and

the

reward

 ?

associated

with

the

transition

(

t

,

a

t

,

s

t

?
 1

)

is determined[4].

2

Playing the Game of Flappy Bird with Deep Reinforcement Learning

Figure 2

: Traditional Reinforcement Learning scenario

The goal of an agent is to collect as much reward as possible. The agent can choose

any

action

as

a

function

of

the

history

and

it

can

even

randomize

its

action

that in order to act near optimally, the agent must reason about the long

term

consequences

of

its

actions

(i.e.,

maximize

the

future

income),

although

the

immediate reward associated with this might be negative[5].

2.1.2

Q-learning Formulation[6]

In

Q-learning

problem,

the

set

of

states

and

actions,

together

with

rules

for

transitioning

from

one

state

to

another,

make

up

a

Markov

decision

process.

One

episode of this process (e.g. onegame) forms a finite sequence of states, actions and

rewards:

s

0

,

a

0

,
 r

1

,s

1
,

a

1

,

r

2

,...,s

n
 ?

1

,

a

n

?

1

,

r

n

,

s

n

Here

s

i

represents the state,

a

i

is the action and

r

i

?

1

is the reward after performing the

action

a

i

. The episode ends with terminal state

s

n

. To perform well in the long-term,

we

need

to

take

into

account

not

only

theimmediate

rewards,

but

also

the

future

rewards we are going to get. Define the total future reward from time point t onward

as:

(1)
 R

t

?

r

t

?

r

t

?

1

?

...

?

r

n

?

1

?

r

n

In order to ensure the divergence and balance the immediate reward and future reward,

total reward must use discounted future reward:

R

t

?

r

t

?
 ?

r

t

?

1

?

...

?

?

n

?

t

?

1

r

n

?

1

?

?

n

?

t

r

n

?

?

?

i
 ?

t

r

i

i

?

t

n

(2)

Here

?

is the discount factor between 0 and 1, the more into the future the rewardis,

the less we take it into consideration. Transforming equation

(2)

can get:

(3)

R

t

?

r

t

?

?

R

t

?

1

In

Q-learning,

define

a

function

Q

(

s

t

,

a

t

)

representing

the

maximum

discounted

future reward when we perform action

a

t

in state:

(4)

Q

(

s

t

,

a

t

)

?

max

R

t

?

1

It is called Q-function, because itre

presents the “quality” of a certain action in a given

state.

A

good

strategy

for

an

agent

would

be

to

always

choose

an

action

that

maximizesthe discounted future reward:

(5)

?

(

s

t

)

?

argmax

a

t

Q

(

s

t

,

a

t

)

3

Playing the Game of Flappy Bird with Deep Reinforcement Learning

Here π represents the policy, the rule how we choose an action in each state

. Given a

transition

(

s

t

,

a

t

,

s

t

?

1

)
, equation

(3)(4)

can get following bellman equation - maximum

future reward for this state and action is the immediate reward plus maximum future

reward for the next state:

(6)

Q

(

s

t

,

a

t

)

?

r

t

?

?

max

a

'

Q

(

s

t

?

1

,

a

'

)

The only way to collect information about the environment is by interacting with it.

Q-learning is the process of learning the optimal function

Q

(

s

t

,

a
 t

)

, which is a table in.

Here is the overall algorithm 1:

Algorithm 1

Q-learning

Initialize Q[num_states, num_actions] arbitrarily

Observe initial state

s

0

Repeat

Select and carry out an action

a

Observe reward r and new state

s’

Q

(

s

,

a

)

?

Q

(

s

,

a

)

?

?

(

r
?

?

max

a
'

Q

(

s

'

,

a

'

)

?

Q

(

s

,

a

))

s

= s’

Until

terminated

2.2

Deep Q-learning Network

In Q-learning, the state space often is too big to be put into main memory.

A game

frame

of

80

?

80

binary

images

has

2

6400

states,

which

is

impossible

to

be

represented

by

Q-

table.

What’s

more,

during

training,

encountering

a

known

state,

Q-learning

ju

st

perform

a

random

action,

meaning

that

it’s

not

heuristic.

In

order

overcome

these

two

problems,

just

approximate

the

Q-table

with

a

convolutional

neural networks (CNN)[7][8]. Thisvariation of Q-learning is called Deep Q-learning

Network (DQN)[9][10].

After

training

the

DQN,

a

multilayer

neural

networks

can

approach

the

traditional

optimal Q-table as followed:

(7)

Q

(

s

t

,

a

t

;

?

)

?

Q

*

(

s

t

,

a

t

)

As for playing flappy bird, the screenshot

s

t

is inputted into the CNN, and the outputs

are the Q-value of actions, as shown in

Figure 3

:

Figure

3:

In DQN, CNN’s input is

raw

game image while its outputs

are Q

-values

Q(s, a),

one neuron corresponding to one action’s Q

-value.

In

order

to

update

CNN’s

weight,

defining

the

cost

function

and

gradient

update

function as[9][10]:

4

Playing the Game of Flappy Bird with Deep Reinforcement Learning


2

1

'

?

?

?

L

?
?

r

t

?

max

a

'

Q

(

s

t

?

1

,

a

;

?

)

?

Q

(

s

t

,

a

t

;

?

)
?

2

'

?

?

?

L

?

?

(

r

?

?

max

Q

(

s

,

a

;

?

)

?

Q

(

s

t

,
a

t

;

?

)

?

'

t

?

1

a

?

t

?

?

?

Q

(

s

t

,

a

t

;

?
)

(8)

(9)

(10)

?

?

?

?

??

?

L

(

?

?

)

?

Here,

?

are

the

DQN

parameters

that

get

trained

and

?

?

a re

non-updated

parametersfor

the

Q-value

function.

During

training,

use

equation(9)

to

update

the

weights of CNN.

Meanwhile, obtaining optimal reward in

every

episode requires the

balancebetween

exploring the environment and exploiting experience.

?

-greedy approach can achieve

this

target.

When

training,

select

a

random

action

with

probability

?

or

otherwisechoose

the

optimal

action

a

?

argmax

a

'

Q

(

s

t

,

a

'

;

?

)

.The

?

anneals

linearly

to

zero with increase in number ofupdates.

2.3

Input Pre- processing

Working

directly

with

raw

game

frames,

which

are

288

?

512

pixel

RGB

images,can

be

computationally

demanding,

so

we

apply

a

basic

preprocessing

step

aimed

at

reducing the input dimensionality.

Figure

4:

Pre-process

game

frames.

First

convert

frames

to

gray

images,

then

down-sample them to specific size. Afterwards, convert them to binary images, finally

stack up last 4 frames as a state.

In

order

to

improve

the

accuracy

of

the

convolutional

network,the

background

of

game

was

removed

and

substitutedwith

a

pure

black

image

to

remove

Figure 4

shows, the raw game frames are preprocessed by first converting their RGB

representationto

gray-scale

and

down-sampling

it

to

an

80

?

80

image.

Then

convert

the gray image to binary image. In addition, stack up last 4 game frames as a state for

CNN. The currentframe is overlapped with the previous frames with slightlyreduced

intensities

and

the

intensity

reduces

as

we

movefarther

away

from

the

most

recent

frame. Thus, the inputimage will give good information on the trajectory onwhich the

bird is currently in.

2.4

ExperienceReplayandStability

By

now

we

can

estimate

the

future

reward

in

each

state

using

Q-learning

and

5

详细的英文-账户名称

详细的英文-账户名称

详细的英文-账户名称

详细的英文-账户名称

详细的英文-账户名称

详细的英文-账户名称

详细的英文-账户名称

详细的英文-账户名称

本文更新与2021-01-28 09:54，由作者提供，不代表本网站立场，转载请注明出处：https://www.bjmy2z.cn/gaokao/578613.html

返回列表：英语

上一篇：英美文学选读要点总结精心整理下载版2
下一篇：Although though 的用法

当前您在：主页 > 英语 >

详细的英文基于深度强化学习的flappy bird

详细的英文-账户名称

详细的英文-账户名称

详细的英文-账户名称

详细的英文-账户名称

详细的英文-账户名称

详细的英文-账户名称

详细的英文-账户名称

详细的英文-账户名称

详细的英文-账户名称

返回列表：英语

基于深度强化学习的flappy bird的相关文章

爱心与尊严的高中作文题库

爱心与尊严高中作文题库

爱心与尊重的作文题库

爱心责任100字作文题库

爱心责任心的作文题库

爱心责任作文题库

爱心长在作文题库

爱心中国感恩励志作文题

爱心助考作文题库

爱心助农作文题库

爱心尊重宽容拒绝作文题

爱心尊重作文题库

爱心作文题库好段

爱心作文题库120字

爱心作文题库读者

爱心作文题库分论点

爱心作文题库简短

爱心作文有哪些题库

爱需要被尊重作文题库

爱需要传递200字作文题库

爱需要公平作文题库

爱需要行动作文800高中作

爱需要行动作文题库

爱需要交流与沟通作文题

当前您在： 主页 > 英语 >

详细的英文-账户名称

详细的英文-账户名称

详细的英文-账户名称

详细的英文-账户名称

详细的英文-账户名称

详细的英文-账户名称

详细的英文-账户名称

详细的英文-账户名称

详细的英文-账户名称

基于深度强化学习的flappy bird的相关文章

当前您在：主页 > 英语 >