关键词不能为空

当前您在: 主页 > 英语 >

详细的英文基于深度强化学习的flappy bird

作者:高考题库网
来源:https://www.bjmy2z.cn/gaokao
2021-01-28 09:54
tags:

详细的英文-账户名称

2021年1月28日发(作者:信息名址)




SHANGHAI JIAO TONG UNIVERSITY










Reinforcement Learning


Project Title


:



Playing the Game of Flappy Bird with Deep


Group Number


: G-07


GroupMembers


:Wang Wenqing 0


Gao Xiaoning2


Qian Chen 3






Contents


1



2



Introduction


.


............................. .................................................. ..


1



Deep Q-learning Network ............................... ............................


2



2.1



Q-learning ............................ .................................................. ........................


2



2.1.1



Reinforcement Learning Problem

< br>.


................................... ..................


2



2.1.2



Q-learning Formulation [6]


.


........... .................................................. ..


3



2.2



2.3



2.4



2.5



Deep Q-learning Network


.


........ .................................................. ...................


4



Input Pre-processing .................. .................................................. ..................


5



Experience Replay and Stability


.


.................................. .................................


5



DQN Architecture and Algorithm


.


.................................................. ...............


6



3



4



5




Experiments ........................... .................................................. ....


7



3.1



Parameters Settings


.


.................. .................................................. ...................


7



3.2



Results Analysis


.............................. .................................................. .............


9



Conclusion ............................ .................................................. ...


1


1



References


.


..... .................................................. ..........................


1


2














I



Playing the Game of Flappy Bird with Deep Reinforcement Learning


Playing the Game of Flappy Bird with Deep Reinforcement Learning



Abstract



Letting


machine


play


games


has


been


one


of


the


populartopics


in


AI



game theory and search algorithms to play games requires specific domain knowledge,


lacking


scalability.


In


this


project,


we


utilize


a


convolutional


neural


network


to


represent


the


environment


of


games,


updating


its


parameters


with


Q-learning,


a


reinforcement


learning


algorithm.


We


call


this


overall


algorithm


as


deep


reinforcement


learning


or


Deep


Q-learning


Network(DQN).


Moreover,


we


only


use


the raw images of the game of flappy bird as the input of DQN, which guarantees the


scalability


for


other


games.


After


training


with


some


tricks,


DQN


can


greatly


outperform human beings.


1



Introduction


Flappy bird is a popular game in the world recent years. The goal of players is guiding


the bird on screen to pass the gap constructed by two pipes by tapping screen. If the


player tap the screen, the bird will jump up, and if the player do nothing, the bird will


fall down at a constant rate. The game will be over when the bird crash on pipes or


ground,


while


the


scores


will


be


added


one


when


the


bird


pass


through


the


gap.


In


Figure1, there are three different state of bird. Figure 1 (a) represents the normal flight


state, (b) represents the crash state, (c) represents the passing state.










(a) (b)



















(c)


Figure 1


: (a) normal flight state (b) crash state (c) passing state



Our goal in this paper is to design an agent to play Flappy bird automatically with the


same


input


comparing


to


human


player,


which


means


that


we


use


raw


images


and


rewards to teach our agent to learn how to play this game. Inspired by [1], we propose


a deep reinforcement learning architecture to learn and play this game.



Recent


years,


a


huge


amount


of


work


has


been


done


on


deep


learningin


computer


vision


[6].


Deep


learning


extracts


high


dimension


features


from


raw


images.


Therefore, it is nature to ask whether the deep learning can be used in reinforcement


learning.


However,


there


are


four


challenges


in


using


deep


learning.


Firstly,


most


successful


deep


learning


applications


to


date


have


required


large


amounts


of


hand- labelled training data. RL algorithms, on the other hand, must be able to learn


from a scalar rewardsignal that is frequently sparse, noisy and delayed. Secondly, the


1



Playing the Game of Flappy Bird with Deep Reinforcement Learning


delay


between


actions


and


resulting


rewards,which


can


be


thousands


of


time


steps


long,


seems


particularly


daunting


when


compared


to


the


directassociation


between


inputs


and


targets


found


in


supervised


learning.


The


third


issue


is


that


most


deeplearning


algorithms


assume


the


data


samples


to


be


independent,


while


in


reinforcement learning onetypically encounters sequences of highly correlated states.


Furthermore,


in


RL


the


data


distributionchanges


as


the


algorithm


learns


new


behaviors,


which


can


be


problematic


for


deep


learningmethods


that


assume


a


fixed


underlying distribution.




This


paper


will


demonstrate


that


using


Convolutional


Neural


Network(CNN)


can


overcome those challenge mentioned above and learn successful control polices from


raw images data in the game Flappy bird. This network is trained with a variant of the


Q-learningalgorithm[6]. By using Deep Q-learning Network(DQN), we construct the


agent to make right decisions on the game flappy bird barely according to consequent


raw images.



2



Deep Q-learning Network


Recent


breakthroughs


in


computer


vision


have


relied


on


efficiently


training


deep


neural networks on very large training sets. By feedingsufficient data into deep neural


networks,


it


is


often


possible


to


learn


better


representations


thanhandcrafted


features[2][3].


These


successes


motivate


us


to


connect


a


reinforcement


learning


algorithm


to


a


deep


neural


network,


which


operatesdirectly


on


raw


images


and


efficiently update parameters by using stochastic gradient descent.



In


the


following


section,


we


describethe


Deep


Q-learning


Network


algorithm


(DQN)and how its model is parameterized.



2.1



Q-learning


2.1.1



Reinforcement Learning Problem


Q-learning is a specific algorithm of reinforcement learning (RL).As


Figure 2


show,


an agent interacts with its environment in discrete time steps. At each time t, the agent


receives


an


state


s

< br>t



and


a


reward


r


t


.


It


then


chooses


an


action


a


t



from


the


set


of


actions


available,


which


is


subsequently


sent


to


the


environment.


The


environment


moves


to


a


new


state


s


t

?


1


and


the


reward


r


t

< br>?


1



associated


with


the


transition


(


s


t


,


a


t


,


s


t


?

< br>1


)


is determined[4].




2



Playing the Game of Flappy Bird with Deep Reinforcement Learning



Figure 2


: Traditional Reinforcement Learning scenario



The goal of an agent is to collect as much reward as possible. The agent can choose


any


action


as


a


function


of


the


history


and


it


can


even


randomize


its


action


that in order to act near optimally, the agent must reason about the long


term


consequences


of


its


actions


(i.e.,


maximize


the


future


income),


although


the


immediate reward associated with this might be negative[5].


2.1.2



Q-learning Formulation[6]


In


Q-learning


problem,


the


set


of


states


and


actions,


together


with


rules


for


transitioning


from


one


state


to


another,


make


up


a


Markov


decision


process.


One


episode of this process (e.g. onegame) forms a finite sequence of states, actions and


rewards:


s


0


,


a


0


,

< br>r


1


,s


1

,


a


1


,


r


2


,...,s


n

< br>?


1


,


a


n


?


1


,


r


n


,


s


n



Here


s


i< /p>


represents the state,


a


i


is the action and


r


i


?


1


is the reward after performing the


action


a


i


. The episode ends with terminal state


s


n


. To perform well in the long-term,


we


need


to


take


into


account


not


only


theimmediate


rewards,


but


also


the


future


rewards we are going to get. Define the total future reward from time point t onward


as:



(1)

< br>R


t


?


r


t


?


r


t


?


1


?


...


?


r


n


?


1


?


r


n



In order to ensure the divergence and balance the immediate reward and future reward,


total reward must use discounted future reward:



R


t


?


r


t


?

< br>?


r


t


?


1


?


...


?


?


n


?


t


?


1


r


n


?


1


?


?


n


?


t


r


n


?


?


?


i

< br>?


t


r


i



i


?


t


n


(2)


Here


?


is the discount factor between 0 and 1, the more into the future the rewardis,


the less we take it into consideration. Transforming equation


(2)


can get:



(3)


R


t


?


r


t


?


?< /p>


R


t


?


1



In


Q-learning,


define


a


function


Q


(


s


t


,



a


t


)


representing


the


maximum


discounted


future reward when we perform action


a


t


in state:



(4)


Q


(


s


t


,


a< /p>


t


)


?


max< /p>


R


t


?


1



It is called Q-function, because itre


presents the “quality” of a certain action in a given


state.


A


good


strategy


for


an


agent


would


be


to


always


choose


an


action


that


maximizesthe discounted future reward:



(5)


?


(


s


t


)


?< /p>


argmax


a


t


Q


(


s


t


,


a


t


)



3



Playing the Game of Flappy Bird with Deep Reinforcement Learning


Here π represents the policy, the rule how we choose an action in each state


. Given a


transition


(


s


t


,

< p>
a


t


,


s


t


?


1


)

, equation


(3)(4)


can get following bellman equation - maximum


future reward for this state and action is the immediate reward plus maximum future


reward for the next state:



(6)


Q


(


s


t


,


a


t< /p>


)


?


r


t


?


?


max


a


'


Q


(


s

< p>
t


?


1


,


a


'


)



The only way to collect information about the environment is by interacting with it.


Q-learning is the process of learning the optimal function


Q


(


s


t


,


a

< br>t


)


, which is a table in.


Here is the overall algorithm 1:



Algorithm 1


Q-learning



Initialize Q[num_states, num_actions] arbitrarily


Observe initial state


s


0



Repeat




Select and carry out an action


a




Observe reward r and new state


s’




Q< /p>


(


s


,


a


)


?


Q


(

< p>
s


,


a


)


?


?


(


r

?


?


max


a

'


Q


(


s


'


,


a


'


)< /p>


?


Q


(


s


,


a


))




s


= s’



Until


terminated





2.2



Deep Q-learning Network


In Q-learning, the state space often is too big to be put into main memory.


A game


frame


of


80


?


80



binary


images


has


2


6400


states,


which


is


impossible


to


be


represented


by


Q-


table.


What’s


more,


during


training,


encountering


a


known


state,


Q-learning


ju


st


perform


a


random


action,


meaning


that


it’s


not


heuristic.



In


order


overcome


these


two


problems,


just


approximate


the


Q-table


with


a


convolutional


neural networks (CNN)[7][8]. Thisvariation of Q-learning is called Deep Q-learning


Network (DQN)[9][10].


After


training


the


DQN,


a


multilayer


neural


networks


can


approach


the


traditional


optimal Q-table as followed:



(7)


Q


(


s


t


,


a


t


;


?


)


?


Q


*


(


s


t


,


a


t


)



As for playing flappy bird, the screenshot


s


t


is inputted into the CNN, and the outputs


are the Q-value of actions, as shown in


Figure 3


:




Figure


3:



In DQN, CNN’s input is


raw


game image while its outputs


are Q


-values


Q(s, a),


one neuron corresponding to one action’s Q


-value.



In


order


to


update


CNN’s


weight,


defining


the


cost


function


and


gradient


update


function as[9][10]:


4



Playing the Game of Flappy Bird with Deep Reinforcement Learning




< p>
2


1


'


?


?


?


L


?

?


r


t


?


max


a


'


Q


(


s


t


?


1< /p>


,


a


;


?


)


?


Q


(

< p>
s


t


,


a


t


;


?


)

?



2


'


?


?


?


L


?< /p>


?


(


r


?


?


max


Q


(


s


,


a


;

< p>
?


)


?


Q


(


s


t


,

a


t


;


?


)


?


'


t


?< /p>


1


a


?


t


?


?


?


Q

< p>
(


s


t


,


a


t


;


?

)



(8)


(9)


(10)


?


?


?


?


??


?


L


(


?


?


)< /p>



?


Here,


?


are


the


DQN


parameters


that


get


trained


and


?


?


a re


non-updated


parametersfor


the


Q-value


function.


During


training,


use


equation(9)


to


update


the


weights of CNN.



Meanwhile, obtaining optimal reward in


every


episode requires the


balancebetween


exploring the environment and exploiting experience.


?


-greedy approach can achieve


this


target.


When


training,


select


a


random


action


with


probability


?


or


otherwisechoose


the


optimal


action


a


?


argmax


a


'


Q


(


s


t< /p>


,


a


'


;


?


)


.The


?


anneals


linearly


to


zero with increase in number ofupdates.


2.3



Input Pre- processing


Working


directly


with


raw


game


frames,


which


are


288


?


512


pixel


RGB


images,can


be


computationally


demanding,


so


we


apply


a


basic


preprocessing


step


aimed


at


reducing the input dimensionality.





Figure


4:



Pre-process


game


frames.


First


convert


frames


to


gray


images,


then


down-sample them to specific size. Afterwards, convert them to binary images, finally


stack up last 4 frames as a state.



In


order


to


improve


the


accuracy


of


the


convolutional


network,the


background


of


game


was


removed


and


substitutedwith


a


pure


black


image


to


remove



Figure 4


shows, the raw game frames are preprocessed by first converting their RGB


representationto


gray-scale


and


down-sampling


it


to


an


80


?


80


image.


Then


convert


the gray image to binary image. In addition, stack up last 4 game frames as a state for


CNN. The currentframe is overlapped with the previous frames with slightlyreduced


intensities


and


the


intensity


reduces


as


we


movefarther


away


from


the


most


recent


frame. Thus, the inputimage will give good information on the trajectory onwhich the


bird is currently in.



2.4



ExperienceReplayandStability


By


now


we


can


estimate


the


future


reward


in


each


state


using


Q-learning


and


5


详细的英文-账户名称


详细的英文-账户名称


详细的英文-账户名称


详细的英文-账户名称


详细的英文-账户名称


详细的英文-账户名称


详细的英文-账户名称


详细的英文-账户名称



本文更新与2021-01-28 09:54,由作者提供,不代表本网站立场,转载请注明出处:https://www.bjmy2z.cn/gaokao/578613.html

基于深度强化学习的flappy bird的相关文章

  • 爱心与尊严的高中作文题库

    1.关于爱心和尊严的作文八百字 我们不必怀疑富翁的捐助,毕竟普施爱心,善莫大焉,它是一 种美;我们也不必指责苛求受捐者的冷漠的拒绝,因为人总是有尊 严的,这也是一种美。

    小学作文
  • 爱心与尊严高中作文题库

    1.关于爱心和尊严的作文八百字 我们不必怀疑富翁的捐助,毕竟普施爱心,善莫大焉,它是一 种美;我们也不必指责苛求受捐者的冷漠的拒绝,因为人总是有尊 严的,这也是一种美。

    小学作文
  • 爱心与尊重的作文题库

    1.作文关爱与尊重议论文 如果说没有爱就没有教育的话,那么离开了尊重同样也谈不上教育。 因为每一位孩子都渴望得到他人的尊重,尤其是教师的尊重。可是在现实生活中,不时会有

    小学作文
  • 爱心责任100字作文题库

    1.有关爱心,坚持,责任的作文题库各三个 一则150字左右 (要事例) “胜不骄,败不馁”这句话我常听外婆说起。 这句名言的意思是说胜利了抄不骄傲,失败了不气馁。我真正体会到它

    小学作文
  • 爱心责任心的作文题库

    1.有关爱心,坚持,责任的作文题库各三个 一则150字左右 (要事例) “胜不骄,败不馁”这句话我常听外婆说起。 这句名言的意思是说胜利了抄不骄傲,失败了不气馁。我真正体会到它

    小学作文
  • 爱心责任作文题库

    1.有关爱心,坚持,责任的作文题库各三个 一则150字左右 (要事例) “胜不骄,败不馁”这句话我常听外婆说起。 这句名言的意思是说胜利了抄不骄傲,失败了不气馁。我真正体会到它

    小学作文