关键词不能为空

当前您在: 主页 > 英语 >

gleam(完整版)基于深度强化学习的flappybird

作者:高考题库网
来源:https://www.bjmy2z.cn/gaokao
2021-01-28 09:53
tags:

gleam-大腿内侧

2021年1月28日发(作者:comments是什么意思)




SHANGHAI JIAO TONG UNIVERSITY






Project Title


:





Group Number


:


Group



Members


:








Playing the Game of Flappy Bird with Deep


Reinforcement Learning



G-07



Wang Wenqing



0



Gao Xiaoning




2



Qian Chen







3








Contents


1



2



Introduction



.


.......................................... .......................................


1



Deep Q-learning Network


............................... ............................


2



2.1



Q-learning


................. .................................................. ...................................


2



2.1.1



Reinforcement Learning Problem



.


........ .............................................


2



2.1.2



Q-learning Formulation [6]



.


.................................................. .............


3



2.2



2.3



2.4



2.5



Deep Q-learning Network



.

< br>............................................... ..............................


4



Input Pre- processing


............................ .................................................. ........


5



Experience Replay and Stability



.


...... .................................................. ...........


5



DQN Architecture and Algorithm



.


....................................... ..........................


6



3



4



5




Experiments


................ .................................................. ...............


7



3.1



Parameters Settings



.


....... .................................................. ..............................


7



3.2



Results Analysis



.................................................. ...........................................


9



Conclusion


................. .................................................. ..............


1


1



References



.

< p>
............................................ .....................................


1


2














I



Playing the Game of Flappy Bird with Deep Reinforcement Learning



Playing the Game of Flappy Bird with Deep Reinforcement Learning




Abstract




Letting machine play games has been one of the popular topics in AI today. Using game


theory


and


search


algorithms


to


play


games


requires


specific


domain


knowledge,


lacking


scalability.


In


this


project,


we


utilize


a


convolutional


neural


network


to


represent


the


environment


of


games,


updating


its


parameters


with


Q-learning,


a


reinforcement learning algorithm. We call this overall algorithm as deep reinforcement


learning or Deep Q-learning Network(DQN). Moreover, we only use the raw images of


the game of flappy bird as the input of DQN, which guarantees the scalability for other


games. After training with some tricks, DQN can greatly outperform human beings.



1



Introduction


Flappy bird is a popular game in the world recent years. The goal of players is guiding


the bird on screen to pass the gap constructed by two pipes by tapping screen. If the


player tap the screen, the bird will jump up, and if the player do nothing, the bird will


fall down at a constant rate. The game will be over when the bird crash on pipes or


ground,


while


the


scores


will


be


added


one


when


the


bird


pass


through


the


gap.


In


Figure1, there are three different state of bird. Figure 1 (a) represents the normal flight


state, (b) represents the crash state, (c) represents the passing state.











(a)



















(b)



















(c)



Figure 1


: (a) normal flight state (b) crash state (c) passing state




Our goal in this paper is to design an agent to play Flappy bird automatically with the


same


input


comparing


to


human


player,


which


means


that


we


use


raw


images


and


rewards to teach our agent to learn how to play this game. Inspired by [1], we propose


a deep reinforcement learning architecture to learn and play this game.




Recent years, a huge amount of work has been done on deep learning in computer vision


[6]. Deep learning extracts high dimension features from raw images. Therefore, it is


nature to ask whether the deep learning can be used in reinforcement learning. However,


there are four challenges in using deep learning. Firstly, most successful deep learning


applications


to


date


have


required


large


amounts


of


hand- labelled


training


data.


RL


algorithms, on the other hand, must be able to learn from a scalar reward signal that is


frequently sparse, noisy and delayed. Secondly, the delay between actions and resulting


rewards, which can be thousands of time steps long, seems particularly daunting when


1



Playing the Game of Flappy Bird with Deep Reinforcement Learning



compared


to


the


direct


association


between


inputs


and


targets


found


in


supervised


learning. The third issue is that most deep learning algorithms assume the data samples


to be independent, while in reinforcement learning one typically encounters sequences


of


highly


correlated


states.


Furthermore,


in


RL


the


data


distribution


changes


as


the


algorithm learns new behaviors, which can be problematic for deep learning methods


that assume a fixed underlying distribution.




This


paper


will


demonstrate


that


using


Convolutional


Neural


Network


(CNN)


can


overcome those challenge mentioned above and learn successful control polices from


raw images data in the game Flappy bird. This network is trained with a variant of the


Q-learning algorithm [6]. By using Deep Q-learning Network (DQN), we construct the


agent to make right decisions on the game flappy bird barely according to consequent


raw images.




2



Deep Q-learning Network


Recent breakthroughs in computer vision have relied on efficiently training deep neural


networks


on


very


large


training


sets.


By


feeding


sufficient


data


into


deep


neural


networks, it is often possible to learn better representations than handcrafted features


[2][3]. These successes motivate us to connect a reinforcement learning algorithm to a


deep


neural


network,


which


operates


directly


on


raw


images


and


efficiently


update


parameters by using stochastic gradient descent.




In the following section, we describe the Deep Q-learning Network algorithm (DQN)


and how its model is parameterized.



2.1



Q-learning



2.1.1



Reinforcement Learning Problem



Q-learning is a specific algorithm of reinforcement learning (RL). As


Figure 2


show,


an agent interacts with its environment in discrete time steps. At each time t, the agent


receives an


state


s


t



and a reward


r


t


.


It


then chooses an action


a


t



from


the set


of


actions


available,


which


is


subsequently


sent


to


the


environment.


The


environment


moves


to


a


new


state


s


t

?


1


and


the


reward


r


t

< br>?


1



associated


with


the


transition


(


s


t


,


a


t


,


s


t


?

< br>1


)


is determined [4].




2



Playing the Game of Flappy Bird with Deep Reinforcement Learning




Figure 2


: Traditional Reinforcement Learning scenario




The goal of an agent is to collect as much reward as possible. The agent can choose any


action as a function of the history and it can even randomize its action selection. Note


that


in


order


to


act


near


optimally,


the


agent


must


reason


about


the


long


term


consequences of its actions (i.e., maximize the future income), although the immediate


reward associated with this might be negative [5].



2.1.2



Q-learning Formulation [6]




In Q-learning problem, the set of states and actions, together with rules for transitioning


from one state to another, make up a Markov decision process. One episode of this


process (e.g. one game) forms a finite sequence of states, actions and rewards:



s


0


,


a


0


,

< br>r


1


,s


1

,


a


1


,


r


2


,...,s


n

< br>?


1


,


a


n


?


1


,


r


n


,


s


n



Here


s


i< /p>


represents the state,


a


i


is the action and


r


i


?


1


is the reward after performing the


action


a


i


. The episode ends with terminal state


s


n


. To perform well in the long-term, we


need to take into account not only the immediate rewards, but also the future rewards


we are going to get. Define the total future reward from time point t onward as:



R


t


?


r


t


?


r


t


?


1


?


...


?


r


n


?


1


?


r


n





(1)



In order to ensure the divergence and balance the immediate reward and future reward,


total reward must use discounted future reward:




R


t


?


r


t

?


?


r


t


?


1


?


...


?


?


n


?


t< /p>


?


1


r


n


?


1


?


?

< p>
n


?


t


r


n


?


?


?

i


?


t


r


i




i


?< /p>


t


n


(2)



Here


?


is the discount factor between 0 and 1, the more into the future the reward is,


the less we take it into consideration. Transforming equation


(2)


can get:



R


t


?


r


t


?


?


R

< br>t


?


1





(3)



In Q-learning, define a function


Q


(


s


t

< br>,



a


t


)


representing the maximum discounted future


reward when we perform action


a


t


in state:



Q


(

< br>s


t


,


a


t


)


?


max


R


t


?


1





(4)



It is called Q-function, because it represents the “quality” of a certain action in a given


state. A good strategy for an agent would be to always choose an action that maximizes


the discounted future reward:



?


(


s


t


)


?

arg


max


a


t


Q


(


s


t

,


a


t


)





(5)



Here π represents the policy, the rule how we choose an action in each state. Given a


transition


(


s

< br>t


,


a


t


,


s


t


?


1


)


, equation


(3)(4)


can get following bellman equation - maximum


3



Playing the Game of Flappy Bird with Deep Reinforcement Learning



future reward for this state and action is the immediate reward plus maximum future


reward for the next state:




(6)



Q


(


s


t


,


a


t


)


?


r


t


?


?


max


a


'


Q


(


s


t


?

< br>1


,


a


'


)




The only way to collect information about the environment is by interacting with it. Q-


learning is the process of learning the optimal function


Q


(


s


t


,


a


t


)


, which is a table in.


Here is the overall algorithm 1:





Algorithm 1


Q-learning



Initialize Q[num_states, num_actions] arbitrarily


Observe initial state


s


0



Repeat




Select and carry out an action


a




Observe reward r and new state


s’




Q< /p>


(


s


,


a


)


?


Q


(

< p>
s


,


a


)


?


?


(


r

?


?


max


a

'


Q


(


s


'


,


a


'


)< /p>


?


Q


(


s


,


a


))





s


= s’



Until


terminated





2.2



Deep Q-learning Network



In Q-learning, the state space often is too big to be put into main memory. A game frame


of


80


?


80



binary images has


2


6400


states, which is impossible to be represented by Q-


table.


What’s


more,


during


training,


encountering


a


known


state,


Q-learning


just


perform a random action, meaning that it’s not heuristic. In order overcome these two


problems, just approximate the Q-table with a convolutional neural networks (CNN)


[7][8]. This variation of Q-learning is called Deep Q-learning Network (DQN) [9][10].



After


training


the


DQN,


a


multilayer


neural


networks


can


approach


the


traditional


optimal Q-table as followed:



Q


(


s


t


,


a


t


;


?


)


?


Q


*


(

< br>s


t


,


a


t


)





(7)



As for playing flappy bird, the screenshot


s


t


is inputted into the CNN, and the outputs


are the Q-value of actions, as shown in


Figure 3


:





Figure 3:


In DQN, CNN’s input is raw game image while its outputs are Q-values Q(s,


a), one neuron corresponding to one action’s Q-value.




In


order


to


update


CNN’s


weight,


defining


the


cost


function


and


gradient


update


function as [9][10]:



2


1


'


?


?


L

< br>?


?


r


?


max


Q


(


s


,


a


;


?


)


?


Q


(


s


,


a


;


?


)





(8)



t


t


?


1


t


t


a


'


?


?


2


'


?



(9)



?


?


L


?


?


(


r


?


?


max


'


Q


(


s


t


?


1


,


a


;


?


)

< br>?


Q


(


s


t


,


a


t


;


?


)


?


?


?


Q


(


s


t


,


a


t


;


?


)




t


a


?


?


4



Playing the Game of Flappy Bird with Deep Reinforcement Learning




?


?


?


?


??


?


L


(


?


?


)




?


(10)



Here,


?


are the DQN parameters that get trained and


?


?


are non-updated parameters


for the Q-value function. During training, use equation(9) to update the weights of CNN.




Meanwhile, obtaining optimal reward in every episode requires the balance between


exploring the environment and exploiting experience.


?


-greedy approach can achieve


this


target.


When


training,


select


a


random


action


with


probability


?



o


r


otherwise


choose the optimal action


a


?


argmax


a


'


Q


(


s


t< /p>


,


a


'


;


?


)



. The


?


anneals linearly to zero with


increase in number of updates.



2.3



Input Pre-processing



Working directly with raw game frames, which are


288


?


512


pixel RGB images, can


be


computationally


demanding,


so


we


apply


a


basic


preprocessing


step


aimed


at


reducing the input dimensionality.





Figure 4:


Pre- process game frames. First convert frames to gray images, then down-


sample them to specific size. Afterwards, convert them to binary images, finally stack


up last 4 frames as a state.




In order to improve the accuracy of the convolutional network, the background of game


was removed and substituted with a pure black image to remove noise. As


Figure 4



shows,


the


raw


game


frames


are


preprocessed


by


first


converting


their


RGB


representation to gray-scale and down- sampling it to an


80


?


80


image. Then convert


the gray image to binary image. In addition, stack up last 4 game frames as a state for


CNN. The current frame is overlapped with the previous frames with slightly reduced


intensities and the intensity reduces


as


we move


farther away from the most


recent


frame. Thus, the input image will give good information on the trajectory on which the


bird is currently in.




2.4



Experience Replay and Stability



By


now


we


can


estimate


the


future


reward


in


each


state


using


Q-learning


and


approximate


the


Q-function


using


a


convolutional


neural


network.


But


the


approximation of Q-values using non- linear functions is not very stable. In Q-learning,


the experiences recorded in a sequential manner are highly correlated. If sequentially


use them to update the DQN parameters, the training process might stuck in a local


minimal solution or diverge.



5


gleam-大腿内侧


gleam-大腿内侧


gleam-大腿内侧


gleam-大腿内侧


gleam-大腿内侧


gleam-大腿内侧


gleam-大腿内侧


gleam-大腿内侧



本文更新与2021-01-28 09:53,由作者提供,不代表本网站立场,转载请注明出处:https://www.bjmy2z.cn/gaokao/578604.html

(完整版)基于深度强化学习的flappybird的相关文章

  • 爱心与尊严的高中作文题库

    1.关于爱心和尊严的作文八百字 我们不必怀疑富翁的捐助,毕竟普施爱心,善莫大焉,它是一 种美;我们也不必指责苛求受捐者的冷漠的拒绝,因为人总是有尊 严的,这也是一种美。

    小学作文
  • 爱心与尊严高中作文题库

    1.关于爱心和尊严的作文八百字 我们不必怀疑富翁的捐助,毕竟普施爱心,善莫大焉,它是一 种美;我们也不必指责苛求受捐者的冷漠的拒绝,因为人总是有尊 严的,这也是一种美。

    小学作文
  • 爱心与尊重的作文题库

    1.作文关爱与尊重议论文 如果说没有爱就没有教育的话,那么离开了尊重同样也谈不上教育。 因为每一位孩子都渴望得到他人的尊重,尤其是教师的尊重。可是在现实生活中,不时会有

    小学作文
  • 爱心责任100字作文题库

    1.有关爱心,坚持,责任的作文题库各三个 一则150字左右 (要事例) “胜不骄,败不馁”这句话我常听外婆说起。 这句名言的意思是说胜利了抄不骄傲,失败了不气馁。我真正体会到它

    小学作文
  • 爱心责任心的作文题库

    1.有关爱心,坚持,责任的作文题库各三个 一则150字左右 (要事例) “胜不骄,败不馁”这句话我常听外婆说起。 这句名言的意思是说胜利了抄不骄傲,失败了不气馁。我真正体会到它

    小学作文
  • 爱心责任作文题库

    1.有关爱心,坚持,责任的作文题库各三个 一则150字左右 (要事例) “胜不骄,败不馁”这句话我常听外婆说起。 这句名言的意思是说胜利了抄不骄傲,失败了不气馁。我真正体会到它

    小学作文