2014-04-02 47 views
2

Q學習 - 獎勵

我掙扎解釋了Q學習算法的僞代碼:Q學習:什麼是獎勵計算正確的狀態

1 For each s, a initialize table entry Q(a, s) = 0 
2 Observe current state s 
3 Do forever: 
4  Select an action a and execute it 
5  Receive immediate reward r 
6  Observe the new state s′ ← δ(a, s) 
7  Update the table entry for Q(a, s) as follows: 
8  Q(a, s) ← R(s) + γ * max Q(a′, s′) 
9  s ← s′ 

應該獎勵從收集後續狀態s'或當前狀態s

回答

2

獎勵應從您執行動作a後輸入的後續狀態收集。