Engineering, 07.03.2020 02:46 lukeperry

Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with reward function R(s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the original MDP

Answers: 2

Show answers

Answers

Answer from: vondah4014

U(s) = maxa[R0

(s, a) + γ

pre T

(s, a, pre)(maxb[R0

(pre, b) + γ

0 T

(pre, b, s0

) ∗ U(s

))]]

U(s) = maxa[

0 T(s, a, s0

)(R(s, a, s0

) + γU(s

)]

U(s) = R0

(s) + γ

2 maxa[

post T

(s, a, post)(R0

(post) + γ

2 maxb[

0 T

(post, b, s0

)U(s

))]]

U(s) = maxa[R(s, a) + γ

0 T(s, a, s0

)U(s

)]

Explanation:

MDPs

MDPs can formulated with a reward function R(s), R(s, a) that depends on the action taken or R(s, a, s’) that

depends on the action taken and outcome state.

To Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with reward

function R(s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the

original MDP.

One solution is to define a ’pre-state’ pre(s, a, s’) for every s, a, s’ such that executing a in s leads not to s’

but to pre(s, a, s’). From the pre-state there is only one action b that always leads to s’. Let the new MDP

have transition T’, reward R’, and discount γ

(s, a, pre(s, a, s0

)) = T(s, a, s0

)

(pre(s, a, s0

), b, s0

) = 1

(s, a) = 0

(pre(s, a, s0

), b) = γ

− 1

2 R(s, a, s0

)

0 = γ

Then, using pre as shorthand for pre(s, a, s’):

U(s) = maxa[R0

(s, a) + γ

pre T

(s, a, pre)(maxb[R0

(pre, b) + γ

0 T

(pre, b, s0

) ∗ U(s

))]]

U(s) = maxa[

0 T(s, a, s0

)(R(s, a, s0

) + γU(s

)]

Now do the same to convert MDPs with R(s, a) into MDPs with R(s).

Similar to part (c), create a state post(s, a) for every s, a such that

(s, a, post(s, a, s0

)) = 1

(post(s, a, s0

), b, s0

) = T(s, a, s0

)

(s) = 0

(post(s, a, s0

)) = γ

− 1

2 R(s, a)

0 = γ

Then, using post as shorthand for post(s, a, s’):

U(s) = R0

(s) + γ

2 maxa[

post T

(s, a, post)(R0

(post) + γ

2 maxb[

0 T

(post, b, s0

)U(s

))]]

U(s) = maxa[R(s, a) + γ

0 T(s, a, s0

)U(s

)]

Answer from: Quest

answer:

ur mom

explanation:

Answer from: Quest

answer:

no man

explanation:

no i dont want it

Another question on Engineering

Engineering, 04.07.2019 18:10

What difference(s) did you notice using a pneumatic circuit over hydraulic circuit.explain why the pneumatic piston stumbles when it hits an obstacle.

Answers: 2

Answer

Engineering, 04.07.2019 18:10

The drive force for diffusion is 7 fick's first law can be used to solve the non-steady state diffusion. a)-true b)-false

Answers: 1

Answer

Engineering, 04.07.2019 18:10

Awall of 0.5m thickness is to be constructed from a material which has average thermal conductivity of 1.4 w/mk. the wall is to be insulated with a material having an average thermal conductivity of 0.35 w/mk so that heat loss per square meter shall not exceed 1450 w. assume inner wall surface temperature of 1200°c and outer surface temperature of the insulation to be 15°c. calculate the thickness of insulation required.

Answers: 3

Answer

Engineering, 04.07.2019 18:20

Determine the damped natural frequencies and the steady state response of a decoupled damped forced two degrees of freedom system. 10ä1 + 2q1 20q1 10 cos t; 10q2 +4q2 + 40q2 10 cos t

Answers: 3

Answer

You know the right answer?

Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with rewa...

Questions