Mathematics, 30.03.2021 19:40 zafyafimli

Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP, but instead, we are given samples of what an agent experiences when it interacts with the environment (although, we do know that we do not remain in the same state after taking an action). In this problem, we will first estimate the model (the transition function and the reward function), and then use the estimated model to find the optimal actions. To find the optimal actions, model-based RL proceeds by computing the optimal V or Q value function with respect to the estimated T and R. This could be done with any of value iteration, policy iteration, or Q-value iteration. Last week you already solved some exercises that involved value iteration and policy iteration, so we will go with Q value iteration in this exercise.
Consider the following samples that the agent encountered.
a a r S a S r S S r в 0.0 A -3.0 Clockwise B Clockwise Clockwise A C A 0.0 B 0.0 B 6.0 Clockwise Clockwise Clockwise A A 3.0 C A -3.0 B 0.0 в | 6.0 Clockwise B Clockwise Clockwise A. C A 3.0 C -10.0 A 0.0 Clockwise А B Clockwise Clockwise C 0.0 C-10.0 A 0.0 Clockwise Clockwise Clockwise А C A Counterclockwise C-8.0 B Counterclockwise A -10.0 C Counterclockwise B -8.0 A Counterclockwise C-8.0 B Counterclockwise A-10.0 C Counterclockwise B -8.0 C Counterclockwise B-8.0 B Counterclockwise A -10.0 A Counterclockwise B 0.0 A Counterclockwise B 0.0 B Counterclockwise A -10.0 C Counterclockwise A 0.0 B COunterclockwise C0.0 A Counterclockwise C-8.0 C Counterclockwise B-8.0
We start by estimating the transition function, T(s, a,s') and reward function R(s, a,s') for this MDP. Fill in the missing values in the following table for T(s, a,s') and R(s, a,s').
Discount Factor, y 0.5 s' T(S, a,s') R(S, a,s') S a Clockwise A M Clockwise A C P A Counterclockwise B 0.400 0.000 A Counterclockwise C 0.600 -8.000 Clockwise 0.800 -3.000 Clockwise 0.000 0.200 B Counterclockwise A 0.800 -10.000 B Counterclockwise C 0.200 0.000 Clockwise C A 0.600 0.000 Clockwise 0.400 6.000 C Counterclockwise A 0.200 0.000 C Counterclockwise B 0.800 -8.000 m

Answers: 2

Show answers

Another question on Mathematics

Mathematics, 21.06.2019 17:30

What is the number 321,000,000 in scientific notation? also what is the number? ?

Answers: 2

Answer

Mathematics, 21.06.2019 21:30

Aye asap pls ! markin da brainiest too ! btw da step choices are all add, distribute, divide, n subtract

Answers: 2

Answer

Mathematics, 22.06.2019 03:10

Write the point slope form of the equation of the line passing through the points (-5, 6) and (0.1).

Answers: 2

Answer

Mathematics, 22.06.2019 03:30

Sera sells t-shirts at the beach. she believes the price of a t-shirt and the number of t-shirts sold are related. she has been experimenting with different prices for the t-shirts. she has collected a data set with five pairs of data; each consists of the price of a t-shirt and the number of shirts sold. the independent variable, which will go on the x-axis, is . the dependent variable, which will go on the y-axis, is the

Answers: 3

Answer

You know the right answer?

Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not k...

Questions