QuickTopic (SM) free message boards QuickTopic (SM) free message boards
Skip to Messages
  Sign In to access your topic list  |New Topic |My Topics|Profile
Upgrade to Pro   Customize, show pictures, add an intro, and more:   QuickTopic Pro...and check out QuickThreadSM
Topic: Sequential cost-sensitive decision-making with reinforcement learning
Views: 304, Unique: 199 
Subscribers: 0
What's
this?
Printer-Friendly Page
Subscribe to get & post, or stop messages by email Subscribe
About these ads
Who | When
Messagessort recent-bottom   
Post a new message
 
Gyozo Gidofalvi  9
04-04-2002 06:23 PM ET (US)
As Dave suggested the paper what very well structured and clearly presented the backbround on reinforcement learning.

I was a bit also unsure about how the simplistic approaches were trained and what information was available to them, but nontheless the results presented in figures 8 and 9 were very interesting and suggested that the approcah presented here seems to have achieved its goal.
Eric Wiewiora  8
04-04-2002 05:58 PM ET (US)
I have questions on the dataset used. Perhaps I'm naive,
but I'd think that the mailings recorded in the dataset
were the result of some simple policy. This would seem
to restrict the policy space to a subset of what has been
tried in the previous mailing campaigns. Does the learner
only learn who it was not worth mailing to in the past?
Dave KauchakPerson was signed in when posted  7
04-04-2002 04:21 PM ET (US)
I too thought this paper addressed an interesting approach to the marketing problem. I particularly enjoyed the introductory material, which I felt did a good job of giving background.

However, I found some of the experimental set-up and experimentation sections a bit vague. I had a hard time figuring out exactly what data was used for training and testing. Also, the paper compares a simplistic repeated version versus the batch learners. Is the simplistic approach simply trained once and then repeatedly applied? Is there extra information available after a campaign has been done? Do these systems relearn between campaigns? If not, I think the simplistic approach may be at a disadvantage, but that is just my intuition (and I don't claim to perfectly understand everything :).

Although the use of a simulation model is nice in that it provides a form of evaluation for testing methods, I wished the paper had analyzed the effects of testing using a simulation. For example, is there any real data availabe for which the simulation can be validated? What affect will the simulation simplification have on the results? Does the simulation model favor any of the algorithms?

Finally, just a style note, I think Figure 6 should probably be eliminated. The information presented in it is also presented in 7 and I don't think that it is that beneficial to have BatchRL(sarsa) shown by itself.
Dana Dahlstrom  6
04-04-2002 03:24 PM ET (US)
[follow-up to previous post]

Actually, looking forward, my proposed patch would
require modification to Equations 5 and 7 (the
Q-learning and sarsa update rules) as well. It's
probably better to revise the notation the other way:

1. Change the descriptions of the state, action, and
reward sequences to begin with zero elements.

2. Change the summations in Equations 1 and 2 to begin
with $t=0$.

3. Change $\gamma^{t-1}$ to $\gamma^t$ in Equation 1.

This way action $a_t$ is performed in state $s_t$ and
reward $r_t$ results; I believe this is conventional.
Dana Dahlstrom  5
04-04-2002 03:03 PM ET (US)
I think Kristin's qualm with the notation in section
2.1 is right; it seems there is some inconsistency:

In paragraph 3 the sequences of actions and rewards
begin with $a_1$ and $r_1$. The state $s_0$ is
mentioned, but not $a_0$ or $r_0$. This leads me to
suppose $r_i$ is a function of $s_{i-1}$ and $a_i$.

Equation 1 is consistent with this supposition: $r_1$
is discounted by $\gamma^{0} = 1$, and so is treated as
an immediate reward. But in Equation 2 $r_1$ is
discounted by $\gamma$, and suddenly there is an $a_0$.
I think this can be remedied by changing $\gamma^t$ to
$\gamma^{t-1}$ and $a_0$ to $a_1$ in Equation 2.

Equation 3 is consistent with the others if the pair
$(s,a)$ corresponds to pairs $(s_t,a_{t+1})$ and $r$
correponds to $r_{t+1}$ from the sequences. That is,
action $a_{t+1}$ is performed in state $s_t$, and the
reward $r_{t+1}$ results.

Perhaps I overlooked something. Does this sound right?
Aldebaro Klautau  4
04-04-2002 12:39 PM ET (US)
"Function approximation" is the key component to deal with the potentially too large state space (virtually infinite, considering there are real-valued features). Last year (http://www-cse.ucsd.edu/~elkan/254spring01/), Sameer talked about reinforcement learning applied to the game Othello, and a neural network was used as function approximator, to avoid the combinatorial explosion of states. I think the multivariate linear regression tree in ProbE performs a similar task. The tree can have nodes like "income < $30K", etc. and the system doesn't need to spend states for all values of feature "income".
Kristin Branson  3
04-04-2002 04:00 AM ET (US)
In response to Greg's question, I think the iteration in Figures 6 and 7 is the value iteration number (described in equation (4)). Iteration 0 represents the estimate of the Q* value function by Q_0, iteration 1 represents the estimate of the Q* value function by Q_1, ...

I thought that this paper thoroughly presented an interesting approach to this problem. Not being all that familiar with the targeted marketing problem, I would have appreciated a more in depth motivation for the algorithm (other than the results), i.e. some demonstration of how hurtful the assumptions made in single-event targeted marketing could be. However, I am impressed that Bianca was able to do this all in just a few months -- it seems like applying reinforcement learning to this application requires a lot of thought and attention to detail.

I am having trouble figuring out exactly what constitutes a single state. With the assumption of a finite number of states but continuous valued probability and amount of donation functions, as well as a large number of customer profiles that will change with time in different ways depending on what actions are taken, it seems like the number of states necessary to represent all this information would be huge. Probably, there is some simplification that I missed that explains how the number of states is kept tractible. An example of a state would be very helpful to me.

Finally, I have a qualm with the introduction to reinforcement learning and MDP's (section 2.1). I think equation (2) is missing some mention of the reward for performing action a in state s at the initial time, if it
is to be consistent with equation (3). I also think section 2.1 requires a mention/emphasis of the assumption that, for MDP's, the state, action, and reward at time t+1 only depends on the state at time t.
Greg Hamerly  2
04-04-2002 02:27 AM ET (US)
This paper was a fun read. One thing I'm unclear on is what iterations 1-6 represent in figures 6/7. Are they the recursion depth to which they allows the reinforcement learning to proceed? Also, how is "life time profit" (figure 6) different than "total profit" (figure 7) -- since the sarsa curves look the same in both.



I do think the most intriguing thing is shown in figure 8, where the sarsa-RL approach conserves its mailings, showing a rather intelligent long-term approach.
Charles ElkanPerson was signed in when posted  1
04-02-2002 03:52 PM ET (US)
Edited by author 04-02-2002 03:53 PM
This is the place to discuss "Sequential cost-sensitive decision-making with reinforcement learning" by E. Pednault, N. Abe, B. Zadrozny and colleagues.

Because the paper has not yet been published, no link to it is published. The paper will be distributed in class on Tuesday April 2.
RSS link What's this?
QuickTopicSM message boards
Over 200,000 topics served
Learn more Frequently asked questions  Acknowledgements
What they're saying about QuickTopic
 Questions, comments, or suggestions? Contact Us
Read our use policy before beginning. We value your privacy; please read our privacy statement.
Copyright ©1999-2006 Internicity Inc. All rights reserved.