|
Gyozo Gidofalvi
04-04-2002
06:23 PM ET (US)
|
As Dave suggested the paper what very well structured and clearly presented the backbround on reinforcement learning.
I was a bit also unsure about how the simplistic approaches were trained and what information was available to them, but nontheless the results presented in figures 8 and 9 were very interesting and suggested that the approcah presented here seems to have achieved its goal.
|
|
Eric Wiewiora
04-04-2002
05:58 PM ET (US)
|
I have questions on the dataset used. Perhaps I'm naive, but I'd think that the mailings recorded in the dataset were the result of some simple policy. This would seem to restrict the policy space to a subset of what has been tried in the previous mailing campaigns. Does the learner only learn who it was not worth mailing to in the past?
|
Dave Kauchak 
04-04-2002
04:21 PM ET (US)
|
I too thought this paper addressed an interesting approach to the marketing problem. I particularly enjoyed the introductory material, which I felt did a good job of giving background.
However, I found some of the experimental set-up and experimentation sections a bit vague. I had a hard time figuring out exactly what data was used for training and testing. Also, the paper compares a simplistic repeated version versus the batch learners. Is the simplistic approach simply trained once and then repeatedly applied? Is there extra information available after a campaign has been done? Do these systems relearn between campaigns? If not, I think the simplistic approach may be at a disadvantage, but that is just my intuition (and I don't claim to perfectly understand everything :).
Although the use of a simulation model is nice in that it provides a form of evaluation for testing methods, I wished the paper had analyzed the effects of testing using a simulation. For example, is there any real data availabe for which the simulation can be validated? What affect will the simulation simplification have on the results? Does the simulation model favor any of the algorithms?
Finally, just a style note, I think Figure 6 should probably be eliminated. The information presented in it is also presented in 7 and I don't think that it is that beneficial to have BatchRL(sarsa) shown by itself.
|
|
Dana Dahlstrom
04-04-2002
03:24 PM ET (US)
|
[follow-up to previous post]
Actually, looking forward, my proposed patch would require modification to Equations 5 and 7 (the Q-learning and sarsa update rules) as well. It's probably better to revise the notation the other way:
1. Change the descriptions of the state, action, and reward sequences to begin with zero elements.
2. Change the summations in Equations 1 and 2 to begin with $t=0$.
3. Change $\gamma^{t-1}$ to $\gamma^t$ in Equation 1.
This way action $a_t$ is performed in state $s_t$ and reward $r_t$ results; I believe this is conventional.
|
|
Dana Dahlstrom
04-04-2002
03:03 PM ET (US)
|
I think Kristin's qualm with the notation in section 2.1 is right; it seems there is some inconsistency:
In paragraph 3 the sequences of actions and rewards begin with $a_1$ and $r_1$. The state $s_0$ is mentioned, but not $a_0$ or $r_0$. This leads me to suppose $r_i$ is a function of $s_{i-1}$ and $a_i$.
Equation 1 is consistent with this supposition: $r_1$ is discounted by $\gamma^{0} = 1$, and so is treated as an immediate reward. But in Equation 2 $r_1$ is discounted by $\gamma$, and suddenly there is an $a_0$. I think this can be remedied by changing $\gamma^t$ to $\gamma^{t-1}$ and $a_0$ to $a_1$ in Equation 2.
Equation 3 is consistent with the others if the pair $(s,a)$ corresponds to pairs $(s_t,a_{t+1})$ and $r$ correponds to $r_{t+1}$ from the sequences. That is, action $a_{t+1}$ is performed in state $s_t$, and the reward $r_{t+1}$ results.
Perhaps I overlooked something. Does this sound right?
|
|
Aldebaro Klautau
04-04-2002
12:39 PM ET (US)
|
"Function approximation" is the key component to deal with the potentially too large state space (virtually infinite, considering there are real-valued features). Last year (http://www-cse.ucsd.edu/~elkan/254spring01/), Sameer talked about reinforcement learning applied to the game Othello, and a neural network was used as function approximator, to avoid the combinatorial explosion of states. I think the multivariate linear regression tree in ProbE performs a similar task. The tree can have nodes like "income < $30K", etc. and the system doesn't need to spend states for all values of feature "income".
|
|
Kristin Branson
04-04-2002
04:00 AM ET (US)
|
In response to Greg's question, I think the iteration in Figures 6 and 7 is the value iteration number (described in equation (4)). Iteration 0 represents the estimate of the Q* value function by Q_0, iteration 1 represents the estimate of the Q* value function by Q_1, ...
I thought that this paper thoroughly presented an interesting approach to this problem. Not being all that familiar with the targeted marketing problem, I would have appreciated a more in depth motivation for the algorithm (other than the results), i.e. some demonstration of how hurtful the assumptions made in single-event targeted marketing could be. However, I am impressed that Bianca was able to do this all in just a few months -- it seems like applying reinforcement learning to this application requires a lot of thought and attention to detail.
I am having trouble figuring out exactly what constitutes a single state. With the assumption of a finite number of states but continuous valued probability and amount of donation functions, as well as a large number of customer profiles that will change with time in different ways depending on what actions are taken, it seems like the number of states necessary to represent all this information would be huge. Probably, there is some simplification that I missed that explains how the number of states is kept tractible. An example of a state would be very helpful to me.
Finally, I have a qualm with the introduction to reinforcement learning and MDP's (section 2.1). I think equation (2) is missing some mention of the reward for performing action a in state s at the initial time, if it is to be consistent with equation (3). I also think section 2.1 requires a mention/emphasis of the assumption that, for MDP's, the state, action, and reward at time t+1 only depends on the state at time t.
|
|
Greg Hamerly
04-04-2002
02:27 AM ET (US)
|
This paper was a fun read. One thing I'm unclear on is what iterations 1-6 represent in figures 6/7. Are they the recursion depth to which they allows the reinforcement learning to proceed? Also, how is "life time profit" (figure 6) different than "total profit" (figure 7) -- since the sarsa curves look the same in both.
I do think the most intriguing thing is shown in figure 8, where the sarsa-RL approach conserves its mailings, showing a rather intelligent long-term approach.
|
Charles Elkan 
04-02-2002
03:52 PM ET (US)
|
This is the place to discuss "Sequential cost-sensitive decision-making with reinforcement learning" by E. Pednault, N. Abe, B. Zadrozny and colleagues. Because the paper has not yet been published, no link to it is published. The paper will be distributed in class on Tuesday April 2. Edited 04-02-2002 03:53 PM
|
|
|