Skip to main content

DRR

DRR Framework

DRR Framework

The Actor network​

For a given user, the network accounts for generating an action a based on her state ss. Let us explain the network from the input to the output part. In DRR, the user state, denoted by the embeddings of her nn latest positively interacted items, is regarded as the input. Then the embeddings are fed into a state representation module to produce a summarized representation ss for the user. For instance, at timestep tt, the state can be defined as st=f(Ht)s_t = f(H_t) where f(β‹…)f(Β·) stands for the state representation module, Ht={i1,...,in}H_t = \{i_1, ..., i_n\} denotes the embeddings of the latest positive interaction history, and it∈R1Γ—ki_t ∈ R^{1Γ—k} is a kk-dimensional vector.

When the recommender agent recommends an item iti_t, if the user provides positive feedback, then in the next timestep, the state is updated to st+1=f(Ht+1)s_{t+1} = f(H_{t+1}), where Ht+1={i2,...,in,it}H_{t+1} = \{i_2, ..., i_n, i_t\}; otherwise, Ht+1=HtH_{t+1} = H_t. The reasons to define the state in such a manner are two folds: (i) a superior recommender system should cater to the users’ taste, i.e., what items the users like; (ii) the latest records represent the users’ recent interests more precisely.

Finally, by two ReLU layers and one Tanh layer, the state representation s is transformed into an action a=πθ(s)a = Ο€_ΞΈ(s) as the output of the Actor network. Particularly, the action a is defined as a ranking function represented by a continuous parameter vector a∈R1Γ—ka ∈ R^{1Γ—k} . By using the action, the ranking score of the item it is defined as scoret=itaTscore_t = i_ta^T. Then, the top ranked item (w.r.t. the ranking scores) is recommended to the user. Note that, the widely used Ξ΅-greedy exploration technique is adopted here.

Implementation in Tensorflow​

import tensorflow as tf
import numpy as np

class ActorNetwork(tf.keras.Model):
def __init__(self, embedding_dim, hidden_dim):
super(ActorNetwork, self).__init__()
self.inputs = tf.keras.layers.InputLayer(name='input_layer', input_shape=(3*embedding_dim,))
self.fc = tf.keras.Sequential([
tf.keras.layers.Dense(hidden_dim, activation='relu'),
tf.keras.layers.Dense(hidden_dim, activation='relu'),
tf.keras.layers.Dense(embedding_dim, activation='tanh')
])

def call(self, x):
x = self.inputs(x)
return self.fc(x)

class Actor(object):

def __init__(self, embedding_dim, hidden_dim, learning_rate, state_size, tau):

self.embedding_dim = embedding_dim
self.state_size = state_size

# actor network / target network
self.network = ActorNetwork(embedding_dim, hidden_dim)
self.target_network = ActorNetwork(embedding_dim, hidden_dim)
# optimizer
self.optimizer = tf.keras.optimizers.Adam(learning_rate)
# soft target network update hyperparameter
self.tau = tau

def build_networks(self):
# Build networks
self.network(np.zeros((1, 3*self.embedding_dim)))
self.target_network(np.zeros((1, 3*self.embedding_dim)))

def update_target_network(self):
# soft target network update
c_theta, t_theta = self.network.get_weights(), self.target_network.get_weights()
for i in range(len(c_theta)):
t_theta[i] = self.tau * c_theta[i] + (1 - self.tau) * t_theta[i]
self.target_network.set_weights(t_theta)

def train(self, states, dq_das):
with tf.GradientTape() as g:
outputs = self.network(states)
# loss = outputs*dq_das
dj_dtheta = g.gradient(outputs, self.network.trainable_weights, -dq_das)
grads = zip(dj_dtheta, self.network.trainable_weights)
self.optimizer.apply_gradients(grads)

def save_weights(self, path):
self.network.save_weights(path)

def load_weights(self, path):
self.network.load_weights(path)

Implementation in PyTorch​

class Actor(nn.Module):
def __init__(self, in_features=100, out_features=18):
super(Actor, self).__init__()
self.in_features = in_features
self.out_features = out_features

self.linear1 = nn.Linear(self.in_features, self.in_features)
self.linear2 = nn.Linear(self.in_features, self.in_features)
self.linear3 = nn.Linear(self.in_features, self.out_features)

def forward(self, state):
output = F.relu(self.linear1(state))
output = F.relu(self.linear2(output))
output = F.tanh(self.linear3(output))
return output

The Critic network​

The Critic part in DRR is a Deep Q-Network, which leverages a deep neural network parameterized as Qω(s,a)Q_ω(s, a) to approximate the true state-action value function Qπ(s,a)Q_π (s, a), namely, the Q-value function. The Q-value function reflects the merits of the action policy generated by the Actor network. Specifically, the input of the Critic network is the user state s generated by the user state representation module and the action a generated by the policy network, and the output is the Q-value, which is a scalar. According to the Q-value, the parameters of the Actor network are updated in the direction of improving the performance of action a, i.e., boosting Qω(s,a)Q_ω(s, a).

Based on the deterministic policy gradient theorem, we can update the Actor by the sampled policy gradient βˆ‡ΞΈJ(πθ)β‰ˆ1Nβˆ‘tβˆ‡aQΟ‰(s,a)∣s=st,a=πθ(st)βˆ‡ΞΈΟ€ΞΈ(s)∣s=st,βˆ‡_ΞΈJ(Ο€_ΞΈ) β‰ˆ \dfrac{1}{N} \sum_t βˆ‡_aQ_Ο‰(s,a)|_{s=s_t,a=Ο€_ΞΈ(s_t)}βˆ‡_ΞΈΟ€_ΞΈ(s)|_{s=s_t}, where J(πθ)J(Ο€_ΞΈ) is the expectation of all possible Q-values that follow the policy πθπ_ΞΈ. Here the mini-batch strategy is utilized and NN denotes the batch size. Moreover, the Critic network is updated accordingly by the temporal-difference learning approach, i.e., minimizing the mean squared error L=1Nβˆ‘i(yiβˆ’QΟ‰(si,ai))2L = \dfrac{1}{N}\sum_i(y_i βˆ’ Q_Ο‰(s_i,a_i))^2 where yi=ri+Ξ³QΟ‰β€²(si+1,πθ′(si+1))y_i = r_i + Ξ³Q_{Ο‰'}(s_{i+1}, Ο€_{ΞΈ'}(s_{i+1})). The target network technique is also adopted in DRR framework, where Ο‰β€²Ο‰' and ΞΈβ€²ΞΈ' is the parameters of the target Critic and Actor network.

Implementation in PyTorch​

class Critic(nn.Module):
def __init__(self, action_size=20, in_features=128, out_features=18):
super(Critic, self).__init__()
self.in_features = in_features
self.out_features = out_features
self.combo_features = in_features + action_size
self.action_size = action_size

self.linear1 = nn.Linear(self.in_features, self.in_features)
self.linear2 = nn.Linear(self.combo_features, self.combo_features)
self.linear3 = nn.Linear(self.combo_features, self.combo_features)
self.output_layer = nn.Linear(self.combo_features, self.out_features)

def forward(self, state, action):
output = F.relu(self.linear1(state))
output = torch.cat((action, output), dim=1)
output = F.relu(self.linear2(output))
output = F.relu(self.linear3(output))
output = self.output_layer(output)
return output

The State Representation Module​

As noted above, the state representation module plays an important role in both the Actor network and Critic network. Hence, it is very crucial to design a good structure to model the state.

DRR-p Structure​

Untitled

DRR-u Structure​

Untitled

DRR-ave Structure​

Untitled

Implementation in PyTorch​

class DRRAveStateRepresentation(nn.Module):
def __init__(self, n_items=5, item_features=100, user_features=100):
super(DRRAveStateRepresentation, self).__init__()
self.n_items = n_items
self.random_state = RandomState(1)
self.item_features = item_features
self.user_features = user_features

self.attention_weights = nn.Parameter(torch.from_numpy(0.1 * self.random_state.rand(self.n_items)).float())

def forward(self, user, items):
'''
DRR-AVE State Representation
:param items: (torch tensor) shape = (n_items x item_features),
Matrix of items in history buffer
:param user: (torch tensor) shape = (1 x user_features),
User embedding
:return: output: (torch tensor) shape = (3 * item_features)
'''
right = items.t() @ self.attention_weights
middle = user * right
output = torch.cat((user, middle, right), 0).flatten()
return output

Training Procedure of the DRR Framework​

The training procedure mainly includes two phases, i.e., transition generation (lines 7-12) and model updating (lines 13-17).

For the first stage, the recommender observes the current state sts_t that is calculated by the proposed state representation module, then generates an action at=πθ(st)a_t = Ο€_ΞΈ(s_t) according to the current policy πθπ_ΞΈ with Ξ΅-greedy exploration, and recommends an item it according to the action ata_t (lines 8-9). Subsequently, the reward rtr_t can be calculated based on the feedback of the user to the recommended item iti_t, and the user state is updated (lines 10-11). Finally, the recommender agent stores the transition (st,at,rt,st+1)(s_t, a_t, r_t, s_{t+1}) into the replay buffer DD (line 12).

Untitled

In the second stage, the model updating, the recommender samples a minibatch of NN transitions with widely used prioritized experience replay sampling technique (line 13), which is essentially an importance sampling strategy. Then, the recommender updates the parameters of the Actor network and Critic network respectively (line 14-16). Finally, the recommender updates the target networks’ parameters with the soft replace strategy.

Evaluation​

The most straightforward way to evaluate the RL based models is to conduct online experiments on recommender systems where the recommender directly interacts with users. However, the underlying commercial risk and the costly deployment on the platform make it impractical. Therefore, throughout the testing phase, we conduct the evaluation of the proposed models on public offline datasets and propose two ways to evaluate the models, which are the offline evaluation and the online evaluation.

Offline evaluation​

Intuitively, the offline evaluation of the trained models is to test the recommendation performance with the learned policy, which is described in Algorithm 2. Specifically, for a given session SjS_j , the recommender only recommends the items that appear in this session, denoted as I(Sj)I(S_j), rather than the ones in the whole item space. The reason is that we only have the ground truth feedback for the items in the session in the recorded offline log. For each timestep, the recommender agent takes an action at according to the learned policy πθπ_ΞΈ, and recommends an item it∈I(Sj)i_t ∈ I(S_j) based on the action ata_t (lines 4-5). After that, the recommender observes the reward rt=R(st,at)r_t = R(s_t, a_t) according to the feedback of the recommended item iti_t (lines 5-6). Then the user state is updated to st+1s_{t+1} and the recommended item iti_t is removed from the candidate set I(Sj)I(S_j) (lines 7-8). The offline evaluation procedure can be treated as a rerank procedure of the candidate set by iteratively selecting an item w.r.t. the action generated by the Actor network in DRR framework. Moreover, the model parameters are not updated in the offline evaluation.

Untitled

Online evaluation with environment simulator​

As aforementioned that it is risky and costly to directly deploy the RL based models on recommender systems. Therefore, we can conduct online evaluation with an environment simulator. We basically pretrain a PMF model as the environment simulator, i.e., to predict an item’s feedback that the user never rates before. The online evaluation procedure follows Algorithm 1, i.e., the parameters continuously update during the online evaluation stage. Its major difference from Algorithm 1 is that the feedback of a recommended item is observed by the environment simulator. Moreover, before each recommendation session starting in the simulated online evaluation, we reset the parameters back to ΞΈ and Ο‰ which is the policy learned in the training stage for a fair comparison.