Skip to main content

BPR

BPR stands for Bayesian Personalized Ranking. In matrix factorization (MF), to compute the prediction we have to multiply the user factors to the item factors:

x^ui=wu,hi=f=1kwufhif\hat{x}_{ui} = \langle w_u,h_i \rangle = \sum_{f=1}^k w_{uf} \cdot h_{if}

The usual approach for item recommenders is to predict a personalized score x^ui\hat{x}_{ui} for an item that reflects the preference of the user for the item. Then the items are ranked by sorting them according to that score. Machine learning approaches are typically fit by using observed items as a positive sample and missing ones for the negative class. A perfect model would thus be useless, as it would classify as negative (non-interesting) all the items that were non-observed at training time. The only reason why such methods work is regularization.

BPR use a different approach. The training dataset is composed by triplets (u,i,j)(u,i,j) representing that user uu is assumed to prefer ii over jj. For an implicit dataset this means that uu observed ii but not jj:

DS:={(u,i,j)iIu+jIIu+}D_S := \{(u,i,j) \mid i \in I_u^+ \wedge j \in I \setminus I_u^+\}

A machine learning model can be represented by a parameter vector ΘΘ which is found at fitting time. BPR wants to find the parameter vector that is most probable given the desired, but latent, preference structure >u>_u:

p(Θ>u)p(>uΘ)p(Θ)uUp(>uΘ)==(u,i,j)DSp(i>ujΘ)\begin{align} p(\Theta \mid >_u) \propto p(>_u \mid \Theta)p(\Theta) \\ \prod_{u\in U} p(>_u \mid \Theta) = \dots = \prod_{(u,i,j) \in D_S} p(i >_u j \mid \Theta) \end{align}

The probability that a user really prefers item ii to item jj is defined as:

p(i>ujΘ):=σ(x^uij(Θ))\begin{align} p(i >_u j \mid \Theta) := \sigma(\hat{x}_{uij}(\Theta)) \end{align}

Where σσ represent the logistic sigmoid and x^uij(Θ)\hat{x}_{uij}(Θ) is an arbitrary real-valued function of ΘΘ (the output of your arbitrary model).

User u1u_1 has interacted with item i2i_2 but not item i1i_1, so we assume that this user prefers item i2i_2 over i1i_1. For items that the user have both interacted with, we cannot infer any preference. The same is true for two items that a user has not interacted yet (e.g. item i1i_1 and i4i_4 for user u1u_1).

/img/content-models-raw-mp2-bpr-untitled.png

To complete the Bayesian setting, we define a prior density for the parameters:

p(Θ)N(0,ΣΘ)p(\Theta) \sim N(0, \Sigma_\Theta)

And we can now formulate the maximum posterior estimator:

BPROPT=logp(Θ>u)=logp(>uΘ)p(Θ)=log(u,i,j)DSσ(x^uij)p(Θ)=(u,i,j)DSlogσ(x^uij)+logp(Θ)=(u,i,j)DSlogσ(x^uij)λΘΘ2\begin{equation} \begin{split} BPR-OPT &=\log p(\Theta \mid >_u)\\ &=\log p(>_u \mid \Theta) p(\Theta) \\ &= \log \prod_{(u,i,j) \in D_S} \sigma(\hat{x}_{uij})p(\Theta) \\ &= \sum_{(u,i,j) \in D_S} \log \sigma(\hat{x}_{uij}) + \log p(\Theta) \\ &= \sum_{(u,i,j) \in D_S} \log \sigma(\hat{x}_{uij}) - \lambda_\Theta ||\Theta||^2 \end{split} \end{equation}

Where λΘλ_Θ are model specific regularization parameters.

Once obtained the log-likelihood, we need to maximize it in order to find our optimal ΘΘ. As the criterion is differentiable, gradient descent algorithms are an obvious choiche for maximization.

The basic version of gradient descent consists in evaluating the gradient using all the available samples and then perform a single update. The problem with this is, in our case, that our training dataset is very skewed. Suppose an item ii is very popular. Then we have many terms of the form x^uij\hat{x}_{uij} in the loss because for many users uu the item ii is compared against all negative items jj. The other popular approach is stochastic gradient descent, where for each training sample an update is performed. This is a better approach, but the order in which the samples are traversed is crucial. To solve this issue BPR uses a stochastic gradient descent algorithm that chooses the triples randomly.

The gradient of BPR-OPT with respect to the model parameters is:

BPROPTΘ=(u,i,j)DSΘlogσ(x^uij)λΘΘΘ2=(u,i,j)DSex^uij1+ex^uijΘx^uijλΘΘ\begin{equation} \begin{split} \frac{\partial BPR-OPT}{\partial \Theta} &= \sum_{(u,i,j) \in D_S} \frac{\partial}{\partial \Theta} \log \sigma (\hat{x}_{uij}) - \lambda_\Theta \frac{\partial}{\partial\Theta} || \Theta ||^2\\ &= \sum_{(u,i,j) \in D_S} \frac{-e^{-\hat{x}_{uij}}}{1+e^{-\hat{x}_{uij}}} \frac{\partial}{\partial \Theta}\hat{x}_{uij} - \lambda_\Theta \Theta \end{split} \end{equation}

In order to practically apply this learning schema to an existing algorithm, we first split the real valued preference term: x^uij:=x^uix^uj\hat{x}_{uij} := \hat{x}_{ui} − \hat{x}_{uj}. And now we can apply any standard collaborative filtering model that predicts x^ui\hat{x}_{ui}.

The problem of predicting x^ui\hat{x}_{ui} can be seen as the task of estimating a matrix X:U×IX:U×I. With matrix factorization the target matrix XX is approximated by the matrix product of two low-rank matrices W:U×kW:|U|×k and H:I×kH:|I|×k:

X:=WHtX := WH^t

The prediction formula can also be written as:

x^ui=wu,hi=f=1kwufhif\hat{x}_{ui} = \langle w_u,h_i \rangle = \sum_{f=1}^k w_{uf} \cdot h_{if}

Besides the dot product ⟨⋅,⋅⟩, in general any kernel can be used.

We can now specify the derivatives:

θx^uij={(hifhjf) if θ=wuf,wuf if θ=hif,wuf if θ=hjf,0 else \frac{\partial}{\partial \theta} \hat{x}_{uij} = \begin{cases} (h_{if} - h_{jf}) \text{ if } \theta=w_{uf}, \\ w_{uf} \text{ if } \theta = h_{if}, \\ -w_{uf} \text{ if } \theta = h_{jf}, \\ 0 \text{ else } \end{cases}

Which basically means: user uu prefer ii over jj, let's do the following:

  • Increase the relevance (according to uu) of features belonging to ii but not to jj and vice-versa
  • Increase the relevance of features assigned to ii
  • Decrease the relevance of features assigned to jj
  1. http://ethen8181.github.io/machine-learning/recsys/4_bpr.html
  2. https://nbviewer.jupyter.org/github/microsoft/recommenders/blob/master/examples/02_model_collaborative_filtering/cornac_bpr_deep_dive.ipynb
  3. https://youtu.be/Jfo9f345iMc
  4. https://www.coursera.org/lecture/matrix-factorization/personalized-ranking-with-daniel-kluver-s3XJo
  5. https://nbviewer.org/github/MaurizioFD/RecSys_Course_AT_PoliMi/blob/master/Practice 10 - MF.ipynb