Skip to main content

Collaborative Filtering

Similarity methods

User-based similarity

Let's take the following user-item rating matrix as an example:

UserID/ItemID123456Mean Rating
17674545.5
267NaN4344.8
3NaN3311NaN2
41223342.5
51NaN12332

For each user, mean rating is calculated as follows:

$$ \muu = \frac{\Sigma{k \in \mathcal{I}u} r{uk}}{|\mathcal{I}_u|} \ \ \forall u \in {1 \dots m} $$

Two common approaches to measure similarity between two users Sim(u,v)\mathrm{Sim}(u, v) are Cosine similarity and Pearson correlation coefficient:

\begin{align} \mathrm{Cosine}(u,v) &= \frac{\Sigma{k \in \mathcal{I}_u \cap \mathcal{I}_v} r{uk} r{vk}}{\sqrt{\Sigma{k \in \mathcal{I}u \cap \mathcal{I}_v} r{uk}^2} \sqrt{\Sigma{k \in \mathcal{I}_u \cap \mathcal{I}_v} r{vk}^2}} \ \mathrm{Pearson}(u,v) &= \frac{\Sigma{k \in \mathcal{I}_u \cap \mathcal{I}_v} (r{uk} - \mu_u) (r{vk} - \mu_v)}{\sqrt{\Sigma{k \in \mathcal{I}u \cap \mathcal{I}_v} (r{uk} - \muu)^2} * \sqrt{\Sigma{k \in \mathcal{I}u \cap \mathcal{I}_v} (r{vk} - \mu_v)^2}} \end{align*}

For example, given the original rating matrix, between User 1 and User 3 we have their similarities as:

\begin{align} \mathrm{Cosine}(1,3) &= \frac{63+73+41+51}{\sqrt{6^2+7^2+4^2+5^2} \sqrt{3^2+3^2+1^2+1^2}} = 0.956 \ \mathrm{Pearson}(1,3) &= \frac{(6 - 5.5) (3 - 2) + (7 - 5.5) (3 - 2) + (4 - 5.5) (1 - 2) + (5 - 5.5) (1 - 2)}{\sqrt{0.5^2 + 1.5^2 + (-1.5)^2 + (-0.5)^2} \sqrt{1^2 + 1^2 + (-1)^2 + (-1)^2}} = 0.894 \end{align}

The overall neighborhood-based prediction function is as follows:

$$ \hat{r}{uj} = \mu_u + \frac{\Sigma{v \in Pu(j)} \mathrm{Sim}(u,v) * (r{vj} - \muv)}{\Sigma{v \in P_u(j)} |\mathrm{Sim}(u,v)|} $$

For example, to calculate the predicted rating given by User 3 to Item 1 and Item 6, where the ratings are based on the two nearest neighbors (User 1 and User 2):

\begin{align} \hat{r}_{31} &= 2 + \frac{1.50.894+1.20.939}{0.894 + 0.939} = 3.35 \ \hat{r}_{36} &= 2 + \frac{-1.50.894-0.80.939}{0.894 + 0.939} = 0.86 \end{align}

Item-based similarity

The Cosine and Pearson similarities can be applied for item-based methods as well, except that the feature vectors are now columns instead of rows as we measure similarity between items.

If Cosine similarity is based on the mean-centered rating matrix, we have a variant called AdjustedCosine. The adjusted cosine similarity between the items (columns) i and j is defined as follows:

$$ \mathrm{AdjustedCosine}(i,j) = \frac{\Sigma{u \in \mathcal{U}_i \cap \mathcal{U}_j} s{ui} s{uj}}{\sqrt{\Sigma{u \in \mathcal{U}i \cap \mathcal{U}_j} s{ui}^2} \sqrt{\Sigma{u \in \mathcal{U}_i \cap \mathcal{U}_j} s{uj}^2}} $$

where suis_{ui} is the mean-centered rating that user uu gives to item ii.

For example, we calculate adjusted cosine between Item 1 and Item 3 in the small sample dataset above as follows:

$$ \mathrm{AdjustedCosine}(1,3) = \frac{1.5 1.5 + (-1.5) (-0.5) + (-1) (-1)}{\sqrt{1.5^2 + (-1.5)^2 + (-1)^2} \sqrt{1.5^2 + (-0.5)^2 + (-1)^2}} = 0.912 $$

For prediction, we use the same form of prediction function as in user-based methods but aggregate the user's ratings on neighboring items:

$$ \hat{r}{ut} = \mu_u + \frac{\Sigma{j \in Qt(u)} \mathrm{Sim}(j,t) * (r{uj} - \muu)}{\Sigma{j \in Q_t(u)} |\mathrm{Sim}(j,t)|} $$

For example, below we predict the ratings that User 3 would give to Item 1 and Item 6. The rating for Item 1 is based on two nearest neighbors Item 2 and Item 3, while the rating for Item 6 is based on Item 4 and Item 5.

\begin{align} \hat{r}_{31} &= 2 + \frac{10.735 + 10.912}{0.735 + 0.912} = 3 \ \hat{r}_{36} &= 2 + \frac{(-1)0.829 + (-1)0.730}{0.829 + 0.730} = 1 \end{align}