Skip to main content

Models

In the following sections, we will more systematically introduce the following models:

📄️ A3C

A3C stands for Asynchronous Advantage Actor-Critic. The A3C algorithm builds upon the Actor-Critic class of algorithms by using a neural network to approximate the actor (and critic). The actor learns the policy function using a deep neural network, while the critic estimates the value function. The asynchronous nature of the algorithm allows the agent to learn from different parts of the state space, allowing parallel learning and faster convergence. Unlike DQN agents, which use an experience replay memory, the A3C agent uses multiple workers to gather more samples for learning.

📄️ AFM

AFM stands for Attentional Factorization Machines. It Improves FM by discriminating the importance of different feature interactions, and learns the importance of each feature interaction from data via a neural attention network. Empirically, it is shown on regression task that AFM performs betters than FM with a 8.6% relative improvement, and consistently outperforms the state-of-the-art deep learning methods Wide&Deep and DeepCross with a much simpler structure and fewer model parameters.

📄️ CASER

CASER stands for Convolutional Sequence Embedding Recommendation. Top-N sequential recommendation models each user as a sequence of items interacted in the past and aims to predict top-N ranked items that a user will likely interact in a 'near future'. The order of interaction implies that sequential patterns play an important role where more recent items in a sequence have a larger impact on the next item. Convolutional Sequence Embedding Recommendation Model (Caser) address this requirement by embedding a sequence of recent items into an image' in the time and latent spaces and learn sequential patterns as local features of the image using convolutional filters. This approach provides a unified and flexible network structure for capturing both general preferences and sequential patterns. In other words, Caser adopts convolutional neural networks capture the dynamic pattern influences of users’ recent activities.

📄️ CIGC

To pursue high efficiency, we set the target as using only new data for model updating, meanwhile not sacrificing the recommendation accuracy compared with full model retraining. This is non-trivial to achieve, since the interaction data participates in both the graph structure for model construction and the loss function for model learning, whereas the old graph structure is not allowed to use in model updating. Causal Incremental Graph Convolution (CIGC) estimates the output of full graph convolution. Incremental Graph Convolution (IGC) ingeniously combine the old representations and the incremental graph and effectively fuse the long-term and short-term preference signals. Colliding Effect Distillation (CED) aims to avoid the out-of-date issue of inactive nodes that are not in the incremental graph, which connects the new data with inactive nodes through causal inference. In particular, CED estimates the causal effect of new data on the representation of inactive nodes through the control of their collider.

📄️ DCN

DCN stands for Deep and Cross Network. Manual explicit feature crossing process is very laborious and inefficient. On the other hand, automatic implicit feature crossing methods like MLPs cannot efficiently approximate even 2nd or 3rd-order feature crosses. Deep-cross networks provides a solution to this problem. DCN was designed to learn explicit and bounded-degree cross features more effectively. It starts with an input layer (typically an embedding layer), followed by a cross network containing multiple cross layers that models explicit feature interactions, and then combines with a deep network that models implicit feature interactions.

📄️ DeepFM

DeepFM stands for Deep Factorization Machines. It consists of an FM component and a deep component which are integrated in a parallel structure. The FM component is the same as the 2-way factorization machines which is used to model the low-order feature interactions. The deep component is a multi-layered perceptron that is used to capture high-order feature interactions and nonlinearities. These two components share the same inputs/embeddings and their outputs are summed up as the final prediction.

📄️ DQN

The Q-learning component of DQN was invented in 1989 by Christopher Watkins in his PhD thesis titled “Learning from Delayed Rewards”. Experience replay quickly followed, invented by Long-Ji Lin in 1992. This played a major role in improving the efficiency of Q-learning. In the years that followed, however, there were no major success stories involving deep Q-learning. This is perhaps not surprising given the combination of limited computational power in the 1990s and early 2000s, data-hungry deep learning architectures, and the sparse, noisy, and delayed feedback signals experienced in RL. Progress had to wait for the emergence of general-purpose GPU programming, for example with the launch of CUDA in 2006, and the reignition of interest in deep learning within the machine learning community that began in the mid-2000s and rapidly accelerated after 2012.

📄️ FM

Factorization Machines (FMs) are a supervised learning approach that enhances the linear regression model by incorporating the second-order feature interactions. Factorization Machine type algorithms are a combination of linear regression and matrix factorization, the cool idea behind this type of algorithm is it aims model interactions between features (a.k.a attributes, explanatory variables) using factorized parameters. By doing so it has the ability to estimate all interactions between features even with extremely sparse data.

📄️ GRU4Rec

It uses session-parallel mini-batch approach where we first create an order for the sessions and then, we use the first event of the first X sessions to form the input of the first mini-batch (the desired output is the second events of our active sessions). The second mini-batch is formed from the second events and so on. If any of the sessions end, the next available session is put in its place. Sessions are assumed to be independent, thus we reset the appropriate hidden state when this switch occurs.

📄️ Markov Chains

Markov chains, named after Andrey Markov, are mathematical systems that hop from one "state" (a situation or set of values) to another. For example, if you made a Markov chain model of a baby's behavior, you might include "playing," "eating", "sleeping," and "crying" as states, which together with other behaviors could form a 'state space': a list of all possible states. In addition, on top of the state space, a Markov chain tells you the probability of hopping, or "transitioning," from one state to any other state---e.g., the chance that a baby currently playing will fall asleep in the next five minutes without crying first.

📄️ MDP

The Markov decision process (MDP), a reinforcement learning (RL) algorithm, perfectly illustrates how machines have become intelligent in their own unique way. Humans build their decision process on experience. MDPs are memoryless. Humans use logic and reasoning to think problems through. MDPs apply random decisions 100% of the time. Humans think in words, labeling everything they perceive. MDPs have an unsupervised approach that uses no labels or training data. MDPs boost the machine thought process of self-driving cars (SDCs), translation tools, scheduling software, and more. This memoryless, random, and unlabeled machine thought process marks a historical change in the way a former human problem was solved.

📄️ NMRN

NMRN is a streaming recommender model based on neural memory networks with external memories to capture and store both long-term stable interests and short-term dynamic interests in a unified way. An adaptive negative sampling framework based on Generative Adversarial Nets (GAN) is developed to optimize our proposed streaming recommender model, which effectively overcomes the limitations of classical negative sampling approaches and improves the effectiveness of the model parameter inference.

📄️ Node2vec

Nodes in networks could be organized based on communities they belong to (i.e., homophily); in other cases, the organization could be based on the structural roles of nodes in the network (i.e., structural equivalence). For instance, in the below figure, we observe nodes $u$ and $s1$ belonging to the same tightly knit community of nodes, while the nodes $u$ and $s6$ in the two distinct communities share the same structural role of a hub node. Real-world networks commonly exhibit a mixture of such equivalences.

📄️ PPO

The PPO (Proximal Policy Optimization) algorithm was introduced by the OpenAI team in 2017 and quickly became one of the most popular Reinforcement Learning method that pushed all other RL methods at that moment aside. PPO involves collecting a small batch of experiences interacting with the environment and using that batch to update its decision-making policy. Once the policy is updated with that batch, the experiences are thrown away and a newer batch is collected with the newly updated policy. This is the reason why it is an “on-policy learning” approach where the experience samples collected are only useful for updating the current policy.

📄️ SASRec

SASRec stands for Self-Attentive Sequential Recommendation. It relies on the sequence modeling capabilities of self-attentive neural networks to predict the occurence of the next item in a user’s consumption sequence. To be precise, given a user 𝑢 and their time-ordered consumption history $S^𝑢 = (S1^u, S2^u, \dots, S_{|S^u|}^𝑢),$ SASRec first applies self-attention on $S^𝑢$ followed by a series of non-linear feed-forward layers to finally obtain the next item likelihood.

📄️ SSE-PT

Temporal information is crucial for recommendation problems because user preferences are naturally dynamic in the real world. Recent advances in deep learning, especially the discovery of various attention mechanisms and newer architectures in addition to widely used RNN and CNN in natural language processing, have allowed for better use of the temporal ordering of items that each user has engaged with. In particular, the SASRec model, inspired by the popular Transformer model in natural languages processing, has achieved state-of-the-art results. However, SASRec, just like the original Transformer model, is inherently an un-personalized model and does not include personalized user embeddings. SSE-PT overcomes this limitation by employing a Personalized Transformer.

📄️ VSKNN

VSKNN stands for Vector Multiplication Session-Based kNN. The idea of this variant is to put more emphasis on the more recent events of a session when computing the similarities. Instead of encoding a session as a binary vector, we use real-valued vectors to encode the current session. Only the very last element of the session obtains a value of “1”; the weights of the other elements are determined using a linear decay function that depends on the position of the element within the session, where elements appearing earlier in the session obtain a lower weight. As a result, when using the dot product as a similarity function between the current weight-encoded session and a binary-encoded past session, more emphasis is given to elements that appear later in the sessions.

CTR Prediction Models

In the early stage of recommendation systems, people spent much time on tedious and onerous feature engineering. At that time, the dimensions of the raw features are relatively small, which makes it possible to implement different combinations of raw features. The newly created features are then fed into a shallow model, such as Logistic Regression (LR) and Gradient Boosting Decision Trees (GBDT) are widely used in CTR prediction. Then, Factorization Machine (FM) transforms the learning of users and items features into a small, shared vector shape. Based on the above method, Field-aware and Field-weighted FM (FFM) further consider the different impact of the fields that a feature belongs to in order to improve the performance of CTR prediction. Along this line, Attentional Factorization Machines (AFM) are proposed to automatically learn weights of deep cross-features and Neural Factorization Machines (NFM) enhances FMs by modelling higher-order and non-linear feature interactions.

Recently, the success of deep neural networks (DNN) in natural language processing and computer vision brings a new direction to the recommendation system. Among them, the Wide & Deep learning introduces deep neural networks to the CTR prediction. It jointly trains a deep neural network along with the traditional wide linear model. The deep neural networks liberated people from feature engineering while generalizing better combinations of the features. Lots of variants of the Wide & Deep learning have been proposed since it revolutionizes the development of the CTR prediction. Deep & Cross network (DCN) replaces the wide linear part with cross-network, which generates explicit feature crossing among low and high level layers. DeepFM combines the power of DNN and factorization machines for feature representation in the recommendation system. Furthermore, xDeepFM extends the work of DNN by proposing a Compressed Interaction Network to enumerate and compress all feature interactions. Overall, the deep models mentioned above all construct a similar model structure by combing the low-order and high-order features, which greatly reduce the effort of feature engineering and improve the performance of CTR prediction.

However, these aforementioned shallow or deep models take statistical and categorical features as input, while discarding the sequential behavior information of users. For example, users may search items at an e-commerce system, then view some items of interest, and these items are likely to be clicked or purchased next time. Since the historical behaviors explicitly indicate the preference of the users, it has gained much more attention in the recommendation systems. Among them, DIN proposes a local activation unit that learns the dynamic user interests from the sequential behavior features. DIEN designs an interest evolving layer an attentional update gate to model the dependency between sequential behaviors. The researches above realized the importance of user’s historical behaviors. Unfortunately, they just project other information (i.e., user-specific and context) into one vector and did not pay equal attention to the interactions between the candidate item and fine-grained information, while modeling this interaction has shown extensively progress in many tasks, such as search recommendations and knowledge distillation.

Different from all previous methods, MIAN can explore the sequential behavior and other fine-grained information simultaneously. Specifically, compared to shallow and deep models, MIAN has a remarkable ability to encode user’s preference through sequential behavior. Compared to sequential models, MIAN can model fine-grained feature interactions better when historical behavior is not enough or representative.

ModelPaperPublicationTypeDescription
LRPredicting Clicks: Estimating the Click-Through Rate for New AdsWWW'07ShallowLogistic regression (LR) is a simple baseline model for CTR prediction. With the online learning algorithm, FTRL, proposed by Google, LR has been widely adopted in industry. It’s a widely used baseline and applies linear transformation to model the relationship of all the features.
FMFactorization MachinesICDM'10ShallowWhile LR fails to capture non-linear feature interactions, Rendle et al. propose factorization machine (FM) that embeds features into dense vectors and models pairwise feature interactions as inner products of the corresponding embedding vectors. Notably, FM also has a linear time complexity in terms of the number of features.
CCPMA Convolutional Click Prediction ModelCIKM'15DeepCCPM reports the first attempt to use convolution for CTR prediction, where feature embeddings are aggregated hierarchically through convolution networks.
FFMField-aware Factorization Machines for CTR PredictionRecSys'16ShallowField-aware factorization machine (FFM) is an extension of FM that considers field information for feature interactions. It was a winner model in several Kaggle contests on CTR prediction.
YoutubeDNNDeep Neural Networks for YouTube RecommendationsRecSys'16DeepDNN is a straightforward deep model, which applies a fully-connected network (termed DNN) after the concatenation of feature embeddings for CTR prediction.
Wide&DeepWide & Deep Learning for Recommender SystemsDLRS'16DeepWide&Deep is a general learning framework proposed by Google that combines a wide (or shallow) network and deep network to achieve the advantages of both. It jointly trains a linear model and a deep MLP model to the CTR prediction.
IPNNProduct-based Neural Networks for User Response PredictionICDM'16DeepPNN is a product-based network that feeds the inner (or outer) products of features embeddings as the input of DNN. Due to the huge memory requirement of pairwise outer products, we use the inner product version, IPNN.
DeepCrossDeep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial FeaturesKDD'16DeepInspired by residual networks, deep crossing to add residual connections between layers of DNNs. DCN is proposed to handle a set of sparse and dense features, and learn cross high-order features jointly with traditional deep MLP.
HOFMHigher-Order Factorization MachinesNIPS'16ShallowSince FM only captures second-order feature interactions, HOFM aims to extend FM to higher-order factorization machines. However, it results in exponential feature combinations that consume huge memory and take a long running time.
DeepFMDeepFM: A Factorization-Machine based Neural Network for CTR PredictionIJCAI'17DeepDeepFM is an extension of Wide&Deep that substitutes LR with FM to explicitly model second-order feature interactions. It combines the explicit high-order interaction module with deep MLP module and traditional FM module, and requires no manual feature engineering.
NFMNeural Factorization Machines for Sparse Predictive AnalyticsSIGIR'17DeepSimilar to PNN, NFM proposes a Bi-interaction layer that pools the pairwise feature interactions to a vector and then feed it to a DNN for CTR prediction.
AFMAttentional Factorization Machines: Learning the Weight of Feature Interactions via Attention NetworksIJCAI'17DeepInstead of treating all feature interactions equally as in FM, AFM learns the weights of feature interactions via attentional networks. Different from FwFM, AFM adjusts the weights dynamically according to the input data sample.
DCNDeep & Cross Network for Ad Click PredictionsADKDD'17DeepIn DCN, a cross network is proposed to perform high-order feature interactions in an explicit way. In addition, it also integrates a DNN network following the Wide&Deep framework.
FwFMField-weighted Factorization Machines for Click-Through Rate Prediction in Display AdvertisingWWW'18ShallowIt considers field-wise weights of features interactions. Compared with FFM, it reports comparable performance but uses much fewer model parameters.
xDeepFMxDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender SystemsKDD'18xDeepFMWhile high-order feature interactions modeled by DCN are bit-wise, xDeepFM proposes to capture high-order feature interactions in a vector-wise way via a compressed interaction network (CIN). It uses Compressed Interaction Network to enumerate and compress all feature interactions, for modeling an explicit order of interactions.
DINDeep Interest Network for Click-Through Rate PredictionKDD'18DeepIt’s an early work exploits users’ historical behaviors and uses the attention mechanism to activate user behaviors in which the user be interested in different items.
FiGNNFiGNN: Modeling Feature Interactions via Graph Neural Networks for CTR PredictionCIKM'19DeepFiGNN leverages the message passing mechanism of graph neural networks to learn high-order features interactions.
AutoInt/AutoInt+AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural NetworksCIKM'19DeepAutoInt leverages self-attention networks to learn high-order features interactions. AutoInt+ integrates AutoInt with a DNN network.
FiBiNETFiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate PredictionRecSys'19DeepFiBiNET leverages squeeze-excitation network to capture important features, and proposes bilinear interactions to enhance feature interactions.
FGCNNFeature Generation by Convolutional Neural Network for Click-Through Rate PredictionWWW'19DeepFGCNN applies convolution networks and recombination layers to generate additional combinatorial features to enrich existing feature representations.
HFM/HFM+Holographic Factorization Machines for RecommendationAAAI'19DeepHFM proposes holographic representation and computes compressed outer products via circular convolution to model pairwise feature interactions. HFM+ further integrates a DNN network with HFM.
ONNOperation-aware Neural Networks for User Response PredictionNeural Networks'20DeepONN (a.k.a., NFFM) is a model built on FFM. It feeds the interaction outputs from FFM to a DNN network for CTR prediction.
AFN/AFN+Adaptive Factorization Network: Learning Adaptive-Order Feature InteractionsAAAI'20DeepAFN applies logarithmic transformation layers to learn adaptive-order feature interactions. AFN+ further integrates AFN with a DNN network.
LorentzFMLearning Feature Interactions with Lorentzian FactorizationAAAI'20ShallowLorentzFM embed features into a hyperbolic space and model feature interactions via triangle inequality of Lorentz distance.
InterHAtInterpretable Click-through Rate Prediction through Hierarchical AttentionWSDM'20DeepInterHAt employs hierarchical attention networks to model high-order feature interactions in an efficient manner.
FLENFLEN: Leveraging Field for Scalable CTR PredictionDLP-KDD'20DeepFLEN: Leveraging Field for Scalable CTR Prediction
FmFMFM^2: Field-matrixed Factorization Machines for Recommender SystemsWWW'21DeepFM^2: Field-matrixed Factorization Machines for Recommender Systems