Challenges

The construction of effective Recommender Systems (RS) is a complex process, mainly due to the nature of RSs which involves large scale software-systems and human interactions. Iterative development processes require deep understanding of a current baseline as well as the ability to estimate the impact of changes in multiple variables of interest. Simulations are well suited to address both challenges and potentially leading to a high velocity construction process, a fundamental requirement in commercial contexts. Recently, there has been significant interest in RS Simulation Platforms, which allow RS developers to easily craft simulated environments where their systems can be analyzed.

One of the most challenging aspects in estimating the causal impact of a new version on all the relevant metrics including user satisfaction and commercial gains among others is that given a current baseline system, it is not obvious which aspect (such as diversity, relevance, etc.), nor which component (such as the ranker, candidate selector, serving infrastructure, etc.) should be the target of the next iteration. A common approach to address this is to develop well articulated hypotheses about issues (e.g. current serving approach has high latency leading to higher abandonment rate) or opportunities for improvement (e.g. point-wise ranker produces low diversity leading to very similar recommendations) in the current version, and use existing or new data to validate them. Unfortunately in many cases this is not possible, because the available data is not appropriate, or because gathering new data is too expensive.

Once a concrete improvement opportunity is detected, the current system is modified accordingly, for example by changing a prediction algorithm, the user interface or the serving infrastructure. These are usually local interventions with measurable local effects. But they also have (and they are expected to have) global or system-wide effects, which are a lot harder to measure since they involve the system as a whole. On top of this, there is usually tension between various variables of interest such as relevance vs latency, or diversity vs relevance or adaptivity vs user satisfaction to name a few which further increases the importance of system-wide analysis. The ability to make good and principled trade-offs is key to deliver a balanced and robust recommender system.

In e-commerce, the number of products in the shelf space are practically infinite. Thus, the users have to navigate through a plethora of options in any category before making a purchase and they often get disinterested in the process soon. E-commerce recommendation algorithms often operate in a challenging environment. For example:

A large retailer might have huge amounts of data, tens of millions of customers and millions of distinct catalog items.
Many applications require the results set to be returned in real-time, in no more than half a second, while still producing high-quality recommendations.
New customers typically have extremely limited information, based on only a few purchases or product ratings.
Older customers can have a glut of information, based on thousands of purchases and ratings.
Customer data is volatile: Each interaction provides valuable customer data, and the algorithm must respond immediately to new information.