The multi-armed bandit (MAB) problem is a relatively simple (to *describe* that is, not to *solve*) formalization of a sequential learning problem that has many applications. In its canonical form, a **gambler** faces a slot machine (colloquially called a one-armed-bandit) with *k >1* **arms** (hence the term multi-armed bandit). Given a (finite) amount of coins, the gambler sequentially chooses which arm to play — subsequently she inserts a coin and pulls the arm — after each play observing some **reward **(*e.g.,* a win or a loss — typically not winning coins that can be re-used). As the gambler *one-by-one* squanders her coins, she hopes to attain as high a cumulative reward as possible: this is tricky since each arm is assumed to have a different (static) probability of paying out. A successful strategy of playing the machine’s arms needs to on one hand *explore* the options (*i.e., *the gambler needs to play the arms whose expected rewards she is uncertain about to learn more), while on the other hand *exploiting* the available knowledge (*i.e., *the gambler needs to play the arm with the highest expected pay-off as often as possible).

The simple gambling setup above matches, sometimes somewhat loosely, many applied problems. Consider the following alternatives for the terms in bold above:

- Gambler, medical doctor, online marketeer, e-commerce manager, …
- Arms, medical treatments, advertisements, products, …
- Rewards, medical outcomes (
*e.g.,*live or death), customer responses (*e.g.,*click on the ad), purchases (*e.g.,*yes or no), …

Because of this generality, the bandit problem formalization has been used to study many applied problems. Furthermore, effective strategies (often coined *policies* in the bandit literature which are formally mappings from the historical data to the next arm) to address bandit problems are used in many applications.

# Effective bandit policies

Before discussing how we implemented effective bandit policies on edge devices using WebAssembly, it is good to explore a few possible bandit policies:

**Just choose one**: One simple, but not very effective, strategy is to just choose the arm you think is best and play it all the time. This policy does not have a formal name in the bandit literature, but it is common in practice: (*e.g.,*) online marketeers often simple create a single ad based on their own expert knowledge without doing any experimentation.**e-first**: A better — in terms of the expected cumulative reward where the expectation is over multiple executions of the policy to address bandit problems — strategy is to first devote a fixed number of*n*plays to randomly explore the different arms. After this exploration period, one selects the arm with the highest (observed) average reward for the remaining plays. Note that this strategy is known as the A/B test or the RCT depending on the field one is in.**e-greedy**: Where e-first effectively first explores and subsequently exploits, it is relatively simple to do both continuously: with probablity*p*the agent selects a random arm, and with probability*1-p*the arm with the highest (observed) average reward is selected. Often,*p*is relatively small (*<.1*)**UCB methods**: The thoughtful reader should have noticed that exploitation means playing the arm with the highest expected reward (as estimated using the average of the historical observations), whereas exploitation effectively reduces the uncertainty surrounding this expectation. UCB methods combine both objectives quite intuitively: the gambler plays the arm with the highest*upper confidence bound*(i.e., the average reward + some quantification of its uncertainty).**Thompson sampling**: UCB methods can be shown to achieve (asymptotically) optimal rewards (or rather regret, but we won’t get to theoretical here). However, UCB methods are very frequentist in nature. Thompson sampling provides a (also asymptotically optimal) Bayesian alternative: the gambler plays each of the arms with a probability that is proportional to her belief (quantified in a Bayesian way) that the arm is the best arm.

While both 4. and 5. above can be shown to be asymptotically optimal — depending a bit on the exact bandit setup and the exact bounds etc. involved — even 2 and 3 above can be extremely useful and generally grossly outperform 1. For a comprehensive overview and analysis of bandit algorithms see this amazing book.

# Bandits on the edge?

Bandit policies are often implemented on servers: the gambler effectively lives in a single “location”. For example, the sequential selection of ads to maximize CTR is often centrally coordinated. However, there are various applications in which one would like to deploy a bandit policy on an edge device: for example, the problem of selecting the fastest route through a network can often be formalized as a bandit problem.

To implement bandit policies in a software system, one effectively needs to implement 2 functions:

- An action assignment function: When called this function returns the next arm to pull.
- An update function: When called this function, based on an arm-reward combination, updates the internal state of the application such that a next call to the action assignment function can be based on the collected history.

This is detailed quite clearly in this recent Journal of Statistical Software paper.

Perhaps surprisingly, both functions are easy to implement — even in a modular way — on edge devices. Using WebAssembly both steps can be implemented efficiently and effectively be deployed on any edge device. We recently used the Scailable platform to provide a small demo for one of our clients: We implemented e-greedy in the standard exposed `predict`

function of a Scailable binary (see this tutorial), and we used orthogonal persistence to implement the `update`

function such that the state of the machine could be update: see this Towards Data Science post for details. Iteratively calling the `predict`

and `update`

functions allowed us to implement a bandit policy on a ESP32 MCU. Pretty neat!

# Conclusion

The bandit problem formalization provides an extremely useful model for many real-world sequential learning problems. Furthermore, the idea of splitting up a bandit policy into two simple function calls is both simple and powerful and can easily be extended to implement contextual bandit policies and batched policies. Finally, doing so in WebAssembly binaries allows for the efficient deployment and monitoring of bandit policies on various edge devices. This is pretty cool since, although rudimentary, we can now have each little chip learn from its direct interactions with its environment!