Supply chain optimization is one the toughest challenges among all enterprise applications of data science and ML. This challenge is rooted in the complexity of supply chain networks that generally require to optimize decisions for multiple layers (echelons) of distribution centers and suppliers, multiple products, multiple periods of time, multiple resource constraints, multiple objectives and costs, multiple transportation options, and multiple uncertainties related to product demand and prices. In the traditional approach, one usually develops a mathematical programming model that explicitly incorporates a relevant subset of these factors and obtains a solution through solving this model using optimization and forecasting methods. An extensive survey of such models is provided, for instance, in [1].
In this article, we explore how the problem can be approached from the reinforcement learning (RL) perspective that generally allows for replacing a handcrafted optimization model with a generic learning algorithm paired with a stochastic supply network simulator. We start by building a simple simulation environment that includes suppliers, factories, warehouses, and retailers, as depicted in the animation below; we then develop a deep RL model that learns how to optimize inventory and pricing decisions.
This work builds on the basic example we developed in one of the previous posts, and the complete implementation is available on GitHub as part of the TensorHouse project.
Our first step is to develop an environment that can be used to train supply chain management policies using deep RL. We choose to create a relatively small-scale model with just a few products and facilities but implement a relatively rich set of features including transportation, pricing, and competition. This environment can be viewed as a foundational framework that can be extended and/or adapted in many ways to study various problem formulations. Henceforth, we refer to this environment as the World of Supply (WoS).
The design of the WoS environment is outlined in the following class diagram and described in detail below:
Almost all entities in the above environment, including the World, Facilities, Units, and Transports are agents in terms of RL, and so this is a hierarchical multi-agent environment. For the training process described later in this article, we simplified the control space by automating Transports (vehicles automatically find the shortest path for each order and deliver the freight); however, logistics decisions can be directly included into the optimization process.
Although the environment described above is fairly simple, the optimization of control policy is complicated by the dependencies between agents and progress-blocking traps. For instance, global profit optimization requires all layers of the network to operate in concert, so that the downstream facilities (e.g., retailers) will not be able to earn anything and learn their policies until the suppliers (e.g., warehouses) learn their policies, which in turn cannot be accomplished until the whole chain works and downstream facilities sell products to the end customers. Other examples of progress-blocking traps include storage overfills that block inbound inventory flows and can also block vehicles at the unloading state.
Our second step is to create an adapter that casts the WoS environment to the standard environment interface provided by RLlib—a reinforcement learning platform that we will use for policy training. This interface is similar to OpenAI Gym but provides comprehensive multi-agent features.
One of the design decisions we need to make is the mapping between agents and policies. The nested structure of the WoS environment allows for a variety of mapping schema. One extreme is to map the entire World to a single policy that would need to produce a high-dimensional action vector that comprises all controls for all facilities. Another extreme is to map each Unit or each Facility to its own policy. After experimenting with various mappings, we concluded that it makes sense to maintain a balance between the dimensionality of the action space and the number of independent policies. One schema that achieves reasonably good performance is shown in the illustration below:
This schema maps all production-related controls to one policy, and all consumption-related controls to another policy. This reduces the dimensionality of the action spaces; thus, we have one 2-dimensional and one 3-dimensional action space. The policies of course need to take the type and location of each facility into account, but we handle this in the design of the state.
Next, we use the proximal policy optimization (PPO) algorithm for both producer and consumer policies [2]. The design of the state representation and network architecture for the consumer policy is shown in the figure below:
The state vector includes global features such as time; a comprehensive set of facility-level features, such as one-hot encoded facility type and identity; BOM inputs and outputs; information about which sources export which products; and what orders are already in transit. The model consists of two wide fully connected layers, and their outputs are mapped to three short linear layers with softmax on top to produce the probabilities of source ID, product ID, and quantity level—the consumer unit then assembles an order from these three values. In general, one can benefit from an LSTM layer that captures the temporal dynamics of the environment—similar to OpenAI Five [3]—however, we did not observe any significant improvement for the setup we used in the experiments described in the next sections. The producer network has a similar design but smaller dimensionality of the fully connected layers.
We use multi-discrete action spaces for both policies, so that product quantities and prices are drawn from a discrete set, although one can consider using continuous control spaces for these elements.
Finally, we use simulation episodes with 1,000 time steps that are downsampled by a factor of 20 before being fed into the policy, so that the policy makes 50 decisions per episode.
Our third step is to implement a hand-coded policy that can be used as a baseline and, as we will see later, a bootstrapping mechanism for learning PPO policies. The hand-coded policies implement a very simple logic:
The following animation shows the debugging information for the begging of one episode (300 steps out of 1,000) played by a hand-coded policy:
The deadlocking dependencies between facilities make the training process challenging, but we can leverage several techniques to work around it.
First, we need to design the reward signal. The WoS environment provides flexibility to determine how exactly the optimization goal can be set: we can view the entire network as a retailer's supply chain and ignore the profitability of individual facilities, or we can view it more as an economic game where each facility is an independent agent that aims to maximize its own profit. We choose to focus on global profit, that is, the sum of all revenues and losses across all facilities. Direct maximization of global profit, however, can be challenging because of significant delay between a facility’s actions and profits ultimately obtained at the end of the chain. As a result, agents tend to get stuck at local do-nothing minimums to minimize losses. We can facilitate the process using curriculum learning—start with easier subtasks and gradually increase the difficulty level. The following example shows a reward signal design based on these considerations:
$$
\begin{aligned}
\alpha &= \alpha_0 + \alpha_1 \cdot \frac{t}{T} \\
r_g &= \alpha\cdot p_g + (1 - \alpha) \cdot p_r \\
r_i &= \beta \cdot r_g + (1 - \beta) \cdot p_i
\end{aligned}
$$
where $\alpha$ is the weight of global profit that increases during the training process ($\alpha_0$ and $\alpha_1$ are the intercept and slope of the linear growth function, $t$ is the training iteration, $T$ is the total number of training iterations), $p_g$ is the global profit, $p_r$ is the mean retailer’s profit or revenue, $p_i$ is the own profit of $i$-th facility, and $r_i$ is the final reward metric for $i$-th facility. In words, we initially focus on ensuring that the supply chain is functioning end to end and retailers make a revenue of $p_r$ and then shift the focus toward the global balance. The local balance of a facility is essentially a regularizer controlled by a constant parameter $\beta$.
The reward design is illustrated in the plots below that show how the global balance, total reward, and rewards of individual agents evolve during the first half of the episode (500 steps) under the baseline policy:
Note that each facility is represented by two agents (in RL terms) because we have separated producer and consumer policies, and, in general, one can have different definitions of reward for producer and consumer actions.
The second technique we use is also based on a curriculum learning approach: we initially assign the baseline policy to all agents to ensure the basic functioning of the network, and gradually switch groups of facilities from the baseline to the PPO policies (e.g., toy factories first, distribution centers second, etc.).
The training process is generally able to produce a meaningful policy for one layer of facilities (e.g., toy factories) in about 300,000 episodes. The reward trajectory has an inverted-U shape as we start with the revenue-focused rewards and gradually increase the contribution of profits. This process is illustrated in the two figures below that show the dynamics of the mean and distribution of the episode reward for one training session:
The learned policy demonstrates reasonably good performance, but we did not manage to outperform the baseline policy. The overall supply chain profit performance under the learned policy is inferior to the baseline because of imperfect selection of suppliers that results in order placement penalties and occasional progress collapses caused by environment blocking. The plots below show an example of balance and reward traces that can be contrasted to the baseline policy traces provided in the previous section:
The animation below provides a detailed view with the debugging information:
Once the policy for one layer of facilities (echelon) is stabilized, the process can be repeated for another layer or group of facilities, and so on.
The experiments above show that it is generally possible to apply multi-agent RL methods to multi-echelon supply chain optimization, but there are a number of challenges that need to be addressed: