“Intelligence is the ability to accomplish complex goals,” says the theoretical physicist Max Tegmark.[1] The complex goals of human intelligence include thinking, understanding, reasoning, planning, learning, criticising, imagining, solving problems, and mastering the use of languages. However, the abilities of current AI are far from this level of intelligence. Our current AI tools are good at learning and solving problems in a narrow domain but lack the capacity to solve a broad range of problems that would be easy even for a 3-year-old kid. At the moment, AI in the research lab or marketplace only possesses narrow intelligence.[2] It is still uncertain as to whether AI will one day acquire the same general intelligence[3] – the ability to execute multiple and diverse complex tasks at the same time[4] – that we humans possess.

AI research began during the 1956 Dartmouth Summer Research Project on Artificial Intelligence.[5] In the years since, there have been periods of optimism and disappointment. The AI winters of 1974-1980 and 1987-1993 witnessed funding cuts and the slowing down of research progress.[6] But an AI resurgence occurred in the twenty-first century due to the increase in computational power, the emergence of better theoretical understanding of AI techniques, and the availability of big data and open-source software tools.

The branch of AI known as “machine learning” figures very heavily in public discussions on technology. Indeed, the machine learning technique known as “deep learning” is one of the most publicly visible achievements of the current boom in AI research. Deep learning programmes use artificial neural networks[7] to learn from data. These programmes can acquire computational intelligence through “supervised” learning, “unsupervised” learning, or “reinforcement” learning.

Supervised learning refers to the process by which machine learning programmes learn from large quantities of tagged or labelled data. Unsupervised learning, on the other hand, refers to the process by which machine learning programmes learn from large quantities of untagged or unlabelled data. For example, under supervised learning conditions, a deep learning programme may acquire the ability to recognize images of cats after being “trained” on millions of images of cats that are specifically tagged or labelled as cat images. In contrast, a deep learning programme labouring under unsupervised learning conditions would be trained on millions of unlabelled or untagged images of cats. In this situation, it learns to recognise images of cats by distilling patterns from the masses of unlabelled data in its training dataset.[8]

Reinforcement learning, in contrast to supervised learning and unsupervised learning, is best understood as a goal-directed learning process. Goals – an example of which would be to win a game of chess – do not lend themselves well to labelling. In a game of chess, for example, it would be difficult to define and to label the “right move” or the “correct answer” for each position. But it is important to note that reinforcement learning is not a subset of unsupervised learning. As explained above, unsupervised learning involves the distillation of patterns from unlabelled data. Reinforcement learning, on the other hand, is an associative learning process under which particular actions by the programme are encouraged or punished. The programme learns through feedback and rewards: “good moves” are rewarded while “bad moves” are punished. In other words, the programme aims, over the course of a large number of trials, to earn as large a reward as possible.[9]

AlphaGo and AlphaZero, two of the most exemplary illustrations of the power and the potential of AI, both acquired their computational intelligence through reinforcement learning. AlphaGo and AlphaZero were both engineered by Alphabet’s DeepMind. Powered by deep learning, AlphaGo was able to defeat the international Go champion, Lee Sedol, in 2016.[10] A year later, AlphaZero was able to attain world champion status in Go, chess and shogi after undergoing twenty-four hours of training through self-play. Through repeated games against different opponents (including itself), AlphaZero not only learnt good strategies, but also discovered novel strategies that allowed it to outmanoeuvre other champion-status programmes like Stockfish (chess) and Elmo (shogi).[11]

In short, repeated training via reinforcement learning improves algorithms and raises computational intelligence. Moreover, there is a formidable element of foresight in reinforcement learning that is somewhat analogous to Dr. Strange’s ability, as demonstrated in the Avengers’ fight with Thanos, to browse all possible futures in order to identify the best course of action. A deep learning programme that faces two options at each step and looks n steps ahead into the future will generate 2n possible options. Although each of these options has a reward associated with it, the algorithm in the neural network will only process a subset of these 2n rewards based on its computational experience, i.e. the training that it has gone through. This minimizes the use of computational resources. Initially, of course, the algorithm in the neural network will not make the best decisions with respect to these long-term rewards. But, as good decisions are rewarded and bad ones punished during the trial-and-error training process, the evolving algorithm in the neural network will gradually acquire the ability to predict the optimal long-term reward for each move by looking only at the most promising variations of the possible future.[12] This whole process is akin to a chess player looking ahead to anticipate the moves of his or her opponent.

Reinforcement learning has the potential to profoundly shape many of our social-technological systems. The humble elevator is no exception. The elevators in high-rise buildings are governed by elevator algorithms. The instructions in these elevator algorithms typically take the “if-then” form – if this then that. The standard elevator algorithm follows a protocol defined by adherence to three basic rules: (1) the elevator continues in its current direction as long as there is demand in that direction; (2) the elevator switches direction whenever there is no more demand in the current direction and there is demand in the opposite direction; (3) it stops and waits for the next call otherwise.  It is possible to subject an algorithm that mimics the operation of an elevator algorithm to the rigours of reinforcement learning. This would involve “training” the mock elevator algorithm with a pre-defined set of rewards and punishments. This process would result in the creation of an elevator algorithm that operates without the use of any “if-then” rules.[13]

An elevator algorithm shaped by reinforcement learning may even outperform conventional elevator algorithms governed by “if-then” rules. But this requires the factoring of the passengers’ origin and destination levels into the reinforcement learning process. This is not an entirely unrealistic requirement. We do have advanced elevator systems that require passengers to input their destination levels before the elevator is activated and sent to their origin levels. We also need to choose a “winning” criterion. A sensible criterion would be the total commuting time of the passengers, where total commuting time denotes the sum of the elevator waiting time and travelling (or riding) time. Next, we need to define a reward and punishment structure that would train the elevator algorithm to minimize the average total commuting time of all passengers. By keying punishments to the amount of commuting time, we will be able to ensure that the programme will learn, across many trials and with a view to accumulating as few punishments as possible, to deliver as many passengers as possible within the least amount of time instead of simply executing the conventional “if-then” elevator protocol.

Our research has shown that this process of reinforcement learning will indeed lead to the emergence of a reshaped algorithm that outperforms a conventional elevator algorithm in terms of total commuting time. The programme’s enhanced computational intelligence comes with some surprises, though. As shown in the animation below (see Fig. 1), the programme makes the counter-intuitive move of switching the direction of the upward bound elevator and moving it downwards to the 1st floor to pick up three passengers who have the same destination as the passengers already in the elevator before resuming the elevator’s upward direction and delivering all of the elevator’s passengers to the 4th floor. In contrast, a conventional elevator algorithm governed by “if-then” rules would have simply continued to direct the upward bound elevator towards the 4th floor before attending to the three waiting passengers at the 1st floor. We can appreciate the disruptiveness of this new technology by imagining the psychological state of the passengers as the upward bound elevator suddenly switches direction and moves towards the 1st floor.

While it is certainly possible to pre-announce any changes in the elevator’s direction, such changes would nonetheless fly in the face of the psychological habits of those of us who are accustomed to the conventional mode of motion of the standard elevator algorithm. Furthermore, the passengers in the upward bound elevator right before it switched direction would end up with a longer journey because of the detour. In other words, the algorithm’s fast delivery goal serves the “collective good” at the expense of a small group of passengers. In our example, the total commuting time of the passengers at the 1st floor has been reduced because of the elevator’s detour. As a result, the average total commuting time of all passengers has indeed decreased.

Thus, we see that computationally intelligent programmes and machines can make seemingly unusual decisions that defy our comprehension if we are not privy to its goals and objectives. These unconventional decisions arise from the complexity of the intelligent machine’s goals. Instead of merely delivering passengers to their destination, the intelligent machine in the example above also wants to complete the task as quickly as possible. As machines become more intelligent, they will be able to pursue and accomplish more complex goals. The question now becomes: Can we trust the machine as it “computes itself” in the drive to achieve complex goals? For the passengers in the elevator, it is a question of both whether they can trust the machine that acts in contradiction to their expectations and whether the machine has indeed acted for the “common good”. The fact of the matter is that intelligent machines are as yet unable to explain their actions.[14] Moreover, while scientists and engineers can often explain the machine’s actions in retrospect, they are as yet unable to understand or to explain certain types of machine behaviour. We have to ensure that the actions of computationally intelligent machines are explainable. Public trust and understanding rest on explainability. An incorrect understanding of how computational intelligence works will lead to public scepticism with regard to its benefits.[15]

Video best viewed with Google Chrome or Microsoft Edge


[1] Max Tegmark, Life 3.0: Being Human in the Age of Artificial Intelligence (New York: Alfred A. Knopf, 2017), 60.

[2] Artificial narrow intelligence is also known as weak AI.

[3] Artificial general intelligence is also known as strong AI.

[4] Ed Finn, What Algorithms Want: Imagination in the Age of Computing (Cambridge, MA: MIT Press, 2017), 135.

[5] Stuart J. Russell and Peter Norvig, Artificial Intelligence: A Modern Approach (2nd ed.) (Upper Saddle River, NJ: Prentice Hall, 2003), 17.

[6] Daniel Crevier, AI: The Tumultuous History of the Search for Artificial Intelligence (New York: Basic Books, 1993), 203.

[7] An artificial neural network imitates the biological neural network of the brain.

[8] For a full explanation of the differences between supervised and unsupervised learning, see Ethem Alpaydin, Introduction to Machine Learning (3rd ed.) (Cambridge, MA: MIT Press, 2014), 11.

[9] For a full explanation of reinforcement learning, see Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction (Cambridge, MA: MIT Press, 1998), 2.

[10] “Google’s AlphaGo gets ‘divine’ Go ranking”, The Straits Times, 15 March 2016.

[11] See David Silver, Thomas Hubert, Julian Schrittwieser, and Demis Hassabis, “AlphaZero: Shedding New Light on Chess, Shogi, and Go”, DeepMind (company blog post), 6 December 2018. Available online at https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go. Last accessed on 1 December 2020. See, also, Matthew Sadler and Natasha Regan, Game Changer: AlphaZero’s Groundbreaking Chess Strategies and the Promise of AI (Alkmaar, The Netherlands: New in Chess, 2019).

[12] For a full explanation of this process, see David Silver et al., “A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go through Self-play”, Science 362, no. 6419 (2018): 1140.

[13] Ning Ning Chung, Hamed Taghavian, Mikael Johansson, and Lock Yue Chew, “An Explainable Neural-network-designed Elevator System that Operates Based on Reinforcement Learning”, manuscript under preparation, (2020).

[14] Paul Voosen, “How AI Detectives are Cracking Open the Black Box of Deep Learning”, Science, 6 July 2017. Available online at https://doi.org/10.1126/science.aan7059. Last accessed on 1 December 2020.

[15] Eliza Strickland, “IBM Watson, Heal Thyself: How IBM Overpromised and Underdelivered on AI Health Care”, IEEE Spectrum 56, no. 4 (2019): 24.