Buzzwordy title alert.
Although there were many individuals worried about recursive self-improving AI, the alarms weren’t really sounded until Nick Bostrom wrote Superintelligence. For those readers who are unfamiliar with why superintelligent AIs, AGIs for short, might be scary, they can look at my notes or this post here. Long story short, an AI that is vastly more intelligent than us that isn’t aligned with our interests may decide something that isn’t in our best interest.
The oft-quoted example of AGI, aka superintelligent AI, gone awry is the paperclip maximizer. While this example doesn’t exactly capture all the nuance, one can get the gist of the problem. An AGI is created whose sole goal is to create as many paperclips as possible, since it’s so good at its job, ends up killing all humans and turning all matter into paperclips. A more “human” example of an AGI gone away is a corporation, aka Enron or any oil company. Cash flows and profit, the internal metric of success or objective functions, they use becomes divorced from their original purpose of creating a good for society. Bitcoin and other cryptocurrency networks also represent some kind of recursively improving organism with no clear point of disconnect and have some individuals worrying about blockchains and AI. AGIs gone away would represent the principal-agent problem on steroids. You could well argue that Bitcoin, or cryptocurrencies are a version of this paperclip maximizer, especially the Proof-of-Work variants.
The basic assumption that researchers in the field make is that AGI is going to happen someday. If not 15 years away, less than 100. 100 years in the course of the universe is nothing. Therefore, solving this problem of defining an objective function, or guardrails for an AGI is of the utmost importance. Sadly, this isn’t quite incentivized today. However, the work that has been done can be summed up as such:
- Alignment: Making sure its objective function doesn’t kill us. Work that I’m most familiar with is coherent extrapolated volition and approval directed agents.
- Capability restraint For example, an AI that is air-gapped from the internet can give just yes or no answers, aka becoming a genie.
However, Bostrom presents another idea on AI control that I think doesn’t get enough coverage. In a few short words "make the objective function tied to the acquisition of some cryptographic token". While this seems unintuitive at first, it becomes akin to us trying to earn money, or dogs doing tricks for doggie treats. In the original proposal, Bostrom proposes to use a centralized cryptographic token managed by scientists. Superintelligence was published before this current hype cycle as well as theoretical work on new cryptographic primitives had begun. During that time, there’s been a little bit of fervor over how blockchains can positively increase the capability of artificially intelligent systems such as Computable by providing more data sets, not much has been written about the safety side. (No surprise there). Here are some specific high-level proposals that can be stacked on top of each other to control and align agents.
- Use a decentralized cryptocurrency as reward function. This one is straightforward enough. Using centralized cryptographic tokens as the goal suffers from the same reason that centralized cryptocurrencies didn’t take off. They introduce the same single point of failure. If a scientist is somehow held at gunpoint by an AGI, he or she will probably hand over some tokens. It’s much harder to hold a network of miners and anonymous token holders at gunpoint.
- Instantiate an AGI as a DAO. This allows this entity to operate trustlessly, which is a double-edged sword. This allows the AGI to sustain itself and operate with or without supervision. But it also keeps an auditable trail of where and when the objective function cryptocurrency was added to the specific address.
- Define reward function as a smart contract to be executed trustlessly. This is where it starts to get a little harder to conceptualize. We can state in plain English what something is. This matters for reinforcement learning agents. Objective functions in terms of Starcraft or Go, are simply to win the game. However, we may want to iterate/check up on the operation of the AI, and update an objective function as we go on, and not let the individual agent be able to change any part of the objective function. Then, use a widely distributed governance token, so pseudonymous actors can allow for changes to this governance token. Keep identities private so that the agent isn’t able to harass/bribe them. Monitor past voting behavior, by adopting a trail of “reputation” for voters to check for any bribery, this can also be determined on-chain.
- Use curve bonded tokens to get rid of “take over attacks”.Curve bonded tokens have programmatically defined prices for minting and redeeming (and then burning) a set of tokens. To perform any goal, the agent is probably going to have a lot of cash on hand. What if he tries to buy up a supply of governance token? That would be bad, then it could change its own objective function. To prevent this, we can set the curve for purchasing tokens at an absurdly high price as more tokens are minted. Corresponding, we can set an extremely small sell-out price to disincentivize any sales.
- Use TCRs (or some other game theoretically sound) ranked lists to tokenize “human values” and direct an AGI to optimize for that set of “human values".The previous example talked about defining a goal in terms of ETH held. That would be easily calculable if the goal of the agent was to maximize the NAV of its investment portfolio. However, as we know today, defining something just in terms of money can lead to some perverse outcomes. If the means of money become the ends, then that leads to greedy short-term actions that can be taken by agents.
- Instead, we might want to optimize for human well-being. How do we define this on chain, so this measure can’t be hacked by an autonomous agent? We utilize decentralized stake-based rating games, namely TCRs with a curve bonded token for staking. You can read a little more about TCRs here.
- Back to representing human well-being “on-chain”. First, we have to define how this is defined in the real world. Various NGOs and ratings orgs track things like the HDI, Human Happiness Index, and GDP per capita. These are top line objectives that countries may try to aim for, through actions that make individual citizens happy. Of course, countries are free to ignore these ratings as well. However, autonomous agents won’t be if their objective functions are locked down.
- So how does that tie into the blockchain? These indexes have a large self reported component right now, and TCRs are good for encoding intangible and subjective information into hard economic terms. By creating this list that might be composed of “happiness”, “wealth for humanity”, and “sugar, spice, and everything nice”, we might have the agent take off-chain actions that benefit humanity.
The largest points of failures would seem to be the voters, especially if they have their identity revealed. Perhaps we can have less intelligent agents that vote on issues for the most intelligent agent, each with their own objective functions that need to be modified. With any organization or incentive structure, there always needs to be a balance between being able to change something and not letting the wrong actors change things. I think this game is especially fun to play when thinking through an actor that is vastly more intelligent than I.