Maintaining and Resetting Consensus in Autonomous Services
Autonomous services can be designed for a variety of use-cases, for example, acting as a decentralized oracle for smart contracts, or executing complex investing strategies that cannot be easily encoded on-chain. In many of these applications, it is expected that their business logic requires the maintenance of some state during the lifetime of the service. Recall that, in the Autonolas network, autonomous services are implemented as agent services (i.e., multi-agent systems based on Autonomous Economic Agents (AEAs)) using the Open Autonomy framework.
One of the main features of an agent service is the ability to synchronize its internal state across agents using a consensus mechanism. Each operator in an agent service is in charge of running an agent instance, which itself manages the running of its consensus node. The consensus node communicates with the corresponding nodes from other operators in the agent service through a local network. We call the collection of these consensus nodes and the local consensus network the consensus gadget. Autonolas agent services use a consensus gadget built on top of Tendermint. Tendermint is a well-known consensus software that offers a high-level view of what is stored in the blockchain that it manages.
By interacting with the consensus gadget, the agent instances (also called simply agents for convenience) reach a secure, consistent view of certain messages at certain points in the execution flow of the agent service:
By consistent, we mean that any honest agent has the same view of what are the transactions recorded by the other honest agents. For context, we say that an agent instance is honest if the operator behind the instance is running the agent code as defined by the service.
An interaction between an agent instance and its associated Tendermint node can occur in one of two ways:
When the Tendermint node executes callbacks through the ABCI interface of the agent. These calls are required, e.g., when the agent has to validate some information before being added to the mempool, or when it delivers a settled transaction.
When the agent instance acts as a “client” of the blockchain and wishes, e.g., to submit a transaction to be included in the local Tendermint blockchain.
In this blog post we are going to discuss what we call the “Tendermint blockchain reset” operation. That is, how to instruct the local Tendermint nodes to forget the previous history of their blockchain and start minting blocks from a new genesis. This might be surprising and counterintuitive at a first glance, and certainly, it is a non-standard approach to interact with Tendermint. The reasoning behind this approach is that, due to the nature of the operations executed in an agent service, it is only required to store a small amount of the transactions that took place in the past. This is in contrast to most blockchains or public ledgers, where all transactions since the start of the ledger are required to determine the correctness of any transaction in the future. Thus, periodic reset of the blockchain has the benefit of minimizing the amount of information that needs to be stored.
Tendermint Reset - High Level Approach
Agent services have a dedicated stage to handle the reset of the local consensus blockchain. Take, for example, the Autonolas Oracle that we discussed here:
The Reset and Pause stage is where the actual resetting of the local blockchain occurs. If we zoom in on that stage, we can see that it is implemented as an FSM with three states:
- ResetAndPauseRound: This state triggers the blockchain reset mechanism when it is visited a given number of times. That is, the blockchain is not necessarily reset at every cycle of the main execution flow of the service, rather, at periodic intervals. Additionally, this state also causes the agent service to pause, in the sense of executing agent service level logic, for a while, also for a pre-specified amount of time. We use a variable called “period” to indicate how many times this state has been reached.
- FinishedResetAndPauseRound: This is a “dummy” state that is reached when there has not been any issue during the reset.
- FinishedResetAndPauseErrorRound: As its name suggests, this state is reached when some error has occurred during the execution of the reset.
But how is the reset mechanism actually implemented? In the following section, we will give some hints about what takes place under the hood of the ResetAndPauseRound state, while avoiding unnecessary details that would be too involved, given the scope of this post.
First, recall that in an agent service, each state in the FSMs linked to an agent service is associated with two objects:
- The rounds, which implement the reactive actions of the agents at that state. The rounds, in the last instance, receive in some sense, the result of the callbacks made by the Tendermint nodes to the ABCI interface (and the ABCIHandler that is in charge of addressing them).
- The behaviors, which implement proactive actions, which, in particular, might include client calls to the Tendermint node.
Pre-Defined Functionalities for Rounds
Before delving into the details of the ResetAndPauseRound and the ResetAndPauseBehaviour, let us give general insights on some functionalities that can be readily incorporated into custom objects to ease rapid development of agent services. More concretely, for the case of Rounds, the framework provides a number of helper base classes that encode common patterns involving interaction with the consensus gadget, like:
- waiting until all nodes have sent the same payload,
- waiting until a majority of nodes (>= 2/3) have sent the same payload,
- waiting until all nodes have sent some (possibly different) payload,
These helper base classes handle the callbacks made by the Tendermint node, executing transparently the consensus algorithm and provide a method called end_block(self), which is invoked when Tendermint notifies the end of a block in the blockchain.
For example, if the developer is expecting that all agents send the same value in a given state of the application, they can simply reuse the corresponding functionality. Then, the framework will take care of handling all that cumbersome logic of ensuring that all agents indeed send the same value, and produce the error events when this condition is not met. It will simply remain that the developer specifies what is the business logic of the agent service when the consensus logic has finalized collecting the same value for all the agents. This has to be defined by overloading the end_block(self) method.
Thus, the ResetAndPauseRound benefits from these predefined functionalities. Concretely, this round requires collection of the same payload from >= 2/3 of the agents. In this case, the payload sent by the agents (i.e., the “same value” that needs to be collected) is simply the period count (recall that the period count is increased +1 each time this round is visited). Thus, the round, must explicitly define what happens (i.e., where should the FSM transit) when:
- The agents reach a consensus at that state, i.e., a majority of agents successfully reset their local Tendermint node. They indicate this by sending to the consensus gadget a payload containing the current period value.
- The consensus gadget is unable to reach consensus among all the agents in the service. In this case, the agent service determines the underlying cause of the problem and triggers an error-recovery procedure.
It is an architectural requirement that upon completion of the end_block(self) method, the round (for any stage, not only Reset and Pause) must output the appropriately updated collection of data objects synchronized across all agents (synchronized_data) together with the event that is resulting, and which will trigger the next transition in the FSM. Hence, the sequence of actions carried by the ResetAndPauseRound is as follows:
First, if the threshold for the payload has been reached,
- Create a new synchronized_data period, while retaining the so-called cross-period persistent data.
- Return synchronized_data and event DONE
Then, if there is no possible majority,
- Return the current synchronized_data and event NO_MAJORITY
As you can see, the business logic of the round is quite easy to implement, as it only needs to determine what event is launched (based upon the conditions it sees) and execute some “housekeeping” actions. The intricacies of the actual reset of the blockchain executed in the associated behavior, namely the ResetAndPauseBehaviour remains to be seen.
All the behaviors in an agent service define an async_act(self) method, which is periodically called by the framework, and this is where all of the “hard work” takes place. This method is defined under an asynchronous programming paradigm using Python generators, which enable that the method be “freezed” at certain points, and retake the execution from that point on when the method is called again. This paradigm is especially useful when the object needs to wait for certain external actions to happen (e.g., making RPC calls, or database operations on external services).
Let us explore what happens inside the async_act(self) method within the ResetAndPauseBehaviour. It is important to note that there are some tasks that might take a certain amount of time, or might produce an “inconclusive” output while being executed. In this case, the method is designed to identify what could have gone wrong once called again, and retake the process at the point at the required location. At high level, this is what happens inside async_act(self):
- If N periods have happened after last Tendermint reset, then: Reset Tendermint and wait. This is done through a dedicated method in the base class that will be discussed below.
- Otherwise, wait for a prespecified amount of time. This is to avoid, for example, the saturation of an oracle service with many close calls.
- Generate payload, which in this case is the period count. This is the payload expected by the round, as discussed above.
- Send payload.
- Wait until the current round ends.
- Set the behavior as done. This sets a flag so that when the framework calls this behavior again, it ignores it, as it has executed all of its expected tasks.
The Internals when Interacting with the Tendermint Node
The ResetAndPauseBehaviour inherits from the BaseBehaviour class, which provides a number of helper methods, including the reset_tendermint_with_wait(self) method, which is where the actual interaction with the Tendermint node occurs. Let us now describe what happens there:
Wait half of the interval that the previous state is expected to complete. This is to ensure that the majority of the agents have finished receiving Tendermint updates from previous states, and have finished any pending internal operations.
Set a flag is_healthy = False. Setting this flag to false indicates that the Tendermint node is not processing incoming transactions. We set this flag to false here, as this will be the case when we are in the middle of a blockchain reset process. This flag will be used to identify how far in the process we are, in case we need to revisit the method.
Instruct Tendermint to restart the current blockchain. This operation consists of: a) stopping the node, b) calling the unsafe-reset-all Tendermint command, and c) restarting the node. The reset operation is executed so that the initial height of the restarted blockchain is set to an unseen value, namely last_height + threshold. Here, threshold is a value, usually larger than 10, to avoid race conditions and ensure that all nodes indeed restart the blockchain with an unseen value: since there is no guarantee that all nodes are reset at the same time, some of them could still be processing a few extra blocks. This is an important consideration, as the consensus mechanism will fail to synchronize across all Tendermint nodes if the blockchain is restarted with a previously seen height value.
If the previous step is successful, set a flag is_healthy = True, meaning that the Tendermint node is accepting client calls. Also, remove the memory contents of the local blockchain that the agent has stored. As opposed to removing the entire blockchain in the Tendermint node, we still retain a configurable amount of historical values in the agent.
Query the local Tendermint node with the RPC call GET /status, to make sure that the node is ready to start producing blocks again before continuing, and verify that the blockchain height coincides with the expected value (i.e., last_height + threshold).
If all the steps above are successful, we declare that the process has been executed satisfactorily.
As you might expect, there are numerous error conditions that can occur in the steps depicted above, and the recovery from those is not always easy to handle. Luckily, the asynchronous programming approach makes it easier to recover from these potential situations. The async_act(self) method is encoded as a Python generator. Recall that the framework periodically calls it (or more specifically, it periodically calls the generator produced by it). There are two main points to consider: time-consuming operations, and errors. The approach taken is as follows:
- If the async_act(self) method needs to “pause”, due to waiting for some time-consuming operation to complete, the method will exit with a yield statement. This will cause the method to restart from the same point when the framework calls it again. This happens, for example, when we are waiting for some Tendermint RPC call to complete.
- If the async_act(self) method needs to restart from scratch due to some error condition, it will exit using a return False statement. This will require that the framework produces a new generator. This situation occurs, for example, when the Tendermint node has not restarted with the appropriate height value the blockchain in Step 5 above. Exiting with a yield is not enough, because the method needs to restart from the beginning. The usage of flags (e.g., is_healthy) or other internal state can be used to determine what will be the execution tree for this new restart of the method.
Below we present a simplified graphical representation of how the Tendermint process works internally. We remark that we have omitted some of the technical details and error conditions that might occur, but the diagram should give you a general idea of how the different sources of error are managed in the process. Note that in the diagram, the asynchronous operations are marked in blue.
Observe, that if the “STOP EXECUTION” ending point is reached (meaning that an error or unexpected condition has occurred), the next time the method is invoked it will start from the beginning, but its execution path will depend on whether the is_healthy flag is set to True (i.e., the method reached at an earlier call the third green box), or not.
To conclude this post, we would like to point out a couple of technical issues related to Tendermint itself that are required so that the application can reset periodically the blockchain safely:
- First, as mentioned earlier, the restarted Tendermint blockchain must be started with a value last_height + threshold, or more specifically, a height value that the nodes have not already seen. Whereas it is possible to set it to 0 for a single node, we have discovered that the nodes fail to synchronize appropriately when they are initialized with an already seen value. This happens because the nodes that have reset might see blocks arriving from the other nodes, which are still on the old blockchain, and this will lead to a consensus failure.
- Second, when resetting the Tendermint node, a parameter used in the blockchain called AppHash (i.e.the root hash of the app state Merkle tree) must also be indicated. This value needs to be different between resets, since, even though the blockchain will be empty again, the service internal state will be different when compared to an earlier start of the blockchain. Again, if this condition is not met, the Tendermint node will panic and fail.
As you can see, the local consensus gadget that we use in our agent services is built on top of Tendermint. The latter is a very useful tool for our purposes, but it also comes with numerous gears and levers that must be adequately maneuvered and managed to make the most of its functionality. We hope that this post has given you some insights on the motivation behind resetting the blockchain of the consensus nodes in an agent service, and how we have tackled a number of technical problems that arise when interacting in a “non-standard way” with Tendermint. We need to point out that the information described in this post is by no means exhaustive, as we have preferred to omit and simplify some tedious technical details for the sake of clarity of exposition.
Finally, it is worth noting that the current implementation of the consensus gadget using Tendermint is somewhat too conservative for our purposes: Tendermint is a fully-featured, general blockchain engine which we only require a limited number of features. Also, although feasible, as we have demonstrated in this post, blockchain reset is an exceptional operation in the Tendermint ecosystem. For these reasons, we are looking into building a more tailor-made blockchain solution for agent services which “natively” forgets its state beyond a certain point in the past.
We hope that you’ve enjoyed this description of our use of Tendermint and how Autonolas services make use of it to manage, maintain and reset consensus. If you’ve found what we’ve described here interesting, or would like guidance in developing your own autonomous services using Autonolas, we invite you to reach out to us on Discord and follow us on Twitter.
You may also want to check out our Academy program . The Academy is a self-guided educational course designed to get you started with the development of your own apps and services using our stack. Upon completion, you can submit a project for our consideration and apply to participate in a "Builder Track", where you will receive expert guidance from our team. A new cohort of the Builder Track starts on September 7th, so there is still time to join. You can find all of the details here.
Sign up for updates