SIMD-0326: Proposal for the New Alpenglow Consensus Protocol

Having read through your response to my post and others, I think my feedback at this point can be summarized as follows -

1 - Rethink the SIMD governance structure for Alpenglow adoption

It’s unclear how the governance process, that has been under development and refinement for over two years will (or won’t) be followed through the Alpenglow adoption process. You mention that this SIMD is “special”, which to me says that it’s expected to be exceptional. The governance process has been put in place for a reason and used well, it can provide a regulator that can be helpful in building confidence in Alpenglow adoption and increase the chances of a successful deployment.

For example, your response about Votor being more than “replacing the current voting mechanism” only strengthens my initial comment that the scope of these SIMDs needs to be more thoroughly and specifically defined.

Your comment stating that Alpenglow implementation is already underway also feels confusing to me. If it’s already underway, what’s the purpose of this SIMD and why is a general consensus needed now?

More concretely, here’s a general structure for consideration -

SIMD-X SIgnaling general support for Votor deployment on testnet
SIMD-X+1 Signaling general support for Votor deployment on mainnet
SIMD-X+2 Signaling general support for Rotor deployment on testnet
SIMD-X+3 Signaling general support for Rotor deployment on mainnet

2 - Form a validator advisory committee

As I also mentioned here adopting a validator advisory committee would help balance the theoretical research being done with very valuable real-world and practical experience gained through the years by battle-tested validator operators. I feel this would also build goodwill and confidence in subsequent deployment steps.

Overall and I say this as respectfully as I can convey here on this forum, I’m sensing a bit of an ivory tower, “trust us, it will be fine” approach here. In a world where we don’t trust but verify, knowing that a team of experienced validators are there to verify seems like it would be very beneficial for the reasons stated above.

1 Like

We seem to have a huge misunderstanding. Alpenglow is not relying on “altruism” any more than any other blockchain (including Solana right now). I’m also super willing to listen to validators, and I’m actually looking forward to our meeting tomorrow.

1 Like

I don’t think separate SIMDs for testnet and mainnet are necessary or sensible, a SIMD should be a complete technical description of a proposal that can be implemented on its own by a competent developer, that is the purpose of the SIMD process.

Separately but adjacent, the governance process is to obtain stake-weighted approval (current implementation) to deploy a significant change to the chain’s economics, security, core programs, etc.

The current approach appears to be one governance proposal for Votor, and one governance proposal for Rotor, which I think is fine. The SIMD title (both in the SIMD PR and the SIMD file) are unfortunately not specific enough as they just say “Alpenglow” and this is generally understood to include both Votor and Rotor. However the SIMD body does specify what the scope is and that it excludes Rotor, therefore while not optimally expressed I think it is clear that this proposal and the SIMD refer to Votor only. I’d encourage updated wording to the SIMD to further clarify this.

Yes, I would second what @laine said. I want to make sure the governance process helps us to get a signal from the validator operators, but I don’t think it should gate testing or development of a new feature. There are plenty of times when the code for a SIMD is started before the SIMD is accepted. I think that’s fine as long as all breaking changes are SIMDs.

Secondly, I will say that Alpenglow is only special in the sense that it is a very important change to the protocol. I think more specificity on certain points would be helpful.

Here are concrete rule text suggestions to harden SIMD-0326 against latency griefing and vote manipulation:

  • Reward timeliness: boost for 80% fast-path, reduced if only 60%.

  • Late vote decay: linear down to zero after ~150 ms.

  • Conflicting votes = full epoch reward forfeiture.

  • ≥2 aggregators per window (primary + backup).

  • Validators deliver votes to ≥2 upcoming aggregators.

  • Aggregator rewards require ≥95% inclusion of timely votes.

  • Validators must always cast notarize/skip; >5% missing = reward loss.

  • On-chain metadata records per-slot vote arrival quintiles.

  • Leader aggregate rewards fixed per slot, not dependent on who voted.

@john-tri1lium

Thanks for your suggestions. First I thought I discuss them in the meeting. But then I started answering some (see below), and now I’m not so sure whether they are meeting material. But please get back to me (before the meeting) if I misunderstood you:

  • Reward timeliness: boost for 80% fast-path, reduced if only 60%.

I don’t understand this. Who gets the reward? The leader? Everybody? In Alpenglow, in any slot, some validators finish on the 80% path, some others finish on the 2x60% path. Whatever is faster for them. Every validator is the only one who knows how they finished. So as a validator I just claim that I finished on the 80% path to get the higher reward? Or is this reward actually for the leader? Do we have an additional vote to figure out (democratically) how many finished on the 80% path and how many on the 2x60%? If I don’t like the leader, why would I not simply lie and claim that I finished on the 2x60% path?

  • Late vote decay: linear down to zero after ~150 ms.

Who is measuring? The vote goes to everybody, so who is deciding this? Do we again vote on votes and take the median? Why would I not lie about the delay of others just to make sure that they don’t get a reward? Also, screw the validator in Australia who might not be able to vote as quickly as the others. In the paper we claim 150ms median delay. However, this is the median over all leaders, relays, and validators. 150ms is not achievable for a validator who is geographically disadvantaged. On earth, the maximum possible delay from any node to any other node is roughly 100ms, roundtrip 200ms. An incentive like this would kill our global blockchain immediately. Now 60% of stake is in (close to) Europe. After this 100% of the stake is going to be in FRA and AMS, perfectly located to have the EU regime kill Solana.

  • Conflicting votes = full epoch reward forfeiture.

Okay, this one is actually rather slashing than rewards. We discussed this among ourselves months ago, and some Anza people even started writing tables of provable offenses and how they can be punished. This one was usually on the top of the list. So I’m totally okay with this. We probably will propose these slashing (or no rewards for you) proposals in a separate SIMD.

  • ≥2 aggregators per window (primary + backup).

Maybe I don’t quite understand what this means. So far in Alpenglow we only have one leader per block, and that leader is decides what goes into the block. So the backup aggregator sends its aggregate to the leader, and the leader then does not include that (or steals all votes from it, and publishes themselves)? Or is the idea that we have aggregations for slot s in slots s+4, s+8, s+12, s+16 to make sure that every vote is accounted for? This would be close to a suggestion we once discussed among ourselves. In the end we decided it’s not worth it, because it adds 10x complexity for almost no real-world improvement. Simplicity is very dear to us. If we ever want to implement MCP, then we need to start out with the simplest possible protocol. Otherwise MCP is doomed to fail (and we are basically back to a very messy protocol … cough … TowerBFT … cough).

  • Validators deliver votes to ≥2 upcoming aggregators.

What do you mean? Validators already deliver their votes to ALL validators.

  • Aggregator rewards require ≥95% inclusion of timely votes.

So far, the aggregator/leader gets rewards for every vote they include. So they have a clear incentive to include as many votes as possible. You suggest instead that they only get a flat reward if they include 95%, and nothing if they don’t have 95%?! How is that any better than just incentivizing every single vote? Why would a byzantine (just 5% byzantine needed) not simply not send their vote just to let you lose your reward completely? We don’t want a protocol where 5% can mess with everybody, do we?

BTW (“story time”): We currently split the rewards 50/50 between leader and voters. You might think that we decided this ratio on a whim. But Max Resnick wrote a whole 6 page document just to compute the lowest possible split that is still incentive compatible with the leader. We discussed about this for a long time. In the end we went for 50/50, also because of simplicity. (Yes, I told this story because I can’t shake the feeling that some validators seem to believe that Alpenglow was written in two days and a lot of “let’s just do this” and “sounds about right.”)

  • Validators must always cast notarize/skip; >5% missing = reward loss.

Is this the same as above? Or do you now argue that a validator will lose their rewards for the whole epoch if they didn’t vote for 95% of all slots? Who is counting? Do we again take the aggregates? If so, 5% bad stake is enough to mess with everything, i.e. 5% can make it happen that we have no rewards for anybody? Alpenglow has 20+20 security. With this we are essentially back to 2+2 security.

  • On-chain metadata records per-slot vote arrival quintiles.

Sorry, but I don’t know what you mean here. We have all the votes (in the aggregates) on-chain, not just quintiles.

  • Leader aggregate rewards fixed per slot, not dependent on who voted.

Okay, this is similar to above. I hope you agree that fixed thresholds are always worse. If you have binary thresholds, it’s an invitation for those that like to play games.

Please tell me if I misunderstood something. I appreciate a lot that you sent this to me before the meeting, because it’s a lot more difficult to address new proposals in a 100 people meeting.

In my opinion Alpenglow is special because it’s by far the biggest protocol change for Solana. There will be 50+ SIMDs just for Alpenglow. SIMD-0326 is the most important, the basis for everything. Most other Alpenglow SIMDs will be more technical, and they will not need a governance vote. We cannot have all these little technical details in one SIMD, it would be unreadable. People already complain that the Alpenglow White Paper is 50+ pages. If we specified everything about Alpenglow in one SIMD, it would be 300+ pages long.

@tigarcia @laine

… moreover, writing one SIMD just based on the whitepaper would be impossible because many questions (and solutions) only emerge while actually coding up the protocol. This was already true during white paper development, where some questions emerged because of Quentin’s prototype Alpenglow implementation. The Anza and FD engineers are now on it, and whenever something is not 100% clear, a SIMD is written. Those writing the software need to understand the rules and regulations better than the lawyers, bureaucrats and (possibly) scientists that wrote the law. Implementation does not have much room for MAYBEs and SHOULDs.

This is above my paygrade, but @tigarcia already answered.

But I would like to say that we don’t see ourselves as “theoreticians.” We try to come up with the best possible protocol. If something was not clear during the development process, we discussed with Solana experts (Toly mostly, but also many others) during development. We also discussed with validators. If anybody from the Solana universe (validators and others) have any concerns and suggestions, I’m all ears. The more concrete you can make your proposal or question, the better.

1 Like

Thank you for the timely and in-depth responses. I see that I probably don’t understand all of the mechanics in the current Alpenglow plan well enough to make precise rule text suggestions. Instead, what seems most important for this SIMD is that we commit to on-chain, verifiable mechanisms that (i) incentivize “good behavior” and (ii) penalize provable bad behavior, in ways that are measurable and resistant to manipulation.

It may also be useful to explicitly recognize that this SIMD is narrowly scoped: it replaces Tower with Alpenglow consensus. We should not assume Rotor or other AG features will necessarily follow. That means we need to think carefully about how Votor + Turbine on its own can be gamed, and ensure the base protocol already includes guardrails against those vectors.

2 Likes

Hi Alpenglowers :alien_monster:
Some of the narrative here is as if “Tower is incentive-compatible and good and Alpenglow will be gamed“. I want to argue it’s the opposite. :sweat_smile: Brace for a big autistic rant.

-

Timely Vote Credits were mentioned above.
Apparently 1/3 of the stake used to vote slowly. What were the reasons? They didn’t want to buy a good machine/connection? But if a potato is too slow, it would be simply falling more and more behind. So why would they be just a bit behind consistently? Here comes Tower: you are rewarded to agree with the majority. In many truthful executions you will NOT agree with the majority, and then you have a perverse incentive. Then it obviously makes sense to wait and see what others vote, and just agree with that.
If they were too slow with connection or execution, wouldn’t they be losing a ton of money as the leader?

Point is, in Tower you don’t get all your rewards if you just execute stock. As a minimum for a correct protocol, you shouldn’t be worse off by running stock than modding voting behavior.

-

Does TVC address this issue?

I get less time to observe the votes, so it mitigates it. There is still room to mod the voting behavior, and it’s more profitable if I’m in Amsterdam rather than in Sydney.

Does TVC force me to execute kinda on-time at least?

Tower also propagates votes in gossip, so say I make a mod that looks in gossip for X% votes on a block, and if I haven’t voted yet I also vote for that block. If performance as a leader doesn’t matter, maybe it can be done without execution at all on a raspberry pi. :thinking:

To conclude, TVC don’t ensure that you follow the protocol or perform. It’s a double-edged mitigation for something bad in Tower. We don’t have this “something bad“ in Alpenglow.

-

The concern of “I’ll just vote skip forever!“ in Alpenglow is different. You wouldn’t get more money, just go out of your way to do something differently. There are many many things you can go out of your way to do differently in Tower without getting more rewards. The problem in Tower is that you have reasons to deviate and get more rewards!

There are other game-theoretic problems with Tower that are scary to mention and Alpenglow improves immensely there, but we didn’t mention them above so I’ll stick to what was mentioned above.

2 Likes

In other words the Tower-potato-argument was never true, and it was always just nodes waiting to see other votes first…

2 Likes

Thanks everybody for the call, and please continue to ask questions (and post criticism).

Thanks a lot for the fantastic paper. I’m still making my way through the proofs. Please correct me but I don’t see any mention to correctness of the staged deployment of Votor against Turbine. It’s fair to ask if it’s not going to crash.

Since we start with Turbine, block propagation latency needs to be respected in Alpenglow’s parameters (timeouts). Also resilience of Turbine is a bit lower than Rotor. Otherwise block propagation will not explode and just stay as it is a bit longer. :grinning_face_with_smiling_eyes:

1 Like

This doesn’t sound like a proof to me. Which parameters and what values? Thanks.

We can just add 400 ms to the timeouts described in the whitepaper. However, we also want to do measurements while testing to fix these parameters. And we want to make sure that they are not too low no matter where you are.

1 Like

@vkomenda
If you’re happy assuming that in good network conditions a bad guy can’t stop Turbine from propagating blocks with some latency bound, you just need to put this bound in the timeout definition.

Otherwise if a bad guy can stop/delay a good guy from propagating blocks with Turbine :sweat_smile: things would just stay as they are until Rotor.

Mainly a curiosity question but since validators will now be voting on blocks before executing the transactions in them (?) where/how does execution tie back into things?

Execution is “eager”: You get the block and you execute it (pipelined). If you got the whole block and executed all of it, then you vote (before the timeout).