Alternative Archival Storage Technologies

Context

Archival storage for Solana has historically expensive and centralized from the technology perspective; at the moment, BigTable tends to be the only reasonable choice for RPCs to store historical information back to genesis.

See this RFP on the development of technologies to enable alternative storage technologies and providers to provide low-cost, high-efficiency access to Solana archival data

Logistics

Take note of the end date (8/13) and be sure to make sure all criteria is met prior to sending in an application. The listed grant amount is a maximum allocation and is issued in USD-equivalent locked SOL and gated behind delivery milestones.

Ground Rules

This thread can be used for comments, questions, praise, and / or criticism, and is intended to be an open forum for any prospective responders. This thread is also an experiment in increasing the transparency through which RFPs are fielded by the Solana ecosystem too, so please be mindful that we’re all here to learn and grow.

Responses to this RFP are not required to be public, but if it is helpful to share notes or combine forces, then please use this thread for such purposes.

2 Likes

Is there some publicly available documentation of the often cited method of using Filecoin for this storage? I see it consistently mentioned by aeyakovenko:

Would love to see what the pros and cons of this approach have been so far. Based on my knowledge of Filecoin, the cost might be quite prohibitive.

I am working on this from the filecoin side along with folks from Triton. Triton just released https://old-faithful.net/ which has more details on how data is being onboarded to filecoin. Happy to answer any follow up questions!

To the point about cost being prohibitive, Filecoin is actually the cheapest option today. See this from messari:

3 Likes

Can you elaborate on these requirements?

  • Solution should provide relevant connection logic for the Solana RPC client
  • Solution must prove equivalence to the Solana ledger as determined by random-sampling of RPC calls

Does this mean that the solution must include a separate RPC that runs on a subset of the data? (eg. an epoch)

and

  • A complete security audit must be completed prior to production launch.
    who is responsible for this? if the submitter, should this be factored as a cost? (problematic as it’s an unknown)

Does this mean that the solution must include a separate RPC that runs on a subset of the data? (eg. an epoch)

no, you can use the existing Solana RPC code. today, that RPC will pack things into BigTable / serve archival requests out of BigTable. the proposed solution needs to plug into that existing code to serve as a suitable replacement, and needs to store data from genesis to tip.

who is responsible for this? if the submitter, should this be factored as a cost? (problematic as it’s an unknown)

good point. don’t have a perfect answer for you since this component isn’t security-critical enough to require an audit; probably okay to waive this concern for the time being

1 Like

Hello,

Was an applicant accepted for this grant or did it just expire?

I am unable to view the RFP at this time but I am interested in working on this.

The solution I have mind uses Apache Parquet archived to commodity object storage such as Amazon S3. I am confident that this approach would reduce costs while providing fast RPC access.

Compared to the Old Faithful approach this would not be decentralized or verifiable but should be a more “plug and play” replacement for RPCs. Parquet has a lot of benefits: great compression, efficient remote queries, and high quality Rust crates, but it’s not a deterministic format. That being said, the lower operational costs would make building your own verified archives from the ledgers much more accessible.

1 Like

Hey Matta,

We closed this RFP about a month ago and are starting the implementation process with the final participants. We had more than 13 applicants across a wide spectrum of tooling choices and will be able to give more details once the details are finalized.

We will likely have follow-up RFPs as the landscape for archival and the state of RPCs is ever evolving. We’ll be sure to post any new details on the forum.

The solution I have mind uses Apache Parquet archived to commodity object storage such as Amazon S3. I am confident that this approach would reduce costs while providing fast RPC access.

Sorry for the late reply – I’m very interested in the Apache Parquet solution. We can make it verifiable for sure. The community needs a ledger data format that is language-agnostic and compact so we can replay Solana Labs data in Firedancer and vice versa. Have you started any work on this? A Parquet format would be easier to work with than Filecoin/CAR (which solves a different problem).

1 Like

Hey there,

We are doing some work adjacent to this but aren’t working on Parquet specifically at the moment. I’m actually in the process of backporting your patches for Geyser support to old ledger tool versions at the moment :laughing:

I did do some initial research into using Parquet and the most straightforward encoding of the ledger data is not actually that compact. The data Parquet compresses well (slot, block time, etc.) is not what takes up the majority of the space. Here are my notes from when I was researching this: https://gist.github.com/matt-allan/851499ed79ffdd48af3c4949270866fc

I would love to talk more and see if we can collaborate on something. I will send you a message. If anyone else on this thread is interested in collaborating too please let me know!

2 Likes