GitHub Logo HIP-679: Bucket restructuring for changing stream data folder structure

Author Mugdha Goel
Working Group Richard Bair, Nick Poorman, Jasper Potts, Dan Alvizu, Steven Sheehy
Discussions-To https://github.com/hashgraph/hedera-improvement-proposal/discussions/678
Status Accepted
Needs Council Approval Yes
Review period ends Thu, 13 Apr 2023 07:00:00 +0000
Type Standards Track
Category Core
Created 2023-01-09
Updated 2023-04-12

Abstract

We propose changing the current stream data bucket folder structure from node account ID based to node ID based. Additionally, we propose incorporating shard into the path to be future compatible with sharding when it’s enabled. We are also proposing a rollup mechanism to reduce cloud storage duplication and improve historical sync times of mirror nodes.

Motivation

Node account IDs are Hedera account IDs, which each node has obtained. These can be changed over time. In the future we plan to introduce community nodes, thus increasing the number of unique nodes across the Hedera network. Granting specific node ID based access to buckets will be of utmost importance for security reasons. The current folder structure uses an account ID per node. Account IDs are not ideal as a council member will want to change their account ID over time. To future-proof this, we will want to move from account IDs to network,shard and node ID in the path.

Rationale

  • There have been council members that want to change their account id but haven’t been able to. Multiple nodes can have the same node account id, in which case they will need to upload files to different bucket folders.
  • There might also be a need to recreate an account for security reasons.
  • By adapting the new node ID based path we provide the council members the flexibility to change their account IDs as well as allow only node specific access to the buckets for future community nodes inclusion.
  • Adding the network name to the path helps:
    • Create bucket paths for ephemeral networks
    • Combine prepod environments
    • Allow not having to recreate buckets for resettable networks
  • The idea of introducing rollups is to make it faster for new mirror nodes to download historical files starting from epoch. This will also help us, avoid having to rename millions of files. Renaming millions of files has to be done one by one via S3 APIs and would take months. With rollup we can create separate aggregated files while leaving individual files intact for longer period until we’ve verified authenticity of rollup.

User stories

  • As a mirror node operator, I’d like to be compatible with both the historical and new bucket path formats.
  • As a node operator, I’d like to be able to change my node’s account ID.
  • As a network operator, I’d like to create the new bucket structure that supports sharding.
  • As a network operator, I’d like to merge buckets for all prepod environments and be able to use bucket paths for ephemeral networks.
  • As a network operator, I’d like to reduce cloud storage costs by eliminating duplicate files and compressing historical data.
  • As a new mirror node operator I’d like to be able to download historical data in the new aggregated format. This will improve the time it takes to startup a new mirror node with all historical data synchronized and to improve download costs to bootstrap a new mirror node.

Specification

This HIP is to change the stream data bucket folder structure and in the process roll up data for faster ingestion.

Current Structure:

    accountBalances/balance{nodeAccountID}/
    eventsStreams/events_{nodeAccountID}/
    recordstreams/record{nodeAccountID}/
    
Example:
    accountBalances/balance0.0.3/
    eventsStreams/events_0.0.3/
    recordstreams/record0.0.3/

Proposed Structure:

    {network}/{shard}/{nodeID}/balance/
    {network}/{shard}/{nodeID}/event/
    {network}/{shard}/{nodeID}/record/
    
Example:

    Standard network:
    mainnet/0/3/balance/
    mainnet/0/3/event/
    mainnet/0/3/record/
    
    Resettable network:
    testnet-2023-01/0/3/balance/
    testnet-2023-01/0/3/event/
    testnet-2023-01/0/3/record/    
Problems with the current structure naming:
  • Stream name is duplicated in eventsStreams and recordstreams. We are missing stream from accountBalances.
  • Casing is inconsistent in eventsStreams and recordstreams.
  • After the separator we are repeating stream names. events_0.0.3 has an underscore where as the others don’t.

The new structure proposes consistent naming. In the future when we start supporting multiple shards, the new bucket structure will automatically work for it without any required changes.

Proposed naming structure:
  • The network name in the bucket path will not change for non-resettable networks.
  • For networks that can be reset the network key in the path will start wth the network name prefix followed by a timestamp, as stated in the above testnet example. On a reset, to get the latest network one would need to filter by the network prefix (e.g. testnet) then sort the output to get the latest network to read from.
  • The granularity of the timestamp would depend on how often that particular network is reset.
  • The remaining bucket path after the network name would be similar for all networks.
  • Using the network name would be strictly internal to mirror node.

It is thus proposed to be a phased approach with three phases.

Phase One:

Update mirror node downloader to read from the new {network}/{shard}/{nodeID}/* nodeId based path as well as support reading from */*0.0.3/ account id based path.

Algorithm:
  • Add a pathType property with values AUTO, ACCOUNT_ID, and NODE_ID with a default of ACCOUNT_ID
  • Later once consensus nodes get closer to writing to the new path adjust default to AUTO
  • Add a pathRefreshInterval property with a default value of 1m
  • If pathType is AUTO:
    • Attempt to list signatures from a path with node account IDs
    • If successful, update a per node timestamp and boolean to ensure account ID is used until pathRefreshInterval elapses
    • If no results, attempt to list signatures from a path with node IDs
    • If successful, set pathType as NODE_ID so the node ID path is used going forward

In the existing scenario the default pathType would be ACCOUNT_ID, which would then be changed to AUTO when we are close to approaching Phase two.

Phase Two:

Update the consensus nodes to write stream files to the new {network}/{shard}/{nodeID}/* nodeId based path.

Phase Three:

Create a new lambda function which performs data rollup along with compression.

Compression zstd vs gz:
  • zstd has a better compression ration that gzip.
  • (zstd:gzip)
    • Faster compression - About a 5:1 ratio.
    • Faster decompression - About a 3:1 ratio.
Algorithm:
  • Read stream files from current location
          accountBalances/balance0.0.3/ 
          eventsStreams/events_0.0.3/     
          recordstreams/record0.0.3/ 
    
  • The new path for aggregated files would be:

          {network}/{shard}/record/
          {network}/{shard}/balance/
          {network}/{shard}/event/
    
  • The lambda will decompress the file, if it is compressed.
  • Calculate a hash of the file contents.
  • Maintain an in-memory map which has the hashes for the current file being processed, to remove duplicate files with the same content.
  • Compress the file using zst if it not a signature file.
  • Monitor the size/time of the potential aggregated file.
  • For aggregation:
    • Calculate the node Id based on the current node account id. NodeId = (number portion of) nodeAccountId - 3 Node for balance0.0.4 would be 4-3 = 1
    • The new file structure would be as follows: Example:
      >tar -tf 2022-07-13T08_46.tar | sort  
      
      2022-07-13T08_46/
      2022-07-13T08_46/0/
      2022-07-13T08_46/0/2022-07-13T08_46_08.041986003Z.rcd.zst
      2022-07-13T08_46/0/2022-07-13T08_46_08.041986003Z.rcd_sig
      2022-07-13T08_46/0/2022-07-13T08_46_11.304284003Z.rcd.zst
      2022-07-13T08_46/0/2022-07-13T08_46_11.304284003Z.rcd_sig
      2022-07-13T08_46/0/2022-07-13T08_46_11.304284003Z_01.rcd.zst
      2022-07-13T08_46/1/
      2022-07-13T08_46/1/2022-07-13T08_46_08.041986003Z.rcd.zst
      2022-07-13T08_46/1/2022-07-13T08_46_08.041986003Z.rcd_sig
          
      

      The above compressed tar file includes the following files in node id based folders:

      • Record stream files (.rcd.zst)
      • Signature files (.rcd_sig)
      • Sidecar files (_01.rcd.zst)

      NOTE: File 2022-07-13T08_46_08.041986003Z.rcd.zst appears in two separate node folders. These are files with distinct content and when downloaded by the mirror node one of the two file will be selected based on the consensus majority.

      The aggregation in the new files would be size/time based, as will be determined by performance testing.

  • Once the limit has been reached, aggregate the file and write it to the new bucket location.
  • Delete the existing files that have been read from the current bucket location.

After a significant amount of time has passed after launching Phase three, we will remove the properties in mirror node to default the downloading from the node ID based path.

Mirror Node Implications

The existing mirror nodes would need to implement phase one.

Dev ops Implications

After the completion of phase one and having been notified, dev ops will need to implement the bucket folder permissioning for specific nodes for phase two. Any related config changes would also be a dev ops responsibility, followed by the rollout of phase two.

Engineering team Implications

Updates to the consensus node software and any code changes to mirror.py for phase two would be implemented by the engineering team.

Backwards Compatibility

This HIP proposes breaking changes. All existing mirror node operators would need to change code to support reading from the new as well as the old bucket location. According to the current plan for phase one and two, operators will have ample time until the completion of phase two and a significant amount of notification period, before the mirror nodes will become backwards incompatible.

Security Implications

With this change, we are improving security. Previously, each node had to be granted access to three separate paths since we had stream types as the top level folders. Write access to the new node id based path will be granting access to only specific node path buckets. This would prevent a malicious node to have write access or corrupt other stream files. The rollup process could be seen as centralized and a single point of failure. Since, the lambda will be deleting files it has already aggregated, this might be pursued as a problem. All files are cryptographically signed by each node. Same data will be in each file. We would be just downloading data and aggregating it. Record file bytes would be unchanged. Signatures would still be present. We could use it to verify the authenticity as it is currently being done in the mirror node.

How to Teach This

There will be a communication sent about the deprecation period along with instructions. Hedera docs will be updated with information about the new bucket structure.

Reference Implementation

Please follow this issue to track progress of the reference implementation.

Rejected Ideas

Use the ledger id instead of the network name in the bucket path. This idea was rejected because having a 32 byte hex as the folder name was not user friendly.Instead keeping the ledger name internal was decided.

Open Issues

We do not know of any open concerns with this proposal.

References

Copyright/license

This document is licensed under the Apache License, Version 2.0 – see LICENSE or (https://www.apache.org/licenses/LICENSE-2.0)

Citation

Please cite this document as: