Safe RPC Failover Checklist

Failover is not just "try the next endpoint." On Solana, a backup path can keep an app online, but it can also create stale reads, split-brain decisions, duplicate transaction sends, and retry storms if every worker makes its own choice.

Use this page when your app already uses Carbium RPC and you are adding a secondary path for resilience. The goal is not to turn every product into a multi-provider control plane. The goal is to decide which requests may fail over, which ones must be checked first, and which ones should stop instead of multiplying traffic.

Part of the Carbium Solana infrastructure stack.

The short rule

Reads can usually fail over after a health and freshness check. Writes need a stricter policy.

Request type	Fail over automatically?	Required guardrail
Health probe	Yes	Keep probes cheap and rate-limited
Balance, account, or slot read	Usually	Compare freshness and commitment expectations
Quote support read	Carefully	Do not mix stale reads with fresh route decisions
Transaction submission	Rarely	Check signature status before any second send
Confirmation check	Yes	Query by known signature, not by rebuilding intent
Backfill or support lookup	Yes	Prefer slower, explicit fallback over user-facing retries

This page owns the failover policy. For full failure classification, use RPC Errors Reference. For blockhash expiry, use Blockhash Expiry Recovery Playbook. For backend relay structure, use Sending Transactions through Carbium RPC.

A safer failover shape

Do not let every feature directly decide where to send traffic. Put the policy in one small routing layer:

flowchart TD
    A["Application request"] --> B{"Read or write?"}
    B -->|Read| C["Check primary health<br/>and freshness"]
    C --> D{"Primary usable?"}
    D -->|Yes| E["Use Carbium RPC"]
    D -->|No| F["Use fallback<br/>with freshness label"]
    B -->|Write| G["Submit once<br/>store signature"]
    G --> H{"Unclear result?"}
    H -->|Yes| I["Check signature status<br/>before any retry"]
    H -->|No| J["Advance app state"]

The important part is the split:

read failover is about freshness and user impact
write failover is about avoiding duplicate execution
confirmation failover is about asking the network what happened to a known signature

Health checks are not enough

Solana's getHealth method tells you whether an RPC node reports healthy within its configured slot distance from the cluster tip. That is useful, but it is only the first signal.

A production readiness check should separate these questions:

Check	What it answers	Example method
Is the endpoint reachable?	Can the client connect and receive JSON-RPC?	`getHealth` or `getSlot`
Is the endpoint fresh enough?	Is it close enough to the slot you expect?	`getSlot`
Is the method path working?	Does the method your app depends on still work?	A low-cost representative read
Is account access correct?	Are keys, restrictions, and plan access valid?	Known-good authenticated request
Are errors local or systemic?	Did one worker, one method, or all traffic fail?	Logs plus Carbium monitoring

Keep health checks boring. A health check that calls expensive methods too often can become the traffic pattern that triggers the incident.

Decide by request class

Normal reads

For balance, account, slot, token, and display reads, automatic fallback can be reasonable when the primary path is unavailable or clearly stale.

Add two labels to the response your app receives:

which endpoint answered
which slot or commitment level the answer came from, when available

That keeps downstream code from treating every read as equal. A wallet balance view can usually show a fresh fallback read. A trading decision may need to stop if the freshest available read is behind the policy threshold.

Quote and execution support reads

Swap and bot systems often combine several reads into one decision. Be careful when only one part of the pipeline fails over.

Bad pattern:

Fetch route or state from one endpoint.
Fetch blockhash or confirmation context from another endpoint.
Send based on mixed assumptions without logging the source of each step.

Better pattern:

Record the source and slot context for each material read.
Define a maximum allowed slot gap for the decision.
If the gap is too large, rebuild the decision instead of pushing it forward.

For the full quote-to-submit path, use Quote to Swap Integration Guide.

Transaction sends

Treat sendTransaction as a special case. Solana's RPC docs state that a successful sendTransaction response means the RPC service accepted the transaction, not that the cluster processed or confirmed it. They also point users to getSignatureStatuses for confirmation.

That changes the failover rule:

Send once through the chosen path.
Store the signature as soon as you have it.
If the result is unclear, query signature status before sending again.
If the blockhash expired, rebuild and re-sign instead of replaying stale bytes.

Do not send the same signed transaction from multiple endpoints in parallel unless you have intentionally designed for that behavior and can reconcile the result.

A minimal TypeScript policy

This sketch keeps the details compact. Adapt it to your logging, queue, and metrics stack.

import { Connection } from "@solana/web3.js";

type RpcPath = {
  name: "carbium" | "fallback";
  connection: Connection;
};

declare function storeSignature(signature: string): Promise<void>;
declare function recordPrimaryReadFailure(): Promise<void>;

const paths: RpcPath[] = [
  {
    name: "carbium",
    connection: new Connection(
      `https://rpc.carbium.io/?apiKey=${process.env.CARBIUM_RPC_KEY}`,
      "confirmed"
    ),
  },
  {
    name: "fallback",
    connection: new Connection(process.env.FALLBACK_RPC_URL!, "confirmed"),
  },
];

async function chooseReadPath(maxSlotLag = 8) {
  const [primary, fallback] = paths;
  const fallbackSlot = await fallback.connection.getSlot("confirmed");

  try {
    const primarySlot = await primary.connection.getSlot("confirmed");

    if (primarySlot + maxSlotLag >= fallbackSlot) {
      return { path: primary, slot: primarySlot };
    }
  } catch {
    await recordPrimaryReadFailure();
  }

  return { path: fallback, slot: fallbackSlot };
}

async function sendOnceAndConfirm(rawTransaction: Buffer) {
  const primary = paths[0];
  const signature = await primary.connection.sendRawTransaction(
    rawTransaction,
    {
      skipPreflight: false,
      maxRetries: 3,
    }
  );

  await storeSignature(signature);

  const status = await primary.connection.getSignatureStatuses(
    [signature],
    { searchTransactionHistory: true }
  );

  return { path: primary.name, signature, status };
}

The policy is deliberately asymmetric:

reads may choose the freshest usable path
writes use the primary path first
unclear writes become signature-status checks before retries

What to monitor after adding failover

Failover should be visible. If it is invisible, it will hide problems until the fallback becomes the default.

Track:

failover count by method
primary health-check failures
primary vs fallback slot distance
write attempts by signature
confirmation latency by path
429, 5xx, and JSON-RPC error rate by path
user-facing errors after fallback was used

Use Status and Metrics for the Carbium-side signal map, and Calls Monitoring for method-level usage visibility.

Launch checklist

Before enabling failover in production:

Reads and writes have separate retry policies.
Health checks are cheap, rate-limited, and not counted as proof of freshness by themselves.
Read responses record endpoint name and slot or commitment context where available.
The app has a maximum slot-lag policy for latency-sensitive decisions.
Transaction sends persist the signature before any retry path runs.
Retry logic checks getSignatureStatuses before any duplicate send decision.
Blockhash expiry triggers rebuild and re-sign, not replay.
Dashboards show when fallback is active, not just when requests fail.
Support logs redact API keys and full authenticated endpoint URLs.

Failover should make production calmer. If adding a fallback path increases request volume, hides stale reads, or makes duplicate sends harder to reason about, the routing policy needs more work before launch.

Source checks

This policy is based on current Solana RPC behavior and local Carbium docs boundaries:

Solana getHealth reports whether an RPC node is healthy relative to its configured slot distance from the cluster tip.
Solana sendTransaction relays a signed transaction and returns before cluster confirmation; use getSignatureStatuses to confirm processing and final state.
Solana transaction submission can still fail when the transaction's recent blockhash expires before landing.
Carbium's docs already separate rate-limit mitigation, RPC error triage, blockhash expiry recovery, and signed-transaction relay. This page links to those owners instead of duplicating them.