Safe RPC Failover Checklist
Add backup Solana RPC paths without creating stale reads, duplicate transaction sends, or noisy health checks that hide the real failure.
Safe RPC Failover Checklist
Failover is not just "try the next endpoint." On Solana, a backup path can keep an app online, but it can also create stale reads, split-brain decisions, duplicate transaction sends, and retry storms if every worker makes its own choice.
Use this page when your app already uses Carbium RPC and you are adding a secondary path for resilience. The goal is not to turn every product into a multi-provider control plane. The goal is to decide which requests may fail over, which ones must be checked first, and which ones should stop instead of multiplying traffic.
Part of the Carbium Solana infrastructure stack.
The short rule
Reads can usually fail over after a health and freshness check. Writes need a stricter policy.
| Request type | Fail over automatically? | Required guardrail |
|---|---|---|
| Health probe | Yes | Keep probes cheap and rate-limited |
| Balance, account, or slot read | Usually | Compare freshness and commitment expectations |
| Quote support read | Carefully | Do not mix stale reads with fresh route decisions |
| Transaction submission | Rarely | Check signature status before any second send |
| Confirmation check | Yes | Query by known signature, not by rebuilding intent |
| Backfill or support lookup | Yes | Prefer slower, explicit fallback over user-facing retries |
This page owns the failover policy. For full failure classification, use RPC Errors Reference. For blockhash expiry, use Blockhash Expiry Recovery Playbook. For backend relay structure, use Sending Transactions through Carbium RPC.
A safer failover shape
Do not let every feature directly decide where to send traffic. Put the policy in one small routing layer:
flowchart TD
A["Application request"] --> B{"Read or write?"}
B -->|Read| C["Check primary health<br/>and freshness"]
C --> D{"Primary usable?"}
D -->|Yes| E["Use Carbium RPC"]
D -->|No| F["Use fallback<br/>with freshness label"]
B -->|Write| G["Submit once<br/>store signature"]
G --> H{"Unclear result?"}
H -->|Yes| I["Check signature status<br/>before any retry"]
H -->|No| J["Advance app state"]
The important part is the split:
- read failover is about freshness and user impact
- write failover is about avoiding duplicate execution
- confirmation failover is about asking the network what happened to a known signature
Health checks are not enough
Solana's getHealth method tells you whether an RPC node reports healthy within its configured slot distance from the cluster tip. That is useful, but it is only the first signal.
A production readiness check should separate these questions:
| Check | What it answers | Example method |
|---|---|---|
| Is the endpoint reachable? | Can the client connect and receive JSON-RPC? | getHealth or getSlot |
| Is the endpoint fresh enough? | Is it close enough to the slot you expect? | getSlot |
| Is the method path working? | Does the method your app depends on still work? | A low-cost representative read |
| Is account access correct? | Are keys, restrictions, and plan access valid? | Known-good authenticated request |
| Are errors local or systemic? | Did one worker, one method, or all traffic fail? | Logs plus Carbium monitoring |
Keep health checks boring. A health check that calls expensive methods too often can become the traffic pattern that triggers the incident.
Decide by request class
Normal reads
For balance, account, slot, token, and display reads, automatic fallback can be reasonable when the primary path is unavailable or clearly stale.
Add two labels to the response your app receives:
- which endpoint answered
- which slot or commitment level the answer came from, when available
That keeps downstream code from treating every read as equal. A wallet balance view can usually show a fresh fallback read. A trading decision may need to stop if the freshest available read is behind the policy threshold.
Quote and execution support reads
Swap and bot systems often combine several reads into one decision. Be careful when only one part of the pipeline fails over.
Bad pattern:
- Fetch route or state from one endpoint.
- Fetch blockhash or confirmation context from another endpoint.
- Send based on mixed assumptions without logging the source of each step.
Better pattern:
- Record the source and slot context for each material read.
- Define a maximum allowed slot gap for the decision.
- If the gap is too large, rebuild the decision instead of pushing it forward.
For the full quote-to-submit path, use Quote to Swap Integration Guide.
Transaction sends
Treat sendTransaction as a special case. Solana's RPC docs state that a successful sendTransaction response means the RPC service accepted the transaction, not that the cluster processed or confirmed it. They also point users to getSignatureStatuses for confirmation.
That changes the failover rule:
- Send once through the chosen path.
- Store the signature as soon as you have it.
- If the result is unclear, query signature status before sending again.
- If the blockhash expired, rebuild and re-sign instead of replaying stale bytes.
Do not send the same signed transaction from multiple endpoints in parallel unless you have intentionally designed for that behavior and can reconcile the result.
A minimal TypeScript policy
This sketch keeps the details compact. Adapt it to your logging, queue, and metrics stack.
import { Connection } from "@solana/web3.js";
type RpcPath = {
name: "carbium" | "fallback";
connection: Connection;
};
declare function storeSignature(signature: string): Promise<void>;
declare function recordPrimaryReadFailure(): Promise<void>;
const paths: RpcPath[] = [
{
name: "carbium",
connection: new Connection(
`https://rpc.carbium.io/?apiKey=${process.env.CARBIUM_RPC_KEY}`,
"confirmed"
),
},
{
name: "fallback",
connection: new Connection(process.env.FALLBACK_RPC_URL!, "confirmed"),
},
];
async function chooseReadPath(maxSlotLag = 8) {
const [primary, fallback] = paths;
const fallbackSlot = await fallback.connection.getSlot("confirmed");
try {
const primarySlot = await primary.connection.getSlot("confirmed");
if (primarySlot + maxSlotLag >= fallbackSlot) {
return { path: primary, slot: primarySlot };
}
} catch {
await recordPrimaryReadFailure();
}
return { path: fallback, slot: fallbackSlot };
}
async function sendOnceAndConfirm(rawTransaction: Buffer) {
const primary = paths[0];
const signature = await primary.connection.sendRawTransaction(
rawTransaction,
{
skipPreflight: false,
maxRetries: 3,
}
);
await storeSignature(signature);
const status = await primary.connection.getSignatureStatuses(
[signature],
{ searchTransactionHistory: true }
);
return { path: primary.name, signature, status };
}The policy is deliberately asymmetric:
- reads may choose the freshest usable path
- writes use the primary path first
- unclear writes become signature-status checks before retries
What to monitor after adding failover
Failover should be visible. If it is invisible, it will hide problems until the fallback becomes the default.
Track:
- failover count by method
- primary health-check failures
- primary vs fallback slot distance
- write attempts by signature
- confirmation latency by path
429,5xx, and JSON-RPC error rate by path- user-facing errors after fallback was used
Use Status and Metrics for the Carbium-side signal map, and Calls Monitoring for method-level usage visibility.
Launch checklist
Before enabling failover in production:
- Reads and writes have separate retry policies.
- Health checks are cheap, rate-limited, and not counted as proof of freshness by themselves.
- Read responses record endpoint name and slot or commitment context where available.
- The app has a maximum slot-lag policy for latency-sensitive decisions.
- Transaction sends persist the signature before any retry path runs.
- Retry logic checks
getSignatureStatusesbefore any duplicate send decision. - Blockhash expiry triggers rebuild and re-sign, not replay.
- Dashboards show when fallback is active, not just when requests fail.
- Support logs redact API keys and full authenticated endpoint URLs.
Failover should make production calmer. If adding a fallback path increases request volume, hides stale reads, or makes duplicate sends harder to reason about, the routing policy needs more work before launch.
Source checks
This policy is based on current Solana RPC behavior and local Carbium docs boundaries:
- Solana
getHealthreports whether an RPC node is healthy relative to its configured slot distance from the cluster tip. - Solana
sendTransactionrelays a signed transaction and returns before cluster confirmation; usegetSignatureStatusesto confirm processing and final state. - Solana transaction submission can still fail when the transaction's recent blockhash expires before landing.
- Carbium's docs already separate rate-limit mitigation, RPC error triage, blockhash expiry recovery, and signed-transaction relay. This page links to those owners instead of duplicating them.
Updated 7 days ago
