CEX Resilience · Cue Cards

Click any sentence to play
Slide 1
Opening
开场
First — big thanks to Joshua. That was a fantastic share. Really great content.
首先,非常感谢 Joshua。刚才那段分享真的非常棒,内容也非常精彩。
For my part, I'll focus on something different.
我这一部分会讲点不一样的。
Over the past two years, we talked to many CEX customers.
过去两年,我们和很多 CEX 客户有过深入交流。
We saw the same patterns come up again and again.
我们反复看到一些共性的 pattern 出现。
Today I want to share what we saw.
今天我想分享一下我们的观察。
The common patterns. How architectures change over time. And how these exchanges build their resilience.
这些共性的 pattern。架构怎么随时间演进。以及这些交易所怎么构建他们的韧性。
This 40-minute session has two parts.
这 40 分钟分两部分。
First — me. I'll walk through the architecture patterns we see across CEX customers.
第一部分 — 我来讲。我会讲我们在 CEX 客户身上看到的架构 pattern。
Second — my colleague Kenny will go deeper on Aeron, with more details.
第二部分 — 我的同事 Kenny 会深入讲 Aeron,以及更多细节。
You'll see this on the agenda in a moment.
这些等下你会在 agenda 里看到。
One more thing — please feel free to interrupt me anytime.
还有一点 — 任何时候都可以打断我。
This is a conversation, not a one-way share.
这是一次交流,不是单向的分享。
And if my English fails me, my colleagues here will rescue me.
另外如果我英语卡住了,旁边同事会救场。
OK, let's start.
好,我们开始。
Slide 2
Agenda
议程
Quick look at the agenda.
快速看一下议程。
First — why resilience matters. We'll start with a short look at AWS major incidents over the past 15 years.
首先 — 为什么韧性重要。我们会快速回顾过去 15 年 AWS 的几次大事故。
Second — our resilience framework. The mental model we use with customers.
其次 — 我们的韧性框架。我们和客户对话时用的思维模型。
Third — CEX architecture tiers. From leg to head. And how each tier handles HA and DR.
第三 — CEX 架构分层。从腿部到头部。每一层是怎么做 HA 和 DR 的。
At the end, I'll touch on Aeron Premium, and walk through one real deployment case.
最后,我会简单提一下 Aeron Premium,并过一个真实部署案例。
Slide 3
Three-Layer Defense + Chaos Engineering
三层防护 + 混沌工程
So this is the mental model we use.
这是我们的思维模型。
Resilience has three layers. Plus one practice that ties them together.
韧性分三层。加上一个贯穿全局的实践。
Layer 1 is Backup. The mission is simple — no data loss.
第一层是 Backup。目标很简单 — 数据不丢。
Periodic snapshots. Cross-region replication. Recovery drills.
定期快照。跨 Region 复制。恢复演练。
Cost is low — about 5 to 10 percent.
成本很低 — 大概 5% 到 10%。
Every customer needs this. No exceptions.
所有客户都需要。没有例外。
Layer 2 is HA. The mission — no service outage.
第二层是 HA。目标 — 服务不停。
Today, most mid-tier customers stop at single-AZ HA.
现在,大多数腰部客户停在单 AZ HA。
But AZ-level failures do happen. They are rare, but they happen.
但 AZ 级故障真的会发生。罕见,但会发生。
So we strongly recommend Multi-AZ HA as the new baseline.
所以我们强烈推荐 Multi-AZ HA 作为新基线。
Cost goes up to 100 to 150 percent. But recovery becomes seconds to minutes.
成本上升到 100% 到 150%。但恢复时间是秒到分钟级。
Layer 3 is DR. The mission — business continuity.
第三层是 DR。目标 — 业务延续。
This is multi-region active or warm standby. For region-level events.
这是多 Region active 或 warm standby。应对 Region 级事件。
For mission-critical workloads, this is not optional.
对核心业务来说,这不是可选项。
Now — the most important part.
现在 — 最重要的部分。
Chaos Engineering.
混沌工程。
All these layers only work if you have actually tested them.
前面这些层,只有你真的演练过,才会 work。
We see this again and again. Companies build great architectures.
我们反复看到这一点。公司建了很好的架构。
Three-AZ. Multi-region. All the right patterns.
三 AZ。多 Region。所有该有的 pattern 都有。
But when something real happens — the runbook is out of date.
但真出事的时候 — runbook 已经过时。
The team has never practiced. Decisions get made under pressure.
团队从来没演练过。压力下做决定。
Often at three in the morning.
经常是凌晨三点。
Chaos Engineering is only 2 to 5 percent extra cost.
混沌工程只增加 2% 到 5% 的成本。
But it tells you whether the previous 150 percent of investment will actually pay off.
但它决定了前面 150% 的投入是否真的有用。
Without chaos drills, the rest is just paper architecture.
没有混沌演练,前面都是纸上谈兵。
Slide 4
Architecture Summary
架构分层概览
Now let's look at where real CEX customers actually sit today.
现在我们看一下,真实的 CEX 客户今天都在什么位置。
We map customers into five tiers — leg, waist, chest, shoulder, and head.
我们把客户分成五层 — 腿、腰、胸、肩颈、头。
But — important — this is by architectural maturity, not by business size.
但要注意 — 这是按架构成熟度分的,不是按业务体量。
Some very large exchanges by volume actually sit at the waist or chest tier.
有些体量很大的交易所,其实架构上还在腰部或胸部。
At the bottom — leg tier.
最底下 — 腿部。
Single-leader OMS. Kafka. Single-leader matching engine. Synchronous DB writes.
单主 OMS。Kafka。单主撮合引擎。同步写库。
Throughput under 10K per pair. P99 around 30 ms.
单币吞吐量 1 万以下。P99 大约 30 毫秒。
The DB is the bottleneck.
瓶颈在数据库。
Waist tier — most CEX customers are here.
腰部 — 大多数 CEX 客户在这里。
Active-standby OMS. Sharding. In-memory matching with async DB.
主备 OMS。分片。内存化撮合,异步写库。
20K to 50K TPS. P99 10 to 20 ms.
2 万到 5 万 TPS。P99 10 到 20 毫秒。
Some are multi-AZ. Many are still single-AZ.
有些是多 AZ。很多还停在单 AZ。
Chest tier. OMS already on Raft. Kafka still in the middle. ME still active-standby.
胸部。OMS 已经上 Raft。Kafka 还在中间。ME 还是主备。
40K to 150K TPS. P99 around 15 ms.
4 万到 15 万 TPS。P99 大约 15 毫秒。
The OMS upgrade to Raft is the key step here.
OMS 升级到 Raft,是这一层的关键一步。
Shoulder tier. Full-stack Sofa-Jraft. Kafka removed from the trading path.
肩颈部。全链路 Sofa-Jraft。Kafka 从交易路径上去掉了。
150K to 500K TPS. P99 under 10 ms.
15 万到 50 万 TPS。P99 在 10 毫秒以内。
TCP at the limit.
TCP 极限。
Head tier. Aeron or in-house multicast. UDP-based, end-to-end Raft, mostly multi-AZ.
头部。Aeron 或自研组播。基于 UDP,全链路 Raft,大多多 AZ。
400K to 1.2 million TPS. P99 in single-digit milliseconds.
40 万到 120 万 TPS。P99 个位数毫秒。
The key observation —
关键观察 —
Head-tier exchanges are not just chasing performance.
头部交易所不只是在追性能。
They also have the strongest resilience setup. Multi-AZ, automatic failover, the works.
他们的韧性设计也是最强的。多 AZ、自动切换,该有的都有。
Below the shoulder, most customers are still single-AZ.
肩颈部以下,大多数客户还停在单 AZ。
That is where we see the biggest engagement opportunity. Helping them evolve to multi-AZ without sacrificing latency.
这是我们看到最大的 engagement 空间。帮他们演进到多 AZ,同时不牺牲延迟。
Slide 5
Three Tech Stacks at a Glance
三种技术栈对比
Now let's compress this view into three technology stacks.
现在我们把刚才看到的客户,归纳成三种技术栈来看。
Kafka standard — about 80 percent of CEX customers sit here.
Kafka 标准 — 大约 80% 的 CEX 客户在这里。
Mature ecosystem. Standardized components. Optional dual-AZ.
生态成熟。组件标准化。可选双 AZ 部署。
P99 15 to 20 ms. 20K to 100K TPS. Easiest to operate.
P99 在 15 到 20 毫秒。2 万到 10 万 TPS。运维最简单。
Sofa-Jraft over gRPC — about 10 percent.
Sofa-Jraft 配 gRPC — 大约 10%。
No message broker. Heavy TCP optimization — Core Pinning, DPDK, NUMA, ENA.
没有消息中间件。TCP 极致优化 — Core Pinning、DPDK、NUMA、ENA。
P99 under 10 ms. Up to 500K TPS.
P99 在 10 毫秒以内。最高到 50 万 TPS。
Aeron over UDP — about 5 percent. Head tier.
Aeron 配 UDP — 大约 5%。头部所。
Reliable UDP multicast. Raft consensus. GC-free. Ring buffer zero-copy. Memory-mapped files.
可靠 UDP 组播。Raft 共识。GC-free。Ring Buffer 零拷贝。Memory Mapped File。
P99 under 1 ms. Up to 1.2 million TPS.
P99 在 1 毫秒以内。最高到 120 万 TPS。
The key point — complexity and cost go up steeply as you move right.
关键点 — 越往右,复杂度和成本陡升。
Kafka is three stars and three dollars.
Kafka 是三颗星、三块钱。
Aeron is five and five.
Aeron 是五颗星、五块钱。
So our advice is — match the stack to the business stage. Not the other way around.
所以我们的建议是 — 让技术栈匹配业务阶段。不是反过来。
Don't pick Aeron because it's cool. Pick it because you've outgrown the alternatives.
不要因为 Aeron 酷就选它。要等你真的发现别的栈不够用了再选。
Slide 6
Learner vs Follower
Learner vs Follower
Quick concept page — Learner vs Follower.
快速过一个概念 — Learner 和 Follower。
In Raft, Followers vote. They count toward quorum.
在 Raft 里,Follower 是投票的。他们计入多数派。
Three Followers means majority is two. Write succeeds when two confirm.
3 个 Follower,多数派就是 2。两个确认了,写入就成功。
Learners do not vote. They receive logs, apply logs. But they don't participate in elections.
Learner 不投票。他们接收日志,应用日志。但不参与选举。
Why does this matter? Three reasons.
为什么这个重要?三个原因。
One — read-only replicas. Distribute read load without slowing writes.
一 — 只读副本。分担读压力,不影响写入。
Two — cross-region deployment. Far-away nodes don't slow down primary writes.
二 — 跨地域部署。远端节点不会拖慢主集群的写入。
Three — node warm-up. A new node syncs as Learner first, then promotes when ready.
三 — 节点预热。新节点先以 Learner 同步,准备好之后再提升。
The key insight is on the right.
关键 insight 在右边。
Three Followers, two Learners.
3 个 Follower,2 个 Learner。
A write needs two of three Followers to ack.
写入只需要 3 个 Follower 中的 2 个确认。
The two Learners — even if they're far away or in another region — do not block the write.
那 2 个 Learner — 即使在很远或者别的 Region — 不会阻塞写入。
So this is how Jraft and Aeron achieve cross-region DR without sacrificing primary latency.
所以这就是 Jraft 和 Aeron 实现跨 Region DR 而不牺牲主集群延迟的原理。
Slide 7
Resilience Evolution Across Three Stacks
三种栈的韧性演进
Now — the evolution.
现在我们看演进路径。
Two dimensions on this page.
这一页有两个维度。
Horizontal — left to right — is the stack upgrade. Kafka, Jraft, Aeron.
横向 — 从左到右 — 是技术栈升级。Kafka、Jraft、Aeron。
Vertical — bottom to top — is the resilience level.
纵向 — 从下往上 — 是韧性级别。
Every customer follows some version of this path.
每个客户都会沿着这条路径的某个版本演进。
Bottom — baseline.
最底层 — 基础。
Kafka with single-leader OMS and ME. Jraft and Aeron with basic Raft cluster.
Kafka 配单主 OMS 和 ME。Jraft 和 Aeron 配基础 Raft 集群。
Next step — Baseline HA.
下一步 — 基础 HA。
Kafka adds cross-AZ active-standby to OMS and ME.
Kafka 这边给 OMS 和 ME 加跨 AZ 的主备。
Jraft and Aeron add cross-AZ Learners. Same purpose, different mechanism.
Jraft 和 Aeron 加跨 AZ 的 Learner。目的一样,机制不同。
Next — Standard HA.
再上一层 — 标准 HA。
Kafka customers begin migrating OMS to Raft. This is the qualitative jump.
Kafka 客户开始把 OMS 迁到 Raft。这是质变的一步。
Jraft and Aeron — continue strengthening Multi-AZ Learners.
Jraft 和 Aeron — 继续加强多 AZ 的 Learner。
Standard DR — cross-region Learner.
标准 DR — 跨 Region 的 Learner。
You'll notice Kafka column is empty here.
你会看到 Kafka 这一列是空的。
That is intentional. Kafka is a stateless message broker.
这是故意的。Kafka 是无状态的消息中间件。
Cross-region replication on Kafka itself isn't where we recommend customers invest.
我们不建议客户在 Kafka 这一层投入跨 Region 复制。
The right place to do cross-region DR is on the stateful layer — OMS, ME.
跨 Region DR 该做在有状态的服务上 — OMS、ME。
That is where Raft and Aeron Learners shine.
这正是 Raft 和 Aeron Learner 发挥作用的地方。
Top — Advanced DR.
顶层 — 进阶 DR。
All three stacks can reach this. Cross-region Learners. Full multi-region active or warm standby.
三种栈都可以到达。跨 Region 的 Learner。完整的多 Region active 或 warm standby。
The takeaway — every customer can be on this map.
启示 — 每个客户都能定位在这张图上。
The conversation is — where are you now, and what's the next step.
对话就变成 — 你现在在哪里,下一步是什么。
Slide 8
Multi-AZ MSK · Active-Standby Reference
多 AZ MSK 主备参考架构
Let me make this concrete with four real architectures.
我们用四张真实架构图把它落地。
I'll keep these brief. The goal is to show what these patterns look like end-to-end. Not to deep-dive.
我会简短带过。目的是让大家看到这些 pattern 端到端长什么样。不深入细节。
First — waist tier.
第一张 — 腰部。
OMS active-standby across two AZs. Two-AZ MSK Kafka. ME active-standby on EC2.
OMS 跨双 AZ 主备。两 AZ 的 MSK Kafka。EC2 上的 ME 主备。
Routing decided at the trading server based on user ID.
Trading Server 根据 user ID 做路由。
Odd users to AZ-1. Even users to AZ-2. Basically that.
单数 user 到 AZ-1。双数 user 到 AZ-2。大概就是这样。
Snapshots every 10 minutes. Async write to Aurora downstream.
每 10 分钟一次 snapshot。下游异步写 Aurora。
P99 around 10 ms. Each shard handles around 50K TPS.
P99 大约 10 毫秒。每个分片大约 50K TPS。
The Chinese sticky notes on the diagram cover snapshot strategy and offset sync logic. I'll translate as we go.
图里的中文 sticky 是 snapshot 策略和 offset 同步的逻辑。我边讲边翻。
The key takeaway — this is the most adopted pattern.
关键 takeaway — 这是最主流的 pattern。
It works well for waist-tier exchanges who want resilience without going to Sofa-Jraft yet.
适合腰部所 — 想要韧性,但还没到上 Sofa-Jraft 的阶段。
Slide 9
Single-AZ Self-Managed Kafka · Raft OMS
单 AZ 自建 Kafka + Raft OMS
Second case — chest tier in transition.
第二个 — 胸部过渡态。
The big change here — OMS is now Raft.
这里的大变化 — OMS 已经上 Raft 了。
But Kafka is still in the middle. And now it's self-managed in a single AZ. Not MSK.
但 Kafka 还在中间。而且变成自建单 AZ 了。不是 MSK。
ME is still active-standby.
ME 还是主备。
Why this hybrid?
为什么是这种混合?
Customers at this tier want OMS consistency guarantees from Raft.
这一层的客户想要 OMS 的 Raft 一致性保证。
But they don't want to give up Kafka. Their ecosystem already depends on it.
但又不想丢 Kafka。他们的整个生态都依赖 Kafka。
So they upgrade the front first — OMS to Raft. And leave the rest stable.
所以他们先升级前端 — OMS 上 Raft。后面留稳定。
One ME snapshot per minute here. More aggressive than the previous case.
ME 在这里是每分钟一次 snapshot。比上一个更激进。
P99 stays around 10 to 15 ms.
P99 保持在 10 到 15 毫秒。
Slide 10
Sofa-Jraft End-to-End
Sofa-Jraft 全链路
Third — shoulder tier. Full Sofa-Jraft.
第三个 — 肩颈部。全链路 Sofa-Jraft。
Kafka is gone from the trading path.
Kafka 从交易路径上消失了。
OMS cluster and ME cluster — both Raft, both gRPC, both deployed on metal.
OMS 集群和 ME 集群 — 都是 Raft,都用 gRPC,都部署在裸金属上。
Notice OMS-shard-1 and OMS-shard-2 each have their own Raft group. With leader and followers.
你会看到 OMS-shard-1 和 shard-2 各自有自己的 Raft group。有 leader 和 followers。
Same on the ME side.
ME 那边也一样。
Snapshots still 10 minutes. Sofa-Jraft handles consistency. Snapshots are for fast recovery.
Snapshot 还是 10 分钟。Sofa-Jraft 管一致性。Snapshot 用于快速恢复。
Customers at this tier have invested heavily — DPDK, NUMA, custom kernels.
这一层的客户投入很重 — DPDK、NUMA、自定义内核。
P99 under 10 ms is achievable, with the right tuning.
调优好可以做到 P99 10 毫秒以内。
One trade-off worth mentioning — the Sofa-Jraft codebase is mature but evolves slowly.
有一个 trade-off 值得提一下 — Sofa-Jraft 代码库成熟但更新慢。
Most customers at this tier maintain their own forks.
这一层的客户大多维护自己 fork 的版本。
Slide 11
Aeron Cluster End-to-End · Head Tier
Aeron Cluster 全链路 · 头部参考
Last one — head tier. Aeron Cluster end-to-end.
最后 — 头部。全链路 Aeron Cluster。
This is the publicly known architecture pattern. Based on Frank Yu's QCon and Aeron MeetUp talks.
这是公开已知的架构 pattern。来自 Frank Yu 在 QCon 和 Aeron MeetUp 的演讲。
Aeron Cluster on both OMS and ME sides. UDP Aeron Transport between them.
OMS 端和 ME 端都是 Aeron Cluster。中间是 UDP 的 Aeron Transport。
All deployed on metal instances.
全部部署在裸金属上。
Two sharding strategies — OMS shards by user. ME shards by trading pair.
两种分片策略 — OMS 按 user 分片。ME 按交易对分片。
What gives this architecture its edge —
是什么让这个架构有优势 —
Reliable UDP multicast.
可靠的 UDP 组播。
Aeron Cluster Raft consensus.
Aeron Cluster 的 Raft 共识。
NAK-based precise retransmission.
基于 NAK 的精准重传。
GC-free, zero-copy ring buffers.
GC-free 的零拷贝 Ring Buffer。
The numbers speak — P99 under 1 millisecond.
数字会说话 — P99 在 1 毫秒以内。
A note for our peers here today.
给在场各位同事一个 note。
What we draw on this slide is the public pattern.
这一页上画的是公开的 pattern。
We've also been spending time understanding what makes this architecture work in production. And where the edge cases are.
我们也花了不少时间研究这个架构在生产中怎么 work。以及哪些是 edge case。
So we'd love to hear your perspective later. Especially on how this pattern evolves over the next 12 to 24 months.
所以稍后非常希望听听各位的看法。特别是这个 pattern 在接下来 12 到 24 个月怎么演进。
That's the overview. Happy to take questions.
overview 到此结束。欢迎随时提问。
And if any of these architectures are familiar to you, I'd love to hear more.
如果这些架构里有你们熟悉的,也很想多聊聊。