Like a Drumbeat, Broadcom Doubles Ethernet Bandwidth with “Tomahawk 5”
If there’s anything hyperscalers and cloud builders value more than anything else, it’s consistency and predictability. They have enough uncertainty to deal with when it comes to customer demand that they like their systems to behave as deterministically as possible and they like a steady pace of innovation from their silicon partners.
In a way, hyperscalers and cloud builders created the market for merchant switches and router chips, and did so by encouraging start-ups such as Broadcom, Marvell, Mellanox, Barefoot Networks and Innovium to create chips capable of running their own custom operating systems and networks. telemetry and management tools. These same massive data center operators encouraged early-stage switch makers such as Arista Networks, Accton, Delta, H3C, Inventec, Wistron, and Quanta to adopt merchant silicon in their switches, which put pressure on network operators. network such as Cisco Systems and Juniper Networks on several fronts. All this has forced the disaggregation of hardware and software in switching and the opening of network platforms thanks to the support of the Switch Abstraction Interface (SAI) created by Microsoft and the opening of proprietary SDKs (mainly via API interfaces) from major network ASIC manufacturers.
No one has benefited from all this openness as much as Broadcom. And due to fierce competition in the merchant network silicon market and now the addition of Cisco’s Silicon One merchant switches and routers, Broadcom had to be relentless to deliver six to eight variants of switches and router chips. each year to meet all of the needs of its enterprise customers, telecommunications operators, service providers, hyperscalers and cloud creators.
Broadcom started with its “Trident” line of switches, moved into deep packet switching and routing with the “Jericho” line, and eventually had to fork its line of data center switches to include cost-optimized , high-performance, no-frills “Tomahawk line”. And it’s these three families of switch chips that still form the core of its data center network chip business today.
Peter Del Vecchio, product manager for the Tomahawk and Trident lines at Broadcom, says The next platform that many customers mix and match ASICs on their networks. For example, many cloud vendors who sell compute and network capacity for a living (many of whom are also hyperscalers who have ad- or subscription-based services at the software level) want to offer bare metal servers. or move business applications to their systems. And therefore, they could choose a Trident ASIC for their top-of-rack switches and then use Tomahawk ASICs in the rest of the network fabric above. At higher levels of the network fabric, some cloud builders and hyperscalers choose Jericho ASICs because of its gigabytes of packet buffers, which help manage congestion and scalability of Internet routing tables. But in general, Trident is for enterprises that need to support the widest range of protocols, Tomahawk is for hyperscalers that focus on cost-per-bit, thermals, and scale, and Jericho is for telcos. and service providers who run huge network backbones.
All of these Broadcom ASIC families share a single set of APIs on a single SDK as well as support for SAI interfaces, which provide minimal compatibility between switch ASIC vendors who have been constrained by the power of purchase of hyperscalers and cloud builders to support the ISC.
With the “Tomahawk 5” StrataXGS, Broadcom comes just in time with another doubling of bandwidth, and does so with a monolithic chip design etched in 5-nanometer processes from foundry partner Taiwan Semiconductor Manufacturing Co.
This single-chip Tomahawk 5 can handle 51.2TB/sec of aggregated bandwidth, double that of the Tomahawk 4 that was unveiled two years ago – exactly on the two-year cadence that Broadcom likes to push its ASIC switches. With the Tomahawk 4, Broadcom had a version that had 512 SerDes running at 50 Gb/sec using PAM-4 modulation to drive a total of 25.6 Tb/sec of bandwidth, another that had 256 SerDes that ran at 100 Gb/sec, and a bunch of variants that had a smaller aggregate bandwidth of 12.8 Tb/sec for more modest use cases. With the Tomahawk 5, the 512 SerDes surrounding the packet processing engines and buffers operate at speeds of 100 Gb/s, resulting in 51.2 Tb/s of bandwidth. We can expect a bunch of different variants of the ASIC Tomahawk 5 family – these have yet to be revealed. What we do know is that the Serdes design of the Tomahawk 5 is all new, and for good reason.
“This generation of SerDes was designed from the ground up to be very flexible,” says Del Vecchio. “This 100 Gb/sec SerDes can push copper up to 4 meters, and of course we can handle front panel pluggable optical modules, and we can also drive our co-packaged optics.”
We’ve got wind of the co-packaged optics that Broadcom has been working on, and we’ll follow up on that. But let’s just say that Broadcom thinks it can lower the cost of switching with co-packaged optics – something not everyone, including and perhaps especially Arista Networks’ Andy Bechtolsheim, believes possible over the course of of this generation of switches. We can assure you that Arista Networks will be at the forefront of peddling switches with co-packaged optics if the economy works out as Broadcom says.
The important thing for hyperscalers and cloud builders, and what’s driving these generational shifts in networking, is that the cost per bit moved is going down and the watts per bit moved is going down as well. With the Tomahawk 5 ASIC, Del Vecchio says Broadcom can deliver less than 1 watt per 100 Gb/sec of signaling.
The Tomahawk 5 chip can drive 64 ports operating at 800 Gb/s, 128 ports operating at 400 Gb/s, and 256 ports operating at 200 Gb/s. These days, hyperscalers and cloud builders like switches with a 128-port base running at 400 Gbps per port, with 64 ports going down to the servers in the rack and 64 ports going up in the backbone of the fabric. network. But that could change depending on how far they want to push a port’s economy. Some of the machine learning flat clusters will only use 256 ports running at 200 Gb/s
In theory, a single Tomahawk 5 could be cut with 512-port cable splitters running at 100Gb/sec, drastically reducing the cost of a 100Gb/sec Ethernet port. But the contraction in the number of switch ASICs that is enabled simply by doubling the bandwidth from one generation to the next is enough to help reduce the cost per port as the bandwidth doubles.
To be specific, here’s what moving from Tomahawk 4 or any other 25.6 Tbps switching chip to Tomahawk 5 gets you in terms of creating 51.2 Tbps of non-blocking aggregate bandwidth for each :
It takes six 25.6 TB/sec chips interconnected in what amounts to a leaf/spin network inside a switch to provide the same ports that a 51.2 TB/sec chip can do on its own . This has huge economic and thermal implications. Typically, in a generational move at the switch level, a port has 2x the bandwidth and costs about 1.5x per port initially and eventually drops to 1.3x per port over time. Initially, the cost per bit transferred only drops by 25% in this scenario, which isn’t great, but it’s kind of like an improvement on Moore’s Law, and you get 2x the bandwidth per port, which is worth something. But the big advantage is that it costs 4 times less to provide a total of 51.2 Tb/sec of aggregated bandwidth. So the network can scale much more or just cost less, depending on the topology and the number of ports you need. (A typical cloud and hyperscaler data center equates to 100,000 machines.) This kind of shrinkage is how Broadcom was able to increase bandwidth 80x over a dozen years and reduce data movement by energy consumption (in terms of joules per bit moved) by 95% over the same period.
This pace of improvement is what makes the modern hyperscale and cloud networking economy work. And you can see it in the market data for the 400 Gb/sec ramp and you’ll see it in the 800 Gb/sec ramp which will start in about a year (it takes about that long to go from one switch ASIC sampling to deployment in hypercalers and clouds) and the 1.6 Tb/sec ramp that will kick off, if all goes well, about two years later. Looked:
Del Vecchio says 400Gb/sec port volumes have doubled every year between 2019 and 2022, and now account for about 15% of total data center switching revenue. The market for ports operating at 400 Gb/s or higher speeds is expected to grow to 57% of revenue by 2026, according to Dell’Oro market research, cited in the chart above. Del Vecchio says that most 800 Gb/sec ports that will initially be sold with splitters splitting them into two 400 Gb/sec ports and later the market will switch to a single native 800 Gb/sec for new ports and ports existing ones can be unsplit to produce 800 Gb/sec ports as well.
One of the things that Broadcom says is a big differentiator for the Tomahawk 5 chips is how it buffers inside the ASIC. Broadcom uses a shared buffer inside Tomahawk 5, compared to a sliced buffer inside an unnamed competing design:
“Our shared packet buffer has excellent burst absorption,” says Del Vecchio. “What’s extremely important for machine learning training is congestion control, the speed at which they can ingest data from different sources into the systems running the machine learning frameworks. We have high-precision time synchronization, which can be used to synchronize jobs across the network.We also have hardware-based link failovers, and we can determine which links will fail by carefully monitoring the password errors in forward error correction which is necessary due to PAM-4 encoding.You can watch and see that a link degrades over time.For hyperscalers and clouds it is still mostly Clos and Fat Tree topologies, but we can do this for toroids, dragonflies, and other interconnects.
The Tomahawk 5 chip is being sampled and will increase over the next year.