NVIDIA Shares Plunge Nearly 5%

Advertisements

On a Monday morning in mid-January, reports emerged that NVIDIA, a prominent name in the AI and computing hardware industry, faced significant setbacks with its next-gen artificial intelligence chip, known as BlackwellThe issues primarily revolved around overheating of server racks and connection irregularities, which raised concerns about the viability of these chips in data center environments.

Such obstacles in the deployment of Blackwell chips notably impacted various high-profile clients, including tech giants like Microsoft, Amazon's AWS, Google, and MetaMany of these clients had to scale down their orders for the Blackwell GB200 racks, which had been anticipated to revolutionize AI processing capabilitiesThe anticipated performance enhancements of the Blackwell chip compared to the previous generation, Hopper, had garnered over $10 billion in orders from each major client involved.

The Blackwell chips were celebrated for their remarkable energy efficiency—four times higher than that of Hopper chips—which seemed tailor-made for the burgeoning demands of AI workloads

Yet, integrating such high-powered chips into server racks proved to be more difficult than NVIDIA had projectedFor context, each Blackwell rack towers taller than a standard refrigerator and weighs nearly as much as a compact car, creating unique operational challenges, particularly concerning thermal managementUnlike traditional air cooling systems, the extreme density of these chips necessitates a water cooling solution, a daunting task for many AI developers and data center operators unaccustomed to such integration processes.

The repercussions of these deployment issues were swift and severeDue to the overheating and connectivity problems, numerous clients opted to delay their Blackwell GB200 ordersSome chose to wait for an improved version expected later in the year, while others reverted to relying on NVIDIA's older AI chips as stop-gap measuresReports suggested that, despite NVIDIA promoting a comprehensive rack solution, some clients might pivot towards acquiring individual Blackwell chips to handle assembly themselves, avoiding the complexities tied to the complete rack system.

Amidst these emerging difficulties, NVIDIA is still presented with opportunities for recovery

Should they address these technical issues in a timely manner, they might well see clients reinstating their ordersFurthermore, even with the technical challenges surrounding these racks, the Blackwell chips are performing efficiently compared to their predecessors—meaning NVIDIA could potentially find other buyers for the flawed racks.

In November of the previous year, NVIDIA had forecasted that the Blackwell chips would contribute billions in revenue and help propel its total annual data center chip revenue from $47.5 billion to a staggering $150 billionThe enticing energy efficiency of these chips was a critical factor that attracted cloud service providers keen to maximize computational efficacy under fixed energy constraints.

The delay in chip deliveries materially disrupted data center deployment plansMicrosoft, as a key server provider for OpenAI, had initially intended to install at least 50,000 Blackwell chips within one of its Phoenix facilities

However, as delays accumulated since the previous year, OpenAI pressured Microsoft to supplement its needs with the older H200 generation chipsConsequently, the Phoenix data center originally planned for numerous GB200 installations became entirely filled with H200 chips instead.

Conversely, reports indicate that Microsoft is now anticipating to install a mere 12,000 Blackwell chips within the GB200 racks by March—a mere fraction of their initial ambitionsCompounding the issues, an associate involved with Microsoft's strategy hinted that they also plan to purchase the forthcoming GB300 racks, expected to launch later in the year, underscoring their ongoing reliance on NVIDIA technology despite existing issues.

Initially brimming with confidence, NVIDIA had projected deliveries of the Blackwell racks to commence by the end of the preceding yearHowever, unforeseen chip design flaws necessitated a three-month postponement of initial deliveries

alefox

Although NVIDIA acted swiftly to address these flaws, by November, fresh reports regarding overheating began to surfaceThe underlying problem arose during real-world operations: As Blackwell chips operated within specific server racks, excessive heat severely hampered system stability and performanceNVIDIA found itself in a challenging loop, repeatedly liaising with suppliers to adjust the rack designs to mitigate these overheating concerns.

Yet, it became evident that the problems were not fully resolvedAccording to sources involved in testing the racks, clients noted inconsistencies in data transfer between chips, leading to longer setup times for the Blackwell racks than anticipatedIf left unresolved, these issues could result in performance levels falling beneath what NVIDIA had initially promised, posing a risk to the company's reputation and its long-term partnerships.

This entire scenario highlights a crucial aspect of technological advancement—while expectations for new capabilities can be sky-high, the underlying complexity of hardware design and deployment must not be underestimated