Nvidia takes the wraps off Kepler, its next-generation GPU, and quite possibly the greatest step forward for graphics processing since it unveiled the G80 in November 2006.
The first desktop card out of the gate is the GTX 680. Unlike the 5xx series, which were based on a refined version of the Fermi architecture Nvidia debuted back in 2010, Kepler uses a new GK104 GPU — and its design is a sharp departure from Nvidia’s previous architectures.
Over the past five years, Nvidia’s GPU strategy has more-or-less amounted to “Everything+Kitchen Sink and we’ll sort things out when we do the refresh.” After the disastrous debut of its R600 architecture in 2006, AMD adopted a strategy of building smaller, mid-range oriented parts and doubling them up to address the high end of the market — Nvidia, in contrast, adamantly stuck to its monolithic guns. Until Kepler.
The transistor counts below are from Nvidia; Kepler’s die size is estimated but should be close to the mark. Kepler’s die size and transistor count are notable achievements in and of themselves, but we’ve barely scratched the surface of the new core. Here’s a table comparing the vitals of NV’s GT200 that debuted in 2008, (the “Tesla” moniker refers to the GPU family, not the high-end scientific computing cards), Fermi, and GK104.
Shaders are now clocked at the same speed as the graphics core. Kepler is clocked 30% higher than Fermi and packs 3x as many cores, but we want to highlight a change Nvidia wouldn’t explain during its presentation — the GK104′s cores aren’t as capable as the GF110′s. With 3x the core count and a 30% clock speed boost, Kepler “only” offers twice the GFLOP throughput. Not that that’s a bad thing.
A number of other GPU resources have been shuffled around as well.
Nvidia’s ratio column is remarkably unhelpful; it only describes the increase between Fermi and Kepler rather than how resources are distributed relative to each other. GK104 packs four times the special function units (SFUs) and twice the texture units as GF110; the core is capable of processing twice as many instructions per clock (though it has three times as many cores to fill with those instructions).
One area Nvidia did shed some light on are the changes it made to its warp scheduler. In weaving (with a loom), the term “warp” refers to the longitudinal threads in a pattern; Nvidia uses the term to mean a group of threads. For our purposes it roughly corresponds to the thread scheduler.
Fermi’s scheduler was designed with hardware stages to “prevent data hazards in the math datapath itself.” Registers were tracked and checked before data was issued to ensure that they were ready for new instructions, while decoded instructions were kept available for fast dispatch when applicable. Kepler simplifies this structure and handles some of the checking in software; dispatch latency instructions are now issued alongside the instructions themselves.
The company also notes that “We also developed a new design for the processor execution core, again with a focus on best performance per watt. Each processing unit was scrubbed to maximize clock gating efficiency and minimize wiring and retiming overheads.”
What all this adds up to is a rearchitected GPU with a focus on power efficiency that’s been notably lacking from the company’s previous high-end efforts. Those of you familiar with Nvidia’s historic naming schemes will recognize the GK104 moniker as one that Team Green typically would reserve for a mid-range GPU. Thus far, there’s no indication of a higher-end part in the works, and no obvious places where NV might have disabled compute units to improve yields, as it did with GF100.
Nvidia wasn’t able to ship us a card for testing — heck, the company wasn’t even able to brief us until less than 24 hours ago — so we have to preface our data with the hefty caveat that these figures are drawn from Nvidia’s own testing. The only good news is that these figures are from the company’s whitepaper rather than poorly labeled slides, meaning we were able to at least check the Appendices for config details. The company also included results in a wide range of titles and two prominent resolutions. Generally speaking there’s an inverse relationship between a company’s confidence in a product’s performance and the results they’ll hand you on launch day.
For those of you who are curious, the GTX 680 is a consistent 14% faster than the highest-end Radeon 7970. That’s not a variance that flat blows the doors down, but there are other factors to consider. The GTX 680 is priced at $499 (we’ll see if NV can hold the price there post-launch), while the Radeon 7970 is $50 more. This time around, Nvidia appears to have beaten AMD on both die size and transistor count. Other factors, such as opting for 2GB of RAM instead of 3GB, and using a smaller 256-bit memory bus instead of the Radeon’s 384-bit option, also tilt cost structures in Nvidia’s favor.
Its been years since Nvidia has been able to claim to have a GPU that decisively took home both the power efficiency and performance crown, but the GK104-based GTX 680 appears to have done just that. We’ll reserve final judgment until we’re able to run our own numbers, but this chip is impressive on multiple fronts. The one fly in the ointment is its GPU compute performance; figures on that front were very noticeably absent from Nvidia’s briefing yesterday, but the technical data available suggests that GK104 trades some raw math muscle for its new gaming oomph. Then again, that’s not necessarily a bad thing — AMD has effectively left the GPGPU compute field (at least where scientific computing is concerned).