Following last month’s AMD’s 6th Opteron anniversary and the related talk about the sexa-core Istanbul coming out next month, a fresh mention of Magny Cours and its gigantic Socket G34 appeared. But hey, AMD did ridicule Intel’s dual-die MCM approach for quickie quad core solutions before – what is this now?
Yeah, it is an MCM, and a big one – not as large as those multi-CPU POWER5 and POWER6 blocks, but still larger sized than what Intel had in the LGA775 & 771 sockets for the past few years. It has to provide space for two Istanbul dies, as well as some reserve area just in case the Bulldozer and beyond dies end up requiring a bit more girth.
So, is AMD architecturally doing a better job than Intel for these MCMs? Let’s look at what we should be having here: The G34 substrate features two 45nm dies – each with six cores and private 512K L2 per core, sharing a – probably undersized – 6M L3 cache per die. I feel that AMD, having been on the forefront of dense cache memory cell technologies [anyone remember Z-RAM?], should have added a bit more here when they decided to jump from 4 to 6 cores. Remember, Core i7 and Gainestown Nehalems [Nehalem-EP] have 8 MB L3 per four cores, fed by a 216-bit Triple-Channel DDR3 while Beckton [Nehalem-EX] will have 8 cores and 24 MB L3, fed by a 288-bit Quad-Channel memory controller. So, for more memory intensive threads, where Istanbul has 6 MB L3 per 6 cores, and fed just by 144-bit Dual-Channel DDR3, the cache increase would have been helpful as long as the latency is managed.
Now the good stuff: instead of sharing the FSB between the two CPU dies in a single package, and then the North Bridge [three load FSB obviously has a speed penalty compared to the two load one], AMD can, as I suggested many times before, make a good use of HyperTransport to link the two Istanbul dies directly at full speed on the substrate – in fact, at much more than the full standard HT3.1 speed.
Each die in Magny-Cours MCM should have four HT3.1 connections. Each of these should provide 25.6 GB/s bidirectional bandwidth in 16-bit mode. Now, as you know, HT3 spec supports link splitting as well as link aggregation, for maximum 2×32 bit bidirectional link.
Out of four links per CPU, two from each CPU will go outside the die to link with other Magny-Cours parts, or FPGA accelerators, and even maybe HT-based shared memory cluster interconnect controllers. That gives four external HT3 links per G34 package. As for the remaining two HT links per die, they get, well, linked together! This move inside K10.5 architecture only makes us wonder why AMD didn’t push Torrenza “HT-glued” platform.
Even at standard HT3.1 speed and without taking any benefits of the ultrafast MCM substrate, the doubled 51.2 GB/s total bandwidth and halved roundtrip latency will help manage the NUMA hop penalty between the two dies. However, if AMD has balls [which very often it didn’t have in the past few years], they could literally further double the HT3.1 bandwidth between the CPUs by running these internal HT links off a separate clock compared to the external HT links to the other CPUs. Over 100 GB/s between the dies? No problem, IBM has even more than that with POWER5+ and POWER6 architecture.
The implication? Nearly no bandwidth penalty for inter-die memory transactions compared to the on-die ones, and just a bit of latency hit. AMD always had higher memory utilization than Intel [in the range of 90-94% compared to 70%+ on Intel], but can the company pull of a cross-link memory exchange? Only time will tell.
Talking about the memory, yes in this case we’ll have both pairs of DDR3 channels exposed outside to the board. So, 12 DIMMs per CPU socket – not bad. Also, each die can access both its own DDR3 as well as the sister die’s DDR3 channels at the same time, or at least queue up for the access if the other one also wants to a bite of it. Looking at numbers, with DDR3-1333 memory you should have 42.66 GB/s of total system memory bandwidth. 12 times 4GB DDR3-1333 would give you 48GB per socket and if you go with custom DIMMs from MetaRAM, you could even end up with 192GB DDR3-1066 memory per single CPU socket. In a 4-Way box, you could have 48 cores and 768GB of DDR3 memory!
Now, can Intel do the same with Beckton? Yes, if they wanted to create a new even larger socket with above 200+ W TDP, they could even have a 16-core 32-thread monster out of two Beckton Nehalem-EX dies and 8 memory channels in a single chip socket. Same benefits of sped-up QPI links on the substrate would apply here too, with even higher bandwidth between chips. Will they do that? Not in the 45 nm generation I think. Once the 32 nm shrinks come along, I won’t be surprised at all to see dual-die MCM Westmere or Sandy Bridge chips with upwards of 12 or 16 cores…
Original Author: Nebojsa Novakovic
Webmaster’s note: You have stumbled on one of the old articles from our archive, for the latest articles we would recommend a click to our tech news category. There you can find the latest technology news and much more. Additionally, we take great pride in our Home Office section, as well as the best VPN one, so be sure to check them out as well.