http://arstechnica.com/hardware/news/2009/09/ibms-8-core-power7-twice-the-muscle-half-the-transistors.ars
IBM’s 8-core POWER7: twice the muscle, half the transistors
Speaking of a POWER7 core’s back end, each core contains a very robust suite of execution resources. There are 12 execution units in total, broken down as follows:
- 2 integer units
- 2 load-store units
- 4 double-precision floating-point units
- 1 branch unit
- 1 condition register unit
- 1 vector unit
- 1 decimal floating-point unit
話說好像把POWER7整個遺忘了….4DP乘加+1個vector+1個十進位浮點。
雖然不能同時使用,不過4個DP乘加本身就已經有點誇張。記得Nehalem和Tukwila都是2乘加?
這樣反而變成load/store和記憶體頻寬吞吐夠不夠….能省電晶體的原因好像是因為32MB on-die L3用eDRAM。
—-
總是有人要開第一槍(噴飯)
http://www.brightsideofnews.com/news/2009/9/30/nvidia-gt300s-fermi-architecture-unveiled-512-cores2c-up-to-6gb-gddr5.aspx
nVidia GT300’s Fermi architecture unveiled: 512 cores, up to 6GB GDDR5
GPU specifications
This is the meat part you always want to read fist. So, here it how it goes:
* 3.0 billion transistors
* 40nm TSMC
* 384-bit memory interface
* 512 shader cores [renamed into CUDA Cores]
* 32 CUDA cores per Shader Cluster
* 1MB L1 cache memory [divided into 16KB Cache – Shared Memory]
* 768KB L2 unified cache memory
* Up to 6GB GDDR5 memory
* Half Speed IEEE 754 Double Precision
DP一口氣拉高到半速(等於全速啦),每個shader cluster變成32core,不過其實看起來比較像是4個以前的8sp合併成32sp,同時shared memory也合併了4片變成64KB,同步的需求就變小了。
話說cluster變大,那麼和TMU之類的比例其實也可能變了,比方說可能變回2個cluster一組大core?然後又變成8x2x[32sp+16TMU]這樣。
這次最重要的是有個全core共用的768KB L2 unified cache memory,而且合併的64KB share memory可以切成兩塊(16KB+48KB),各自當成scratch pad memory和cache(用途可互換)。至於RV870的32KB LDS只能當scratch pad不能當cache。