今天下午兩點,ATI X1000 series台灣發表會。
聽有去的 ikari 說,是David Wang(Director of Engineering)親自出馬….(wow,ArtX的頭子!XD)
但是好像去的人不多,環視周遭只有十來個….
而且問不到什麼問題。XD
結論來說,ATI 對GPGPU的態度,也是採行會發行API的方式;其他部分(如AVIVO的H.264)則都是未定、觀察。
所以其實乏善可陳嗎….. _A_
source:
http://bbs.gzeasy.com/index.php?showtopic=461982&st=22
Mike houston對R520的一些敘述:
mhouston
GP
Joined: 02 Sep 2003
Posts: 241
Location: Stanford University
Posted: Wed Oct 05, 2005 5:01 pm Post subject: A little R520 info
——————————————————————————–
Now that things are public, I can talk about some things:
The board is 32-bit. The precision on ops is slightly better in general than Nvidia, but not in all cases (from GPUBench precision test). ATI cuts corners, much like Nvidia, when it comes to denorms.
Readback rates are still a problem under GL (450MB/s), but not under DX on Nforce4 or ATI chipsets (900+MB/s). There are performance problems on Intel chipsets for some reason. Still below where I’d like to see them, but at least closer to Nvidia performance.
The board has really good latency hiding, much like the R4XX series. Your performance is generally the max(ALU, tex, branch). Where tex is the total fetch latency: 4 cycles for a 128-bit fetch which is a cache hit, and 8 cycles for a 128-bit streaming fetch. You can look at the ClawHMMer paper for more analysis of latency hiding.
The board supports generallized scatter, yet it’s not currently exposed (no way to do this cleanly in DX, so it might be GL only (<- I’m working on this)…)
The board has 1.5 ALUs. The half can do add/sub but not MUL/MAD/etc. This gives the X1800XT ~120GFlop peak. Raw MAD rate is 83GFlops, which is lower than Nvidia.
Cache bandwidth is 42GB/s and streaming is 21GB/s for the X1800XT.
Branch granularity is ~16 fragments. Branch performance, at least from really basic tests, seems very good.
ATI has claimed to be more committed to supporting academic research and GPGPU in general. They say they will open up a lot more information about their architecture and provide lower level interfaces to access their hardware. Only time will tell how this will play out.
Let me know if you have other questions, and I’ll try to answer them as soon and as well as I can. At the moment, I only have a X1800XL here, so I’ll try to put up some GPUBench results for the board later today on the GPUBench site.
-Mike
Last edited by mhouston on Thu Oct 06, 2005 12:49 am; edited 1 time in total
[quote]The board supports generallized scatter, yet it’s not currently exposed (no way to do this cleanly in DX, so it might be GL only (<- I’m working on this)…)
[/quote]
Posted: Thu Oct 06, 2005 1:17 am Post subject:
——————————————————————————–
You can have an arbitrary number of outputs from a shader, well, I guess the instruction limit, so 512.
You can basically do a[i] = x. The writes are uncached, so there will be a performance penalty (think in the thousands of cycles), but if you do lots of ops, some of the latency can be hidden, at least in theory. You are responsible for making sure fragments don’t clobber each other. Also, you cannot read and write to the same buffer, i.e. no read-modify-write. I haven’t tested it yet, since it’s not exposed currently in any available driver, but the memory controller and memory system were designed to handle this.
Posted: Thu Oct 06, 2005 1:29 am Post subject:
——————————————————————————–
Yes. But, you can also output more than 16 floating point values (4 float4’s) as well. Both are useful. We’ve been asking the graphics card companies for awhile about this one as it solves some issues with variable output from kernels as well as stream filtering. It’s going to be interesting to see if it’s cheaper than the known methods, like Daniel Horn’s chapter in GPU Gems2.
[quote]I just checked the GPUBench page and compared the X1800 to the X800XT PCIe. It seems to me that the computational power remained nearly the same. (instruction issue, scalar vs. vector instruction issue, basic throughput, FP Bandwidth)
The most significant differences seem to be the new branching and 32-Bit support.
So is it faster than the former ATI cards e.g. X850? Or as fast as those cards, but now with 32 Bit support?[/quote]
Posted: Thu Oct 06, 2005 2:07 pm Post subject:
——————————————————————————–
The X1800XL has roughly the same clock rates as the X8XX boards, 500 core/500 mem. The branching, 32-bit, scatter, no dependent texture limit, no dynamic instruction limit, and fully associative cache are the biggest new things. The R520 is ~20% faster than the R4XX clock for clock and has a MUCH better memory subsystem so it handles random reads better. Basically, all our apps got a little faster on the XL, ~10-15%.
The X1800XT has is clocked at 625c/750m, so is substantially faster. We’ve seen compute bound applications get ~30% and memory bound applications get 50-100% depending on the memory access patterns. The later is from the new cache design (many fewer misses) and the memory subsystem handling incoherent reads much better.