--- crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
On Thu, Dec 29, 2016 at 8:03 AM, Lauri Kasanen cand@gmx.com wrote:
On Thu, 29 Dec 2016 00:44:33 +0000 Luke Kenneth Casson Leighton lkcl@lkcl.net wrote:
thats interesting. i worked for aspex semi in 2003 and they had the exact same problem, programming ultra parallel devices is limited to a few hundred competent people in the entire world.
interesting to me because ericsson bought aspex.
Surely that's changed now, with the ubiquity of modern GPUs? It's entirely normal there to handle hundreds or thousands of threads at once, at massively parallel workloads.
for the ASP: no. not a chance. it was a massively-parallel (deep SIMD) *two bit* string-array processor with (in some variations) 256 bits of content-addressable memory, that used "tagging" to make the decision as to whether any one SIMD instruction was to be executed or not.
it was so far outside of mainstream processing "norms" that they actually had to use gcc -E pre-processing "macros" to substitute pre-defined pipeline-stuffing c code loaded with hexadecimal representations of assembly-code instructions to be sent to the SIMD unit.
a similar trick was deployed by ingenic for their X-Burst VPU: their pre-processing mechanism is a dog's dinner mess of awk and perl that would look for appropriate patterns in pre-existing c-code, whereas Aspex's technique was to just put capitalised macros directly interspersed in c code and let the pre-processing phase explicitly take care of it.
it was utterly horrible and insane, and it was only tolerated on the *promise* that, at the time each architecture was announced, it could do *CERTAIN* tasks at a hundred times faster than the AVAILABLE silicon of the time.
of course... by the time each architecture revision actually came out (18 months + later) the speed of pentium processors had of course increased so greatly that the gap was only 20, 10 or even 5 times greater....
to write code for the ASP you measured the number of lines of code in DAYS per line of (assembly-style) code. you actually had to write a spreadsheet to work out whether it was more efficient to map the operands into single-bit linear per processor or to use the "string" feature to process operands spread out in parallel across mulitple neighbouring APUs.
the factor which made this analysis so insanely complex was that the "load and unload" time had to be done linearly using a standard memory bus, and was a looong time relative to the clock rate of the APUs. thus, if you only needed to do a small amount of computation it was best to use the single-bit technique (4,000 answers in a slower time, to match the "load and unload" time), but if you had a lot of computation to perform it was better to use the parallel technique, in order to keep the little buggers busy whilst waiting for load or unload.
... or... anything in between. 2,4, 5, 6, 8, 12, 24, 32, 64, 96, 128 or 256 bit parallel computation, it was all the same to an array-string massively-parallel deep SIMD *bit-level* processor.
but it made programming it absolutely flat-out totally impractical and even undesirable, except for those very very rare cases, usually related to the ultra-fast content-addressable-memory capability.
i.e. extremely, extremely rare.
putting a "normal" c compiler on top of the ASP, or porting OpenCL to it, would be an estimated 50-man-year research and programming effort all on its own. just... not worth the effort, sadly.
l.