Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 17
|
![]() |
Author |
|
Sandvika
Advanced Cruncher United Kingdom Joined: Apr 27, 2007 Post Count: 112 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
You lost me there - I'm referring to hyper-threading as implemented in the current high end desktop and server CPUs from Intel. The optimisation is achieved by ordering the instructions in the execution pipeline to maximise the use of the different execution units within the core. Having 2 threads to select the instructions from provides greater scope for optimisation. The scope for optimisation is purely sequencing, it's not a matter of 32 or 64-bit. Nearly all new PC processors have been 64-bit for the last decade so it wouldn't make any sense to design anything new for accommodating 32-bit.
----------------------------------------![]() ![]() |
||
|
Jesse Viviano
Cruncher United States of America Joined: Dec 14, 2007 Post Count: 15 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I understand that, but 32-bit programs provide much greater optimization opportunities than 64-bit programs due to quirks in their designs that were helpful back when they were designed but became faults later on in time due to changing technology that turned those quirks into faults. Basically, the 16 and 32 bit modes of x86 have eight registers per register file. Registers are the fastest type of memory in a processor core and are the only locations for data to be guaranteed to be inside the processor core no matter what. Unfortunately, they were expensive when the 16-bit and 32-bit modes were designed, so processor designers in the 1970s and 1980s did not want to have too many of them when 32-bit x86 was designed. Registers have become cheaper than other CPU hardware like out of order execution hardware and superscalar execution hardware especially at the turn of the century. In the main register file, two of those eight are reserved for critical purposes so they either must not be used, or if their contents are saved somewhere and then restored if they must be used by a function within a program. Since six or eight registers is generally not enough to hold enough data in almost any modern program segment, the processor is forced to spill some of that data into memory to make room for other data nearly all of the time, and then must go back to memory to retrieve it as needed. The frequent access to memory often forces threads to frequently stall, opening up bubbles in the pipeline that Hyper-Threading can exploit to run another thread. Caches can help, but they have problems. There is no guarantee that needed data is in the cache. Even if the data is in the fastest cache, threads will stall if instructions need to access the cache at the same time and the cache does not have enough ports to satisfy all the requests simultaneously. 64-bit mode extends most of the register files to contain sixteen registers, and also extends the registers to 64 bits wide in the main integer units. Suddenly, many small functions can keep all of their data in the registers, and therefore cannot stall due to memory accesses during that time that they can keep all of the needed data in registers. Hyper-Threading loses much of its ability to boost performance in such cases because it will have fewer pipeline bubbles to exploit when one or both of its threads can keep all of their data in the registers. In such a pathological situation, each thread runs at half speed if they hog the same units. It still is useful for workloads that are dissimilar in which units they hog (e.g. an integer workload and a floating point workload) or require so much data at once that they constantly require memory or storage access (e.g. database, big data, or file serving workloads) that constantly force the creation of pipeline bubbles.
----------------------------------------Basically, sensible design decisions back when 16-bit and 32-bit x86 were designed became faults that leaves the processor waiting for data. This waiting time is time Hyper-Threading is designed to patch with another thread. 64-bit mode fixes the faults and therefore allow many programs that hog the CPU to run at much closer to peak CPU utilization due to having the data guaranteed to be ready for the program when the program needs it. Therefore, Hyper-Threading often has much less waiting time to patch when one or both threads are in 64-bit mode. [Edit 2 times, last edit by Jesse Viviano at Jan 19, 2015 4:50:43 PM] |
||
|
Sandvika
Advanced Cruncher United Kingdom Joined: Apr 27, 2007 Post Count: 112 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks, I think I get where you are coming from now.
----------------------------------------Thanks to Lynn Conway the instruction pipeline can be resequenced on superscalar processors 1) to optimise use of available processor execution units and 2) minimise chances of incorrect branch prediction. For example a branch instruction relying on the result of an integer operation will not be executed as consecutive instructions where possible, since the cost in wasted processor cycles of incorrect branch prediction would be very high. Her innovation and the need for assemblers to be aware of it marked the end of self-modifying code which was always a bad idea for debugging anyway. My understanding is that the purpose of hyper-threading is to enhance these capabilities: 1) to provide optimal separation of dependent instructions on the same thread beyond what is otherwise possible by inserting instructions for the other thread so that the input is hardly ever stalled waiting on a previous output (a 1-to-1 interleave of the 2 threads becomes the theoretical worst case in the pipeline), 2) to separate branch instructions sufficiently from their preceding instructions so there is certain branching or almost-never-wrong branch prediction and 3) to optimise the use of different execution units within the core, so for example a 256-bit precision floating point instruction which requires multiple clock cycles to insert its parameters and retrieve its result can execute in parallel with independent integer or branch instructions for either or both threads. This is achieved in the hardware design by doubling-up ALL the registers so that each hyper-thread retains its full context concurrently with yet independently of the other. The difference you refer to appears to me to be more like the difference in cost between a context switch on an old process-oriented scheduler and a context switch on a contemporary thread-oriented scheduler. Since the context switch (whether minimal or maximal) essentially amounts to pushing and popping the registers off a stack which will often be retained in level 1 cache, it is still orders of magnitude faster than accessing RAM. In fact, in contemporary Intel processors, even the shared level 3 cache runs at processor clock speed, let alone the level 1 and level 2 cache, so the cost of any context switch (even for single threaded processes like WCG) is minimal. Furthermore, in multiprocessor systems employing NUMA to increase memory bandwidth and similarly more PCIe lanes distributed across the processors, the QPI provides much faster L3 to L3 cache inter-processor transfer than could be achievable via a RAM-based sharing architecture. So for example, the overhead associated with a disk controller being on a PCIe bus for a different processor than the one on which the dependent thread(s) are running is also minimised. So, again, 32-bit v 64-bit doesn't come in to it, except, for example, I would not expect to see any 32-bit compiler/assembler capable of handling Intel Advanced Vector Extensions (AVX) and consequently the 32-bit calculation would be much more long-winded.... Which brings me back to where I started: the CEP2 WUs being 32-bit and BOINC being hyper-thread unaware is a poor use of the available CPU power in the best case and a complete waste of CPU time if terminated prematurely. Consequently I am optimising my contribution to WCG by now running 64-bit WUs on my multi-processor multi-core, HT-enabled and AVX capable server and limping towards my gold CEP2 badge on my laptop. ![]() ![]() |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
At some point there was talk of grid work for CEP2 that needed more than 4GB memory, which would have required a 64bit compile. Heard much of nothing about this for quite a long time. Till now thou, I've happily 'limped' on hyperthreaded devices to well over 3 years contribution for CEP2 so far and will continue doing that with 1 thread per host, 24/7. :)
|
||
|
Jesse Viviano
Cruncher United States of America Joined: Dec 14, 2007 Post Count: 15 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
That is not what I was coming from. I was stating that 32-bit x86 spends more time going to the memory system than other current architectures even when there are no threads to switch, so it has more loads of memory system overhead to overcome even in a single-threaded situation where no other thread is executing rather than doing real work. A modern x86 processor when running a 16-bit or 32-bit program usually spends most of its time waiting for the memory system. L1 cache runs at the speed of the processor and x86 chips are optimized to use results directly from the L1. However, cache misses can ruin that. Also, the instructions that generate the work needed to directly access memory are broken down into more micro-ops (simple internal instructions that x86 instructions are translated into by instruction decoders because x86 instructions are unsuitable for high speed execution due to their complexity) inside the processor because more work is required for this. Furthermore, if two instructions that are otherwise ready to execute in parallel need access to the cache, they will fight and one will have to wait until the next cycle, reducing the number of instructions per cycle. Instructions that reference the registers don't generate memory micro-op overhead and therefore leave more room for more computation micro-ops to be in flight, allowing more instructions per cycle (IPC) to be processed. L2 cache often runs at the processor's clock speed, but it requires more time to search because the search logic needs to be made more simple to allow the L2 cache to be dense enough to work. L3 caches are shared among multiple threads and have to be denser, so they too are slower due to being denser despite their matching clock speeds and due to having multiple threads contending for the same cache.
----------------------------------------AMD64 provides more registers for 64-bit programs. This has several benefits. First, more instructions will refer to the registers. That leads to less micro-ops being issued for memory system overhead, so more micro-ops relating to real work can be stored in the processor. Since each micro-op in a thread is guaranteed total access to the registers as long as multiple micro-ops do not attempt to write to any of the same registers at the same time, there is less fighting over the cache due to less need to access the cache, greater parallelism, and greater IPC. Second, due to the lower quantity of instructions that perform memory accesses, there is less contention for the whole memory system, raising effective memory throughput. The one significant downside of having more registers in high performance processors is that more registers need to be stored to memory when a context switch must happen. All of these problems that 32-bit and 16-bit x86 generate are opportunities for Hyper-Threading to exploit because they leave execution units idle. Your idea about optimizing branches is a brilliant use of Hyper-Threading that works even in 64-bit mode applications as well. As for what is hyper-threading unaware, BOINC has a small part if it is hyper-threading unaware, but the operating system has a large part in damaging performance if it is hyper-threading unaware. BOINC should disable half the threads if there is a task in danger of going over the deadline (e.g. earliest deadline first mode) unless the task is a multithreaded task that uses more threads than physical cores. A hyper-threading aware OS would then redistribute threads to ensure that each physical core is busy. A hyper-threading unaware OS might leave some of the physical cores idle while others are loaded with two threads in this situation, damaging performance for the threads on the doubly-loaded cores. [Edit 2 times, last edit by Jesse Viviano at Jan 22, 2015 3:03:51 AM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Five of us have now failed due to an error on:
E227707_591_S.304.C41H30N4.QLVKWKOFLWZPKL-UHFFFAOYSA-N.13_s1_14 Minimum Quorum: 1 Replication: 1 The highest CPU Time / Elapsed Time time for one of us was 11.44 hours. My was 10.16 hours. Can't see how the 18 hour limit had any play in this. Guess it happens – hope it means something to somebody. |
||
|
Sandvika
Advanced Cruncher United Kingdom Joined: Apr 27, 2007 Post Count: 112 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks Jesse, I think I finally got there. The gist of what you are saying is that the old 32-bit processors had few registers so the 'great big pool' of registers in the register file on 64-bit processors is mostly unused when running a 32-bit binaries, which is why these binaries head off to cache / RAM much more often than 64-bit applications. Yes, that makes sense. The long term issue has been and remains feeding the processors quickly enough to keep them busy. Getting throughput rather than peak speed was my key criterion when building my computer hence opting for a pair of E5-2620v2 processors rather than I7-4960. Lots of cache, hyperthreads and almost double the memory bandwidth (since it's NUMA architecture). It just happens that the CEP2 project bites back in this case.
----------------------------------------For the L3 cache, it is actually segmented per core with ring stops per core and Intel claim this means the bandwidth increases as the core count increases which obviously helps scalability. There's also a comparison of the micro-architectures showing how Intel and AMD have opted for rather different approaches. So I think we are essentially in agreement that by providing only 32-bit executables, CEP2 is shunning most of the 'new' capability of most processors produced in the last decade! Searching the forum from August 2013 it was suggested that 64-bit only improved things by 4 to 8% and was slower in some cases. In 2011 a CEP scientist said that the only benefit was increased memory addressability and this was of no use for the project!! There was an earlier claim that 64-bit processors only provide 32-bit floating point. There are switches for the 64-bit compilers to exploit the 256-bit AVX instructions, with fallback to SSE2 if not present. CEP2 is just using 1/8th of the available precision.... This floating-point capability is why these Intel and AMD processors are used in supercomputers. Though the Q-Chem software used for CEP2 has been available in 64-bit since before this project, I expect they made a decision to enable participation to as many as possible on WCG at the expense of efficiency by looking at the aggregated profiles of participants - ie. they also tuned it for throughput. ![]() In the final analysis since I noticed WUs being killed after 18 hours, the chances of it actually being recorded as "error" status and receiving no credit were about 30%. Half of these were killed in job zero (which I understand to be the point of the kill, because some of the model permutations result in 'massive' calculations that take many times longer and can't checkpoint often enough to be workable) and the rest were killed in job <=4. It seems that WUs killed when in job 5, 6 or 7 which is the 70% of those killed, are sufficiently advanced to be validated and receive credit. My laptop is less than a day away from pushing my contribution over the line to get the gold badge and I have already changed its profile. Since I'm caught between hyperthreads that are too slow on my desktop and a laptop that needs WCG to be shut down at the drop of a hat, this project is the wrong one for me so I'll duck out and leave it to others. ![]() ![]() ![]() |
||
|
|
![]() |