| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 15
|
|
| Author |
|
|
Hypernova
Master Cruncher Audaces Fortuna Juvat ! Vaud - Switzerland Joined: Dec 16, 2008 Post Count: 1908 Status: Offline Project Badges:
|
To boost a little this project after having heard that it was moving slowly I decided to put all my machines on it fully. The idea was to move from the actual 4 years runtime to 10 years, or adding 6 years runtime. Fully that is 21 machines and 12 threads per machine.
----------------------------------------It appears that 4 machines for the moment are allergic to this project. I mean that they generate a lot of errors. The problem is that many of these errors happen with CPU runtimes that are not insignificant. A vast majority is around 20 minutes per WU and it goes up to 6 hours and even more. The problem is that I get zero credit even if I claim some credit. So if I add up all these runtimes that makes it for days of cpu runtime wasted. So for now I have switched all the allergic machines to FAAH (goal is 30 years there). As my machines are pretty the same hardware and software it is impossible to identify why this allergy. Another impact I see is that the overall crunching performance has gone down. The high upload volume does also slow in a very sensible way crunching as the CPU has to devote more time to this task. In the end this project is really not very CPU efficient. Maybe the best thing is to limit the number of WU's per machine say to six units of CEP2 mixed with the other project FAAH, and have a more equilibrated crunching. ![]() |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
The "advisory sheet" says that half of total threads in a host is sort of the grosso modo limit for reasonable performance, running all cores is taxing and would have to be tested on a per-device basis.
----------------------------------------What is the error? By chance the same as you posted in the current beta thread?... "too many exits" is a classic sign of an overloaded cruncher and given that you're the only one so far reporting this on about a probably 40,000 WU test, it is probably not the science app, not forgetting that this Beta is the same app as used for GFAM and DSFL. Cant remember if you had problems with those, but... think the common denominator in this case is too many concurrent CEP2 should those Betas be running on the same hosts. --//-- [Edit 1 times, last edit by Former Member at Feb 1, 2012 11:26:37 PM] |
||
|
|
KWSN - A Shrubbery
Master Cruncher Joined: Jan 8, 2006 Post Count: 1585 Status: Offline |
CEP2 is not a science that runs well with itself. Your efficiency will suffer horribly if you attempt to utilize all cores concurrently.
----------------------------------------Another thing that can cause errors (if you're running Linux) is a congested network. Again, this is an area where CEP2 can overwhelm a system when you're running that many tasks. As Sekerob suggested, run less tasks at a time and mix them with other projects. This should fix the errors as well as the efficiency. ![]() Distributed computing volunteer since September 27, 2000 |
||
|
|
Hypernova
Master Cruncher Audaces Fortuna Juvat ! Vaud - Switzerland Joined: Dec 16, 2008 Post Count: 1908 Status: Offline Project Badges:
|
You are both right. The Errors are all similar and related to the famous "No hartbeat for 30 seconds" which is probably the sign of an overloaded cruncher.
----------------------------------------What is amazing is that only four boxes do generate those errors. And they are identical to others which do not report errors and are also running 12 CEP2 threads. Any way to avoid any "contamination I will lower the number of units loaded on the CPU at the same time. ![]() [Edit 1 times, last edit by Hypernova at Feb 2, 2012 6:58:37 AM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hypernova , Is it possible that the computers you are having problems with are having a heat problem ? The builds may be the same but not all
----------------------------------------heatsinks are . Even a bad application of heatsink compound could give you overheating in spots . and heatsink compound does deteriorate in time . [Edit 1 times, last edit by Former Member at Feb 2, 2012 8:59:48 AM] |
||
|
|
Hypernova
Master Cruncher Audaces Fortuna Juvat ! Vaud - Switzerland Joined: Dec 16, 2008 Post Count: 1908 Status: Offline Project Badges:
|
Fluxcore, I do check core temps regularly on each machine, and with the low external temps we are having these days, heat is not really an issue. As my machines are all OC, I set them so that max temps remain under the 70 deg Celsius.
----------------------------------------The boxes are also regularly vacuumed so that filters and fins remain pretty clean. I must say that in Switzerland were I am the air is pretty pure and dry alpine air. Very little dust cumulates. The low humidity also does not help the dust to stick onto surfaces, it is removed easily. In the past I tried to find why on some projects certain boxes (never the sames) would error a large share (up to 50%) of the crunched WUs. I had this issue with HPF2 and I made very long and exhaustive search for the causes and it has been impossible to identify them. The case of C4CW is also strange. When the project started I had also error issues with certain machines. But today all machines will crunch it fully without even one error. It is a very mysterious phenomena ![]() ![]() [Edit 2 times, last edit by Hypernova at Feb 2, 2012 9:21:21 AM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
70 deg Celsius seems very high . At that temp there are not a lot of heatsinks that can deal with that with out problems . And your thermal Paste would degrade fast as well .
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Looking at your host list, it appears that you're running i7 980Xs under Windows 7 across the board. 64-bit?
----------------------------------------My 980x/Win7 Ultimate 64-bit runs a constant 10 cores/threads of CEP2 (and a GPU project on the display adapter, so using some CPU for that...plus I remote desktop into other boxes from it/run browser windows on it). It is running an XMP-1600 profile and constant 3600 MHz on processor. Ambients of 25C/75F and below allow me to keep core temps below 70C/158F using air-cooling (Cooler Master Hyper 212+). Running with RAID 0 paired SSDs and a BOINC data directory on a RAID 10 made up of WDC drives - both hanging off of an Intel SATA controller - I get ~ 35 results/day. (All but 1 GB of swapfile relocated to RAID 10, but with 12 gig of memory...) That was for the sake of example. As to your set-ups: When you say your boxes are "identical", are you saying all the way down to hard disk manufacturer/model, SATA controller (Intel vs. JMicron vs. Marvell), RAID configuration, write-caching policies, and antivirus scanning/exclusions? All of which make a difference... (As a side note: Don't remote desktop into a box with BOINC manager running. Every time I do it, at least, the options/preferences/exit confirmation/etc. dialogue window position and sizes get sent to la-la land way left and up of viewing area, often with a window size of [0, 0] which is a mite hard to work with. Have to use something like Winlister to center the exit confirmation window, tell it don't stop crunching on exit, and restart BOINC manager. Highly annoying.) [Edit 1 times, last edit by Former Member at Feb 2, 2012 3:28:28 PM] |
||
|
|
Hypernova
Master Cruncher Audaces Fortuna Juvat ! Vaud - Switzerland Joined: Dec 16, 2008 Post Count: 1908 Status: Offline Project Badges:
|
70 deg Celsius seems very high . At that temp there are not a lot of heatsinks that can deal with that with out problems . And your thermal Paste would degrade fast as well . Good air coolers like the Noctua HD14 with two fans a 140mm and a 120mm cope very well. 70 is a maximum in fact temps hover around 60-67. ![]() |
||
|
|
Hypernova
Master Cruncher Audaces Fortuna Juvat ! Vaud - Switzerland Joined: Dec 16, 2008 Post Count: 1908 Status: Offline Project Badges:
|
That was for the sake of example. As to your set-ups: When you say your boxes are "identical", are you saying all the way down to hard disk manufacturer/model, SATA controller (Intel vs. JMicron vs. Marvell), RAID configuration, write-caching policies, and antivirus scanning/exclusions? It is correct to say that I have identical boxes, down to the casing, motherboard, HDD, RAM, CPU, OS, Security kit (firewall/antivirus) video card, CD/DVD reader. This is not true for all 21 which can be grouped in identical subsets. But again, nothing coherent has emerged. Some units of each subset can develop a common allergy. And sometimes the errors are not even of the same type so its a nightmare. The best thing is just to take out the machines that are allergic and let them crunch what they like .![]() |
||
|
|
|