Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 109
|
![]() |
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Info dump from a PC that has a lot of errors;
02/02/2010 11:35:06 Starting BOINC client version 6.10.18 for windows_x86_64 02/02/2010 11:35:06 log flags: file_xfer, sched_ops, task 02/02/2010 11:35:06 Libraries: libcurl/7.19.4 OpenSSL/0.9.8l zlib/1.2.3 02/02/2010 11:35:06 Data directory: C:\ProgramData\BOINC 02/02/2010 11:35:06 Running under account [censored] 02/02/2010 11:35:06 Processor: 8 GenuineIntel Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz [Intel64 Family 6 Model 26 Stepping 5] 02/02/2010 11:35:06 Processor: 256.00 KB cache 02/02/2010 11:35:06 Processor features: fpu tsc pae nx sse sse2 pni 02/02/2010 11:35:06 OS: Microsoft Windows 7: x64 Edition, (06.01.7600.00) 02/02/2010 11:35:06 Memory: 5.99 GB physical, 6.38 GB virtual 02/02/2010 11:35:06 Disk: 139.73 GB total, 42.98 GB free 02/02/2010 11:35:06 Local time is UTC +1 hours 02/02/2010 11:35:06 ATI GPU 0: ATI Radeon HD 4700/4800 (RV740/RV770) (CAL version 1.4.427, 1024MB, 1360 GFLOPS peak) It's a 4890. Motherboard is Asus P6T Deluxe with an Intel X58 chipset. RAM is from Crucial, 3*2GB This from just now, when the client wanted to wait 18h to get new WUs even though I had 8 ready to report. So I'm running through a lot of WUs atm, looks like 7 successful starts in 45min. 02/02/2010 12:06:42 World Community Grid Starting nc664_00073_2 02/02/2010 12:06:42 World Community Grid Starting task nc664_00073_2 using hpf2 version 603 02/02/2010 12:06:55 World Community Grid Sending scheduler request: To fetch work. 02/02/2010 12:06:55 World Community Grid Requesting new tasks for CPU 02/02/2010 12:07:00 World Community Grid Scheduler request completed: got 1 new tasks 02/02/2010 12:07:02 World Community Grid Starting nc664_00066_16 02/02/2010 12:07:02 World Community Grid Starting task nc664_00066_16 using hpf2 version 603 02/02/2010 12:07:15 World Community Grid Sending scheduler request: To fetch work. 02/02/2010 12:07:15 World Community Grid Requesting new tasks for CPU 02/02/2010 12:07:20 World Community Grid Scheduler request completed: got 1 new tasks 02/02/2010 12:07:22 World Community Grid Starting nc664_00002_2 02/02/2010 12:07:22 World Community Grid Starting task nc664_00002_2 using hpf2 version 603 02/02/2010 12:07:58 World Community Grid Computation for task nc664_00073_2 finished 02/02/2010 12:07:58 World Community Grid Output file nc664_00073_2_0 for task nc664_00073_2 absent 02/02/2010 12:08:21 World Community Grid Computation for task nc664_00066_16 finished 02/02/2010 12:08:21 World Community Grid Output file nc664_00066_16_0 for task nc664_00066_16 absent 02/02/2010 12:08:39 World Community Grid Computation for task nc664_00002_2 finished 02/02/2010 12:08:39 World Community Grid Output file nc664_00002_2_0 for task nc664_00002_2 absent I've given up on keeping track of my WUs. The desktop is unstable when running HPF2. It happens only when running HPF2. When I ran all but HPF2 I had no problems at all with my shiny new desktop (at about $1000), it didn't matter how many different projects I ran or what clientversion. Now I'm getting tons of errors. Number of cores running or clientversion doesn't matter. It also just stops all input and output except the mouse at times. Very annoying as it forces me to reset my desktop. Interestingly enough, my laptop doesn't experience that problem, so it's probably part of the setup I have that is causing it. But that too is experiencing errors. Lenovo G550. Clean install to win7. And my old single-core is crunching along happily without errors. |
||
|
gb009761
Master Cruncher Scotland Joined: Apr 6, 2005 Post Count: 2982 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Although I've now moved off HPF2 (due to reaching my current goal), just recently I crunched the equivalent of 76 CPU day's worth of WU's for this project (300+ WU's) - of which, I can only remember 1 error (which didn't fall into the pattern of aborting within a few seconds of it starting, and thus, I suspect it aborted for some other reason).
----------------------------------------Anyhow, for the sake of providing more information to go on, here's my set-up; 02/02/2010 04:31:45||Starting BOINC client version 6.2.28 for windows_intelx86 02/02/2010 04:31:45||log flags: task, file_xfer, sched_ops 02/02/2010 04:31:45||Libraries: libcurl/7.19.0 OpenSSL/0.9.8i zlib/1.2.3 02/02/2010 04:31:45||Data directory: C:\Documents and Settings\All Users\Application Data\BOINC 02/02/2010 04:31:45||Running under account GB009761 02/02/2010 04:31:46||Processor: 2 GenuineIntel Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz [x86 Family 6 Model 15 Stepping 11] 02/02/2010 04:31:46||Processor features: fpu tsc sse sse2 mmx 02/02/2010 04:31:46||OS: Microsoft Windows XP: Professional x86 Editon, Service Pack 2, (05.01.2600.00) 02/02/2010 04:31:46||Memory: 1.96 GB physical, 3.81 GB virtual 02/02/2010 04:31:46||Disk: 111.78 GB total, 16.96 GB free 02/02/2010 04:31:46||Local time is UTC +0 hours 02/02/2010 04:31:46|World Community Grid|URL: http://www.worldcommunitygrid.org/; Computer ID: 785082; location: (none); project prefs: default 02/02/2010 04:31:46||General prefs: from World Community Grid (last modified 28-Jan-2010 16:14:00) 02/02/2010 04:31:46||Host location: none 02/02/2010 04:31:46||General prefs: using your defaults 02/02/2010 04:31:46||Preferences limit memory usage when active to 1705.29MB 02/02/2010 04:31:46||Preferences limit memory usage when idle to 1905.91MB 02/02/2010 04:31:46||Preferences limit disk usage to 16.49GB During the bulk of the period, I was running both cores with HPF2 work, although towards the end, I slowly changed over to FA@H. Hopefully, this'll help... ![]() |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello WCG.
Attention: Sekerob - Community Advisor Reference: Your [Oct 16, 2009 10:08:50 AM] post http://www.worldcommunitygrid.org/forums/wcg/...ead,26706_offset,0#254060 Sir: Your [Oct 16, 2009 10:08:50 AM] post spoke of ".. the rate of failure increased depending on the concurrent number of HPF2 jobs". I confirm this as the same situation as my case with the HPF2 errors. Also, I have yet to come across a thread that speaks of having run an HPF2 WUs in a single-core and still encountered the HPF2 error. Your post also spoke of "improving system performance and lowers temperature". Connect those dots and it would seem to point in the direction of highTemperature-induced error in crunching HPF2 WUs in a multi-core CPU. If that is the case, what do you think of the idea of working on a solution based on working around the temperature situation? For example, for the scheduler in BOINC to make sure that HPF2 WUs are spread with other non-HPF2 WUs (As you also suggested in your post)? This approach would suit those crunchers who chose to stick with HPF2 (because of the importance of the underlying science, for example); and that this approach be made available to those crunchers while the root cause of the error in HPF2 is not yet pinned down. And should the culprit turn out to be the (high) temperature (made necessary, perhaps, by the compute-intensive nature of an HPF2 WU), then the issue with HPF2 error would have been mitigated. Good day. ; |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
andzgrid,
----------------------------------------The temp part you could confirm by setting the BOINC throttle, that itself has in past been a source of failure, or trying out ThreadMaster GUI (the package with graphical interface, link in the 3rd party software FAQ). This kit I'm running on my W7 laptop, set at 90% presently... works for Woz since W2K. I really would not know if a CPU at micro level could get too hot without the CPU signaling the system that something is too hot. Vaguely I remember the programmers once found a science that was in the launch phase showing this, but am unable to trace it back. The alternate is trying TThrottle, specifically written for BOINC. Works, but I did not like it that it was consuming quite a bit of CPU time on my systems, so reverted back to the very smooth operating ThreadMaster. Thanks for taking the time doing the past hundreds of posts analysis... and here a killer feature requested from developers at times: Setting to prevent more than X instances of same. Try PrimeGrid and that on a quad well get your fans to have the curtains move... memory slogging is already addressed. BOINC will pause 1 or more jobs and try find/start smaller sciences if available.
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
[snip] Connect those dots and it would seem to point in the direction of highTemperature-induced error in crunching HPF2 WUs in a multi-core CPU. If that is the case, what do you think of the idea of working on a solution based on working around the temperature situation? For example, for the scheduler in BOINC to make sure that HPF2 WUs are spread with other non-HPF2 WUs (As you also suggested in your post)? This approach would suit those crunchers who chose to stick with HPF2 (because of the importance of the underlying science, for example); and that this approach be made available to those crunchers while the root cause of the error in HPF2 is not yet pinned down. And should the culprit turn out to be the (high) temperature (made necessary, perhaps, by the compute-intensive nature of an HPF2 WU), then the issue with HPF2 error would have been mitigated. Good day. ; Many motherboards come equipped with sensors to monitor that; various programs can query them and display/track those data, sound a warning when it gets too hot, or even shut the computer off if it happens while unattended. However you find out, if you're experiencing high-temp CPU conditions, consider removing the fan/heatsink, cleaning it and (very carefully) the top of the CPU (I use a paper towel and 99% isopropyl), then applying a good conductive paste. In my experience the little phase-change pads that came with a lot of CPU/heatsink combos the last few years were plain junk. Some had clumps in them that would hold the sink off the CPU and let the rest run right out when it reached its change point. Whether it's Arctic Silver or plain old white Dow 340, heatsink paste works better than many of those pads did, IMHO. Hope that helps. I would postulate the reason this problem occurs on multi-core CPUs is that most machines running vista and win7 have multi-core CPUs... not because of the multi-cores. I have a couple (1 AMD; 1 Intel) quad-core machines running fedora that have never had an error of any kind running the HPF2 task (and when they run it, they typically run nothing but HPF2). ![]() |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello WCG.
Attention: Sekerob - Community Advisor ZoSo - Advanced Cruncher Reference: Sekerob [Feb 2, 2010 5:25:10 PM] post ZoSo [Feb 3, 2010 6:02:35 AM] post Gentlemen: Thanks for responding. Your posts touched on a lot of good points. On a related matter, the uplinger [Feb 1, 2010 9:04:35 PM] post spoke of: "..working very hard to bring two more science applications online before we are able to dedicate more of our time to fixing this issue". I can't wait to try those project's WUs out. Good day ; |
||
|
Hypernova
Master Cruncher Audaces Fortuna Juvat ! Vaud - Switzerland Joined: Dec 16, 2008 Post Count: 1908 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Many motherboards come equipped with sensors to monitor that; various programs can query them and display/track those data, sound a warning when it gets too hot, or even shut the computer off if it happens while unattended. However you find out, if you're experiencing high-temp CPU conditions, consider removing the fan/heatsink, cleaning it and (very carefully) the top of the CPU (I use a paper towel and 99% isopropyl), then applying a good conductive paste. In my experience the little phase-change pads that came with a lot of CPU/heatsink combos the last few years were plain junk. Some had clumps in them that would hold the sink off the CPU and let the rest run right out when it reached its change point. Whether it's Arctic Silver or plain old white Dow 340, heatsink paste works better than many of those pads did, IMHO. Hope that helps. Very well said. I can add to that that avoid buying boxed Intel CPU's with the fan. The boxed fan is not worth the money. It is not efficient and is noisy, and forget overclocking. Buy the CPU standalone and put a real CPU cooler. You can then monitor the core temperatures with small applications you can find on the net. ![]() ![]() |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
As I do with every new alpha client installed I'll fetch 1 or 2 to confirm they're still running fine. As so happened 2 days ago did a test fetch to see if the work-unavailable reports were enduring, which fortunately they weren't and here it is, the periodic save at 45 minutes CPU time, 14% progress, no issue at the 2 minute hurdle expecting it to finish as always, without a hitch.
----------------------------------------25/02/2010 14:31:19 World Community Grid [checkpoint_debug] result ne015_00046_6 checkpointed W7-64 all patched to the latest, Q6600, alpha 6.10.34, 64 bit, running on 4 bores.
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
So the current theory is that it is high temps that are causing the crash?
When the crashes happen, they always happen within a couple of seconds of starting which is not enough time to send the temps up so high as to make the CPU shut the core the WU is running on down. Even if it did then on my i7-920 with HT on I would get more than one error at the same time. I also am running serious HSF units and keep my temps in check even when running LinX or IBT ... it is not the temps directly. As for the likelyhood of it happening on multicore ... never happened on my laptop with two cores. More likely it is some bizzarly synchronous event that causes a buggy dll, that WCG/ the scientists have no control over, to try and write to a static memory space that is already protected by another process. That's why you would see it more often with multiple instances of HPF2 as they are more likely to cross swords writing to the same static location as some other random code that already has a lock on that position. |
||
|
Hypernova
Master Cruncher Audaces Fortuna Juvat ! Vaud - Switzerland Joined: Dec 16, 2008 Post Count: 1908 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
High temps can cause the crash, and memory timings.
----------------------------------------I limit my core temps to target 75C average, with 80C max. Some would run their machine with 85C or 90C. I tested at those temps for two weeks and it runs but the risk is that if your cooling efficiency goes down due to high room temps or other reasons and then you reach Tj and then STOP. To my knowledge there are no official figures from Intel on Tj. It is said 100C but if some of you know better I am interested. And core temp measuring softwares have +/- 2-3C. The other more frequent cause is memory agressive timings with overclocking. I had crashes with Patriot 2000 Mhz DDR3 CL8 (8-8-8-24) when I had the frequency around 1950 Mhz. To be stable at those timings I have to be under 1900 Mhz. With Patriot 1600 Mhz CL8 DDR3 ram it remained stable very near to 1600. So you have to test and once you get the BSD then go back about 100 Mhz and test again over 3 four days continuous crunching anf if you get one BSD another 50 Mhz and then it should be ok or your DDR3 mem is not at specified max timings or bad. Then you have the combination of overclocked memory and cpu. If you want to pull the max then you are for long manual settings of many parameters. For me it is not worth the effort. Keep a reasonable margin if you crunch non stop and do not have allways access to your machines. BSD and restarts or BSD with a frozen system will cost you more then a less optimized system that never stops. ![]() |
||
|
|
![]() |