World Community Grid - View Thread - anyone else seeing these kinds of errors? I'm getting tons of them.

World Community Grid Forums

Category: Completed Research

Forum: Human Proteome Folding - Phase 2

Thread: anyone else seeing these kinds of errors? I'm getting tons of them.

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 109

[ ]

Author

This topic has been viewed 950962 times and has 108 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: anyone else seeing these kinds of errors? I'm getting tons of them.

Info dump from a PC that has a lot of errors;

02/02/2010 11:35:06 Starting BOINC client version 6.10.18 for windows_x86_64
02/02/2010 11:35:06 log flags: file_xfer, sched_ops, task
02/02/2010 11:35:06 Libraries: libcurl/7.19.4 OpenSSL/0.9.8l zlib/1.2.3
02/02/2010 11:35:06 Data directory: C:\ProgramData\BOINC
02/02/2010 11:35:06 Running under account [censored]
02/02/2010 11:35:06 Processor: 8 GenuineIntel Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz [Intel64 Family 6 Model 26 Stepping 5]
02/02/2010 11:35:06 Processor: 256.00 KB cache
02/02/2010 11:35:06 Processor features: fpu tsc pae nx sse sse2 pni
02/02/2010 11:35:06 OS: Microsoft Windows 7: x64 Edition, (06.01.7600.00)
02/02/2010 11:35:06 Memory: 5.99 GB physical, 6.38 GB virtual
02/02/2010 11:35:06 Disk: 139.73 GB total, 42.98 GB free
02/02/2010 11:35:06 Local time is UTC +1 hours
02/02/2010 11:35:06 ATI GPU 0: ATI Radeon HD 4700/4800 (RV740/RV770) (CAL version 1.4.427, 1024MB, 1360 GFLOPS peak)

It's a 4890.
Motherboard is Asus P6T Deluxe with an Intel X58 chipset.
RAM is from Crucial, 3*2GB

This from just now, when the client wanted to wait 18h to get new WUs even though I had 8 ready to report. So I'm running through a lot of WUs atm, looks like 7 successful starts in 45min.

02/02/2010 12:06:42 World Community Grid Starting nc664_00073_2
02/02/2010 12:06:42 World Community Grid Starting task nc664_00073_2 using hpf2 version 603
02/02/2010 12:06:55 World Community Grid Sending scheduler request: To fetch work.
02/02/2010 12:06:55 World Community Grid Requesting new tasks for CPU
02/02/2010 12:07:00 World Community Grid Scheduler request completed: got 1 new tasks
02/02/2010 12:07:02 World Community Grid Starting nc664_00066_16
02/02/2010 12:07:02 World Community Grid Starting task nc664_00066_16 using hpf2 version 603
02/02/2010 12:07:15 World Community Grid Sending scheduler request: To fetch work.
02/02/2010 12:07:15 World Community Grid Requesting new tasks for CPU
02/02/2010 12:07:20 World Community Grid Scheduler request completed: got 1 new tasks
02/02/2010 12:07:22 World Community Grid Starting nc664_00002_2
02/02/2010 12:07:22 World Community Grid Starting task nc664_00002_2 using hpf2 version 603
02/02/2010 12:07:58 World Community Grid Computation for task nc664_00073_2 finished
02/02/2010 12:07:58 World Community Grid Output file nc664_00073_2_0 for task nc664_00073_2 absent
02/02/2010 12:08:21 World Community Grid Computation for task nc664_00066_16 finished
02/02/2010 12:08:21 World Community Grid Output file nc664_00066_16_0 for task nc664_00066_16 absent
02/02/2010 12:08:39 World Community Grid Computation for task nc664_00002_2 finished
02/02/2010 12:08:39 World Community Grid Output file nc664_00002_2_0 for task nc664_00002_2 absent

I've given up on keeping track of my WUs. The desktop is unstable when running HPF2. It happens only when running HPF2.
When I ran all but HPF2 I had no problems at all with my shiny new desktop (at about $1000), it didn't matter how many different projects I ran or what clientversion. Now I'm getting tons of errors. Number of cores running or clientversion doesn't matter.
It also just stops all input and output except the mouse at times. Very annoying as it forces me to reset my desktop.
Interestingly enough, my laptop doesn't experience that problem, so it's probably part of the setup I have that is causing it. But that too is experiencing errors. Lenovo G550. Clean install to win7.
And my old single-core is crunching along happily without errors.

[Feb 2, 2010 11:24:44 AM]

gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 3010
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

90 day badge for Help Cure Muscular Dystrophy

90 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

1 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

1 year badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

180 day badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: anyone else seeing these kinds of errors? I'm getting tons of them.

Although I've now moved off HPF2 (due to reaching my current goal), just recently I crunched the equivalent of 76 CPU day's worth of WU's for this project (300+ WU's) - of which, I can only remember 1 error (which didn't fall into the pattern of aborting within a few seconds of it starting, and thus, I suspect it aborted for some other reason).

Anyhow, for the sake of providing more information to go on, here's my set-up;
02/02/2010 04:31:45||Starting BOINC client version 6.2.28 for windows_intelx86
02/02/2010 04:31:45||log flags: task, file_xfer, sched_ops
02/02/2010 04:31:45||Libraries: libcurl/7.19.0 OpenSSL/0.9.8i zlib/1.2.3
02/02/2010 04:31:45||Data directory: C:\Documents and Settings\All Users\Application Data\BOINC
02/02/2010 04:31:45||Running under account GB009761
02/02/2010 04:31:46||Processor: 2 GenuineIntel Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz [x86 Family 6 Model 15 Stepping 11]
02/02/2010 04:31:46||Processor features: fpu tsc sse sse2 mmx
02/02/2010 04:31:46||OS: Microsoft Windows XP: Professional x86 Editon, Service Pack 2, (05.01.2600.00)
02/02/2010 04:31:46||Memory: 1.96 GB physical, 3.81 GB virtual
02/02/2010 04:31:46||Disk: 111.78 GB total, 16.96 GB free
02/02/2010 04:31:46||Local time is UTC +0 hours
02/02/2010 04:31:46|World Community Grid|URL: http://www.worldcommunitygrid.org/; Computer ID: 785082; location: (none); project prefs: default
02/02/2010 04:31:46||General prefs: from World Community Grid (last modified 28-Jan-2010 16:14:00)
02/02/2010 04:31:46||Host location: none
02/02/2010 04:31:46||General prefs: using your defaults
02/02/2010 04:31:46||Preferences limit memory usage when active to 1705.29MB
02/02/2010 04:31:46||Preferences limit memory usage when idle to 1905.91MB
02/02/2010 04:31:46||Preferences limit disk usage to 16.49GB

During the bulk of the period, I was running both cores with HPF2 work, although towards the end, I slowly changed over to FA@H.

Hopefully, this'll help...

----------------------------------------

[Feb 2, 2010 11:45:48 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: anyone else seeing these kinds of errors? I'm getting tons of them.

Hello WCG.

Attention: Sekerob - Community Advisor
Reference: Your [Oct 16, 2009 10:08:50 AM] post
http://www.worldcommunitygrid.org/forums/wcg/...ead,26706_offset,0#254060

Sir:

Your [Oct 16, 2009 10:08:50 AM] post spoke of ".. the rate of failure increased depending on the concurrent number of HPF2 jobs". I confirm this as the same situation as my case with the HPF2 errors. Also, I have yet to come across a thread that speaks of having run an HPF2 WUs in a single-core and still encountered the HPF2 error. Your post also spoke of "improving system performance and lowers temperature". Connect those dots and it would seem to point in the direction of highTemperature-induced error in crunching HPF2 WUs in a multi-core CPU. If that is the case, what do you think of the idea of working on a solution based on working around the temperature situation? For example, for the scheduler in BOINC to make sure that HPF2 WUs are spread with other non-HPF2 WUs (As you also suggested in your post)? This approach would suit those crunchers who chose to stick with HPF2 (because of the importance of the underlying science, for example); and that this approach be made available to those crunchers while the root cause of the error in HPF2 is not yet pinned down. And should the culprit turn out to be the (high) temperature (made necessary, perhaps, by the compute-intensive nature of an HPF2 WU), then the issue with HPF2 error would have been mitigated.

Good day.
;

[Feb 2, 2010 5:01:11 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: anyone else seeing these kinds of errors? I'm getting tons of them.

andzgrid,

The temp part you could confirm by setting the BOINC throttle, that itself has in past been a source of failure, or trying out ThreadMaster GUI (the package with graphical interface, link in the 3rd party software FAQ). This kit I'm running on my W7 laptop, set at 90% presently... works for Woz since W2K. I really would not know if a CPU at micro level could get too hot without the CPU signaling the system that something is too hot. Vaguely I remember the programmers once found a science that was in the launch phase showing this, but am unable to trace it back.

The alternate is trying TThrottle, specifically written for BOINC. Works, but I did not like it that it was consuming quite a bit of CPU time on my systems, so reverted back to the very smooth operating ThreadMaster.

Thanks for taking the time doing the past hundreds of posts analysis... and here a killer feature requested from developers at times: Setting to prevent more than X instances of same. Try PrimeGrid and that on a quad well get your fans to have the curtains move... memory slogging is already addressed. BOINC will pause 1 or more jobs and try find/start smaller sciences if available.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Feb 2, 2010 5:25:10 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: anyone else seeing these kinds of errors? I'm getting tons of them.

[snip]
Connect those dots and it would seem to point in the direction of highTemperature-induced error in crunching HPF2 WUs in a multi-core CPU. If that is the case, what do you think of the idea of working on a solution based on working around the temperature situation? For example, for the scheduler in BOINC to make sure that HPF2 WUs are spread with other non-HPF2 WUs (As you also suggested in your post)? This approach would suit those crunchers who chose to stick with HPF2 (because of the importance of the underlying science, for example); and that this approach be made available to those crunchers while the root cause of the error in HPF2 is not yet pinned down. And should the culprit turn out to be the (high) temperature (made necessary, perhaps, by the compute-intensive nature of an HPF2 WU), then the issue with HPF2 error would have been mitigated.

Good day.
;

Many motherboards come equipped with sensors to monitor that; various programs can query them and display/track those data, sound a warning when it gets too hot, or even shut the computer off if it happens while unattended.
However you find out, if you're experiencing high-temp CPU conditions, consider removing the fan/heatsink, cleaning it and (very carefully) the top of the CPU (I use a paper towel and 99% isopropyl), then applying a good conductive paste. In my experience the little phase-change pads that came with a lot of CPU/heatsink combos the last few years were plain junk. Some had clumps in them that would hold the sink off the CPU and let the rest run right out when it reached its change point. Whether it's Arctic Silver or plain old white Dow 340, heatsink paste works better than many of those pads did, IMHO. Hope that helps.

I would postulate the reason this problem occurs on multi-core CPUs is that most machines running vista and win7 have multi-core CPUs... not because of the multi-cores. I have a couple (1 AMD; 1 Intel) quad-core machines running fedora that have never had an error of any kind running the HPF2 task (and when they run it, they typically run nothing but HPF2).
cool

[Feb 3, 2010 6:02:35 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: anyone else seeing these kinds of errors? I'm getting tons of them.

Hello WCG.

Attention: Sekerob - Community Advisor
ZoSo - Advanced Cruncher

Reference: Sekerob [Feb 2, 2010 5:25:10 PM] post
ZoSo [Feb 3, 2010 6:02:35 AM] post

Gentlemen:

Thanks for responding. Your posts touched on a lot of good points. On a related matter, the uplinger [Feb 1, 2010 9:04:35 PM] post spoke of: "..working very hard to bring two more science applications online before we are able to dedicate more of our time to fixing this issue". I can't wait to try those project's WUs out.

Good day
;

[Feb 3, 2010 8:50:33 AM]

Hypernova
Master Cruncher
Audaces Fortuna Juvat ! Vaud - Switzerland
Joined: Dec 16, 2008
Post Count: 1908
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

2 year badge for Nutritious Rice for the World

20 year badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

20 year badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

20 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

10 year badge for GO Fight Against Malaria

5 year badge for Computing for Sustainable Water


Re: anyone else seeing these kinds of errors? I'm getting tons of them.

Very well said. I can add to that that avoid buying boxed Intel CPU's with the fan. The boxed fan is not worth the money. It is not efficient and is noisy, and forget overclocking. Buy the CPU standalone and put a real CPU cooler. You can then monitor the core temperatures with small applications you can find on the net. smile

----------------------------------------

[Feb 3, 2010 2:42:31 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: anyone else seeing these kinds of errors? I'm getting tons of them.

As I do with every new alpha client installed I'll fetch 1 or 2 to confirm they're still running fine. As so happened 2 days ago did a test fetch to see if the work-unavailable reports were enduring, which fortunately they weren't and here it is, the periodic save at 45 minutes CPU time, 14% progress, no issue at the 2 minute hurdle expecting it to finish as always, without a hitch.

25/02/2010 14:31:19 World Community Grid [checkpoint_debug] result ne015_00046_6 checkpointed

W7-64 all patched to the latest, Q6600, alpha 6.10.34, 64 bit, running on 4 bores.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Feb 25, 2010 1:35:53 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: anyone else seeing these kinds of errors? I'm getting tons of them.

So the current theory is that it is high temps that are causing the crash?
When the crashes happen, they always happen within a couple of seconds of starting which is not enough time to send the temps up so high as to make the CPU shut the core the WU is running on down. Even if it did then on my i7-920 with HT on I would get more than one error at the same time. I also am running serious HSF units and keep my temps in check even when running LinX or IBT ... it is not the temps directly. As for the likelyhood of it happening on multicore ... never happened on my laptop with two cores. More likely it is some bizzarly synchronous event that causes a buggy dll, that WCG/ the scientists have no control over, to try and write to a static memory space that is already protected by another process. That's why you would see it more often with multiple instances of HPF2 as they are more likely to cross swords writing to the same static location as some other random code that already has a lock on that position.

[Feb 25, 2010 2:00:40 PM]

Hypernova
Master Cruncher
Audaces Fortuna Juvat ! Vaud - Switzerland
Joined: Dec 16, 2008
Post Count: 1908
Status: Offline
Project Badges:


Re: anyone else seeing these kinds of errors? I'm getting tons of them.

High temps can cause the crash, and memory timings.
I limit my core temps to target 75C average, with 80C max. Some would run their machine with 85C or 90C. I tested at those temps for two weeks and it runs but the risk is that if your cooling efficiency goes down due to high room temps or other reasons and then you reach Tj and then STOP. To my knowledge there are no official figures from Intel on Tj. It is said 100C but if some of you know better I am interested. And core temp measuring softwares have +/- 2-3C.

The other more frequent cause is memory agressive timings with overclocking. I had crashes with Patriot 2000 Mhz DDR3 CL8 (8-8-8-24) when I had the frequency around 1950 Mhz. To be stable at those timings I have to be under 1900 Mhz.
With Patriot 1600 Mhz CL8 DDR3 ram it remained stable very near to 1600. So you have to test and once you get the BSD then go back about 100 Mhz and test again over 3 four days continuous crunching anf if you get one BSD another 50 Mhz and then it should be ok or your DDR3 mem is not at specified max timings or bad.

Then you have the combination of overclocked memory and cpu. If you want to pull the max then you are for long manual settings of many parameters. For me it is not worth the effort. Keep a reasonable margin if you crunch non stop and do not have allways access to your machines. BSD and restarts or BSD with a frozen system will cost you more then a less optimized system that never stops.

----------------------------------------

[Feb 28, 2010 8:48:43 PM]

[ ]