| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 86
|
|
| Author |
|
|
Vuj
Cruncher Joined: Nov 21, 2004 Post Count: 33 Status: Offline Project Badges:
|
36. Ability for the Program to be used as a system service so Win 2K/XP users don't have to login to start the program. (got it schedule task'ed to run on boot currently) Check this webpage http://www.tacktech.com/display.cfm?ttid=197 OK, there is a way to run the client as a service or on the second processor but does anybody know a way to check the progress of the task (since there is no icon in the task bar)? May be it is easy for the guys from WCG to write a little program which can do this? If you're running XP Pro, go to Administrative Services and find where you setup the WCG to run as a service, right click and select properties. The second tab allows interaction with desktop. HTH |
||
|
|
Viktors
Former World Community Grid Tech Joined: Sep 20, 2004 Post Count: 653 Status: Offline Project Badges:
|
Regarding Linux ETA, we have promised this before the end of the year. If things go smoothly, you might be pleasantly surprised much sooner. That is about all I can say right now. Sorry.
|
||
|
|
debrouxl
Advanced Cruncher France Joined: Dec 31, 2004 Post Count: 61 Status: Offline Project Badges:
|
It would be great to have a client that makes a save of the current WU every, say, 10%, so that even that if the WU gets corrupt and aborted, a part of it can be returned nevertheless ?
----------------------------------------I've lost a dozen of ultra-long WUs (after more than 20h and as far as 100 hours of crunching, but they would have been shorter and less likely to be corrupt if the air fans had been cleaned up before - I have a notebook P4A 2.6 GHz) due to corruption. That is really counter-productive and annoying. A simple shutdown & reboot can corrupt the WU, like I saw more than once, and no later than yesterday with a WU that aborted after ~22 hours @ ~80% (WU 2615455). The UD Monitor reports that when the agent is started again, it saves the WU once, and some time later, it aborts. |
||
|
|
Viktors
Former World Community Grid Tech Joined: Sep 20, 2004 Post Count: 653 Status: Offline Project Badges:
|
Some work units do take much longer (10 times or more) than others. However, I have seen run-away processes in some machines which consume 100% of the CPU and since they don't run at lowest priority, Rosetta gets no CPU time (although the "Run time" continues to build). "Run time" is the wall clock time during which the agent is able to run as long as no other processing is occurring on the machine. If you see no progress for many hours, I would use the Task Manager to check for something else which might be consuming CPU time. I have seen virus scan software get stuck in an infinite loop, print spoolers consume 100% of the cpu while waiting for an unconnected printer, and other stranger things.
Rebooting does not currupt a work unit. Rosetta simply resumes from the last checkpoint. Checkpoints normally occur every several minutes (depending on the speed of the machine) and at most after about an hour (or more on slower machines) if the particular protein fold is non-converging. The progress percentage is updated at the time of the checkpoint. If you are using UD monitor, what can happen is that a work unit can timeout because it has taken too long (2 weeks run time, 3 weeks wall clock time). If something else modifies the files in any way (including a sector going bad on your disk), the agent simply quits on the current work unit and gets new work. This also happens also if Rosetta crashes for any reason. Normally, crashes are a sign of some hardware failure or running out of virtual memory. You might want to increase the maximum virtual memory paging file size by 200MB or more to be on the safe side. Otherwise, most relevant hardware problems can be discovered using tools such as memtest86, scandisk, and the "hot cpu tester". See: Tools If you are seeing what you call "work unit corruption," exactly what are the symptoms? Can it be explained by any of the above? |
||
|
|
debrouxl
Advanced Cruncher France Joined: Dec 31, 2004 Post Count: 61 Status: Offline Project Badges:
|
> Some work units do take much longer (10 times or more) than others.
----------------------------------------Yes. > If you see no progress for many hours, I would use the Task Manager to check for something else which might be consuming CPU time. I have seen virus scan software get stuck in an infinite loop, print spoolers consume 100% of the cpu while waiting for an unconnected printer, and other stranger things. I know that, but none of those apply here, sorry. > Rebooting does not currupt a work unit. Of course, but I saw more than once a WU being corrupted right after (within 1 or 2 checkpoints) the PC is stopped and rebooted (actually, the agent restarted), and that already happened before installing a Linux on my HD. Anyway, my Linux cannot be the cause of corruption: it has read support for NTFS, but not write support. > Rosetta simply resumes from the last checkpoint. Checkpoints normally occur every several minutes (depending on the speed of the machine) and at most after about an hour (or more on slower machines) if the particular protein fold is non-converging. The progress percentage is updated at the time of the checkpoint. Yes. > If you are using UD monitor, what can happen is that a work unit can timeout because it has taken too long (2 weeks run time, 3 weeks wall clock time). Yes, but this never happened to me, as I always kept a reasonable number of slots (formerly 6, now 4) and never got too many long WUs at a time on the computer. > If something else modifies the files in any way (including a sector going bad on your disk), the agent simply quits on the current work unit and gets new work. Well, the disk is scanned from time to time, and it has never had any bad sectors. > This also happens also if Rosetta crashes for any reason. Well, it never crashed here in more than 200 WUs. That said, I noticed at least once that a WU aborted a short period of time after another application crashed (that was the application's fault - it was a very buggy beta - not the hardware's). > Normally, crashes are a sign of some hardware failure or running out of virtual memory. You might want to increase the maximum virtual memory paging file size by 200MB or more to be on the safe side. I have already run out of VM while WCG was running, due to a buggy GreaseMonkey under Firefox - Firefox had allocated more than 350 MB of VM - but Windows XP smoothly increased the amount of VM, and the WU did not abort. As far as the entire computer being too hot when saving a WU... maybe. Before I cleaned it up (I made a topic about that, on the Member-to-member forum IIRC), it did overheat all the time, and abort most WUs: the UDMon logs shows it pretty well. When working in a very hot office, it aborted a number WUs the following way: stop the PC at the end of the day, reboot, WU is soon corrupted. > If you are seeing what you call "work unit corruption," exactly what are the symptoms? The WU aborts, a directory and a number of files are erased, and the cache slot becomes empty, while there is no seemingly valid reason - especially, no timeout (wall clock time, non-convergence, etc.). I know this is different from a timeout, because I have seen a WU aborting smoothly (hit the maximum time between two checkpoints), and a small result was returned. That WU was definitely too long for my older computer. I estimate my corrupted WUs are at least three weeks of CPU time total since 2004/12/31, peaking just before I cleaned the computer, which was in dire need of cleaning. If I had had "reliable checkpoints" (what I'm suggesting: checkpoints every 10% or so, which can be sent to the server nevertheless when the WU is corrupted), well, I and WCG would not have lost all that crunching time... Actually, I did try to restore files from the UDMon backups once, and it worked for some time (the WU could go further than the percentage it aborted at), but it aborted again some time later, and I gave up. My point is that such a feature could be built in the official software. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Regarding Linux ETA, we have promised this before the end of the year. If things go smoothly, you might be pleasantly surprised much sooner. That is about all I can say right now. Sorry. WOOOOOOOOOOOOOHOOOOOOOOOOOOOOOOOOOOOOO Bring it on |
||
|
|
debrouxl
Advanced Cruncher France Joined: Dec 31, 2004 Post Count: 61 Status: Offline Project Badges:
|
I tried again backuping a WU from the UDMon backups. This time, the WU aborted for no reason within seconds after a save, while I was doing nothing fancy with my computer (Explorer, etc.)...
----------------------------------------I restored the backup from ~40 minutes earlier. So far, it has crunched more than 4 hours since it started working again on that WU, and is now ~15% further than it was when it aborted (now at ~88%). We'll see if it turns into a returned result. The WU number is 2644630. |
||
|
|
retsof
Former Community Advisor USA Joined: Jul 31, 2005 Post Count: 6824 Status: Offline Project Badges:
|
Maybe a blank screen after some time in confiration of screen saver like seti client. So it wont waste time showing images. ![]() Alther replied to a query like this saying that the screensaver can be changed using the normal Windows commands without bothering the Rosetta program. In other words, the screen saver is optional. Many of us take out the WCG screensaver entirely and run with screensaver (none). That gives the most percentage to crunching. You can always rollover the icon to check the status, or click on it to go to the large screen. The energysaving feature of this monitor is set to turn it off after 30 minutes .... good for overnight work.
SUPPORT ADVISOR
----------------------------------------Work+GPU i7 8700 12threads School i7 4770 8threads Default+GPU Ryzen 7 3700X 16threads Ryzen 7 3800X 16 threads Ryzen 9 3900X 24threads Home i7 3540M 4threads50% [Edit 1 times, last edit by retsof at Sep 13, 2005 4:20:55 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
PepeG seems to have solved the “Unable to Process Task Data - Backing Off” error at http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=1858#31030
----------------------------------------This problem shows up right after installation. We could add an explanation to the long list under 'Trouble-shooting' at http://www.worldcommunitygrid.org/help/viewTopic.do?shortName=trouble Looking over this page, the section titled 'Why does my PC show 100% CPU use?' could be followed by one that says 'Why does my PC show 50% CPU use?' that reassures members whose computers use hyperthreading. The next section titled 'My CPU is overheating running while running the agent.' could contain a link to http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=2683 , the post by Viktors explaining the CPU Throttle Feature. Also, the section title could be changed from 'My CPU is overheating running while running the agent.' to 'My CPU is overheating while running the agent.' without running any risk of confusing the reader. Added: Troble-shooting is a gigantic page, but if you try to get to it from HELP, it is treated as a title and you can only reach a sub-page. And it can be difficult to get from the sub-page to Trouble-shooting. Try it and see for yourself. This is not acceptable. We need to make it possible for prospective members having trouble to at least reach our help pages. Otherwise they will probably grow too frustrated and give up in disgust.[Edit 2 times, last edit by Former Member at Sep 19, 2005 5:38:55 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I agree about the max processor load. I was running this on my work laptop, and ended up uninstalling it because it kept overheating to the point of shutting itself off.
|
||
|
|
|