Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 20
|
![]() |
Author |
|
dguntner
Cruncher Joined: Jan 17, 2013 Post Count: 8 Status: Offline |
It's been a long time since I last was using a grid computing client. Recently (within the last couple of months), I've been in a state where I can do so again.
I'm running Debian 6.0.6 with the Debian-packaged version of BOINC, which is 6.10.58. I've got computing time split 50/50 between FightAIDS@Home and SETI@Home. The SETI work units have been running just swimmingly; no issues, no problem, no nuthin'. ![]() Unfortunately, I can't say the same for the FightAIDS@Home. To date (since restarting a couple of months ago), every single work unit that has been sent to me as gotten the result of "result {whatever} is no longer usable." The last couple of days, I've gotten no new work units at all. And when I tried manually doing an update this morning in the BOINC client, I got a message that the project was currently offline for maintenance. I'm starting to come to the conclusion that I should just drop the FA@H project and concentrate all CPU cycles to the S@H project - at least they are accepting my work units when my machine is ready to send them in and don't seem to have any problems sending more.... I think FA@H is a very worthwhile project and I'd like to contribute to it, but if every work unit I get is just thrown away as being "unusable" when sent back, then the CPU usage is just wasted cycles. But before giving up on it completely, I figured I'd ask here and see if anyone knows what's going on. Is this just a short-term problem? Something else? Will it likely be sorted out soon? I tried searching the forum but didn't really come up with anything useful on this particular subject. So, if anyone knows that's going on WRT my stated problem here, I'd love to know what it is. Thanks. --Dave |
||
|
gb009761
Master Cruncher Scotland Joined: Apr 6, 2005 Post Count: 2982 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The FA@H WU's that you're returning, are they after the set deadline of 10 days? if so, then the question is, why? (perhaps the cache setting is too high for the amount of time your computer is on - after all, if you haven't been in a position to run BOINC for some time, it may take a little while for it to determine the percentage of time your computer is on).
----------------------------------------![]() |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi,
----------------------------------------Not commenting on SETI's schwimmingness, just from reading their alternate [when they are down] cafe thread at Berkeley they're down like 3 of 7 days a week. Members can tell you about the WCG uptime by comparison and the general rate of validation they experience in processing FAAH tasks... which is like very high. On that message, really like you to post the log from the Result Status page. Goto My Grid > Result Status, then click on the link of an FAAH task in the Status column where it says for instance Server Aborted, Error, Invalid. Also, like you to go back into the message/event log or the stdoutdae.txt file [log record] and find the point where such a task produces this message and post this plus the before and after so we can see the sequence. How much computing time is being logged on the Result Status page for the "result ... no longer usable"? My impression is that the task is too old, and then gets annulled with that message. That would mean such a task is sitting a long time on your system... more than 10 days, probably 12 or more without getting processed. But, that's for the moment all guessing. The log info and other bits asked for will tell us how/where to look further. edit: Think gb is on to something. If you're caching high because SETI is off-line so often/long, then FAAH may never get it's foot in sideways. Cache settings are generally advised never to exceed the shortest deadline of any project that's active on a host.. WCG longest deadline is 10 days]. Either fetching is rejected because the server is told the result wont come back in time, or if with high cache you manage to get FAAH, it would process with high priority and be returned in time. [Edit 1 times, last edit by Former Member at Jan 17, 2013 4:36:39 PM] |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7662 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Another thing you might post is the machine specifications and how much time during an average day you leave it crunching.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
dguntner
Cruncher Joined: Jan 17, 2013 Post Count: 8 Status: Offline |
Thank you to those who have replied so far. I'll post my replies here for the current set of replies. :-)
As an also-note: After posting my original message, I decided to try using the "reset project" button in the BOINC client, and that did in fact result in a new work unit being sent to me. We'll see what happens with this one.... Today is 1/17; it gives me a due-date of 1/27. One thing I note that seems kind-of odd, though, is that it was downloaded over two hours ago and yet no processing has started on that work unit (it's still working on the current S@H unit), even though my preferences state that it should switch between applications every 120 minutes. Is that setting for wall-clock time, or the amount of CPU time being run up by an application? I *have* in fact seen the FAAH task go active in the BOINC client in the past; I'll be keeping an eye on it to make sure it does this time as well. Anyway, onto my replies: From gb009761: The FA@H WU's that you're returning, are they after the set deadline of 10 days? if so, then the question is, why? (perhaps the cache setting is too high for the amount of time your computer is on - after all, if you haven't been in a position to run BOINC for some time, it may take a little while for it to determine the percentage of time your computer is on). I don't *think* they've been getting submitted after the deadline, but to be honest, I don't really know. I'll pay closer attention to it this time and see if that's what's happening. The computer is used as a server for my home network; as such, it's on 24/7. From SekeRob: Hi, Not commenting on SETI's schwimmingness, just from reading their alternate [when they are down] cafe thread at Berkeley they're down like 3 of 7 days a week. Members can tell you about the WCG uptime by comparison and the general rate of validation they experience in processing FAAH tasks... which is like very high. Useful to know. On that message, really like you to post the log from the Result Status page. Goto My Grid > Result Status, then click on the link of an FAAH task in the Status column where it says for instance Server Aborted, Error, Invalid. I checked that page as you suggested. Unfortunately, under the Results Status page (thanks for describing how to get there!), the only thing showing is the currently-downloaded work unit. And on that, it just shows "in progress." No old units of any kind are showing. Filters are all set to "all." :-/ Also, like you to go back into the message/event log or the stdoutdae.txt file [log record] and find the point where such a task produces this message and post this plus the before and after so we can see the sequence. I looked through the file. The latest incident of it contained several (over a few days) worth of this:
And then at the point where the failure occurs is this:
Let me know if I missed something that you need. How much computing time is being logged on the Result Status page for the "result ... no longer usable"? My impression is that the task is too old, and then gets annulled with that message. That would mean such a task is sitting a long time on your system... more than 10 days, probably 12 or more without getting processed. But, that's for the moment all guessing. The log info and other bits asked for will tell us how/where to look further. Well, as mentioned above, unfortunately the only work unit being shown at the moment on My Grid is the one currently downloaded, so I've got no other information I can give you. Maybe the log entries above help? I know I'm not liking the looks of some of them, though I don't know enough about this to know if they actually indicate a problem. edit: Think gb is on to something. If you're caching high because SETI is off-line so often/long, then FAAH may never get it's foot in sideways. Cache settings are generally advised never to exceed the shortest deadline of any project that's active on a host.. WCG longest deadline is 10 days]. Either fetching is rejected because the server is told the result wont come back in time, or if with high cache you manage to get FAAH, it would process with high priority and be returned in time. Is there a way to determine if that is what's happening? From Sgt.Joe: Another thing you might post is the machine specifications and how much time during an average day you leave it crunching. Well, you're in luck. :-) Going through the stdoutdae.txt file for the above; I saw where it listed the machine specs as it started for the first time. The hardware hasn't changed since then, so here's that info:
And after that, of course, I reattached it to my logins in both places (after being away for a few years, I was surprised that the accounts still existed!) and let it grab the projects. --Dave |
||
|
dguntner
Cruncher Joined: Jan 17, 2013 Post Count: 8 Status: Offline |
As a quick followup to the information I provided yesterday, which I checked today, after almost 24 hours, there were three S@h tasks showing, two of which were done and waiting to be uploaded and one being worked on (the third one wasn't there when I posted yesterday). The FAAH task had not even been started on, even though I've got BOINC configured as a 50/50, and set to switch applications every 60 minutes. What the heck is going on?
I was able to manually get it to start on the FAAH task by suspending the S@h project briefly (fortunately, when I hit resume, the S@h task is showing as "waiting to run" and the FAAH is still running). I'll keep an eye on it and see how it behaves once enough time has passed that it switches back to S@h. I don't think I should have to be quite so manually involved, though.... :-) Are there any tweaks I can apply that will get this thing to balance out better? --Dave |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
You may look in the Project > Select Project > Properties screen of BOINC Manager. It would tell what the priorities are. If WCG is overworked, it would not get a turn, or only when it's getting late... that is with 6.10.58. The scheduling behavior with 7.0.4x is much more benign, but don't know if there's a package of that series which fits on Debian 6.06.
Switch 60 minutes... I've got it to 240 minutes... Much less flip flopping, which also on 7.0.4x wont be too often. This client tends to bulk work on one project, then move on to the next and back again. If you set [trick alert] the switch time to longer than the FAAH run time, it will likely finish in one go, when there's no High Priority processing interruption. |
||
|
dguntner
Cruncher Joined: Jan 17, 2013 Post Count: 8 Status: Offline |
Thanks for the reply, SekeRob. I checked, and the only packaged version for Squeeze (Debian 6.0.x) is 6.10.58. There's no newer already-packaged version, even via the backports channel. I did some hunting, and found that the BOINC package that's included in Wheezy (the next Debian version, which is currently in testing) is 7.0.27. So, once Wheezy goes stable, I'll be able to upgrade to that and get the newer version of BOINC in the process (of course, that's not at the 7.0.4x level you mention above, so I still won't see that particular benefit). Until then, though, I'm stuck where I am unless I decide it bugs me enough to abandon the packaged version and grab current and compile/install from there (which, at the moment, it doesn't).
As an aside, I checked, and after an hour of processing the FAAH task, it switched back to the outstanding S@h task, and then it's stayed there. <grumble> I have no idea why it's doing this.... I checked the project properties as you describe above. On the scheduling section of that pop-up window for the FAAH project, it says that CPU scheduling priority is -1252.17, CPU work fetch priority of 0.00, CPU work fetch deferred for ---, CPU work fetch deferral interval ---, Duration correction factor 1.0000. For S@h, the numbers are: CPU scheduling priority 809.72, CPU work fetch priority -84965.46, CPU work fetch deferred for ---, CPU work fetch deferral interval ---, Duration correction factor 1.8788. Don't really understand those numbers, although I suspect the "CPU scheduling priority" is why S@h seems to be getting all the love from my system, despite the fact that both projects show "resource share" of 100 (50.00%) on the Projects tab. Is there any way to adjust those values to help even things out better? FWIW, I've not been touching the client preferences file directly. I've left that clear, instead preferring to set the preference values via the website settings both here and at the S@h page. The reason I was using a 60 switch was because when at the settings page for S@h (which I've been part of way longer (1999) than here at WCG (2008)), it said that 60 was recommended. So I just stuck with that number. I'll go ahead and adjust the settings on both pages to move it to 120 to see if that makes any kind of impact. Where is the [trick alert] switch found? --Dave |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
If you know where to set it from 60 to 120 minutes you know where to set it to e.g 1440 minutes [freakish long]. In the client the option is on the Processor tab of local preferences and is called "Switch between applications every..." (which is the field the trick alert was about).
That field worked slightly different on the 5 series clients. It would force any task ahead if the value was set longer than the deadline of e.g. repair jobs [4 days at WCG]. Some would input 6000 or 7000 minutes to rush process these tasks. Can't really understand why 50:50 is not running 50:50. Search the stdoutdae.txt log if you have messages such as "wont send work... on 99% and of that 100% computing". A very high cache setting would tell the client and the WCG server "look, if I give you work, but you wont send it back in time, I will only give you 1 task, long as you keep reporting tasks too late". Think you hand forcing FAAH on and setting an exorbitant switch time will get things moving towards more time for WCG [or temp suspend S@H], then backfilling of the buffer has to come from WCG... or lower the cache to something less than 10 days... try 8 days. No matter what, in FIFO order of processing that task will/should be finished before the 8th day is over. P.S. The CPU priority of -1252.17 is the reason... WCG used up it's 60 minutes and handed over back to the alien search team. Can't remember ATM if that value is seconds or minutes. If minutes, see you tomorrow before the FAAH tasks gets another hour. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Oh, errata: If you have a WCG task and it wont move... alternate trick: Temporarily reduce the resource share / project weight value to something extremely low and update client. If WCG only gets for instance 1%, all it's work will be processed by priority because the client will think it's only going to schedule 14.4 minutes per day. Because it knows that in 10 days it normally only schedules 144 minutes, which wont be enough, it will rush the FAAH job. My money is though on setting the switch app time to 1440 minutes... if FAAH starts [automatic or manual], it will run to the end, that is if your host is powerful enough to complete a FAAH task in under 24 hours.
|
||
|
|
![]() |