| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 49
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Never saw the stuck state on the first dozen run off Linux-64 11.10, then on queue v.v. the 1 per host limitation did a remote boot last nigth [from a far far away land] with the free version of the multiplatform TeamViewer7 into W7-64 and now seeing 99.88% efficiency... a good half percent better than Linux [and 25% faster at that, for this version of app on W7-64].
--//-- |
||
|
|
pvh513
Senior Cruncher Joined: Feb 26, 2011 Post Count: 260 Status: Offline Project Badges:
|
I have seen the stuck WU problem once (on 5 WUs simultaneously if memory serves). I stopped the client and restarted immediately afterwards (with LAIM enabled) and everything appeared to be normal after that, just like kateiacy reported. Haven't seen the problem since even though I run cfsw exclusively on an 8- and 24-core system (if I can get the WUs) so the problem is rare. I have completed 190 WUs so far. Stopping to send out units seems rather harsh to me in view of the fact that the problem is so rare and very easy to fix. I run openSUSE 11.4 and BOINC 7.0.23
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Quick question to the WCG Techs:
Since the limitation of CFSW WU's to the 4 core Linux machines was implemented, I have not been able to get any WU's from any project on my 4 core machine except C4CW. I am not even able to snag the one CFSW at a time limit. Is this by intent? Message log snipped: Fri 20 Apr 2012 12:46:06 PM MST World Community Grid [sched_op_debug] CPU work request: 286524.36 seconds; 0.00 CPUs Fri 20 Apr 2012 12:46:06 PM MST World Community Grid [sched_op_debug] NVIDIA GPU work request: 0.00 seconds; 0.00 GPUs Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Scheduler request completed: got 0 new tasks Fri 20 Apr 2012 12:46:08 PM MST World Community Grid [sched_op_debug] Server version 700 Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks sent Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for Computing for Sustainable Water Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for Say No to Schistosoma Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for GO Fight Against Malaria Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for Drug Search for Leishmaniasis Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for The Clean Energy Project - Phase 2 Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for Help Cure Muscular Dystrophy - Phase 2 Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for Help Fight Childhood Cancer Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for Help Conquer Cancer Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for FightAIDS@Home Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for Discovering Dengue Drugs - Together - Phase 2 Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for Discovering Dengue Drugs - Together - Phase 2 (Type A) Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for the applications you have selected. Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Project requested delay of 11 seconds |
||
|
|
Jason1478963
Senior Cruncher United States Joined: Sep 18, 2005 Post Count: 295 Status: Offline Project Badges:
|
I also think this is a very minor problem considering the number of machines/cores running this and the work units done so far. Could this just be an issue with a machine and its install?
----------------------------------------![]() |
||
|
|
Dark Angel
Veteran Cruncher Australia Joined: Nov 11, 2005 Post Count: 728 Status: Offline Project Badges:
|
I've had two, both on dual core crunchers running Ubuntu 10.04 server with 2GB or RAM. I have several quad core machines that have had no issues so far and with less RAM and another dual core that, according to the Results Status page has a page and a half of cached CfSW work, none of which is actually on the machine. Other projects show up just fine.
----------------------------------------![]() Currently being moderated under false pretences |
||
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
The reason we have implemented the 1 per host for linux is because there is a chance that your machine will go mostly idle. By this I mean that 3 out of the 4 cores on a machine will be stalled and not processing any work. BOINC however views these as being in a Running state, which means they will show on the client as running for 1+ day. Our fear is that if someone has CFSW selected as their only project they're computer may get stuck in this state for days on end. I have noticed it on my test machines after the launch, and some work units showed almost 2 days of runtime at 5% done. They were stalled, so when i had to stop them, they started from the last checkpoint, which was back at an hour of work. As you know, many members would be completely furious with this issue as not every member checks their machines every day. Instead of wasting computer idle time with stalled work units this allows members to contribute to other projects while we work out the root cause of this issue.
Now, why this wasn't caught in alpha/beta is due the fact that we have two grids and the work units were being suspended to allow work from the production grid to operate. Thus every hour or so forcing even the stalled work units to stop and restart. Thanks, -Uplinger |
||
|
|
pvh513
Senior Cruncher Joined: Feb 26, 2011 Post Count: 260 Status: Offline Project Badges:
|
I think that users should be allowed to make up their own mind about this. They can always opt out of cfsw until the issue is resolved if they don't want to run the risk of having part of their rig running idle. Leaving the users the freedom to choose what they want to do has always been part of the Linux philosophy. Also, stuck work units is nothing new, e.g. Rosetta@home had a similar problem for a long time and I recently had a stuck work unit with Poem@Home as well. There is always some risk that work units go awry, it is foolish to expect that they never do.
|
||
|
|
KWSN - A Shrubbery
Master Cruncher Joined: Jan 8, 2006 Post Count: 1585 Status: Offline |
Uplinger, I simply have to say the obvious. We need more betas
----------------------------------------![]() ![]() Distributed computing volunteer since September 27, 2000 |
||
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
uplinger: additional info you requested It appears to me like there are 5 stuck wu's, each getting this message in boincview: Host Project Date Message Ramjet-OctiCore5 World Community Grid 4/19/2012 5:30:36 PM Task cfsw_0010_00010215_0 exited with zero status but no 'finished' file Ramjet-OctiCore5 World Community Grid 4/19/2012 5:30:36 PM If this happens repeatedly you may need to reset the project. I have tried suspending all other work, running only 1 stuck wu at a time, but they all come up with that error msg. I now have them suspended, running my last 6 tasks and they seem to run fine. But I have 96 hours of runtime in the 5 stuck ones, hate to abort them but I see no other choice. ramjet@Ramjet-OctiCore5:~$ ps -ef | grep wcg boinc 5098 963 98 Apr18 ? 23:37:19 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 290659766 -c baygame.db -Q A00010215.sql -n 8 boinc 5099 5098 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 290659766 -c baygame.db -Q A00010215.sql -n 8 boinc 5100 5099 0 Apr18 ? 00:00:18 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 290659766 -c baygame.db -Q A00010215.sql -n 8 boinc 12020 963 98 Apr18 ? 22:26:15 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 245577611 -c baygame.db -Q A00010121.sql -n 8 boinc 12021 12020 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 245577611 -c baygame.db -Q A00010121.sql -n 8 boinc 12022 12021 0 Apr18 ? 00:00:17 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 245577611 -c baygame.db -Q A00010121.sql -n 8 boinc 12025 963 90 Apr18 ? 20:21:29 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 402454482 -c baygame.db -Q A00055469.sql -n 8 boinc 12026 963 22 Apr18 ? 04:59:01 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 668651325 -c baygame.db -Q A00055211.sql -n 8 boinc 12027 12025 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 402454482 -c baygame.db -Q A00055469.sql -n 8 boinc 12028 963 4 Apr18 ? 01:04:23 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 675555241 -c baygame.db -Q A00055084.sql -n 8 boinc 12029 12026 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 668651325 -c baygame.db -Q A00055211.sql -n 8 boinc 12030 12027 0 Apr18 ? 00:00:17 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 402454482 -c baygame.db -Q A00055469.sql -n 8 boinc 12031 963 0 Apr18 ? 00:12:12 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 311102980 -c baygame.db -Q A00055064.sql -n 8 boinc 12032 963 0 Apr18 ? 00:12:14 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 871081157 -c baygame.db -Q A00054952.sql -n 8 boinc 12033 12029 0 Apr18 ? 00:00:16 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 668651325 -c baygame.db -Q A00055211.sql -n 8 boinc 12034 963 0 Apr18 ? 00:12:16 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 1982038206 -c baygame.db -Q A00054510.sql -n 8 boinc 12035 12028 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 675555241 -c baygame.db -Q A00055084.sql -n 8 boinc 12036 12032 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 871081157 -c baygame.db -Q A00054952.sql -n 8 boinc 12037 963 0 Apr18 ? 00:12:17 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 491499442 -c baygame.db -Q A00064633.sql -n 8 boinc 12038 12034 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 1982038206 -c baygame.db -Q A00054510.sql -n 8 boinc 12039 12036 0 Apr18 ? 00:00:17 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 871081157 -c baygame.db -Q A00054952.sql -n 8 boinc 12041 12035 0 Apr18 ? 00:00:16 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 675555241 -c baygame.db -Q A00055084.sql -n 8 boinc 12043 12038 0 Apr18 ? 00:00:15 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 1982038206 -c baygame.db -Q A00054510.sql -n 8 boinc 12045 963 0 Apr18 ? 00:12:14 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 212781843 -c baygame.db -Q A00066840.sql -n 8 boinc 12046 12031 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 311102980 -c baygame.db -Q A00055064.sql -n 8 boinc 12047 12046 0 Apr18 ? 00:00:16 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 311102980 -c baygame.db -Q A00055064.sql -n 8 boinc 12051 12045 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 212781843 -c baygame.db -Q A00066840.sql -n 8 boinc 12052 12037 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 491499442 -c baygame.db -Q A00064633.sql -n 8 boinc 12053 12052 0 Apr18 ? 00:00:16 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 491499442 -c baygame.db -Q A00064633.sql -n 8 boinc 12054 12051 0 Apr18 ? 00:00:15 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 212781843 -c baygame.db -Q A00066840.sql -n 8 ramjet 12415 12398 0 16:24 pts/0 00:00:00 grep --color=auto wcg ramjet@Ramjet-OctiCore5:~$ Jason, Yeah you have the stuck work units on your machine. You can see that usuallay there are 3 lines with -s <number> but the work units that are stuck only have 2. For example 290659766 has 3 threads. but 491499442 has only 2, which are at 0% cpu usage. You can do a kill on the pids for the stuck work units, kill the master of the two. You can tell which one is the master because the 3rd column has a reference to the pid in the second column. this will stop both threads and it'll restart from the last check point. Unfortunately the time shown in boinc is going to go down to the last cpu time at the previous checkpoint. I am sorry for this and we are testing possible solutions for this problem. Thanks, -Uplinger |
||
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
Uplinger, I simply have to say the obvious. We need more betas ![]() You would like to see that...but you also have to view it from my stand point, I don't want anyone else with Sapphire beta :P On a serious note, when we do have a solution to the problem there will be another beta, and it will more than likely be focus on linux platform only. Thanks, -Uplinger |
||
|
|
|