Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 49
Posts: 49   Pages: 5   [ Previous Page | 1 2 3 4 5 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 122660 times and has 48 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Known Issue with Linux stuck workunits

Never saw the stuck state on the first dozen run off Linux-64 11.10, then on queue v.v. the 1 per host limitation did a remote boot last nigth [from a far far away land] with the free version of the multiplatform TeamViewer7 into W7-64 and now seeing 99.88% efficiency... a good half percent better than Linux [and 25% faster at that, for this version of app on W7-64].

--//--
[Apr 20, 2012 11:17:26 AM]   Link   Report threatening or abusive post: please login first  Go to top 
pvh513
Senior Cruncher
Joined: Feb 26, 2011
Post Count: 260
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Known Issue with Linux stuck workunits

I have seen the stuck WU problem once (on 5 WUs simultaneously if memory serves). I stopped the client and restarted immediately afterwards (with LAIM enabled) and everything appeared to be normal after that, just like kateiacy reported. Haven't seen the problem since even though I run cfsw exclusively on an 8- and 24-core system (if I can get the WUs) so the problem is rare. I have completed 190 WUs so far. Stopping to send out units seems rather harsh to me in view of the fact that the problem is so rare and very easy to fix. I run openSUSE 11.4 and BOINC 7.0.23
[Apr 20, 2012 5:55:30 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Known Issue with Linux stuck workunits

Quick question to the WCG Techs:

Since the limitation of CFSW WU's to the 4 core Linux machines was implemented, I have not been able to get any WU's from any project on my 4 core machine except C4CW. I am not even able to snag the one CFSW at a time limit. Is this by intent?

Message log snipped:
Fri 20 Apr 2012 12:46:06 PM MST World Community Grid [sched_op_debug] CPU work request: 286524.36 seconds; 0.00 CPUs
Fri 20 Apr 2012 12:46:06 PM MST World Community Grid [sched_op_debug] NVIDIA GPU work request: 0.00 seconds; 0.00 GPUs
Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Scheduler request completed: got 0 new tasks
Fri 20 Apr 2012 12:46:08 PM MST World Community Grid [sched_op_debug] Server version 700
Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks sent
Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for Computing for Sustainable Water
Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for Say No to Schistosoma
Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for GO Fight Against Malaria
Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for Drug Search for Leishmaniasis
Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for The Clean Energy Project - Phase 2
Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for Help Cure Muscular Dystrophy - Phase 2
Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for Help Fight Childhood Cancer
Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for Help Conquer Cancer
Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for FightAIDS@Home
Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for Discovering Dengue Drugs - Together - Phase 2
Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for Discovering Dengue Drugs - Together - Phase 2 (Type A)
Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Message from server: No tasks are available for the applications you have selected.
Fri 20 Apr 2012 12:46:08 PM MST World Community Grid Project requested delay of 11 seconds
[Apr 20, 2012 9:34:56 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Jason1478963
Senior Cruncher
United States
Joined: Sep 18, 2005
Post Count: 295
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Known Issue with Linux stuck workunits

I also think this is a very minor problem considering the number of machines/cores running this and the work units done so far. Could this just be an issue with a machine and its install?
----------------------------------------

[Apr 20, 2012 9:35:22 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Dark Angel
Veteran Cruncher
Australia
Joined: Nov 11, 2005
Post Count: 728
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Known Issue with Linux stuck workunits

I've had two, both on dual core crunchers running Ubuntu 10.04 server with 2GB or RAM. I have several quad core machines that have had no issues so far and with less RAM and another dual core that, according to the Results Status page has a page and a half of cached CfSW work, none of which is actually on the machine. Other projects show up just fine.
----------------------------------------

Currently being moderated under false pretences
[Apr 20, 2012 10:05:56 PM]   Link   Report threatening or abusive post: please login first  Go to top 
uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Known Issue with Linux stuck workunits

The reason we have implemented the 1 per host for linux is because there is a chance that your machine will go mostly idle. By this I mean that 3 out of the 4 cores on a machine will be stalled and not processing any work. BOINC however views these as being in a Running state, which means they will show on the client as running for 1+ day. Our fear is that if someone has CFSW selected as their only project they're computer may get stuck in this state for days on end. I have noticed it on my test machines after the launch, and some work units showed almost 2 days of runtime at 5% done. They were stalled, so when i had to stop them, they started from the last checkpoint, which was back at an hour of work. As you know, many members would be completely furious with this issue as not every member checks their machines every day. Instead of wasting computer idle time with stalled work units this allows members to contribute to other projects while we work out the root cause of this issue.

Now, why this wasn't caught in alpha/beta is due the fact that we have two grids and the work units were being suspended to allow work from the production grid to operate. Thus every hour or so forcing even the stalled work units to stop and restart.

Thanks,
-Uplinger
[Apr 20, 2012 11:38:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
pvh513
Senior Cruncher
Joined: Feb 26, 2011
Post Count: 260
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Known Issue with Linux stuck workunits

I think that users should be allowed to make up their own mind about this. They can always opt out of cfsw until the issue is resolved if they don't want to run the risk of having part of their rig running idle. Leaving the users the freedom to choose what they want to do has always been part of the Linux philosophy. Also, stuck work units is nothing new, e.g. Rosetta@home had a similar problem for a long time and I recently had a stuck work unit with Poem@Home as well. There is always some risk that work units go awry, it is foolish to expect that they never do.
[Apr 21, 2012 1:50:41 AM]   Link   Report threatening or abusive post: please login first  Go to top 
KWSN - A Shrubbery
Master Cruncher
Joined: Jan 8, 2006
Post Count: 1585
Status: Offline
Reply to this Post  Reply with Quote 
Re: Known Issue with Linux stuck workunits

Uplinger, I simply have to say the obvious. We need more betas biggrin
----------------------------------------

Distributed computing volunteer since September 27, 2000
[Apr 21, 2012 2:57:03 AM]   Link   Report threatening or abusive post: please login first  Go to top 
uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Known Issue with Linux stuck workunits

uplinger: additional info you requested

It appears to me like there are 5 stuck wu's, each getting this message in boincview:
Host Project Date Message
Ramjet-OctiCore5 World Community Grid 4/19/2012 5:30:36 PM Task cfsw_0010_00010215_0 exited with zero status but no 'finished' file
Ramjet-OctiCore5 World Community Grid 4/19/2012 5:30:36 PM If this happens repeatedly you may need to reset the project.

I have tried suspending all other work, running only 1 stuck wu at a time, but they all come up with that error msg. I now have them suspended, running my last 6 tasks and they seem to run fine. But I have 96 hours of runtime in the 5 stuck ones, hate to abort them but I see no other choice. sad


ramjet@Ramjet-OctiCore5:~$ ps -ef | grep wcg
boinc 5098 963 98 Apr18 ? 23:37:19 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 290659766 -c baygame.db -Q A00010215.sql -n 8
boinc 5099 5098 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 290659766 -c baygame.db -Q A00010215.sql -n 8
boinc 5100 5099 0 Apr18 ? 00:00:18 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 290659766 -c baygame.db -Q A00010215.sql -n 8
boinc 12020 963 98 Apr18 ? 22:26:15 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 245577611 -c baygame.db -Q A00010121.sql -n 8
boinc 12021 12020 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 245577611 -c baygame.db -Q A00010121.sql -n 8
boinc 12022 12021 0 Apr18 ? 00:00:17 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 245577611 -c baygame.db -Q A00010121.sql -n 8
boinc 12025 963 90 Apr18 ? 20:21:29 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 402454482 -c baygame.db -Q A00055469.sql -n 8
boinc 12026 963 22 Apr18 ? 04:59:01 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 668651325 -c baygame.db -Q A00055211.sql -n 8
boinc 12027 12025 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 402454482 -c baygame.db -Q A00055469.sql -n 8
boinc 12028 963 4 Apr18 ? 01:04:23 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 675555241 -c baygame.db -Q A00055084.sql -n 8
boinc 12029 12026 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 668651325 -c baygame.db -Q A00055211.sql -n 8
boinc 12030 12027 0 Apr18 ? 00:00:17 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 402454482 -c baygame.db -Q A00055469.sql -n 8
boinc 12031 963 0 Apr18 ? 00:12:12 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 311102980 -c baygame.db -Q A00055064.sql -n 8
boinc 12032 963 0 Apr18 ? 00:12:14 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 871081157 -c baygame.db -Q A00054952.sql -n 8
boinc 12033 12029 0 Apr18 ? 00:00:16 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 668651325 -c baygame.db -Q A00055211.sql -n 8
boinc 12034 963 0 Apr18 ? 00:12:16 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 1982038206 -c baygame.db -Q A00054510.sql -n 8
boinc 12035 12028 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 675555241 -c baygame.db -Q A00055084.sql -n 8
boinc 12036 12032 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 871081157 -c baygame.db -Q A00054952.sql -n 8
boinc 12037 963 0 Apr18 ? 00:12:17 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 491499442 -c baygame.db -Q A00064633.sql -n 8
boinc 12038 12034 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 1982038206 -c baygame.db -Q A00054510.sql -n 8
boinc 12039 12036 0 Apr18 ? 00:00:17 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 871081157 -c baygame.db -Q A00054952.sql -n 8
boinc 12041 12035 0 Apr18 ? 00:00:16 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 675555241 -c baygame.db -Q A00055084.sql -n 8
boinc 12043 12038 0 Apr18 ? 00:00:15 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 1982038206 -c baygame.db -Q A00054510.sql -n 8
boinc 12045 963 0 Apr18 ? 00:12:14 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 212781843 -c baygame.db -Q A00066840.sql -n 8
boinc 12046 12031 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 311102980 -c baygame.db -Q A00055064.sql -n 8
boinc 12047 12046 0 Apr18 ? 00:00:16 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 311102980 -c baygame.db -Q A00055064.sql -n 8
boinc 12051 12045 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 212781843 -c baygame.db -Q A00066840.sql -n 8
boinc 12052 12037 0 Apr18 ? 00:00:00 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 491499442 -c baygame.db -Q A00064633.sql -n 8
boinc 12053 12052 0 Apr18 ? 00:00:16 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 491499442 -c baygame.db -Q A00064633.sql -n 8
boinc 12054 12051 0 Apr18 ? 00:00:15 ../../projects/www.worldcommunitygrid.org/wcgrid_cfsw_baygame_6.05_i686-pc-linux-gnu -t 240 -y 1990 -s 212781843 -c baygame.db -Q A00066840.sql -n 8
ramjet 12415 12398 0 16:24 pts/0 00:00:00 grep --color=auto wcg
ramjet@Ramjet-OctiCore5:~$



Jason,

Yeah you have the stuck work units on your machine. You can see that usuallay there are 3 lines with -s <number> but the work units that are stuck only have 2. For example 290659766 has 3 threads. but 491499442 has only 2, which are at 0% cpu usage. You can do a kill on the pids for the stuck work units, kill the master of the two. You can tell which one is the master because the 3rd column has a reference to the pid in the second column. this will stop both threads and it'll restart from the last check point. Unfortunately the time shown in boinc is going to go down to the last cpu time at the previous checkpoint. I am sorry for this and we are testing possible solutions for this problem.

Thanks,
-Uplinger
[Apr 21, 2012 3:20:35 AM]   Link   Report threatening or abusive post: please login first  Go to top 
uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Known Issue with Linux stuck workunits

Uplinger, I simply have to say the obvious. We need more betas biggrin


You would like to see that...but you also have to view it from my stand point, I don't want anyone else with Sapphire beta :P

On a serious note, when we do have a solution to the problem there will be another beta, and it will more than likely be focus on linux platform only.

Thanks,
-Uplinger
[Apr 21, 2012 3:22:10 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 49   Pages: 5   [ Previous Page | 1 2 3 4 5 | Next Page ]
[ Jump to Last Post ]
Post new Thread