Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 99
Posts: 99   Pages: 10   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 222304 times and has 98 replies Next Thread
nanoprobe
Master Cruncher
Classified
Joined: Aug 29, 2008
Post Count: 2998
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

FWIW I have tasks on 1 machine that run in excess of 15 hours. The longest running task from the previous 2 beta runs was 3 3/4 hours on said machine. Just curious as to what the techs are looking for with 4x longer running tasks.
----------------------------------------
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.


[Jan 10, 2015 1:11:38 PM]   Link   Report threatening or abusive post: please login first  Go to top 
yoro42
Ace Cruncher
United States
Joined: Feb 19, 2011
Post Count: 8976
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

World Community Grid Coltrane 7.16 beta20 BETA_OET1_0000309_xMBGP-OM_rig_0421_1 1/9/2015 02:31:08 PM 15:13:53 (14:39:54) 01:41:31 1/13/2015 02:31:07 PM Running 90.00 [1] 00:43:11 yoro42 52.45 MB 49.75 MB
World Community Grid Coltrane 7.16 beta20 BETA_OET1_0000309_xMBGP-OM_rig_1228_1 1/9/2015 02:40:13 PM 13:25:17 (13:03:18) 01:29:28 1/13/2015 02:40:12 PM Running 90.00 [1] 00:39:13 yoro42 49.02 MB 46.38 MB
World Community Grid Coltrane 7.16 beta20 BETA_OET1_0000309_xMBGP-OM_rig_1321_0 1/9/2015 02:40:13 PM 13:28:23 (12:40:20) 01:29:49 1/13/2015 02:40:11 PM Running 90.00 [1] 00:54:01 yoro42 48.84 MB 46.38 MB
World Community Grid Coltrane 7.16 beta20 BETA_OET1_0000309_xMBGP-OM_rig_1326_0 1/9/2015 02:40:13 PM 13:28:23 (12:53:29) 03:22:05 1/13/2015 02:40:11 PM Running 80.00 [1] 01:42:05 yoro42 49.04 MB 46.33 MB
World Community Grid Coltrane 7.16 beta20 BETA_OET1_0000309_xMBGP-OM_rig_1302_1 1/9/2015 02:40:13 PM 13:28:23 (12:27:42) 03:22:05 1/13/2015 02:40:12 PM Running 80.00 [1] 01:05:24 yoro42 52.55 MB 49.82 MB
----------------------------------------

[Jan 10, 2015 1:12:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

Just curious as to what the techs are looking for with 4x longer running tasks.
Probably answered briefly in this post earlier in the thread.
[Jan 10, 2015 1:41:29 PM]   Link   Report threatening or abusive post: please login first  Go to top 
nanoprobe
Master Cruncher
Classified
Joined: Aug 29, 2008
Post Count: 2998
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

1 1/2 hr. run time. No progress. What's the point of this?



Computer: homepremium764
Project World Community Grid

Name BETA_OET1_0000309_xMBGP-OM_rig_1332_0

Application Beta - Outsmart Ebola Together 7.16
Workunit name BETA_OET1_0000309_xMBGP-OM_rig_1332
State Running
Received 1/9/2015 4:36:02 PM
Report deadline 1/13/2015 4:36:06 PM
Estimated app speed 2.69 GFLOPs/sec
Estimated task size 14,530 GFLOPs
CPU time at last checkpoint 00:00:00
CPU time 01:36:08
Elapsed time 01:35:56

Estimated time remaining --
Fraction done 0.000%
Virtual memory size 50.48 MB
Working set size 53.17 MB
Directory slots/3
Process ID 3800
----------------------------------------
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.


----------------------------------------
[Edit 1 times, last edit by nanoprobe at Jan 10, 2015 2:09:16 PM]
[Jan 10, 2015 1:57:37 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sandvika
Advanced Cruncher
United Kingdom
Joined: Apr 27, 2007
Post Count: 112
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

Also with batch 309, 6 WUs running but not checkpointing after 4 hours. Changing LAIM setting rewrites the init_data.xml file but nothing else is going on in the slots directories since the WUs started. I guess with multiple folds in protein targets we will see huge variability in the completion times which will most likely mask the CPU times lost with suspend/resume with LAIM off. I was expecting that the suspend/resume actions would result in something appearing at least in the logs in the slots directories, but there is nothing. I would expect beta WUs to produce more verbose logs to explain what's actually happening (or not).

As for the progress and remaining time estimates: After starting, the percentages for the WUs progressed rapidly consistent with the initial estimate of remaining time but by >80% the linkage broke and the remaining time was in some cases "---", Eventually they dropped back to 10% one by one and remaining times are either static or increasing, and contradictory in the context of the elapsed time and progress ie. if 10% after 4H00 then the remaining 90% won't be 0H22!

A 10% WU that I suspended and resumed without a checkpoint has presumably restarted from the beginning but the progress and remaining time did not revert to their initial behaviour. I'm expecting it to be the last one to get to 20% as it has already slipped to last position amongst those still on 10%. We shall see.

Edit: Checkpointing started at 3H47 for the WU at 30%, 4H09, 4H54, 5H06 for WUs at 20%, the other 2 WU have not checkpointed yet at >=5H30. The WU suspended/resumed with LAIM off is not the last to get to 20% either, so overall a great deal of variation. Anecdotally, the first WUs which checkpointed are the only ones where the other copy in the quorum already has the result returned, so the first checkpoint time might prove to be proportional to the duration of the WU.
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by Sandvika at Jan 10, 2015 4:46:52 PM]
[Jan 10, 2015 3:24:39 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

task: BETA_OET1_0000311_xSDGP-OM_rig_0910_0
elapsed, cpu: 00:02:39 (00:05:21)
checkpoint, cpu%:[0] 00:05:21 100,00
progress: 2,519
system: win7_32
ram, virtual: 42.95 MB 40.47 MB
This just-in work unit has a higher CPU time than elapsed time count. Had it suspended after a couple of seconds with laim off, then resumed.
----------------------------------------
[Edit 1 times, last edit by Former Member at Jan 10, 2015 4:13:54 PM]
[Jan 10, 2015 4:12:07 PM]   Link   Report threatening or abusive post: please login first  Go to top 
wiesel111
Cruncher
Joined: Aug 14, 2010
Post Count: 3
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

I catch my first 4 beta today.

I'm running 3 CEP2 and one beta (batch 310), turned LAIM off. Then I stopped the beta task two minutes after start. Boinc manager showed me, the first beta stopped and the second beta started. Looking at task manager, I see there are running 5 tasks, 3 CEP2 and two betas, all abot 20%! Normally I cannot run more then 4 CPUs at the same time.

I have no possibility to stop a started beta task. So I waited to next checkpoint of CEP2 and stopped it. Then there are now running 2 CEP2 and two betas.

Can I let 5 tasks run parallel or may this be a problem?

edit:
task1: BETA_ OET1_ 0000310_ xSDGP-F_ rig_ 0614_ 0--
task2: BETA_ OET1_ 0000310_ xSDGP-F_ rig_ 0601_ 0--
----------------------------------------
[Edit 1 times, last edit by wiesel111 at Jan 10, 2015 5:26:28 PM]
[Jan 10, 2015 4:42:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1316
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

Confirmation of the issue already several times noted in this thread:

A suspended/paused Beta-task (LAIM on) is suspended for Boinc Manager, but in the background the task keeps on running, overcommitting the machine, because BOINC will start/resume another task and/or the user wants to use the machine for his own processes.

With LAIM off and resuming after the first checkpoint is made, the task is restarted from that last checkpoint.

Another issue I saw is that the jump to the 10% progress does not mean there is a checkpoint made.

I got several tasks from new batches 310, 311, 313 and 314.

The 310's seems to run faster, but also have only 1 job within the task.
The 314's are very short.
----------------------------------------

----------------------------------------
[Edit 2 times, last edit by Crystal Pellet at Jan 11, 2015 8:49:19 AM]
[Jan 10, 2015 5:27:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Trotador
Senior Cruncher
Joined: Mar 26, 2009
Post Count: 154
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

I've picked 31 beta units last night, 309 type, in a dual sandy brdige xeon at 3 GHz (hypethreading ON), linux 64, Boinc 7.0.65.

First batch of 10 units executed without interruption between 5:50 and 8:20 hours. Most of them already validated, some waiting the wingmen to finalise, no invalid one so far.

With the remaining 21 units I've been playing with suspending and resuming both without closing and closing the boincmanager, which has the same effect wrt resuming. Checkpoints were taken between 1:30 and 1:50 for the first ones and around 40 to 60 minutes for the following ones.

Progress percentage advance in jumps of 10%'s (not related to checkpoint made). Times to completion are certainly off of the actual values.

If a task is supended and resumed, the elapsed time goes right back to the last check point time. However, the progress bar percentage takes some minutes to go back and in several cases, not sure if in all of them, it goes back much beyond than it would correspond,for example to 10% when it was at 80% before suspending. It does not seem to affect to wu completion time.

I've noticed that when a task resumes the date of all the checkpoint files, even the ones from previous checkpoints, is updated to the date of the resuming moment, not to the date when the checkpoint was performed.
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by Trotador at Jan 10, 2015 7:36:06 PM]
[Jan 10, 2015 7:34:48 PM]   Link   Report threatening or abusive post: please login first  Go to top 
ca05065
Senior Cruncher
Joined: Dec 4, 2007
Post Count: 325
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

These three work units behaved as expected. With LAIM off suspend, removed from memory, resume. STDERR showed a restart with zero CPU time.
BETA_ OET1_ 0000312_ xZAGP-F_ rig_ 0533_ 1--
BETA_ OET1_ 0000312_ xZAGP-F_ rig_ 0504_ 1--
BETA_ OET1_ 0000310_ xSDGP-F_ rig_ 0265_ 0--

The other work unit behaved oddly.
With LAIM off, suspend, left in memory, boinctasks does not increase times and shows as suspended, Process Explorer shows work unit still running with times increasing.
I displayed properties from boinctasks which I think (relying on my memory) CPU time at last checkpoint as zero.
After resumption of work unit a display of properties was definitely very odd:
CPU Last checkpoint 28:12
CPU 30:17
elapsed 26:55
STDERR did not show a resumption
BETA_ OET1_ 0000312_ xZAGP-F_ rig_ 0506_ 1--
[Jan 10, 2015 8:20:43 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 99   Pages: 10   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread