Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 99
Posts: 99   Pages: 10   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 354306 times and has 98 replies Next Thread
[CSF] Thomas Dupont
Veteran Cruncher
Joined: Aug 25, 2013
Post Count: 685
Status: Offline
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

FYI, we are continuing the beta test, with more flexible work units. The application will appear as 7.16. We are running one batch right now and another 5 if we determine it is running well. It has a minor change, but should give us more insight into how it is running on the grid.

Thanks,
-Uplinger

Thanks for the heads-up Uplinger cool
----------------------------------------
----------------------------------------
[Edit 1 times, last edit by [CSF] Thomas Dupont at Jan 10, 2015 5:58:43 AM]
[Jan 10, 2015 5:57:56 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

I received two WUs both with an estimated time of 1:36. Settings allow disk access every 30 minutes, LAIM off.

[WORK UNIT, RUNTIME, PROGRESS, CHECKPOINT]

BETA_OET1_0000309_xMBGP-OM_rig_0463_0, 00:35:00, 36%, no
BETA_OET1_0000309_xMBGP-OM_rig_1242_1, 00:35:00, 36%, no

BETA_OET1_0000309_xMBGP-OM_rig_0463_0, 00:48:00, 10%, no
BETA_OET1_0000309_xMBGP-OM_rig_1242_1, 00:48:00, 46%, no

BETA_OET1_0000309_xMBGP-OM_rig_0463_0, 01:10:00, 10%, no
BETA_OET1_0000309_xMBGP-OM_rig_1242_1, 01:10:00, 59%, no

BETA_OET1_0000309_xMBGP-OM_rig_0463_0, 01:17:00, 10%, no, estimated time to go (00:07:10)
BETA_OET1_0000309_xMBGP-OM_rig_1242_1, 01:17:00, 63%, no, estimated time to go (---)

BETA_OET1_0000309_xMBGP-OM_rig_0463_0, 01:31:00, 20%, no, estimated time to go (00:14:40)
BETA_OET1_0000309_xMBGP-OM_rig_1242_1, 01:31:00, 10%, no, estimated time to go (00:08:10)

BETA_OET1_0000309_xMBGP-OM_rig_0463_0, 02:15:00, 30%, yes (01:51:52), estimated time to go (00:28:30)
BETA_OET1_0000309_xMBGP-OM_rig_1242_1, 02:15:00, 10%, no, estimated time to go (00:12:17)

I'll keep them running. At the same time 4 MCM work units saved checkpoints as expected.

Edit: After 5:34 hours both betas had checkpoints, so I rebooted the machine and the work units restarted from their checkpoints. Nothing to complain so far. smile
Well, it would be nice to have the checkpoints more frequently, but hey, we have very reliable machines, don't we? cool

Edit: Both units finished, but only 1242 (runtime nearly 19 hours) was valid. 0463 took more than 8 hours and was invalid.

System: Quad Core, no HT, Windows 7-64, BOINC 7.2.47
----------------------------------------
[Edit 4 times, last edit by Former Member at Jan 11, 2015 4:46:16 PM]
[Jan 10, 2015 7:05:50 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

#3 is the weirdest yet.
nanoprobe, I'm glad you mentioned the weird suspend behaviour, because I thought I was going mad. I suspended Beta tasks on one machine with LAIM off, resumed them, then checked stderr - but there was no indication of a restart. The CPU Time and Progress figures in BOINC Manager seemed to continue from where they were before the suspend. This was before the tasks had checkpointed at all. In my case, all 4 cores of the quad were running Beta tasks at the time, but with UGM tasks waiting to run. Here's a Result Log (my Suspend was at 22:19 GMT) - no sign of a restart; this task is now Pending Validation:

Result Name: BETA_ OET1_ 0000309_ xMBGP-OM_ rig_ 0519_ 0--
<core_client_version>7.2.47</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[22:09:29] Number of tasks = 1
[22:09:29] Running task 0,CPU time at start of task 0 was 0.000000
[22:09:29] ./ZINC01671640.pdbqt size = 20 1 ../../projects/www.worldcommunitygrid.org/beta20.xMBGP-OM_rig.pdbqt size = 1930 0
[05:11:09] Finished task #0 cpu time used 24020.191575
05:11:10 (17420): called boinc_finish

</stderr_txt>


PS Another Beta task is still running after 10 hours CPU Time, 90% Progress, last checkpoint at 8h 52m.
[Jan 10, 2015 8:16:23 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1403
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

Again the prove that VINA is running much faster on Linux.
On the same server hardware I've 2 VM's: 1 16 cores Linux and 1 28 cores Windows.
All 16 beta's on the Linux VM were ready and returned this morning with an average cpu runtime of 7 hours and 56 minutes.
On the Windows VM I started 8 beta's and they are still running now for 11 hours and have a progress from 50%, 60%, 60%, 70%, 70%, 80%, 80%, 80%.

On another (faster) Windows host I got 8 tasks.
1 returned (10h52m), 4 in progress (70% and 80%) and 3 tasks reserved to test this afternoon with suspending/resuming after the first 10% checkpoint.

So far this Beta with flexible work units is running smooth. Well done team.
[Jan 10, 2015 9:12:17 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

I got a number of these betas yesterday (8 Jan)(299 and 308 series) and today (9 Jan)(309 series), for Linux-x64 only.
(BTW, the Linux x64 beta program is 64-bit).
All tasks had "_rig_" in their names.
I couldn't seem to bust any by checkpointing them.
However, my i7-970 got 1 invalid result, and this machine hasn't had one of those for several years if ever: BETA_ OET1_ 0000308_ xMBGP-F_ rig_ 1656_ 1-- . The result log shows 3 starts (2 suspends?) and no error messages.

Today's tasks (309s) are running much longer than yesterday's 299s and 308s, by a factor of about 5.
Both series seem to checkpoint about 10 times during their runs.
Thus the shorter tasks checkpointed acceptably often, but the 309s often run over 2 hrs between checkpoints on the 970. I am monitoring the tasks using BoincTasks 1.66 , which shows time since last checkpoint in the Checkpoint column of its Tasks tab - a handy feature - and have seen up to 2h30m and counting there. I think that's longer than is desirable.
I know that one could get more accurate info on checkpoint frequency by setting the checkpoint_debug flag and trawling through the messages files, but I haven't got time.
OTOH, I would expect that the timestamps on the .ckp files in the slot directories would reflect the checkpoint times, but under Linux at least, all files have identical timestamps.
Why is this so?

My "farm" consists of the abovementioned 970 plus several 2600Ks and 3770Ks. On tasks other than these betas, the 970 (thirsty slug) takes about 40% longer than the LGA1155 machines at similar clock speeds. With these betas, it seems to take about 100% longer. Is there a simple explanation for this, eg different compiler or compiler flags used, more extensive use of CPU SIMD instructions, eg AVX?

The 309 betas on the 970 are still running, showing 20% complete after 9-10 hours. Are they likely to bump into an upper limit on CPU time?
---
BTW, another handy feature of BoincTasks is that one can set a "Suspend at checkpoint" condition on a task. (Right-click the task's entry in the Tasks tab). It's been useful in the current exercise, and would be useful when running any projects with long checkpoint intervals, eg when one wants to shut down a machine without losing crunching time. If the flag is set there will be an '=' next to the task's Checkpoint column entry. It does exhibit some quirky behavour though. Using it may consume considerable computing resources since it appears to function by polling, and this may be the reason such a feature has not been implemented for BOINC Manager.
[Jan 10, 2015 9:22:18 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1403
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

...
The 309 betas on the 970 are still running, showing 20% complete after 9-10 hours. Are they likely to bump into an upper limit on CPU time?
...

I checked the upper limit -> fpops_bound / fpops = maximum run time in seconds

For the machine where I checked this, it is 42 hours.
[Jan 10, 2015 10:24:40 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

I didn't stay up last night to babysit this latest 309 batch, and I'm glad I didn't!

On my lappie the first checkpoint this time was after 4.5 hours. On my ancient deskside it was after 10 hours. Only one has finished and it's still PV. All the ones from the previous 299 batch are now validated, and I had restarted all of those after the first checkpoint.

I've found that, although the checkpoint files in the slot directory all have the time of the latest write, those in the enclosed vina_checkpoint directory have what appears to be the time of the actual checkpoint.

Subsequent checkpoints seem to come more quickly (around 2 hours on my lappie for batch 309), but are still what I consider to be a long way apart for my rather "ordinary" machines. I would argue that a project where any of the checkpoints are likely to be more than 3 hours apart on a typical volunteer's box should be opt-in only, but that's just my view. It is based on the point I already made about machines being switched on and switched off again before a checkpoint has occurred.
[Jan 10, 2015 11:26:11 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

I would argue that a project where any of the checkpoints are likely to be more than 3 hours apart on a typical volunteer's box should be opt-in only, but that's just my view. It is based on the point I already made about machines being switched on and switched off again before a checkpoint has occurred.
And the reason for the opt-in status should be made clear in the System Requirements. My Projects does ask you to review them for OET, and LAIM on is "encouraged", but I would like to see the reasoning given too.
[Jan 10, 2015 11:52:04 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

Hmm, weird display in BM now: Both units were back to 10% although they are running for more than 6 hours and showing an estimatet time to go of about 1 hour and 38 minutes?

[WORK UNIT, RUNTIME, PROGRESS, CHECKPOINT]

BETA_OET1_0000309_xMBGP-OM_rig_0463_0, 06:18:00, 20%, yes, estimated time to go (01:00:40)
BETA_OET1_0000309_xMBGP-OM_rig_1242_1, 07:13:00, 10%, yes, estimated time to go (00:39:01)

Edit:

BETA_OET1_0000309_xMBGP-OM_rig_0463_0, 08:25:00, 100%, yes, uploaded to host
BETA_OET1_0000309_xMBGP-OM_rig_1242_1, 12:20:00, 40%, yes, estimated time to go (02:57:41 and rising)

Edit: Both units finished, but only 1242 (runtime nearly 19 hours) was valid. 0463 took more than 8 hours and was invalid.

System: Quad Core, no HT, Windows 7-64, BOINC 7.2.47
----------------------------------------
[Edit 2 times, last edit by Former Member at Jan 11, 2015 4:47:33 PM]
[Jan 10, 2015 12:42:27 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

The estimated time to completion was highly misleading for the Beta tasks I've just completed, too. For most of the processing time it increased sad
[Jan 10, 2015 12:56:32 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 99   Pages: 10   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread