Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 9
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 1788 times and has 8 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Is this a non-converging work unit?

I read a response by cleanenergy. Is this what is happening with this work unit? If so, should I abort it? Or should I let it continue… is a non-coverging work unit is still useful for the science?

Let me state that first off I have read: Project Checkpoint Saving - How to Minimize Progress Loss on Close/Restart

I also did a “checkpoint” search in the CEP forum. I read the most relevant of the 113 posts; some of which I understood, some of which I did not.

I read, and understood, that there are 16 jobs to each work unit. I also understand that a checkpoint is only made after each job. I also gather that these checkpoints can be hours apart.

Work unit number E203528_777_C.27.C22H14OSSeSi2.00148624.4.set1d06_0 has been running for over 9 hours. The time without a checkpoint DOES NOT worry me.

What worries me is the fact that it is over 70% complete without a checkpoint. Can it be possible that the first of 16 jobs is not be completed by now?

I am going to try to provide you with any information/screenshots that I think you may or may not ask for. I know, I know, I forgot something...

BOINC message tab:
10/21/2011 2:11:30 PM     Starting BOINC client version 6.10.58 for windows_intelx86
10/21/2011 2:11:30 PM log flags: file_xfer, sched_ops, task, checkpoint_debug, file_xfer_debug
10/21/2011 2:11:30 PM Libraries: libcurl/7.19.7 OpenSSL/0.9.8l zlib/1.2.3
10/21/2011 2:11:30 PM Data directory: C:\Documents and Settings\All Users\Application Data\BOINC
10/21/2011 2:11:30 PM Running under account Deb
10/21/2011 2:11:30 PM Processor: 2 GenuineIntel Intel(R) Pentium(R) 4 CPU 3.06GHz [Family 15 Model 2 Stepping 7]
10/21/2011 2:11:30 PM Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pbe
10/21/2011 2:11:30 PM OS: Microsoft Windows XP: Home x86 Edition, Service Pack 3, (05.01.2600.00)
10/21/2011 2:11:30 PM Memory: 1022.79 MB physical, 2.40 GB virtual
10/21/2011 2:11:30 PM Disk: 146.48 GB total, 123.91 GB free
10/21/2011 2:11:30 PM Local time is UTC -5 hours
10/21/2011 2:11:30 PM No usable GPUs found
10/21/2011 2:11:30 PM World Community Grid URL http://www.worldcommunitygrid.org/; Computer ID 1754347; resource share 100
10/21/2011 2:11:30 PM World Community Grid General prefs: from World Community Grid (last modified 21-Oct-2011 13:55:00)
10/21/2011 2:11:30 PM World Community Grid Host location: none
10/21/2011 2:11:30 PM World Community Grid General prefs: using your defaults
10/21/2011 2:11:30 PM Preferences:
10/21/2011 2:11:30 PM max memory usage when active: 767.09MB
10/21/2011 2:11:30 PM max memory usage when idle: 920.51MB
10/21/2011 2:11:31 PM max disk usage: 10.00GB
10/21/2011 2:11:31 PM max CPUs used: 1
10/21/2011 2:11:31 PM don't use GPU while active
10/21/2011 2:11:31 PM (to change preferences, visit the web site of an attached project, or select Preferences in the Manager)
10/21/2011 2:11:31 PM Not using a proxy
10/21/2011 2:11:31 PM World Community Grid Restarting task E203528_777_C.27.C22H14OSSeSi2.00148624.4.set1d06_0 using cep2 version 640
10/21/2011 6:36:40 PM World Community Grid Sending scheduler request: To fetch work.
10/21/2011 6:36:40 PM World Community Grid Requesting new tasks
10/21/2011 6:36:42 PM World Community Grid Scheduler request completed: got 1 new tasks
10/21/2011 6:36:44 PM World Community Grid Started download of 07edef018edea7e550e285daecdd66ef.zip
10/21/2011 6:36:44 PM World Community Grid [file_xfer_debug] URL: https://grid.worldcommunitygrid.org/boinc/dow...ea7e550e285daecdd66ef.zip
10/21/2011 6:36:46 PM World Community Grid [file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0
10/21/2011 6:36:46 PM World Community Grid [file_xfer_debug] file transfer status 0
10/21/2011 6:36:46 PM World Community Grid Finished download of 07edef018edea7e550e285daecdd66ef.zip
10/21/2011 6:36:46 PM World Community Grid [file_xfer_debug] Throughput 64971 bytes/sec


BOINC cc_config.xml file:
<cc_config>
<log_flags>
<task>1</task>
<file_xfer>1</file_xfer>
<file_xfer_debug>1</file_xfer_debug>
<proxy_debug>0</proxy_debug>
<http_debug>0</http_debug>
<checkpoint_debug>1</checkpoint_debug>
</log_flags>
<options>
<client_version_check_url>http://www.worldcommunitygrid.org/download.ph...ent_version_check_url>
<client_download_url>http://www.worldcommunitygrid.org/download.php</client_download_url>
<network_test_url>http://www.ibm.com/</network_test_url>
<save_stats_days>90</save_stats_days>
<dont_contact_ref_site>0</dont_contact_ref_site>
</options>
</cc_config>


WCG website device profile 1

WCG website device profile 2

BOINC local preferences – Processor

BOINC local preferences – Network

BOINC local preferences – Disk/Memory

BOINC disk graph

Antivirus: Norton 2011
Firewall: ZA Free 9.2.106.000

My computer is hyper-threaded, which is why I set the multi-processor usage to 50%.

------------------------------

[append]

The work unit finished without ever doing a checkpoint.

Here is the rest of the messages:
10/22/2011 5:44:30 AM	World Community Grid	Computation for task E203528_777_C.27.C22H14OSSeSi2.00148624.4.set1d06_0 finished
10/22/2011 5:44:30 AM World Community Grid Starting E203544_136_C.28.C23H12N2S2Se.00704106.3.set1d06_0
10/22/2011 5:44:30 AM World Community Grid Starting task E203544_136_C.28.C23H12N2S2Se.00704106.3.set1d06_0 using cep2 version 640
10/22/2011 5:44:31 AM World Community Grid [fxd] starting upload, upload_offset 0
10/22/2011 5:44:31 AM World Community Grid Started upload of E203528_777_C.27.C22H14OSSeSi2.00148624.4.set1d06_0_0
10/22/2011 5:44:31 AM World Community Grid [file_xfer_debug] URL: https://grid.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler
10/22/2011 5:44:31 AM World Community Grid [fxd] starting upload, upload_offset -1
10/22/2011 5:44:31 AM World Community Grid Started upload of E203528_777_C.27.C22H14OSSeSi2.00148624.4.set1d06_0_1
10/22/2011 5:44:31 AM World Community Grid [file_xfer_debug] URL: https://grid.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler
10/22/2011 5:44:33 AM World Community Grid [file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0
10/22/2011 5:44:33 AM World Community Grid [file_xfer_debug] parsing upload response: <data_server_reply> <status>0</status></data_server_reply>
10/22/2011 5:44:33 AM World Community Grid [file_xfer_debug] parsing status: 0
10/22/2011 5:44:33 AM World Community Grid [file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0
10/22/2011 5:44:33 AM World Community Grid [file_xfer_debug] parsing upload response: <data_server_reply> <status>0</status> <file_size>0</file_size></data_server_reply>
10/22/2011 5:44:33 AM World Community Grid [file_xfer_debug] parsing status: 0
10/22/2011 5:44:33 AM World Community Grid [fxd] starting upload, upload_offset 0
10/22/2011 5:44:34 AM World Community Grid [file_xfer_debug] file transfer status 0
10/22/2011 5:44:34 AM World Community Grid Finished upload of E203528_777_C.27.C22H14OSSeSi2.00148624.4.set1d06_0_0
10/22/2011 5:44:34 AM World Community Grid [file_xfer_debug] Throughput 2665 bytes/sec
10/22/2011 5:44:34 AM World Community Grid [fxd] starting upload, upload_offset -1
10/22/2011 5:44:34 AM World Community Grid Started upload of E203528_777_C.27.C22H14OSSeSi2.00148624.4.set1d06_0_2
10/22/2011 5:44:34 AM World Community Grid [file_xfer_debug] URL: https://grid.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler
10/22/2011 5:44:35 AM World Community Grid [file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0
10/22/2011 5:44:35 AM World Community Grid [file_xfer_debug] parsing upload response: <data_server_reply> <status>0</status> <file_size>0</file_size></data_server_reply>
10/22/2011 5:44:35 AM World Community Grid [file_xfer_debug] parsing status: 0
10/22/2011 5:44:35 AM World Community Grid [fxd] starting upload, upload_offset 0
10/22/2011 5:44:37 AM World Community Grid [file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0
10/22/2011 5:44:37 AM World Community Grid [file_xfer_debug] parsing upload response: <data_server_reply> <status>0</status></data_server_reply>
10/22/2011 5:44:37 AM World Community Grid [file_xfer_debug] parsing status: 0
10/22/2011 5:44:37 AM World Community Grid [file_xfer_debug] file transfer status 0
10/22/2011 5:44:37 AM World Community Grid Finished upload of E203528_777_C.27.C22H14OSSeSi2.00148624.4.set1d06_0_2
10/22/2011 5:44:37 AM World Community Grid [file_xfer_debug] Throughput 60778 bytes/sec
10/22/2011 5:44:37 AM World Community Grid [fxd] starting upload, upload_offset 0
10/22/2011 5:44:37 AM World Community Grid Started upload of E203528_777_C.27.C22H14OSSeSi2.00148624.4.set1d06_0_3
10/22/2011 5:44:37 AM World Community Grid [file_xfer_debug] URL: https://grid.worldcommunitygrid.org/boinc/wcg_cgi/file_upload_handler
10/22/2011 5:44:38 AM World Community Grid [file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0
10/22/2011 5:44:38 AM World Community Grid [file_xfer_debug] parsing upload response: <data_server_reply> <status>0</status></data_server_reply>
10/22/2011 5:44:38 AM World Community Grid [file_xfer_debug] parsing status: 0
10/22/2011 5:44:38 AM World Community Grid [file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0
10/22/2011 5:44:38 AM World Community Grid [file_xfer_debug] parsing upload response: <data_server_reply> <status>0</status></data_server_reply>
10/22/2011 5:44:38 AM World Community Grid [file_xfer_debug] parsing status: 0
10/22/2011 5:44:38 AM World Community Grid [file_xfer_debug] file transfer status 0
10/22/2011 5:44:38 AM World Community Grid Finished upload of E203528_777_C.27.C22H14OSSeSi2.00148624.4.set1d06_0_1
10/22/2011 5:44:38 AM World Community Grid [file_xfer_debug] Throughput 60433 bytes/sec
10/22/2011 5:44:38 AM World Community Grid [file_xfer_debug] file transfer status 0
10/22/2011 5:44:38 AM World Community Grid Finished upload of E203528_777_C.27.C22H14OSSeSi2.00148624.4.set1d06_0_3
10/22/2011 5:44:38 AM World Community Grid [file_xfer_debug] Throughput 389 bytes/sec
10/22/2011 5:44:38 AM World Community Grid [fxd] starting upload, upload_offset -1
10/22/2011 5:44:38 AM World Community Grid Started upload of E203528_777_C.27.C22H14OSSeSi2.00148624.4.set1d06_0_4
10/22/2011 5:44:38 AM World Community Grid [file_xfer_debug] URL: https://cleanenergy.worldcommunitygrid.org/prod/cep2/file_upload_handler
10/22/2011 5:44:40 AM World Community Grid [file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0
10/22/2011 5:44:40 AM World Community Grid [file_xfer_debug] parsing upload response: <data_server_reply> <status>0</status> <file_size>0</file_size></data_server_reply>
10/22/2011 5:44:40 AM World Community Grid [file_xfer_debug] parsing status: 0
10/22/2011 5:44:40 AM World Community Grid [fxd] starting upload, upload_offset 0
10/22/2011 5:45:23 AM World Community Grid [file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0
10/22/2011 5:45:23 AM World Community Grid [file_xfer_debug] parsing upload response: <data_server_reply> <status>0</status></data_server_reply>
10/22/2011 5:45:23 AM World Community Grid [file_xfer_debug] parsing status: 0
10/22/2011 5:45:23 AM World Community Grid [file_xfer_debug] file transfer status 0
10/22/2011 5:45:23 AM World Community Grid Finished upload of E203528_777_C.27.C22H14OSSeSi2.00148624.4.set1d06_0_4
10/22/2011 5:45:23 AM World Community Grid [file_xfer_debug] Throughput 97119 bytes/sec
10/22/2011 5:46:10 AM World Community Grid update requested by user
10/22/2011 5:46:19 AM World Community Grid Sending scheduler request: Requested by user.
10/22/2011 5:46:19 AM World Community Grid Reporting 1 completed tasks, not requesting new tasks
10/22/2011 5:46:20 AM World Community Grid Scheduler request completed


So, was this a non-converging work unit?

If I get another one like this, should I abort it, or should I let it complete?



Off Topic:
I thought these work units timed out at 12 hours. This one went for 12 hours and 23 minutes. Two more work units in my queue are estimated for over 12 hours also. screenshot
----------------------------------------
[Edit 2 times, last edit by Former Member at Oct 23, 2011 4:23:40 AM]
[Oct 22, 2011 7:03:05 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Is this a non-converging work unit?

Dear debsgr8 ,
it is a bit complicated to explain what non-convergence means, but wikipedia sums it up quite nicely:
http://en.wikipedia.org/wiki/Iterative_method
Most numerical calculations in science (such as in quantum chemistry as employed in CEP2) use iterative algorithms, i.e., they generate a sequence of improving approximate solutions that get closer and closer to the actual result until they reach (=converge to) it, and no more change occurs. Sometimes, these algorithms can fail (for many reasons), in which case the intermediate solutions do not approach (=converge to) the desired result but, e.g., start to oscillate or shoot off to noman's land. (It can also sometimes converge to a wrong result, but that is a different story).
Anyways, the code we use and the wus we designed try to reduce the chance of failure as much as possible so it only happens very rarely. But if it happens in CEP2, there is not much one can do about it. The wus at some point give up on a job and move on to the next, no manual intervention necessary. Please don't abort wus, they do that by themselves once it's hopeless.
Your example could well be a case which does not converge. In these cases we have to manually rerun certain parts of the calcs in-house and with specialized settings, but that's ok.
The progress setting is a very crude tool and cannot account for extraordinary circumstances. The 12h limit is for CPU time, you look at wall clock time which is commonly higher.
Best wishes
Your Harvard CEP team

BTW: You use LAIM, correct?
[Oct 23, 2011 6:12:02 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Is this a non-converging work unit?

cleanenergy said:
Please don't abort wus, they do that by themselves once it's hopeless.


Thank you. That's what I wanted to know.

cleanenergy asked:
BTW: You use LAIM, correct?


Yes.
[Oct 24, 2011 4:46:51 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Is this a non-converging work unit?

Work unit E203560_910_C.28.C22H15N3SSi2.00144301.2.set1d06_1 is doing the same thing. I'm at 80% without a checkpoint.

I'm I just a lucky cruncher who happens to get 2 of these? Or is there something about my machine and/or settings that is causing this to happen?
[Oct 24, 2011 7:07:12 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Is this a non-converging work unit?

Hello debsgr8,
Checkpoints are set in the work unit, they are not affected by the computer as long as it computes correctly. I would just monitor the Results Status page and only get worried if something strange showed up there. smile I admit, I just exaggerrated. Personally, I would check Properties in Tasks for that work unit in BOINC Manager repeatedly, trying to figure it out. But I would just be puzzled as long as Results Status validated.

What does check point say in the Properties popup window?

Lawrence
[Oct 24, 2011 10:12:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Is this a non-converging work unit?

Hi debsgr8,
While problematic molecules are rare, they tend to cluster: there may be a structural feature which causes problems, and since similar molecules and their wus are created at the same time, they may end up with the same host, if the host downloads a series of results.
You should just monitor the situation and if the Results Status persistently shows problems over the next two weeks, then we should have a closer look.
Best wishes from
Your Harvard CEP team
[Oct 25, 2011 3:13:09 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Is this a non-converging work unit?

lawrencehardin said:
... I would just monitor the Results Status page and only get worried if something strange showed up there.
cleanenergy said:
... You should just monitor the situation and if the Results Status persistently shows problems over the next two weeks, then we should have a closer look.
Neither of those two work units are even listed on my Results Page... even with the Result Status filter set to 'All'.

cleanenergy said:
...they tend to cluster...
I tend to believe that's it, as I've done other CEP2 work units without problems and have received valid results on them. I'm currently working on a work unit that is over 54% with 2 checkpoints already.

It's a puzzle. I love puzzles. I've just set my device profile for CEP2 only in an attempt to sort this out.
[Oct 26, 2011 4:40:40 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Is this a non-converging work unit?

clown
tired
No ideas, but do you really want to run a lot of CEP2 units simultaneously?
Lawrence
[Oct 26, 2011 5:14:32 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Is this a non-converging work unit?

Simultaneously? No. I have my HT setting at 50% so I'm only crunching on one at a time.

But if you mean exclusively, then yes... just until I can determine that it whether or not it is my computer and/or settings.
[Oct 26, 2011 5:28:47 AM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread