| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 22
|
|
| Author |
|
|
slakin
Advanced Cruncher Joined: Jul 4, 2008 Post Count: 79 Status: Offline Project Badges:
|
I have had several OET tasks (8-9) error out over the last 3 weeks with hundreds that complete successfully. It is not a huge deal, but it forces out a large number of pending verification tasks and any new ones that come in are quorum 2 for the next while. Seems like a waste of compute power. From looking around I understand it is likely a timing issue ..tried turning off 'Bit Defender' and setting AVAST not to search the BOINC directories but it is still happening every couple days. Just wanted to report and check to see if there is any suggestions. Here is a sample of the error.
Result Name: OET1_ 0003973_ x4O6IchA_ rig_ 75422_ 0-- <core_client_version>7.2.47</core_client_version> <![CDATA[ <message> finish file present too long </message> <stderr_txt> INFO: No state to restore. Start from the beginning. [00:41:05] Number of tasks = 1 [00:41:05] Running task 0,CPU time at start of task 0 was 0.000000 [00:41:05] ./ZINC12500059.pdbqt size = 29 5 ../../projects/www.worldcommunitygrid.org/oet1.x4O6IchA_rig.pdbqt size = 1867 0 [04:30:23] Finished task #0 cpu time used 10741.682857 04:30:23 (5720): called boinc_finish(0) </stderr_txt> ]]> Thanks, |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
|
||
|
|
slakin
Advanced Cruncher Joined: Jul 4, 2008 Post Count: 79 Status: Offline Project Badges:
|
Actually the machine has been booted several times during this period. It is a single Error' and then will run a couple of days before it runs into another one ..same pattern continues with or without a reboot.
|
||
|
|
slakin
Advanced Cruncher Joined: Jul 4, 2008 Post Count: 79 Status: Offline Project Badges:
|
Had another occurrence last night, error file was a little differnt this time in that it appears the job may not have started. Its strange in that severak hundred have run successfully over the last few days.
Result Log Result Name: OET1_ 0003996_ x3MWPp_ rig_ 37385_ 0-- <core_client_version>7.2.47</core_client_version> <![CDATA[ <message> finish file present too long </message> ]]> |
||
|
|
slakin
Advanced Cruncher Joined: Jul 4, 2008 Post Count: 79 Status: Offline Project Badges:
|
Strange, had another error, it almost seems based on the timing that I get one in every thousand tasks, then carry on for the next thousand, ( no ipl required, rest of the tasks complete sucessfully(, If you are interested in looking at it let me know what details you would need.
Result Log Result Name: OET1_ 0004010_ x3MWP_ rig_ 5947_ 0-- <core_client_version>7.2.47</core_client_version> <![CDATA[ <message> finish file present too long </message> <stderr_txt> INFO: No state to restore. Start from the beginning. [00:32:38] Number of tasks = 1 [00:32:38] Running task 0,CPU time at start of task 0 was 0.000000 [00:32:38] ./ZINC00319214.pdbqt size = 22 5 ../../projects/www.worldcommunitygrid.org/oet1.x3MWP_rig.pdbqt size = 1765 0 [04:26:16] Finished task #0 cpu time used 9846.611519 04:26:16 (4236): called boinc_finish(0) </stderr_txt> Thanks, Susan |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
No promises, but if you have an AntiVirus software, you're advised to add a scan exclusion for the BOINC data directory. E.g. for Avast you go to Active Protection > File System Shield > Customize > Exclusions > Add, where you then browse to [on Windows usually] C:\ProgramData\BOINC ** and OK.
----------------------------------------** The path where the data directory is printed in the BOINC event log (Ctrl+Shft+E keys) at the top of the log. Edit: The exclusion is save. BOINC is sandboxed, limited rights, and all that is send/received is verified for integrity [MD5, RSA etc]. In that, BOINC installed as service (AKA Protected Application Execution] runs on a special boinc account which can only touch what's in the BOINC data dir. [Edit 1 times, last edit by Former Member at Jul 15, 2016 7:38:03 AM] |
||
|
|
slakin
Advanced Cruncher Joined: Jul 4, 2008 Post Count: 79 Status: Offline Project Badges:
|
Anti-virus was/is configured with explicit exclusions on both the BOINC program and APPDATA directories with no scheduled scans or automatic updates.
Had 3 more occurrences today: OET1_0003820_x3MWPp_rig_83699_2-- OET1_0003820_x3MWPp_rig_84099_0-- OET1_0003820_x3MWPp_rig_35985_0-- I have noted that the error message stanza that is inserted <message> finish file present too long </message> is usually before <stderr_txt> but sometimes it is after. This suggests two different logic paths in the Boinc client ? The "finsish_file_called" scan at 10 second interval appears to be designed to look for science application (in this case OET VINA app) that has "hung" during its exit. I am puzzled as to how this occurs since the app has already: - created the output file in the Projects directory - cleaned up all of its checkpoint files in the - \SLOTxx directory it was running under - the associated \SLOTxx\Vina_Checkpoint directory - updated stderr to indicate completion time and CPU used before it creates this finish file then only appears to: - delete the Boinc_Lockfile - write to stdout.txt - close stdout.txt and stderr.txt - close associated OET zip file in the Projects Directory Not sure how the Boinc client gets control or is notified when the app exits. Whatever the internal process, I am very surprised that more than 10 seconds could elapse before it runs. The machine is essentially dedicated to BOINC and does nothing else. - 96% or 100% processors with no CPU activity suspend threshold Questions: 1) Before flagging "error", does the Boinc client check to see if - the PID is still running - output was created in projects folder - the lockfile was deleted Based on observed behaviour, I would say no. However if these conditions are all met then the output file(s) should be good. Could it not be uploaded for Verification rather than killed as Error ? - Since Errors push the machine into "Boinc jail" (everything pending) 2) It would be helpful to understand the timestamp when the scan detects this condition. Is this logged somewhere by the client ? The app has already timestamed when boinc_finish was called in stderr 3) Can the 10 second scan interval be altered ? Would be nice to be able to wait for up to 60 seconds Small price to pay for app that has run for 2-4 hours 4) Would it be possible to re-send the machine one of the tasks above that error-ed out ? If no error then this would confirm it had nothing to do with the work unit itself 5) Is there a debug option that could help ? Thanks, Susan |
||
|
|
slakin
Advanced Cruncher Joined: Jul 4, 2008 Post Count: 79 Status: Offline Project Badges:
|
Another occurrence today on OET1_0003841_x3MX2_rig_43387_2
I was able to determine when the finish file scan detected the error by looking in the Event Log In this case the results show that BOINC_FINISH(0) was called at 3:45:38PM and the error detected by the Boinc client 23 seconds later The event log is included below for the time range of interest: 7/19/2016 3:43:47 PM | World Community Grid | Reporting 1 completed tasks 7/19/2016 3:43:47 PM | World Community Grid | Requesting new tasks for CPU 7/19/2016 3:43:49 PM | World Community Grid | Scheduler request completed: got 1 new tasks 7/19/2016 3:43:51 PM | World Community Grid | Started download of c548b85964a135f0a5f47e3c47bbad0c.job 7/19/2016 3:43:51 PM | World Community Grid | Started download of 8ff71af3e8b5d7e052aba36758b14b2a.zip 7/19/2016 3:43:52 PM | World Community Grid | Finished download of c548b85964a135f0a5f47e3c47bbad0c.job 7/19/2016 3:43:52 PM | World Community Grid | Finished download of 8ff71af3e8b5d7e052aba36758b14b2a.zip 7/19/2016 3:43:52 PM | World Community Grid | Started download of 040cbbf23805b23bbb607c02c616aca8.pdbqt 7/19/2016 3:43:53 PM | World Community Grid | Finished download of 040cbbf23805b23bbb607c02c616aca8.pdbqt 3:45:38 PM finish called 7/19/2016 3:45:51 PM | World Community Grid | Sending scheduler request: To fetch work. 7/19/2016 3:45:51 PM | World Community Grid | Requesting new tasks for CPU 7/19/2016 3:45:52 PM | World Community Grid | Scheduler request completed: got 1 new tasks 7/19/2016 3:45:54 PM | World Community Grid | Started download of fcc1ae4a42212fbab5225fa52febaaf0.job 7/19/2016 3:45:54 PM | World Community Grid | Started download of 2c9bd3308f47fbcdbf5187c805617520.zip 7/19/2016 3:45:55 PM | World Community Grid | Finished download of fcc1ae4a42212fbab5225fa52febaaf0.job 7/19/2016 3:45:55 PM | World Community Grid | Finished download of 2c9bd3308f47fbcdbf5187c805617520.zip 7/19/2016 3:45:55 PM | World Community Grid | Started download of b3cdde0aeb0213ffcc5bc85c028eaafe.pdbqt 7/19/2016 3:45:56 PM | World Community Grid | Finished download of b3cdde0aeb0213ffcc5bc85c028eaafe.pdbqt 3:46:01 PM After about 23 seconds the scan must have found finish_file_called and triggered error 7/19/2016 3:46:01 PM | World Community Grid | Computation for task OET1_0003841_x3MX2_rig_43387_2 finished 7/19/2016 3:46:01 PM | World Community Grid | Starting task OET1_0003841_x3MX2_rig_53131_0 7/19/2016 3:46:03 PM | World Community Grid | Started upload of OET1_0003841_x3MX2_rig_43387_2_r534633098_0 7/19/2016 3:46:05 PM | World Community Grid | Finished upload of OET1_0003841_x3MX2_rig_43387_2_r534633098_0 ================ Here is the Result Log. Note that there is no trailing </stderr.txt> which is the 1st time I have seen this. The cpu time and claimed points appear to be in the expected range for this task. Result Name: OET1_ 0003841_ x3MX2_ rig_ 43387_ 2-- <core_client_version>7.2.47</core_client_version> <![CDATA[ <message> finish file present too long </message> <stderr_txt> INFO: No state to restore. Start from the beginning. [12:10:35] Number of tasks = 1 [12:10:35] Running task 0,CPU time at start of task 0 was 0.000000 [12:10:35] ./ZINC04065585.pdbqt size = 34 5 ../../projects/www.worldcommunitygrid.org/oet1.x3MX2_rig.pdbqt size = 1784 0 [15:45:38] Finished task #0 cpu time used 12176.782856 15:45:38 (3880): called boinc_finish(0) ------------------ Given that the finish to error interval was about 23 seconds I am now not convinced that increasing the scan timer will help. Since the Boinc client downloaded a new task after the OET app called finish then it would suggest that the machine has available cpu cycles. The client appears to have simply missed the notification that the task had already ended. Would love to understand how the finish process is picked up by the client. The fact that this scan is there at all suggests some kind of an issue was encountered during program development. This machine and the OET tasks are somehow creating the perfect storm to trigger this weird error condition. Despite having done so a week ago, I have rebooted to see if the condition clears or re-appears. Not hopeful that I have seen the last of this issue. Susan |
||
|
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges:
|
Have you changed anything on that machine recently that may explain the change in behaviour? You may want to try the latest client from the boinc download page and see if that has any effect on the issue.
Thanks, armstrdj |
||
|
|
slakin
Advanced Cruncher Joined: Jul 4, 2008 Post Count: 79 Status: Offline Project Badges:
|
I am already running most current version (7.2.24).
Problem has increased in frequency since enabling hyperthread'ing but was occurring before then. Have had 3 more errors in last 3 days. |
||
|
|
|