Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 171
|
![]() |
Author |
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 971 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
@knreed - any chance something like this could be done as a side-effect of the OPNG down-time? This is something that we will pass along to the Krembil staff as we transition everything over to them. The next few months will be an intensive focus on migrating the system over to Krembil's infrastructure and cross-training their staff to become familiar with the different pieces system (some of this has been going for awhile now, but some couldn't start until the change became official). As a result, I don't think there will be much capacity to take this on in the short term, but I think that is something that can be revisited once the migration is complete. Thanks for the acknowledgment - I didn't really think there was much chance for now, especially if it didn't take too long to sort out the validator, but I did think it was worth a mention... Thanks also for the effort entailed in sorting out the validator; it looks as if it is helping with these awkward tasks! Good luck with the ongoing work. Cheers - Al. |
||
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
(slightly off-topic, but closely related ...)
I've had 15 WUS from 3 devices error out, mostly or all when restarting from sleep or hibernation. I haven't seen this behaviour before. The detailed error messages are in the form: Error: Unable to open job description file OPNG_<WU ID number>.job and the exit codes are: <message> The environment is incorrect. (0xa) - exit code 10 (0xa)</message> Example of the end part of a log file: INFO:[08:59:15] Start AutoDock for OB3ZINC001107003131--7jji_003_mgl_rot-N121--AARG190_inert.dpf(Job #64)... OpenCL device: GeForce GTX 970 INFO:[08:59:45] End AutoDock... INFO:[08:59:45] Start AutoDock for OB3ZINC000903000887--7jji_003_mgl_rot-N121--AARG190_inert.dpf(Job #65)... OpenCL device: GeForce GTX 970 projects/www.worldcommunitygrid.org/wcgrid_opng_autodockgpu_7.28_windows_x86_64__opencl_nvidia_102 -jobs OPNG_0087276_00021.job -input OPNG_0087276_00021.zip -seed 660440409 -wcgruns 13300 -wcgdpf 266 INFO: Using gpu device from app init data 0 Error: Unable to open job description file OPNG_0087276_00021.job 15:10:10 (7176): called boinc_finish(10) The machine (3770K, Win7-x64) went into scheduled sleep at 09:00 and was wakened manually at around15:09 (UTC+10, 24/9/21) ----- Off-topic: I trialled putting the parts of the above message that are in Courier font into "small" font, but they become ridiculously tiny. Yes, ridiculously tiny! Could we perhaps have something in-between? And when typing in the intial message, I selected some text, then slected Courier in the Font box, and the forum software just inserted <square bracket> font=courier new] and <square bracket> font] around the selected text, in situ. But when I did this for some other text after the first preview, I got popup windows which duplicated the text upon closing the popups. Inconsistent. And why is the text in the results status pages so huge. And grey. And why is the "apply filters" button there is initially scrolled off-page? HTH |
||
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2186 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Yes, I noticed the same here. Not that I put my computer to sleep, but I tried suspending the WU, shut down BOINC, and then restarted BOINC, and resumed the WU.
----------------------------------------It was like yours it seems one of the '"big'" ones with many "jobs" in it. The "normal" ones, are perfectly safe to suspend and resume, as well as put the computer to sleep. However these "big" ones, does not seem able to handle that, without erroring out My WU, (Windows 8.1) https://www.worldcommunitygrid.org/contribution/workunit/821144975 Part of data from stderr_txt: <core_client_version>7.16.7</core_client_version> And at the end part of stderr.txt: INFO:[14:55:43] Start AutoDock for OB3ZINC000923890044_2--7jji_003_mgl_rot-N121--AARG190_inert.dpf(Job #15)... [Edit 3 times, last edit by Grumpy Swede at Sep 24, 2021 1:34:43 PM] |
||
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks, Grumpy Swede
![]() I just found another oddity: A WU where all 7 copies were deemed "Too late". OPNG_0086420_00045 My log file for it looks normal. |
||
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2186 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks, Grumpy Swede ![]() I just found another oddity: A WU where all 7 copies were deemed "Too late". OPNG_0086420_00045 My log file for it looks normal. That's from a batch known to be problematic. The reason for the GPU pause the other days was for the team to try to fix that issue with an updated validator. There's thousands of those problematic batch 0086XXX WU's out there. So, that one you can forget about. ![]() Edit: The validator fix didn't solve the "tight" issue totally, but it became at least a bit better. We're now past the 0086XXX batch, so until another equally "tight" batch comes our way, we will hopefully see less of the "Too Late", and "Invalid" messages. [Edit 2 times, last edit by Grumpy Swede at Sep 24, 2021 4:08:24 PM] |
||
|
Richard Haselgrove
Senior Cruncher United Kingdom Joined: Feb 19, 2021 Post Count: 360 Status: Offline Project Badges: ![]() ![]() |
Something strange and new is happening with the validator. I got replication 7 (_6) of https://www.worldcommunitygrid.org/contribution/workunit/826331107.
----------------------------------------None of the others have passed verification yet. Edit - first and last are valid, the five in the middle are invalid. What's up with that? [Edit 1 times, last edit by Richard Haselgrove at Sep 29, 2021 7:47:30 AM] |
||
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2186 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Something strange and new is happening with the validator. I got replication 7 (_6) of https://www.worldcommunitygrid.org/contribution/workunit/826331107. None of the others have passed verification yet. Edit - first and last are valid, the five in the middle are invalid. What's up with that? I think this one is on its way to the same high number of replications: https://www.worldcommunitygrid.org/contribution/workunit/827428602 |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2167 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Something to sort out for someone …
----------------------------------------workunit 273672597 OPNG_0172048_00083_0 Linux Ubuntu Inval 2023-02-28T20:14:15 2023-02-28T21:24:11 0.25/0.25 58.6/0.0 workunit 273984646 OPNG_0172088_00112_0 Linuxmint Inval 2023-03-01T01:44:42 2023-03-01T01:55:48 0.09/0.09 58.6/0.0 workunit 276778923 OPNG_0171938_00043_0 Linux Ubuntu Inval 2023-04-06T09:51:33 2023-04-06T10:01:49 0.03/0.03 58.6/0.0 [Edit 2 times, last edit by adriverhoef at Apr 29, 2023 9:34:50 PM] |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2167 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
It's been a few weeks since I saw an all-Invalid workunit. Here follows one I stumbled across today:
----------------------------------------workunit 295237528 OPNG_0187139_00111_0 Fedora Linux Inval 2023-04-29T12:56:41 2023-04-29T13:34:02 0.15/0.15 58.6/0.0 And this is what I found in its error log: INFO:[18:34:42] Start AutoDock for OB37744998--5rmm_rna5_0--BASN179.dpf(Job #8)... EDIT 09-05-2023: It wasn't over yet, wingmen _7 and _8 have been added. EDIT 15-05-2023: It wasn't over yet, wingmen _9 and _10 have been added. [Edit 8 times, last edit by adriverhoef at May 15, 2023 5:20:35 PM] |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2167 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
It's two days later, another workunit with all-Invalid tasks ...
----------------------------------------workunit 296839243 OPNG_0185377_00039_0 Fedora Linux Inval 2023-05-01T16:37:18 2023-05-01T16:41:40 0.06/0.07 58.6/0.0--------------------------------------------------------------------------------------------------------------------------------------- Details: Logfile: No further errors until the task was completed and uploaded, same outcome for each completed task as part of the workunit. Note: task _0 led to Invalid, upon which task _1 was sent. When that one led to Invalid, tasks _2, _3 and _4 were sent. When task _3 went Invalid, tasks _5 and _6 were sent out. When task _6 got an Invalid outcome, tasks _7 and _8 were transmitted. When task _5 fell into the Invalid trap, too, all possible remaining tasks were Server Aborted. If you wonder 'where did I see this error before?', well, here it is in thread 44931. Adri PS This workunit also had a due time of 2023-05-03T04:37:18+0000 initially (task _0), that's 1½ days instead of the usual three. ![]() EDIT 15-05-2023: It wasn't over yet, wingmen _9 and _10 have been added. [Edit 5 times, last edit by adriverhoef at May 15, 2023 5:23:22 PM] |
||
|
|
![]() |