| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 15
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I've been getting a lot of these errors recently on one host (my apologies if that link isn't publicly viewable -- not sure how to do that here). I don't think I made any changes that would affect its operation or preferences and it's normally been very solid here and at other projects. The stderr output includes something similar to this:
<core_client_version>7.4.36</core_client_version> <![CDATA[ <message> Maximum disk usage exceeded </message> <stderr_txt> Unhandled Exception Detected... - Unhandled Exception Record - Reason: Breakpoint Encountered (0x80000003) at address 0x000007FEFD483C32 Engaging BOINC Windows Runtime Debugger... Disk usage limitations are set locally and I allow ~150GB of disk usage (BOINC taking about 10GB and WCG only 175MB). Host in question is an AMD FX8150/Win7-64/16GB RAM. Is there any hardware problem that could cause an error like this? Thanks for any feedback. MarkR |
||
|
|
BobCat13
Senior Cruncher Joined: Oct 29, 2005 Post Count: 295 Status: Offline Project Badges:
|
Go to your Data directory and check all of the slots folders to see if there are leftover VM files from any of the CERN projects.
Another project's message board has several mentions of this occurring. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
ritterm,
Find the Event log 'stdoutdae.txt' file in the BOINC data directory [often hidden, path printed at client startup in BOINC Event log window, to open with Ctrl+Shft+E]. In the file find the local time for one of your errored UGM jobs. Copy all the related task log lines and paste them to a reply. There's a global BOINC max disk use exceed [default 10GB], but this could also be job specific. The event log would tell. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Go to your Data directory and check all of the slots folders to see if there are leftover VM files from any of the CERN projects... I checked the slot directories and there were two with VM image files, each over 5GB. Since I'm not running any VM projects -- and wanted to solve my WCG problem! -- I went ahead and deleted those files and have been running trouble free for the last 10 hours or so. I'm not sure, though, why that would be the problem. Would BOINC have tried to use one or both of those slots -- thinking, perhaps, it was empty -- only to find a large file in it already, and then abort because the disk limit was exceeded? If it's something like that, is the VM project at fault for leaving the image file there or is the BOINC client not managing the slots properly? That seems to have been the problem for this host. But, I checked my other hosts and have one with similar local disk usage limitations (I think) that also has two slot directories with similarly large VM image files. That host is having no trouble at all with WCG tasks. I think my immediate problem is solved, but I'd like to know more. I'll check the stdoutdae.txt file at SekeRob's suggestion and see what that might show. Thanks for the help! Cheers, MarkR |
||
|
|
BobCat13
Senior Cruncher Joined: Oct 29, 2005 Post Count: 295 Status: Offline Project Badges:
|
I checked the slot directories and there were two with VM image files, each over 5GB. Since I'm not running any VM projects -- and wanted to solve my WCG problem! -- I went ahead and deleted those files and have been running trouble free for the last 10 hours or so. I'm not sure, though, why that would be the problem. Would BOINC have tried to use one or both of those slots -- thinking, perhaps, it was empty -- only to find a large file in it already, and then abort because the disk limit was exceeded? If it's something like that, is the VM project at fault for leaving the image file there or is the BOINC client not managing the slots properly? It would be the responsibility of the project application to clean up the slot directory when finished. If those .vdi files are being left behind, it is most likely due to the VM task erroring and not cleaning up properly. The boinc client only creates slots when needed and deletes empty slots upon a client startup. The thread from the offending project: http://boincai05.cern.ch/CMS-dev/forum_thread.php?id=34 I am not sure how Boinc totals up the disk usage, but it appears to add up all the files in a slot, even files that are not from the current task. That seems to have been the problem for this host. But, I checked my other hosts and have one with similar local disk usage limitations (I think) that also has two slot directories with similarly large VM image files. That host is having no trouble at all with WCG tasks. It may be that neither of those slot directories has been reused by the client yet. If the 5GB .vdi files stay there and the slots are reused, then there is a very good chance those tasks will error as well. I think my immediate problem is solved, but I'd like to know more. I'll check the stdoutdae.txt file at SekeRob's suggestion and see what that might show. Thanks for the help! Cheers, MarkR Glad I could help. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Find the Event log 'stdoutdae.txt' file in the BOINC data directory [often hidden, path printed at client startup in BOINC Event log window, to open with Ctrl+Shft+E]. In the file find the local time for one of your errored UGM jobs. Copy all the related task log lines and paste them to a reply. This is typical of the event log output for my failed tasks: 01-May-2015 16:18:21 [World Community Grid] Aborting task ugm1_ugm1_11587_0304_0: exceeded disk limit: 4707.01MB > 500.00MB 01-May-2015 16:18:22 [World Community Grid] Computation for task ugm1_ugm1_11587_0304_0 finished 01-May-2015 16:18:22 [World Community Grid] Output file ugm1_ugm1_11587_0304_0_0 for task ugm1_ugm1_11587_0304_0 absent [Edit 1 times, last edit by Former Member at May 2, 2015 2:39:44 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
ritterm,
----------------------------------------Really like to see the event log. Reason is, jobs are not supposed to go out like that when BOINC is short of [allowed] disk space... 1 or more tasks are supposed to be paused with "waiting for disk space", same as when running short on memory there's the "waiting for memory". Speculating right now that the two VM slots already took the full space, so the jobs crashed probably already at set-up. Increase the default to 15-20GB and it wont happen again when [LHC/Atlas?] gets sloppy again... if they are, for what is it they specify as minimum permanent requirement? For sure, if a job finishes and the result is reported, the slot has to be emptied and eventually BOINC housekeeping will either delete the excess slot or re-use it. At any rate, the log would tell when a collapsing job was started, what whining and how long it took to go belly up. Edit: OK let me do a bit of pensing on this log you just posted before my exposition. Indeed crashing right on set-up, that is, don't see the 'starting xyz line' giving the timestamp on when it was activated. [Edit 3 times, last edit by Former Member at May 2, 2015 2:53:24 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Edit: OK let me do a bit of pensing on this log you just posted before my exposition. Indeed crashing right on set-up, that is, don't see the 'starting xyz line' giving the timestamp on when it was activated. Indeed, those three messages are all that I find in the event log -- nothing about the task starting. I searched the file looking for the task ID, but maybe that didn't result in catching everything that is pertinent. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
This has become a topic at the Berkeley developers alpha mail list as this issue with the slots not being emptied by CERN/LHC/ATLAS is seemingly having a snowballing effect at at least 5 projects... slots not emptied, though BOINC registers it as empty. Inherently, BOINC should not be reusing a non-empty slot, but, well e.g. what BOINC version did this start?
----------------------------------------"I looked at this and couldn't immediately see the problem. The BOINC client deletes everything in a slot directory before using it for a new job. If a deletion fails (e.g. because a file is in use by another app) it doesn't use that slot directory. I verified this by opening some Word docs in slot directories. Notes: * There's a "slot_debug" log flag for messages related to slot directories. Unfortunately it doesn't print messages about failed file deletions; I'll add this. * The "disk limit exceeded" errors refer to the per-job disk limit, not the user's disk usage preferences; I'll change the message to clarify this. * Apps aren't responsible for cleaning out their slot dirs; BOINC does this. It may be that BOINC is failing to delete VM images because they're still in use by the VirtualBox executive. Bottom line: I'll need some more info to debug this. If anyone is seeing this reproducibly, let me know. Otherwise we'll release a client with more debugging output to help us investigate. -- David " Just curious: When WCG said it uses 172MB, how much does BOINC indicate is used for the VM project [when there are no more VM jobs running or paused]? [Edit 1 times, last edit by Former Member at May 4, 2015 8:56:46 AM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Not what everyone would have wanted, the fix in a future client [After 7.5.0 ALPHA], where the root problem is caused by CERN/ATLAS's VM image not allowing BOINC to clean up proper. The fix in prog.lingo:
David Anderson [Mon, 4 May 2015 21:48:34 +0000] client: detect errors in directory enumeration Previously, the dir_scan() function didn't distinguish between - reaching the end of the directory - errors It just returned nonzero in either case. This means that the function that cleans out a slot dir (client_clean_out_dir()) could potentially return success even though the directory is nonempty. This could potentially cause the recently-reported problem where a slot dir contains a VM image from a previous job. And more importantly, moving the VM extra-slottus Rom Walton [Mon, 4 May 2015 23:44:00 +0000] VBOX: Make sure vboxsvc is launched outside the slot directory in sandbox mode on Windows. At any rate, those running VirtualBOX projects in BOINC need to be aware and will want to wager their crunching future on temporarily running a next alpha build, if not wanting these errors to occur! [Will advise when out] |
||
|
|
|