Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 70
|
![]() |
Author |
|
julemand101
Cruncher Denmark Joined: Feb 28, 2020 Post Count: 7 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
mdxi: I have the same problem on a recent updated Arch Linux server. Besides what you have already described I can provide a more detailed stacktrace from journalctl:
----------------------------------------systemd-coredump[4847]: Process 2621 (wcgrid_mip1_ros) of user 969 dumped core. [Edit 1 times, last edit by julemand101 at Sep 2, 2020 10:55:36 PM] |
||
|
birdmoot
Cruncher Joined: Dec 19, 2017 Post Count: 1 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Kernel update landed yesterday (5.8.6) so I tried again. MIP1 WUs still error out.
The only new information I have is that it happens between the 60% checkpoint and the 80% checkpoint, and that it doesn't happen at the same time for all WUS. That is, a WU with less runtime may crash before one with more runtime; it isn't a strict FIFO situation. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
No clue, just throwing a few things at it
exit code -1073741819 is a classic https://www.worldcommunitygrid.org/forums/wcg...ead,16468_offset,0#129404 and MIP1 only exists as 32 bit app last I looked. |
||
|
littlepeaks
Veteran Cruncher USA Joined: Apr 28, 2007 Post Count: 748 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I got 3 errored work units today
----------------------------------------"<message> WU download error: couldn't get input files: <file_xfer_error><file_name>mip1.MIP1_00318498.2</file_name>" That was MIP1_ 00318498_ 8438_0-- The other two were: MIP1_ 00318498_ 8439_ 0-- MIP1_ 00318498_ 8437_ 0-- Starting at MIP1_ 00318499_ 0305_ 0-- everything's OK again -- that WU returned a "Valid" [Edit 3 times, last edit by littlepeaks at Sep 7, 2020 7:31:56 PM] |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Good Morning,
I'm running some of these in stand alone mode to see if I can recreate the issues. Members are noticing them on a wide range of host types/systems and batch numbers. Thanks, -Uplinger |
||
|
mdxi
Advanced Cruncher Joined: Dec 6, 2017 Post Count: 109 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thank you!
----------------------------------------![]() |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I'm running 7.14.3 / 3.0.1 on an x86 processor with 8G RAM.
WUs have not been erroring out, just sticking at 60% or 40% complete (Exactly) Tried restarting, deleting the app and re-installing, but they still stick at the same % level Not a memory issue as have tried running with few additional apps. Example: Application Microbiome Immunity Project 7.16 Name MIP1_00317793_3961 State Running Received 01/09/2020 16:05:04 Report deadline 11/09/2020 16:05:04 Estimated computation size 21,592 GFLOPs CPU time 02:56:06 CPU time since checkpoint 02:56:06 Elapsed time 03:15:11 Estimated time remaining 01:35:54 Fraction done 60.000% Virtual memory size 330.55 MB Working set size 265.18 MB Directory slots/1 Process ID 16088 Progress rate 18.000% per hour Executable wcgrid_mip1_rosetta_7.16_windows_intelx86 |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I've found MIP1 to be heavy duty and thus limit the concurrent to max 2 in app_config.xml . It would be interesting to learn if you limit MIP1 to just one at a time if the problem still occurs.
(sorry if this is a repeat of what has already been said in this thread) |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I'm running 7.14.3 / 3.0.1 on an x86 processor with 8G RAM. WUs have not been erroring out, just sticking at 60% or 40% complete (Exactly) Tried restarting, deleting the app and re-installing, but they still stick at the same % level Not a memory issue as have tried running with few additional apps. Example: Application Microbiome Immunity Project 7.16 Name MIP1_00317793_3961 State Running Received 01/09/2020 16:05:04 Report deadline 11/09/2020 16:05:04 Estimated computation size 21,592 GFLOPs CPU time 02:56:06 CPU time since checkpoint 02:56:06 Elapsed time 03:15:11 Estimated time remaining 01:35:54 Fraction done 60.000% Virtual memory size 330.55 MB Working set size 265.18 MB Directory slots/1 Process ID 16088 Progress rate 18.000% per hour Executable wcgrid_mip1_rosetta_7.16_windows_intelx86 I see this on my Ryzen 3700X systems. I haven't looked for for it on others. It reminds me of the Win9X Progress Bar. You know, where the system was too busy doing real work to be bothered with updating it's status so the user wouldn't kill it. [Edit 1 times, last edit by Former Member at Sep 11, 2020 10:34:06 PM] |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 987 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I see this on my Ryzen 3700X systems. I haven't looked for for it on others. It reminds me of the Win9X Progress Bar. You know, where the system was too busy doing real work to be bothered with updating it's status so the user wouldn't kill it. Good comparison!!! I can confirm this behaviour and offer an explanation (for which, read what follows); the important thing to take away from this is that there is nothing wrong with your jobs - it's just a BOINC "feature" that this application highlights! so Lightofmylife and others need not be worried by this. Now, the explanation... Not all BOINC applications put regular status update data where the client can find it, and the Progress figure can only be "accurate" as and when such data is posted. If the client has never seen any progress data, it estimates it based on the expected run time and the time user so far - this counts up consistently until proper data is available, at which point it gets set to that value. Sometimes this can result in progress appearing to have been lost (usually when the estimated run time is too low) There seem to be a lot of MIP1 jobs with one long structure at present, and I notice progress for these seeming to stop at 40% for 10 to 15 minutes before going up to 60% and sticking at that value until the job finishes (at which point it shows 100% as expected). I found some jobs with more, shorter, structures and observed a pattern in how those reported as well - it seems that each structure reports at 40% and 60%, some also at 20%, but none at 80%. For instance, a 3-structure job showed fixed progress at 6.667%, 13.333% and 20% (structure 1), 46.667% and 53.333% (structure 2), 73.333%, 80% and 86.667% (structure 3). I also watched an 11-structure job which showed similar behaviour. Hope that's of interest. Cheers - Al. [Edit 1 times, last edit by alanb1951 at Sep 12, 2020 11:36:34 AM] |
||
|
|
![]() |