Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 70
Posts: 70   Pages: 7   [ Previous Page | 1 2 3 4 5 6 7 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 46846 times and has 69 replies Next Thread
julemand101
Cruncher
Denmark
Joined: Feb 28, 2020
Post Count: 7
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

mdxi: I have the same problem on a recent updated Arch Linux server. Besides what you have already described I can provide a more detailed stacktrace from journalctl:

systemd-coredump[4847]: Process 2621 (wcgrid_mip1_ros) of user 969 dumped core.

Stack trace of thread 2621:
#0 0x00000000047cc521 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x43cc521)
#1 0x00000000047bcb49 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x43bcb49)
#2 0x00000000047a815e n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x43a815e)
#3 0x00000000047b4936 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x43b4936)
#4 0x00000000047b1b27 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x43b1b27)
#5 0x00000000046c88c4 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x42c88c4)
#6 0x00000000046ddb60 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x42ddb60)
#7 0x00007f258bafee84 __nss_readline (/usr/lib/libc-2.32.so + 0x124e84)
#8 0x00007f258ccff63d n/a (/usr/lib/libnss_files-2.32.so + 0x663d)
#9 0x00007f258ccff9c4 _nss_files_getpwuid_r (/usr/lib/libnss_files-2.32.so + 0x69c4)
#10 0x0000000004808bbc n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x4408bbc)
#11 0x00000000048089ec n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x44089ec)
#12 0x00000000041d5234 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x3dd5234)
#13 0x0000000002d6bca8 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x296bca8)
#14 0x0000000002d7505d n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x297505d)
#15 0x0000000002d7d0f3 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x297d0f3)
#16 0x0000000002d7d35b n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x297d35b)
#17 0x0000000002d8c5e4 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x298c5e4)
#18 0x0000000002cd62f3 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x28d62f3)
#19 0x0000000002d5f48d n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x295f48d)
#20 0x00000000030e2439 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x2ce2439)
#21 0x00000000030e79bd n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x2ce79bd)
#22 0x0000000002d0d035 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x290d035)
#23 0x0000000002470ae3 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x2070ae3)
#24 0x000000000247106a n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x207106a)
#25 0x00000000010a1b94 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0xca1b94)
#26 0x0000000000faf2cc n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0xbaf2cc)
#27 0x0000000000fb17fc n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0xbb17fc)
#28 0x0000000000411ec4 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x11ec4)
#29 0x0000000004794bb4 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x4394bb4)
#30 0x0000000004794ce6 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x4394ce6)
#31 0x00000000009658d6 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x5658d6)

Stack trace of thread 2727:
#0 0x00000000046dd3d1 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x42dd3d1)
#1 0x000000000480b3b4 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x440b3b4)
#2 0x000000000469fa6f n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x429fa6f)
#3 0x000000000468c44d n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x428c44d)
#4 0x00000000046d6925 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x42d6925)
#5 0x000000000480ec89 n/a (/var/lib/boinc/projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu + 0x440ec89)

----------------------------------------
[Edit 1 times, last edit by julemand101 at Sep 2, 2020 10:55:36 PM]
[Sep 2, 2020 10:54:02 PM]   Link   Report threatening or abusive post: please login first  Go to top 
birdmoot
Cruncher
Joined: Dec 19, 2017
Post Count: 1
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

Kernel update landed yesterday (5.8.6) so I tried again. MIP1 WUs still error out.

The only new information I have is that it happens between the 60% checkpoint and the 80% checkpoint, and that it doesn't happen at the same time for all WUS. That is, a WU with less runtime may crash before one with more runtime; it isn't a strict FIFO situation.
[Sep 6, 2020 5:20:22 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

No clue, just throwing a few things at it

exit code -1073741819 is a classic https://www.worldcommunitygrid.org/forums/wcg...ead,16468_offset,0#129404 and MIP1 only exists as 32 bit app last I looked.
[Sep 6, 2020 6:56:05 PM]   Link   Report threatening or abusive post: please login first  Go to top 
littlepeaks
Veteran Cruncher
USA
Joined: Apr 28, 2007
Post Count: 748
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

I got 3 errored work units today

"<message>
WU download error: couldn't get input files:
<file_xfer_error><file_name>mip1.MIP1_00318498.2</file_name>"
That was MIP1_ 00318498_ 8438_0--
The other two were:
MIP1_ 00318498_ 8439_ 0--
MIP1_ 00318498_ 8437_ 0--

Starting at MIP1_ 00318499_ 0305_ 0-- everything's OK again -- that WU returned a "Valid"
----------------------------------------
[Edit 3 times, last edit by littlepeaks at Sep 7, 2020 7:31:56 PM]
[Sep 7, 2020 7:25:29 PM]   Link   Report threatening or abusive post: please login first  Go to top 
uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

Good Morning,

I'm running some of these in stand alone mode to see if I can recreate the issues. Members are noticing them on a wide range of host types/systems and batch numbers.

Thanks,
-Uplinger
[Sep 8, 2020 1:10:37 PM]   Link   Report threatening or abusive post: please login first  Go to top 
mdxi
Advanced Cruncher
Joined: Dec 6, 2017
Post Count: 109
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

Thank you!
----------------------------------------

[Sep 9, 2020 5:31:45 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

I'm running 7.14.3 / 3.0.1 on an x86 processor with 8G RAM.
WUs have not been erroring out, just sticking at 60% or 40% complete (Exactly)
Tried restarting, deleting the app and re-installing, but they still stick at the same % level
Not a memory issue as have tried running with few additional apps.

Example:
Application
Microbiome Immunity Project 7.16
Name
MIP1_00317793_3961
State
Running
Received
01/09/2020 16:05:04
Report deadline
11/09/2020 16:05:04
Estimated computation size
21,592 GFLOPs
CPU time
02:56:06
CPU time since checkpoint
02:56:06
Elapsed time
03:15:11
Estimated time remaining
01:35:54
Fraction done
60.000%
Virtual memory size
330.55 MB
Working set size
265.18 MB
Directory
slots/1
Process ID
16088
Progress rate
18.000% per hour
Executable
wcgrid_mip1_rosetta_7.16_windows_intelx86
[Sep 11, 2020 1:43:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

I've found MIP1 to be heavy duty and thus limit the concurrent to max 2 in app_config.xml . It would be interesting to learn if you limit MIP1 to just one at a time if the problem still occurs.

(sorry if this is a repeat of what has already been said in this thread)
[Sep 11, 2020 4:38:00 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

I'm running 7.14.3 / 3.0.1 on an x86 processor with 8G RAM.
WUs have not been erroring out, just sticking at 60% or 40% complete (Exactly)
Tried restarting, deleting the app and re-installing, but they still stick at the same % level
Not a memory issue as have tried running with few additional apps.

Example:
Application
Microbiome Immunity Project 7.16
Name
MIP1_00317793_3961
State
Running
Received
01/09/2020 16:05:04
Report deadline
11/09/2020 16:05:04
Estimated computation size
21,592 GFLOPs
CPU time
02:56:06
CPU time since checkpoint
02:56:06
Elapsed time
03:15:11
Estimated time remaining
01:35:54
Fraction done
60.000%
Virtual memory size
330.55 MB
Working set size
265.18 MB
Directory
slots/1
Process ID
16088
Progress rate
18.000% per hour
Executable
wcgrid_mip1_rosetta_7.16_windows_intelx86


I see this on my Ryzen 3700X systems. I haven't looked for for it on others. It reminds me of the Win9X Progress Bar. You know, where the system was too busy doing real work to be bothered with updating it's status so the user wouldn't kill it.
----------------------------------------
[Edit 1 times, last edit by Former Member at Sep 11, 2020 10:34:06 PM]
[Sep 11, 2020 10:29:12 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 987
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

I see this on my Ryzen 3700X systems. I haven't looked for for it on others. It reminds me of the Win9X Progress Bar. You know, where the system was too busy doing real work to be bothered with updating it's status so the user wouldn't kill it.

Good comparison!!! I can confirm this behaviour and offer an explanation (for which, read what follows); the important thing to take away from this is that there is nothing wrong with your jobs - it's just a BOINC "feature" that this application highlights! so Lightofmylife and others need not be worried by this.

Now, the explanation...

Not all BOINC applications put regular status update data where the client can find it, and the Progress figure can only be "accurate" as and when such data is posted.

If the client has never seen any progress data, it estimates it based on the expected run time and the time user so far - this counts up consistently until proper data is available, at which point it gets set to that value. Sometimes this can result in progress appearing to have been lost (usually when the estimated run time is too low)

There seem to be a lot of MIP1 jobs with one long structure at present, and I notice progress for these seeming to stop at 40% for 10 to 15 minutes before going up to 60% and sticking at that value until the job finishes (at which point it shows 100% as expected).

I found some jobs with more, shorter, structures and observed a pattern in how those reported as well - it seems that each structure reports at 40% and 60%, some also at 20%, but none at 80%. For instance, a 3-structure job showed fixed progress at 6.667%, 13.333% and 20% (structure 1), 46.667% and 53.333% (structure 2), 73.333%, 80% and 86.667% (structure 3). I also watched an 11-structure job which showed similar behaviour.

Hope that's of interest.

Cheers - Al.
----------------------------------------
[Edit 1 times, last edit by alanb1951 at Sep 12, 2020 11:36:34 AM]
[Sep 12, 2020 11:35:28 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 70   Pages: 7   [ Previous Page | 1 2 3 4 5 6 7 | Next Page ]
[ Jump to Last Post ]
Post new Thread