| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 11
|
|
| Author |
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Recently Active Project Badges:
|
I just put a HPG7 rack mount server into production with 2 AMD 6234 cpu's. They are 12 core so it is running 24 work units at a time. This unit was pretty finicky to finally get running, but I finally succeeded yesterday. It is running exclusively OPN work units. It has returned about 11 valid units, 10 pending validation units, and 13 units which have errored out with the message "Finish file present too long."
----------------------------------------Does anyone have any clue why this happens ? I have returned over 100,000 units for this project and not seen this problem on any of my other machines. The OS is Linux Mint 18, the same OS as all my other Linux machines. If I can not find a solution I will take this machine out of the mix because I don't want to be wasting time returning this many units which are in error. I got the machine for nothing, but maybe there was a reason it was free. I may also try MCM on it to see if it also occurs with that project. Thanks for any suggestions. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
Martin Schnellinger
Advanced Cruncher Joined: Apr 29, 2007 Post Count: 128 Status: Offline Project Badges:
|
Dear Sgt Joe,
on github, I found the following on the topic "finish file too long" https://github.com/BOINC/boinc/pull/3019 As far as I understand, the timout limit must be increased. Citation "When an app finishes, it writes a "finish file", which ensures the client that the app really finished. If the app process is still there N seconds after the finish file appears, the client assumes that something went wrong, and it aborts the job. Previously N was 10. This was too small during periods of heavy paging. I increased it to 300. It has been pointed out that if the app creates the finish file, and its output files are present, it should be treated as successful regardless of whether it exits. This is probably true, but right now we don't have a mechanism for killing a job and marking it as success. The longer timeout makes this moot." I do not know a real good solution, but would prpose to try and uncheck the option "leave BOINC in memory when it pauses" This is only an more or less intelligent guess, on a try and error basis. All the best in these times. I think, we should be able to fix this problem, as it is apparently not a new one, but it is old. Greetings M |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Recently Active Project Badges:
|
Thanks, I will try that.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
Martin Schnellinger
Advanced Cruncher Joined: Apr 29, 2007 Post Count: 128 Status: Offline Project Badges:
|
Hello,
additonal Info: Problem has been deeply discussed here: https://boinc.bakerlab.org/forum_thread.php?id=13860&postid=95357#95357 It seens, that changes in cache size could help. Citation: Linux has its own built-in cache, you just need to set the size. 1 GB of cache and 1/2 hour write-delay should work wonders; probably half that amount or even less would fix this problem; 5 minutes should be more than enough. https://lonesysadmin.net/2013/12/22/better-li...rformance-vm-dirty_ratio/ |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Recently Active Project Badges:
|
Thank you. I will investigate.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
geophi
Advanced Cruncher U.S. Joined: Sep 3, 2007 Post Count: 113 Status: Offline Project Badges:
|
I used to occasionally get this message on some climateprediction.net tasks, especially when interrupting them for any reason when heavy disk writes were ongoing. I know in the main boinc support forums (and seti) this error was talked about quite a bit and some newer version of boinc fixed it for me. Since upgrading in April, I've had no problems, no matter how the task was interrupted. The linux version of boinc that has this fix is 7.16.6 https://boinc.berkeley.edu/forum_thread.php?id=13562&postid=97382, which would be in the repository for Ubuntu 20.04 or Linux Mint 20. Or you could run it from the boinc version hosted at berkeley that may run on Mint 18, but certainly runs on 19 and 20. https://boinc.berkeley.edu/dl/boinc_ubuntu_7.16.6_x86_64-pc-linux-gnu.sh
|
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Recently Active Project Badges:
|
I changed the mix from all OPN to half OPn and half MCM. There have been no more errors since 12:00 UTC Dec. 15.
----------------------------------------Thanks to all for their suggestions. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Recently Active Project Badges:
|
Well, I did not fix the entire problem. I have gotten the incidence down to about 1 to 2 errors per hundred units. I will do some more tweaking to try to eliminate them entirely.
----------------------------------------Once again, thank all for your suggestions. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
Bryn Mawr
Senior Cruncher Joined: Dec 26, 2018 Post Count: 384 Status: Offline Project Badges:
|
The problem is exacerbated by the fact that this is a new machine running just one application which means that you’re likely to have 24 WUs finishing at pretty much the same time.
As the tasks spread out the box will process the output and send it in a smooth flow rather than having the backing up and hanging around. |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Recently Active Project Badges:
|
The problem is exacerbated by the fact that this is a new machine running just one application which means that you’re likely to have 24 WUs finishing at pretty much the same time. As the tasks spread out the box will process the output and send it in a smooth flow rather than having the backing up and hanging around. You may very well have a point. I also had thought I may be saturating my bandwidth as I have 144 threads running through an "N" connection on my range extender. However, the errors were only specific to one machine which had been a bit finicky to set up in the first place. At any rate, with some tweaking of the work unit mix, I seem to have alleviated most if not all of the problem. I am still going to try to optimize the mix a bit more if needed. So far today I have zero errors. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
|