| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 24
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
CEP2 is my favorite crunching project. I have a Phenom II X4-based Win7-64 PC that has been crunching workunits for CEP2 (2 at the time) alongside workunits from Fight AIDS at Home without issue for some time. I have rearranged my crunching PC's and, due to its excellent performance, I decided to allow this PC to exclusively run four CEP2 units at the time. It has the high speed RAM, high speed hard drive and network bandwidth to handle the workload without issue. (Set to run all the time, I don't even notice the workunits are running.)
----------------------------------------This afternoon, I came home to find just shy of 200 work units have errored out. The message is one I've seen before, and about which I posted to the forums. At that time, an experienced cruncher told me it was a fairly generic error, and not to worry about it. Back then, after a several work units, things got back to normal. However, after seeing 200 of them error out, one after another, I am once again concerned. The work units in question are in the series with names beginning E210286, E210285, E210284 and one from E210278. Other work units have completed normally, and two are currently running. This is the error message I am seeing: <core_client_version>6.10.58</core_client_version> <![CDATA[ <message> - exit code 195 (0xc3) </message> <stderr_txt> INFO: No state to restore. Start from the beginning. [16:58:34] Number of jobs = 16 [16:58:34] Starting job 0,CPU time has been restored to 0.000000. Application exited with RC = 0x1 [16:58:36] Finished Job #0 16:58:41 (4848): called boinc_finish </stderr_txt> Should I be concerned? Any input (reassurance?) would be appreciated. Thanks! [Edit 2 times, last edit by Former Member at Nov 9, 2012 2:59:22 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Just thought I'd add a new twist. After downloading and erroring out over 200 workunits, BOINC stopped downloading new workunits, and now is saying "Message from server: This computer has finished a daily quota of 63 tasks." I don't get the math . . . and now have two willing (and soon four willing) cores without work to do??
|
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7849 Status: Offline Project Badges:
|
First course of action is to reboot the machine. Then wait to see when you get new jobs if they error also.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I decided to do just that, and as I was waiting for the reboot, logged into the forum and saw your message. Seems to have worked. Strange that two work units ran without issue on the same PC where 200+ workunits fail. It seems to be working now.
Thanks for the reply. |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7849 Status: Offline Project Badges:
|
It has happened to me a couple of times with machines that ran for long periods of time 24/7 without a problem until one day they had a problem which affected every WU. The reboot took care of it.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
OK, I'm back with the same problem, so I've removed RESOLVED from the header. Part of my day job is to troubleshoot such problems on Windows PC's, but I simply do not have enough information to troubleshoot this one. "Computation error," "Exit Code 195" and "Application exited with RC = 0x1" are not enough for me to go on.
Googling for Exit Code 195, I find "The science application running the task failed for unknown reasons." Googling for RC = 0x1 yielded only five confusing results, two of which were in Chinese, and didn't translate well into English. The PC in question is extremely stable. It runs an array of different applications over months at the time without a reboot, which is why I chose to have it run CEP2 on each core. In general, I only have to reboot when something like a Windows Update forces it. It runs Win7-64 SP1 on a Phenom II X4 945 running at 3 GHz (no overclock). It has 8GB DDR3 1333 MHz RAM running without overclock. It has a 10K RPM SATA2 hard drive. It connects to a gigabit local LAN which connects to a very reliable 15 megabit internet connection. It boggles how a very stable rig with these specs could present a problem for any app. The Elapsed vs. CPU time speaks for itself. Often, there's only a few minutes between them when the PC is in active use, and even less when it is running unattended. CEP2's extended time between checkpoints means I'm not willing to keep rebooting the thing just to get this one app to behave. I understand the basic notion that "CEP2 takes more resources," but I need to know what exactly what those demands are that this rig cannot handle. This time it only ran two days before the problem recurred. So, for two days, it has run four CEP2 tasks at the time without issue, and then suddenly it starts failing again. By the way, I am running BOINC 6.10.58, because that is what WCG distributes. Would updating to a more recent version have any benefit? Can anyone shed light or point me to documentation? Any information would be appreciated. Thanks! |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7849 Status: Offline Project Badges:
|
Since the reboot did help for a short period of time, I am speculating on something occurring within the system. The first thing which comes to mind is corrupted memory. You could try running memtest for while to see if anything pops out. The second item would be to check for overheating either cpu related or memory related. If the machine has been running for a long time with no problems, I would check the cpu heatsink to see if it is clogged with dust. A third really remote possibility might be a failing hard drive because they do get hammered with CEP2. Running a check on it might not hurt.
----------------------------------------If I think of anything else I will post it. Hope this helps. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Like I said, this hardware is extremely stable. When I examine the processes, CEP2 is not taxing my hardware anywhere near some of the other apps I run on that box.
Temp only becomes a problem with this box when ambient temp approaches 80F/27C, and the temp in the room at the time of CEP2 errors was 72F/22C. Since every other app runs fine, including high stress diagnostics, and CEP2 is the only one causing the problem, I've changed the profile for this PC to allow only 3 CEP2's at the time, and added FAAH to take up the slack. This means that FAAH monopolizes the PC. I hate that, since I love contributing as much as possible to CEP2, and my understanding is that they can use all the cores they can get. But if CEP2 won't behave, I don't have much choice. |
||
|
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges:
|
I think I got that exit code a couple of years ago when I was running CEP2 on a generation 1 SSD. The writes to the drive couldn't keep up. Since you are running a mechanical drive, it could be the same thing.
I now have a must faster SSD (Crucial m4), but more to the point I use a Ramdisk to protect it from all the writes, and have no errors on CEP2. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Among my many projects, I build my own PC's. I carefully select the hardware and then test it extensively, especially as it relates to vendor-claimed throughput. This hard drive has a high sustained read and write rate.
When I don my programmer's hat, I often require of it much more than CEP2 does. One of my big data projects was converting a 10GB database into a new one, record by record, converting some field formats as I went, depending on record contents. That process asked way more of the drive than CEP2 is asking, with the drive running at 100% load (according to Resource Monitor) for a few hours at the time. Testing indicates that the drive is easily capable of writing fast enough to save the entire memory footprint of CEP2 in a matter of a few seconds, and it is even faster at reading. For example, when doing video editing, this PC can transcode a 4GB NTSC video file from one format to another in four minutes or less. I've encountered this error before, and posted to the forums about it. The advice I was given then was that sometimes workunits have such problems, and I shouldn't be too concerned about my hardware. As I was advised, several workunits errored out, but then I started getting more that would run without issue, all without rebooting. That was July, and I've not encountered the problem again until this week. Since I cannot get hardware or software errors from any other app, including diagnostics, I cannot explain it. As a dogged troubleshooter, I hate it when I can't explain something. I do appreciate your responses. |
||
|
|
|