World Community Grid - View Thread - Exit Code 195 (0xc3)/RC = 0X1

World Community Grid Forums

Category: Completed Research

Forum: The Clean Energy Project - Phase 2 Forum

Thread: Exit Code 195 (0xc3)/RC = 0X1

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 24

[ ]

Author

This topic has been viewed 3801 times and has 23 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Exit Code 195 (0xc3)/RC = 0X1

Hello MarkHNC,
Here is my unsolicited opinion on unknown science application errors. Right now, DDDT2 is stalled while the scientists try to overcome problems running the program over West Nile encephalitis. These problems can occur when an algorithm suddenly finds itself in an impossible situation while processing a particular molecule. For example, the algorithm may not converge to a single value. This is not necessarily because of a programming bug. It can happen because the algorithm cannot handle a particular molecular configuration.

For whatever reason, almost all quantum chemistry programs will fail over some ranges of data. I just shrug and let the project scientists worry about this. On some projects, such as HCMD2, we end up rerunning some molecules with different parameters. (Which is what we have been doing for several months. Almost through!!)

Lawrence

[Nov 10, 2012 5:55:48 AM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7849
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: Exit Code 195 (0xc3)/RC = 0X1

I presume you have a quad core system because you stated you are now running only 3 CEP2 to to one FAAH. Has this allowed CEP2 to behave over time ? If so, it seems to point to a potential bottleneck someplace in your system. You have stated you use only high end components and have otherwise maximally stressed your system with other applications and experienced no errors. I am at a loss for where that might be. It also does not explain why you were able to run for an extensive period of time with CEP2 without a problem. The only other item which comes to mind is a transient problem with your power supply, but given your care with components, that is probably unlikely. Other than that I am at a deadend for any further suggestions but would be curious to know if you ever do figure out what the cause is. Good luck.
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Nov 10, 2012 2:56:24 PM]

Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:

45 day badge for Nutritious Rice for the World

1 year badge for Help Fight Childhood Cancer

20 year badge for The Clean Energy Project - Phase 2

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

5 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Exit Code 195 (0xc3)/RC = 0X1

When I don my programmer's hat, I often require of it much more than CEP2 does. One of my big data projects was converting a 10GB database into a new one, record by record, converting some field formats as I went, depending on record contents. That process asked way more of the drive than CEP2 is asking, with the drive running at 100% load (according to Resource Monitor) for a few hours at the time.

Testing indicates that the drive is easily capable of writing fast enough to save the entire memory footprint of CEP2 in a matter of a few seconds, and it is even faster at reading. For example, when doing video editing, this PC can transcode a 4GB NTSC video file from one format to another in four minutes or less.

I am wondering if it is due to the difference between serial and random writes? The serial writes are easier on the drive, whether it is mechanical or SSD. But the random writes slow things down, especially if it requires a lot of head movement on a mechanical drive. I expect that the real-time operation of CEP2 requires more random writes than does a database, but have not tested that myself.

If you want to have some fun, try out the free trial (180 days) of FancyCache. Even if it does not solve this problem, it will speed things up if you use write-caching and set a long write-delay of at least a few minutes (I usually used 24 hours).
http://www.romexsoftware.com/en-us/fancy-cache/index.html

Note that the "Volume" version is for caching a single partition, which is normally what I use. The "Disk" version can cache one or more entire drives. But the caveats about backup power supplies still apply, since you lose anything in the write cache during a power outage or crash. However, if you set the write cache to only 10 minutes or so, you still get a worthwhile reduction in writes, while not losing very much in case of a crash. (Note that in addition to reducing the number of writes to the drive for CEP2, it also serializes them, which makes it easier on the drive.)

----------------------------------------
[Edit 1 times, last edit by Jim1348 at Nov 10, 2012 5:07:04 PM]

[Nov 10, 2012 3:16:44 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Exit Code 195 (0xc3)/RC = 0X1

Part of why I choose enterprise class Western Digital 10k RPM Velociraptors as my primary workstation hard drives is their superior random access speeds. In testing, these drives outperform all of my 7200 RPM drives. You have to move from SATA to SAS to get better performance. The more recent models I am using also feature TLER, so the possible bottleneck of error recovery is mitigated. (These drives also feature an "ice pack," a 3.5" heat sink surrounding the 2.5" drive, which means they run at the same or less temps as my 7200 RPM drives given the same ambient temp, and mount in a 3.5" bay.)

I already have write caching enabled these drives, because each of my PC's has a battery backup.

[Nov 11, 2012 12:15:45 PM]

Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:


Re: Exit Code 195 (0xc3)/RC = 0X1

I already have write caching enabled these drives, because each of my PC's has a battery backup.

That enables the internal cache on the drive (e.g., 32 MB or whatever), and will help the data get to the drive if the drive is busy doing other things when the OS wants to write to it. It will bridge the gap for a few seconds, depending on how fast you are sending data, and might be good enough to eliminate the above-noted problem in your case, since you have very fast drives to begin with.

But CEP2 writes a LOT of data, depending on how many cores it is running on at a time. A caching program like FancyCache makes use of your main memory, which allows you to set the cache size to a GB or more; I think I used 4 GB when setting the write delay to 24 hours, but that is on the large size and not necessary with shorter write delays. And the write cache in FancyCache is also a Read cache, so that any data still in the cache is available for reading from main memory rather than having to access it from the drive. That cuts down a lot on head movement, and reduces delays in getting the data out too. In fact, when you set the write delay to more than about 12 hours or so, you are in effect creating a Ramdisk, and all the data in a work unit is written to, and read from main memory rather than the disk. It may not be needed in your case, but it would definitely help on slower drives.

[Nov 12, 2012 10:42:45 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Exit Code 195 (0xc3)/RC = 0X1

The failures of CEP2 are not occurring during periods of high drive activity, which is why I don't think drive throughput is the problem. There is high drive activity as each CEP2 workunit starts (the first 3 or 4 seconds), but the activity quickly drops to nominal. It's between 10 and 15 seconds into the run that CEP2 fails.

In any event, if Win7's Resource Monitor is to be trusted, the rate at which data is being written comes no where near the known top write speed of this drive, even on random writes, and that high drive activity has moderated well before CEP2 fails. If all four CEP2 processes were starting or doing heavy writes at the same time, then it would seem more plausible that it was reaching the the limits of the drive.

I have been running this PC since I joined in May of this year with "Run always" selected on the Activity menu and the preferences set to use 100% of processors and 100% of CPU time while processor usage is less than 50%. If I know I'm going to push the limit often on the 50% threshold (e.g. video editing), I simply suspend BOINC.

In all other respects, I am very proud of the efficiency/low overhead on this PC. For example, the last validated CEP2 workunit from this afternoon (while the PC did not have an active user) shows 5.16 hours CPU time out of 5.19 Elapsed hours time. At the time, it was running alongside another CEP2 and two FAAH workunits.

I have changed "Run always" to "Run based on preferences," to see if that might stabilize the situation. I have also kept FAAH in the rotation, and set CEP2 to max at 3 workunits at the time. It makes it impossible for all four cores to run CEP2 at the same time.

I must confess some bewilderment at the lack of a useful error code/log. As a Windows desktop app developer, I always write error events to the log (and optionally other events), so that there is at least some information that can support troubleshooting. This often gives me valuable information to convince a user/client that they have a bad drive or a flaky network connection.

I'm going to let it run this way for a while and see how it behaves.

[Nov 13, 2012 3:23:52 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Exit Code 195 (0xc3)/RC = 0X1

I just shrug and let the project scientists worry about this.

Thanks lawrence. I somehow missed your encouraging post. Much appreciated.

[Nov 13, 2012 3:27:59 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Exit Code 195 (0xc3)/RC = 0X1

It also does not explain why you were able to run for an extensive period of time with CEP2 without a problem.

Agreed, Sgt Joe. Doesn't make sense to me either. It would run 24 to 36 hours without a problem, and then suddenly start throwing errors.

We get very high quality/stable power from the electricity provider in our area, and I have good battery backup on that PC, which conditions the power. All my PC's also have high efficiency power supplies that are rated for at least twice the wattage that they require, so total power requirements shouldn't be an issue. With heavy primary drive activity, all four cores at 100% and the LED/LCD monitor "awake," this rig maxes around 150W draw on a 900W battery backup.

As to running 3 CEP2 to 1 FAAH, I can only manage that if I micromanage the queue. I have BOINC on that PC set for half a day's cache. Since CEP2 is currently limited to 3 workunits at the time, it keeps a much larger cache of FAAH than CEP2, so there are times when only FAAH workunits are running at the time. Wish I could tell it to do 3 to 1 . . .

[Nov 13, 2012 3:45:25 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Exit Code 195 (0xc3)/RC = 0X1

I do think I'd at least consider loading the combination of eFMer BoincTasks and TThrottle to give yourself the ability to monitor and log processor core temperatures and throttle down on processor overheats. You could then firmly state that your core temperature data rules out overheating as an issue rather than rely upon whatever assumptions you are making about ambient temperatures.

You'd be surprised at how quickly environmental conditions can change where cooling is involved...as when your cleaning service, significant other, or mother does you the great good favor of dosing carpets with a substance like Carpet Fresh - or construction upwind of your A/C fresh air intakes blesses you with some grade A rock or cement dust.

[Nov 14, 2012 4:00:36 AM]

CandymanWCG
Senior Cruncher
Romania
Joined: Dec 20, 2010
Post Count: 421
Status: Offline
Project Badges:

45 day badge for Human Proteome Folding - Phase 2

14 day badge for Help Fight Childhood Cancer

14 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

45 day badge for Drug Search for Leishmaniasis

90 day badge for GO Fight Against Malaria

180 day badge for Uncovering Genome Mysteries

180 day badge for Outsmart Ebola Together

1 year badge for FightAIDS@Home - Phase 2

90 day badge for Microbiome Immunity Project


Re: Exit Code 195 (0xc3)/RC = 0X1

...or you could do the recommended setting for this project and choose to run it on just 50% of your CPU's cores (which, by coincidence or not, was the same as your old setting of running only 2 units at a time).

My two cents on the matter are these: even if the error you see is because of the huge amount of I/O on whatever type of storage you have and assuming you will somehow get that fixed, I would bet it won't be long before you will start seeing another nice error: no heartbeat from client for more than x seconds...exiting. And that, my friend, is the CPU getting clogged up. So my advice: go back to running it on 2 cores. If you have other devices that run other sciences, maybe you can add a unit or two there, to compensate for the other 2 cores that you won't be running CEP2 on.

Oh, and by the way, the fact that BOINC refused to give you anymore units is just a fail safe put in place for exactly this type of situations where a machine spits out only bad results. You would put a leash on it too, right? But no worries: as soon as you get it fixed, your daily quota will increase little by little and since the maximum is not that far, you shouldn't even notice when you will be back to normal.

Hope this helps. Cheers!

----------------------------------------

Knowledge is limited. Imagination encircles the world! - Albert Einstein

----------------------------------------
[Edit 1 times, last edit by CandymanWCG at Nov 14, 2012 11:02:45 PM]

[Nov 14, 2012 11:00:34 PM]

[ ]