World Community Grid - View Thread - BOINC 6.12.34 seven 64 bit REVODRIVE 3 many error on tasks help please

World Community Grid Forums

Category: Support

Forum: BOINC Agent Support

Thread: BOINC 6.12.34 seven 64 bit REVODRIVE 3 many error on tasks help please

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 44

[ ]

Author

This topic has been viewed 4806 times and has 43 replies

Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

180 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

1 year badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: BOINC 6.12.34 seven 64 bit REVODRIVE 3 many error on tasks help please

Not really on topic, my *None* HT Q6600 filtering with the WCGDAWS tool of pirogue has had exactly 1 hitting the 12 hour mark in the last 60 days, doing about 6 per day [50% of cores]. From a tech note some week or so ago, there's going to something that will allow these to run up to 24 hours, opt-in was me interpretation and from the statistics, the average run time has actually dropped since the implementation of ZR from 8 hours to 6 hours **, so largely, it's self-inflicted, that high cutoff %. Certainly 57% is not exactly representative, but if it bothers, the None HT crunching by restricting BOINC to the physical cores would possible remove the bulk of your waste, might even increase your number of results... you're good with numbers ;>)

Taking a quick look on all CEP2, not only the ones from august and until now, the i7 had 120 of 592 hitting the limit or 20%, while the other computer had 4 out of 548 or 0.7% hitting the limit. This is definitely better, but still it's inefficient to use the i7 for CEP2.

As for disabling HT, did benchmark this over a year ago, and using HT gave 20% higher Hadam3P-production compared to not using HT, so I'm definitely keeping it enabled.

In any case, bringing things back on topic:

Now, personally I'm limiting CEP2 to max 4 at a time, and is normally starting them manually to limit the possibility of they're starting simultaneously and trashing excessively. Having the 6466 application-files under qcaux-directory-tree duplicated for each task is inefficient, and is atleast partially responsible for CEP2 crapping-out as often as it does. Did once try to put 16 copies of qcaux under the project-directory (meaning no CEP2 at all even started), and just choosing the "Disk"-tab in BOINC Manager was enough for all running tasks (AFAIK HCC at the time) to get the dreaded "no heartbeat". Did even try spreading the 16 directories across 3 hd's, but this gave the same result.

So, until CEP2 decreases the duplication of the application-files, or decreases how many application-files they're using, there is a limit of how many CEP2-tasks a computer can have started at once (it doesn't matter if some of them is stopped and not even loaded into memory, just having all the files under a slot-directory is enough to give problems).

----------------------------------------

"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."

[Nov 26, 2011 6:13:12 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: BOINC 6.12.34 seven 64 bit REVODRIVE 3 many error on tasks help please

Storage IO bottleneck ? again why i have it on a revodrive 3 while on velociraptor it works fine with 24 cores on it.

[Nov 26, 2011 6:45:00 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: BOINC 6.12.34 seven 64 bit REVODRIVE 3 many error on tasks help please

Where's not progressing. Plz post a Result log. That will tell more directly why your CEP2 tasks bonk when you run it on a large number of cores concurrently. Ingleside makes the same heartbeat observation. I can run all cores, but it becomes hugely inefficient and definitely have to let the client pause when using the device myself else they do go south anyhow.

Someone commented at one time, think it was skgiven, that an older drive would do better or something along that line. Did not quite follow the logic. Drive caches? CPU L2/L3 caches? CNRE

--//--

[Nov 26, 2011 6:54:33 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: BOINC 6.12.34 seven 64 bit REVODRIVE 3 many error on tasks help please

I was running x12 CEP2 for a long time no problem until...

First , my internet connection got too slow to handle all the large file uploads.
The fix was to edit /var/lib/boinc-client/cc_config.xml
and insert ( max_file_xfers ).... example:

<options>
<max_file_xfers>1</max_file_xfers>
</options>

Second time I had probs was when I went from 3x2 GB mem to 6x2 GB Mem.

The fix was to back down the memspeed from 2000MHz to 1333MHz
Then it worked fine again for a while.
( I think the X58 board didn't like having all 6 slots filled )

Third time I had problems was when at some point , something changed
with the CEP2 work units. I think the Techs did some tweaking to the code.
That was the hard one to fix because the only way I could make them run X12
was to temporarly change over to Windows XP x64. Then they ran fine again.
That was the hardest one for me because I am a diehard Linux Fan. wink

I don't know if any of these might help but I thought I'd kick in my 2 cents.

OS : Linux Mint 7 x64 / Windows XP Pro x64 / Windows 7 Pro x64
MB : Gigabyte X58A-UD7
CPU : i7-980X
HD : Crucial C300 RealSSD 256 GB (single drive)
Mem : Corsair Dominator GT 12GB 2000MHz 6x2

----------------------------------------
[Edit 1 times, last edit by Former Member at Nov 26, 2011 7:52:57 PM]

[Nov 26, 2011 7:43:46 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: BOINC 6.12.34 seven 64 bit REVODRIVE 3 many error on tasks help please

Hi

Here is a result log for example

Journal des résultats

Nom du résultat: E204101_ 111_ C.31.C23H10N4S4.00595379.3.set1d06_ 0--

<core_client_version>6.12.34</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[20:32:56] Number of jobs = 16
[20:32:56] Starting job 0,CPU time has been restored to 0.000000.
[20:34:56] Finished Job #0
[20:34:56] Starting job 1,CPU time has been restored to 119.156250.
[20:41:13] Finished Job #1
[20:41:13] Starting job 2,CPU time has been restored to 492.187500.
Quit requested: Exiting
[21:10:36] Number of jobs = 16
[21:10:36] Starting job 2,CPU time has been restored to 492.187500.
22:06:24 (4288): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[22:07:02] Number of jobs = 16
[22:07:02] Starting job 2,CPU time has been restored to 492.187500.
Quit requested: Exiting
[08:56:14] Number of jobs = 16
[08:56:14] Starting job 2,CPU time has been restored to 492.187500.
10:55:24 (5172): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[10:56:00] Number of jobs = 16
[10:56:00] Starting job 2,CPU time has been restored to 492.187500.
12:54:15 (7672): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[12:54:54] Number of jobs = 16
[12:54:54] Starting job 2,CPU time has been restored to 492.187500.
[14:55:03] Finished Job #2
[14:55:03] Starting job 3,CPU time has been restored to 7646.156250.
[15:01:54] Finished Job #3
[15:01:54] Starting job 4,CPU time has been restored to 8053.609375.
[15:07:26] Finished Job #4
[15:07:26] Starting job 5,CPU time has been restored to 8384.187500.
[15:13:07] Finished Job #5
[15:13:07] Starting job 6,CPU time has been restored to 8723.171875.
[15:18:41] Finished Job #6
[15:18:41] Starting job 7,CPU time has been restored to 9055.906250.
[15:26:44] Finished Job #7
[15:26:44] Starting job 8,CPU time has been restored to 9535.718750.
[15:32:45] Finished Job #8
[15:32:45] Starting job 9,CPU time has been restored to 9895.718750.
[15:39:18] Finished Job #9
[15:39:18] Starting job 10,CPU time has been restored to 10286.906250.
[15:52:19] Finished Job #10
[15:52:19] Starting job 11,CPU time has been restored to 11065.734375.
[16:00:21] Finished Job #11
[16:00:21] Starting job 12,CPU time has been restored to 11545.359375.
Application exited with RC = 0xc0000005
[16:49:35] Finished Job #12
[16:49:35] Starting job 13,CPU time has been restored to 14492.531250.
[16:49:35] Skipping Job #13
[16:49:35] Starting job 14,CPU time has been restored to 14492.531250.
[16:49:35] Skipping Job #14
[16:49:35] Starting job 15,CPU time has been restored to 14492.531250.
[16:49:35] Skipping Job #15
16:49:44 (4544): called boinc_finish

</stderr_txt>
]]>

[Nov 27, 2011 8:08:43 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: BOINC 6.12.34 seven 64 bit REVODRIVE 3 many error on tasks help please

Yup, exactly what I thought it was [and Ingleside too]

22:06:24 (4288): No heartbeat from core client for 30 sec - exiting

Your system bottlenecks somewhere, tomast's post of interest. He does not mention [I'm tegnorant], but guess it's DDR3 memory and using triple channel on his mobo.

--//--

edit: @tomast, the downside of your max_file setting to 1 is that if a upload stalls, you will not get anything else through to WCG. My upload bandwidth is not great, 1Mb, but by limiting the upload speed [a BOINC setting] found that downloading concurrently goes much faster. Others found same.

----------------------------------------
[Edit 1 times, last edit by Former Member at Nov 27, 2011 8:25:38 AM]

[Nov 27, 2011 8:16:42 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: BOINC 6.12.34 seven 64 bit REVODRIVE 3 many error on tasks help please

I think i know where the problem coming from. It's most probably because the revodrive is not connected to PCI_E solt 1.

Currently it is on slot 7 but it works in 2,5Gbs mode. i'll check that next week, i can't move it righ now because of the watercooling.

[Nov 27, 2011 9:37:34 AM]

Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:


Re: BOINC 6.12.34 seven 64 bit REVODRIVE 3 many error on tasks help please

Your system bottlenecks somewhere, tomast's post of interest. He does not mention [I'm tegnorant], but guess it's DDR3 memory and using triple channel on his mobo.

Any i7-9xx-systems, including all 12-core (with HT)-systems or Xeon-systems uses triple-channel, and both mentions they're using 6x 2 GB-sticks. It's only the more resent low-end i7-2600 and similar systems that's only dual-channel.

edit: @tomast, the downside of your max_file setting to 1 is that if a upload stalls, you will not get anything else through to WCG. My upload bandwidth is not great, 1Mb, but by limiting the upload speed [a BOINC setting] found that downloading concurrently goes much faster. Others found same.

Uploads and downloads isn't permanently stalled, as long as you're not stuck with the ancient v5.10.45-client and likely also the case with v6.2.xx-clients that is, any stuck transfer should time-out within 5 minutes. The timeout is now controllable in v6.12.27 and later, and can also terminate a slow transfer, use <http_transfer_timeout>seconds</http_transfer_timeout> and <http_transfer_timeout_bps>bps</http_transfer_timeout_bps> as options in cc_config.xml

Have for years limited transfer to 1 at a time without any problems. But granted if you're mostly running non-CEP2 and WCG-only, many WCG-tasks uses multiple files and many of these files is only a couple KB in size, so even increasing from the default 2 at a time can be an advantage. Also for anyone running max 1 CEP2 at a time keeping it at default 2 shouldn't be a problem..

The only problem is that WCG doesn't have any option of getting only 1 CEP2 at a time, since in my experience the other WCG-tasks small download-size will keep the computer below the bandwith-limit so you'll never get any CEP2 at all.

----------------------------------------

"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."

[Nov 27, 2011 1:22:42 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: BOINC 6.12.34 seven 64 bit REVODRIVE 3 many error on tasks help please

One more of these settings only the incrowd knows about. Does not help any middle of the road users who want it GUI interface driven and not do RTFM. Yup, crunching is really simple, set and forget... sunday afternoon live and lol.

My duo reads as having a DL BW of 284Kb yet presently and an up of 29Kb, yet there's only the 2 part small (1 part zipped) downloads for DSFL/GFAM coming down (about 6-8 per day) aside from CEP2 and they're even smaller 0.1Mb . Set to default 1, CEP2 keeps coming and no complaints in the logs]. Think the requirement was lowered to 64Kb and though there is reference to the System Requirement page on the profile, cant seem to see what the bandwidth limit presently is (did not have a personal email to notify me of the change ;>). The small files do thus for me not have a detrimental impact with what's registered by BOINC.

--//--

----------------------------------------
[Edit 1 times, last edit by Former Member at Nov 27, 2011 1:51:37 PM]

[Nov 27, 2011 1:48:51 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: BOINC 6.12.34 seven 64 bit REVODRIVE 3 many error on tasks help please

huumm i checked also in windows logs and i have noticed the following

L’heure du système est passée à ‎2011‎-‎11‎-‎27T16:40:57.500000000Z à partir de ‎2011‎-‎11‎-‎27T16:42:08.506835900Z.

While boinc

warmachine-SR2

749 27/11/2011 17:42:07 System clock was turned backwards; clearing timeouts

So what is going on ??? why do i have a such time update ? even if i have now 6 cores over 24 engaged i still have the issue.

I have disable the ntp synchro for testing

----------------------------------------
[Edit 2 times, last edit by Former Member at Nov 27, 2011 6:42:33 PM]

[Nov 27, 2011 6:40:22 PM]

[ ]