| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 44
|
|
| Author |
|
|
Ingleside
Veteran Cruncher Norway Joined: Nov 19, 2005 Post Count: 974 Status: Offline Project Badges:
|
Not really on topic, my *None* HT Q6600 filtering with the WCGDAWS tool of pirogue has had exactly 1 hitting the 12 hour mark in the last 60 days, doing about 6 per day [50% of cores]. From a tech note some week or so ago, there's going to something that will allow these to run up to 24 hours, opt-in was me interpretation and from the statistics, the average run time has actually dropped since the implementation of ZR from 8 hours to 6 hours **, so largely, it's self-inflicted, that high cutoff %. Certainly 57% is not exactly representative, but if it bothers, the None HT crunching by restricting BOINC to the physical cores would possible remove the bulk of your waste, might even increase your number of results... you're good with numbers ;>) Taking a quick look on all CEP2, not only the ones from august and until now, the i7 had 120 of 592 hitting the limit or 20%, while the other computer had 4 out of 548 or 0.7% hitting the limit. This is definitely better, but still it's inefficient to use the i7 for CEP2. As for disabling HT, did benchmark this over a year ago, and using HT gave 20% higher Hadam3P-production compared to not using HT, so I'm definitely keeping it enabled. In any case, bringing things back on topic: Now, personally I'm limiting CEP2 to max 4 at a time, and is normally starting them manually to limit the possibility of they're starting simultaneously and trashing excessively. Having the 6466 application-files under qcaux-directory-tree duplicated for each task is inefficient, and is atleast partially responsible for CEP2 crapping-out as often as it does. Did once try to put 16 copies of qcaux under the project-directory (meaning no CEP2 at all even started), and just choosing the "Disk"-tab in BOINC Manager was enough for all running tasks (AFAIK HCC at the time) to get the dreaded "no heartbeat". Did even try spreading the 16 directories across 3 hd's, but this gave the same result. So, until CEP2 decreases the duplication of the application-files, or decreases how many application-files they're using, there is a limit of how many CEP2-tasks a computer can have started at once (it doesn't matter if some of them is stopped and not even loaded into memory, just having all the files under a slot-directory is enough to give problems). ![]() "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Storage IO bottleneck ? again why i have it on a revodrive 3 while on velociraptor it works fine with 24 cores on it.
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Where's not progressing. Plz post a Result log. That will tell more directly why your CEP2 tasks bonk when you run it on a large number of cores concurrently. Ingleside makes the same heartbeat observation. I can run all cores, but it becomes hugely inefficient and definitely have to let the client pause when using the device myself else they do go south anyhow.
Someone commented at one time, think it was skgiven, that an older drive would do better or something along that line. Did not quite follow the logic. Drive caches? CPU L2/L3 caches? CNRE --//-- |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I was running x12 CEP2 for a long time no problem until...
----------------------------------------First , my internet connection got too slow to handle all the large file uploads. The fix was to edit /var/lib/boinc-client/cc_config.xml and insert ( max_file_xfers ).... example: <options> <max_file_xfers>1</max_file_xfers> </options> Second time I had probs was when I went from 3x2 GB mem to 6x2 GB Mem. The fix was to back down the memspeed from 2000MHz to 1333MHz Then it worked fine again for a while. ( I think the X58 board didn't like having all 6 slots filled ) Third time I had problems was when at some point , something changed with the CEP2 work units. I think the Techs did some tweaking to the code. That was the hard one to fix because the only way I could make them run X12 was to temporarly change over to Windows XP x64. Then they ran fine again. That was the hardest one for me because I am a diehard Linux Fan. I don't know if any of these might help but I thought I'd kick in my 2 cents. OS : Linux Mint 7 x64 / Windows XP Pro x64 / Windows 7 Pro x64 MB : Gigabyte X58A-UD7 CPU : i7-980X HD : Crucial C300 RealSSD 256 GB (single drive) Mem : Corsair Dominator GT 12GB 2000MHz 6x2 [Edit 1 times, last edit by Former Member at Nov 26, 2011 7:52:57 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi
Here is a result log for example Journal des résultats Nom du résultat: E204101_ 111_ C.31.C23H10N4S4.00595379.3.set1d06_ 0-- <core_client_version>6.12.34</core_client_version> <![CDATA[ <stderr_txt> INFO: No state to restore. Start from the beginning. [20:32:56] Number of jobs = 16 [20:32:56] Starting job 0,CPU time has been restored to 0.000000. [20:34:56] Finished Job #0 [20:34:56] Starting job 1,CPU time has been restored to 119.156250. [20:41:13] Finished Job #1 [20:41:13] Starting job 2,CPU time has been restored to 492.187500. Quit requested: Exiting [21:10:36] Number of jobs = 16 [21:10:36] Starting job 2,CPU time has been restored to 492.187500. 22:06:24 (4288): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [22:07:02] Number of jobs = 16 [22:07:02] Starting job 2,CPU time has been restored to 492.187500. Quit requested: Exiting [08:56:14] Number of jobs = 16 [08:56:14] Starting job 2,CPU time has been restored to 492.187500. 10:55:24 (5172): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [10:56:00] Number of jobs = 16 [10:56:00] Starting job 2,CPU time has been restored to 492.187500. 12:54:15 (7672): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [12:54:54] Number of jobs = 16 [12:54:54] Starting job 2,CPU time has been restored to 492.187500. [14:55:03] Finished Job #2 [14:55:03] Starting job 3,CPU time has been restored to 7646.156250. [15:01:54] Finished Job #3 [15:01:54] Starting job 4,CPU time has been restored to 8053.609375. [15:07:26] Finished Job #4 [15:07:26] Starting job 5,CPU time has been restored to 8384.187500. [15:13:07] Finished Job #5 [15:13:07] Starting job 6,CPU time has been restored to 8723.171875. [15:18:41] Finished Job #6 [15:18:41] Starting job 7,CPU time has been restored to 9055.906250. [15:26:44] Finished Job #7 [15:26:44] Starting job 8,CPU time has been restored to 9535.718750. [15:32:45] Finished Job #8 [15:32:45] Starting job 9,CPU time has been restored to 9895.718750. [15:39:18] Finished Job #9 [15:39:18] Starting job 10,CPU time has been restored to 10286.906250. [15:52:19] Finished Job #10 [15:52:19] Starting job 11,CPU time has been restored to 11065.734375. [16:00:21] Finished Job #11 [16:00:21] Starting job 12,CPU time has been restored to 11545.359375. Application exited with RC = 0xc0000005 [16:49:35] Finished Job #12 [16:49:35] Starting job 13,CPU time has been restored to 14492.531250. [16:49:35] Skipping Job #13 [16:49:35] Starting job 14,CPU time has been restored to 14492.531250. [16:49:35] Skipping Job #14 [16:49:35] Starting job 15,CPU time has been restored to 14492.531250. [16:49:35] Skipping Job #15 16:49:44 (4544): called boinc_finish </stderr_txt> ]]> |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Yup, exactly what I thought it was [and Ingleside too]
----------------------------------------22:06:24 (4288): No heartbeat from core client for 30 sec - exiting Your system bottlenecks somewhere, tomast's post of interest. He does not mention [I'm tegnorant], but guess it's DDR3 memory and using triple channel on his mobo. --//-- edit: @tomast, the downside of your max_file setting to 1 is that if a upload stalls, you will not get anything else through to WCG. My upload bandwidth is not great, 1Mb, but by limiting the upload speed [a BOINC setting] found that downloading concurrently goes much faster. Others found same. [Edit 1 times, last edit by Former Member at Nov 27, 2011 8:25:38 AM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I think i know where the problem coming from. It's most probably because the revodrive is not connected to PCI_E solt 1.
Currently it is on slot 7 but it works in 2,5Gbs mode. i'll check that next week, i can't move it righ now because of the watercooling. |
||
|
|
Ingleside
Veteran Cruncher Norway Joined: Nov 19, 2005 Post Count: 974 Status: Offline Project Badges:
|
Your system bottlenecks somewhere, tomast's post of interest. He does not mention [I'm tegnorant], but guess it's DDR3 memory and using triple channel on his mobo. Any i7-9xx-systems, including all 12-core (with HT)-systems or Xeon-systems uses triple-channel, and both mentions they're using 6x 2 GB-sticks. It's only the more resent low-end i7-2600 and similar systems that's only dual-channel. edit: @tomast, the downside of your max_file setting to 1 is that if a upload stalls, you will not get anything else through to WCG. My upload bandwidth is not great, 1Mb, but by limiting the upload speed [a BOINC setting] found that downloading concurrently goes much faster. Others found same. Uploads and downloads isn't permanently stalled, as long as you're not stuck with the ancient v5.10.45-client and likely also the case with v6.2.xx-clients that is, any stuck transfer should time-out within 5 minutes. The timeout is now controllable in v6.12.27 and later, and can also terminate a slow transfer, use <http_transfer_timeout>seconds</http_transfer_timeout> and <http_transfer_timeout_bps>bps</http_transfer_timeout_bps> as options in cc_config.xml Have for years limited transfer to 1 at a time without any problems. But granted if you're mostly running non-CEP2 and WCG-only, many WCG-tasks uses multiple files and many of these files is only a couple KB in size, so even increasing from the default 2 at a time can be an advantage. Also for anyone running max 1 CEP2 at a time keeping it at default 2 shouldn't be a problem.. The only problem is that WCG doesn't have any option of getting only 1 CEP2 at a time, since in my experience the other WCG-tasks small download-size will keep the computer below the bandwith-limit so you'll never get any CEP2 at all. ![]() "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
One more of these settings only the incrowd knows about. Does not help any middle of the road users who want it GUI interface driven and not do RTFM. Yup, crunching is really simple, set and forget... sunday afternoon live and lol.
----------------------------------------My duo reads as having a DL BW of 284Kb yet presently and an up of 29Kb, yet there's only the 2 part small (1 part zipped) downloads for DSFL/GFAM coming down (about 6-8 per day) aside from CEP2 and they're even smaller 0.1Mb . Set to default 1, CEP2 keeps coming and no complaints in the logs]. Think the requirement was lowered to 64Kb and though there is reference to the System Requirement page on the profile, cant seem to see what the bandwidth limit presently is (did not have a personal email to notify me of the change ;>). The small files do thus for me not have a detrimental impact with what's registered by BOINC. --//-- [Edit 1 times, last edit by Former Member at Nov 27, 2011 1:51:37 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
huumm i checked also in windows logs and i have noticed the following
----------------------------------------L’heure du système est passée à 2011-11-27T16:40:57.500000000Z à partir de 2011-11-27T16:42:08.506835900Z. While boinc warmachine-SR2 749 27/11/2011 17:42:07 System clock was turned backwards; clearing timeouts So what is going on ??? why do i have a such time update ? even if i have now 6 cores over 24 engaged i still have the issue. I have disable the ntp synchro for testing [Edit 2 times, last edit by Former Member at Nov 27, 2011 6:42:33 PM] |
||
|
|
|