Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 33
Posts: 33   Pages: 4   [ Previous Page | 1 2 3 4 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 142074 times and has 32 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: WUs take forever and don't checkpoint

Interesting. One task appears as running for 55h 20m, properties report 2h 39m of CPU time, and Windows Task Manager reports 0:01:42 (1 hour 42 minutes, or 1 minute 42 seconds?) of time for that PID.

But, a task that shows as 99% complete in three hours shows only 0:00:01 CPU time in Task Manager. So it seems that my processes aren't showing what I see in BOINC.

Anyway, reducing to a heptacore didn't do anything; I still see the same behaviour.


Hi ashes999.

Has no one else has asked, can you shut down the Boinc & restart it then copy the first

30+ lines to where the tasks are starting from the event log & paste it here so we can see what this rig is doing might help.

Also have you changed the checkpoint save time in Boinc for some reason?
----------------------------------------
[Edit 2 times, last edit by Former Member at May 23, 2013 12:44:55 AM]
[May 23, 2013 12:21:42 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: WUs take forever and don't checkpoint

Since like CEP2, the VINA jobs launch a new 'job' after each checkpoint, you will never see the total CPU time in the TM, just the time for the one docking. BOINC shows the accumulated CPU time. Time in TM is shown as hh:mm:ss so when you see 0:12:30 it's 12 minutes 30 seconds.

It makes absolute zero sense, that when you reduce from 8 to 7 cores on BOINC, there's still 1 VINA job exhibiting the Elapsed / CPU time discrepancy. Switch on checkpoint logging, tag <checkpoint_debug> to be added to cc_config.xml in the <log_flags> section, and start counting the entries per job. If one never checkpoints, it would never have an log entry.
[May 23, 2013 6:26:41 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: WUs take forever and don't checkpoint

BTW, you never answered what the 'System idle process' was showing when BOINC ran and used all 8 cores. Also, apropos, assume that BOINC is set to still use 100% of time.
[May 23, 2013 6:29:33 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: WUs take forever and don't checkpoint

Hi ashes999.

If your using windows have you had a look at what the core/thread speeds are under load something like CPU.z to see if it is slowing down for some reason, I just thought that that might account for some of the lost time

Plus you might want to check your B.I.O.s to see if there is something in there that is slowing it.
[May 23, 2013 6:38:57 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: WUs take forever and don't checkpoint

P.P.L, it's only on 1 of 8 threads, PLUS, even if you slow a CPU, the Elapsed time and CPU time still count on the same clock... jobs will just take longer.
[May 23, 2013 6:47:55 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: WUs take forever and don't checkpoint

ashes999, is the CPU time for the processor set in BOINC to 100%? The startup log as per P.P.L. is indeed now of interest, as is a piece of log when BOINC is in full swing, with the suggested checkpoint logging activated.
[May 23, 2013 7:35:11 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: WUs take forever and don't checkpoint

ashes999.

Another thing since this is a work P.C. is it possible that the I.T. person/dept have installed some software on the rig that you don't know about and more than likely wouldn't be able to see & fiddle with. confused devilish
----------------------------------------
[Edit 1 times, last edit by Former Member at May 23, 2013 7:53:54 AM]
[May 23, 2013 7:46:29 AM]   Link   Report threatening or abusive post: please login first  Go to top 
NightBlade
Advanced Cruncher
Joined: Jun 10, 2008
Post Count: 89
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: WUs take forever and don't checkpoint

Hi all,

Thanks for all the feedback and information. The main issue is that this happens sporadically, not all the time; I see usually none, sometimes one, and rarely two WUs in this weird state.

@SekeRob thanks for explaining about the TM stuff, that makes perfect sense now. I've integrated your cc_config.xml changes and set my CPU back to 100% of CPUs and 85% usage; let's see if that causes the problem to emerge. It seemed like switching to 75% (6/8) solved the problem, but it occurs non-deterministically, so I can't say if this is for sure or not.

As for support, I have full permission and full control to do whatever I want to my machine. There are a few rare exceptions, like disabling the firewall. It's very unlikely they installed anything on it, because the machine came to me very, very bare-bones with just an OS.

As requested @P.P.L., here's the first 30-ish lines of my log file. In cc_config.xml, I added checkpoint_debug and cpu_sched_debug. Again, this is with 100% CPUs and 85% utilization.


23/05/2013 10:53:16 AM | | Starting BOINC client version 7.0.64 for windows_x86_64
23/05/2013 10:53:16 AM | | log flags: file_xfer, sched_ops, task, checkpoint_debug, cpu_sched_debug
23/05/2013 10:53:16 AM | | Libraries: libcurl/7.25.0 OpenSSL/1.0.1 zlib/1.2.6
23/05/2013 10:53:16 AM | | Running as a daemon
23/05/2013 10:53:16 AM | | Data directory: D:\Program Files (x86)\BOINC\Data
23/05/2013 10:53:16 AM | | Running under account boinc_master
23/05/2013 10:53:16 AM | | Processor: 8 GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz [Family 6 Model 58 Stepping 9]
23/05/2013 10:53:16 AM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 sse4_2 popcnt aes nx lm vmx smx tm2 pbe
23/05/2013 10:53:16 AM | | OS: Microsoft Windows 7: Professional x64 Edition, Service Pack 1, (06.01.7601.00)
23/05/2013 10:53:16 AM | | Memory: 15.96 GB physical, 31.91 GB virtual
23/05/2013 10:53:16 AM | | Disk: 186.31 GB total, 172.94 GB free
23/05/2013 10:53:16 AM | | Local time is UTC -4 hours
23/05/2013 10:53:16 AM | | No usable GPUs found
23/05/2013 10:53:16 AM | VolPEx | URL http://volpex.cs.uh.edu/VCP/; Computer ID 7122; resource share 900
23/05/2013 10:53:16 AM | World Community Grid | URL http://www.worldcommunitygrid.org/; Computer ID 2346629; resource share 100
23/05/2013 10:53:16 AM | | General prefs: from http://bam.boincstats.com/ (last modified 27-Dec-2012 11:32:09)
23/05/2013 10:53:16 AM | | Host location: none
23/05/2013 10:53:16 AM | | General prefs: using your defaults
23/05/2013 10:53:16 AM | | Reading preferences override file
23/05/2013 10:53:16 AM | | Preferences:
23/05/2013 10:53:16 AM | | max memory usage when active: 8169.22MB
23/05/2013 10:53:16 AM | | max memory usage when idle: 14704.59MB
23/05/2013 10:53:16 AM | | max disk usage: 100.00GB
23/05/2013 10:53:16 AM | | (to change preferences, visit a project web site or select Preferences in the Manager)
23/05/2013 10:53:16 AM | | [cpu_sched_debug] Request CPU reschedule: Prefs update
23/05/2013 10:53:16 AM | | [cpu_sched_debug] Request CPU reschedule: Startup
23/05/2013 10:53:16 AM | | Not using a proxy
23/05/2013 10:53:16 AM | | [cpu_sched_debug] Request CPU reschedule: Idle state change
23/05/2013 10:53:16 AM | | [cpu_sched_debug] Request CPU reschedule: periodic CPU scheduling
23/05/2013 10:53:16 AM | | [cpu_sched_debug] schedule_cpus(): start
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] scheduling SN2S_Smp102070_0000095_0584_0 (CPU job, priority order) (prio -1.000000)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] scheduling SN2S_Smp102070_0000095_0624_0 (CPU job, priority order) (prio -1.005208)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] scheduling SN2S_Smp102070_0000095_0778_0 (CPU job, priority order) (prio -1.010417)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] scheduling SN2S_Smp102070_0000095_0449_0 (CPU job, priority order) (prio -1.015625)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] scheduling SN2S_Smp102070_0000095_0011_0 (CPU job, priority order) (prio -1.020833)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] scheduling SN2S_Smp102070_0000095_0673_0 (CPU job, priority order) (prio -1.026042)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] scheduling SN2S_Smp102070_0000095_0708_0 (CPU job, priority order) (prio -1.031250)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] scheduling SN2S_Smp102070_0000095_0156_0 (CPU job, priority order) (prio -1.036458)
23/05/2013 10:53:16 AM | | [cpu_sched_debug] enforce_schedule(): start
23/05/2013 10:53:16 AM | | [cpu_sched_debug] preliminary job list:
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] 0: SN2S_Smp102070_0000095_0584_0 (MD: no; UTS: no)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] 1: SN2S_Smp102070_0000095_0624_0 (MD: no; UTS: no)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] 2: SN2S_Smp102070_0000095_0778_0 (MD: no; UTS: no)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] 3: SN2S_Smp102070_0000095_0449_0 (MD: no; UTS: yes)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] 4: SN2S_Smp102070_0000095_0011_0 (MD: no; UTS: no)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] 5: SN2S_Smp102070_0000095_0673_0 (MD: no; UTS: no)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] 6: SN2S_Smp102070_0000095_0708_0 (MD: no; UTS: no)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] 7: SN2S_Smp102070_0000095_0156_0 (MD: no; UTS: no)
23/05/2013 10:53:16 AM | | [cpu_sched_debug] final job list:
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] 0: SN2S_Smp102070_0000095_0584_0 (MD: no; UTS: no)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] 1: SN2S_Smp102070_0000095_0624_0 (MD: no; UTS: no)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] 2: SN2S_Smp102070_0000095_0778_0 (MD: no; UTS: no)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] 3: SN2S_Smp102070_0000095_0449_0 (MD: no; UTS: no)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] 4: SN2S_Smp102070_0000095_0011_0 (MD: no; UTS: no)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] 5: SN2S_Smp102070_0000095_0673_0 (MD: no; UTS: no)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] 6: SN2S_Smp102070_0000095_0708_0 (MD: no; UTS: no)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] 7: SN2S_Smp102070_0000095_0156_0 (MD: no; UTS: no)
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] scheduling SN2S_Smp102070_0000095_0584_0
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] scheduling SN2S_Smp102070_0000095_0624_0
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] scheduling SN2S_Smp102070_0000095_0778_0
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] scheduling SN2S_Smp102070_0000095_0449_0
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] scheduling SN2S_Smp102070_0000095_0011_0
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] scheduling SN2S_Smp102070_0000095_0673_0
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] scheduling SN2S_Smp102070_0000095_0708_0
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] scheduling SN2S_Smp102070_0000095_0156_0
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] SN2S_Smp102070_0000095_0584_0 sched state 1 next 2 task state 0
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] SN2S_Smp102070_0000095_0624_0 sched state 1 next 2 task state 0
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] SN2S_Smp102070_0000095_0778_0 sched state 1 next 2 task state 0
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] SN2S_Smp102070_0000095_0449_0 sched state 1 next 2 task state 0
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] SN2S_Smp102070_0000095_0011_0 sched state 1 next 2 task state 0
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] SN2S_Smp102070_0000095_0673_0 sched state 1 next 2 task state 0
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] SN2S_Smp102070_0000095_0708_0 sched state 1 next 2 task state 0
23/05/2013 10:53:16 AM | World Community Grid | [cpu_sched_debug] SN2S_Smp102070_0000095_0156_0 sched state 1 next 2 task state 0
23/05/2013 10:53:16 AM | World Community Grid | Restarting task SN2S_Smp102070_0000095_0584_0 using sn2s version 620 in slot 2
23/05/2013 10:53:16 AM | World Community Grid | Restarting task SN2S_Smp102070_0000095_0624_0 using sn2s version 620 in slot 11
23/05/2013 10:53:16 AM | World Community Grid | Restarting task SN2S_Smp102070_0000095_0778_0 using sn2s version 620 in slot 12
23/05/2013 10:53:16 AM | World Community Grid | Restarting task SN2S_Smp102070_0000095_0449_0 using sn2s version 620 in slot 6
23/05/2013 10:53:16 AM | World Community Grid | Restarting task SN2S_Smp102070_0000095_0011_0 using sn2s version 620 in slot 5
23/05/2013 10:53:16 AM | World Community Grid | Restarting task SN2S_Smp102070_0000095_0673_0 using sn2s version 620 in slot 9
23/05/2013 10:53:16 AM | World Community Grid | Restarting task SN2S_Smp102070_0000095_0708_0 using sn2s version 620 in slot 13
23/05/2013 10:53:16 AM | World Community Grid | Restarting task SN2S_Smp102070_0000095_0156_0 using sn2s version 620 in slot 14
23/05/2013 10:53:16 AM | | [cpu_sched_debug] enforce_schedule: end

----------------------------------------

[May 23, 2013 2:57:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: WUs take forever and don't checkpoint

The 85% utilization is the potential issue [has been in past]... it's a meaningless setting for desktops/servers anyhow which is why I proposed 100% CPU time. 85% translates to 17/20th running of a unit of time which is in whole seconds, 3/20th pausing [to cool down which is meant for laptops]. 85% Is anyway even for laptops ineffective, only something like 50% [WCG default], will let the client run 1 second, pause one second, the prevent CPU fan oscillation.

I'd take the sched_debug out... generates lots of output with little value in the situation.
[May 23, 2013 3:08:08 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Some VINA based WUs take forever and don't checkpoint

P.S. Please edit the opening post and specify [Some VINA based WU take forever and don't checkpoint] as I did to this post. As is evident, some readers misunderstand it as being a broader problem.
[May 23, 2013 3:14:39 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 33   Pages: 4   [ Previous Page | 1 2 3 4 | Next Page ]
[ Jump to Last Post ]
Post new Thread