Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 13
Posts: 13   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 4040 times and has 12 replies Next Thread
Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1403
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
ARP1-task failing . . .

Is this error caused due to lack of memory after >9 hours run time?
I see in the result log 'malloc', what could mean memory allocation.

Result Name: ARP1_ 0034761_ 000_ 0--

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
The storage control block address is invalid.
(0x9) - exit code 9 (0x9)</message>
<stderr_txt>
INFO: Initializing
INFO: No state to restore. Start from the beginning.
Starting WRFMain
[10:14:56] INFO: Checkpoint taken at 2018-07-01_06:00:00
rsl_malloc failed allocating 24911668 bytes, called ..\external\RSL_LITE\rsl_bcast.c, line 270, try 1
: Not enough space
rsl_malloc failed allocating 24911668 bytes, called ..\external\RSL_LITE\rsl_bcast.c, line 270, try 2
: Not enough space
rsl_malloc failed allocating 24911668 bytes, called ..\external\RSL_LITE\rsl_bcast.c, line 270, try 3
: Not enough space

</stderr_txt>
[Dec 7, 2019 9:59:54 AM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: ARP1-task failing . . .

If we believe the error-messages, Crystal Pellet, and there is no indication that we shouldn't be believing them, then we see that the function rsl_malloc tried three times to allocate almost 25 MB of memory and failed in all three cases. The error-message-string can be found in the executable (on my Linux-device in a ELF 64-bit LSB executable):
# strings wcgrid_arp1_wrf_7.27_x86_64-pc-linux-gnu | grep rsl_malloc
rsl_malloc failed allocating %d bytes, called %s, line %d, try %d
The obvious reason for the error-messages comes from the operating system: Not enough space. Since the storage-area in the computer's memory can't be allocated, there is no valid computer-address associated with it, which explains this message: "The storage control block address is invalid".
[Dec 7, 2019 2:54:28 PM]   Link   Report threatening or abusive post: please login first  Go to top 
nanoprobe
Master Cruncher
Classified
Joined: Aug 29, 2008
Post Count: 2998
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: ARP1-task failing . . .

Looks like not enough swap space.
----------------------------------------
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.


[Dec 7, 2019 4:13:16 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: ARP1-task failing . . .

Try allocating more disc space in device profiles.

Mike
[Dec 10, 2019 2:41:10 AM]   Link   Report threatening or abusive post: please login first  Go to top 
hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 865
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: ARP1-task failing . . .

@Crystal Pellet, how much RAM does that machine have, and how many CPU cores/threads?
----------------------------------------
  • i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
  • i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
  • i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

[Dec 10, 2019 4:06:58 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: ARP1-task failing . . .

Try allocating more disc space in device profiles.

Mike

BOINC is designed to pause a task(s) when it does not have enough memory allocated "waiting for memory" i.e. nothing to do with the BOINC profiles.
[Dec 10, 2019 7:57:48 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1403
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: ARP1-task failing . . .

@Crystal Pellet, how much RAM does that machine have, and how many CPU cores/threads?
It's a Windows VM with 30 cores only used for BOINC on an Opteron Linux server.

VM1: 10 Dec 09:40:47 max memory usage when idle: 23039.55 MB

The max for ARP1 is set to 10 tasks and else MCM1.
I noticed that the ARP1's normally are using about 700-750MB of RAM,
but during very short periods the RAM usage can grow to 1012MB.

On that VM I also had running ECM's (elliptic-curve factorization method) and those tasks are a bit tricky with the memory usage.
During ~70% of the run they use almost no memory and during the last part the RAM goes sky high.
Depending on the type it can go up to 1800MB.
Therefore I run them staggered, so that only a few tasks at the same time use the higher needed memory.
That must been the reason for the lack of memory somehow causing the ARP-failure.
[Dec 10, 2019 9:03:23 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: ARP1-task failing . . .

A VM will utilise some capacity, but not too much. Are you limiting arp1 in device profiles or app_config? Try limiting arp1 to 10 in Device Profiles and to 5 in app_config and if that cures the problem, slowly increase. That way you will hold a cache of 10 units and only run 5 at a time. Not very scientific but we have a long way to go at present rates, so we can take our time finding the best combinations.

The project will not suffer from your temporary lower throughput as there are lots of machines under-utilised.

Mike
----------------------------------------
[Edit 1 times, last edit by Mike.Gibson at Dec 10, 2019 1:31:41 PM]
[Dec 10, 2019 1:23:10 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1403
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: ARP1-task failing . . .

I've limited the ARP1 to 10 in a device profile just because of their RAM-hungriness. I don't want to create a buffer of ARP1's, cause I want them return as soon as possible to stay 'reliable' for ARP1 (return within 2.5 days nowadays). The ARP1- runtimes on that VM are from 47 to 52 hours, so when I see a new one has arrived, I'm push it to running state by suspending a MCM1.
[Dec 10, 2019 2:55:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: ARP1-task failing . . .

10 in Device profiles and 5 in app_config would give you a cache of 2 per thread. If you are taking less than 24 hours per unit then they will all be returned within the 2.5 days. You can then slowly increase the 5 until you start getting the problem again. Should only take a few days to find the optimum.

Mike
[Dec 10, 2019 3:06:02 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 13   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread