| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 13
|
|
| Author |
|
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1403 Status: Offline Project Badges:
|
Is this error caused due to lack of memory after >9 hours run time?
I see in the result log 'malloc', what could mean memory allocation. Result Name: ARP1_ 0034761_ 000_ 0-- <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> The storage control block address is invalid. (0x9) - exit code 9 (0x9)</message> <stderr_txt> INFO: Initializing INFO: No state to restore. Start from the beginning. Starting WRFMain [10:14:56] INFO: Checkpoint taken at 2018-07-01_06:00:00 rsl_malloc failed allocating 24911668 bytes, called ..\external\RSL_LITE\rsl_bcast.c, line 270, try 1 : Not enough space rsl_malloc failed allocating 24911668 bytes, called ..\external\RSL_LITE\rsl_bcast.c, line 270, try 2 : Not enough space rsl_malloc failed allocating 24911668 bytes, called ..\external\RSL_LITE\rsl_bcast.c, line 270, try 3 : Not enough space </stderr_txt> |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
If we believe the error-messages, Crystal Pellet, and there is no indication that we shouldn't be believing them, then we see that the function rsl_malloc tried three times to allocate almost 25 MB of memory and failed in all three cases. The error-message-string can be found in the executable (on my Linux-device in a ELF 64-bit LSB executable):
# strings wcgrid_arp1_wrf_7.27_x86_64-pc-linux-gnu | grep rsl_mallocThe obvious reason for the error-messages comes from the operating system: Not enough space. Since the storage-area in the computer's memory can't be allocated, there is no valid computer-address associated with it, which explains this message: "The storage control block address is invalid". |
||
|
|
nanoprobe
Master Cruncher Classified Joined: Aug 29, 2008 Post Count: 2998 Status: Offline Project Badges:
|
Looks like not enough swap space.
----------------------------------------
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.
![]() ![]() |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
Try allocating more disc space in device profiles.
Mike |
||
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
@Crystal Pellet, how much RAM does that machine have, and how many CPU cores/threads?
----------------------------------------
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Try allocating more disc space in device profiles. Mike BOINC is designed to pause a task(s) when it does not have enough memory allocated "waiting for memory" i.e. nothing to do with the BOINC profiles. |
||
|
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1403 Status: Offline Project Badges:
|
@Crystal Pellet, how much RAM does that machine have, and how many CPU cores/threads? It's a Windows VM with 30 cores only used for BOINC on an Opteron Linux server.VM1: 10 Dec 09:40:47 max memory usage when idle: 23039.55 MB The max for ARP1 is set to 10 tasks and else MCM1. I noticed that the ARP1's normally are using about 700-750MB of RAM, but during very short periods the RAM usage can grow to 1012MB. On that VM I also had running ECM's (elliptic-curve factorization method) and those tasks are a bit tricky with the memory usage. During ~70% of the run they use almost no memory and during the last part the RAM goes sky high. Depending on the type it can go up to 1800MB. Therefore I run them staggered, so that only a few tasks at the same time use the higher needed memory. That must been the reason for the lack of memory somehow causing the ARP-failure. |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
A VM will utilise some capacity, but not too much. Are you limiting arp1 in device profiles or app_config? Try limiting arp1 to 10 in Device Profiles and to 5 in app_config and if that cures the problem, slowly increase. That way you will hold a cache of 10 units and only run 5 at a time. Not very scientific but we have a long way to go at present rates, so we can take our time finding the best combinations.
----------------------------------------The project will not suffer from your temporary lower throughput as there are lots of machines under-utilised. Mike [Edit 1 times, last edit by Mike.Gibson at Dec 10, 2019 1:31:41 PM] |
||
|
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1403 Status: Offline Project Badges:
|
I've limited the ARP1 to 10 in a device profile just because of their RAM-hungriness. I don't want to create a buffer of ARP1's, cause I want them return as soon as possible to stay 'reliable' for ARP1 (return within 2.5 days nowadays). The ARP1- runtimes on that VM are from 47 to 52 hours, so when I see a new one has arrived, I'm push it to running state by suspending a MCM1.
|
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
10 in Device profiles and 5 in app_config would give you a cache of 2 per thread. If you are taking less than 24 hours per unit then they will all be returned within the 2.5 days. You can then slowly increase the 5 until you start getting the problem again. Should only take a few days to find the optimum.
Mike |
||
|
|
|