Author | Message |
|
All of my hosts are now receiving 50 WUs rather than a number appropriate to the amount by time requested. Not particularly a problem as I think I'll be able to return them on time, but it's not right :)
Cheers,
Al.
____________
|
|
|
|
Same here except that I will never finish them on time and am forced to abort 40 or so at time. Apart from hating to abort any task, I cannot remember a simple little trick I once knew to abort multiple tasks at once and must do so one at a time. Perhaps the trick no longer works on BOINC v.6.12.34. I was somehow able to drag the mouse to highlight multiple tasks. One of the F keys perhaps? A better solution would be a more accurate time to completion estimate as they seem to be underestimated by a factor of 20 or so on my boxes.
____________
   |
|
|
|
Hold down shift and move the cursor to highlight in BM. This does not work in BT. In BT you can use the ctrl key and touch the top and the bottom of the task rows to select a group to terminate. |
|
|
|
Thanks for the reminder BB! |
|
|
|
The rsc_fpops_est value in the workunits is extremely wrong in this project, something you can see on your host pages (Duration Correction Factor should be as close to 1 as possible, here it is somewhere near 20)
I guess the server side scheduler uses rsc_fpops_est to assign the requested amount of work to a computer and does not use the correction factor to adjust the assigned time calculation. |
|
|
|
This seems to have settled down now on my hosts, getting a sensible number of tasks again :) Only had to abort about 10 in total, so not too bad.
Al.
|
|
|
|
This seems to have settled down now on my hosts, getting a sensible number of tasks again :) Only had to abort about 10 in total, so not too bad.
I spoke too soon ;)
For some reason, on all but one host, I always get 10 WUs per core, regardless of the amount of work requested. I can easily crunch them in time, but it means my caches are filled for way over my normal cache time and it stops work fetch on other projects.
Maybe I just need to leave them to settle down again, but I don't know what changed to cause this to happen :(
Cheers,
Al.
|
|
|
|
The problem is still around, it dumped 40 tasks on my machine when I had space allocated for a mere 2 tasks. |
|
|
|
Recently I've made some observations, I'm not so glad to present (noise from parallel projects deleted).
20-May-2013 05:33:39 [---] [wfd]: work fetch start
20-May-2013 05:33:39 [primaboinca] chosen: minor shortfall CPU: 0.00 inst, 1393.80 sec
20-May-2013 05:33:39 [---] [wfd] ------- start work fetch state -------
20-May-2013 05:33:39 [---] [wfd] target work buffer: 28797.12 + 14402.88 sec
20-May-2013 05:33:39 [---] [wfd] CPU: shortfall 1393.80 nidle 0.00 saturated 41806.20 busy 0.00 RS fetchable 100.00 runnable 300.00
20-May-2013 05:33:39 [ABC@home] [wfd] CPU: fetch share 0.00 LTD 0.00 backoff dt 0.00 int 61440.00 (comm deferred)
20-May-2013 05:33:39 [SZTAKI Desktop Grid] [wfd] CPU: fetch share 0.00 LTD -18454.46 backoff dt 0.00 int 0.00 (comm deferred)
20-May-2013 05:33:39 [PrimeGrid] [wfd] CPU: fetch share 0.00 LTD -225.14 backoff dt 0.00 int 0.00 (comm deferred)
20-May-2013 05:33:39 [primaboinca] [wfd] CPU: fetch share 1.00 LTD -59228.18 backoff dt 0.00 int 0.00
20-May-2013 05:33:39 [ABC@home] [wfd] overall LTD 0.00
20-May-2013 05:33:39 [SZTAKI Desktop Grid] [wfd] overall LTD -28786.98
20-May-2013 05:33:39 [PrimeGrid] [wfd] overall LTD -6654.77
20-May-2013 05:33:39 [primaboinca] [wfd] overall LTD -75407.47
20-May-2013 05:33:39 [---] [wfd] ------- end work fetch state -------
20-May-2013 05:33:39 [primaboinca] [wfd] request: 1393.80 sec CPU (1393.80 sec, 0.00)
20-May-2013 05:33:39 [primaboinca] Sending scheduler request: To fetch work.
20-May-2013 05:33:39 [primaboinca] Reporting 46 completed tasks, requesting new tasks
20-May-2013 05:33:53 [primaboinca] Scheduler request completed: got 2 new tasks
At time of observation real estimated run-time is ~7000sec. As you can see client asks for 1393.80sec. What would be 0.199 WU. Instead gets 2.
Another one, just a couple hours later.
20-May-2013 09:33:19 [---] [wfd]: work fetch start
20-May-2013 09:33:19 [primaboinca] chosen: major shortfall CPU: 0.00 inst, 83650.42 sec
20-May-2013 09:33:19 [---] [wfd] ------- start work fetch state -------
20-May-2013 09:33:19 [---] [wfd] target work buffer: 28797.12 + 14402.88 sec
20-May-2013 09:33:19 [---] [wfd] CPU: shortfall 83650.42 nidle 0.00 saturated 27744.42 busy 0.00 RS fetchable 100.00 runnable 300.00
20-May-2013 09:33:19 [ABC@home] [wfd] CPU: fetch share 0.00 LTD 0.00 backoff dt 46862.72 int 86400.00
20-May-2013 09:33:19 [SZTAKI Desktop Grid] [wfd] CPU: fetch share 0.00 LTD -24033.97 backoff dt 0.00 int 0.00 (comm deferred)
20-May-2013 09:33:19 [PrimeGrid] [wfd] CPU: fetch share 0.00 LTD 0.00 backoff dt 0.00 int 0.00 (comm deferred)
20-May-2013 09:33:19 [primaboinca] [wfd] CPU: fetch share 1.00 LTD -65457.82 backoff dt 0.00 int 0.00
20-May-2013 09:33:19 [ABC@home] [wfd] overall LTD 0.00
20-May-2013 09:33:19 [SZTAKI Desktop Grid] [wfd] overall LTD -35727.97
20-May-2013 09:33:19 [PrimeGrid] [wfd] overall LTD -4461.58
20-May-2013 09:33:19 [primaboinca] [wfd] overall LTD -74931.11
20-May-2013 09:33:19 [---] [wfd] ------- end work fetch state -------
20-May-2013 09:33:19 [primaboinca] [wfd] request: 83650.42 sec CPU (83650.42 sec, 0.00)
20-May-2013 09:33:19 [primaboinca] Sending scheduler request: To fetch work.
20-May-2013 09:33:19 [primaboinca] Reporting 9 completed tasks, requesting new tasks
20-May-2013 09:33:30 [primaboinca] Scheduler request completed: got 50 new tasks
Client requests 83650.42sec. What would be 11.950 WU. Instead gets 50.
I have a theory what happens. Regulars should remember that from server's POV, estimated run-time is ~760sec. Now, 1393.80 divided by 760 is: 1.833 WU, what is pretty close to 2 WU from first example. 83650.42sec by 760 is: whooping 110 WU. Why it's 50 instead? We all know, because those are 50 WU that are always ready.
Now, what if some new-comer would, after couple of hours, find that he get 10 times more workload and then desperately trying to get reasonable amount by deleting everything (or aborting, what doesn't matter for the purpose of this theory)? In no-time server will get loads of resends (just like right now, at time of posting: ~2.5 kWU). Then there's a hard-coded limit:
[code]
23-May-2013 01:33:33 [---] [wfd]: work fetch start
23-May-2013 01:33:33 [primaboinca] chosen: major shortfall CPU: 0.00 inst, 103518.66 sec
23-May-2013 01:33:33 [---] [wfd] ------- start work fetch state -------
23-May-2013 01:33:33 [---] [wfd] target work buffer: 28797.12 + 14402.88 sec
23-May-2013 01:33:33 [---] [wfd] CPU: shortfall 103518.66 nidle 0.00 saturated 28430.55 busy 0.00 RS fetchable 100.00 runnable 300.00
23-May-2013 01:33:33 [ABC@home] [wfd] CPU: fetch share 0.00 LTD 0.00 backoff dt 14122.52 int 86400.00
23-May-2013 01:33:33 [SZTAKI Desktop Grid] [wfd] CPU: fetch share 0.00 LTD -52535.69 backoff dt 0.00 int 0.00 (comm deferred)
23-May-2013 01:33:33 [PrimeGrid] [wfd] CPU: fetch share 0.00 LTD 0.00 backoff dt 0.0 |
|
|
|
[didn't know we have message limits, forums had cut off patetic part of last week post; luckily, I'd network problems last time]
23-May-2013 01:33:33 [---] [wfd]: work fetch start
23-May-2013 01:33:33 [primaboinca] chosen: major shortfall CPU: 0.00 inst, 103518.66 sec
23-May-2013 01:33:33 [---] [wfd] ------- start work fetch state -------
23-May-2013 01:33:33 [---] [wfd] target work buffer: 28797.12 + 14402.88 sec
23-May-2013 01:33:33 [---] [wfd] CPU: shortfall 103518.66 nidle 0.00 saturated 28430.55 busy 0.00 RS fetchable 100.00 runnable 300.00
23-May-2013 01:33:33 [ABC@home] [wfd] CPU: fetch share 0.00 LTD 0.00 backoff dt 14122.52 int 86400.00
23-May-2013 01:33:33 [SZTAKI Desktop Grid] [wfd] CPU: fetch share 0.00 LTD -52535.69 backoff dt 0.00 int 0.00 (comm deferred)
23-May-2013 01:33:33 [PrimeGrid] [wfd] CPU: fetch share 0.00 LTD 0.00 backoff dt 0.00 int 0.00 (comm deferred)
23-May-2013 01:33:33 [primaboinca] [wfd] CPU: fetch share 1.00 LTD -151432.65 backoff dt 0.00 int 0.00 (overworked)
23-May-2013 01:33:33 [ABC@home] [wfd] overall LTD 0.00
23-May-2013 01:33:33 [SZTAKI Desktop Grid] [wfd] overall LTD -61975.66
23-May-2013 01:33:33 [PrimeGrid] [wfd] overall LTD -7903.51
23-May-2013 01:33:33 [primaboinca] [wfd] overall LTD -162476.08
23-May-2013 01:33:33 [---] [wfd] ------- end work fetch state -------
23-May-2013 01:33:33 [primaboinca] [wfd] request: 103518.66 sec CPU (103518.66 sec, 0.00)
23-May-2013 01:33:33 [primaboinca] Sending scheduler request: To fetch work.
23-May-2013 01:33:33 [primaboinca] Reporting 15 completed tasks, requesting new tasks
23-May-2013 01:33:52 [primaboinca] Scheduler request completed: got 80 new tasks
23-May-2013 01:33:52 [---] [wfd] Request work fetch: RPC complete
What can we do about it? We can either get out of this building or, as they say, "We can be patient". For a long time I've been bothered by client stuffing workload for a day per core. Now I understand what happens. After couple of tries (about a week) client gives up on stabilizing running workload and turns to daily basis (as you can see I have buffer set for 8+4 hour).
I have only one question. Fabio, tell me. Is it possible to do something with user-of-the-day? That litter hangs on front page for a darn month! If you don't care, shut the thing down, what a big deal?
____________
I'm counting for science.
Points just make me sick. |
|
|
|
Now, what if some new-comer would, after couple of hours, find that he get 10 times more workload and then desperately trying to get reasonable amount by deleting everything (or aborting, what doesn't matter for the purpose of this theory)?
Hi.
Just joined this project for the first time in order to test something and got ...
161 CPU tasks (!) and all had virtually the same deadline for completion in one week from now. (That on a part-time P4 with 25% devoted to crunching.)
Bonkers.
Luckily I know how to manage the situation as I have BOINC'd for years :-D |
|
|