World Community Grid - View Thread - Duplicated results in web API

World Community Grid Forums

Category: Support

Forum: Website Support

Thread: Duplicated results in web API

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 17

[ ]

Author

This topic has been viewed 3385 times and has 16 replies

wujj123456
Cruncher
Joined: Jun 9, 2010
Post Count: 38
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

1 year badge for Help Fight Childhood Cancer

1 year badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

100 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Duplicated results in web API

I've been playing with web API recently, and I realized it returns duplicated results.

In my case, even if I set query limit to 1000, I at most get 250 results back.

Say there are 730 results, I do:
Fetch from offset=0. Returned 250 results
Fetch from offset=250. Returned 250 results
Keys already exist:
OET1_0003699_x3MWPp_rig_61457_0
ugm1_ugm1_25075_0758_1
ugm1_ugm1_25075_0679_0
Fetch from offset=500. Returned 230 results
Results returned from web: 727

When I combined them, the total is not 730 but 727 results. The three duplicated keys have exact same data in both queries.

It's not hard for me to workaround this, as python dict update handles it naturally anyway. Just wondering if it's something WCG team would care, since it might be a bug somewhere.

If this is not the right place to report API issue (or general bugs), please let me know. I can move this to appropriate channel. Thanks.

[Apr 24, 2016 6:21:56 PM]

SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline


Re: Duplicated results in web API

Since the Result Status data is 'live' [dynamic] and the fetch per call is restricted to 250, there is a good chance that a next call will fetch a result again, as in the meantime any of the first 250 or a later result in the fetch order did so to move to top, shifting everything 1 or more down. The faster your internet connection and the more optimal your concatenated queries run, the less chance this may occur.

----------------------------------------
[Edit 1 times, last edit by SekeRob* at Apr 24, 2016 8:19:24 PM]

[Apr 24, 2016 8:18:31 PM]

wujj123456
Cruncher
Joined: Jun 9, 2010
Post Count: 38
Status: Offline
Project Badges:


Re: Duplicated results in web API

I thought about this, because I do see order changing sometimes. However, overall fetched results should match the number of available results, across multiple back to back runs. Each run only takes seconds... In my tests, these WUs have consistently been shown twice. That's why I think it's duplicated in the data, instead of the timing on my side.

Honestly it's not really a concern to me given that I can tolerate losing some stats. If it's just a timing issue, I would probably end up fetching all results anyway since I run it periodically.

I wonder why cap the limit at 250? It doesn't really help since most people would have to loop and query all results anyway. It might be much more efficient to let people pass in ModTime and only return WU with a modified time older than that, if it's trying to save some server resource.

----------------------------------------
[Edit 1 times, last edit by wujj123456 at Apr 24, 2016 8:51:25 PM]

[Apr 24, 2016 8:50:59 PM]

Tullus
Cruncher
Joined: Nov 14, 2008
Post Count: 29
Status: Offline
Project Badges:

90 day badge for Human Proteome Folding - Phase 2

1 year badge for The Clean Energy Project - Phase 2

90 day badge for Drug Search for Leishmaniasis

45 day badge for GO Fight Against Malaria

90 day badge for Uncovering Genome Mysteries

1 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

90 day badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


Re: Duplicated results in web API

Hi wujj123456. If you are playing with the web API from python, you might be interested in part of:

https://code.google.com/archive/p/py-boinc-plotter/

You will most likely want to look at parser.HTMLParser_worldcommunitygrid and task.Task_web_worldcommunitygrid.

I haven't bothered to move the project away from google.code, so some of it might be a bit dated, but should not be too bad.

[Apr 25, 2016 7:40:23 AM]

wujj123456
Cruncher
Joined: Jun 9, 2010
Post Count: 38
Status: Offline
Project Badges:


Re: Duplicated results in web API

Thanks. Well, I guess you missed Google announcement a year ago:
http://google-opensource.blogspot.com/2015/03/farewell-to-google-code.html

Google code is gone, along with all projects hosted on it. :-(

From the method name, are you parsing HTML instead of using the API?

[Apr 25, 2016 7:49:53 AM]

SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline


Re: Duplicated results in web API

On the emphasized, no, because between the first and second fetch request one or the other canonical result could already have been migrated off... the misses. To give an approximation, 12 results come on and another 12 come off per second [1 million plus per day], which on accounts that have thousands on their result status pages leads to a continuous reordering due status changes and the [ModTime] of the momentary transaction.

Duplicates on the database: On a willy nilly system that would be possible, yes wink

(The Result Status pages are a direct window to what's going on, on the core BOINC task/result scheduling system.)

----------------------------------------
[Edit 1 times, last edit by SekeRob* at Apr 25, 2016 11:39:23 AM]

[Apr 25, 2016 11:18:15 AM]

wujj123456
Cruncher
Joined: Jun 9, 2010
Post Count: 38
Status: Offline
Project Badges:


Re: Duplicated results in web API

I tried a few more times and looks like the duplicates aren't always the same, even though the number of duplicates I got for same number of total results are mostly the same. So it does look like random timing issue, at least across longer periods. I guess the repeated same offenders I got yesterday might just be a coincidence.

Looking more closely to the fields, it doesn't seem to be ordered in anyway. (I am only querying results with ValidateState=1.) However, on website there is a way to order results.

Do you happen to know if I can specify an order with web API? It's an SQL query at the end, but not necessarily exposed to web API I suppose. All I need is some ordering that would be stable for 1 minute for me to deterministically get all results back. So anything other than ModTime will probably work. (I assume getting all results in one-shot is off-limits given that a cap is implemented at first place.)

[Apr 26, 2016 4:53:26 AM]

Tullus
Cruncher
Joined: Nov 14, 2008
Post Count: 29
Status: Offline
Project Badges:


Re: Duplicated results in web API

Yes, I know about the google code farewell, just to lazy to move. I was about to tell you that the 'downloads' still work, but found they are mostly empty .tar files, so I must have messed up. Moved it to github:
https://github.com/obtitus/py-boinc-plotter/
I still need to move the wiki somehow.

I was originally parsing the html, so therefore the name :) I am still parsing xml for the badges and stuff.

[Apr 26, 2016 4:20:46 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: Duplicated results in web API

(I am only querying results with ValidateState=1.)

Even doing this you will get duplicate units where all is the same except for the mod time. I have stopped wondering why the mod time would change on a unit which has already been validated. It must make sense to the techs. Once I stick my query into a spreadsheet, I can deal with the duplicate issue.
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Apr 26, 2016 9:33:04 PM]

wujj123456
Cruncher
Joined: Jun 9, 2010
Post Count: 38
Status: Offline
Project Badges:


Re: Duplicated results in web API

(I am only querying results with ValidateState=1.)

I have stopped wondering why the mod time would change on a unit which has already been validated. It must make sense to the techs.

This I happen to see it happening. Modtime can change after validation because the files are deleted. That might eventually happen for all results, but yeah, from user's point of view, it's no longer interesting once results are validated.

[Apr 27, 2016 2:21:17 AM]

[ ]