| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 24
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Ah, that's great! Btw, I realized this morning that I should have been checking for database connectivity before I execute the API and load. I had MySQL down and got an error when I ran this morning. So I added a test function for MySQL.
----------------------------------------So my next steps are to add error traps in all functions as well as add a case statement to allow various functions that could operate standalone to be specified as arguments at run time. I also incorporated the content of the README into the script function-by-function to help clarify purposes. Best, -mark [Edit 2 times, last edit by Former Member at Apr 17, 2019 1:30:02 AM] |
||
|
|
twilyth
Master Cruncher US Joined: Mar 30, 2007 Post Count: 2130 Status: Offline Project Badges:
|
When I saw your post at the end of March I too was starting almost the same project! To that end, I've got a small script together that uses the WCG API to download current workunits, reformat that output to CSV, and then loads it into a MySQL database. The code can be found on GitHub here: https://github.com/msellan/wcg_bash The script contains a function to create a table with appropriate schema definitions. So far it's been running for a week without issue. I'm happy to answer questions and am interested in feedback or suggestions from anyone. I looked at the new overview and code and lifted your table definition parms. I'm going to try to use the mysql python connector and the I'm curious about how you update your table. Do you save all of the data for every work unit from dispatch to validation and credit or do you overwrite obsolete information, for example when a status field is updated do you save that in a new record or overlay the old information. I'm wondering if there might be some useful information to be gleaned from following a wu over it's life, assuming that one is able to catch every set of updates. Sure, when I first downloaded sample workunits from the API I thought there were 18 fields. But then later when I downloaded a full set of workunits I noticed that some records had an extra field, "ReceivedTime" which is not sent for a workunit until after the work has completed and it's been 'received' by the server. But in the JSON, the field doesn't show up for a record until it has a timestamp. So really a complete record is made of 19 fields. So I decided to insert the 'ReceievedTime' field, as a placeholder, even for records that don't have it assigned yet. I choose to fill it with the Unix Epoch date, rather than null, as it will stand out and allow me to query all workunits that haven't been received knowing that later UPDATE statements will overwrite the Epoch date once the workunit is processed. This probably saved me at least a few hours of aggravation. Thank you. ![]() ![]() [Edit 1 times, last edit by twilyth at Apr 24, 2019 12:12:59 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
That's really a great question. The hardest part of this little project for me was puzzling over what I wanted to do with the data once I started downloading it!
In the end, I decided to simply do UPDATEs on the records rather than INSERTs for all the changes for each work unit. I realized after a while that my main goal was to be able to query the outcomes of each work unit - how many succeeded, how long did it take to process, which projects have failures more often than others, which of my PCs process more units, etc. and so I overwrite the obsolete data, as you mention. I too wonder if there's interesting or useful information tracking a workunit across its lifespan. I don't think it would be hard to catch all the updates. I just reworked part of the script that was the most inefficient so that now it executes in about 10 seconds from 3-4 minutes before. So it seems conceivable that it could run every 10 minutes (or something like that) - I'm not sure what the minimum frequency would need to be to ensure every change is caught. But since it only makes one call to the API, even at a 10 minute frequency, the WCG server(s) shouldn't be too hammered. I'm looking forward to seeing what you create. I've been a unix admin/shell programmer most of my career which was why I wrote it in bash. But originally I was thinking about writing it in Python as a Python learning exercise. So seeing a Python solution will be helpful. Best, -mark |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
@twilyh - I realized the other day that although I'm overwriting obsolete data in the MySQL table, I'm also archiving the original API data downloads and the script-generated SQL load scripts (which have all the data but formatted) into a folder on the filesystem.
So effectively, I could reload all of it into another table and not do UPDATEs but INSERTs for all the stages of the work units to see what that might look like as a fun comparison. If I get around to that experiment sometime soon, I'll post back the results. Best, -mark |
||
|
|
|