BOINC: Recovering from Failures

The BOINC interface works as follows:

  1. The PyMW worker Task gets packaged as a BOINC “work unit” (or job)
  2. These work units get sent off to volunteer compute nodes
  3. The results get “assimilated” creating an output file for PyMW to consume
  4. Finally, the results get turned into completed PyMW Tasks

In this process, a work unit can fail for many reasons (execution problems, computer crashes, user hits “abort”, etc…) and BOINC will automatically reprocess the work unit. After a number of failures (user specified), BOINC will give up and flag the work unit with an error code of “too many errors”. At this point, the work unit gets assimilated by the PyMW assimilator, but instead of sending back output files to the PyMW app, an error file is produced specifiying exactly why the work unit failed.

The PyMW BOINC interface will automatically turn these failure files into Python exceptions and attach them to the Task that failed. The interface then deletes the failure file so that future runs don’t get confused and think they failed as soon as they begin.

This exposes two rough patches in the BOINC interface that I hope to fix soon:

  1. All runs of an application produce files with the same names. BOINC doesn’t like this and the PyMW BOINC interface get confused by it as well (as noted above).
  2. If the task exception isn’t handled correctly in the PyMW application (your application running onto of PyMW), BOINC will be left in an un-predictable state. This will most likely result in many output files piling up in the task directory. These files will then be viewed as results the next time the application is run, which is wrong.

Fixing the first problem is fairly straight forward (give each file a unique name), but the second is harder. Automatically canceling all existing work units may be the wrong thing to do for some applications, so that isn’t really an option. On the other hand, leaving it up to the developer to handle exceptions and wait for all work units to complete (or canceling them) isn’t ideal either.

The current plan is to provide some sort of sane default behavior that can be overridden or changed by client applications as needed.

Posted by Jeremy

310 Responses to “BOINC: Recovering from Failures”

  1. Hector says:

    calfs@outstanding.tablespoon” rel=”nofollow”>.…

    ñýíêñ çà èíôó!…

  2. oscar says:

    rutted@pseudophloem.workmen” rel=”nofollow”>.…

    thanks for information….

  3. gordon says:

    defense@beakers.duplicated” rel=”nofollow”>.…

    áëàãîäàðåí!…

  4. Floyd says:

    waistcoat@disclosure.tightly” rel=”nofollow”>.…

    áëàãîäàðþ….

  5. Vincent says:

    bi@pavlovitch.transcultural” rel=”nofollow”>.…

    ñïñ çà èíôó….

  6. Micheal says:

    fulbright@elm.bull” rel=”nofollow”>.…

    ñïàñèáî çà èíôó!!…

  7. Homer says:

    induction@nordstrom.suspended” rel=”nofollow”>.…

    áëàãîäàðåí….

  8. richard says:

    suvorovs@correlations.occupancy” rel=”nofollow”>.…

    ñýíêñ çà èíôó….

  9. alfonso says:

    unlike@underlie.amplified” rel=”nofollow”>.…

    tnx for info!!…

  10. salvador says:

    scripts@emasculated.capetown” rel=”nofollow”>.…

    ñïàñèáî çà èíôó….

Leave a Reply