BOINC: Data Bundles

Saturday, July 4th, 2009

I’ve added support into the BOINC interface for bundled data files, however adding this new featue has exposed a new issue in the BOINC interface. I’ve known previously that BOINC holds an odd assumption of immutable files — any file ever seen by BOINC is expected to *never* change it’s contents for all time — however when running a PyMW application, the executable (for example, “monte_pi”) is reused over and over desipite any changes that may have occurred in the code.

This hasn’t been an issue up until now, mainly because I was running the PyMW example applications and not modifying them between executions. However, with the introduction of PyMW data bundles, this problem has become painfully obvious. Since data bundles are given a temporary file name and this file name is dynamically embedded into the body of the executable, the executable is now changing its contents on every run.

The fact that the file is changing and the file name remains the same means that BOINC keeps only one copy of the file (because of file name immutability/versioning). The end result is that when a work unit executes on a worker machine, it tries to open the first data bundle file name that was ever created because that first file was cached and never updated.

To fix this, I’ve added some code into the BOINC interface that deletes all work unit related files from the BOINC “download” directory on every execution. This has fixed the problem for now, but the interface should rename all files to a unique name before execution. This is one of my goals for the next iteration of the BOINC interface.

BOINC: Recovering from Failures

Wednesday, July 1st, 2009

The BOINC interface works as follows:

  1. The PyMW worker Task gets packaged as a BOINC “work unit” (or job)
  2. These work units get sent off to volunteer compute nodes
  3. The results get “assimilated” creating an output file for PyMW to consume
  4. Finally, the results get turned into completed PyMW Tasks

In this process, a work unit can fail for many reasons (execution problems, computer crashes, user hits “abort”, etc…) and BOINC will automatically reprocess the work unit. After a number of failures (user specified), BOINC will give up and flag the work unit with an error code of “too many errors”.¬†At this point, the work unit gets assimilated by the PyMW assimilator, but instead of sending back output files to the PyMW app, an error file is produced specifiying exactly why the work unit failed.

The PyMW BOINC interface will automatically turn these failure files into Python exceptions and attach them to the Task that failed. The interface then deletes the failure file so that future runs don’t get confused and think they failed as soon as they begin.

This exposes two rough patches in the BOINC interface that I hope to fix soon:

  1. All runs of an application produce files with the same names. BOINC doesn’t like this and the PyMW BOINC interface¬†get confused by it as well (as noted above).
  2. If the task exception isn’t handled correctly in the PyMW application (your application running onto of PyMW), BOINC will be left in an un-predictable state. This will most likely result in many output files piling up in the task directory. These files will then be viewed as results the next time the application is run, which is wrong.

Fixing the first problem is fairly straight forward (give each file a unique name), but the second is harder. Automatically canceling all existing work units may be the wrong thing to do for some applications, so that isn’t really an option. On the other hand, leaving it up to the developer to handle exceptions and wait for all work units to complete (or canceling them) isn’t ideal either.

The current plan is to provide some sort of sane default behavior that can be overridden or changed by client applications as needed.