PyMPI vs. mpi4py

Tuesday, June 29th, 2010

In PyMW 0.4 we decided to change from using pympi to mpi4py for the PyMW MPI interface. pympi has not been updated in six years, whereas mpi4py is part of the SciPy suite and very well supported. This also allows us to remove the MPI manager function from a separate file so the MPI interface is now entirely self contained. There may still be problems involving external data using MPI without a shared filesystem, though these will be fixed in the next release. The interface also correctly checks for mpi4py now, and gracefully quits if it is not available.

Ganga Interface

Monday, June 14th, 2010

Big thanks to Wayne San for submitting the Ganga interface!

Ganga is an easy-to-use frontend for job definition and management, implemented in Python. It has been developed to meet the needs of the ATLAS and LHCb for a Grid user interface, and includes built-in support for configuring and running applications based on the Gaudi / Athena framework common to the two experiments. Ganga allows trivial switching between testing on a local batch system and large-scale processing on Grid resources.

Wayne’s Ganga interface will be included in the 0.4 release of PyMW.

BOINC: Data Bundles

Saturday, July 4th, 2009

I’ve added support into the BOINC interface for bundled data files, however adding this new featue has exposed a new issue in the BOINC interface. I’ve known previously that BOINC holds an odd assumption of immutable files — any file ever seen by BOINC is expected to *never* change it’s contents for all time — however when running a PyMW application, the executable (for example, “monte_pi”) is reused over and over desipite any changes that may have occurred in the code.

This hasn’t been an issue up until now, mainly because I was running the PyMW example applications and not modifying them between executions. However, with the introduction of PyMW data bundles, this problem has become painfully obvious. Since data bundles are given a temporary file name and this file name is dynamically embedded into the body of the executable, the executable is now changing its contents on every run.

The fact that the file is changing and the file name remains the same means that BOINC keeps only one copy of the file (because of file name immutability/versioning). The end result is that when a work unit executes on a worker machine, it tries to open the first data bundle file name that was ever created because that first file was cached and never updated.

To fix this, I’ve added some code into the BOINC interface that deletes all work unit related files from the BOINC “download” directory on every execution. This has fixed the problem for now, but the interface should rename all files to a unique name before execution. This is one of my goals for the next iteration of the BOINC interface.

Bundling files with tasks

Thursday, July 2nd, 2009

Currently, PyMW tasks consist of only functions and input data.  However, many tasks require separate input files that are often shared between different tasks.  To accommodate this, I modified PyMW to support packing and unpacking of files associated with tasks.  Because of the way this works, there is slightly more burden placed on interface designers.

Some notes:

  • Groups of files are packed into a zip file.  The names of files in a group are hashed and put in a dictionary which references the file they’re packed into. This way, if a given group of files is used in multiple tasks they will only be zipped once.
  • This new feature adds a slight burden to interfaces. Interfaces are now responsible for checking if a task has an associated package of files (task._data_file_zip). If so, the interface must copy the file to the workers directory. Unzipping the files occurs in the task, but the interface should clean up all files after computation. This also lets tasks create temporary files which will be destroyed at task completion.
  • In regards to the previous point, the generic interface now creates a temporary directory for each worker where temp files can be placed. These directories are completely wiped after computation is finished.

This should also simplify the future work of including Python modules for tasks.  Bundled modules will likely be in a different archive for several reasons:

  • Future versions of PyMW will support different modules for different platforms, so we need to distinguish the WinXP module archive from the Linux module archive.  As far as I can think, there’s no reason to have platform dependent data files though.
  • The PyZipFile writepy() function reads module dependencies, packages these together correctly, and compiles the .pyc files.  Although we could do all this in the same archive as the data files, it feels cleaner to separate them.
  • On some platforms it might be good to separate executable code from other data for security reasons.

BOINC: Recovering from Failures

Wednesday, July 1st, 2009

The BOINC interface works as follows:

  1. The PyMW worker Task gets packaged as a BOINC “work unit” (or job)
  2. These work units get sent off to volunteer compute nodes
  3. The results get “assimilated” creating an output file for PyMW to consume
  4. Finally, the results get turned into completed PyMW Tasks

In this process, a work unit can fail for many reasons (execution problems, computer crashes, user hits “abort”, etc…) and BOINC will automatically reprocess the work unit. After a number of failures (user specified), BOINC will give up and flag the work unit with an error code of “too many errors”. At this point, the work unit gets assimilated by the PyMW assimilator, but instead of sending back output files to the PyMW app, an error file is produced specifiying exactly why the work unit failed.

The PyMW BOINC interface will automatically turn these failure files into Python exceptions and attach them to the Task that failed. The interface then deletes the failure file so that future runs don’t get confused and think they failed as soon as they begin.

This exposes two rough patches in the BOINC interface that I hope to fix soon:

  1. All runs of an application produce files with the same names. BOINC doesn’t like this and the PyMW BOINC interface get confused by it as well (as noted above).
  2. If the task exception isn’t handled correctly in the PyMW application (your application running onto of PyMW), BOINC will be left in an un-predictable state. This will most likely result in many output files piling up in the task directory. These files will then be viewed as results the next time the application is run, which is wrong.

Fixing the first problem is fairly straight forward (give each file a unique name), but the second is harder. Automatically canceling all existing work units may be the wrong thing to do for some applications, so that isn’t really an option. On the other hand, leaving it up to the developer to handle exceptions and wait for all work units to complete (or canceling them) isn’t ideal either.

The current plan is to provide some sort of sane default behavior that can be overridden or changed by client applications as needed.