Bundling files with tasks

Thursday, July 2nd, 2009

Currently, PyMW tasks consist of only functions and input data.  However, many tasks require separate input files that are often shared between different tasks.  To accommodate this, I modified PyMW to support packing and unpacking of files associated with tasks.  Because of the way this works, there is slightly more burden placed on interface designers.

Some notes:

  • Groups of files are packed into a zip file.  The names of files in a group are hashed and put in a dictionary which references the file they’re packed into. This way, if a given group of files is used in multiple tasks they will only be zipped once.
  • This new feature adds a slight burden to interfaces. Interfaces are now responsible for checking if a task has an associated package of files (task._data_file_zip). If so, the interface must copy the file to the workers directory. Unzipping the files occurs in the task, but the interface should clean up all files after computation. This also lets tasks create temporary files which will be destroyed at task completion.
  • In regards to the previous point, the generic interface now creates a temporary directory for each worker where temp files can be placed. These directories are completely wiped after computation is finished.

This should also simplify the future work of including Python modules for tasks.  Bundled modules will likely be in a different archive for several reasons:

  • Future versions of PyMW will support different modules for different platforms, so we need to distinguish the WinXP module archive from the Linux module archive.  As far as I can think, there’s no reason to have platform dependent data files though.
  • The PyZipFile writepy() function reads module dependencies, packages these together correctly, and compiles the .pyc files.  Although we could do all this in the same archive as the data files, it feels cleaner to separate them.
  • On some platforms it might be good to separate executable code from other data for security reasons.