NaCTeM

Seminar — Kenjiro Taura

Speaker: Professor Kenjiro Taura, The University of Tokyo
Title: Large scale text processing made simple by GXP make: A Unixish way to parallel workflow processing
Date: Monday 15th June at 11:00
Location: Lecture Theatre (MLG.001) in the MIB Building
Abstract:

In the first part of this talk, I will introduce a simple tool called GXP make. GXP is a general purpose parallel shell (a process launcher) for multicore machines, unmanaged clusters accessed via SSH, clusters or supercomputers managed by batch scheduler, distributed machines, or any mixture thereof. GXP make is a 'make' execution engine that executes regular UNIX makefiles in parallel. Make, though typically used for software builds, is in fact a general framework to concisely describe workflows constituing sequential commands. Installation of GXP requires no root privileges and needs to be done only on the user's home machine. GXP make easily scales to more than 1,000 CPU cores. The net result is that GXP make allows an easy migration of workflows from serial environments to clusters and to distributed environments. In the second part, I will talk about our experiences on running a complex text processing workflow developed by NLP experts. It is an entire workflow that processes MEDLINE abstracts with deep NLP tools (e.g., Enju parser) to generate search indices of MEDIE, a semantic retrieval engine for MEDLINE, which is one of NaCTeM's services. It was originally described in Makefile without a particular provision to parallel processing, yet GXP make was able to run it on clusters with almost no changes to the original Makefile. Time for processing abstracts published in a single day was reduced from approximately eight hours (with a single machine) to twenty minutes with a trivial amount of efforts. A larger scale experiment of processing all abstracts published so far and remaining challenges will also be presented.

 

Presentation slides