Most of the computing tasks at home are interactive, but there are some common batch jobs like file compression, audio/video encoding, graphic manipulation. Many Linux users prefer command line tools for such tasks. It's easy to run them on multiple files with for or find. But if you have multi-core CPU or multiple CPUs, running parallel jobs would vastly reduce the time it takes to process all files.
PPSS is a shell script that executes commands in parallel. There are many tools like that, what's different about PPSS is that it's dead simple to use. Let's start with an example. I put a bunch of JPG files into ~/test-ppss and I want to convert them to tiff. All I need is:
ppss -d ~/test-ppss -c 'jpg2tiff 'As you can see, it only needs two paremeters:
-d /directory/with/files/to/be/processedNote the trailing space in the command. It's the simplest form - filename will be appended at the end.
-c 'command '
OK, I cheated a little bit. You probably won't find jpg2tiff command on your system. It's a shell script I created that contains:
convert -compress lzw "$1" "`basename "$1" jpg`tiff"It uses "convert" from the ImageMagick package. I created it to have an easy to grasp command line for my first PPSS example. Now let's try without the script. How do we put a filename somewhere in the middle of PPSS command line? PPSS sets the $ITEM environment variable, so all you need use:
-c 'command -i "$ITEM" -s some -o other -a arguments'.
In our example:
ppss -d ~/test-ppss -c 'BASENAME=$(basename "$ITEM" .jpg); convert -compress lzw "$ITEM" "$BASENAME.tiff"'As you can imagine, it took me more than one try to write it, so a helper script might be a good idea.
There are other arguments for PPSS. By default, it run one parallel job for each CPU core it detects, but doesn't take HyperThreading into account. You can change it with -j. It probably won't make it run faster, as HT is not as efficient as multi-core or SMP. You can also set the number of tasks manually with -p. It's useful if you want to keep one core for interactive tasks.
Shell scripts on multiple machines
PPSS can distribute tasks to other machines using SSH. The setup is slightly harder, but still way simpler than with typical distributed programming frameworks. All you need is:
- Passwordless (key-based) SSH authentication. You should create accounts and keyfiles for this sole purpose. Don't use your login account.
- Config file for PPSS with node list and parameters.
- Optionally, SMB or NFS share with files to be processed. PPSS can also send input and get the results with scp, but unless the files are very small and the task very CPU intensive, it won't have much sense. All the gains from parallel processing would be eaten up by the time it takes to distribute the files.
I haven't tested it yet, you can see the explanation and read the manual on PPS website.

With GNU Parallel http://www.gnu.org/software/parallel/ you probably do not need several tries to get it right. Without the helper script it looks like this:
ReplyDeletefind images | parallel convert -compress lzw {} {.}.tiff
To run one process per CPU core add -j+0:
find images | parallel -j+0 convert -compress lzw {} {.}.tiff
To run on remote computers you need to tell GNU Parallel to transfer the files using --trc. This will run one per CPU core on server1 and on your local computer (:) - even if server1 does not have the same amount of cores:
find images | parallel -S server1,: --trc {.}.tiff -j+0 convert -compress lzw {} {.}.tiff
See the video: http://www.youtube.com/watch?v=LlXDtd_pRaY
I have tried PPSS on a computer that has two quad core cpus running a large Monte Carlo study in Stata's batch mode. With PPSS I finally get all CPU's up to 100 % utilization (well one hovers at 99.5%). With PPSS I cut the processing time with approx 50%. Great.
ReplyDelete