Archive for December, 2007

HOWTO: Getting Condor going on an cluster of Ubuntu machines with a custom algorithm.

Sunday, December 16th, 2007

The following instructions follow on from my previous article concerning a cluster of Ubuntu machines and BOINC.

 

Similar to BOINC, Condor is a client server architecture. So I have instructions for a master machine and also for a bunch of slaves.

 

Master Machine

 

Download condor-6.8.7-linux-x86-rhel.tar.gz and uncompress it.

 

cd /data/condor-6.8.7

 

I have opted to make two directories, condor_root and condor_local:

 

mkdir condor_root mkdir condor_local

 

Now run Condor’s configuration program and then set an environment variable to point to it (you might want to make it more permanent than that):

 

sudo ./condor_configure --install-dir=/data/condor-6.8.7/condor_root/ --type=manager,submit --local-dir=/data/condor-6.8.7/condor_local/ --owner=bcg --install=/data/condor-6.8.7/release.tar

 

export CONDOR_CONFIG=/data/condor-6.8.7/condor_root/etc/condor_config
Edit condor_root/etc/condor_config:

 

Set RELEASE_DIR to /data/condor-6.8.7/condor_root/

 

Set HOSTALLOW_WRITE to *

 

Set HOSTALLOW_ADMINISTRATOR = $(FULL_HOSTNAME)

 

Start Condor:

 

sudo condor_root/sbin/condor_master

 

Stop Condor:

 

sudo condor_root/sbin/condor_off -master

 

Check that its running:

 

ps -ef | egrep condor_

 

bcg@rhdl-a2:/data/condor-6.8.7$ ps -ef | egrep condor_
bcg 24421 1 0 12:25 ? 00:00:00 condor_root/sbin/condor_master
bcg 24422 24421 0 12:25 ? 00:00:00 condor_collector -f
bcg 24423 24421 0 12:25 ? 00:00:00 condor_negotiator -f
bcg 24424 24421 0 12:25 ? 00:00:00 condor_schedd -f
bcg 24425 24421 7 12:25 ? 00:00:07 condor_startd -f
bcg 24475 5431 0 12:27 pts/0 00:00:00 grep -E condor_

 

Create a job:

 

Put the following in a text file called mandelbrot16.condor (the filename can be anything really).

 

# file name: mandelbrot16.condor # Condor submit description file for mandelbrot Executable = /data/condor-6.7.8/mandelbrot16/mandelbrot Universe = vanilla Error = logs/err.$(cluster) Output = logs/out.$(cluster) Log = logs/log.$(cluster)

 

should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = files/mandelbrot_00000001
Arguments = mandelbrot_00000001 mandelbrot_00000001_out
Queue
create a directory to put the job in.

 

mkdir mandelbrot16 cd mandelbrot16 mkdir logs mkdir files

 

Copy the files that you want condor to work with (such as the input files for the alogorithm) in the files directory. Put the algorithm itself in the mandelbrot16 directory.

 

Submit a job

 

condor_root/bin/condor_submit mandelbrot16/mandelbrot16.condor

 

Check on jobs

 

All jobs:

 

bin/condor_q

 

A job (say job id 3):

 

bin/condor_q 3

 

Client Machine

 

Download condor-6.8.7-linux-x86-rhel.tar.gz and uncompress it.

 

cd /data/condor-6.8.7

 

Same kind of setup as the Master box here. Note that the type is execute only:

 

mkdir condor_root mkdir condor_local

 

sudo ./condor_configure --install-dir=/home/bcg/condor-6.8.7/condor_root/ --type=execute --local-dir=/home/bcg/condor-6.8.7/condor_local/ --owner=bcg --install=/home/bcg/condor-6.8.7/release.tar

 

export CONDOR_CONFIG=/home/bcg/condor-6.8.7/condor_root/etc/condor_config

 

Edit condor-6.8.7/condor_root/etc/condor_config

 

Set UID_DOMAIN = $(FULL_HOSTNAME) Set FILESYSTEM_DOMAIN=$(FULL_HOSTNAME) Set HOSTALLOW_ADMINISTRATOR = $(FULL_HOSTNAME) Set HOSTALLOW_WRITE to *

 

Edit condor-6.8.7/condor_local/condor_config.local. Set CONDOR_HOST to the ip address of your master machine. Set NETWORK_INTERFACE to the ip address of the client machine you are setting up.

 

Set CONDOR_HOST = 144.6.40.251 SET UID_DOMAIN and FILESYSTEM_DOMAIN to $(FULL_HOSTNAME) NETWORK_INTERFACE = 144.6.40.115

 

These settings, in the same file, make the client work on jobs as quickly as possible and with as much effort as possible regardless of user actions on the client machine. Remember, these instructions are for a dedicated cluster - you might not want to do this with your desktop machine.

 

WANT_SUSPEND = FALSE CONTINUE = TRUE SUSPEND = FALSE PREEMPT = FALSE START=TRUE

 

That’s it. Start them up on all the machines and wait for the computation to start. Condor can take several ( 5 to 10) minutes to get things underway, but once she starts she chuggs through the work pretty quickly.