HOWTO: Getting Condor going on an cluster of Ubuntu machines with a custom algorithm.
The following instructions follow on from my previous article concerning a cluster of Ubuntu machines and BOINC.
Similar to BOINC, Condor is a client server architecture. So I have instructions for a master machine and also for a bunch of slaves.
Master Machine
Download condor-6.8.7-linux-x86-rhel.tar.gz and uncompress it.
cd /data/condor-6.8.7
I have opted to make two directories, condor_root and condor_local:
mkdir condor_root
mkdir condor_local
Now run Condor’s configuration program and then set an environment variable to point to it (you might want to make it more permanent than that):
sudo ./condor_configure --install-dir=/data/condor-6.8.7/condor_root/ --type=manager,submit --local-dir=/data/condor-6.8.7/condor_local/ --owner=bcg --install=/data/condor-6.8.7/release.tar
export CONDOR_CONFIG=/data/condor-6.8.7/condor_root/etc/condor_config
Edit condor_root/etc/condor_config:
Set RELEASE_DIR to /data/condor-6.8.7/condor_root/
Set HOSTALLOW_WRITE to *
Set HOSTALLOW_ADMINISTRATOR = $(FULL_HOSTNAME)
Start Condor:
sudo condor_root/sbin/condor_master
Stop Condor:
sudo condor_root/sbin/condor_off -master
Check that its running:
ps -ef | egrep condor_
bcg@rhdl-a2:/data/condor-6.8.7$ ps -ef | egrep condor_
bcg 24421 1 0 12:25 ? 00:00:00 condor_root/sbin/condor_master
bcg 24422 24421 0 12:25 ? 00:00:00 condor_collector -f
bcg 24423 24421 0 12:25 ? 00:00:00 condor_negotiator -f
bcg 24424 24421 0 12:25 ? 00:00:00 condor_schedd -f
bcg 24425 24421 7 12:25 ? 00:00:07 condor_startd -f
bcg 24475 5431 0 12:27 pts/0 00:00:00 grep -E condor_
Create a job:
Put the following in a text file called mandelbrot16.condor (the filename can be anything really).
# file name: mandelbrot16.condor
# Condor submit description file for mandelbrot
Executable = /data/condor-6.7.8/mandelbrot16/mandelbrot
Universe = vanilla
Error = logs/err.$(cluster)
Output = logs/out.$(cluster)
Log = logs/log.$(cluster)
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = files/mandelbrot_00000001
Arguments = mandelbrot_00000001 mandelbrot_00000001_out
Queue
create a directory to put the job in.
mkdir mandelbrot16
cd mandelbrot16
mkdir logs
mkdir files
Copy the files that you want condor to work with (such as the input files for the alogorithm) in the files directory. Put the algorithm itself in the mandelbrot16 directory.
Submit a job
condor_root/bin/condor_submit mandelbrot16/mandelbrot16.condor
Check on jobs
All jobs:
bin/condor_q
A job (say job id 3):
bin/condor_q 3
Client Machine
Download condor-6.8.7-linux-x86-rhel.tar.gz and uncompress it.
cd /data/condor-6.8.7
Same kind of setup as the Master box here. Note that the type is execute only:
mkdir condor_root
mkdir condor_local
sudo ./condor_configure --install-dir=/home/bcg/condor-6.8.7/condor_root/ --type=execute --local-dir=/home/bcg/condor-6.8.7/condor_local/ --owner=bcg --install=/home/bcg/condor-6.8.7/release.tar
export CONDOR_CONFIG=/home/bcg/condor-6.8.7/condor_root/etc/condor_config
Edit condor-6.8.7/condor_root/etc/condor_config
Set UID_DOMAIN = $(FULL_HOSTNAME)
Set FILESYSTEM_DOMAIN=$(FULL_HOSTNAME)
Set HOSTALLOW_ADMINISTRATOR = $(FULL_HOSTNAME)
Set HOSTALLOW_WRITE to *
Edit condor-6.8.7/condor_local/condor_config.local. Set CONDOR_HOST to the ip address of your master machine. Set NETWORK_INTERFACE to the ip address of the client machine you are setting up.
Set CONDOR_HOST = 144.6.40.251
SET UID_DOMAIN and FILESYSTEM_DOMAIN to $(FULL_HOSTNAME)
NETWORK_INTERFACE = 144.6.40.115
These settings, in the same file, make the client work on jobs as quickly as possible and with as much effort as possible regardless of user actions on the client machine. Remember, these instructions are for a dedicated cluster - you might not want to do this with your desktop machine.
WANT_SUSPEND = FALSE
CONTINUE = TRUE
SUSPEND = FALSE
PREEMPT = FALSE
START=TRUE
That’s it. Start them up on all the machines and wait for the computation to start. Condor can take several ( 5 to 10) minutes to get things underway, but once she starts she chuggs through the work pretty quickly.
December 17th, 2007 at 6:26 pm
Regarding your last “5 to 10 minutes” statement… It will be much quicker if you set
NEGOTIATOR_INTERVAL = 30
UPDATE_INTERVAL = 30
in the configuration. The first one is how often Condor attempts to match jobs with machines, the last one how often machines report back their status to master.
December 17th, 2007 at 7:01 pm
I’ll look into that. I find that there is only that delay when you start the whole cluster cold. If the machines all have a chance to get in touch with the master first, they seem to take and compute jobs pretty much straight away. As you would expect given the WANT_SUSPEND, CONTINUE, etc settings.