HTCondor Project

The website and blog for the HTCondor project on github.


Planet HTCondor

Configuring a Personal Hadoop Development Environment on Fedora 18

Background

The following post outlines a setup and configuration of a “personal hadoop” development environment that is much akin to a “personal condor” setup. The primary purpose is to have a single source for configuration and logs along with a soft-link to development built binaries such that switching to a different build is a matter of updating a soft-link while maintaining all other data and configuration.


Use Cases

  • Comparison testing in a local sandbox without altering an existing system installation.
  • Single source configuration and logs

References

Inter-webz:

Books:


Disclaimers

  • Currently this is a non-native development setup that uses the existing maven dependencies. For details on native packaging please visit https://fedoraproject.org/wiki/Features/Hadoop
  • The setup listed below is for creating “Single-Node-Cluster”

Prerequisites

Configure Password-less ssh

yum install openssh openssh-clients openssh-server
# generate a public/private key, if you don't already have one
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/*

# testing ssh:
ps -ef | grep sshd     # verify sshd is running
ssh localhost          # accept the certification when prompted
sudo passwd root       # Make sure the root has a password

Install Other Build Dependencies

yum install cmake git subversion dh-make ant autoconf automake sharutils libtool asciidoc xmlto curl protobuf-compiler maven gcc-c++ 

Install Oracle 1.6 JDK

You want jdk6 (not jdk7) , for example: jdk-6u43-linux-x64-rpm.bin. Run and it will unpack into several “.rpm” files:

jdk-6u43-linux-amd64.rpm
sun-javadb-client-10.6.2-1.1.i386.rpm  
sun-javadb-core-10.6.2-1.1.i386.rpm
sun-javadb-docs-10.6.2-1.1.i386.rpm
sun-javadb-common-10.6.2-1.1.i386.rpm  
sun-javadb-demo-10.6.2-1.1.i386.rpm 
sun-javadb-javadoc-10.6.2-1.1.i386.rpm

Install:

yum localinstall jdk-6u43-linux-amd64.rpm sun-javadb-*.rpm

append to your .bashrc file:

export JAVA_HOME="/usr/java/jdk1.6.0_43"
export JVM_ARGS="-Xmx1024m -XX:MaxPermSize=512m"
export PATH=${JAVA_HOME}/bin:${PATH}

Building and Setting up a “personal-hadoop”

Building

git clone git://git.apache.org/hadoop-common.git
cd hadoop-common
git checkout branch-2.0.4-alpha origin/branch-2.0.4-alpha
mvn clean package -Pdist -DskipTests

Creating Your “personal-hadoop” Sandbox

In this configuration we default to /home/tstclair

cd ~
mkdir personal-hadoop
cd personal-hadoop
mkdir -p conf data name logs/yarn
ln -sf <your-git-loc>/hadoop-dist/target/hadoop-2.0.4-alpha home

Override your environment

append to your .bashrc file:

# Hadoop env override:
export HADOOP_BASE_DIR=${HOME}/personal-hadoop
export HADOOP_LOG_DIR=${HOME}/personal-hadoop/logs
export HADOOP_PID_DIR=${HADOOP_BASE_DIR}
export HADOOP_CONF_DIR=${HOME}/personal-hadoop/conf
export HADOOP_COMMON_HOME=${HOME}/personal-hadoop/home
export HADOOP_HDFS_HOME=${HADOOP_COMMON_HOME}
export HADOOP_MAPRED_HOME=${HADOOP_COMMON_HOME}
# Yarn env override:
export HADOOP_YARN_HOME=${HADOOP_COMMON_HOME}
export YARN_LOG_DIR=${HADOOP_LOG_DIR}/yarn
#classpath override to search hadoop loc
export CLASSPATH=/usr/share/java/:${HADOOP_COMMON_HOME}/share
#Finally update your PATH
export PATH=${HADOOP_COMMON_HOME}/bin:${HADOOP_COMMON_HOME}/sbin:${HADOOP_COMMON_HOME}/libexec:${PATH}

Verify your setup

source ~/.bashrc
which hadoop    # verify it should be ${HOME}/personal-hadoop/home/bin  
hadoop -help    # verify classpath is correct.

Creating Initial Single Configuration Node Setup

First copy in the default configuration files:

cp ${HADOOP_COMMON_HOME}/etc/hadoop/* ${HADOOP_BASE_DIR}/conf

NOTE: As your configuration testing space expands it is sometimes useful to have your conf directory to also be a softlink of configuration templates.

Next update your hdfs-site.xml with the following:

(hdfs-site.xml) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Override tstclair with your home directory -->

<configuration>

    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost/</value>
    </property>
    <property>
        <name>dfs.name.dir</name>
        <value>file:///home/tstclair/personal-hadoop/name</value>
    </property>
    <property>
        <name>dfs.http.address</name>
        <value>0.0.0.0:50070</value>
    </property>
    <property>
        <name>dfs.data.dir</name>
        <value>file:///home/tstclair/personal-hadoop/data</value>
    </property>
    <property>
        <name>dfs.datanode.address</name>
        <value>0.0.0.0:50010</value>
    </property>
    <property>
        <name>dfs.datanode.http.address</name>
        <value>0.0.0.0:50075</value>
    </property>
    <property>
        <name>dfs.datanode.ipc.address</name>
        <value>0.0.0.0:50020</value>
    </property>

</configuration>

Append, or update, your mapred-site.xml with the following:

(mapred-site.xml) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Update or append these vars -->

<configuration>
    <property>
        <name>mapreduce.cluster.temp.dir</name>
        <value>
        </value>
        <description>No description</description>
        <final>true</final>
    </property>
    <property>
        <name>mapreduce.cluster.local.dir</name>
        <value>
        </value>
        <description>No description</description>
        <final>true</final>
    </property>
</configuration>

Finally update your yarn-site.xml with the following:

(yarn-site.xml) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<configuration>
    <!-- Site specific YARN configuration properties -->
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>localhost:8031</value>
        <description>host is the hostname of the resource manager and
                    port is the port on which the NodeManagers contact the Resource Manager.
        </description>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>localhost:8030</value>
        <description>host is the hostname of the resourcemanager and port is the port
                     on which the Applications in the cluster talk to the Resource Manager.
        </description>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
        <description>In case you do not want to use the default scheduler</description>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>localhost:8032</value>
        <description>the host is the hostname of the ResourceManager and the port is the port on
                    which the clients can talk to the Resource Manager. </description>
    </property>
    <property>
        <name>yarn.nodemanager.local-dirs</name>
        <value>
        </value>
        <description>the local directories used by the nodemanager</description>
    </property>
    <property>
        <name>yarn.nodemanager.address</name>
        <value>localhost:8034</value>
        <description>the nodemanagers bind to this port</description>
    </property>
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>10240</value>
        <description>the amount of memory on the NodeManager in GB</description>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce.shuffle</value>
        <description>shuffle service that needs to be set for Map Reduce to run </description>
    </property>
</configuration>

NOTE: You may notice that I’ve included default variables and their corresponding port numbers to ease default hunting.

Starting Your Single Node Hadoop Cluster

Format your namenode (only needed for the 1st setup):

hadoop namenode -format
#verify output is correct.

Start HDFS:

start-dfs.sh

open a browser to http://localhost:50070 and verify you have 1 live node.

Next start yarn:

start-yarn.sh

Verify the logs show it’s running normally.

Finally check to see if you can run an MR application:

cd ${HADOOP_COMMON_HOME}/share/hadoop/mapreduce
hadoop jar hadoop-mapreduce-example-2.0.4-alpha.jar randomwriter out

HAPPY HACKING!!!

Installing Spark on Fedora 18

The Spark project is an actively-developed open-source engine for data analytics on clusters using Scala, Python, or Java. It offers map, filter, and reduce operations over in-memory collections, data from local files, or data taken from HDFS, but unlike standard map-reduce frameworks, it offers the opportunity to cache intermediate results across the cluster (and can thus offer orders-of-magnitude improvements over standard map-reduce when implementing iterative algorithms). I’ve been using it lately and have been really impressed. However — as with many cool projects in the “big data” space — the chain of dependencies to get a working installation can be daunting.

In this post, we’ll walk through setting up Spark to run on a stock Fedora 18 installation. We’ll also build the Mesos cluster manager so that we can run Spark jobs under Mesos, and we’ll build Hadoop with support for Mesos (so that we’ll have the option to run standard Hadoop MapReduce jobs under Mesos as well). By following these steps, you should be up and running with Spark quickly and painlessly.

Preliminary steps

You’ll first need to install some packages so that you’ll have all of the build dependencies for the packages you’ll want to install. You may have some of these already installed, depending on what Fedora installation type you’ve used and what packages you’ve already installed, but this list should cover bringing you from a minimal Fedora installation to one that can support building Mesos, Hadoop, and Spark.

First, we’ll install some essential tools that might not already be on your system:

sudo yum install -y git wget patch tar autoconf automake autotools libtool bzip2

Then, we’ll install some compilers, language environments, build tools, and support libraries:

sudo yum install -y gcc gcc-c++ python scala
sudo yum install -y java-devel python-devel zlib-devel libcurl-devel openssl-devel
sudo yum install -y ant maven maven2

We should now have all of the dependencies installed to build Mesos, Hadoop, and Spark.

Building Mesos

Set JAVA_HOME in your environment:

export JAVA_HOME=/usr/lib/jvm/java-1.7.0/

Then create a working directory and clone the Mesos source repository. This will take a little while, since it’s a large (~180mb) repository:

mkdir ~/build
cd ~/build
git clone https://github.com/apache/mesos.git

We’re going to be working from the 0.12.x branch of the Mesos repository. Check that out:

cd mesos
git checkout 0.12.x

Now we can actually build Mesos:

./bootstrap
./configure --with-webui --with-included-zookeeper --disable-perftools
make && sudo make install

Building and running Hadoop

You can run Spark jobs under Mesos without using Hadoop at all, and Spark running under Mesos can get data from HDFS even if the HDFS service isn’t itself running under Mesos. However, in order to run Hadoop MapReduce jobs under Mesos, we’ll need to build a patched version of Hadoop. If you already have Hadoop installed and running and are interested simply in running Spark against data in HDFS, you can skip this step, but if you don’t have Hadoop installed, installing it this way is simple and will give you greater flexibility in the future.

Building the patched Hadoop is straightforward, since patches and build scripts are bundled with Mesos. Simply run the following command from within your mesos directory and follow the prompts:

./hadoop/TUTORIAL.sh

It will explain what it is doing while downloading, patching, and building Hadoop. It will then run an example Hadoop MapReduce job to make sure everything is working properly and remind you that you’ll need to make changes to the MapReduce configuration before running MapReduce jobs on your Mesos cluster. Since we aren’t going to be running Hadoop MapReduce jobs under Mesos right away, we’ll skip that step for now. However, we will be making some minor configuration changes.

First, while you’re still in the mesos directory, change to the directory where you built your patched Hadoop:

cd hadoop/hadoop-0.20.205.0/

Then edit conf/hadoop-env.sh with your favorite editor, replacing the commented-out line that sets JAVA_HOME with the following:

export JAVA_HOME=/usr/lib/jvm/java-1.7.0/

Finally, edit conf/core-site.xml and add the following property:

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9100</value>
</property>

Now we’re ready to run Hadoop. Check to make sure that you can ssh to your local machine without a password, since the Hadoop startup scripts will do this several times and you will get tired of typing your password. If you can’t and you already have a SSH key pair, append your public key to your authorized_keys file and make sure it’s only readable by you. If you don’t already have a SSH key pair on your machine, you can simply type in the following commands for a local-only setup:

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys

Now you’re ready to format your HDFS storage:

./bin/hadoop namenode -format

Next, start all of the Hadoop daemons:

cd ~/build/mesos/hadoop/hadoop-0.20.205.0/
./bin/start-all.sh

If you’re running as root (shame on you!), you’ll get an error that the -jvm option isn’t supported. You can work around this error by running sed -i 's/jvm //' bin/hadoop or — if you prefer to do things manually — editing bin/hadoop and replacing the line that reads

HADOOP_OPTS="$HADOOP_OPTS -jvm server $HADOOP_DATANODE_OPTS"

with this line:

HADOOP_OPTS="$HADOOP_OPTS -server $HADOOP_DATANODE_OPTS"

If you had to make this change, run bin/stop-all.sh and then bin/start-all.sh again.

Storing an input file to HDFS

As long as we’re still thinking about Hadoop, let’s put some data there so we can process it after we’ve built Spark. For our Spark example, we’ll work with a Common Log Format file (like an Apache HTTPD AccessLog). (You probably have one sitting around somewhere.) Mine is called ~/access_log. Here’s how I’ll load it in to HDFS:

./bin/hadoop fs -mkdir input
./bin/hadoop fs -put ~/access_log input

I can check that it’s actually there:

./bin/hadoop fs -ls input | grep access_log

and, sure enough, I’ll get something like this in return:

-rw-r--r--   3 root supergroup  792036736 2013-04-10 14:32 /user/root/input/access_log

Shame on me, too, I guess. Now we’re ready to build Spark and run some example jobs against these data.

Building Spark

First, fetch the source tarball:

wget http://spark-project.org/files/spark-0.7.0-sources.tgz
tar -xzvf spark-0.7.0-sources.tgz
cd spark-0.7.0

In order to be able to fetch input data from HDFS, Spark needs to know what version of Hadoop you’re running before you build it. To do this, either open project/SparkBuild.scala in your favorite editor or run the following command:

sed -i 's/= "1.0.4"/= "0.20.205.0"/' project/SparkBuild.scala

Now we’ll use the Scala build tool to compile Spark:

./sbt/sbt package

We can now run a simple job to make sure Spark works. The SparkPi example uses a Monte Carlo method to approximate Pi:

export SCALA_HOME=/usr/share/scala
./run spark.examples.SparkPi local 1000

Now you should have a working Spark installation and can run some jobs locally. We’ll look at an example interaction with Spark next.

Running an example Spark job

The Spark shells use the MASTER environment variable to determine how to run jobs. Set it to local for now:

export MASTER=local

This will run your jobs locally, using only one thread. We’ll also want to enable Spark to use more memory if we’re going to work with a large dataset:

export SPARK_MEM=2g

If you want to use more or less than 2 gigabytes, change that setting appropriately; it uses the same format as Java heap size arguments. Now we’ll start up the Python Spark environment:

./pyspark

You’ll find yourself in a Python REPL with a SparkContext object initialized in the sc variable. Now, create a Spark resilient distributed dataset from the lines in the log file we uploaded to HDFS (noting, of course, that your URL may be different depending on how you stored the file):

log_entries = sc.textFile("hdfs://localhost:9100/user/root/input/access_log")

We’ll then write a short series of operations to count the number of log entries for each remote host:

ips = log_entries.map(lambda s: s.split()[0])
ips.map(lambda ip: (ip, 1)).reduceByKey(lambda a, b: a + b).collect()

This example won’t benefit all that much from Spark’s caching support, but it will run faster in parallel. If you have many cores in your machine, try using them! Here’s how you’d set MASTER if you wanted to use 8 threads:

export MASTER='local[8]'

Exit out of pyspark and try running the example again with more threads, if you feel so inclined.

Next steps

Now that you have Spark up and running, here are some things to consider trying:

Reprocessing CMS events with Bosco

Prior to the LHC long shutdown, the CMS experiment increased the trigger rate of the detector, therefore increasing the data coming off the detector.  The Tier-0 was unable to process all of the events coming off of the detector, therefore the events where only stored and not processed.  After the run, the experiment wanted to process the backlog of events, but didn't have the computing power available to do it.  So they turned to opportunistic computing and Bosco.

The CMS collaborators at UCSD worked with the San Diego Supercomputing Resource to run the processing on the Gordon supercomputer.  Gordon is an XSEDE resource and does not include a traditional OSG Globus Gatekeeper.  Also, we did not have root access to the cluster to install a gatekeeper.  Therefore, Bosco was used to submit and manage the GlidienWMS Condor glideins to the resource.

Running jobs at Gordon, the SDSC supercomputer


As you can see from the graph, we reached nearly 4,000 CMS processing jobs on Gordon.  4k cores is larger than most CMS Tier 2's, and as big as a European Tier-1.  With Bosco, overnight, Gordon became one of the largest CMS clusters in the world.

Full details will be written in a submitted paper to CHEP '13 in Amsterdam, and Bosco will be presented in a poster (and paper) as well.  I hope to see you there!

(If I got any details wrong about the CMS side of this run, please let me know.  I have intimate knowledge of the Gordon side, but not so much the CMS side).


Bosco Download

HTCondor 7.8.8 released! ( March 28, 2013 )

The HTCondor team is pleased to announce the release of HTCondor 7.8.8. This release contains bug fixes for reconnection failure when using CCB, introduces automatic retries for some glexec errors, and fixes several other grid related bugs. A complete list of bugs fixed can be found in the Version History. HTCondor 7.8.8 binaries and source code are available from our Downloads page.

Smooth Gradients for Cubic Hermite Splines

One of the advantages of cubic Hermite splines is that their interval interpolation formula is an explicit function of gradients \( m_0, m_1, ... m_{n-1} \) at knot-points: [...]

Paradyn/HTCondor Week 2013 registration open (March 8, 2013)

We want to invite you to HTCondor Week 2013 , our annual HTCondor user conference, in beautiful Madison, WI April 29-May 3, 2013. (HTCondor Week was formerly named Condor Week, matching a name change for the software.) We will again host HTCondor Week at the Wisconsin Institutes for Discovery, a state of the art facility for academic and private research specifically designed to foster private and public collaboration. It provides HTCondor Week attendees a compelling environment to attend tutorials and talks from HTCondor developers and users like you. It also provides many comfortable spaces for one-on-one or small group collaborations throughout the week. This year we continue our partnership with the Paradyn Tools Project, making this year Paradyn/HTCondor Week 2013. There will be a full slate of tutorials and talk for both HTCondor and Paradyn. Our current development series, 7.9, is well underway toward our upcoming production release. When you attend, you will learn how to take advantage of the latest features such as per-job PID namespaces, cgroup enforced resource limits, Python bindings, CPU affinity, BOSCO for submitting jobs to remote batch systems without administrator assistance, EC2 spot instance support, and a variety of speed and memory optimizations. You'll also get a peek into our longer term development plans--something you can only get at HTCondor Week! We will have a variety of in-depth tutorials, talks, and panels where you can not only learn more about HTCondor, but you can also learn how other people are using and deploying HTCondor. Best of all you can establish contacts and learn best practices from people in industry, government and academia who are using HTCondor to solve hard problems, many of which may be similar to those facing you. Speaking of learning from the community, we'd love to have you give a talk at HTCondor Week. Talks are 20 minutes long and are a great way share your ideas and get feedback from the community. If you have a compelling use of HTCondor you'd like to share, let Alan De Smet know (adesmet@cs.wisc.edu) and he'll help you out. More information on speaking at HTCondor Week is available at the HTCondor Week web site. You can register, get the hotel details and see the agenda overview on the HTCondor Week 2013 site. See you soon in Madison!

HTCondor 7.9.4 released! (February 20, 2013)

The HTCondor team is pleased to announce the release of HTCondor 7.9.4. This release supports per job PID namespaces for Linux RHEL 6, improvements to the resource usage of the EC2 GAHP, support for capping the size of input and output file transfer, and new analysis modes for condor_q -analyze. A complete list of bugs fixed and features can be found in the Version History. HTCondor 7.9.4 binaries and source code are available from our Downloads page.

Statistic changes in HTCondor 7.7

Notice to HTCondor 7.8 users - Statistics implemented during the 7.5 series that landed in 7.7.0 were rewritten by the time 7.8 was released. If you were using the original statistics for monitoring and/or reporting, here is a table to help you map old (left column) to new (right column). See – 7.6 -> 7.8 [...]

How accounting group configuration could work with Wallaby

Configuration of accounting groups in HTCondor is too often an expert task that requires coordination between administrators and their tools. Wallaby provides a coordination point, so long as a little convention is employed, and can provide a task specific interface to simplify configuration. Quick background, Wallaby provides semantic configuration for HTCondor. It models a pool [...]

Some htcondor-wiki stats

A few years ago I discovered Web Numbr, a service that will monitor a web page for a number and graph that number over time. I installed a handful of webnumbrs to track things at HTCondor’s gittrac instance. http://webnumbr.com/search?query=condor Thing such as - Tickets resolved with no destination: tickets that don’t indicate what version they [...]

Concurrency Limits: Group defaults

Concurrency limits allow for protecting resources by providing a way to cap the number of jobs requiring a specific resource that can run at one time. For instance, limit licenses and filer access at four regional data centers. Notice the repetition. In addition to the repetition, every license.* and filer.* must be known and recorded [...]

Your API is a feature, give it real resource management

So much these days is about distributed resource management. That’s anything that can be created and destroyed in the cloud[0]. Proper management is especially important when the resource’s existence is tied to a real economy, e.g. your user’s credit card[1]. Above is a state machine required to ensure that resources created in AWS EC2 are [...]

The Mean of the Modulus Does Not Equal the Modulus of the Mean

I've been considering models for the effects of HTCondor negotiation cycle cadence on pool loading and accounting group starvation, which led me to thinking about the effects of taking the modulus of a random variable, for reasons I plan to discuss in future posts. [...]

A Demonstration of Negotiator-Side Resource Consumption

HTCondor supports a notion of aggregate compute resources known as partitionable slots (p-slots), which may be consumed by multiple jobs. Historically, at most one job could be matched against such a slot in a single negotiation cycle, which limited the rate at which partitionable slot resources could be utilized. More recently, the scheduler has been enhanced with logic to allow it to acquire multiple claims against a partitionable slot, which increases the p-slot utilization rate. However, as this potentially bypasses the negotiator's accounting of global pool resources such as accounting group quotas and concurrency limits, it places some contraints on what jobs can can safely acquire multiple claims against any particular p-slot: for example, only other jobs on the same scheduler can be considered. Additionally, candidate job requirements must match the requirements of the job that originally matched in the negotiator. Another significant impact is that the negotiator is still forced to match an entire p-slot, which may have a large match cost (weight): these large match costs cause accounting difficulties when submitter shares and/or group quotas drop below the cost of a slot. This particular problem is growing steadily larger, as machines with ever-larger numbers of cores and other resources appear in HTCondor pools. [...]

Role enforcement in Cumin

Roles in Cumin scope activities and content in the UI. There are currently two roles defined in Cumin, admin and user. The admin role is a superset of the user role, and every new account has the user role by default. [...]

Best practices for Wallaby's default group

Recall that Wallaby applies partial configurations to groups of nodes. Groups can be either explicit —- that is, a named subset of nodes created by the user, or special groups that are built-in to Wallaby; each node’s group memberships have...

Welcome To The HTCondor Project Github Site

Welcome to the HTCondor Project GitHub website! This site is the github web and blog presence for the HTCondor project. [...]

Configuring high-availability Condor central managers with Wallaby

Rob Rati and I gave a tutorial on highly-available job queues at Condor Week this year. While it was not a Wallaby-specific tutorial, we did point out that configuring highly-available job queues is easier for users who manage and deploy...

Using Cluster Suite's GUI to configure High Availability Schedulers

In an earlier post I talked about using Cluster Suite to manage high availability schedulers and referenced the command line tools available perform the configuration. I'd like to focus on using the GUI that is part of Cluster Suite to configure an HA schedd. It's a pretty simple process but does require you run a wallaby shell command to complete the configuration. [...]

Using Cluster Suite to Manage a High Availability Scheduler

Condor provides simple and easy to configure HA functionality for the schedd that relies upon shared storage (usually NFS). The shared store is used to store the job queue log and coordinate which node is running the schedd. This means that each node that can run a particular schedd not only have condor configured but the node needs to be configured to access the shared storage. [...]

Integrating Cumin with LDAP for Authentication

Past versions of Cumin have relied on a local database for storing user accounts. However, that solution adds extra maintenance for site administrators who already have or plan to have a central authentication mechanism for their users. Consequently, development is ongoing to integrate Cumin with common central auth mechanisms. LDAP integration is available now, with support for other technologies planned for the future. [...]

So What is Cumin Anyway?

Cumin is a Python web UI developed in the Fedora community for managing Condor pools and Qpid messaging brokers. It is packaged for Fedora but may be run from sources and would probably be easy to port to other Linux distributions (or just run Fedora on a node or two in a heterogeneous environment!) The current development focus for Cumin is on expanding the Condor management facilities. [...]

Authorization for Wallaby clients

Wallaby 0.16.0, which updates the Wallaby API version to 20101031.6, includes support for authorizing broker users with various roles that can interact with Wallaby in different ways. This post will explain how the authorization support works and show how to get started using it. If you just want to get started using Wallaby with authorization support as quickly as possible, skip ahead to the section titled “Getting Started” below. Detailed information about which role is required for each Wallaby API method is available here. [...]

Authorization for Wallaby clients

Wallaby 0.16.0, which updates the Wallaby API version to 20101031.6, includes support for authorizing broker users with various roles that can interact with Wallaby in different ways. This post will explain how the authorization support works and show how to...

Highly-available configuration data with Wallaby

Many Condor users are interested in high-availability (HA) services: they don't want their compute resources to become unavailable due to the failure of a single machine that is running an important Condor daemon. (See this talk that Rob Rati and...

Highly-available configuration data with Wallaby

Many Condor users are interested in high-availability (HA) services: they don’t want their compute resources to become unavailable due to the failure of a single machine that is running an important Condor daemon. (See this talk that Rob Rati and I gave at Condor Week this year for a couple of solutions to HA with the Condor schedd.) So it’s only natural that Condor users who are interested in configuring their pools with Wallaby might wonder how Wallaby responds in the face of failure. [...]

Using the skeleton group

In Wallaby, Condor nodes are configured by applying features and parameter settings to groups. In order for the group abstraction to be fully general, Wallaby provides two kinds of special groups: the default group, which contains every node (but which is the lowest-priority membership for each node), and a set of identity groups, each of which only contains a single node (and which is always its highest-priority membership, so that special settings applied to a node’s identity group always take precedence over settings from that node’s other memberships). [...]

Troubleshooting Condor with Wallaby

Often, if you’re trying to reproduce a problem someone else is having with Condor, you’ll need their configuration. Likewise, if you’re trying to help someone reproduce a problem you’re having, you’ll want to send along your configuration to aid them in replicating your setup. For installations that use legacy flat-file configurations (optionally with a local configuration directory), this can be a pain, since you’ll need to copy several files from site to site (ensuring that you’ve included all the files necessary to replicate your configuration, perhaps across multiple machines on the site experiencing the problem). [...]