HTCondor Project

The website and blog for the HTCondor project on github.

Planet HTCondor

Hiding all the details: Grid jobs in Docker

For a long time, HTCondor has strived to have the job runtime environment be run and defined by the submit host. But, that is surprisingly difficult to do. There are many reasons why the environment should be controlled by the submit host, for example:

  1. Enable OS updates independent of the job environment. The sysadmins may want to run newer operating systems.
  2. Allow users to define their own execution environment. Many applications require many dependencies that can be packaged together.

Previously we would have considered virtual machines. But… virtual machines are difficult to author and maintain. Virtual machines tend to be large (in GBs). And they can have potentially large overheads, especially for IO.

In 2013, Greg Thain presented putting users in a box for job isolation. It used three technologies in order to enable the isolation:

  1. PID Namespaces - Isolation of the job processes from other jobs. Also, job processes cannot view system processes.
  2. CHROOTs - Create an isolated filesystem to protect the system from modifications.
  3. CGroups - Control Groups to isolate resource usage between jobs and the system.

We used these chroot in Nebraska’s transition from RHEL5 to RHEL6. But the chroot capability has degraded over time since it is difficult to author and maintain raw chroots.

A New Approach

chroot, namespaces, and cgroups are all part of Docker’s containerization solution. Docker provides a very approachable way to compose and publish images. Further, we don’t need to maintain a RHEL6 image, only our local customizations on top.

We decided to use HTCondor’s new Docker universe. We want to trnasform incomping grid jobs into Docker universe jobs.

Why Docker

We chose Docker over Virtual Machines due to potential IO bottlenecks that have been identified in recent publications.

Docker Vs VM IO performance


Our host environment consists of:

  • CentOS 7.2: This is our admin’s preferred OS.
  • Docker v1.9.1: Default version of Docker for RHEL7.
  • HTCondor 8.5.4: Contains a few useful bug fixes and new features over the current stable series.

The default container is based off of CentOS 6. It includes the OSG WN packages, gcc, glibc-headers… for various system depenedencies for the CMS project. Here is the Docker Hub.

The full DockerFile is on Github and below.

FROM centos:centos6

RUN yum -y install && \
    yum -y install epel-release && \
    yum -y install osg-wn-client osg-wn-client-glexec cvmfs && \
    yum -y install glibc-headers && \
    yum -y install gcc && \
    yum -y install redhat-lsb-core sssd-client && \
    yum clean all && \
    yum -y update

# Create condor user and group
RUN groupadd -r condor && \
    useradd -r -g condor -d /var/lib/condor -s /sbin/nologin condor

# Add lcmaps.db
COPY lcmaps.db /etc/lcmaps.db

RUN yum -y install openssh-clients && yum clean all

That’s it, that’s all of the DockerFile.

There are a few important directories from the host that need to be available to the container - for example, the HDFS-based storage system. Docker refers to these as volume mounts. Currently, we bring in a total of 6 different directories. Most volumes are marked read only - no need for the jobs to write to these. Exception is SSSD: need to write to a Unix socket to lookup usernames.

DOCKER_VOLUME_DIR_CVMFS         = /cvmfs:/cvmfs:ro
DOCKER_VOLUME_DIR_ETC_CVMFS     = /etc/cvmfs:/etc/cvmfs:ro
DOCKER_VOLUME_DIR_HDFS          = /mnt/hadoop:/mnt/hadoop:ro
DOCKER_VOLUME_DIR_GRID_SECURITY = /etc/grid-security:/etc/grid-security:ro
DOCKER_VOLUME_DIR_SSSD          = /var/lib/sss/pipes/nss
DOCKER_VOLUME_DIR_NSSWITCH      = /etc/nsswitch.conf:/etc/nsswitch.conf:ro

OSG Flow

The HTCondor-CE accepts jobs into the cluster from external submitters.

  1. GlideinWMS factories submit to the HTCondor-CE
  2. The Job Router component transforms the CE job to use Docker universe.
    • Surprisingly, no new JobUniverse.
    • Sets DockerImage.
    • Changes the Cmd string.

Snippets from the condor_job_router transform language

  • Cmd needs to be prepended with ./:

    copy_Cmd = "orig_Cmd"
    eval_set_Cmd = ifThenElse(regexp("^/", orig_Cmd), orig_Cmd, strcat("./",orig_Cmd))
  • DockerImage needs to be set:

    copy_DockerImage = "orig_DockerImage"
    eval_set_DockerImage = ifThenElse(isUndefined(orig_DockerImage),

Current Status

Running Production CMS and OSG jobs on Nebraska’s CMS Tier 2. Currently ~10% of the Nebraska Tier 2 is Docker-enabled. Will be expanding to the entire cluster in the coming weeks: goal is to be done by the end-of-summer. Next step is to further explore how to (safely) expose this capability to OSG VOs and users.

Future Directions

HTCondor treats all Docker images the same. We want to differentiate the images that come from the “good guys” (us) versus the “bad guys” (users). Still uncomfortable with the idea of allowing users to request arbitrary images. RHEL7.2 includes various sandboxing mechanisms: there’s no (publicly) known ways to break out, but the track record is relatively poor.


Things that would simplify our setup:

  1. Pass resource accounting (CPU, memory usage) from Docker to HTCondor. Scheduled for 8.5.5.
  2. Avoid prepending ./ to the Cmd.
  3. Make volume mounts conditional: we only want to expose HDFS and SSSD to CMS jobs.
  4. Ability to whitelist particular images - evaluated on worker node!
  5. Ability to mark jobs in “untrusted images” with the Linux “NO_NEW_PRIVS” flag (prevents setuid).


Notes for my HTCondor Week talk

I’m delighted to have a chance to present at HTCondor Week this year and am looking forward to seeing some old friends and collaborators. The thesis of my talk is that HTCondor users who aren’t already leading data science initiatives are well-equipped to start doing so. The talk is brief and high-level, so here are a few quick links to learn more if you’re interested:

I also gave a quick overview of some of my team’s recent data science projects; visit these links to learn more:

HTCondor Week tutorials free to UW-Madison faculty, staff and students ( May 11, 2016 )

Just a reminder that the Tuesday tutorials at HTCondor Week are free to UW-Madison faculty, staff and students. If you have any interest in using HTCondor and Center for High Throughput Computing resources, we'd love to see you on Tuesday. However, as everyone knows, there ain't no such thing as a free lunch -- people who attend the tutorials without paying are not eligible for the lunch (you can still get snacks at the breaks, though).

Silex 0.0.10

My team and I are pleased to announce the latest release of our Silex library, featuring cool new functionality from all of the core contributors. Silex is a library of reusable components for Apache Spark factored out of our data science work in Red Hat’s Emerging Technology group. You can:

Enjoy, and let us know how you’re finding it useful!

Log analytics talk at Apache: Big Data

As I mentioned earlier, I’ll be talking about feature engineering and outlier detection for infrastructure log data at Apache: Big Data next week. Consider this post a virtual handout for that talk. (I’ll also be presenting another talk on scalable log data analysis later this summer. That talk is also inspired by my recent work with logs but will focus on different parts of the problem, so stay tuned if you’re interested in the domain!)

Some general links:

  • You can download a PDF of my slide deck. I recognize that people often want to download slides, although I’d prefer you look at the rest of this post instead since my slides are not intended to stand alone without my presentation.
  • Check out my team’s Silex library, which is intended to extend the standard Spark library with high-quality, reusable components for real-world data science. The most recent release includes the self-organizing map implementation I mentioned in my talk.
  • Watch this short video presentation showing some of the feature engineering and dimensionality-reduction techniques I discussed in the talk.

The following blog posts provide a deeper dive into some of the topics I covered in the talk:

  • When I started using Spark and ElasticSearch, the upstream documentation was pretty sparse (it was especially confusing because it required some unidiomatic configuration steps). So I wrote up my experiences getting things working. This is an older post but may still be helpful.
  • If you’re interested in applying natural-language techniques to log data, you should consider your preprocessing pipeline. Here are the choices I made when I was evaluating word2vec on log messages.
  • Here’s a brief (and not-overly technical) overview of self-organizing maps, including static visual explanations and an animated demo.

Random Forest Clustering of Machine Package Configurations in Apache Spark

In this post I am going to describe some results I obtained for clustering machines by which RPM packages that were installed on them. The clustering technique I used was Random Forest Clustering.

The Data

The data I clustered consisted of 135 machines, each with a list of installed RPM packages. The number of unique package names among all 135 machines was 4397. Each machine was assigned a vector of Boolean values: a value of 1 indicates that the corresponding RPM was installed on that machine. This means that the clustering data occupied a space of nearly 4400 dimensions. I discuss the implications of this later in the post, and what it has to do with Random Forest Clustering in particular.

For ease of navigation and digestion, the remainder of this post is organized in sections:

Introduction to Random Forest Clustering
        (The Pay-Off)
Package Configuration Clustering Code
Clustering Results

Random Forests and Random Forest Clustering

Full explainations of Random Forests and Random Forest Clustering could easily occupy blog posts of their own, but I will attempt to summarize them briefly here. Random Forest learning models per se are well covered in the machine learning community, and available in most machine learning toolkits. With that in mind, I will focus on their application to Random Forest Clustering, as it is less commonly used.

A Random Forest is an ensemble learning model, consisting of some number of individual decision trees, each trained on a random subset of the training data, and which choose from a random subset of candidate features when learning each internal decision node.

Random Forest Clustering begins by training a Random Forest to distinguish between the data to be clustered, and a corresponding synthetic data set created by sampling from the marginal distributions of each feature. If the data has well defined clusters in the joint feature space (a common scenario), then the model can identify these clusters as standing out from the more homogeneous distribution of synthetic data. A simple example of what this looks like in 2 dimensional data is displayed in Figure 1, where the dark red dots are the data to be clustered, and the lighter pink dots represent synthetic data generated from the marginal distributions:

Figure 1

Each interior decision node, in each tree of a Random Forest, typically divides the space of feature vectors in half: the half-space <= some threshold, and the half-space > that threshold. The result is that the model learned for our data can be visualized as rectilinear regions of space. In this simple example, these regions can be plotted directly over the data, and show that the Random Forest did indeed learn the location of the data clusters against the background of synthetic data:

Figure 2

Once this model has been trained, the actual data to be clustered are evaluated against this model. Each data element navigates the interior decision nodes and eventually arrives at a leaf-node of each tree in the Random Forest ensemble, as illustrated in the following schematic:

Figure 3

A key insight of Random Forest Clustering is that if two objects (or, their feature vectors) are similar, then they are likely to arrive at the same leaf nodes more often than not. As the figure above suggests, it means we can cluster objects by their corresponding vectors of leaf nodes, instead of their raw feature vectors.

If we map the points in our toy example to leaf ids in this way, and then cluster the results, we obtain the following two clusters, which correspond well with the structure of the data:

Figure 4

A note on clustering leaf ids. A leaf id is just that -- an identifier -- and in that respect a vector of leaf ids has no algebra; it is not meaningful to take an average of such identifiers, any more than it would be meaningful to take the average of people's names. Pragmatically, what this means is that the popular k-means clustering algorithm cannot be applied to this problem.

These vectors do, however, have distance: for any pair of vectors, add 1 for each corresponding pair of leaf ids that differ. If two data elements arrived at all the same leafs in the Random Forest model, all their leaf ids are the same, and their distance is zero (with respect to the model, they are the same). Therefore, we can apply k-medoids clustering.

The Pay-Off

What does this somewhat indirect method of clustering buy us? Why not just cluster objects by their raw feature vectors?

The problem is that in many real-world cases (unlike in our toy example above), feature vectors computed for objects have many dimensions -- hundreds, thousands, perhaps millions -- instead of the two dimensions in this example. Computing distances on such objects, necessary for clustering, is often expensive, and worse yet the quality of these distances is frequently poor due to the fact that most features in large spaces will be poorly correlated with any structure in the data. This problem is so common, and so important, it has a name: the Curse of Dimensionality.

Random Forest Clustering, which clusters on vectors of leaf-node ids from the trees in the model, side-steps the curse of dimensionality because the Random Forest training process, by learning where the data is against the background of the synthetic data, has already identified the features that are useful for identifying the structure of the data! If any particular feature was poorly correlated with that struture, it has already been ignored by the model. In other words, a Random Forest Clustering model is implicitly examining exactly those features that are most useful for clustering , thus providing a cure for the Curse of Dimensionality.

The machine package configurations whose clustering I describe for this post are a good example of high dimensional data that is vulnerable to the Curse of Dimensionality. The dimensionality of the feature space is nearly 4400, making distances between vectors potentially expensive to evaluate. Any individual feature contributes little to the distance, having to contend with over 4000 other features. Installed packages are also noisy. Many packages, such as kernels, are installed everywhere. Others may be installed but not used, making them potentially irrelevant to grouping machines. Furthermore, there are only 135 machines, and so there are far more features than data examples, making this an underdetermined data set.

All of these factors make the machine package configuration data a good test of the strenghts of Random Forest Clustering.

Package Configuration Clustering Code

The implementation of Random Forest Clustering I used for the results in this post is a library available from the silex project, a package of analytics libraries and utilities for Apache Spark.

In this section I will describe three code fragments that load the machine configuration data, perform a Random Forest clustering, and format some of the output. This is the code I ran to obtain the results described in the final section of this post.

The first fragment of code illustrates the logistics of loading the feature vectors from file train.txt that represent the installed-package configurations for each machine. A corresponding "parallel" file nodesclean.txt contains corresponding machine names for each vector. A third companion file rpms.txt contains names of each installed package. These are used to instantiate a specialized Scala function (InvertibleIndexFunction) between feature indexes and human-readable feature names (in this case, names of RPM packages). Finally, another specialized function (Extractor) for instantiating Spark feature vectors is created.

Note: Extractor and InvertibleIndexFunction are also component libraries of silex

```scala // Load installed-package feature vectors val fields = spark.textFile(s"$dataDir/train.txt").map(_.split(" ").toVector)

// Pair feature vectors with machine names val nodes = spark.textFile(s"$dataDir/nodesclean.txt").map { _.split(" ")(1) } val ids = fields.paste(nodes)

// Load map from feature indexes to package names val inp = spark.textFile(s"$dataDir/rpms.txt").map(.split(" ")) .map(r => (r(0).toInt, r(1))) .collect.toVector.sorted val nf = InvertibleIndexFunction(

// A feature extractor maps features into sequence of doubles val m = fields.first.length - 1 val ext = Extractor(m, (v: Vector[String]) => :FeatureSeq) .withNames(nf) .withCategoryInfo(IndexFunction.constant(2, m)) ```

The next section of code is where the work of Random Forest Clustering happens. A RandomForestCluster object is instantiated, and configured. Here, the configuration is for 7 clusters, 250 synthetic points (about twice as many synthetic points as true data), and a Random Forest of 20 trees. Training against the input data is a simple call to the run method.

The predictWithDistanceBy method is then applied to the data paired with machine names, to yield tuples of cluster-id, distance to cluster center, and the associated machine name. These tuples are split by distance into data with a cluster, and data considered to be "outliers" (i.e. elements far from any cluster center). Lastly, the histFeatures method is applied, to examine the Random Forest Model and identify any commonly-used features.

```scala // Train a Random Forest Clustering Model val rfcModel = RandomForestCluster(ext) .setClusterK(7) .setSyntheticSS(250) .setRfNumTrees(20) .setSeed(37) .run(fields)

// Evaluate to get tuples: (cluster, distance, machine-name) val cid = => x))

// Split by closest distances into clusters and outliers
val (clusters, outliers) = cid.splitFilter { case (, dist, ) => dist <= 5 }

// Generate a histogram of features used in the RF model val featureHist = rfcModel.randomForestModel.histFeatures(ext.names) ```

The final code fragment simply formats clusters and outliers into a tabular form, as displayed in the next section of this post. Note that there is neither Spark nor silex code here; standard Scala methods are sufficient to post-process the clustering data:

```scala // Format clusters for display val clusterStr = { case (j, d, n) => (j, (d, n)) } .groupByKey .collect .map { case (j, nodes) => { case (d, n) => s"$d  $n" }.mkString("\n")

} .mkString("\n\n")

// Format outliers for display val outlierStr = outliers.collect .map { case (_, d,n) => (d, n) } .toVector.sorted .map { case (d, n) => s"$d $n" } .mkString("\n") ```

Package Configuration Clustering Results

The result of running the code in the previous section is seven clusters of machines. In the following files, the first column represents distance from the cluster center, and the second is the actual machine's node name. A cluster distance of 0.0 indicates that the machine was indistinguishable from cluster center, as far as the Random Forest model was concerned. The larger the distance, the more different from the cluster's center a machine was, in terms of its installed RPM packages.

Was the clustering meaningful? Examining the first two clusters below is promising; the machine names in these clusters are clearly similar, likely configured for some common task by the IT department. The first cluster of machines appears to be web servers and corresponding backend services. It would be unsurprising to find their RPM configurations were similar.

The second cluster is a series of executor machines of varying sizes, but presumably these would be configured similarly to one another.

The second pair of clusters (3 & 4) are small. All of their names are similar (and furthermore, similar to some machines in other clusters), and so an IT administrator might wonder why they ended up in oddball small clusters. Perhaps they have some spurious, non-standard packages installed that ought to be cleaned up. Identifying these kinds of structure in a clustering is one common clustering application.

Cluster 5 is a series of bugzilla web servers and corresponding back-end bugzilla data base services. Although they were clustered together, we see that the web servers have a larger distance from the center, indicating a somewhat different configuration.

Cluster 6 represents a group of performance-related machines. Not all of these machines occupy the same distance, even though most of their names are similar. These are also the same series of machines as in clusters 3 & 4. Does this indicate spurious package installations, or some other legitimate configuration difference? A question for the IT department...

Cluster 7 is by far the largest. It is primarily a combination of OpenStack machines and yet more perf machines. This clustering was relatively stable -- it appeared across multiple independent clustering runs. Because of its stability I would suggest to an IT administrator that the performance and OpenStack machines are sharing some configuration similarities, and the performance machines in other clusters suggest that there might be yet more configuration anomalies. Perhaps these were OpenStack nodes that were re-purposed as performance machines? Yet another question for IT...


This last grouping represents machines which were "far" from any of the previous cluster centers. They may be interpreted as "outliers" - machines that don't fit any model category. Of these the node frodo is clearly somebody's personal machine, likely with a customized or idiosyncratic package configuration. Unsurprising that it is farthest of all machines from any cluster, with distance 9.0. The jenkins machine is also somewhat unique among the nodes, and so perhaps not surprising that its registers as anomalous. The remaining machines match node series from other clusters. Their large distance is another indication of spurious configurations for IT to examine.

I will conclude with another useful feature of Random Forest Models, which is that you can interrogate them for information such as which features were used most frequently. Here is a histogram of model features (in this case, installed packages) that were used most frequently in the clustering model. This particular histogram i sinteresting, as no feature was used more than twice. The remaining features were all used exactly once. This is a bit unusual for a Random Forest model. Frequently some features are used commonly, with a longer tail. This histogram is rather "flat," which may be a consequence of there being many more features (over 4000 installed packages) than there are data elements (135 machines). This makes the problem somewhat under-determined. To its credit, the model still achieves a meaningful clustering.

Lastly I'll note that full histogram length was 186; in other words, of the nearly 4400 installed packages, the Random Forest model used only 186 of them -- a tiny fraction! A nice illustration of Random Forest Clustering performing in the face of high dimensionality!

Red Hat Data Science talks at Apache: Big Data 2016

If you’ll be at Apache: Big Data next week, you should definitely check out some talks from my teammates in Red Hat’s Emerging Technology group and our colleague Suneel Marthi from the CTO office:

Unfortunately, my talk is at the same time as Suneel’s, so I won’t be able to attend his, but these are all great talks and you should be sure to put as many as possible on your schedule if you’ll be in Vancouver!

Building CentOS packages on Travis-CI

Build Status

The Travis-CI Continuous Integration service is great for building and testing software for each commit. But, it is limited to only supporting builds and tests on the Ubuntu OS. The OSG, on the other hand, only supports the EL6 and EL7 family of OS’s (such as CentOS, Scientific Linux, and RHEL). With the recent move of all OSG internal software projects to Github, we have the opportunity to utilize Travis-CI infrastructure to build and test each change to our software.

In this post, I hope to describe how we used Docker on Travis-CI to create a CentOS 6 and 7 environment to build and test OSG software.

Creating the .travis.yml

Any Travis-CI build requires a .travis.yml file in the top level directory of your Github repoistory. It is used to describe how to build and test your software. We adapted a .travis.yml from Ansible testing.

sudo: required
  - OS_TYPE=centos OS_VERSION=6
  - OS_TYPE=centos OS_VERSION=7
  - docker
  - sudo apt-get update
  - echo 'DOCKER_OPTS="-H tcp:// -H unix:///var/run/docker.sock -s devicemapper"' | sudo tee /etc/default/docker > /dev/null
  - sudo service docker restart
  - sleep 5
  - sudo docker pull centos:centos${OS_VERSION}

 # Run tests in Container
- tests/ ${OS_VERSION}

In the .travis.yml file above, first we require sudo access so that we can start a Docker image. We want to build and test the packages on both CentOS 6 and 7, so we create a matrix so that Travis-CI will create 2 builds for each change in the repo, one for CentOS 6, and the other CentOS 7. Next, we require the docker service to be available so that we can start our image.

In the before_install, we set some docker options (which may not be necessary) and download the CentOS docker image from Docker Hub. Finally, in the script section, we run another script that will start the docker images.

Setting up the Tests

CentOS 6 and 7 are significantly different and require different docker startup procedures to get a usable system for testing OSG software. This includes starting systemd in CentOS 7, which is necessary to test services.

#!/bin/sh -xe

# This script starts docker and systemd (if el7)

# Version of CentOS/RHEL

 # Run tests in Container
if [ "$el_version" = "6" ]; then

sudo docker run --rm=true -v `pwd`:/htcondor-ce:rw centos:centos${OS_VERSION} /bin/bash -c "bash -xe /htcondor-ce/tests/ ${OS_VERSION}"

elif [ "$el_version" = "7" ]; then

docker run --privileged -d -ti -e "container=docker"  -v /sys/fs/cgroup:/sys/fs/cgroup -v `pwd`:/htcondor-ce:rw  centos:centos${OS_VERSION}   /usr/sbin/init
DOCKER_CONTAINER_ID=$(docker ps | grep centos | awk '{print $1}')
docker exec -ti $DOCKER_CONTAINER_ID /bin/bash -xec "bash -xe /htcondor-ce/tests/ ${OS_VERSION};
  echo -ne \"------\nEND HTCONDOR-CE TESTS\n\";"
docker ps -a


In the file above, we have two different startups for CentOS 6 and 7. For both startups we mount the repo, in this case the HTCondor-CE, so that the docker image has access to the repo files when it builds and tests the software.

For CentOS 6, the startup is simple. The docker image is run, and the only command is to run the script, which we will describe in the next section.

For CentOS 7, we must first start docker in privileged mode so that systemd may see and use the cgroup device. Our initial docker run command only starts /usr/sbin/init, which is systemd. Next, it starts our script, which will start systemd services. When the tests have completed, it will stop and remove the docker image.

Running the Tests

Finally, running tests on the software repository is completely dependent on the software being tested.

A full test file can be found in the HTCondor-CE Repo.

  1. Clean the yum cache.
  2. Install the EPEL and OSG repositories
  3. Install RPMs required for building the software.
  4. Build and package the software in RPMs
  5. Install the newly package RPMs
  6. Run the osg-test integration tests against the new packages.

It should be noted that all of the above scripts run bash with the arguments -xe. The x means to print each line before executing it, useful for debugging. The e means to exit the bash script immediately if any command returns a non-zero exit status. Since these scripts are designed to test software, we want to capture any faults in the tests or testing infrastructure.


By moving to Github for OSG’s software repositories, we have made it easy to build and test each change to repos. Additionally, we can fun full integration tests on each package for each change. This has the potential to catch many errors.

Here is a list of OSG builds using the above configuration:

HTCondor 8.5.4 released! ( May 2, 2016 )

The HTCondor team is pleased to announce the release of HTCondor 8.5.4. This development series release contains new features that are under development. This release contains all of the bug fixes from the 8.4.6 stable release. For sites using partitionable slots, one of the fixes from the 8.4.6 release corrects a serious regression in version 8.5.3. Highlights of the release are: Fixed a bug that delays schedd response when significant attributes change; Fixed a bug where the group ID was not set in Docker universe jobs; Limit update rate of various attributes to not overload the collector; To make job router configuration easier, added implicit "target" scoping; To make BOSCO work, the blahp does not generate limited proxies by default; condor_status can now display utilization per machine rather than per slot; Improve performance of condor_history and other tools. Further details can be found in the Version History. HTCondor 8.5.4 binaries and source code are available from our Downloads page.

Self-organizing maps in Spark

Self-organizing maps are a useful technique for identifying structure in high-dimensional data sets. The map itself is a low-dimensional arrangement of cells, where each cell is an object comparable to the objects in the training set. The goal of self-organizing map training is to arrange a grid of cells so that nearby cells will be the best matches for similar objects. Once we’ve built up the map, we can identify clusters of similar objects (based on the cells that they map to) and even detect outliers (based on the distributions of map quality).

Here are a few snapshots of the training process on color data, which I developed as a test for a parallel implementation of self-organizing maps in Apache Spark. For this demo, I used angular similarity in the RGB color space (not Euclidean distance) as a measure of color similarity. This means that, for example, a darker color would be considered similar to a lighter color with a similar hue.

We start with a random map:

Matches made in the first training iteration essentially affect the whole map, producing a blurred, unsaturated, undifferentiated map:

Some structure begins to emerge pretty rapidly, though; after one quarter of our training iterations, we can already see clear clusters of colors:

The map begins to get more and more saturated as similar colors are grouped together. Here’s what it looks like after half of the training iterations:

…and three-quarters of the training iterations:

As training proceeds, it gradually affects smaller and smaller neighborhoods of the map until the very end, when each training match only affects a single cell (and thus the impact of darker colors becomes apparent, since they can cluster together in single cells that are not the best matching unit for any brighter colors):

In a future post, I’ll cover the training algorithm, introduce the code, and provide some tips for implementing similar techniques in Spark. For now, though, here is a demo video that shows an animation of the whole map training process:

HTCondor-CE-Bosco Upcoming Release

The HTCondor-CE-Bosco (CE-Bosco) is one of the largest changes for the upcoming OSG 3.3.12 release, to be released on 2016/05/10. The HTCondor-CE-Bosco is a special configuration of the HTCondor-CE. The HTCondor-CE-Bosco does not submit directly to a local scheduler such as Slurm or PBS, instead, it will submit jobs to a remote cluster over SSH.

Why do supercomputers have to be so big? ( April 26, 2016 )

In this Blue Sky Science article Lauren Michael of the Center for High Throughput Computing answers a question from an eight-year-old: Why do supercomputers have to be so big? The article can also be viewed at the Wisconsin State Journal and the Baraboo News Republic.

HTCondor Week registration deadline is May 9 ( April 26, 2016 )

The registration deadline for HTCondor Week 2016 has been extended to Monday, May 9. This will be the final extension -- we need to finalize attendance numbers for the caterers. Also note that Wednesday, April 27, is the last day to be guaranteed to get the conference rate at the DoubleTree Hotel.

HTCondor 8.4.6 released! ( April 21, 2016 )

The HTCondor team is pleased to announce the release of HTCondor 8.4.6. A stable series release contains significant bug fixes. Highlights of this release are: fixed a bug that could cause a job to fail to start in a dynamic slot; fixed a negotiator memory leak when using partitionable slot preemption; fixed a bug that caused supplemental groups to be wrong during file transfer; properly identify the Windows 10 platform; fixed a typographic error in the LIMIT_JOB_RUNTIMES policy; fixed a bug where maximum length IPv6 addresses were not parsed; a few other bug fixes, consult the version history. Further details can be found in the Version History. HTCondor 8.4.6 binaries and source code are available from our Downloads page.

A New Blogging Platform

I have decided to move my Blog from Blogspot to Github Pages, and Jekyll publisher. I made this move for many reasons, but my latest blog post showed how difficult it can be to create technical posts in Blogger. For example, syntax highlighting is very difficult. Something as easy as below was nearly impossible to look right. And I resorted to using gists.

Querying an Elasticsearch Cluster for Gratia Records

For the last few days I have been working on email reports for GRACC, OSG's new prototype accounting system.  The source of the email reports are located on Github.I have learned a significant amount about queries and aggregations for ElasticSearch.  For example, below is the query that counts the number of records for a date range. The above query searches for queries in the date range specific, and counts the number of records.  It uses the Elasticsearch-dsl python library.  It does not return the actual records, just a number.  This is useful for generating raw counts and a delta for records processed over the last few days.The other query I designed is to aggregate the number of records per probe.  This query is designed to help us understand differences in specific probe's reporting behavior.This query is much more complicated than the simple count query above.  First, it creates a search selecting the "gracc-osg-*" indexes.  It also creates an aggregation "A" which will be used later to aggregate by the ProbeName field. Next, we create a bucket called day_range which is of type range.  It aggregates in two ranges, the last 24 hours and the 24 hours previous to that.  Next, we attach our ProbeName aggregation "A" defined above.  In return we get an aggregation for each of the ranges, for each of the probes, how many records exist for that probe. This nested aggregation is a powerful feature that will be used in the summarization of the records.

Computing Simplex Vertex Locations From Pairwise Object Distances

Suppose I have a collection of (N) objects, and distances d(j,k) between each pair of objects (j) and (k); that is, my objects are members of a metric space. I have no knowledge about my objects, beyond these pair-wise distances. These objects could be construed as vertices in an (N-1) dimensional simplex. However, since I have no spatial information about my objects, I first need a way to assign spatial locations to each object, in vector space R(N-1), with only my object distances to work with. [...]

Efficient Multiplexing for Spark RDDs

In this post I'm going to propose a new abstract operation on Spark RDDs -- multiplexing -- that makes some categories of operations on RDDs both easier to program and in many cases much faster. [...]

Very Fast Reservoir Sampling

In this post I will demonstrate how to do reservoir sampling orders of magnitude faster than the traditional "naive" reservoir sampling algorithm, using a fast high-fidelity approximation to the reservoir sampling-gap distribution. [...]