Archive for April, 2012

Service as a Job: HDFS NameNode

April 16, 2012

Scheduling an HDFS DataNode is a powerful function. However, an operational HDFS instance also requires a NameNode. Here is an example of how a NameNode can be scheduled, followed by scheduled DataNodes, to create an HDFS instance.

From here, HDFS instances can be dynamically created on shared resources. Workflows can be built to manage, grow and shrink HDFS instances. Multiple HDFS instances can be deployed on a single set of resources.

The control script is based on hdfs_datanode.sh. It discovers the NameNode’s endpoints and chirps them.

hdfs_namenode.sh

#!/bin/sh -x

# condor_chirp in /usr/libexec/condor
export PATH=$PATH:/usr/libexec/condor

HADOOP_TARBALL=$1

# Note: bin/hadoop uses JAVA_HOME to find the runtime and tools.jar,
#       except tools.jar does not seem necessary therefore /usr works
#       (there's no /usr/lib/tools.jar, but there is /usr/bin/java)
export JAVA_HOME=/usr

# When we get SIGTERM, which Condor will send when
# we are kicked, kill off the namenode
function term {
   ./bin/hadoop-daemon.sh stop namenode
}

# Unpack
tar xzfv $HADOOP_TARBALL

# Move into tarball, inefficiently
cd $(tar tzf $HADOOP_TARBALL | head -n1)

# Configure,
#  . fs.default.name,dfs.http.address must be set to port 0 (ephemeral)
#  . dfs.name.dir must be in _CONDOR_SCRATCH_DIR for cleanup
# FIX: Figure out why a hostname is required, instead of 0.0.0.0:0 for
#      fs.default.name
cat > conf/hdfs-site.xml <<EOF
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://$HOSTNAME:0</value>
  </property>
  <property>
    <name>dfs.name.dir</name>
    <value>$_CONDOR_SCRATCH_DIR/name</value>
  </property>
  <property>
    <name>dfs.http.address</name>
    <value>0.0.0.0:0</value>
  </property>
</configuration>
EOF

# Try to shutdown cleanly
trap term SIGTERM

export HADOOP_CONF_DIR=$PWD/conf
export HADOOP_LOG_DIR=$_CONDOR_SCRATCH_DIR/logs
export HADOOP_PID_DIR=$PWD

./bin/hadoop namenode -format
./bin/hadoop-daemon.sh start namenode

# Wait for pid file
PID_FILE=$(echo hadoop-*-namenode.pid)
while [ ! -s $PID_FILE ]; do sleep 1; done
PID=$(cat $PID_FILE)

# Wait for the log
LOG_FILE=$(echo $HADOOP_LOG_DIR/hadoop-*-namenode-*.log)
while [ ! -s $LOG_FILE ]; do sleep 1; done

# It would be nice if there were a way to get these without grepping logs
while [ ! $(grep "IPC Server listener on" $LOG_FILE) ]; do sleep 1; done
IPC_PORT=$(grep "IPC Server listener on" $LOG_FILE | sed 's/.* on \(.*\):.*/\1/')
while [ ! $(grep "Jetty bound to port" $LOG_FILE) ]; do sleep 1; done
HTTP_PORT=$(grep "Jetty bound to port" $LOG_FILE | sed 's/.* to port \(.*\)$/\1/')

# Record the port number where everyone can see it
condor_chirp set_job_attr NameNodeIPCAddress \"$HOSTNAME:$IPC_PORT\"
condor_chirp set_job_attr NameNodeHTTPAddress \"$HOSTNAME:$HTTP_PORT\"

# While namenode is running, collect and report back stats
while kill -0 $PID; do
   # Collect stats and chirp them back into the job ad
   # Nothing to do.
   sleep 30
done

The job description file is standard at this point.

hdfs_namenode.job

cmd = hdfs_namenode.sh
args = hadoop-1.0.1-bin.tar.gz

transfer_input_files = hadoop-1.0.1-bin.tar.gz

#output = namenode.$(cluster).out
#error = namenode.$(cluster).err

log = namenode.$(cluster).log

kill_sig = SIGTERM

# Want chirp functionality
+WantIOProxy = TRUE

should_transfer_files = yes
when_to_transfer_output = on_exit

requirements = HasJava =?= TRUE

queue

In operation –

Submit the namenode and find its endpoints,

$ condor_submit hdfs_namenode.job
Submitting job(s).
1 job(s) submitted to cluster 208.

$ condor_q -long 208| grep NameNode       
NameNodeHTTPAddress = "eeyore.local:60182"
NameNodeIPCAddress = "eeyore.local:38174"

Open a browser window to http://eeyore.local:60182 to find the cluster summary,

1 files and directories, 0 blocks = 1 total. Heap Size is 44.81 MB / 888.94 MB (5%)
  Configured Capacity                   :        0 KB
  DFS Used                              :        0 KB
  Non DFS Used                          :        0 KB
  DFS Remaining                         :        0 KB
  DFS Used%                             :       100 %
  DFS Remaining%                        :         0 %
 Live Nodes                             :           0
 Dead Nodes                             :           0
 Decommissioning Nodes                  :           0
  Number of Under-Replicated Blocks     :           0

Add a datanode,

$ condor_submit -a NameNodeAddress=hdfs://eeyore.local:38174 hdfs_datanode.job 
Submitting job(s).
1 job(s) submitted to cluster 209.

Refresh the cluster summary,

1 files and directories, 0 blocks = 1 total. Heap Size is 44.81 MB / 888.94 MB (5%)
  Configured Capacity                   :     9.84 GB
  DFS Used                              :       28 KB
  Non DFS Used                          :     9.54 GB
  DFS Remaining                         :   309.79 MB
  DFS Used%                             :         0 %
  DFS Remaining%                        :      3.07 %
 Live Nodes                             :           1
 Dead Nodes                             :           0
 Decommissioning Nodes                  :           0
  Number of Under-Replicated Blocks     :           0

And another,

$ condor_submit -a NameNodeAddress=hdfs://eeyore.local:38174 hdfs_datanode.job
Submitting job(s).
1 job(s) submitted to cluster 210.

Refresh,

1 files and directories, 0 blocks = 1 total. Heap Size is 44.81 MB / 888.94 MB (5%)
  Configured Capacity                   :    19.69 GB
  DFS Used                              :       56 KB
  Non DFS Used                          :    19.26 GB
  DFS Remaining                         :   435.51 MB
  DFS Used%                             :         0 %
  DFS Remaining%                        :      2.16 %
 Live Nodes                             :           2
 Dead Nodes                             :           0
 Decommissioning Nodes                  :           0
  Number of Under-Replicated Blocks     :           0

All the building blocks necessary to run HDFS on scheduled resources.

Service as a Job: HDFS DataNode

April 4, 2012

Building on other examples of services run as jobs, such as Tomcat, Qpidd and memcached, here is an example for the Hadoop Distributed File System‘s DataNode.

Below is the control script for the datanode. It mirrors the memcached’s script, but does not publish statistics. However, datanode statistic/metrics could be pulled and published.

hdfs_datanode.sh

#!/bin/sh -x

# condor_chirp in /usr/libexec/condor
export PATH=$PATH:/usr/libexec/condor

HADOOP_TARBALL=$1
NAMENODE_ENDPOINT=$2

# Note: bin/hadoop uses JAVA_HOME to find the runtime and tools.jar,
#       except tools.jar does not seem necessary therefore /usr works
#       (there's no /usr/lib/tools.jar, but there is /usr/bin/java)
export JAVA_HOME=/usr

# When we get SIGTERM, which Condor will send when
# we are kicked, kill off the datanode and gather logs
function term {
   ./bin/hadoop-daemon.sh stop datanode
# Useful if we can transfer data back
#   tar czf logs.tgz logs
#   cp logs.tgz $_CONDOR_SCRATCH_DIR
}

# Unpack
tar xzfv $HADOOP_TARBALL

# Move into tarball, inefficiently
cd $(tar tzf $HADOOP_TARBALL | head -n1)

# Configure,
#  . dfs.data.dir must be in _CONDOR_SCRATCH_DIR for cleanup
#  . address,http.address,ipc.address must be set to port 0 (ephemeral)
cat > conf/hdfs-site.xml <<EOF
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>$NAMENODE_ENDPOINT</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>$_CONDOR_SCRATCH_DIR/data</value>
  </property>
  <property>
    <name>dfs.datanode.address</name>
    <value>0.0.0.0:0</value>
  </property>
  <property>
    <name>dfs.datanode.http.address</name>
    <value>0.0.0.0:0</value>
  </property>
  <property>
    <name>dfs.datanode.ipc.address</name>
    <value>0.0.0.0:0</value>
  </property>
</configuration>
EOF

# Try to shutdown cleanly
trap term SIGTERM

export HADOOP_CONF_DIR=$PWD/conf
export HADOOP_PID_DIR=$PWD
export HADOOP_LOG_DIR=$_CONDOR_SCRATCH_DIR/logs
./bin/hadoop-daemon.sh start datanode

# Wait for pid file
PID_FILE=$(echo hadoop-*-datanode.pid)
while [ ! -s $PID_FILE ]; do sleep 1; done
PID=$(cat $PID_FILE)

# Report back some static data about the datanode
# e.g. condor_chirp set_job_attr SomeAttr SomeData
# None at the moment.

# While the datanode is running, collect and report back stats
while kill -0 $PID; do
   # Collect stats and chirp them back into the job ad
   # None at the moment.
   sleep 30
done

The job description below uses a templating technique. The description uses a variable NameNodeAddress, which is not defined in the description. Instead, the value is provided as an argument to condor_submit. In fact, a complete job can be defined without a description file, e.g. echo queue | condor_submit -a executable=/bin/sleep -a args=1d, but more on that some other time.

hdfs_datanode.job

# Submit w/ condor_submit -a NameNodeAddress=<address>
# e.g. <address> = hdfs://$HOSTNAME:2007

cmd = hdfs_datanode.sh
args = hadoop-1.0.1-bin.tar.gz $(NameNodeAddress)

transfer_input_files = hadoop-1.0.1-bin.tar.gz

# RFE: Ability to get output files even when job is removed
#transfer_output_files = logs.tgz
#transfer_output_remaps = "logs.tgz logs.$(cluster).tgz"
output = datanode.$(cluster).out
error = datanode.$(cluster).err

log = datanode.$(cluster).log

kill_sig = SIGTERM

# Want chirp functionality
+WantIOProxy = TRUE

should_transfer_files = yes
when_to_transfer_output = on_exit

requirements = HasJava =?= TRUE

queue

hadoop-1.0.1-bin.tar.gz is available from http://archive.apache.org/dist/hadoop/core/hadoop-1.0.1/

Finally, here is a running example,

A namenode is already running –

$ ./bin/hadoop dfsadmin -conf conf/hdfs-site.xml -report
Configured Capacity: 0 (0 KB)
Present Capacity: 0 (0 KB)
DFS Remaining: 0 (0 KB)
DFS Used: 0 (0 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 0 (0 total, 0 dead)

Submit a datanode, knowing the namenode’s IPC port is 9000 –

$ condor_submit -a NameNodeAddress=hdfs://$HOSTNAME:9000 hdfs_datanode.job
Submitting job(s).
1 job(s) submitted to cluster 169.

$ condor_q
-- Submitter: eeyore.local : <127.0.0.1:59889> : eeyore.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD 
 169.0   matt            3/26 15:17   0+00:00:16 R  0   0.0 hdfs_datanode.sh h
1 jobs; 0 idle, 1 running, 0 held

Storage is now available, though not very much as my disk is almost full –

$ ./bin/hadoop dfsadmin -conf conf/hdfs-site.xml -report
Configured Capacity: 63810015232 (59.43 GB)
Present Capacity: 4907495424 (4.57 GB)
DFS Remaining: 4907466752 (4.57 GB)
DFS Used: 28672 (28 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)

Submit 9 more datanodes –

$ condor_submit -a NameNodeAddress=hdfs://$HOSTNAME:9000 hdfs_datanode.job
Submitting job(s).
1 job(s) submitted to cluster 170.
...
1 job(s) submitted to cluster 178.

$ ./bin/hadoop dfsadmin -conf conf/hdfs-site.xml -report
Configured Capacity: 638100152320 (594.28 GB)
Present Capacity: 40399958016 (37.63 GB)
DFS Remaining: 40399671296 (37.63 GB)
DFS Used: 286720 (280 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 10 (10 total, 0 dead)

At this point you can run a workload against the storage, visit the namenode at http://localhost:50070, or simply use ./bin/hadoop fs to interact with the filesystem.

Remember, all the datanodes were dispatched by a scheduler, run along side existing workload on your pool, and are completely manageable by standard policies.