Permanent Kerberos tickets for interactive users of Hadoop cluster

dataiku hadoop
dataiku multi user security
dataiku dss
hadoop gs

I have a Hadoop cluster which uses the company's Active Directory as Kerberos realm. The nodes, and the end-user Linux workstations are all Ubuntu 16.04. They are joined to the same domain using PowerBroker PBIS, so SSH logons between the workstations and the grid nodes are single sign-on. End-users run long-running scripts from their workstations, which repeatedly use SSH to first launch Spark / Yarn jobs on the cluster, and then keep track of their progress, which have to keep running overnight and on weekends well beyond the 10-hour lifetime of a Kerberos ticket.

I'm looking for a way to install permanent, service-style, Kerberos keytabs for the users, relieving them of the need to deal with kinit. I understand this would imply anyone with shell access to the grid as a particular user would be able to authenticate as that user.

I've also noticed that performing non-SSO SSH logins using password automatically creates net ticket valid from the time of the login. If this behaviour could be enabled for SSO logins, that would solve my problem.


You just have to ask users to add --principal and --keytab arguments to their Spark jobs. Then Spark (actually YARN) code will renew tickets for you automatically. We have jobs that run for weeks using this approach.

See for example https://spark.apache.org/docs/latest/security.html#yarn-mode

For long-running apps like Spark Streaming apps to be able to write to HDFS, it is possible to pass a principal and keytab to spark-submit via the --principal and --keytab parameters respectively. The keytab passed in will be copied over to the machine running the Application Master via the Hadoop Distributed Cache (securely - if YARN is configured with SSL and HDFS encryption is enabled). The Kerberos login will be periodically renewed using this principal and keytab and the delegation tokens required for HDFS will be generated periodically so the application can continue writing to HDFS.

You can see in Spark driver logs when Yarn renews a Kerberos ticket.

Headless Kerberos tickets for interactive users of, I have a Hadoop HDP 2.6.3 cluster which uses the company's Active Directory Headless Kerberos tickets for interactive users of Hadoop cluster for a way to install permanent, service-style, Kerberos keytabs for the users,  I have a Hadoop HDP 2.6.3 cluster which uses the company's Active Directory as Kerberos realm. The nodes and the end-user Linux workstations are all Ubuntu 16.04. They are joined to the same domain using PowerBroker PBIS, so SSH logons between the workstations and the grid nodes are passwordless.


If you are accessing Hive/Hbase or any other components with need kerberos ticket then make your spark code to relogin in case of ticket expired. You have to update ticket to use keytab rather than relying on a TGT to already exist in the cache. This is done by using the UserGroupInformation class from the Hadoop Security package. Add below snippet in you spark job for long running-

val configuration = new Configuration
configuration.addResource("/etc/hadoop/conf/hdfs-site.xml")
UserGroupInformation.setConfiguration(configuration)

UserGroupInformation.getCurrentUser.setAuthenticationMethod(AuthenticationMethod.KERBEROS)
UserGroupInformation.loginUserFromKeytabAndReturnUGI(
  "hadoop.kerberos.principal", " path of hadoop.kerberos.keytab file")
  .doAs(new PrivilegedExceptionAction[Unit]() {
    @Override
    def run(): Unit = {
       //hbase/hive connection
      // logic

    }
  })

Above we specify the name of our service principal and the path to the keytab file we generated. As long as that keytab is valid our program will use the desired service principal for all actions, regardless of whether or not the user running the program has already authenticated and received a TGT.

If there is no other component access except spark then you don't need to write above code. Simply provide keytab and principal in you spark submit command.

spark-submit --master yarn-cluster --keytab "xxxxxx.keytab" --principal "svc-xxxx@xxxx.COM"  xxxx.jar

Connecting to secure clusters, Upon success, this initial authentication phase returns Kerberos credentials suitable for use with the Hadoop cluster. Data Science Studio then uses these  [hadoop distribution] is the name of the Hadoop distribution, such as mapr31. Set the Kerberos Principal property. authentication.kerberos.principal=user@omnicorp.com. Decide whether to authenticate using a password or a keytab file. To authenticate with a password, set the authentication.kerberos.password property.


I took the suggestion above to use the --keytab argument to specify a custom keytab on the grid node from which I submit to Spark. I create my own per-user keytab using the script below. It holds until the user changes password.

Note that the script makes the simplifying assumptions that the Kerberos realm is same as the DNS domain and the LDAP directory where users are defined. This holds for my setup, use with care on yours. It also expects the users to be sudoers on that grid node. A more refined script might separate keytab generation and installation.

#!/usr/bin/python2.7

from __future__ import print_function

import os
import sys
import stat
import getpass
import subprocess
import collections
import socket
import tempfile

def runSudo(cmd, pw):
    try:
        subprocess.check_call("echo '{}' | sudo -S -p '' {}".format(pw, cmd), shell = True)
        return True
    except subprocess.CalledProcessError:
        return False

def testPassword(pw):
    subprocess.check_call("sudo -k", shell = True)
    if not runSudo("true", pw):
        print("Incorrect password for user {}".format(getpass.getuser()), file = sys.stderr)
        sys.exit(os.EX_NOINPUT)    

class KeytabFile(object):
    def __init__(self, pw):
        self.userName = getpass.getuser()
        self.pw = pw
        self.targetPath = "/etc/security/keytabs/{}.headless.keytab".format(self.userName)
        self.tempFile = None

    KeytabEntry = collections.namedtuple("KeytabEntry", ("kvno", "principal", "encryption"))

    def LoadExistingKeytab(self):
        if not os.access(self.targetPath, os.R_OK):

            # Note: the assumption made here, that the Kerberos realm is same as the DNS domain,
            # may not hold in other setups
            domainName = ".".join(socket.getfqdn().split(".")[1:])

            encryptions = ("aes128-cts-hmac-sha1-96", "arcfour-hmac", "aes256-cts-hmac-sha1-96")
            return [
                self.KeytabEntry(0, "@".join( (self.userName, domainName)), encryption)
                    for encryption in encryptions ]

        def parseLine(keytabLine):
            tokens = keytabLine.strip().split(" ")
            return self.KeytabEntry(int(tokens[0]), tokens[1], tokens[2].strip("()"))

        cmd ="klist -ek {} | tail -n+4".format(self.targetPath)
        entryLines = subprocess.check_output(cmd, shell = True).splitlines()
        return map(parseLine, entryLines)

    class KtUtil(subprocess.Popen):
        def __init__(self):
            subprocess.Popen.__init__(self, "ktutil",
                stdin = subprocess.PIPE, stdout = subprocess.PIPE, stderr=subprocess.PIPE, shell = True)

        def SendLine(self, line, expectPrompt = True):
            self.stdin.write(bytes(line + "\n"))
            self.stdin.flush()
            if expectPrompt:
                self.stdout.readline()

        def Quit(self):
            self.SendLine("quit", False)
            rc = self.wait()
            if rc != 0:
                raise subprocess.CalledProcessError(rc, "ktutil")


    def InstallUpdatedKeytab(self):
        fd, tempKt = tempfile.mkstemp(suffix = ".keytab")
        os.close(fd)
        entries = self.LoadExistingKeytab()
        ktutil = self.KtUtil()
        for entry in entries:
            cmd = "add_entry -password -p {} -k {} -e {}".format(
                entry.principal, entry.kvno + 1, entry.encryption)

            ktutil.SendLine(cmd)
            ktutil.SendLine(self.pw)

        os.unlink(tempKt)
        ktutil.SendLine("write_kt {}".format(tempKt))
        ktutil.Quit()

        if not runSudo("mv {} {}".format(tempKt, self.targetPath), self.pw):
            os.unlink(tempKt)
            print("Failed to install the keytab to {}.".format(self.targetPath), file = sys.stderr)
            sys.exit(os.EX_CANTCREAT)

        os.chmod(self.targetPath, stat.S_IRUSR)
        # TODO: Also change group to 'hadoop'

if __name__ == '__main__':

    def main():
        userPass = getpass.getpass("Please enter your password: ")
        testPassword(userPass)
        kt = KeytabFile(userPass)
        kt.InstallUpdatedKeytab()

    main()

Apache Hadoop 3.2.1 – Hadoop in Secure Mode, End User Accounts; User Accounts for Hadoop Daemons; Kerberos principals for --secure: Fail if the command is not executed on a secure cluster. Security features of Hadoop consist of Authentication, Service Level Kerberos keytab files may be used when interactive login with kinit is infeasible. I'm running Spark jobs on a Kerberos-enabled cluster (Cloudera), and would like to be able to log the Kerberos identity of the user, for any given run of a job. (Note that this is not the identity of the local Linux user that launches the job, because we use keytab files, jaas.conf files, and call kinit in the launch script.


Supporting User Access to Oracle Big Data Appliance, Kerberos is a network authentication protocol that helps prevent malicious impersonation. If the Oracle Big Data Appliance cluster is secured with Kerberos, then you Decompress the file into a permanent location, which will be the Hadoop  If you are logged into the Spoon host machine and your account has already been authenticated using Kerberos, indicate that you want to use the authentication information that is in the config.properties file instead, not the one that has already been saved in the Kerberos ticket cache.


[PDF] Amazon EMR Migration Guide, Optimizing Apache Hadoop YARN-based Applications . EMR Kerberos Cluster Startup Flow for KDC with One-Way Trust . If there is a constant queue of These interactive workload clusters usually have peak usage times Authentication is the process of an entity (a user or application) proving its identity to an. Hadoop Kerberos-based authentication is currently getting used widely. This is commonly referred to as Hadoop Security. When Hadoop Security is enabled it requires users to authenticate (using Kerberos) in order to read and write data in HDFS or to submit and manage MapReduce jobs and all Hadoop services authenticate with each other using Kerberos.


Hadoop setup · clhedrick/kerberos Wiki · GitHub, Fixing ambari-server. services; kerberos; storage; Hbase: giving users access; Zeppelin NOTE: For HDP 3 Kerberos to work, the ambari node must be part of the cluster. It needs all the users. To make it permanent, replace the file in We also do Kerberos ticket management on a per-interpreter basis. Apache Hadoop is not an inherently secure system. It is protected only by network security. After a connection is established, a client has full access to the system. To counterbalance this open environment, Oracle Big Data Appliance supports Kerberos security as a software installation option.