Understanding Hadoop Kerberos Authentication

Hadoop supports to authenticate its clients and users using Kerberos for security. Understanding the whole mechanism I’m sure isn’t easy otherwise I won’t compose this blog blah blah. That’s not only because Kerberos itself is very complex, but also it involves other complicated things such as SASL, GSSAPI, JAAS and etc. To start with here is a rough picture overall where I tried to put all things together simply. After that, I’d like to try to explain and when you come back to this picture again, I hope it can be more meaningful and clear.

Per Kerberos authentication mechanism, both server side and client side need to authenticate into the system. Server side in Hadoop, I mean Hadoop services, like namenode/datanode, jobtracker/tasktracker etc; for client side, I mean hdfs client, job client etc used by Hadoop users. Of course they are not limited to such user tools since there can be other services like HBase, Oozie to be client too to HDFS, MapReduce services. But the situations involving the Kerberos authentication can still be the two scenarioes: user client side and service server side.

Server side authentication (service)
Hadoop services themselves need to authenticate to Kerberos KDC. So how is it goes? During start up, it will login from keytab in UserGroupInformation via JAAS framework and Krb5LoginModule, using the service principal and keytab configured in core-site.xml or the like. The principal asserts it’s the right service which can be proved by the right key held in the keytab. Krb5LoginModule reads the keytab file and authenticates with KDC directly (without involving other framework stuffs this time) getting the needed ticket. As a normal result of JAAS login module, the ticket will be put into the subject’s private credentials set. And the subject is then put into the AccessControlContext in the current thread. Thus it’s easy to pick them up. That’s all for it and when it’s done successfully, the service authenticates passed, starting up and can then serve client requests.

Client side authentication (user)
In the essential of course Hadoop user should authenticate to Kerberos KDC when accesses Hadoop service via client tool, using his/her own user principals. First, the user should log in to a client machine which can talk to the Hadoop cluster, then executes kinit cmd with the principal and password. kinit does the work to authenticate for the user to KDC, gets the result kerberos TGT ticket and puts it into ticket cache in file system. Then the user can run client tool for example submit a MapReduce job via job client. Now it comes to the client software. Just as server side, it also goes to login but from ticket cache instead of keytab without prompting in UserGroupInformation via JAAS framework and Krb5LoginModule. The login module reads the TGT ticket from the ticket cache left by the user kinit executing result, and puts it into the subject’s private credentials set in the current thread’s access control context just as service side does. Until now, the client bootstraps and can issue connection request to the server side.

Client side and server side connection (C/S)
OK now, both server side and client side authenticate to Kerbereros successfully, the server waiting for client requests, and the client just being ready to issue a request. Upon the request, both the two sides come down to the SASL/GSSAPI stack. The client stack picks up the client TGT ticket from the above mentioned subject object in the current access control context. Using the TGT it then requests a service ticket from KDC targeting the right service/server which the user or the client software is accessing, and shows the result service ticket to the server SASL/GSSAPI stack, which decrypts and thus authenticates it, with its own ticket also picked up from the corresponding subject of current access control context. Note this stack is very complex regarding SASL and GSSAPI specs involving Kerberos, but it’s not just for authentication, along the way it also builds a secure context and channel for the two sides. 3 levels of protection are provided by this stack, auth(authentiation), int(Integrity), and privacy(encryption). These options are exposed and can be configured in latest Hadoop so that encryption of RPC channel can be easily done.

Notes
One may ask how server and client sides know which KDC to authenticate, since no explicit configuration for that is found in Hadoop relevant configuration files. Very different with other authentication mechanisms,  Kerberos is heavy weight and deployed in system level, not per application level, thus both in server and client machines, Kerberos should be configured previously, which is actually out of scope of Hadoop things. The configuration file is krb5.conf, in Linux, found as /etc/krb5.conf, where you can find the Kerberos realms it supports, and also the KDCs. As discussed above, the Krb5LoginModule and also Krb5GssClient/Server will find this configuration file and get the necessary KDC address out of band.

Another thing worth notable is there are two user objects involved in the authentication and connection setup process. Protocol user, which is put by client and found by server in client/server protocol in the connection header, and authentication user, which is determined by server in the SASL/GSSAPI stack. The two user objects can be same, and also can be different, depending on what’s the value the client writes to the connection header as protocol user.  Allowing this difference mainly comes from the often-mentioned proxy-user need or situation, where the user the service needs to behave as the client wants to be is not the user doing the Kerberos authentication, since the former may not have any Kerberos principal accounts or due to other technical reasons like Oozie it’s impossible. In Oozie, because MapReduce jobs are actually submitted by oozie service principal, but the jobs themselves should be run as Oozie users in order to access the privileged resources/files owned by the users. To achieve this, such proxy user mechanism is used, where Oozie software as Hadoop cluster’s client just passes the user it wants to pass via connection header.

OK, have to stop here now. Hope to have more time for this. To be continued.

Advertisements
This entry was posted in Apache, Hadoop, Security and tagged . Bookmark the permalink.

One Response to Understanding Hadoop Kerberos Authentication

  1. Regunath B says:

    Thanks for this post! It helped me understand Hadoop Kerberos authentication better, especially the way the client needs to work in order to authenticate itself.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s