Community is improving Hadoop RPC and SASL authentication

Very interesting, Hadoop active guy Darn Sharp is working on improving RPC and SASL to support his list:

  • support multiple SASL mechanisms
  • support negotiation of SASL mechanisms
  • support multiple protocols per mechanism
  • add server id hints for sasl clients
    • support kerberos auth to servers with arbitrary service principals
    • completely decouple host/ip from tokens
  • aforementioned supports servers with multiple NICs
  • clients may access a server via any hostname, ip, or even CNAME for a server

Would watch this for our support to implement the token based authentication.

Advertisements
Posted in authentication, Security, Uncategorized | Leave a comment

Unified authorization framework for Hadoop

As another goal initiated by Hadoop Rhino project, unified authorization framework for Hadoop is under the architecture and design, since its main dependency feature token based authentication and single sign on just published its design. Based on the big picture, this authorization framework is fully supported. Hope the design will be up coming soon. If you’re interested, please watch this JIRA, HADOOP-9466.

Posted in authorization, Hadoop, Security | Leave a comment

Token based authentication and single sign on for Hadoop

Hadoop Rhino project initiated token based authentication plus single sign on. Learn about Rhino please look at Rhino project for the details. Simply speaking, it mainly targeted security for Hadoop, with these features like data encryption, token based authentication and sso, unified authorization framework, advanced hbase authorization and etc. It’s launched by Intel as contribution to Hadoop.

So busy these days with the design, thanks the team and thanks HBase big man Andrew Purtell, the doc is out. If you’re interested, go to this JIRA for the doc, HADOOP-9392. Would appreciate your comments and any feedback.

Some main points presented by the design:
New tokenAuthn method for Hadoop authentication method, with the old ones like simple, kerberos;
Identity token to bridge external authentication systems and mechanisms to Hadoop;
Multiple deployment of token authentication service with different token realms for multiple organizations;
Relaxed Hadoop Kerberos authentication as best practice deployment mode based on this framework.

Posted in Apache, authentication, Hadoop, Security | 3 Comments

Understanding Hadoop Kerberos Authentication

Hadoop supports to authenticate its clients and users using Kerberos for security. Understanding the whole mechanism I’m sure isn’t easy otherwise I won’t compose this blog blah blah. That’s not only because Kerberos itself is very complex, but also it involves other complicated things such as SASL, GSSAPI, JAAS and etc. To start with here is a rough picture overall where I tried to put all things together simply. After that, I’d like to try to explain and when you come back to this picture again, I hope it can be more meaningful and clear.

Per Kerberos authentication mechanism, both server side and client side need to authenticate into the system. Server side in Hadoop, I mean Hadoop services, like namenode/datanode, jobtracker/tasktracker etc; for client side, I mean hdfs client, job client etc used by Hadoop users. Of course they are not limited to such user tools since there can be other services like HBase, Oozie to be client too to HDFS, MapReduce services. But the situations involving the Kerberos authentication can still be the two scenarioes: user client side and service server side.

Server side authentication (service)
Hadoop services themselves need to authenticate to Kerberos KDC. So how is it goes? During start up, it will login from keytab in UserGroupInformation via JAAS framework and Krb5LoginModule, using the service principal and keytab configured in core-site.xml or the like. The principal asserts it’s the right service which can be proved by the right key held in the keytab. Krb5LoginModule reads the keytab file and authenticates with KDC directly (without involving other framework stuffs this time) getting the needed ticket. As a normal result of JAAS login module, the ticket will be put into the subject’s private credentials set. And the subject is then put into the AccessControlContext in the current thread. Thus it’s easy to pick them up. That’s all for it and when it’s done successfully, the service authenticates passed, starting up and can then serve client requests.

Client side authentication (user)
In the essential of course Hadoop user should authenticate to Kerberos KDC when accesses Hadoop service via client tool, using his/her own user principals. First, the user should log in to a client machine which can talk to the Hadoop cluster, then executes kinit cmd with the principal and password. kinit does the work to authenticate for the user to KDC, gets the result kerberos TGT ticket and puts it into ticket cache in file system. Then the user can run client tool for example submit a MapReduce job via job client. Now it comes to the client software. Just as server side, it also goes to login but from ticket cache instead of keytab without prompting in UserGroupInformation via JAAS framework and Krb5LoginModule. The login module reads the TGT ticket from the ticket cache left by the user kinit executing result, and puts it into the subject’s private credentials set in the current thread’s access control context just as service side does. Until now, the client bootstraps and can issue connection request to the server side.

Client side and server side connection (C/S)
OK now, both server side and client side authenticate to Kerbereros successfully, the server waiting for client requests, and the client just being ready to issue a request. Upon the request, both the two sides come down to the SASL/GSSAPI stack. The client stack picks up the client TGT ticket from the above mentioned subject object in the current access control context. Using the TGT it then requests a service ticket from KDC targeting the right service/server which the user or the client software is accessing, and shows the result service ticket to the server SASL/GSSAPI stack, which decrypts and thus authenticates it, with its own ticket also picked up from the corresponding subject of current access control context. Note this stack is very complex regarding SASL and GSSAPI specs involving Kerberos, but it’s not just for authentication, along the way it also builds a secure context and channel for the two sides. 3 levels of protection are provided by this stack, auth(authentiation), int(Integrity), and privacy(encryption). These options are exposed and can be configured in latest Hadoop so that encryption of RPC channel can be easily done.

Notes
One may ask how server and client sides know which KDC to authenticate, since no explicit configuration for that is found in Hadoop relevant configuration files. Very different with other authentication mechanisms,  Kerberos is heavy weight and deployed in system level, not per application level, thus both in server and client machines, Kerberos should be configured previously, which is actually out of scope of Hadoop things. The configuration file is krb5.conf, in Linux, found as /etc/krb5.conf, where you can find the Kerberos realms it supports, and also the KDCs. As discussed above, the Krb5LoginModule and also Krb5GssClient/Server will find this configuration file and get the necessary KDC address out of band.

Another thing worth notable is there are two user objects involved in the authentication and connection setup process. Protocol user, which is put by client and found by server in client/server protocol in the connection header, and authentication user, which is determined by server in the SASL/GSSAPI stack. The two user objects can be same, and also can be different, depending on what’s the value the client writes to the connection header as protocol user.  Allowing this difference mainly comes from the often-mentioned proxy-user need or situation, where the user the service needs to behave as the client wants to be is not the user doing the Kerberos authentication, since the former may not have any Kerberos principal accounts or due to other technical reasons like Oozie it’s impossible. In Oozie, because MapReduce jobs are actually submitted by oozie service principal, but the jobs themselves should be run as Oozie users in order to access the privileged resources/files owned by the users. To achieve this, such proxy user mechanism is used, where Oozie software as Hadoop cluster’s client just passes the user it wants to pass via connection header.

OK, have to stop here now. Hope to have more time for this. To be continued.

Posted in Apache, Hadoop, Security | Tagged | 1 Comment

Multiple group mapping providers are now possible

User group mapping is an important and very basic functionality in Hadoop. It’s used to query groups info for Hadoop services and users. As we all know, Hadoop related components such as HDFS, MapReduce, HBase, Hive, Oozie and etc. all implemented their authorization features based on user + group model. User info comes from client request, how to get groups info then depends on service provider to do the user group mapping . Currently Hadoop supports to configure one such provider and it’s pluggable, and there’re two providers we can choose to use, ShellBasedUnixGroupsMapping(and the like), and LdapGroupsMapping. The former just gets groups from *nix OS via “id -Gn” command, and the later queries an AD/LDAP to get groups entries. Generally speaking, ShellBasedUnixGroupsMapping is efficient, reliable and can be used
for Hadoop service users, such as hdfs, mapred, hbase, hive, oozie etc, and the LDAP one can be used for Hadoop end users which can avoid having to add group entries for amounts of end users into *nix OS.

So what’s the problem? The problem is, currently Hadoop only supports ONE provider to be used, and we can’t configure more. You may wonder why we need more than one that. OK, let’s see one typical use case. In big organizations AD is often used as users’ identity store, when Hadoop
cluster is deployed with Kerberos authentication in such an organization, the best practice would be use MIT Kerberos plus AD, where MIT Kerberos trusts the AD realm. Hadoop service principals authenticate with MIT Kerberos, while end users still authenticate with AD just as traditional. Regarding to user group mapping, since there’re two user sources thus might two groups sources here, if only one provider is used then it has to merge the two group sources into single place, which can be a big overhead or headache. Anyhow It does not make much sense to add groups entries into AD for service principals just for this, and similarly for adding groups entries into *NIX OS for end users.

To resolve such problem, one possible solution would be just implementing another one to deal with the complex situation regarding to more groups sources. That’s direct and most flexible since you can do whatever you want, but obviously you need to develop it, right. Of course you may don’t if there is one already which does the right job. OK, CompositeGroupsMapping can be such one since it comes for that.

CompositeGroupsMapping can make use of and combine multiple existing group mapping provider implementations to create a virtual one dealing with more than one group sources. Using it, one can go as
ShellBasedUnixGroupsMapping + LdapGroupsMapping,
or even more complex,
ShellBasedUnixGroupsMapping + LdapGroupsMapping for domain X + LdapGroupsMapping for domain Y.

So how to use it? Let’s illustrate how to configure for the complex and later one:
ShellBasedUnixGroupsMapping for service principals + LdapGroupsMapping for domain X + LdapGroupsMapping for domain Y.

First configure Hadoop to use CompositeGroupsMapping provider

<property>
<name>hadoop.security.group.mapping</name>
<value>org.apache.hadoop.security.CompositeGroupsMapping</value>
<description>
Class for user to group mapping (get groups for a given user) for ACL, which
makes use of other multiple providers to provide the service.
</description>
</property>

Then configure what providers with names to combine

<property>
<name>hadoop.security.group.mapping.providers</name>
<value>shell4services,ad4usersX,ad4usersY</value>
<description>
Comma separated of names of other providers to provide user to group mapping.
</description>
</property>

<property>
<name>hadoop.security.group.mapping.provider.shell4services</name>
<value>org.apache.hadoop.security.ShellBasedUnixGroupsMapping</value>
<description>
Class for group mapping provider named by 'shell4services'. The name can then be referenced
by hadoop.security.group.mapping.providers property.
</description>
</property>

<property>
<name>hadoop.security.group.mapping.provider.ad4usersX</name>
<value>org.apache.hadoop.security.LdapGroupsMapping</value>
<description>
Class for group mapping provider named by 'ad4usersX'. The name can then be referenced
by hadoop.security.group.mapping.providers property.
</description>
</property>

<property>
<name>hadoop.security.group.mapping.provider.ad4usersY</name>
<value>org.apache.hadoop.security.LdapGroupsMapping</value>
<description>
Class for group mapping provider named by 'ad4usersY'. The name can then be referenced
by hadoop.security.group.mapping.providers property.
</description>
</property>

Next configure which users regarding to domain should go to which provider for groups info

<property>
<name>hadoop.security.group.mapping.provider.ad4usersX.domain</name>
<value>EXAMPLE-X.COM</value>
<description>
Domain or realm for users which should go to the provider named by 'ad4usersX' when do group mapping.
</description>
</property>

<property>
<name>hadoop.security.group.mapping.provider.ad4usersY.domain</name>
<value>EXAMPLE-Y.COM</value>
<description>
Domain or realm for users which should go to the provider named by 'ad4usersY' when do group mapping.
</description>
</property>

Also we need to configure AD specific configurations for each LdapGroupsMapping instance

<pre>
<property>
<name>hadoop.security.group.mapping.provider.ad4usersX.ldap.url</name>
<value>ldap://ad-host-for-users-X:389</value>
<description>
ldap url for the provider named by 'ad4usersX'. Note this property comes from
'hadoop.security.group.mapping.ldap.url'.
</description>
</property>

<property>
<name>hadoop.security.group.mapping.provider.ad4usersY.ldap.url</name>
<value>ldap://ad-host-for-users-Y:389</value>
<description>
ldap url for the provider named by 'ad4usersY'. Note this property comes from
'hadoop.security.group.mapping.ldap.url'.
</description>
</property>

more here omitted ...

For now I’m writing this blog, the patch for this provider CompositeGroupsMapping was just ready for review. But if you need it right now I think you can just download the patch, patch and build yourself. Believe me, or the test, it’s fine.

Thanks for your feedback. You’re welcome.

Posted in Apache, Development, Hadoop | Tagged | Leave a comment

Something better to know when patch for Hadoop

You now want to contribute some codes to Apache Hadoop, surely there’s re something you might need to aware regarding to how to wrap your codes as a patch.

Open a JIRA

Assume you’re not working on a existing JIRA issue, you may need to open one before start coding stuff. You need a JIRA mainly because you need to document what you’re going to resolve and get some feedback and comments there, and also you need to name your patch, attach and submit your patch there too.

Code in the style

Make sure you’re writing the codes in right style. Check your codes by running:

mvn checkstyle:checkstyle

Write necessary unit tests

You may need to add your own unit test cases, or at least you may need to make sure you’re not breaking the existing ones. Ref. Unit tests development guidelines HowToDevelopUnitTests

Run your unit test case

Assuming you add a unit test case named TestYourPatch.java, you can run it only without being mixed with others by:

mvn -Dtest=TestYourPatch test

Debug your unit test case

Your patch failed with a unit test case! So how to debug it? You can launch the right test case in debug mode as follows, then remote debug it using your IDE.

mvn -Dmaven.surefire.debug="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8001" -Dtest=TestYourPatch test

Update the document

You may need to update doc either in API level or user guide part. Check doc by running:

mvn javadoc:javadoc

Create a patch with the right name

Check the files list that should be included in your patch:

svn st

Get the right JIRA you’re working on and name your patch as:

HADOOP-1234.patch

Create the patch

svn diff > HADOOP-1234.patch

Test your patch

dev-support/test-patch.sh /path/to/my.patch

Attach your patch

Attach then your patch file to the JIRA, asking for code reviews.

Submit your patch

When you believe that your patch is ready to be committed, select the Submit Patch link on the issue’s Jira.

Reference

Note here’re just some quick points. May not enough so please read the official guide about HowToContribute when necessary.

Posted in Apache, Development, Hadoop | Tagged | Leave a comment

How to quickly checkout an Apache project

Sometimes it’s frustrating that we just follow the repository url listed on the official project site, but then find it doesn’t work simply because it’s obsolete. I met such situations recently when working on Hadoop related projects, and found out that it shouldn’t be hard to quickly get the svn or git url for any Apache project. We can just get it from the following two index pages.

Git
Browse and look up git repository url here
http://git.apache.org/

SVN
Browse and look up SVN repository url here
http://svn.apache.org/repos/asf

Posted in Apache, Hadoop | Leave a comment