Knox Incubator

Apache Knox Gateway 0.3.x (Incubator) User’s Guide

Table Of Contents

Introduction

The Apache Knox Gateway is a system that provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal is to simplify Hadoop security for both users (i.e. who access the cluster data and execute jobs) and operators (i.e. who control access and manage the cluster). The gateway runs as a server (or cluster of servers) that provide centralized access to one or more Hadoop clusters. In general the goals of the gateway are as follows:

Quick Start

Here are the steps to have Apache Knox up and running against a Hadoop Cluster:

  1. Verify system requirements
  2. Download a virtual machine (VM) with Hadoop
  3. Download Apache Knox Gateway
  4. Start the virtual machine with Hadoop
  5. Install Knox
  6. Start the LDAP embedded within Knox
  7. Start the Knox Gateway
  8. Do Hadoop with Knox

1 - Requirements

Java

Java 1.6 or later is required for the Knox Gateway runtime. Use the command below to check the version of Java installed on the system where Knox will be running.

java -version

Hadoop

Knox supports Hadoop 1.x or 2.x, the quick start instructions assume a Hadoop 2.x virtual machine based environment.

2 - Download Hadoop 2.x VM

The quick start provides a link to download Hadoop 2.0 based Hortonworks virtual machine Sandbox. Please note Knox supports other Hadoop distributions and is configurable against a full blown Hadoop cluster. Configuring Knox for Hadoop 1.x/2.x version, or Hadoop deployed in EC2 or a custom Hadoop cluster is documented in advance deployment guide.

3 - Download Apache Knox Gateway

Download one of the distributions below from the Apache mirrors.

Apache Knox Gateway releases are available under the Apache License, Version 2.0. See the NOTICE file contained in each release artifact for applicable copyright attribution notices.

Verify

While recommended, verify is an optional step. You can verify the integrity of any downloaded files using the PGP signatures. Please read Verifying Apache HTTP Server Releases for more information on why you should verify our releases.

The PGP signatures can be verified using PGP or GPG. First download the KEYS file as well as the .asc signature files for the relevant release packages. Make sure you get these files from the main distribution directory linked above, rather than from a mirror. Then verify the signatures using one of the methods below.

% pgpk -a KEYS
% pgpv knox-incubating-0.3.0.zip.asc

or

% pgp -ka KEYS
% pgp knox-incubating-0.3.0.zip.asc

or

% gpg --import KEYS
% gpg --verify knox-incubating-0.3.0.zip.asc

4 - Start Hadoop virtual machine

Start the Hadoop virtual machine.

5 - Install Knox

The steps required to install the gateway will vary depending upon which distribution format (zip | rpm) was downloaded. In either case you will end up with a directory where the gateway is installed. This directory will be referred to as your {GATEWAY_HOME} throughout this document.

ZIP

If you downloaded the Zip distribution you can simply extract the contents into a directory. The example below provides a command that can be executed to do this. Note the {VERSION} portion of the command must be replaced with an actual Apache Knox Gateway version number. This might be 0.3.0 for example and must patch the value in the file downloaded.

jar xf knox-incubating-{VERSION}.zip

This will create a directory knox-incubating-{VERSION} in your current directory. The directory knox-incubating-{VERSION} will considered your {GATEWAY_HOME}

RPM

If you downloaded the RPM distribution you can install it using normal RPM package tools. It is important that the user that will be running the gateway server is used to install. This is because several directories are created that are owned by this user. These command will install Knox to /usr/lib/knox following the pattern of other Hadoop components. This directory will be considered your {GATEWAY_HOME}.

sudo yum localinstall knox-incubating-{VERSION}.rpm

or

sudo rpm -ihv knox-incubating-{VERSION}.rpm

6 - Start LDAP embedded in Knox

Knox comes with an LDAP server for demonstration purposes.

cd {GATEWAY_HOME}
java -jar bin/ldap.jar conf &

7 - Start Knox

The gateway can be started in one of two ways, as java -jar or with a shell script.

Starting via Java

This is the simplest way to start the gateway. Starting this way will result in all logging being written directly to standard output.

cd {GATEWAY_HOME}
java -jar bin/gateway.jar

Upon start, Knox server will prompt you for the master secret (i.e. password). This secret is used to secure artifacts used by the gateway server for things like SSL and credential/password aliasing. This secret will have to be entered at startup unless you choose to persist it.

Starting via script (*nix only)

Run the setup command with root privileges.

cd {GATEWAY_HOME}
sudo bin/gateway.sh setup

The server will prompt you for the master secret (i.e. password).

The server can then be started without root privileges using this command.

cd {GATEWAY_HOME}
bin/gateway.sh start

When starting the gateway this way the process will be run in the backgroud. The log output is written into the directory /var/log/knox. In addition a PID (process ID) is written into /var/run/knox.

In order to stop a gateway that was started with the script use this command.

cd {GATEWAY_HOME}
bin/gateway.sh stop

If for some reason the gateway is stopped other than by using the command above you may need to clear the tracking PID.

cd {GATEWAY_HOME}
bin/gateway.sh clean

NOTE: This command will also clear any log output in /var/log/knox so use this with caution.

8 - Do Hadoop with Knox

Put a file in HDFS via Knox.

CAT a file in HDFS via Knox.

Invoke the LISTSATUS operation on WebHDFS via the gateway.

This will return a directory listing of the root (i.e. /) directory of HDFS.

curl -i -k -u guest:guest-password -X GET \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/?op=LISTSTATUS'

The results of the above command should result in something to along the lines of the output below. The exact information returned is subject to the content within HDFS in your Hadoop cluster. Successfully executing this command at a minimum proves that the gateway is properly configured to provide access to WebHDFS. It does not necessarily provide that any of the other services are correct configured to be accessible. To validate that see the sections for the individual services in Service Details.

HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 760
Server: Jetty(6.1.26)

{"FileStatuses":{"FileStatus":[
{"accessTime":0,"blockSize":0,"group":"hdfs","length":0,"modificationTime":1350595859762,"owner":"hdfs","pathSuffix":"apps","permission":"755","replication":0,"type":"DIRECTORY"},
{"accessTime":0,"blockSize":0,"group":"mapred","length":0,"modificationTime":1350595874024,"owner":"mapred","pathSuffix":"mapred","permission":"755","replication":0,"type":"DIRECTORY"},
{"accessTime":0,"blockSize":0,"group":"hdfs","length":0,"modificationTime":1350596040075,"owner":"hdfs","pathSuffix":"tmp","permission":"777","replication":0,"type":"DIRECTORY"},
{"accessTime":0,"blockSize":0,"group":"hdfs","length":0,"modificationTime":1350595857178,"owner":"hdfs","pathSuffix":"user","permission":"755","replication":0,"type":"DIRECTORY"}
]}}

Submit a MR job via Knox.

Get status of a MR job via Knox.

Cancel a MR job via Knox.

More Examples

Apache Knox Details

This section provides everything you need to know to get the Knox gateway up and running against a Hadoop cluster.

Hadoop

An an existing Hadoop 1.x or 2.x cluster is required for Knox sit in front of and protect. It is possible to use a Hadoop cluster deployed on EC2 but this will require additional configuration not covered here. It is also possible to use a limited set of services in Hadoop cluster secured with Kerberos. This too required additional configuration that is not described here. See Supported Services for details on what is supported for this release.

The Hadoop cluster should be ensured to have at least WebHDFS, WebHCat (i.e. Templeton) and Oozie configured, deployed and running. HBase/Stargate and Hive can also be accessed via the Knox Gateway given the proper versions and configuration.

The instructions that follow assume a few things:

  1. The gateway is not collocated with the Hadoop clusters themselves.
  2. The host names and IP addresses of the cluster services are accessible by the gateway where ever it happens to be running.

All of the instructions and samples provided here are tailored and tested to work “out of the box” against a Hortonworks Sandbox 2.x VM.

Apache Knox Directory Layout

Knox can be installed by expanding the zip file or with rpm. With rpm based install the following directories are created in addition to those described in this section.

/usr/lib/knox
/var/log/knox
/var/run/knox

The directory /usr/lib/knox is considered your {GATEWAY_HOME} and will adhere to the layout described below. The directory /var/log/knox will contain the output files from the server. The directory /var/run/knox will contain the process ID for a currently running gateway server.

Regardless of the installation method used the layout and content of the {GATEWAY_HOME} will be identical. The table below provides a brief explanation of the important files and directories within {GATEWWAY_HOME}

Directory Purpose
conf/ Contains configuration files that apply to the gateway globally (i.e. not cluster specific ).
bin/ Contains the executable shell scripts, batch files and JARs for clients and servers.
deployments/ Contains topology descriptors used to configure the gateway for specific Hadoop clusters.
lib/ Contains the JARs for all the components that make up the gateway.
dep/ Contains the JARs for all of the components upon which the gateway depends.
ext/ A directory where user supplied extension JARs can be placed to extends the gateways functionality.
samples/ Contains a number of samples that can be used to explore the functionality of the gateway.
templates/ Contains default configuration files that can be copied and customized.
README Provides basic information about the Apache Knox Gateway.
ISSUES Describes significant know issues.
CHANGES Enumerates the changes between releases.
LICENSE Documents the license under which this software is provided.
NOTICE Documents required attribution notices for included dependencies.
DISCLAIMER Documents that this release is from a project undergoing incubation at Apache.

Supported Services

This table enumerates the versions of various Hadoop services that have been tested to work with the Knox Gateway. Only more recent versions of some Hadoop components when secured via Kerberos can be accessed via the Knox Gateway.

Service Version Non-Secure Secure
WebHDFS 2.1.0 y y
WebHCat/Templeton 0.11.0 y n
0.12.0 y y
Ozzie 4.0.0 y y
HBase/Stargate 0.95.2 y n
Hive (via WebHCat) 0.11.0 y n
0.12.0 y y
Hive (via JDBC) 0.11.0 n n
0.12.0 y n
Hive (via ODBC) 0.11.0 n n
0.12.0 n n

More Examples

These examples provide more detail about how to access various Apache Hadoop services via the Apache Knox Gateway.

Gateway Details

TODO

URL Mapping

The gateway functions much like a reverse proxy. As such it maintains a mapping of URLs that are exposed externally by the gateway to URLs that are provided by the Hadoop cluster. Examples of mappings for the WebHDFS, WebHCat, Oozie and Stargate/HBase are shown below. These mapping are generated from the combination of the gateway configuration file (i.e. {GATEWAY_HOME}/conf/gateway-site.xml) and the cluster topology descriptors (e.g. {GATEWAY_HOME}/deployments/{cluster-name}.xml). The port numbers show for the Cluster URLs represent the default ports for these services. The actual port number may be different for a given cluster.

The values for {gateway-host}, {gateway-port}, {gateway-path} are provided via the gateway configuration file (i.e. {GATEWAY_HOME}/conf/gateway-site.xml).

The value for {cluster-name} is derived from the file name of the cluster topology descriptor (e.g. {GATEWAY_HOME}/deployments/{cluster-name}.xml).

The value for {webhdfs-host}, {webhcat-host}, {oozie-host} and {hbase-host} are provided via the cluster topology descriptor (e.g. {GATEWAY_HOME}/deployments/{cluster-name}.xml).

Note: The ports 50070, 50111, 11000 and 60080 are the defaults for WebHDFS, WebHCat, Oozie and Stargate/HBase respectively. Their values can also be provided via the cluster topology descriptor if your Hadoop cluster uses different ports.

Configuration

Topology Descriptors

The topology descriptor files provide the gateway with per-cluster configuration information. This includes configuration for both the providers within the gateway and the services within the Hadoop cluster. These files are located in {GATEWAY_HOME}/deployments. The general outline of this document looks like this.

<topology>
    <gateway>
        <provider>
        </provider>
    </gateway>
    <service>
    </service>
</topology>

There are typically multiple <provider> and <service> elements.

/topology
Defines the provider and configuration and service topology for a single Hadoop cluster.
/topology/gateway
Groups all of the provider elements
/topology/gateway/provider
Defines the configuration of a specific provider for the cluster.
/topology/service
Defines the location of a specific Hadoop service within the Hadoop cluster.
Provider Configuration

Provider configuration is used to customize the behavior of a particular gateway feature. The general outline of a provider element looks like this.

<provider>
    <role>authentication</role>
    <name>ShiroProvider</name>
    <enabled>true</enabled>
    <param>
        <name></name>
        <value></value>
    </param>
</provider>
/topology/gateway/provider
Groups information for a specific provider.
/topology/gateway/provider/role
Defines the role of a particular provider. There are a number of pre-defined roles used by out-of-the-box provider plugins for the gateay. These roles are: authentication, identity-assertion, authentication, rewrite and hostmap
/topology/gateway/provider/name
Defines the name of the provider for which this configuration applies. There can be multiple provider implementations for a given role. Specifying the name is used identify which particular provider is being configured. Typically each topology descriptor should contain only one provider for each role but there are exceptions.
/topology/gateway/provider/enabled
Allows a particular provider to be enabled or disabled via true or false respectively. When a provider is disabled any filters associated with that provider are excluded from the processing chain.
/topology/gateway/provider/param
These elements are used to supply provider configuration. There can be zero or more of these per provider.
/topology/gateway/provider/param/name
The name of a parameter to pass to the provider.
/topology/gateway/provider/param/value
The value of a parameter to pass to the provider.
Service Configuration

Service configuration is used to specify the location of services within the Hadoop cluster. The general outline of a service element looks like this.

<service>
    <role>WEBHDFS</role>
    <url>http://localhost:50070/webhdfs</url>
</service>
/topology/service
Provider information about a particular service within the Hadoop cluster. Not all services are necessarily exposed as gateway endpoints.
/topology/service/role
Identifies the role of this service. Currently supported roles are: WEBHDFS, WEBHCAT, WEBHBASE, OOZIE, HIVE, NAMENODE, JOBTRACKER Additional service roles can be supported via plugins.
topology/service/url
The URL identifying the location of a particular service within the Hadoop cluster.

Hostmap Provider

The purpose of the Hostmap provider is to handle situations where host are know by one name within the cluster and another name externally. This frequently occurs when virtual machines are used and in particular using cloud hosting services. Currently the Hostmap provider is configured as part of the topology file. The basic structure is shown below.

<topology>
    <gateway>
        ...
        <provider>
            <role>hostmap</role>
            <name>static</name>
            <enabled>true</enabled>
            <param><name>external-host-name</name><value>internal-host-name</value></param>
        </provider>
        ...
    </gateway>
    ...
</topology>

This mapping is required because the Hadoop servies running within the cluster are unaware that they are being accessed from outside the cluster. Therefore URLs returned as part of REST API responses will typically contain internal host names. Since clients outside the cluster will be unable to resolve those host name they must be mapped to external host names.

Hostmap Provider Example - EC2

Consider an EC2 example where two VMs have been allocated. Each VM has an external host name by which it can be accessed via the internet. However the EC2 VM is unaware of this external host name and instead is configured with the internal host name.

External HOSTNAMES:
ec2-23-22-31-165.compute-1.amazonaws.com
ec2-23-23-25-10.compute-1.amazonaws.com

Internal HOSTNAMES:
ip-10-118-99-172.ec2.internal
ip-10-39-107-209.ec2.internal

The Hostmap configuration required to allow access external to the Hadoop cluster via the Apache Knox Gateway would be this.

<topology>
    <gateway>
        ...
        <provider>
            <role>hostmap</role>
            <name>static</name>
            <enabled>true</enabled>
            <param>
                <name>ec2-23-22-31-165.compute-1.amazonaws.com</name>
                <value>ip-10-118-99-172.ec2.internal</value>
            </param>
            <param>
                <name>ec2-23-23-25-10.compute-1.amazonaws.com</name>
                <value>ip-10-39-107-209.ec2.internal</value>
            </param>
        </provider>
        ...
    </gateway>
    ...
</topology>
Hostmap Provider Example - Sandbox

Hortonwork’s Sandbox 2.x poses a different challenge for host name mapping. This version of the Sandbox uses port mapping to make the Sandbox VM appear as though it is accessible via localhost. However the Sandbox VM is internally configured to consider sandbox.hortonworks.com as the host name. So from the perspective of a client accessing Sandbox the external host name is localhost. The Hostmap configuration required to allow access to Sandbox from the host operating system is this.

<topology>
    <gateway>
        ...
        <provider>
            <role>hostmap</role>
            <name>static</name>
            <enabled>true</enabled>
            <param><name>localhost</name><value>sandbox,sandbox.hortonworks.com</value></param>
        </provider>
        ...
    </gateway>
    ...
</topology>
Hostmap Provider Configuration

Details about each provider configuration element is enumerated below.

topology/gateway/provider/role
The role for a Hostmap provider must always be hostmap.
topology/gateway/provider/name
The Hostmap provider supplied out-of-the-box is selected via the name static.
topology/gateway/provider/enabled
Host mapping can be enabled or disabled by providing true or false.
topology/gateway/provider/param
Host mapping is configured by providing parameters for each external to internal mapping.
topology/gateway/provider/param/name
The parameter names represent an external host names associated with the internal host names provided by the value element. This can be a comma separated list of host names that all represent the same physical host. When mapping from internal to external host name the first external host name in the list is used.
topology/gateway/provider/param/value
The parameter values represent the internal host names associated with the external host names provider by the name element. This can be a comma separated list of host names that all represent the same physical host. When mapping from external to internal host names the first internal host name in the list is used.

Logging

If necessary you can enable additional logging by editing the log4j.properties file in the conf directory. Changing the rootLogger value from ERROR to DEBUG will generate a large amount of debug logging. A number of useful, more fine loggers are also provided in the file.

Java VM Options

TODO - Java VM options doc.

Persisting the Master Secret

The master secret is required to start the server. This secret is used to access secured artifacts by the gateway instance. Keystore, trust stores and credential stores are all protected with the master secret.

You may persist the master secret by supplying the -persist-master switch at startup. This will result in a warning indicating that persisting the secret is less secure than providing it at startup. We do make some provisions in order to protect the persisted password.

It is encrypted with AES 128 bit encryption and where possible the file permissions are set to only be accessible by the user that the gateway is running as.

After persisting the secret, ensure that the file at config/security/master has the appropriate permissions set for your environment. This is probably the most important layer of defense for master secret. Do not assume that the encryption if sufficient protection.

A specific user should be created to run the gateway this will protect a persisted master file.

Management of Security Artifacts

There are a number of artifacts that are used by the gateway in ensuring the security of wire level communications, access to protected resources and the encryption of sensitive data. These artifacts can be managed from outside of the gateway instances or generated and populated by the gateway instance itself.

The following is a description of how this is coordinated with both standalone (development, demo, etc) gateway instances and instances as part of a cluster of gateways in mind.

Upon start of the gateway server we:

  1. Look for an identity store at conf/security/keystores/gateway.jks. The identity store contains the certificate and private key used to represent the identity of the server for SSL connections and signature creation.
  2. Look for a credential store at conf/security/keystores/__gateway-credentials.jceks. This credential store is used to store secrets/passwords that are used by the gateway. For instance, this is where the pass-phrase for accessing the gateway-identity certificate is kept.

Upon deployment of a Hadoop cluster topology within the gateway we:

  1. Look for a credential store for the topology. For instance, we have a sample topology that gets deployed out of the box. We look for conf/security/keystores/sandbox-credentials.jceks. This topology specific credential store is used for storing secrets/passwords that are used for encrypting sensitive data with topology specific keys.

By leveraging the algorithm described above we can provide a window of opportunity for management of these artifacts in a number of ways.

  1. Using a single gateway instance as a master instance the artifacts can be generated or placed into the expected location and then replicated across all of the slave instances before startup.
  2. Using an NFS mount as a central location for the artifacts would provide a single source of truth without the need to replicate them over the network. Of course, NFS mounts have their own challenges.

Keystores

In order to provide your own certificate for use by the gateway, you will need to either import an existing key pair into a Java keystore or generate a self-signed cert using the Java keytool.

Importing a key pair into a Java keystore

—-NEEDS TESTING

One way to accomplish this is to start with a PKCS12 store for your key pair and then convert it to a Java keystore or JKS.

openssl pkcs12 -export -in cert.pem -inkey key.pem > server.p12

The above example uses openssl to create a PKCS12 encoded store for your provided certificate private key.

keytool -importkeystore -srckeystore {server.p12} -destkeystore gateway.jks -srcstoretype pkcs12

This example converts the PKCS12 store into a Java keystore (JKS). It should prompt you for the keystore and key passwords for the destination keystore. You must use the master-secret for both.

While using this approach a couple of important things to be aware of:

  1. the alias MUST be “gateway-identity”
  2. the name of the expected identity keystore for the gateway MUST be gateway.jks
  3. the passwords for the keystore and the imported key MUST both be the master secret for the gateway install

NOTE: The password for the keystore as well as that of the imported key must be the master secret for the gateway instance.

—-END NEEDS TESTING

Generating a self-signed cert for use in testing or development environments
keytool -genkey -keyalg RSA -alias gateway-identity -keystore gateway.jks \
    -storepass {master-secret} -validity 360 -keysize 2048

Keytool will prompt you for a number of elements used that will comprise this distiniguished name (DN) within your certificate.

NOTE: When it prompts you for your First and Last name be sure to type in the hostname of the machine that your gateway instance will be running on. This is used by clients during hostname verification to ensure that the presented certificate matches the hostname that was used in the URL for the connection - so they need to match.

NOTE: When it prompts for the key password just press enter to ensure that it is the same as the keystore password. Which as was described earlier must match the master secret for the gateway instance.

Credential Store

Whenever you provide your own keystore with either a self-signed cert or a real certificate signed by a trusted authority, you will need to create an empty credential store. This is necessary for the current release in order for the system to utilize the same password for the keystore and the key.

The credential stores in Knox use the JCEKS keystore type as it allows for the storage of general secrets in addition to certificates.

keytool -genkey -alias {anything} -keystore __gateway-credentials.jceks \
    -storepass {master-secret} -validity 360 -keysize 1024 -storetype JCEKS

Follow the prompts again for the DN for the cert of the credential store. This certificate isn’t really used for anything at the moment but is required to create the credential store.

Provisioning of Keystores

Once you have created these keystores you must move them into place for the gateway to discover them and use them to represent its identity for SSL connections. This is done by copying the keystores to the {GATEWAY_HOME}/conf/security/keystores directory for your gateway install.

Summary of Secrets to be Managed

  1. Master secret - the same for all gateway instances in a cluster of gateways
  2. All security related artifacts are protected with the master secret
  3. Secrets used by the gateway itself are stored within the gateway credential store and are the same across all gateway instances in the cluster of gateways
  4. Secrets used by providers within cluster topologies are stored in topology specific credential stores and are the same for the same topology across the cluster of gateway instances. However, they are specific to the topology - so secrets for one hadoop cluster are different from those of another. This allows for fail-over from one gateway instance to another even when encryption is being used while not allowing the compromise of one encryption key to expose the data for all clusters.

NOTE: the SSL certificate will need special consideration depending on the type of certificate. Wildcard certs may be able to be shared across all gateway instances in a cluster. When certs are dedicated to specific machines the gateway identity store will not be able to be blindly replicated as host name verification problems will ensue. Obviously, trust-stores will need to be taken into account as well.

Authentication

There are two types of providers supported in Knox for establishing a user’s identity:

  1. Authentication Providers
  2. Federation Providers

Authentication providers directly accept a user’s credentials and validates them against some particular user store. Federation providers, on the other hand, validate a token that has been issued for the user by a trusted Identity Provider (IdP).

The current release of Knox ships with an authentication provider based on the Apache Shiro project and is initially configured for BASIC authentication against an LDAP store. This has been specifically tested against Apache Directory Server and Active Directory.

This section will cover the general approach to leveraging Shiro within the bundled provider including:

  1. General mapping of provider config to shiro.ini config
  2. Specific configuration for the bundled BASIC/LDAP configuration
  3. Some tips into what may need to be customized for your environment
  4. How to setup the use of LDAP over SSL or LDAPS

General Configuration for Shiro Provider

As is described in the configuration section of this document, providers have a name-value based configuration - as is the common pattern in the rest of Hadoop.

The following example shows the format of the configuration for a given provider:

<provider>
    <role>authentication</role>
    <name>ShiroProvider</name>
    <enabled>true</enabled>
    <param>
        <name>{name}</name>
        <value>{value}</value>
    </param>
</provider>

Conversely, the Shiro provider currently expects a shiro.ini file in the web-inf directory of the cluster specific web application.

The following example illustrates a configuration of the bundled BASIC/LDAP authentication config in a shiro.ini file:

[urls]
/**=authcBasic
[main]
ldapRealm=org.apache.shiro.realm.ldap.JndiLdapRealm
ldapRealm.contextFactory.authenticationMechanism=simple
ldapRealm.contextFactory.url=ldap://localhost:33389
ldapRealm.userDnTemplate=uid={0},ou=people,dc=hadoop,dc=apache,dc=org

In order to fit into the context of an INI file format, at deployment time we interrogate the paramaters provided in the provider configuration and parse the INI section out of the paramter names. The following provider config illustrates this approach. Notice that the section names in the above shiro.ini match the beginning of the param names that are in the following config:

<gateway>
    <provider>
        <role>authentication</role>
        <name>ShiroProvider</name>
        <enabled>true</enabled>
        <param>
            <name>main.ldapRealm</name>
            <value>org.apache.shiro.realm.ldap.JndiLdapRealm</value>
        </param>
        <param>
            <name>main.ldapRealm.userDnTemplate</name>
            <value>uid={0},ou=people,dc=hadoop,dc=apache,dc=org</value>
        </param>
        <param>
            <name>main.ldapRealm.contextFactory.url</name>
            <value>ldap://localhost:33389</value>
        </param>
        <param>
            <name>main.ldapRealm.contextFactory.authenticationMechanism</name>
            <value>simple</value>
        </param>
        <param>
            <name>urls./**</name>
            <value>authcBasic</value>
        </param>
    </provider>

This happens to be the way that we are currently configuring Shiro for BASIC/LDAP authentication. This same config approach may be used to achieve other authentication mechanisms or variations on this one. We however have not tested additional uses for it for this release.

LDAP Configuration

This section discusses the LDAP configuration used above for the Shiro Provider. Some of these configuration elements will need to be customized to reflect your deployment environment.

main.ldapRealm - this element indicates the fully qualified classname of the Shiro realm to be used in authenticating the user. The classname provided by default in the sample is the org.apache.shiro.realm.ldap.JndiLdapRealm this implementation provides us with the ability to authenticate but by default has authorization disabled. In order to provide authorization - which is seen by Shiro as dependent on an LDAP schema that is specific to each organization - an extension of JndiLdapRealm is generally used to override and implement the doGetAuhtorizationInfo method. In this particular release we are providing a simple authorization provider that can be used along with the Shiro authentication provider.

main.ldapRealm.userDnTemplate - in order to bind a simple username to an LDAP server that generally requires a full distinguished name (DN), we must provide the template into which the simple username will be inserted. This template allows for the creation of a DN by injecting the simple username into the common name (CN) portion of the DN. This element will need to be customized to reflect your deployment environment. The template provided in the sample is only an example and is valid only within the LDAP schema distributed with Knox and is represented by the users.ldif file in the {GATEWAY_HOME}conf directory.

main.ldapRealm.contextFactory.url - this element is the URL that represents the host and port of LDAP server. It also includes the scheme of the protocol to use. This may be either ldap or ldaps depending on whether you are communicating with the LDAP over SSL (higly recommended). This element will need to be cusomized to reflect your deployment environment..

main.ldapRealm.contextFactory.authenticationMechanism - this element indicates the type of authentication that should be performed against the LDAP server. The current default value is simple which indicates a simple bind operation. This element should not need to be modified and no mechanism other than a simple bind has been tested for this particular release.

urls./** - this element represents a single URL_Ant_Path_Expression and the value the Shiro filter chain to apply to it. This particular sample indicates that all paths into the application have the same Shiro filter chain applied. The paths are relative to the application context path. The use of the value authcBasic here indicates that BASIC authentication is expected for every path into the application. Adding an additional Shiro filter to that chain for validating that the request isSecure() and over SSL can be achieved by changing the value to ssl, authcBasic. It is not likely that you need to change this element for your environment.

Active Directory - Special Note

You would use LDAP configuration as documented above to authenticate against Active Directory as well.

Some Active Directory specifc things to keep in mind:

Typical AD main.ldapRealm.userDnTemplate value looks slightly different, such as cn={0},cn=users,DC=lab,DC=sample,dc=com

Please compare this with a typical Apache DS main.ldapRealm.userDnTemplate value and make note of the difference. uid={0},ou=people,dc=hadoop,dc=apache,dc=org

If your AD is configured to authenticate based on just the cn and password and does not require user DN, you do not have to specify value for main.ldapRealm.userDnTemplate.

LDAP over SSL (LDAPS) Configuration

In order to communicate with your LDAP server over SSL (again, highly recommended), you will need to modify the topology file in a couple ways and possibly provision some keying material.

  1. main.ldapRealm.contextFactory.url must be changed to have the ldaps protocol scheme and the port must be the SSL listener port on your LDAP server.
  2. Identity certificate (keypair) provisioned to LDAP server - your LDAP server specific documentation should indicate what is requried for providing a cert or keypair to represent the LDAP server identity to connecting clients.
  3. Trusting the LDAP Server’s public key - if the LDAP Server’s identity certificate is issued by a well known and trusted certificate authority and is already represented in the JRE’s cacerts truststore then you don’t need to do anything for trusting the LDAP server’s cert. If, however, the cert is selfsigned or issued by an untrusted authority you will need to either add it to the cacerts keystore or to another truststore that you may direct Knox to utilize through a system property.

Session Configuration

Knox maps each cluster topology to a web application and leverages standard JavaEE session management.

To configure session idle timeout for the topology, please specify value of parameter sessionTimeout for ShiroProvider in your topology file. If you do not specify the value for this parameter, it defaults to 30minutes.

The definition would look like the following in the topoloogy file:

...
<provider>
    <role>authentication</role>
    <name>ShiroProvider</name>
    <enabled>true</enabled>
    <param>
        <!--
        Session timeout in minutes. This is really idle timeout.
        Defaults to 30 minutes, if the property value is not defined.
        Current client authentication will expire if client idles
        continuously for more than this value
        -->
        <name>sessionTimeout</name>
        <value>30</value>
    </param>
<provider>
...

At present, ShiroProvider in Knox leverages JavaEE session to maintain authentication state for a user across requests using JSESSIONID cookie. So, a clieent that authenticated with Knox could pass the JSESSIONID cookie with repeated requests as long as the session has not timed out instead of submitting userid/password with every request. Presenting a valid session cookie in place of userid/password would also perform better as additional credential store lookups are avoided.

Identity Assertion

The identity assertion provider within Knox plays the critical role of communicating the identity principal to be used within the Hadoop cluster to represent the identity that has been authenticated at the gateway.

The general responsibilities of the identity assertion provider is to interrogate the current Java Subject that has been established by the authentication or federation provider and:

  1. determine whether it matches any principal mapping rules and apply them appropriately
  2. determine whether it matches any group principal mapping rules and apply them
  3. if it is determined that the principal will be impersonating another through a principal mapping rule then a Subject.doAS is required in order for providers farther downstream can determine the appropriate effective principal name and groups for the user

The following configuration is required for asserting the users identity to the Hadoop cluster using Pseudo or Simple “authentication”.

<provider>
    <role>identity-assertion</role>
    <name>Pseudo</name>
    <enabled>true</enabled>
</provider>

This particular configuration indicates that the Pseudo identity assertion provider is enabled and that there are no principal mapping rules to apply to identities flowing from the authentication in the gateway to the backend Hadoop cluster services. The primary principal of the current subject will therefore be asserted via a query paramter or as a form parameter - ie. ?user.name={primaryPrincipal}

<provider>
    <role>identity-assertion</role>
    <name>Pseudo</name>
    <enabled>true</enabled>
    <param>
        <name>principal.mapping</name>
        <value>guest=hdfs;</value>
    </param>
    <param>
        <name>group.principal.mapping</name>
        <value>*=users;hdfs=admin</value>
    </param>
</provider>

This configuration identifies the same identity assertion provider but does provide principal and group mapping rules. In this case, when a user is authenticated as “guest” his identity is actually asserted to the Hadoop cluster as “hdfs”. In addition, since there are group principal mappings defined, he will also be considered as a member of the groups “users” and “admin”. In this particular example the wildcard "*“ is used to indicate that all authenticated users need to be considered members of the ”users“ group and that only the user ”hdfs“ is mapped to be a member of the ”admin" group.

NOTE: These group memberships are currently only meaningful for Service Level Authorization using the AclsAuthorization provider. The groups are not currently asserted to the Hadoop cluster at this time. See the Authorization section within this guide to see how this is used.

The principal mapping aspect of the identity assertion provider is important to understand in order to fully utilize the authorization features of this provider.

This feature allows us to map the authenticated principal to a runas or impersonated principal to be asserted to the Hadoop services in the backend.

When a principal mapping is defined that results in an impersonated principal being created the impersonated principal is then the effective principal.

If there is no mapping to another principal then the authenticated or primary principal is then the effective principal.

Principal Mapping

<param>
    <name>principal.mapping</name>
    <value>{primaryPrincipal}[,...]={impersonatedPrincipal}[;...]</value>
</param>

For instance:

<param>
    <name>principal.mapping</name>
    <value>guest=hdfs</value>
</param>

For multiple mappings:

<param>
    <name>principal.mapping</name>
    <value>guest,alice=hdfs;mary=alice2</value>
</param>

Group Principal Mapping

<param>
    <name>group.principal.mapping</name>
    <value>{userName[,*|userName...]}={groupName[,groupName...]}[,...]</value>
</param>

For instance:

<param>
    <name>group.principal.mapping</name>
    <value>*=users;hdfs=admin</value>
</param>

this configuration indicates that all (*) authenticated users are members of the “users” group and that user “hdfs” is a member of the admin group. Group principal mapping has been added along with the authorization provider described in this document.

Authorization

Service Level Authorization

The Knox Gateway has an out-of-the-box authorization provider that allows administrators to restrict access to the individual services within a Hadoop cluster.

This provider utilizes a simple and familiar pattern of using ACLs to protect Hadoop resources by specifying users, groups and ip addresses that are permitted access.

Note: In the examples below {serviceName} represents a real service name (e.g. WEBHDFS) and would be replaced with these values in an actual configuration.

Usecases
USECASE-1: Restrict access to specific Hadoop services to specific Users
<param>
    <name>{serviceName}.acl</name>
    <value>guest;*;*</value>
</param>
USECASE-2: Restrict access to specific Hadoop services to specific Groups
<param>
    <name>{serviceName}.acls</name>
    <value>*;admins;*</value>
</param>
USECASE-3: Restrict access to specific Hadoop services to specific Remote IPs
<param>
    <name>{serviceName}.acl</name>
    <value>*;*;127.0.0.1</value>
</param>
USECASE-4: Restrict access to specific Hadoop services to specific Users OR users within specific Groups
<param>
    <name>{serviceName}.acl.mode</name>
    <value>OR</value>
</param>
<param>
    <name>{serviceName}.acl</name>
    <value>guest;admin;*</value>
</param>
USECASE-5: Restrict access to specific Hadoop services to specific Users OR users from specific Remote IPs
<param>
    <name>{serviceName}.acl.mode</name>
    <value>OR</value>
</param>
<param>
    <name>{serviceName}.acl</name>
    <value>guest;*;127.0.0.1</value>
</param>
USECASE-6: Restrict access to specific Hadoop services to users within specific Groups OR from specific Remote IPs
<param>
    <name>{serviceName}.acl.mode</name>
    <value>OR</value>
</param>
<param>
    <name>{serviceName}.acl</name>
    <value>*;admin;127.0.0.1</value>
</param>
USECASE-7: Restrict access to specific Hadoop services to specific Users OR users within specific Groups OR from specific Remote IPs
<param>
    <name>{serviceName}.acl.mode</name>
    <value>OR</value>
</param>
<param>
    <name>{serviceName}.acl</name>
    <value>guest;admin;127.0.0.1</value>
</param>
USECASE-8: Restrict access to specific Hadoop services to specific Users AND users within specific Groups
<param>
    <name>{serviceName}.acl</name>
    <value>guest;admin;*</value>
</param>
USECASE-9: Restrict access to specific Hadoop services to specific Users AND users from specific Remote IPs
<param>
    <name>{serviceName}.acl</name>
    <value>guest;*;127.0.0.1</value>
</param>
USECASE-10: Restrict access to specific Hadoop services to users within specific Groups AND from specific Remote IPs
<param>
    <name>{serviceName}.acl</name>
    <value>*;admins;127.0.0.1</value>
</param>
USECASE-11: Restrict access to specific Hadoop services to specific Users AND users within specific Groups AND from specific Remote IPs
<param>
    <name>{serviceName}.acl</name>
    <value>guest;admins;127.0.0.1</value>
</param>

Configuration

ACLs are bound to services within the topology descriptors by introducing the authorization provider with configuration like:

<provider>
    <role>authorization</role>
    <name>AclsAuthz</name>
    <enabled>true</enabled>
</provider>

The above configuration enables the authorization provider but does not indicate any ACLs yet and therefore there is no restriction to accessing the Hadoop services. In order to indicate the resources to be protected and the specific users, groups or ip’s to grant access, we need to provide parameters like the following:

<param>
    <name>{serviceName}.acl</name>
    <value>username[,*|username...];group[,*|group...];ipaddr[,*|ipaddr...]</value>
</param>

where {serviceName} would need to be the name of a configured Hadoop service within the topology.

NOTE: ipaddr is unique among the parts of the ACL in that you are able to specify a wildcard within an ipaddr to indicate that the remote address must being with the String prior to the asterisk within the ipaddr acl. For instance:

<param>
    <name>{serviceName}.acl</name>
    <value>*;*;192.168.*</value>
</param>

This indicates that the request must come from an IP address that begins with ‘192.168.’ in order to be granted access.

Note also that configuration without any ACLs defined is equivalent to:

<param>
    <name>{serviceName}.acl</name>
    <value>*;*;*</value>
</param>

meaning: all users, groups and IPs have access. Each of the elements of the acl param support multiple values via comma separated list and the * wildcard to match any.

For instance:

<param>
    <name>webhdfs.acl</name>
    <value>hdfs;admin;127.0.0.2,127.0.0.3</value>
</param>

this configuration indicates that ALL of the following are satisfied:

  1. the user “hdfs” has access AND
  2. users in the group “admin” have access AND
  3. any authenticated user from either 127.0.0.2 or 127.0.0.3 will have access

This allows us to craft policy that restricts the members of a large group to a subset that should have access. The user being removed from the group will allow access to be denied even though their username may have been in the ACL.

An additional configuration element may be used to alter the processing of the ACL to be OR instead of the default AND behavior:

<param>
    <name>{serviceName}.acl.mode</name>
    <value>OR</value>
</param>

this processing behavior requires that the effective user satisfy one of the parts of the ACL definition in order to be granted access. For instance:

<param>
    <name>webhdfs.acl</name>
    <value>hdfs,guest;admin;127.0.0.2,127.0.0.3</value>
</param>

You may also set the ACL processing mode at the top level for the topology. This essentially sets the default for the managed cluster. It may then be overridden at the service level as well.

<param>
    <name>acl.mode</name>
    <value>OR</value>
</param>

this configuration indicates that ONE of the following must be satisfied to be granted access:

  1. the user is “hdfs” or “guest” OR
  2. the user is in “admin” group OR
  3. the request is coming from 127.0.0.2 or 127.0.0.3

Other Related Configuration

The principal mapping aspect of the identity assertion provider is important to understand in order to fully utilize the authorization features of this provider.

This feature allows us to map the authenticated principal to a runas or impersonated principal to be asserted to the Hadoop services in the backend. When a principal mapping is defined that results in an impersonated principal being created the impersonated principal is then the effective principal. If there is no mapping to another principal then the authenticated or primary principal is then the effective principal. Principal mapping has actually been available in the identity assertion provider from the beginning of Knox and is documented fully in the Identity Assertion section of this guide.

<param>
    <name>principal.mapping</name>
    <value>{primaryPrincipal}[,...]={impersonatedPrincipal}[;...]</value>
</param>

For instance:

<param>
    <name>principal.mapping</name>
    <value>guest=hdfs</value>
</param>

In addition, we allow the administrator to map groups to effective principals. This is done through another param within the identity assertion provider:

<param>
    <name>group.principal.mapping</name>
    <value>{userName[,*|userName...]}={groupName[,groupName...]}[,...]</value>
</param>

For instance:

<param>
    <name>group.principal.mapping</name>
    <value>*=users;hdfs=admin</value>
</param>

this configuration indicates that all (*) authenticated users are members of the “users” group and that user “hdfs” is a member of the admin group. Group principal mapping has been added along with the authorization provider described in this document.

For more information on principal and group principal mapping see the Identity Assertion section of this guide.

These additional mapping capabilities are used together with the authorization ACL policy. An example of a full topology that illustrates these together is below.

<topology>
    <gateway>
        <provider>
            <role>authentication</role>
            <name>ShiroProvider</name>
            <enabled>true</enabled>
            <param>
                <name>main.ldapRealm</name>
                <value>org.apache.shiro.realm.ldap.JndiLdapRealm</value>
            </param>
            <param>
                <name>main.ldapRealm.userDnTemplate</name>
                <value>uid={0},ou=people,dc=hadoop,dc=apache,dc=org</value>
            </param>
            <param>
                <name>main.ldapRealm.contextFactory.url</name>
                <value>ldap://localhost:33389</value>
            </param>
            <param>
                <name>main.ldapRealm.contextFactory.authenticationMechanism</name>
                <value>simple</value>
            </param>
            <param>
                <name>urls./**</name>
                <value>authcBasic</value>
            </param>
        </provider>
        <provider>
            <role>identity-assertion</role>
            <name>Pseudo</name>
            <enabled>true</enabled>
            <param>
                <name>principal.mapping</name>
                <value>guest=hdfs;</value>
            </param>
            <param>
                <name>group.principal.mapping</name>
                <value>*=users;hdfs=admin</value>
            </param>
        </provider>
        <provider>
            <role>authorization</role>
            <name>AclsAuthz</name>
            <enabled>true</enabled>
            <param>
                <name>acl.mode</name>
                <value>OR</value>
            </param>
            <param>
                <name>webhdfs.acl.mode</name>
                <value>AND</value>
            </param>
            <param>
                <name>webhdfs.acl</name>
                <value>hdfs;admin;127.0.0.2,127.0.0.3</value>
            </param>
            <param>
                <name>webhcat.acl</name>
                <value>hdfs;admin;127.0.0.2,127.0.0.3</value>
            </param>
        </provider>
        <provider>
            <role>hostmap</role>
            <name>static</name>
            <enabled>true</enabled>
            <param>
                <name>localhost</name>
                <value>sandbox,sandbox.hortonworks.com</value>
            </param>
        </provider>
    </gateway>

    <service>
        <role>JOBTRACKER</role>
        <url>rpc://localhost:8050</url>
    </service>

    <service>
        <role>WEBHDFS</role>
        <url>http://localhost:50070/webhdfs</url>
    </service>

    <service>
        <role>WEBHCAT</role>
        <url>http://localhost:50111/templeton</url>
    </service>

    <service>
        <role>OOZIE</role>
        <url>http://localhost:11000/oozie</url>
    </service>

    <service>
        <role>WEBHBASE</role>
        <url>http://localhost:60080</url>
    </service>

    <service>
        <role>HIVE</role>
        <url>http://localhost:10000</url>
    </service>
</topology>

Secure Clusters

See these documents for setting up a secure Hadoop cluster http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html#Configuration_in_Secure_Mode http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.1/bk_installing_manually_book/content/rpm-chap14.html

Once you have a Hadoop cluster that is using Kerberos for authentication, you have to do the following to configure Knox to work with that cluster.

Create Unix account for Knox on Hadoop master nodes

useradd -g hadoop knox

Create Kerberos principal, keytab for Knox

One way of doing this, assuming your KDC realm is EXAMPLE.COM, is to ssh into your host running KDC and execute kadmin.local That will result in an interactive session in which you can execute commands.

ssh into your host running KDC

kadmin.local
add_principal -randkey knox/knox@EXAMPLE.COM
ktadd -norandkey -k /etc/security/keytabs/knox.service.keytab
ktadd -k /etc/security/keytabs/knox.service.keytab -norandkey knox/knox@EXAMPLE.COM
exit

Grant Proxy privileges for Knox user in core-site.xml on Hadoop master nodes

Update core-site.xml and add the following lines towards the end of the file.

Replace FQDN_OF_KNOX_HOST with the fully qualified domain name of the host running the gateway. You can usually find this by running hostname -f on that host.

You could use * for local developer testing if Knox host does not have static IP.

<property>
    <name>hadoop.proxyuser.knox.groups</name>
    <value>users</value>
</property>
<property>
    <name>hadoop.proxyuser.knox.hosts</name>
    <value>FQDN_OF_KNOX_HOST</value>
</property>

Grant proxy privilege for Knox in webhcat-stie.xml on Hadoop master nodes

Update webhcat-site.xml and add the following lines towards the end of the file.

Replace FQDN_OF_KNOX_HOST with right value in your cluster. You could use * for local developer testing if Knox host does not have static IP.

<property>
    <name>hadoop.proxyuser.knox.groups</name>
    <value>users</value>
</property>
<property>
    <name>hadoop.proxyuser.knox.hosts</name>
    <value>FQDN_OF_KNOX_HOST</value>
</property>

Grant proxy privilege for Knox in oozie-stie.xml on Oozie host

Update oozie-site.xml and add the following lines towards the end of the file.

Replace FQDN_OF_KNOX_HOST with right value in your cluster. You could use * for local developer testing if Knox host does not have static IP.

<property>
   <name>oozie.service.ProxyUserService.proxyuser.knox.groups</name>
   <value>users</value>
</property>
<property>
   <name>oozie.service.ProxyUserService.proxyuser.knox.hosts</name>
   <value>FQDN_OF_KNOX_HOST</value>
</property>

Copy knox keytab to Knox host

Add unix account for the knox user on Knox host

useradd -g hadoop knox

Copy knox.service.keytab created on KDC host on to your Knox host /etc/knox/conf/knox.service.keytab

chown knox knox.service.keytab
chmod 400 knox.service.keytab

Update krb5.conf at /etc/knox/conf/krb5.conf on Knox host

You could copy the templates/krb5.conf file provided in the Knox binary download and customize it to suit your cluster.

Update krb5JAASLogin.conf at /etc/knox/conf/krb5JAASLogin.conf on Knox host

You could copy the templates/krb5JAASLogin.conf file provided in the Knox binary download and customize it to suit your cluster.

Update gateway-site.xml on Knox host on Knox host

Update conf/gateway-site.xml in your Knox installation and set the value of gateway.hadoop.kerberos.secured to true.

Restart Knox

After you do the above configurations and restart Knox, Knox would use SPNego to authenticate with Hadoop services and Oozie. There is no change in the way you make calls to Knox whether you use Curl or Knox DSL.

Client Details

Hadoop requires a client that can be used to interact remotely with the services provided by Hadoop cluster. This will also be true when using the Apache Knox Gateway to provide perimeter security and centralized access for these services. The two primary existing clients for Hadoop are the CLI (i.e. Command Line Interface, hadoop) and HUE (i.e. Hadoop User Environment). For several reasons however, neither of these clients can currently be used to access Hadoop services via the Apache Knox Gateway.

This led to thinking about a very simple client that could help people use and evaluate the gateway. The list below outlines the general requirements for such a client.

The result is a very simple DSL (Domain Specific Language) of sorts that is used via Groovy scripts. Here is an example of a command that copies a file from the local file system to HDFS.

Note: The variables session, localFile and remoteFile are assumed to be defined.

Hdfs.put( session ).file( localFile ).to( remoteFile ).now()

This work is very early in development but is also very useful in its current state. We are very interested in receiving feedback about how to improve this feature and the DSL in particular.

A note of thanks to REST-assured which provides a Fluent interface style DSL for testing REST services. It served as the initial inspiration for the creation of this DSL.

Assumptions

This document assumes a few things about your environment in order to simplify the examples.

Basics

The DSL requires a shell to interpret the Groovy script. The shell can either be used interactively or to execute a script file. To simplify use, the distribution contains an embedded version of the Groovy shell.

The shell can be run interactively. Use the command exit to exit.

java -jar bin/shell.jar

When running interactively it may be helpful to reduce some of the output generated by the shell console. Use the following command in the interactive shell to reduce that output. This only needs to be done once as these preferences are persisted.

set verbosity QUIET
set show-last-result false

Also when running interactively use the exit command to terminate the shell. Using ^C to exit can sometimes leaves the parent shell in a problematic state.

The shell can also be used to execute a script by passing a single filename argument.

java -jar bin/shell.jar samples/ExampleWebHdfsPutGetFile.groovy

Examples

Once the shell can be launched the DSL can be used to interact with the gateway and Hadoop. Below is a very simple example of an interactive shell session to upload a file to HDFS.

java -jar bin/shell.jar
knox:000> session = Hadoop.login( "https://localhost:8443/gateway/sandbox", "guest", "guest-password" )
knox:000> Hdfs.put( session ).file( "README" ).to( "/tmp/example/README" ).now()

The knox:000> in the example above is the prompt from the embedded Groovy console. If you output doesn’t look like this you may need to set the verbosity and show-last-result preferences as described above in the Usage section.

If you relieve an error HTTP/1.1 403 Forbidden it may be because that file already exists. Try deleting it with the following command and then try again.

knox:000> Hdfs.rm(session).file("/tmp/example/README").now()

Without using some other tool to browse HDFS it is hard to tell that that this command did anything. Execute this to get a bit more feedback.

knox:000> println "Status=" + Hdfs.put( session ).file( "README" ).to( "/tmp/example/README2" ).now().statusCode
Status=201

Notice that a different filename is used for the destination. Without this an error would have resulted. Of course the DSL also provides a command to list the contents of a directory.

knox:000> println Hdfs.ls( session ).dir( "/tmp/example" ).now().string
{"FileStatuses":{"FileStatus":[{"accessTime":1363711366977,"blockSize":134217728,"group":"hdfs","length":19395,"modificationTime":1363711366977,"owner":"guest","pathSuffix":"README","permission":"644","replication":1,"type":"FILE"},{"accessTime":1363711375617,"blockSize":134217728,"group":"hdfs","length":19395,"modificationTime":1363711375617,"owner":"guest","pathSuffix":"README2","permission":"644","replication":1,"type":"FILE"}]}}

It is a design decision of the DSL to not provide type safe classes for various request and response payloads. Doing so would provide an undesirable coupling between the DSL and the service implementation. It also would make adding new commands much more difficult. See the Groovy section below for a variety capabilities and tools for working with JSON and XML to make this easy. The example below shows the use of JsonSlurper and GPath to extract content from a JSON response.

knox:000> import groovy.json.JsonSlurper
knox:000> text = Hdfs.ls( session ).dir( "/tmp/example" ).now().string
knox:000> json = (new JsonSlurper()).parseText( text )
knox:000> println json.FileStatuses.FileStatus.pathSuffix
[README, README2]

In the future, “built-in” methods to slurp JSON and XML may be added to make this a bit easier. This would allow for this type if single line interaction.

println Hdfs.ls(session).dir("/tmp").now().json().FileStatuses.FileStatus.pathSuffix

Shell session should always be ended with shutting down the session. The examples above do not touch on it but the DSL supports the simple execution of commands asynchronously. The shutdown command attempts to ensures that all asynchronous commands have completed before existing the shell.

knox:000> session.shutdown()
knox:000> exit

All of the commands above could have been combined into a script file and executed as a single line.

java -jar bin/shell.jar samples/ExampleWebHdfsPutGet.groovy

This would be the content of that script.

import org.apache.hadoop.gateway.shell.Hadoop
import org.apache.hadoop.gateway.shell.hdfs.Hdfs
import groovy.json.JsonSlurper

gateway = "https://localhost:8443/gateway/sandbox"
username = "guest"
password = "guest-password"
dataFile = "README"

session = Hadoop.login( gateway, username, password )
Hdfs.rm( session ).file( "/tmp/example" ).recursive().now()
Hdfs.put( session ).file( dataFile ).to( "/tmp/example/README" ).now()
text = Hdfs.ls( session ).dir( "/tmp/example" ).now().string
json = (new JsonSlurper()).parseText( text )
println json.FileStatuses.FileStatus.pathSuffix
session.shutdown()
exit

Notice the Hdfs.rm command. This is included simply to ensure that the script can be rerun. Without this an error would result the second time it is run.

Futures

The DSL supports the ability to invoke commands asynchronously via the later() invocation method. The object returned from the later() method is a java.util.concurrent.Future parametrized with the response type of the command. This is an example of how to asynchronously put a file to HDFS.

future = Hdfs.put(session).file("README").to("tmp/example/README").later()
println future.get().statusCode

The future.get() method will block until the asynchronous command is complete. To illustrate the usefulness of this however multiple concurrent commands are required.

readmeFuture = Hdfs.put(session).file("README").to("tmp/example/README").later()
licenseFuture = Hdfs.put(session).file("LICENSE").to("tmp/example/LICENSE").later()
session.waitFor( readmeFuture, licenseFuture )
println readmeFuture.get().statusCode
println licenseFuture.get().statusCode

The session.waitFor() method will wait for one or more asynchronous commands to complete.

Closures

Futures alone only provide asynchronous invocation of the command. What if some processing should also occur asynchronously once the command is complete. Support for this is provided by closures. Closures are blocks of code that are passed into the later() invocation method. In Groovy these are contained within {} immediately after a method. These blocks of code are executed once the asynchronous command is complete.

Hdfs.put(session).file("README").to("tmp/example/README").later(){ println it.statusCode }

In this example the put() command is executed on a separate thread and once complete the println it.statusCode block is executed on that thread. The it variable is automatically populated by Groovy and is a reference to the result that is returned from the future or now() method. The future example above can be rewritten to illustrate the use of closures.

readmeFuture = Hdfs.put(session).file("README").to("tmp/example/README").later() { println it.statusCode }
licenseFuture = Hdfs.put(session).file("LICENSE").to("tmp/example/LICENSE").later() { println it.statusCode }
session.waitFor( readmeFuture, licenseFuture )

Again, the session.waitFor() method will wait for one or more asynchronous commands to complete.

Constructs

In order to understand the DSL there are three primary constructs that need to be understood.

Session

This construct encapsulates the client side session state that will be shared between all command invocations. In particular it will simplify the management of any tokens that need to be presented with each command invocation. It also manages a thread pool that is used by all asynchronous commands which is why it is important to call one of the shutdown methods.

The syntax associated with this is expected to change we expect that credentials will not need to be provided to the gateway. Rather it is expected that some form of access token will be used to initialize the session.

Services

Services are the primary extension point for adding new suites of commands. The current built in examples are: Hdfs, Job and Workflow. The desire for extensibility is the reason for the slightly awkward Hdfs.ls(session) syntax. Certainly something more like session.hdfs().ls() would have been preferred but this would prevent adding new commands easily. At a minimum it would result in extension commands with a different syntax from the “built-in” commands.

The service objects essentially function as a factory for a suite of commands.

Commands

Commands provide the behavior of the DSL. They typically follow a Fluent interface style in order to allow for single line commands. There are really three parts to each command: Request, Invocation, Response

Request

The request is populated by all of the methods following the “verb” method and the “invoke” method. For example in Hdfs.rm(session).ls(dir).now() the request is populated between the “verb” method rm() and the “invoke” method now().

Invocation

The invocation method controls how the request is invoked. Currently supported synchronous and asynchronous invocation. The now() method executes the request and returns the result immediately. The later() method submits the request to be executed later and returns a future from which the result can be retrieved. In addition later() invocation method can optionally be provided a closure to execute when the request is complete. See the Futures and Closures sections below for additional detail and examples.

Response

The response contains the results of the invocation of the request. In most cases the response is a thin wrapper over the HTTP response. In fact many commands will share a single BasicResponse type that only provides a few simple methods.

public int getStatusCode()
public long getContentLength()
public String getContentType()
public String getContentEncoding()
public InputStream getStream()
public String getString()
public byte[] getBytes()
public void close();

Thanks to Groovy these methods can be accessed as attributes. In the some of the examples the staticCode was retrieved for example.

println Hdfs.put(session).rm(dir).now().statusCode

Groovy will invoke the getStatusCode method to retrieve the statusCode attribute.

The three methods getStream(), getBytes() and getString deserve special attention. Care must be taken that the HTTP body is fully read once and only once. Therefore one of these methods (and only one) must be called once and only once. Calling one of these more than once will cause an error. Failing to call one of these methods once will result in lingering open HTTP connections. The close() method may be used if the caller is not interested in reading the result body. Most commands that do not expect a response body will call close implicitly. If the body is retrieved via getBytes() or getString(), the close() method need not be called. When using getStream(), care must be taken to consume the entire body otherwise lingering open HTTP connections will result. The close() method may be called after reading the body partially to discard the remainder of the body.

Services

The built-in supported client DLS for each Hadoop service can be found in the Service Details section.

Extension

Extensibility is a key design goal of the KnoxShell and client DSL. There are two ways to provide extended functionality for use with the shell. The first is to simply create Groovy scripts that use the DSL to perform a useful task. The second is to add new services and commands. In order to add new service and commands new classes must be written in either Groovy or Java and added to the classpath of the shell. Fortunately there is a very simple way to add classes and JARs to the shell classpath. The first time the shell is executed it will create a configuration file in the same directory as the JAR with the same base name and a .cfg extension.

bin/shell.jar
bin/shell.cfg

That file contains both the main class for the shell as well as a definition of the classpath. Currently that file will by default contain the following.

main.class=org.apache.hadoop.gateway.shell.Shell
class.path=../lib; ../lib/*.jar; ../ext; ../ext/*.jar

Therefore to extend the shell you should copy any new service and command class either to the ext directory or if they are packaged within a JAR copy the JAR to the ext directory. The lib directory is reserved for JARs that may be delivered with the product.

Below are samples for the service and command classes that would need to be written to add new commands to the shell. These happen to be Groovy source files but could with very minor changes be Java files. The easiest way to add these to the shell is to compile them directory into the ext directory. Note: This command depends upon having the Groovy compiler installed and available on the execution path.

groovy -d ext -cp bin/shell.jar samples/SampleService.groovy \
    samples/SampleSimpleCommand.groovy samples/SampleComplexCommand.groovy

These source files are available in the samples directory of the distribution but these are included here for convenience.

Sample Service (Groovy)

import org.apache.hadoop.gateway.shell.Hadoop

class SampleService {

    static String PATH = "/webhdfs/v1"

    static SimpleCommand simple( Hadoop session ) {
        return new SimpleCommand( session )
    }

    static ComplexCommand.Request complex( Hadoop session ) {
        return new ComplexCommand.Request( session )
    }

}

Sample Simple Command (Groovy)

import org.apache.hadoop.gateway.shell.AbstractRequest
import org.apache.hadoop.gateway.shell.BasicResponse
import org.apache.hadoop.gateway.shell.Hadoop
import org.apache.http.client.methods.HttpGet
import org.apache.http.client.utils.URIBuilder

import java.util.concurrent.Callable

class SimpleCommand extends AbstractRequest<BasicResponse> {

    SimpleCommand( Hadoop session ) {
        super( session )
    }

    private String param
    SimpleCommand param( String param ) {
        this.param = param
        return this
    }

    @Override
    protected Callable<BasicResponse> callable() {
        return new Callable<BasicResponse>() {
            @Override
            BasicResponse call() {
                URIBuilder uri = uri( SampleService.PATH, param )
                addQueryParam( uri, "op", "LISTSTATUS" )
                HttpGet get = new HttpGet( uri.build() )
                return new BasicResponse( execute( get ) )
            }
        }
    }

}

Sample Complex Command (Groovy)

import com.jayway.jsonpath.JsonPath
import org.apache.hadoop.gateway.shell.AbstractRequest
import org.apache.hadoop.gateway.shell.BasicResponse
import org.apache.hadoop.gateway.shell.Hadoop
import org.apache.http.HttpResponse
import org.apache.http.client.methods.HttpGet
import org.apache.http.client.utils.URIBuilder

import java.util.concurrent.Callable

class ComplexCommand {

    static class Request extends AbstractRequest<Response> {

        Request( Hadoop session ) {
            super( session )
        }

        private String param;
        Request param( String param ) {
            this.param = param;
            return this;
        }

        @Override
        protected Callable<Response> callable() {
            return new Callable<Response>() {
                @Override
                Response call() {
                    URIBuilder uri = uri( SampleService.PATH, param )
                    addQueryParam( uri, "op", "LISTSTATUS" )
                    HttpGet get = new HttpGet( uri.build() )
                    return new Response( execute( get ) )
                }
            }
        }

    }

    static class Response extends BasicResponse {

        Response(HttpResponse response) {
            super(response)
        }

        public List<String> getNames() {
            return JsonPath.read( string, "\$.FileStatuses.FileStatus[*].pathSuffix" )
        }

    }

}

Groovy

The shell included in the distribution is basically an unmodified packaging of the Groovy shell. The distribution does however provide a wrapper that makes it very easy to setup the class path for the shell. In fact the JARs required to execute the DSL are included on the class path by default. Therefore these command are functionally equivalent if you have Groovy [installed][15]. See below for a description of what is required for JARs required by the DSL from lib and dep directories.

java -jar bin/shell.jar samples/ExampleWebHdfsPutGet.groovy
groovy -classpath {JARs required by the DSL from lib and dep} samples/ExampleWebHdfsPutGet.groovy

The interactive shell isn’t exactly equivalent. However the only difference is that the shell.jar automatically executes some additional imports that are useful for the KnoxShell client DSL. So these two sets of commands should be functionality equivalent. However there is currently a class loading issue that prevents the groovysh command from working properly.

java -jar bin/shell.jar

groovysh -classpath {JARs required by the DSL from lib and dep}
import org.apache.hadoop.gateway.shell.Hadoop
import org.apache.hadoop.gateway.shell.hdfs.Hdfs
import org.apache.hadoop.gateway.shell.job.Job
import org.apache.hadoop.gateway.shell.workflow.Workflow
import java.util.concurrent.TimeUnit

Alternatively, you can use the Groovy Console which does not appear to have the same class loading issue.

groovyConsole -classpath {JARs required by the DSL from lib and dep}

import org.apache.hadoop.gateway.shell.Hadoop
import org.apache.hadoop.gateway.shell.hdfs.Hdfs
import org.apache.hadoop.gateway.shell.job.Job
import org.apache.hadoop.gateway.shell.workflow.Workflow
import java.util.concurrent.TimeUnit

The JARs currently required by the client DSL are

lib/gateway-shell-${gateway-version}.jar
dep/httpclient-4.2.3.jar
dep/httpcore-4.2.2.jar
dep/commons-lang3-3.1.jar
dep/commons-codec-1.7.jar

So on Linux/MacOS you would need this command

groovy -cp lib/gateway-shell-0.2.0-SNAPSHOT.jar:dep/httpclient-4.2.3.jar:dep/httpcore-4.2.2.jar:dep/commons-lang3-3.1.jar:dep/commons-codec-1.7.jar samples/ExampleWebHdfsPutGet.groovy

and on Windows you would need this command

groovy -cp lib/gateway-shell-0.2.0-SNAPSHOT.jar;dep/httpclient-4.2.3.jar;dep/httpcore-4.2.2.jar;dep/commons-lang3-3.1.jar;dep/commons-codec-1.7.jar samples/ExampleWebHdfsPutGet.groovy

The exact list of required JARs is likely to change from release to release so it is recommended that you utilize the wrapper bin/shell.jar.

In addition because the DSL can be used via standard Groovy, the Groovy integrations in many popular IDEs (e.g. IntelliJ , Eclipse) can also be used. This makes it particularly nice to develop and execute scripts to interact with Hadoop. The code-completion features in modern IDEs in particular provides immense value. All that is required is to add the shell-0.2.0.jar to the projects class path.

There are a variety of Groovy tools that make it very easy to work with the standard interchange formats (i.e. JSON and XML). In Groovy the creation of XML or JSON is typically done via a “builder” and parsing done via a “slurper”. In addition once JSON or XML is “slurped” the GPath, an XPath like feature build into Groovy can be used to access data.

Service Details

In the sections that follow the integrations currently available out of the box with the gateway will be described. In general these sections will include examples that demonstrate how to access each of these services via the gateway. In many cases this will include both the use of cURL as a REST API client as well as the use of the Knox Client DSL. You may notice that there are some minor differences between using the REST API of a given service via the gateway. In general this is necessary in order to achieve the goal of leaking internal Hadoop cluster details to the client.

Keep in mind that the gateway uses a plugin model for supporting Hadoop services. Check back with the Apache Knox site for the latest news on plugin availability. You can also create your own custom plugin to extend the capabilities of the gateway.

These are the current Hadoop services with built-in support.

Assumptions

This document assumes a few things about your environment in order to simplify the examples.

Customization

Using these samples with other Hadoop installations will require changes to the steps describe here as well as changes to referenced sample scripts. This will also likely require changes to the gateway’s default configuration. In particular host names, ports user names and password may need to be changes to match your environment. These changes may need to be made to gateway configuration and also the Groovy sample script files in the distribution. All of the values that may need to be customized in the sample scripts can be found together at the top of each of these files.

cURL

The cURL HTTP client command line utility is used extensively in the examples for each service. In particular this form of the cURL command line is used repeatedly.

curl -i -k -u guest:guest-password ...

The option -i (aka –include) is used to output HTTP response header information. This will be important when the content of the HTTP Location header is required for subsequent requests.

The option -k (aka –insecure) is used to avoid any issues resulting the use of demonstration SSL certificates.

The option -u (aka –user) is used to provide the credentials to be used when the client is challenged by the gateway.

Keep in mind that the samples do not use the cookie features of cURL for the sake of simplicity. Therefore each request via cURL will result in an authentication.

WebHDFS

REST API access to HDFS in a Hadoop cluster is provided by WebHDFS. The WebHDFS REST API documentation is available online. WebHDFS must be enabled in the hdfs-site.xml configuration file. In sandbox this configuration file is located at /etc/hadoop/conf/hdfs-site.xml. Note the properties shown below as they are related to configuration required by the gateway. Some of these represent the default values and may not actually be present in hdfs-site.xml.

<property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
</property>
<property>
    <name>dfs.namenode.rpc-address</name>
    <value>sandbox.hortonworks.com:8020</value>
</property>
<property>
    <name>dfs.namenode.http-address</name>
    <value>sandbox.hortonworks.com:50070</value>
</property>
<property>
    <name>dfs.https.namenode.https-address</name>
    <value>sandbox.hortonworks.com:50470</value>
</property>

The values above need to be reflected in each topology descriptor file deployed to the gateway. The gateway by default includes a sample topology descriptor file {GATEWAY_HOME}/deployments/sandbox.xml. The values in this sample are configured to work with an installed Sandbox VM.

<service>
    <role>NAMENODE</role>
    <url>hdfs://localhost:8020</url>
</service>
<service>
    <role>WEBHDFS</role>
    <url>http://localhost:50070/webhdfs</url>
</service>

The URL provided for the role NAMENODE does not result in an endpoint being exposed by the gateway. This information is only required so that other URLs can be rewritten that reference the Name Node’s RPC address. This prevents clients from needed to be aware of the internal cluster details.

By default the gateway is configured to use the HTTP endpoint for WebHDFS in the Sandbox. This could alternatively be configured to use the HTTPS endpoint by provided the correct address.

WebHDFS URL Mapping

For Name Node URLs, the mapping of Knox Gateway accessible WebHDFS URLs to direct WebHDFS URLs is simple.

Gateway https://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/webhdfs
Cluster http://{webhdfs-host}:50070/webhdfs

However, there is a subtle difference to URLs that are returned by WebHDFS in the Location header of many requests. Direct WebHDFS requests may return Location headers that contain the address of a particular Data Node. The gateway will rewrite these URLs to ensure subsequent requests come back through the gateway and internal cluster details are protected.

A WebHDFS request to the Node Node to retrieve a file will return a URL of the form below in the Location header.

http://{datanode-host}:{data-node-port}/webhdfs/v1/{path}?...

Note that this URL contains the newtwork location of a Data Node. The gateway will rewrite this URL to look like the URL below.

https://{gateway-host}:{gateway-port}/{gateway-path}/{custer-name}/webhdfs/data/v1/{path}?_={encrypted-query-parameters}

The {encrypted-query-parameters} will contain the {datanode-host} and {datanode-port} information. This information along with the original query parameters are encrypted so that the internal Hadoop details are protected.

WebHDFS Examples

The examples below upload a file, download the file and list the contents of the directory.

WebHDFS via client DSL

You can use the Groovy example scripts and interpreter provided with the distribution.

java -jar bin/shell.jar samples/ExampleWebHdfsPutGet.groovy
java -jar bin/shell.jar samples/ExampleWebHdfsLs.groovy

You can manually type the client DSL script into the KnoxShell interactive Groovy interpreter provided with the distribution. The command below starts the KnoxShell in interactive mode.

java -jar bin/shell.jar

Each line below could be typed or copied into the interactive shell and executed. This is provided as an example to illustrate the use of the client DSL.

// Import the client DSL and a useful utilities for working with JSON.
import org.apache.hadoop.gateway.shell.Hadoop
import org.apache.hadoop.gateway.shell.hdfs.Hdfs
import groovy.json.JsonSlurper

// Setup some basic config.
gateway = "https://localhost:8443/gateway/sandbox"
username = "guest"
password = "guest-password"

// Start the session.
session = Hadoop.login( gateway, username, password )

// Cleanup anything leftover from a previous run.
Hdfs.rm( session ).file( "/user/guest/example" ).recursive().now()

// Upload the README to HDFS.
Hdfs.put( session ).file( "README" ).to( "/user/guest/example/README" ).now()

// Download the README from HDFS.
text = Hdfs.get( session ).from( "/user/guest/example/README" ).now().string
println text

// List the contents of the directory.
text = Hdfs.ls( session ).dir( "/user/guest/example" ).now().string
json = (new JsonSlurper()).parseText( text )
println json.FileStatuses.FileStatus.pathSuffix

// Cleanup the directory.
Hdfs.rm( session ).file( "/user/guest/example" ).recursive().now()

// Clean the session.
session.shutdown()
WebHDFS via cURL

Use can use cURL to directly invoke the REST APIs via the gateway.

Optionally cleanup the sample directory in case a previous example was run without cleaning up.
curl -i -k -u guest:guest-password -X DELETE \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example?op=DELETE&recursive=true'
Register the name for a sample file README in /user/guest/example.
curl -i -k -u guest:guest-password -X PUT \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example/README?op=CREATE'
Upload README to /user/guest/example. Use the README in {GATEWAY_HOME}.
curl -i -k -u guest:guest-password -T README -X PUT \
    '{Value of Location header from command above}'
List the contents of the directory /user/guest/example.
curl -i -k -u guest:guest-password -X GET \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example?op=LISTSTATUS'
Request the content of the README file in /user/guest/example.
curl -i -k -u guest:guest-password -X GET \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example/README?op=OPEN'
Read the content of the file.
curl -i -k -u guest:guest-password -X GET \
    '{Value of Location header from command above}'
Optionally cleanup the example directory.
curl -i -k -u guest:guest-password -X DELETE \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example?op=DELETE&recursive=true'
WebHDFS client DSL
get() - Get a file from HDFS (OPEN).
ls() - Query the contents of a directory (LISTSTATUS)
mkdir() - Create a directory in HDFS (MKDIRS)
put() - Write a file into HDFS (CREATE)
rm() - Delete a file or directory (DELETE)

WebHCat

WebHCat is a related but separate service from Hive. As such it is installed and configured independently. The WebHCat wiki pages describe this processes. In sandbox this configuration file for WebHCat is located at /etc/hadoop/hcatalog/webhcat-site.xml. Note the properties shown below as they are related to configuration required by the gateway.

<property>
    <name>templeton.port</name>
    <value>50111</value>
</property>

Also important is the configuration of the JOBTRACKER RPC endpoint. For Hadoop 2 this can be found in the yarn-site.xml file. In Sandbox this file can be found at /etc/hadoop/conf/yarn-site.xml. The property yarn.resourcemanager.address within that file is relevant for the gateway’s configuration.

<property>
    <name>yarn.resourcemanager.address</name>
    <value>sandbox.hortonworks.com:8050</value>
</property>

See WebHDFS for details about locating the Haddop configuration for the NAMENODE endpoint.

The gateway by default includes a sample topology descriptor file {GATEWAY_HOME}/deployments/sandbox.xml. The values in this sample are configured to work with an installed Sandbox VM.

<service>
    <role>NAMENODE</role>
    <url>hdfs://localhost:8020</url>
</service>
<service>
    <role>JOBTRACKER</role>
    <url>rpc://localhost:8050</url>
</service>
<service>
    <role>WEBHCAT</role>
    <url>http://localhost:50111/templeton</url>
</service>

The URLs provided for the role NAMENODE and JOBTRACKER do not result in an endpoint being exposed by the gateway. This information is only required so that other URLs can be rewritten that reference the appropriate RPC address for Hadoop services. This prevents clients from needed to be aware of the internal cluster details. Note that for Hadoop 2 the JOBTRACKER RPC endpoint is provided by the Resource Manager component.

By default the gateway is configured to use the HTTP endpoint for WebHCat in the Sandbox. This could alternatively be configured to use the HTTPS endpoint by provided the correct address.

WebHCat URL Mapping

For WebHCat URLs, the mapping of Knox Gateway accessible URLs to direct WebHCat URLs is simple.

Gateway https://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/templeton
Cluster http://{webhcat-host}:{webhcat-port}/templeton}

WebHCat Example

This example will submit the familiar WordCount Java MapReduce job to the Hadoop cluster via the gateway using the KnoxShell DSL. There are several ways to do this depending upon your preference.

You can use the “embedded” Groovy interpreter provided with the distribution.

java -jar bin/shell.jar samples/ExampleWebHCatJob.groovy

You can manually type in the KnoxShell DSL script into the “embedded” Groovy interpreter provided with the distribution.

java -jar bin/shell.jar

Each line from the file samples/ExampleWebHCatJob.groovy would then need to be typed or copied into the interactive shell.

WebHCat Client DSL

submitJava() - Submit a Java MapReduce job.
submitPig() - Submit a Pig job.
submitHive() - Submit a Hive job.
queryQueue() - Return a list of all job IDs registered to the user.
queryStatus() - Check the status of a job and get related job information given its job ID.

Oozie

Oozie is a Hadoop component provides complex job workflows to be submitted and managed. Please refer to the latest Oozie documentation for details.

In order to make Oozie accessible via the gateway there are several important Haddop configuration settings. These all relate to the network endpoint exposed by various Hadoop services.

The HTTP endpoint at which Oozie is running can be found via the oozie.base.url property in the oozie-site.xml file. In a Sandbox installation this can typically be found in /etc/oozie/conf/oozie-site.xml.

<property>
    <name>oozie.base.url</name>
    <value>http://sandbox.hortonworks.com:11000/oozie</value>
</property>

The RPC address at which the Resource Manager exposes the JOBTRACKER endpoint can be found via the yarn.resourcemanager.address in the yarn-site.xml file. In a Sandbox installation this can typically be found in /etc/hadoop/conf/yarn-site.xml.

<property>
    <name>yarn.resourcemanager.address</name>
    <value>sandbox.hortonworks.com:8050</value>
</property>

The RPC address at which the Name Node exposes its RPC endpoint can be found via the dfs.namenode.rpc-address in the hdfs-site.xml file. In a Sandbox installation this can typically be found in /etc/hadoop/conf/hdfs-site.xml.

<property>
    <name>dfs.namenode.rpc-address</name>
    <value>sandbox.hortonworks.com:8020</value>
</property>

The information above must be provided to the gateway via a topology descriptor file. These topology descriptor files are placed in {GATEWAY_HOME}/deployments. An example that is setup for the default configuration of the Sandbox is {GATEWAY_HOME}/deployments/sandbox.xml. These values will need to be changed for non-default Sandbox or other Hadoop cluster configuration.

<service>
    <role>NAMENODE</role>
    <url>hdfs://localhost:8020</url>
</service>
<service>
    <role>JOBTRACKER</role>
    <url>rpc://localhost:8050</url>
</service>
<service>
    <role>OOZIE</role>
    <url>http://localhost:11000/oozie</url>
</service>

Oozie URL Mapping

For Oozie URLs, the mapping of Knox Gateway accessible URLs to direct Oozie URLs is simple.

Gateway https://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/oozie
Cluster http://{oozie-host}:{oozie-port}/oozie}

Oozie Request Changes

TODO - In some cases the Oozie requests needs to be slightly different when made through the gateway. These changes are required in order to protect the client from knowing the internal structure of the Hadoop cluster.

Oozie Example via Client DSL

This example will also submit the familiar WordCount Java MapReduce job to the Hadoop cluster via the gateway using the KnoxShell DSL. However in this case the job will be submitted via a Oozie workflow. There are several ways to do this depending upon your preference.

You can use the “embedded” Groovy interpreter provided with the distribution.

java -jar bin/shell.jar samples/ExampleOozieWorkflow.groovy

You can manually type in the KnoxShell DSL script into the “embedded” Groovy interpreter provided with the distribution.

java -jar bin/shell.jar

Each line from the file samples/ExampleOozieWorkflow.groovy will need to be typed or copied into the interactive shell.

Oozie Example via cURL

The example below illustrates the sequence of curl commands that could be used to run a “word count” map reduce job via an Oozie workflow.

It utilizes the hadoop-examples.jar from a Hadoop install for running a simple word count job. A copy of that jar has been included in the samples directory for convenience.

In addition a workflow definition and configuration file is required. These have not been included but are available for download. Download workflow-definition.xml and workflow-configuration.xml and store them in the {GATEWAY_HOME} directory. Review the contents of workflow-configuration.xml to ensure that it matches your environment.

Take care to follow the instructions below where replacement values are required. These replacement values are identified with { } markup.

# 0. Optionally cleanup the test directory in case a previous example was run without cleaning up.
curl -i -k -u guest:guest-password -X DELETE \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example?op=DELETE&recursive=true'

# 1. Create the inode for workflow definition file in /user/guest/example
curl -i -k -u guest:guest-password -X PUT \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example/workflow.xml?op=CREATE'

# 2. Upload the workflow definition file.  This file can be found in {GATEWAY_HOME}/templates
curl -i -k -u guest:guest-password -T workflow-definition.xml -X PUT \
    '{Value Location header from command above}'

# 3. Create the inode for hadoop-examples.jar in /user/guest/example/lib
curl -i -k -u guest:guest-password -X PUT \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example/lib/hadoop-examples.jar?op=CREATE'

# 4. Upload hadoop-examples.jar to /user/guest/example/lib.  Use a hadoop-examples.jar from a Hadoop install.
curl -i -k -u guest:guest-password -T samples/hadoop-examples.jar -X PUT \
    '{Value Location header from command above}'

# 5. Create the inode for a sample input file readme.txt in /user/guest/example/input.
curl -i -k -u guest:guest-password -X PUT \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example/input/README?op=CREATE'

# 6. Upload readme.txt to /user/guest/example/input.  Use the readme.txt in {GATEWAY_HOME}.
# The sample below uses this README file found in {GATEWAY_HOME}.
curl -i -k -u guest:guest-password -T README -X PUT \
    '{Value of Location header from command above}'

# 7. Submit the job via Oozie
# Take note of the Job ID in the JSON response as this will be used in the next step.
curl -i -k -u guest:guest-password -H Content-Type:application/xml -T workflow-configuration.xml \
    -X POST 'https://localhost:8443/gateway/sandbox/oozie/v1/jobs?action=start'

# 8. Query the job status via Oozie.
curl -i -k -u guest:guest-password -X GET \
    'https://localhost:8443/gateway/sandbox/oozie/v1/job/{Job ID from JSON body}'

# 9. List the contents of the output directory /user/guest/example/output
curl -i -k -u guest:guest-password -X GET \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example/output?op=LISTSTATUS'

# 10. Optionally cleanup the test directory
curl -i -k -u guest:guest-password -X DELETE \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example?op=DELETE&recursive=true'

Oozie Client DSL

submit() - Submit a workflow job.

status() - Query the status of a workflow job.

HBase

TODO

HBase URL Mapping

TODO

HBase Examples

TODO

The examples below illustrate the set of basic operations with HBase instance using Stargate REST API. Use following link to get more more details about HBase/Stargate API: http://wiki.apache.org/hadoop/Hbase/Stargate.

HBase Stargate Setup

Launch Stargate

The command below launches the Stargate daemon on port 60080

sudo /usr/lib/hbase/bin/hbase-daemon.sh start rest -p 60080

Port 60080 is used because it was specified in sample Hadoop cluster deployment {GATEWAY_HOME}/deployments/sandbox.xml.

Configure Sandbox port mapping for VirtualBox

  1. Select the VM
  2. Select menu Machine>Settings…
  3. Select tab Network
  4. Select Adapter 1
  5. Press Port Forwarding button
  6. Press Plus button to insert new rule: Name=Stargate, Host Port=60080, Guest Port=60080
  7. Press OK to close the rule window
  8. Press OK to Network window save the changes

60080 pot is used because it was specified in sample Hadoop cluster deployment {GATEWAY_HOME}/deployments/sandbox.xml.

HBase Restart

If it becomes necessary to restart HBase you can log into the hosts running HBase and use these steps.

sudo /usr/lib/hbase/bin/hbase-daemon.sh stop rest
sudo -u hbase /usr/lib/hbase/bin/hbase-daemon.sh stop regionserver
sudo -u hbase /usr/lib/hbase/bin/hbase-daemon.sh stop master

sudo -u hbase /usr/lib/hbase/bin/hbase-daemon.sh start regionserver
sudo -u hbase /usr/lib/hbase/bin/hbase-daemon.sh start master
sudo /usr/lib/hbase/bin/hbase-daemon.sh start rest -p 60080

HBase/Stargate client DSL

For more details about client DSL usage please follow this [page|https://cwiki.apache.org/confluence/display/KNOX/Client+Usage].

systemVersion() - Query Software Version.

clusterVersion() - Query Storage Cluster Version.

status() - Query Storage Cluster Status.

table().list() - Query Table List.

table(String tableName).schema() - Query Table Schema.

table(String tableName).create() - Create Table Schema.

table(String tableName).update() - Update Table Schema.

table(String tableName).regions() - Query Table Metadata.

table(String tableName).delete() - Delete Table.

table(String tableName).row(String rowId).store() - Cell Store.

table(String tableName).row(String rowId).query() - Cell or Row Query.

table(String tableName).row(String rowId).delete() - Row, Column, or Cell Delete.

table(String tableName).scanner().create() - Scanner Creation.

table(String tableName).scanner(String scannerId).getNext() - Scanner Get Next.

table(String tableName).scanner(String scannerId).delete() - Scanner Deletion.

HBase/Stargate via Client DSL

This example illustrates sequence of all basic HBase operations: 1. get system version 2. get cluster version 3. get cluster status 4. create the table 5. get list of tables 6. get table schema 7. update table schema 8. insert single row into table 9. query row by id 10. query all rows 11. delete cell from row 12. delete entire column family from row 13. get table regions 14. create scanner 15. fetch values using scanner 16. drop scanner 17. drop the table

There are several ways to do this depending upon your preference.

You can use the Groovy interpreter provided with the distribution.

java -jar bin/shell.jar samples/ExampleHBase.groovy

You can manually type in the KnoxShell DSL script into the interactive Groovy interpreter provided with the distribution.

java -jar bin/shell.jar

Each line from the file below will need to be typed or copied into the interactive shell.

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.hadoop.gateway.shell.hbase

import org.apache.hadoop.gateway.shell.Hadoop

import static java.util.concurrent.TimeUnit.SECONDS

gateway = "https://localhost:8443/gateway/sandbox"
username = "guest"
password = "guest-password"
tableName = "test_table"

session = Hadoop.login(gateway, username, password)

println "System version : " + HBase.session(session).systemVersion().now().string

println "Cluster version : " + HBase.session(session).clusterVersion().now().string

println "Status : " + HBase.session(session).status().now().string

println "Creating table '" + tableName + "'..."

HBase.session(session).table(tableName).create()  \
    .attribute("tb_attr1", "value1")  \
    .attribute("tb_attr2", "value2")  \
    .family("family1")  \
        .attribute("fm_attr1", "value3")  \
        .attribute("fm_attr2", "value4")  \
    .endFamilyDef()  \
    .family("family2")  \
    .family("family3")  \
    .endFamilyDef()  \
    .attribute("tb_attr3", "value5")  \
    .now()

println "Done"

println "Table List : " + HBase.session(session).table().list().now().string

println "Schema for table '" + tableName + "' : " + HBase.session(session)  \
    .table(tableName)  \
    .schema()  \
    .now().string

println "Updating schema of table '" + tableName + "'..."

HBase.session(session).table(tableName).update()  \
    .family("family1")  \
        .attribute("fm_attr1", "new_value3")  \
    .endFamilyDef()  \
    .family("family4")  \
        .attribute("fm_attr3", "value6")  \
    .endFamilyDef()  \
    .now()

println "Done"

println "Schema for table '" + tableName + "' : " + HBase.session(session)  \
    .table(tableName)  \
    .schema()  \
    .now().string

println "Inserting data into table..."

HBase.session(session).table(tableName).row("row_id_1").store()  \
    .column("family1", "col1", "col_value1")  \
    .column("family1", "col2", "col_value2", 1234567890l)  \
    .column("family2", null, "fam_value1")  \
    .now()

HBase.session(session).table(tableName).row("row_id_2").store()  \
    .column("family1", "row2_col1", "row2_col_value1")  \
    .now()

println "Done"

println "Querying row by id..."

println HBase.session(session).table(tableName).row("row_id_1")  \
    .query()  \
    .now().string

println "Querying all rows..."

println HBase.session(session).table(tableName).row().query().now().string

println "Querying row by id with extended settings..."

println HBase.session(session).table(tableName).row().query()  \
    .column("family1", "row2_col1")  \
    .column("family2")  \
    .times(0, Long.MAX_VALUE)  \
    .numVersions(1)  \
    .now().string

println "Deleting cell..."

HBase.session(session).table(tableName).row("row_id_1")  \
    .delete()  \
    .column("family1", "col1")  \
    .now()

println "Rows after delete:"

println HBase.session(session).table(tableName).row().query().now().string

println "Extended cell delete"

HBase.session(session).table(tableName).row("row_id_1")  \
    .delete()  \
    .column("family2")  \
    .time(Long.MAX_VALUE)  \
    .now()

println "Rows after delete:"

println HBase.session(session).table(tableName).row().query().now().string

println "Table regions : " + HBase.session(session).table(tableName)  \
    .regions()  \
    .now().string

println "Creating scanner..."

scannerId = HBase.session(session).table(tableName).scanner().create()  \
    .column("family1", "col2")  \
    .column("family2")  \
    .startRow("row_id_1")  \
    .endRow("row_id_2")  \
    .batch(1)  \
    .startTime(0)  \
    .endTime(Long.MAX_VALUE)  \
    .filter("")  \
    .maxVersions(100)  \
    .now().scannerId

println "Scanner id=" + scannerId

println "Scanner get next..."

println HBase.session(session).table(tableName).scanner(scannerId)  \
    .getNext()  \
    .now().string

println "Dropping scanner with id=" + scannerId

HBase.session(session).table(tableName).scanner(scannerId).delete().now()

println "Done"

println "Dropping table '" + tableName + "'..."

HBase.session(session).table(tableName).delete().now()

println "Done"

session.shutdown(10, SECONDS)

HBase/Stargate via cURL

Get software version

Set Accept Header to “text/plain”, “text/xml”, “application/json” or “application/x-protobuf”

%  curl -ik -u guest:guest-password\
 -H "Accept:  application/json"\
 -X GET 'https://localhost:8443/gateway/sandbox/hbase/version'

Get version information regarding the HBase cluster backing the Stargate instance

Set Accept Header to “text/plain”, “text/xml” or “application/x-protobuf”

%  curl -ik -u guest:guest-password\
 -H "Accept: text/xml"\
 -X GET 'https://localhost:8443/gateway/sandbox/hbase/version/cluster'

Get detailed status on the HBase cluster backing the Stargate instance.

Set Accept Header to “text/plain”, “text/xml”, “application/json” or “application/x-protobuf”

curl -ik -u guest:guest-password\
 -H "Accept: text/xml"\
 -X GET 'https://localhost:8443/gateway/sandbox/hbase/status/cluster'

Get the list of available tables.

Set Accept Header to “text/plain”, “text/xml”, “application/json” or “application/x-protobuf”

curl -ik -u guest:guest-password\
 -H "Accept: text/xml"\
 -X GET 'https://localhost:8443/gateway/sandbox/hbase'

Create table with two column families using xml input

curl -ik -u guest:guest-password\
 -H "Accept: text/xml"   -H "Content-Type: text/xml"\
 -d '<?xml version="1.0" encoding="UTF-8"?><TableSchema name="table1"><ColumnSchema name="family1"/><ColumnSchema name="family2"/></TableSchema>'\
 -X PUT 'https://localhost:8443/gateway/sandbox/hbase/table1/schema'

Create table with two column families using JSON input

curl -ik -u guest:guest-password\
 -H "Accept: application/json"  -H "Content-Type: application/json"\
 -d '{"name":"table2","ColumnSchema":[{"name":"family3"},{"name":"family4"}]}'\
 -X PUT 'https://localhost:8443/gateway/sandbox/hbase/table2/schema'

Get table metadata

curl -ik -u guest:guest-password\
 -H "Accept: text/xml"\
 -X GET 'https://localhost:8443/gateway/sandbox/hbase/table1/regions'

Insert single row table

curl -ik -u guest:guest-password\
 -H "Content-Type: text/xml"\
 -H "Accept: text/xml"\
 -d '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><CellSet><Row key="cm93MQ=="><Cell column="ZmFtaWx5MTpjb2wx" >dGVzdA==</Cell></Row></CellSet>'\
 -X POST 'https://localhost:8443/gateway/sandbox/hbase/table1/row1'

Insert multiple rows into table

curl -ik -u guest:guest-password\
 -H "Content-Type: text/xml"\
 -H "Accept: text/xml"\
 -d '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><CellSet><Row key="cm93MA=="><Cell column=" ZmFtaWx5Mzpjb2x1bW4x" >dGVzdA==</Cell></Row><Row key="cm93MQ=="><Cell column=" ZmFtaWx5NDpjb2x1bW4x" >dGVzdA==</Cell></Row></CellSet>'\
 -X POST 'https://localhost:8443/gateway/sandbox/hbase/table2/false-row-key'

Get all data from table

Set Accept Header to “text/plain”, “text/xml”, “application/json” or “application/x-protobuf”

curl -ik -u guest:guest-password\
 -H "Accept: text/xml"\
 -X GET 'https://localhost:8443/gateway/sandbox/hbase/table1/*'

Execute cell or row query

Set Accept Header to “text/plain”, “text/xml”, “application/json” or “application/x-protobuf”

curl -ik -u guest:guest-password\
 -H "Accept: text/xml"\
 -X GET 'https://localhost:8443/gateway/sandbox/hbase/table1/row1/family1:col1'

Delete entire row from table

curl -ik -u guest:guest-password\
 -H "Accept: text/xml"\
 -X DELETE 'https://localhost:8443/gateway/sandbox/hbase/table2/row0'

Delete column family from row

curl -ik -u guest:guest-password\
 -H "Accept: text/xml"\
 -X DELETE 'https://localhost:8443/gateway/sandbox/hbase/table2/row0/family3'

Delete specific column from row

curl -ik -u guest:guest-password\
 -H "Accept: text/xml"\
 -X DELETE 'https://localhost:8443/gateway/sandbox/hbase/table2/row0/family3'

Create scanner

Scanner URL will be in Location response header

curl -ik -u guest:guest-password\
 -H "Content-Type: text/xml"\
 -d '<Scanner batch="1"/>'\
 -X PUT 'https://localhost:8443/gateway/sandbox/hbase/table1/scanner'

Get the values of the next cells found by the scanner

curl -ik -u guest:guest-password\
 -H "Accept: application/json"\
 -X GET 'https://localhost:8443/gateway/sandbox/hbase/table1/scanner/13705290446328cff5ed'

Delete scanner

curl -ik -u guest:guest-password\
 -H "Accept: text/xml"\
 -X DELETE 'https://localhost:8443/gateway/sandbox/hbase/table1/scanner/13705290446328cff5ed'

Delete table

curl -ik -u guest:guest-password\
 -X DELETE 'https://localhost:8443/gateway/sandbox/hbase/table1/schema'

Hive

TODO

Hive URL Mapping

TODO

Hive Examples

This guide provides detailed examples for how to to some basic interactions with Hive via the Apache Knox Gateway.

Hive Setup
  1. Make sure you are running the correct version of Hive to ensure JDBC/Thrift/HTTP support.
  2. Make sure Hive is running on the correct port.
  3. In hive-server.xml add the property “hive.server2.servermode=http”
  4. Client side (JDBC):
    1. Hive JDBC in HTTP mode depends on following libraries to run successfully(must be in the classpath): Hive Thrift artifacts classes, commons-codec.jar, commons-configuration.jar, commons-lang.jar, commons-logging.jar, hadoop-core.jar, hive-cli.jar, hive-common.jar, hive-jdbc.jar, hive-service.jar, hive-shims.jar, httpclient.jar, httpcore.jar, slf4j-api.jar;
    2. Import gateway certificate into the default JRE truststore. It is located in the /lib/security/cacerts keytool -import -alias hadoop.gateway -file hadoop.gateway.cer -keystore <java-home>/lib/security/cacerts Alternatively you can run your sample with additional parameters: -Djavax.net.ssl.trustStoreType=JKS -Djavax.net.ssl.trustStore=<path-to-trust-store> -Djavax.net.ssl.trustStorePassword=<trust-store-password>
    3. Connection URL has to be following: jdbc:hive2://<gateway-host>:<gateway-port>/?hive.server2.servermode=https;hive.server2.http.path=<gateway-path>/<cluster-name>/hive
    4. Look at https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-DDLOperations for examples. Hint: For testing it would be better to execute “set hive.security.authorization.enabled=false” as the first statement. Hint: Good examples of Hive DDL/DML can be found here http://gettingstarted.hadooponazure.com/hw/hive.html
Customization

This example may need to be tailored to the execution environment. In particular host name, host port, user name, user password and context path may need to be changed to match your environment. In particular there is one example file in the distribution that may need to be customized. Take a moment to review this file. All of the values that may need to be customized can be found together at the top of the file.

Client JDBC Example

Sample example for creating new table, loading data into it from local file system and querying data from that table.

Java
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;

import java.util.logging.Level;
import java.util.logging.Logger;

public class HiveJDBCSample {

  public static void main( String[] args ) {
    Connection connection = null;
    Statement statement = null;
    ResultSet resultSet = null;

    try {
      String user = "guest";
      String password = user + "-password";
      String gatewayHost = "localhost";
      int gatewayPort = 8443;
      String contextPath = "gateway/sandbox/hive";
      String connectionString = String.format( "jdbc:hive2://%s:%d/?hive.server2.servermode=https;hive.server2.http.path=%s", gatewayHost, gatewayPort, contextPath );

      // load Hive JDBC Driver
      Class.forName( "org.apache.hive.jdbc.HiveDriver" );

      // configure JDBC connection
      connection = DriverManager.getConnection( connectionString, user, password );

      statement = connection.createStatement();

      // disable Hive authorization - it could be ommited if Hive authorization
      // was configured properly
      statement.execute( "set hive.security.authorization.enabled=false" );

      // create sample table
      statement.execute( "CREATE TABLE logs(column1 string, column2 string, column3 string, column4 string, column5 string, column6 string, column7 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '" );

      // load data into Hive from file /tmp/log.txt which is placed on the local file system
      statement.execute( "LOAD DATA LOCAL INPATH '/tmp/log.txt' OVERWRITE INTO TABLE logs" );

      resultSet = statement.executeQuery( "SELECT * FROM logs" );

      while ( resultSet.next() ) {
        System.out.println( resultSet.getString( 1 ) + " --- " + resultSet.getString( 2 ) + " --- " + resultSet.getString( 3 ) + " --- " + resultSet.getString( 4 ) );
      }
    } catch ( ClassNotFoundException ex ) {
      Logger.getLogger( HiveJDBCSample.class.getName() ).log( Level.SEVERE, null, ex );
    } catch ( SQLException ex ) {
      Logger.getLogger( HiveJDBCSample.class.getName() ).log( Level.SEVERE, null, ex );
    } finally {
      if ( resultSet != null ) {
        try {
          resultSet.close();
        } catch ( SQLException ex ) {
          Logger.getLogger( HiveJDBCSample.class.getName() ).log( Level.SEVERE, null, ex );
        }
      }
      if ( statement != null ) {
        try {
          statement.close();
        } catch ( SQLException ex ) {
          Logger.getLogger( HiveJDBCSample.class.getName() ).log( Level.SEVERE, null, ex );
        }
      }
      if ( connection != null ) {
        try {
          connection.close();
        } catch ( SQLException ex ) {
          Logger.getLogger( HiveJDBCSample.class.getName() ).log( Level.SEVERE, null, ex );
        }
      }
    }
  }
}

h3. Groovy

Make sure that GATEWAY_HOME/ext directory contains following jars/classes for successful execution: Hive Thrift artifacts classes, commons-codec.jar, commons-configuration.jar, commons-lang.jar, commons-logging.jar, hadoop-core.jar, hive-cli.jar, hive-common.jar, hive-jdbc.jar, hive-service.jar, hive-shims.jar, httpclient.jar, httpcore.jar, slf4j-api.jar

There are several ways to execute this sample depending upon your preference.

You can use the Groovy interpreter provided with the distribution.

java -jar bin/shell.jar samples/hive/groovy/jdbc/sandbox/HiveJDBCSample.groovy

You can manually type in the KnoxShell DSL script into the interactive Groovy interpreter provided with the distribution.

java -jar bin/shell.jar

Each line from the file below will need to be typed or copied into the interactive shell.

import java.sql.DriverManager

user = "guest";
password = user + "-password";
gatewayHost = "localhost";
gatewayPort = 8443;
contextPath = "gateway/sandbox/hive";
connectionString = String.format( "jdbc:hive2://%s:%d/?hive.server2.servermode=https;hive.server2.http.path=%s", gatewayHost, gatewayPort, contextPath );

// Load Hive JDBC Driver
Class.forName( "org.apache.hive.jdbc.HiveDriver" );

// Configure JDBC connection
connection = DriverManager.getConnection( connectionString, user, password );

statement = connection.createStatement();

// Disable Hive authorization - This can be ommited if Hive authorization is configured properly
statement.execute( "set hive.security.authorization.enabled=false" );

// Create sample table
statement.execute( "CREATE TABLE logs(column1 string, column2 string, column3 string, column4 string, column5 string, column6 string, column7 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '" );

// Load data into Hive from file /tmp/log.txt which is placed on the local file system
statement.execute( "LOAD DATA LOCAL INPATH '/tmp/sample.log' OVERWRITE INTO TABLE logs" );

resultSet = statement.executeQuery( "SELECT * FROM logs" );

while ( resultSet.next() ) {
  System.out.println( resultSet.getString( 1 ) + " --- " + resultSet.getString( 2 ) );
}

resultSet.close();
statement.close();
connection.close();

Exampes use ‘log.txt’ with content:

2012-02-03 18:35:34 SampleClass6 [INFO] everything normal for id 577725851
2012-02-03 18:35:34 SampleClass4 [FATAL] system problem at id 1991281254
2012-02-03 18:35:34 SampleClass3 [DEBUG] detail for id 1304807656
2012-02-03 18:35:34 SampleClass3 [WARN] missing id 423340895
2012-02-03 18:35:34 SampleClass5 [TRACE] verbose detail for id 2082654978
2012-02-03 18:35:34 SampleClass0 [ERROR] incorrect id  1886438513
2012-02-03 18:35:34 SampleClass9 [TRACE] verbose detail for id 438634209
2012-02-03 18:35:34 SampleClass8 [DEBUG] detail for id 2074121310
2012-02-03 18:35:34 SampleClass0 [TRACE] verbose detail for id 1505582508
2012-02-03 18:35:34 SampleClass0 [TRACE] verbose detail for id 1903854437
2012-02-03 18:35:34 SampleClass7 [DEBUG] detail for id 915853141
2012-02-03 18:35:34 SampleClass3 [TRACE] verbose detail for id 303132401
2012-02-03 18:35:34 SampleClass6 [TRACE] verbose detail for id 151914369
2012-02-03 18:35:34 SampleClass2 [DEBUG] detail for id 146527742
...

Expected output:

2012-02-03 --- 18:35:34 --- SampleClass6 --- [INFO]
2012-02-03 --- 18:35:34 --- SampleClass4 --- [FATAL]
2012-02-03 --- 18:35:34 --- SampleClass3 --- [DEBUG]
2012-02-03 --- 18:35:34 --- SampleClass3 --- [WARN]
2012-02-03 --- 18:35:34 --- SampleClass5 --- [TRACE]
2012-02-03 --- 18:35:34 --- SampleClass0 --- [ERROR]
2012-02-03 --- 18:35:34 --- SampleClass9 --- [TRACE]
2012-02-03 --- 18:35:34 --- SampleClass8 --- [DEBUG]
2012-02-03 --- 18:35:34 --- SampleClass0 --- [TRACE]
2012-02-03 --- 18:35:34 --- SampleClass0 --- [TRACE]
2012-02-03 --- 18:35:34 --- SampleClass7 --- [DEBUG]
2012-02-03 --- 18:35:34 --- SampleClass3 --- [TRACE]
2012-02-03 --- 18:35:34 --- SampleClass6 --- [TRACE]
2012-02-03 --- 18:35:34 --- SampleClass2 --- [DEBUG]
...

Limitations

Secure Oozie POST/PUT Request Payload Size Restriction

With one exception there are no know size limits for requests or responses payloads that pass through the gateway. The exception involves POST or PUT request payload sizes for Oozie in a Kerberos secured Hadoop cluster. In this one case there is currently a 4Kb payload size limit for the first request made to the Hadoop cluster. This is a result of how the gateway negotiates a trust relationship between itself and the cluster via SPNego. There is an undocumented configuration setting to modify this limit’s value if required. In the future this will be made more easily configuration and at that time it will be documented.

LDAP Groups Acquisition

The LDAP authenticator currently does not “out of the box” support the acquisition of group information. This can be addressed by implementing a custom Shiro Realm extension. Building this into the default implementation is on the roadmap.

Group Membership Propagation

Groups that are acquired via Identity Assertion Group Principal Mapping are not propigated to the Hadoop services. Therefore groups used for Service Level Authorization policy may not match those acquired within the cluster via GroupMappingServiceProvider plugins.

Troubleshooting

Finding Logs

When things aren’t working the first thing you need to do is examine the diagnostic logs. Depending upon how you are running the gateway these diagnostic logs will be output to different locations.

java -jar bin/gateway.jar

When the gateway is run this way the diagnostic output is written directly to the console. If you want to capture that output you will need to redirect the console output to a file using OS specific techniques.

java -jar bin/gateway.jar > gateway.log

bin/gateway.sh start

When the gateway is run this way the diagnostic output is written to /var/log/knox/knox.out and /var/log/knox/knox.err. Typically only knox.out will have content.

Increasing Logging

The log4j.properties files {GATEWAY_HOME}/conf can be used to change the granularity of the logging done by Knox. The Knox server must be restarted in order for these changes to take effect. There are various useful loggers pre-populated but commented out.

log4j.logger.org.apache.hadoop.gateway=DEBUG # Use this logger to increase the debugging of Apache Knox itself.
log4j.logger.org.apache.shiro=DEBUG          # Use this logger to increase the debugging of Apache Shiro.
log4j.logger.org.apache.http=DEBUG           # Use this logger to increase the debugging of Apache HTTP components.
log4j.logger.org.apache.http.client=DEBUG    # Use this logger to increase the debugging of Apache HTTP client component.
log4j.logger.org.apache.http.headers=DEBUG   # Use this logger to increase the debugging of Apache HTTP header.
log4j.logger.org.apache.http.wire=DEBUG      # Use this logger to increase the debugging of Apache HTTP wire traffic.

LDAP Server Connectivity Issues

If the gateway cannot contact the configured LDAP server you will see errors in the gateway diagnostic output.

13/11/15 16:30:17 DEBUG authc.BasicHttpAuthenticationFilter: Attempting to execute login with headers [Basic Z3Vlc3Q6Z3Vlc3QtcGFzc3dvcmQ=]
13/11/15 16:30:17 DEBUG ldap.JndiLdapRealm: Authenticating user 'guest' through LDAP
13/11/15 16:30:17 DEBUG ldap.JndiLdapContextFactory: Initializing LDAP context using URL    [ldap://localhost:33389] and principal [uid=guest,ou=people,dc=hadoop,dc=apache,dc=org] with pooling disabled
13/11/15 16:30:17 DEBUG servlet.SimpleCookie: Added HttpServletResponse Cookie [rememberMe=deleteMe; Path=/gateway/vaultservice; Max-Age=0; Expires=Thu, 14-Nov-2013 21:30:17 GMT]
13/11/15 16:30:17 DEBUG authc.BasicHttpAuthenticationFilter: Authentication required: sending 401 Authentication challenge response.

The client should see something along the lines of:

HTTP/1.1 401 Unauthorized
WWW-Authenticate: BASIC realm="application"
Content-Length: 0
Server: Jetty(8.1.12.v20130726)

Resolving this will require ensuring that the LDAP server is running and that connection information is correct. The LDAP server connection information is configured in the cluster’s topology file (e.g. {GATEWAY_HOME}/deployments/sandbox.xml).

Hadoop Cluster Connectivity Issues

If the gateway cannot contact one of the services in the configured Hadoop cluster you will see errors in the gateway diagnostic output.

13/11/18 18:49:45 WARN hadoop.gateway: Connection exception dispatching request: http://localhost:50070/webhdfs/v1/?user.name=guest&op=LISTSTATUS org.apache.http.conn.HttpHostConnectException: Connection to http://localhost:50070 refused
org.apache.http.conn.HttpHostConnectException: Connection to http://localhost:50070 refused
    at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:190)
    at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)
    at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:645)
    at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:480)
    at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
    at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
    at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
    at org.apache.hadoop.gateway.dispatch.HttpClientDispatch.executeRequest(HttpClientDispatch.java:99)

The the resulting behavior on the client will differ by client. For the client DSL executing the {GATEWAY_HOME}/samples/ExampleWebHdfsLs.groovy the output will look look like this.

Caught: org.apache.hadoop.gateway.shell.HadoopException: org.apache.hadoop.gateway.shell.ErrorResponse: HTTP/1.1 500 Server Error
org.apache.hadoop.gateway.shell.HadoopException: org.apache.hadoop.gateway.shell.ErrorResponse: HTTP/1.1 500 Server Error
  at org.apache.hadoop.gateway.shell.AbstractRequest.now(AbstractRequest.java:72)
  at org.apache.hadoop.gateway.shell.AbstractRequest$now.call(Unknown Source)
  at ExampleWebHdfsLs.run(ExampleWebHdfsLs.groovy:28)

When executing commands requests via cURL the output might look similar to the following example.

Set-Cookie: JSESSIONID=16xwhpuxjr8251ufg22f8pqo85;Path=/gateway/sandbox;Secure
Content-Type: text/html;charset=ISO-8859-1
Cache-Control: must-revalidate,no-cache,no-store
Content-Length: 21856
Server: Jetty(8.1.12.v20130726)

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title>Error 500 Server Error</title>
</head>
<body><h2>HTTP ERROR 500</h2>

Resolving this will require ensuring that the Hadoop services are running and that connection information is correct. Basic Hadoop connectivity can be evaluated using cURL as described elsewhere. Otherwise the Hadoop cluster connection information is configured in the cluster’s topology file (e.g. {GATEWAY_HOME}/deployments/sandbox.xml).

Check Hadoop Cluster Access via cURL

When you are experiencing connectivity issue it can be helpful to “bypass” the gateway and invoke the Hadoop REST APIs directly. This can easily be done using the cURL command line utility or many other REST/HTTP clients. Exactly how to use cURL depends on the configuration of your Hadoop cluster. In general however you will use a command line the one that follows.

curl -ikv -X GET 'http://namenode-host:50070/webhdfs/v1/?op=LISTSTATUS'

If you are using Sandbox the WebHDFS or NameNode port will be mapped to localhost so this command can be used.

curl -ikv -X GET 'http://localhost:50070/webhdfs/v1/?op=LISTSTATUS'

If you are using a cluster secured with Kerberos you will need to have used kinit to authenticate to the KDC. Then the command below should verify that WebHDFS in the Hadoop cluster is accessible.

curl -ikv --negotiate -u : -X 'http://localhost:50070/webhdfs/v1/?op=LISTSTATUS'

Authentication Issues

The following log information is available when you enable debug level logging for shiro. This can be done within the conf/log4j.properties file. Not the “Password not correct for user” message.

13/11/15 16:37:15 DEBUG authc.BasicHttpAuthenticationFilter: Attempting to execute login with headers [Basic Z3Vlc3Q6Z3Vlc3QtcGFzc3dvcmQw]
13/11/15 16:37:15 DEBUG ldap.JndiLdapRealm: Authenticating user 'guest' through LDAP
13/11/15 16:37:15 DEBUG ldap.JndiLdapContextFactory: Initializing LDAP context using URL [ldap://localhost:33389] and principal [uid=guest,ou=people,dc=hadoop,dc=apache,dc=org] with pooling disabled
2013-11-15 16:37:15,899 INFO  Password not correct for user 'uid=guest,ou=people,dc=hadoop,dc=apache,dc=org'
2013-11-15 16:37:15,899 INFO  Authenticator org.apache.directory.server.core.authn.SimpleAuthenticator@354c78e3 failed to authenticate: BindContext for DN 'uid=guest,ou=people,dc=hadoop,dc=apache,dc=org', credentials <0x67 0x75 0x65 0x73 0x74 0x2D 0x70 0x61 0x73 0x73 0x77 0x6F 0x72 0x64 0x30 >
2013-11-15 16:37:15,899 INFO  Cannot bind to the server
13/11/15 16:37:15 DEBUG servlet.SimpleCookie: Added HttpServletResponse Cookie [rememberMe=deleteMe; Path=/gateway/vaultservice; Max-Age=0; Expires=Thu, 14-Nov-2013 21:37:15 GMT]
13/11/15 16:37:15 DEBUG authc.BasicHttpAuthenticationFilter: Authentication required: sending 401 Authentication challenge response.

The client will likely see something along the lines of:

HTTP/1.1 401 Unauthorized
WWW-Authenticate: BASIC realm="application"
Content-Length: 0
Server: Jetty(8.1.12.v20130726)

Using ldapsearch to verify ldap connectivtiy and credentials

If your authentication to knox fails and you believe your are using correct creedentilas, you could try to verify the connectivity and credentials usong ldapsearch, assuming you are using ldap directory for authentication.

Assuming you are using the default values that came out of box with knox, your ldapsearch command would be like the following

ldapsearch -h localhost -p 33389 -D “uid=guest,ou=people,dc=hadoop,dc=apache,dc=org” -w guest-password -b “uid=guest,ou=people,dc=hadoop,dc=apache,dc=org” “objectclass=*”

This should produce output like the following

extended LDIF

#

LDAPv3

base <uid=guest,ou=people,dc=hadoop,dc=apache,dc=org> with scope subtree

filter: objectclass=*

requesting: ALL

#

guest, people, hadoop.apache.org

dn: uid=guest,ou=people,dc=hadoop,dc=apache,dc=org objectClass: organizationalPerson objectClass: person objectClass: inetOrgPerson objectClass: top uid: guest cn: Guest sn: User userpassword:: Z3Vlc3QtcGFzc3dvcmQ=

search result

search: 2 result: 0 Success

numResponses: 2

numEntries: 1

In a more general form the ldapsearch command would be

ldapsearch -h {HOST} -p {PORT} -D {DN of binding user} -w {bind password} -b {DN of binding user} "objectclass=*}

Hostname Resolution Issues

The deployments/sandbox.xml topology file has the host mapping feature enabled. This is required due to the way networking is setup in the Sandbox VM. Specifically the VM’s internal hostname is sandbox.hortonworks.com. Since this hostname cannot be resolved to the actual VM Knox needs to map that hostname to something resolvable.

If for example host mapping is disabled but the Sandbox VM is still used you will see an error in the diagnostic output similar to the below.

13/11/18 19:11:35 WARN hadoop.gateway: Connection exception dispatching request: http://sandbox.hortonworks.com:50075/webhdfs/v1/user/guest/example/README?op=CREATE&namenoderpcaddress=sandbox.hortonworks.com:8020&user.name=guest&overwrite=false java.net.UnknownHostException: sandbox.hortonworks.com
java.net.UnknownHostException: sandbox.hortonworks.com
    at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)

On the other hand if you are migrating from the Sandbox based configuration to a cluster you have deployment you may see a similar error. However in this case you may need to disable host mapping. This can be done by modifying the topology file (e.g. deployments/sandbox.xml) for the cluster.

...
<provider>
    <role>hostmap</role>
    <name>static</name>
    <enabled>false</enabled>
    <param><name>localhost</name><value>sandbox,sandbox.hortonworks.com</value></param>
</provider>
....

Job Submission Issues - HDFS Home Directories

If you see error like the following in your console while submitting a Job using groovy shell, it is likely that the authenticated user does not have a home directory on HDFS.


Caught: org.apache.hadoop.gateway.shell.HadoopException: org.apache.hadoop.gateway.shell.ErrorResponse: HTTP/1.1 403 Forbidden
org.apache.hadoop.gateway.shell.HadoopException: org.apache.hadoop.gateway.shell.ErrorResponse: HTTP/1.1 403 Forbidden

You would also see this error if you try file operation on the home directory of the authenticating user.

The error would look a little different as shown below if you are attempting to the operation with cURL.


{"RemoteException":{"exception":"AccessControlException","javaClassName":"org.apache.hadoop.security.AccessControlException","message":"Permission denied: user=tom, access=WRITE, inode=\"/user\":hdfs:hdfs:drwxr-xr-x"}}* 

Resolution

Create the home directory for the user on HDFS. The home directory is typically of the form /user/{userid} and should be owned by the user. user ‘hdfs’ can create such a directory and make the user owner of the directory.

Job Submission Issues - OS Accounts

If the hadoop cluster is not secured with Kerberos, the user submitting a job need not have an OS account on the hadoop nodemanagers.

If the hadoop cluster is secured with Kerberos, the user submitting the job should have an OS account on hadoop nodemanagers.

In either case if the user does not have such OS account, his file permissions are based on user ownership of files or “other” permission in “ugo” posix permission. The user does not get any file permission as a member of any group if you are using default hadoop.security.group.mapping.

TODO: add sample error message from running test on secure cluster with missing OS account

HBase Issues

If you experience problems running the HBase samples with the Sandbox VM it may be necessary to restart HBase and Stargate. This can sometimes occur with the Sandbox VM is restarted from a saved state. If the client hangs after emitting the last line in the sample output below you are most likely affected.

System version : {...}
Cluster version : 0.96.0.2.0.6.0-76-hadoop2
Status : {...}
Creating table 'test_table'...

HBase and Starget can be restred using the following commands on the Hadoop Sandbox VM. You will need to ssh into the VM in order to run these commands.

sudo -u hbase /usr/lib/hbase/bin/hbase-daemon.sh stop master
sudo -u hbase /usr/lib/hbase/bin/hbase-daemon.sh start master
sudo -u hbase /usr/lib/hbase/bin/hbase-daemon.sh restart rest -p 60080

SSL Certificate Issues

Clients that do not trust the certificate presented by the server will behave in different ways. A browser will typically warn you of the inability to trust the receieved certificate and give you an opportunity to add an exception for the particular certificate. Curl will present you with the follow message and instructions for turning of certificate verification:

curl performs SSL certificate verification by default, using a "bundle" 
 of Certificate Authority (CA) public keys (CA certs). If the default
 bundle file isn't adequate, you can specify an alternate file
 using the --cacert option.
If this HTTPS server uses a certificate signed by a CA represented 
 the bundle, the certificate verification probably failed due to a
 problem with the certificate (it might be expired, or the name might
 not match the domain name in the URL).
If you'd like to turn off curl's verification of the certificate, use
 the -k (or --insecure) option.

Filing Bugs

Bugs can be filed using Jira. Please include the results of this command below in the Environment section. Also include the version of Hadoop being used in the same section.

cd {GATEWAY_HOME}
java -jar bin/gateway.jar -version

Export Controls

Apache Knox Gateway includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of encryption software. BEFORE using any encryption software, please check your country’s laws, regulations and policies concerning the import, possession, or use, and re-export of encryption software, to see if this is permitted. See http://www.wassenaar.org for more information.

The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes information security software using or performing cryptographic functions with asymmetric algorithms. The form and manner of this Apache Software Foundation distribution makes it eligible for export under the License Exception ENC Technology Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, Section 740.13) for both object code and source code.

The following provides more details on the included cryptographic software:

Trademarks

Apache Knox, Apache Knox Gateway, Apache, the Apache feather logo and the Apache Knox Gateway project logos are trademarks of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners.

License

Apache Knox uses the standard Apache license.

Privacy Policy

Apache Knox uses the standard Apache privacy policy.

Information about your use of this website is collected using server access logs and a tracking cookie. The collected information consists of the following:

Part of this information is gathered using a tracking cookie set by the Google Analytics service. Google’s policy for the use of this information is described in their privacy policy. See your browser’s documentation for instructions on how to disable the cookie if you prefer not to share this data with Google.

We use the gathered information to help us make our site more useful to visitors and to better understand how and when our site is used. We do not track or collect personally identifiable information or associate gathered data with any personally identifying information from other sources.

By using this website, you consent to the collection of this data in the manner and for the purpose described above.