Accessing HDFS from Spark in Zeppelin

Categories: BigData

When using a Zeppelin notebook for data analysis (eg on a Hortonworks platform), files stored in HDFS can be manipulated via a paragraph using the shell interpreter (%sh header) and then using the hdfs commandline tool. However sometimes it is nicer to access HDFS from spark/scala code rather than requiring a separate block.

Shell-based access to HDFS is well documented, eg in the “intro to spark” demo notebook provided by Hortonworks:

%sh

hdfs dfs -ls /tmp

Enabling access from spark/scala is not documented anywhere that I could find, but is quite simple:

%spark

val hfs = org.apache.hadoop.fs.FileSystem.newInstance(sc.hadoopConfiguration)

// and here are some functions that demonstrate using the `hfs` variable (and are themselves useful)..

def listHdfsDir(path:String): Array[org.apache.hadoop.fs.Path] = {
    val p = new org.apache.hadoop.fs.Path(path)
    if (hfs.exists(p))
      hfs.listStatus(p).map(_.getPath())
    else
      Array[org.apache.hadoop.fs.Path]()
}

def showHdfsDir(path:String) = {
    val entries = listHdfsDir(path)
    println(s"Dir $path contains ${entries.size} entries:")
    entries.foreach(println)
    println
}

def deleteHdfs(path:String) = {
    val p = new org.apache.hadoop.fs.Path(path)
    if (hfs.exists(p)) {
        hfs.delete(p, true)
    }
}