Accessing HDFS from Spark in Zeppelin

Categories: BigData

When using a Zeppelin notebook for data analysis (eg on a Hortonworks platform), files stored in HDFS can be manipulated via a paragraph using the shell interpreter (%sh header) and then using the hdfs commandline tool. However sometimes it is nicer to access HDFS from spark/scala code rather than requiring a separate block.

Shell-based access to HDFS is well documented, eg in the “intro to spark” demo notebook provided by Hortonworks:


hdfs dfs -ls /tmp

Enabling access from spark/scala is not documented anywhere that I could find, but is quite simple:


val hfs = org.apache.hadoop.fs.FileSystem.newInstance(sc.hadoopConfiguration)

// and here are some functions that demonstrate using the `hfs` variable (and are themselves useful)..

def listHdfsDir(path:String): Array[org.apache.hadoop.fs.Path] = {
    val p = new org.apache.hadoop.fs.Path(path)
    if (hfs.exists(p))

def showHdfsDir(path:String) = {
    val entries = listHdfsDir(path)
    println(s"Dir $path contains ${entries.size} entries:")

def deleteHdfs(path:String) = {
    val p = new org.apache.hadoop.fs.Path(path)
    if (hfs.exists(p)) {
        hfs.delete(p, true)