Accessing HDFS from Spark in Zeppelin
Categories: BigData
When using a Zeppelin notebook for data analysis (eg on a Hortonworks platform), files stored in HDFS can be manipulated via a paragraph using the shell interpreter (%sh
header) and then using the hdfs commandline tool. However sometimes it is nicer to access HDFS from spark/scala code rather than requiring a separate block.
Shell-based access to HDFS is well documented, eg in the “intro to spark” demo notebook provided by Hortonworks:
%sh
hdfs dfs -ls /tmp
Enabling access from spark/scala is not documented anywhere that I could find, but is quite simple:
%spark
val hfs = org.apache.hadoop.fs.FileSystem.newInstance(sc.hadoopConfiguration)
// and here are some functions that demonstrate using the `hfs` variable (and are themselves useful)..
def listHdfsDir(path:String): Array[org.apache.hadoop.fs.Path] = {
val p = new org.apache.hadoop.fs.Path(path)
if (hfs.exists(p))
hfs.listStatus(p).map(_.getPath())
else
Array[org.apache.hadoop.fs.Path]()
}
def showHdfsDir(path:String) = {
val entries = listHdfsDir(path)
println(s"Dir $path contains ${entries.size} entries:")
entries.foreach(println)
println
}
def deleteHdfs(path:String) = {
val p = new org.apache.hadoop.fs.Path(path)
if (hfs.exists(p)) {
hfs.delete(p, true)
}
}