A Lambda Architecure with Spark Streaming from Walmart Labs
I was part of a project that tried to do streaming processing with Spark a year or so ago. That didn’t go at all well; we had little resources and time, and (IMO) Spark-streaming was simply not mature enough for production.
One of the nasty problem we had was that landing data into Hive created large numbers of small files; Walmart Labs solve that by using KairosDB as the target storage instead. KairosDB is a layer on Cassandra, ie HBase-like.
Another serious problem with Spark-streaming is session-detection; it is possible but only with significant complexity. If I understand correctly, they solve that via the lambda archtecture - rough session detection in streaming, and better detection in the batch pass.
They still apparently had to fiddle with lots of Spark-streaming parameters though (batch duration, memory.fraction, locality.wait, executor/core ratios), and write custom monitoring code. And they were running on a dedicated spark cluster, not yarn. My conclusion from this is: yes Spark-streaming can work for production use-cases, but it is hard.
After my experiences, and some confirmation from this article, a solution based on Flink, Kafka-streaming, or maybe Apache Beam seems simpler to me. Those are all robust enough to process data fully in streaming mode, ie the kappa architecture.
While talking about Spark, here is an unrelated but interesting article on Spark for Data Science: the Good, Bad and Ugly.