Hi, all. Is there any tip for using Kotlin Spark ...
# datascience
i
Hi, all. Is there any tip for using Kotlin Spark on Jupyter? Here at work we successfully developed data pipelines using Spark on Kotlin. It has been a great experience, so far. However, we weren’t been able to use it on Jupyter due to some serialization problems.
a
Could you please dump the stack trace here. It should work fine.
r
@ita Yep, we’ve fixed serialization issues with spark in Jupyter a while ago, but probably its support is outdated now. If you could share the simple notebook with us it would help us a lot! (cc @Ilya Muradyan)
i
I’ll provide a sample notebook.
i
Make sure you are using
%dumpClassesForSpark
magic. If it still doesn't work, please share stacktrace
👀 1
i
I ended using
%use spark
which call
%dumpClassesForSpark
behind the scene and it worked smoothly. However, if I try to create a Spark Dataframe from an AWS S3 file I get
java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
when running from Kotlin. Using PySpark on the same container I can create it from the S3 file. I noticed that it uses the
$HOME/.m2
directory to find the spark libraries. How can I use previous Spark configuration libraries from my
$SPARK_HOME
?
i
@ita I'm not a Spark expert, but this problem seems to be common: https://stackoverflow.com/questions/58415928/spark-s3-error-java-lang-classnotfoundexception-class-org-apache-hadoop-f The first thing I can suggest you to do is to check if the spark version on your AWS cluster matches the version of Spark from descriptor: https://github.com/Kotlin/kotlin-jupyter-libraries/blob/master/spark.json. Default one is
2.4.4
, as you see. To use another version of Spark, specify it in %use magic, i.e:
Copy code
%use spark(spark=2.4.8, scala=...)
It may turn out that you don't need all the libraries in this descriptor's dependencies block, or need some other libraries. It may also turn out that you need other imports or initialization code. In this case, you can try to avoid using
%use spark
at all. First, execute
%dumpClassesForSpark
magic. Then, specify desired dependencies, i.e:
Copy code
@file:DependsOn("org.apache.spark:spark-mllib_2.11:2.4.8")
If you want to use locally installed libraries, specify path to JAR instead of GAV coordinates
Copy code
@file:DependsOn("/path/to/library.jar")
Alternatively, if
$SPARK_HOME
is a local maven repository (not just a directory with JARs), you can add it as a repository:
Copy code
@file:Repository("/spark/home/path")
and then add dependencies via GAV coordinated as it was mentioned above. Then, you should specify desired imports, i.e.
Copy code
import org.apache.spark.sql.*
And, finally, write the code that initializes the Spark session, i.e.
Copy code
val spark = SparkSession
    .builder()
    .appName("Spark example")
    .master("local")
    .getOrCreate();
val sc = spark.sparkContext()
a
And we can ask @Pasha Finkelshteyn
👌 1
a
@altavir but for me Kotlin Spark is just a library, I don't know anything about jupyter's magic 😞 Of course this mentioned Exception means that there is no such a class on classpath, but the question "why" remains open. Should it be part of Spark or Jupyter? I'm not sure 😞