A modern programming language that makes developers happier.

kotlinlang

Hi, all.

Is there any tip for using Kotlin Spark on Jupyter?
Here at work we successfully developed data pipelines using Spark on Kotlin. It has been a great experience, so far. However, we weren’t been able to use it on Jupyter due to some serialization problems.

Could you please dump the stack trace here. It should work fine.

<@U02P1G236G7> Yep, we’ve fixed serialization issues with spark in Jupyter a while ago, but probably its support is outdated now. If you could share the simple notebook with us it would help us a lot! (cc <@UQ8NE6A86>)

Make sure you are using `%dumpClassesForSpark`  magic. If it still doesn't work, please share stacktrace

I ended using `%use spark`  which call `%dumpClassesForSpark`  behind the scene and it worked smoothly. However, if I try to create a Spark Dataframe from an AWS S3 file I get `java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found`  when running from Kotlin. Using PySpark on the same container I can create it from the S3 file.

I noticed that it uses the `$HOME/.m2`  directory to find the spark libraries. How can I use previous Spark configuration libraries from my `$SPARK_HOME` ?

<@U02P1G236G7> I'm not a Spark expert, but this problem seems to be common: <https://stackoverflow.com/questions/58415928/spark-s3-error-java-lang-classnotfoundexception-class-org-apache-hadoop-f>
The first thing I can suggest you to do is to check if the spark version on your AWS cluster matches the version of Spark from descriptor: <https://github.com/Kotlin/kotlin-jupyter-libraries/blob/master/spark.json>. Default one is `2.4.4`, as you see. To use another version of Spark, specify it in %use magic, i.e:
```%use spark(spark=2.4.8, scala=...)```
It may turn out that you don't need all the libraries in this descriptor's dependencies block, or need some other libraries. It may also turn out that you need other imports or initialization code. In this case, you can try to avoid using `%use spark`  at all.
First, execute `%dumpClassesForSpark`  magic.
Then, specify desired dependencies, i.e:
```@file:DependsOn("org.apache.spark:spark-mllib_2.11:2.4.8")```
If you want to use locally installed libraries, specify path to JAR instead of GAV coordinates
```@file:DependsOn("/path/to/library.jar")```
Alternatively, if `$SPARK_HOME`  is a local maven repository (not just a directory with JARs), you can add it as a repository:
```@file:Repository("/spark/home/path")```
and then add dependencies via GAV coordinated as it was mentioned above.

Then, you should specify desired imports, i.e.
```import org.apache.spark.sql.*```
And, finally, write the code that initializes the Spark session, i.e.
```val spark = SparkSession
    .builder()
    .appName("Spark example")
    .master("local")
    .getOrCreate();
val sc = spark.sparkContext()```

<@U5F0TT0UX> but for me Kotlin Spark is just a library, I don't know anything about jupyter's magic :disappointed: Of course this mentioned Exception means that there is no such a class on classpath, but the question  "why" remains open. Should it be part of Spark or Jupyter? I'm not sure :disappointed: