Good morning everyone, just started using spark an...
# kotlin-spark
z
Good morning everyone, just started using spark and I'm delighted that kotlin supports that too. (That Java-API-thingy is in my oppinion a creepy mess πŸ˜‘) But we need Spark NLP which currently only works with Spark 2.3.x and 2.4.x. Does anyone know how long it'll take for reaching full 2.4.x support along with Scala 2.11? (or how good the actual version works with 2.4.x) I didn't find that much on this topic. πŸ˜„
i
As I understand, Kotlin API is compatible only with Spark 3.0. However, @Pasha Finkelshteyn may give more reliable info.
p
Thank you mentioning me, @Ilya Muradyan @zeugederunity We have almost complete implementation for 2.4.0, it's not released yet because we need to fix one bug with array support. It's worth mentioning that Scala 2.11 supports only Spark 2.4.0, starting from 2.4.1 it works only with Scala 2.12 If you are ready to give a try to snapshot version I can provide you with all necessary instructions.
z
Hello @Pasha Finkelshteyn, thank you for that kind offer, i would really appreciate your help when you could provide a snapshot πŸ˜„ (I'm planning to use it in my Phd Thesis and generelly at our chair, therefore i don't mind using snapshots, alphas etc.) I just started working with spark nlp (and spark in general) therefore a snapshot would really help me out.
p
Oh, it's an honor for me to be part of thesis 😍 I'll DM you soon
πŸ™ 1
z
@Pasha Finkelshteyn Ok, a first (short) feedback regarding Spark, Scala and Kotlin: I wrote a small wrapper for this
UDFRegistration::register
to place the parameter functional interface at the end of the parameters. Furthermore i added some typeckecks while calling the register function, to make sure that the parameters of the
UDF<N>
funtion are not sublasses of kotlin.collections.Iterable or any type of array (this causes errors). I also added a class to wrap this whole
functions.callUDF
with more security checks. Now creating and calling an UDF looks like this in my code:
Copy code
val joinArray = spark.sqlContext()
            .udf()
            .register("joinArray", DataTypes.StringType) {
                array: WrappedArray<String> -> 
                array.asMutableCollection().joinToString(" ") 
            }

    val processedResult = result.withColumn("colName", joinArray(result.col("colName")))
Another point: A function like this for
WrappedArray
-instances would be nice in the framework:
Copy code
fun <T> WrappedArray<T>.asMutableCollection(): MutableCollection<T> =
        JavaConverters.asJavaCollectionConverter(this).asJavaCollection()
Because it took me pretty long to find out, that there is this converter-function. Especially when the error message only says, that the conversion to a
List
ist not possible it is quite hard for someone new to Spark and Scala to figure out what is going wrong.
p
Wow
I have several questions if you have time to answer them πŸ™‚
z
of course πŸ˜„
p
1. Won't you mind to contribute your wrapper for
UDFRegistration
?
2. Where the hell did you get
WrappedArray
?
z
Regarding 1: I won't mind at all. It would be my pleasure, but i don't know where is should add my code. i only tested it with 2.4 and i don't know if it works with 3.0 ^^' Regarding 2: I got it from the scala package, when i tried to register a function with List<String> it failed. After some trial and error i found out, that the type that i got was a WrappedArray
p
2. Looks like a bug
1. Let's put it into separate file called UDFSupport and write a small test for it. After that we'll be able to copy it to 3.0 and check if it works yet
z
Nice πŸ˜„ Then i'll fork it and make a merge request. πŸ˜„
p
TBH I didn't have in mind support for UDFs because I thought about support for UDFs
Thank you so much for effort!
z
When i open the fork as project i get the following sync error:
Cannot resolve org.jetbrains.kotlinx.spark:kotlin-spark-api-2.4_2.11:1.0.0-preview2-SNAPSHOT
for
Kotlin Spark API: Examples for Spark 2.4+ (Scala 2.11)
p
I believe you need to activate maven profile Scala 2.11
Oops
scala-2.11
It could be achieved by calling maven with -Pscala-2.11 or in IDEA be selecting scala-2.11 profile in Maven toolbar
z
ok works, thanks πŸ˜„
ok i got a polished version of the code, maybe there is the need for more Unit Tests and i have some ideas to add more elaborated security checks. But for now i'll create a pull request with the code shown above. But for implementing the mentioned advanced checks i'll need feedback from you or someone else with more experience if that is possible/makes sense.
My next planned merge request are wrappers for the converter functions, because i think they are quite important for the useability. Or are they allready in production? Then i won't have to write them myself :D