:wave: Hello, team! I'm using kotlin-spark-api_3....
# kotlin-spark
e
👋 Hello, team! I'm using kotlin-spark-api_3.3.2_2.12:12.4 to read an write some parqut files from MinIO through hadoop-aws and aws-java-sdk. I have a StandAlone cluster using docker compose and a I have a jupyter notebook on the same network as well. Everyhting works fine if I launch it from jupyter or from intellij, but when i create a fat-jar with maven-shade-plugin and use
spark-submit
it just stalls forever:
Copy code
24/03/12 12:09:26 INFO StandaloneSchedulerBackend: Granted executor ID app-20240312110926-0002/1 on hostPort 172.20.0.4:42795 with 1 core(s), 1024.0 MiB RAM
24/03/12 12:09:26 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.55:58381 with 434.4 MiB RAM, BlockManagerId(driver, 192.168.1.55, 58381, None)
24/03/12 12:09:26 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.1.55, 58381, None)
24/03/12 12:09:26 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.1.55, 58381, None)
24/03/12 12:09:26 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
I'm using jvm17, my spark-submit command is :
Copy code
$SPARK_HOME/bin/spark-submit \
  --class "com.gearsofleo.platform.jobs.GeneratorApplication" \
  --master <spark://localhost:7077> \
  --packages org.jetbrains.kotlin:kotlin-reflect:1.8.20,org.jetbrains.kotlinx.spark:kotlin-spark-api_3.3.2_2.12:1.2.4,org.apache.hadoop:hadoop-aws:3.2.4,com.amazonaws:aws-java-sdk:1.12.595 \
  target/ReportingLakehouse-1.0-SNAPSHOT.jar
my shade conf is:
Copy code
<plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.5.2</version>
                <configuration>
                    <minimizeJar>true</minimizeJar>
                    <filters>
                        <filter>
                            <artifact>*:*</artifact>
                            <excludes>
                                <exclude>module-info.class</exclude>
                                <exclude>META-INF/*.SF</exclude>
                                <exclude>META-INF/*.DSA</exclude>
                                <exclude>META-INF/*.RSA</exclude>
                                <exclude>META-INF/*.md</exclude>
                                <exclude>META-INF/*.markdown</exclude>
                                <exclude>**/*.txt</exclude>
                                <exclude>**/*.proto</exclude>
                                <exclude>**/pom.properties</exclude>
                                <exclude>**/pom.xml</exclude>
                            </excludes>
                        </filter>
                    </filters>
                    <transformers>
                        <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                            <mainClass>com.somecomp.platform.jobs.GeneratorApplication</mainClass>
                        </transformer>
                    </transformers>
                </configuration>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <artifactSet>
                                <excludes>
                                    <exclude>org.apache.spark:*</exclude>
                                </excludes>
                            </artifactSet>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
so basically I'm open to any suggestions cuz i've tried everything I can think of already
j
Hmm. I'm not sure I can help here. I'm not sure this is due to the Kotlin-spark-api or some other Spark shenanigans. If its the latter, I'm not experienced enough to guide you, unfortunately 😅 The only thing I can think of it to try to run something similar without the kotlin-spark-api to see if the issue lies there
@asm0dey maybe? 🙂
e
but it sgiuld work with jvm17 right? and usng spark-submit
knowing that works is a good start 😄
j
e
cuz u need a lot of --add-exports on jvm17 apparently, maybe its safer to try with jvm8
j
now that you say that... That does sound familiar
e
except it works well from intellij and from jupyter :S
j
and yes, 8 or 11 should probably work! I'll get back to you with that export story, because I can't stand that I don't know anymore where I've seen that before haha
🙏 1
might be the key
a
I'm not even sure spark works on 17 actually
e
It works as long as I run it either from jupyter or intellij
a
Are you sure they really use Java 17 to run it?
e
yes, its the same docker-compose with my standalone cluster, minio and jupyter plus my intelij on host
Copy code
Spark context Web UI available at <http://7cc13da6fbf1:4040>
Spark context available as 'sc' (master = local[*], app id = local-1710266430566).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.2
      /_/
         
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 17.0.8)
a
Well, this is hard to beat.
e
its just when running through
spark-submit
that stalls
from intellij it works:
Copy code
minRegisteredResourcesRatio: 0.0
2024-03-11 15:52:38 INFO  GeneratorApplication:25 - Starting spark job
2024-03-11 15:52:39 INFO  StandaloneAppClient$ClientEndpoint:61 - Executor updated: app-20240311145238-0014/0 is now RUNNING
2024-03-11 15:52:39 INFO  StandaloneAppClient$ClientEndpoint:61 - Executor updated: app-20240311145238-0014/1 is now RUNNING
2024-03-11 15:52:40 INFO  SharedState:61 - Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir
from jupyter as well 🙂
this is all fake data btw
j
Found it! https://jakubpradzynski.pl/posts/example-spark-kotlin-gradle-project/ Are you using --add-opens, like mentioned here?
e
I'm using some yes, will check the link
j
That is what I remember from helping someone before with Spark and java 17
I believe this needed to be inserted in their notebook before it worked, but I can't find that exactly
e
so this: https://github.com/Kotlin/kotlin-spark-api/wiki/Quick-Start-Guide#building-the-application-with-maven runs on my cluster which means I need to clean my maven dps and simplify
now I get:
Copy code
Executor app-20240312233908-0011/82 finished with state EXITED message Command exited with code 1 exitStatus 1
so this is progress I guess
Copy code
NOTE: Picked up JDK_JAVA_OPTIONS: '--add-opens=java.base/java.lang=ALL-UNNAMED, --add-opens=java.base/java.lang.invoke=ALL-UNNAMED, --add-opens=java.base/java.lang.reflect=ALL-UNNAMED, --add-opens=java.base/java.io=ALL-UNNAMED, --add-opens=java.base/java.net=ALL-UNNAMED, --add-opens=java.base/java.nio=ALL-UNNAMED, --add-opens=java.base/java.util=ALL-UNNAMED, --add-opens=java.base/java.util.concurrent=ALL-UNNAMED, --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED, --add-opens=java.base/sun.nio.ch=ALL-UNNAMED, --add-opens=java.base/sun.nio.cs=ALL-UNNAMED, --add-opens=java.base/sun.security.action=ALL-UNNAMED, --add-opens=java.base/sun.util.calendar=ALL-UNNAMED, --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED'
WARNING: Unknown module:  --add-opens=java.base/java.lang.invoke=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/java.lang.reflect=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/java.io=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/java.net=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/java.nio=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/java.util=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/java.util.concurrent=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/sun.nio.ch=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/sun.nio.cs=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/sun.security.action=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/sun.util.calendar=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED specified to --add-opens
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Exception in thread "main" java.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$ (in unnamed module @0x3081f72c) cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export <http://sun.nio.ch|sun.nio.ch> to unnamed module @0x3081f72c
        at org.apache.spark.storage.StorageUtils$.<init>(StorageUtils.scala:213)
        at org.apache.spark.storage.StorageUtils$.<clinit>(StorageUtils.scala)
        at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:222)
        at org.apache.spark.SparkEnv$.create(SparkEnv.scala:376)
        at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:210)
        at
FYI:
Copy code
24/03/13 14:05:55 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
+--------------------+--------------------+--------------------+------------------+----------+------------+---------------+------------------+-------------+---------------+----------------+------------+----------+----------------------------------------------------------------------------------------------------+-----------+--------------------+-------------------+-------------------+--------------------+---------------------+--------------------+-------------------+---------------------+----------------+---------------------+
|playerUid           |idPremio            |idAposta            |estadoOrigemAposta|tipoAposta|statusAposta|motivoSuspensao|motivoCancelamento|codModalidade|competicao     |eventoModalidade|statusEvento|qtdMercado|idMercado                                                                                           |tipoMercado|quotaFixaMercado    |inicioEvento       |fimEvento          |quotaFixaTotal      |valorAposta          |valorIRPFRetido     |dataHoraAposta     |ganhoApostador       |indicadorCashOut|valorCashOut         |
+--------------------+--------------------+--------------------+------------------+----------+------------+---------------+------------------+-------------+---------------+----------------+------------+----------+----------------------------------------------------------------------------------------------------+-----------+--------------------+-------------------+-------------------+--------------------+---------------------+--------------------+-------------------+---------------------+----------------+---------------------+
|01006538959909174408|lm3YsqZ7PtETsxRjmcij|YPDtXmmlnAWw0LVQKsRy|74                |MULTIPLE  |CANCELED    |null           |EVENT_CANCELLATION|315          |UEFA           |Masters Cup     |IN_PROGRESS |200       |eLxCSAx0crX0zaKI29DulLXakx4RA38ZyyltIHWpxXBxnWbKzmifTLBcwZaOMc9hka8WTJPMIBEypfEB0tdlZsQitjizuAYLcKvc|false      |5.300000000000000000|2024-03-13 11:58:20|2024-03-13 13:58:10|5.300000000000000000|46.000000000000000000|3.220000000000000000|2024-03-13 13:28:17|null                 |false           |null                 |
|03575822208816394959|ys9brAv1mikZkZ5rx1dl|F8myePRImL5j1bkSdKbA|56                |MULTIPLE  |ONGOING     |null           |null              |259          |NHL            |NHL             |POSTPONED   |706       |eLxCSAx0crX0zaKI29DulLXakx4RA38ZyyltIHWpxXBxnWbKzmifTLBcwZaOMc9hka8WTJPMIBEypfEB0tdlZsQitjizuAYLcKvc|true       |1.240000000000000000|2024-03-13 11:24:11|2024-03-13 14:06:47|1.240000000000000000|76.000000000000000000|5.320000000000000000|2024-03-13 12:26:18|null                 |true            |76.000000000000000000|
|54630212552887067444|xw0Q2Vlf5ytQJpknpZo6|jSks02LyRSgRIow5jnAC|53                |SIMPLE    |NOT_AWARDED |null           |null              |958          |NBA            |NHL             |FINISHED    |null      |eLxCSAx0crX0zaKI29DulLXakx4RA38ZyyltIHWpxXBxnWbKzmifTLBcwZaOMc9hka8WTJPMIBEypfEB0tdlZsQitjizuAYLcKvc|false      |5.300000000000000000|2024-03-13 11:51:56|2024-03-13 13:57:22|null                |68.000000000000000000|4.760000000000000000|2024-03-13 12:01:44|null                 |false           |null                 |
|84150253222367488864|IYnWIdndntRfPjLoegk5|pPJHSUI4pKvFPtnIpQ8O|19                |MULTIPLE  |ONGOING     |null           |null              |838          |UEFA           |Euro League Cup |LATE        |930       |eLxCSAx0crX0zaKI29DulLXakx4RA38ZyyltIHWpxXBxnWbKzmifTLBcwZaOMc9hka8WTJPMIBEypfEB0tdlZsQitjizuAYLcKvc|true       |3.450000000000000000|2024-03-13 11:20:27|2024-03-13 14:54:36|3.450000000000000000|55.000000000000000000|3.850000000000000000|2024-03-13 13:48:28|null                 |false           |null                 |
|94856581322015931314|IR7BaHNgrSpcXuF4p4vD|QvNd0UGnw5EIUdoLURnv|42                |MULTIPLE  |SUSPENDED   |OTHER          |null              |960          |Masters Cup    |NHL             |CANCELLED   |936       |J5o0FtwHulKD62mQLpz73Y
its writing to the parquet that stalls it, for some reason
j
Copy code
Exception in thread "main" java.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$ (in unnamed module @0x3081f72c) cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export <http://sun.nio.ch|sun.nio.ch> to unnamed module @0x3081f72c
        at org.apache.spark.storage.StorageUtils$.<init>(StorageUtils.scala:213)
        at org.apache.spark.storage.StorageUtils$.<clinit>(StorageUtils.scala)
        at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:222)
        at org.apache.spark.SparkEnv$.create(SparkEnv.scala:376)
        at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:210)
        at
is interesting. Here on Stackoverflow, someone solves it by adding even more --add-opens: https://stackoverflow.com/questions/73465937/apache-spark-3-3-0-breaks-on-java-17-with-cannot-access-class-sun-nio-ch-direct
e
I suspect its a reflection thing
Copy code
fun <T> writeToParquet(ds: Dataset<T>, parquetName: String, partitionColumnName: String): Unit =
    ds.repartition(col(partitionColumnName))
        .write()
        .mode(SaveMode.Append)
        .partitionBy(partitionColumnName)
        .parquet("$parquetOutputPath/domains/$parquetName")
with this fucntion it stalls
if i write it directly though:
Copy code
sportDS.repartition(col("sports"))
            .write()
            .mode(SaveMode.Append)
            .partitionBy("playerUid")
            .parquet("$parquetOutputPath/domains/sports")
it works
as opposed to using the fnction:
Copy code
writeToParquet(sports.toDS(), "sports", "playerUid")
again, running from intellij this works fine
both ways work fine
j
really odd, have you also tried different kotlin versions? I know they changed something some time ago which touches the Serializablility of Kotlin functions on the JVM
e
I was using 1.9.0 and then went back to 1.8.20
actually, scratch what I said, it still stalls on writing parquet
its just that the
show
before, executes fine
j
And kotlin 1.9.23, just for the sake of completeness?
I suspect it won't change anything, but I also don't really know what else to try anymore
e
Will open more modules, just to be sure
So the issue has to do with incompatible libraries needed to work with S3/MinIO and nothing to do woth kotlin spark api
I tested on GCP Dataproc (which doesn't use these libs) and it works perfectly
a
Dude! So cool you solved it! How did you find the root cause?
🙏 1
e
blood and tears... tears mostly 😞
basically I tried out different operations except read/write and everything worked fine, plus enabling DEBUG level for logs
I could see all the relevant dirs were being listed and all tasks created but when it was time to actually read the file... nothing
what I'd like to know is how exactly does jupyter "submits" the job
my uberjar includes something that breaks the whole hadoop-aws/aws-sdk thing
j
Jupyter might behave differently due to class loading. It's loaded first and the integration with spark is loaded into it later. That might cause the difference
👍 1