wave Hello team I m using kotlin spark api 3 3 2 2 12 12 4 kotlinlang #kotlin-spark

:wave: Hello, team! I'm using kotlin-spark-api_3....

Eric Rodriguez

03/12/2024, 4:32 PM

👋 Hello, team! I'm using kotlin-spark-api_3.3.2_2.12:12.4 to read an write some parqut files from MinIO through hadoop-aws and aws-java-sdk. I have a StandAlone cluster using docker compose and a I have a jupyter notebook on the same network as well. Everyhting works fine if I launch it from jupyter or from intellij, but when i create a fat-jar with maven-shade-plugin and use

spark-submit

it just stalls forever:

Copy code

24/03/12 12:09:26 INFO StandaloneSchedulerBackend: Granted executor ID app-20240312110926-0002/1 on hostPort 172.20.0.4:42795 with 1 core(s), 1024.0 MiB RAM
24/03/12 12:09:26 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.55:58381 with 434.4 MiB RAM, BlockManagerId(driver, 192.168.1.55, 58381, None)
24/03/12 12:09:26 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.1.55, 58381, None)
24/03/12 12:09:26 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.1.55, 58381, None)
24/03/12 12:09:26 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0

Eric Rodriguez

03/12/2024, 4:34 PM

I'm using jvm17, my spark-submit command is :

Copy code

$SPARK_HOME/bin/spark-submit \
  --class "com.gearsofleo.platform.jobs.GeneratorApplication" \
  --master <spark://localhost:7077> \
  --packages org.jetbrains.kotlin:kotlin-reflect:1.8.20,org.jetbrains.kotlinx.spark:kotlin-spark-api_3.3.2_2.12:1.2.4,org.apache.hadoop:hadoop-aws:3.2.4,com.amazonaws:aws-java-sdk:1.12.595 \
  target/ReportingLakehouse-1.0-SNAPSHOT.jar

Eric Rodriguez

03/12/2024, 4:34 PM

my shade conf is:

Copy code

<plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.5.2</version>
                <configuration>
                    <minimizeJar>true</minimizeJar>
                    <filters>
                        <filter>
                            <artifact>*:*</artifact>
                            <excludes>
                                <exclude>module-info.class</exclude>
                                <exclude>META-INF/*.SF</exclude>
                                <exclude>META-INF/*.DSA</exclude>
                                <exclude>META-INF/*.RSA</exclude>
                                <exclude>META-INF/*.md</exclude>
                                <exclude>META-INF/*.markdown</exclude>
                                <exclude>**/*.txt</exclude>
                                <exclude>**/*.proto</exclude>
                                <exclude>**/pom.properties</exclude>
                                <exclude>**/pom.xml</exclude>
                            </excludes>
                        </filter>
                    </filters>
                    <transformers>
                        <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                            <mainClass>com.somecomp.platform.jobs.GeneratorApplication</mainClass>
                        </transformer>
                    </transformers>
                </configuration>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <artifactSet>
                                <excludes>
                                    <exclude>org.apache.spark:*</exclude>
                                </excludes>
                            </artifactSet>
                        </configuration>
                    </execution>
                </executions>
            </plugin>

Eric Rodriguez

03/12/2024, 4:34 PM

so basically I'm open to any suggestions cuz i've tried everything I can think of already

Jolan Rensen [JB]

03/12/2024, 4:51 PM

Hmm. I'm not sure I can help here. I'm not sure this is due to the Kotlin-spark-api or some other Spark shenanigans. If its the latter, I'm not experienced enough to guide you, unfortunately 😅 The only thing I can think of it to try to run something similar without the kotlin-spark-api to see if the issue lies there

Jolan Rensen [JB]

03/12/2024, 4:51 PM

@asm0dey maybe? 🙂

Eric Rodriguez

03/12/2024, 4:52 PM

but it sgiuld work with jvm17 right? and usng spark-submit

Eric Rodriguez

03/12/2024, 4:52 PM

knowing that works is a good start 😄

Jolan Rensen [JB]

03/12/2024, 4:53 PM

yes, in theory it should work, as described in the wiki https://github.com/Kotlin/kotlin-spark-api/wiki/Quick-Start-Guide#executing-the-application-with-spark-submit

Eric Rodriguez

03/12/2024, 4:53 PM

cuz u need a lot of --add-exports on jvm17 apparently, maybe its safer to try with jvm8

Jolan Rensen [JB]

03/12/2024, 4:54 PM

now that you say that... That does sound familiar

Eric Rodriguez

03/12/2024, 4:54 PM

except it works well from intellij and from jupyter :S

Jolan Rensen [JB]

03/12/2024, 4:58 PM

and yes, 8 or 11 should probably work! I'll get back to you with that export story, because I can't stand that I don't know anymore where I've seen that before haha

🙏 1

Jolan Rensen [JB]

03/12/2024, 4:58 PM

might be the key

asm0dey

03/12/2024, 5:31 PM

I'm not even sure spark works on 17 actually

Eric Rodriguez

03/12/2024, 5:57 PM

It works as long as I run it either from jupyter or intellij

asm0dey

03/12/2024, 5:58 PM

Are you sure they really use Java 17 to run it?

Eric Rodriguez

03/12/2024, 6:01 PM

yes, its the same docker-compose with my standalone cluster, minio and jupyter plus my intelij on host

Eric Rodriguez

03/12/2024, 6:01 PM

Copy code

Spark context Web UI available at <http://7cc13da6fbf1:4040>
Spark context available as 'sc' (master = local[*], app id = local-1710266430566).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.2
      /_/
         
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 17.0.8)

asm0dey

03/12/2024, 6:01 PM

Well, this is hard to beat.

Eric Rodriguez

03/12/2024, 6:03 PM

its just when running through

spark-submit

that stalls

Eric Rodriguez

03/12/2024, 6:05 PM

from intellij it works:

Copy code

minRegisteredResourcesRatio: 0.0
2024-03-11 15:52:38 INFO  GeneratorApplication:25 - Starting spark job
2024-03-11 15:52:39 INFO  StandaloneAppClient$ClientEndpoint:61 - Executor updated: app-20240311145238-0014/0 is now RUNNING
2024-03-11 15:52:39 INFO  StandaloneAppClient$ClientEndpoint:61 - Executor updated: app-20240311145238-0014/1 is now RUNNING
2024-03-11 15:52:40 INFO  SharedState:61 - Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir

Eric Rodriguez

03/12/2024, 6:05 PM

from jupyter as well 🙂

Eric Rodriguez

03/12/2024, 6:05 PM

this is all fake data btw

Jolan Rensen [JB]

03/12/2024, 6:58 PM

Found it! https://jakubpradzynski.pl/posts/example-spark-kotlin-gradle-project/ Are you using --add-opens, like mentioned here?

Eric Rodriguez

03/12/2024, 7:00 PM

I'm using some yes, will check the link

Jolan Rensen [JB]

03/12/2024, 7:00 PM

That is what I remember from helping someone before with Spark and java 17

Jolan Rensen [JB]

03/12/2024, 7:01 PM

I believe this needed to be inserted in their notebook before it worked, but I can't find that exactly

Eric Rodriguez

03/12/2024, 10:52 PM

so this: https://github.com/Kotlin/kotlin-spark-api/wiki/Quick-Start-Guide#building-the-application-with-maven runs on my cluster which means I need to clean my maven dps and simplify

Eric Rodriguez

03/12/2024, 11:41 PM

now I get:

Copy code

Executor app-20240312233908-0011/82 finished with state EXITED message Command exited with code 1 exitStatus 1

Eric Rodriguez

03/12/2024, 11:41 PM

so this is progress I guess

Eric Rodriguez

03/12/2024, 11:50 PM

Copy code

NOTE: Picked up JDK_JAVA_OPTIONS: '--add-opens=java.base/java.lang=ALL-UNNAMED, --add-opens=java.base/java.lang.invoke=ALL-UNNAMED, --add-opens=java.base/java.lang.reflect=ALL-UNNAMED, --add-opens=java.base/java.io=ALL-UNNAMED, --add-opens=java.base/java.net=ALL-UNNAMED, --add-opens=java.base/java.nio=ALL-UNNAMED, --add-opens=java.base/java.util=ALL-UNNAMED, --add-opens=java.base/java.util.concurrent=ALL-UNNAMED, --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED, --add-opens=java.base/sun.nio.ch=ALL-UNNAMED, --add-opens=java.base/sun.nio.cs=ALL-UNNAMED, --add-opens=java.base/sun.security.action=ALL-UNNAMED, --add-opens=java.base/sun.util.calendar=ALL-UNNAMED, --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED'
WARNING: Unknown module:  --add-opens=java.base/java.lang.invoke=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/java.lang.reflect=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/java.io=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/java.net=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/java.nio=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/java.util=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/java.util.concurrent=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/sun.nio.ch=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/sun.nio.cs=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/sun.security.action=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.base/sun.util.calendar=ALL-UNNAMED specified to --add-opens
WARNING: Unknown module:  --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED specified to --add-opens
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Exception in thread "main" java.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$ (in unnamed module @0x3081f72c) cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export <http://sun.nio.ch|sun.nio.ch> to unnamed module @0x3081f72c
        at org.apache.spark.storage.StorageUtils$.<init>(StorageUtils.scala:213)
        at org.apache.spark.storage.StorageUtils$.<clinit>(StorageUtils.scala)
        at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:222)
        at org.apache.spark.SparkEnv$.create(SparkEnv.scala:376)
        at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:210)
        at

Eric Rodriguez

03/13/2024, 1:07 PM

FYI:

Copy code

24/03/13 14:05:55 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
+--------------------+--------------------+--------------------+------------------+----------+------------+---------------+------------------+-------------+---------------+----------------+------------+----------+----------------------------------------------------------------------------------------------------+-----------+--------------------+-------------------+-------------------+--------------------+---------------------+--------------------+-------------------+---------------------+----------------+---------------------+
|playerUid           |idPremio            |idAposta            |estadoOrigemAposta|tipoAposta|statusAposta|motivoSuspensao|motivoCancelamento|codModalidade|competicao     |eventoModalidade|statusEvento|qtdMercado|idMercado                                                                                           |tipoMercado|quotaFixaMercado    |inicioEvento       |fimEvento          |quotaFixaTotal      |valorAposta          |valorIRPFRetido     |dataHoraAposta     |ganhoApostador       |indicadorCashOut|valorCashOut         |
+--------------------+--------------------+--------------------+------------------+----------+------------+---------------+------------------+-------------+---------------+----------------+------------+----------+----------------------------------------------------------------------------------------------------+-----------+--------------------+-------------------+-------------------+--------------------+---------------------+--------------------+-------------------+---------------------+----------------+---------------------+
|01006538959909174408|lm3YsqZ7PtETsxRjmcij|YPDtXmmlnAWw0LVQKsRy|74                |MULTIPLE  |CANCELED    |null           |EVENT_CANCELLATION|315          |UEFA           |Masters Cup     |IN_PROGRESS |200       |eLxCSAx0crX0zaKI29DulLXakx4RA38ZyyltIHWpxXBxnWbKzmifTLBcwZaOMc9hka8WTJPMIBEypfEB0tdlZsQitjizuAYLcKvc|false      |5.300000000000000000|2024-03-13 11:58:20|2024-03-13 13:58:10|5.300000000000000000|46.000000000000000000|3.220000000000000000|2024-03-13 13:28:17|null                 |false           |null                 |
|03575822208816394959|ys9brAv1mikZkZ5rx1dl|F8myePRImL5j1bkSdKbA|56                |MULTIPLE  |ONGOING     |null           |null              |259          |NHL            |NHL             |POSTPONED   |706       |eLxCSAx0crX0zaKI29DulLXakx4RA38ZyyltIHWpxXBxnWbKzmifTLBcwZaOMc9hka8WTJPMIBEypfEB0tdlZsQitjizuAYLcKvc|true       |1.240000000000000000|2024-03-13 11:24:11|2024-03-13 14:06:47|1.240000000000000000|76.000000000000000000|5.320000000000000000|2024-03-13 12:26:18|null                 |true            |76.000000000000000000|
|54630212552887067444|xw0Q2Vlf5ytQJpknpZo6|jSks02LyRSgRIow5jnAC|53                |SIMPLE    |NOT_AWARDED |null           |null              |958          |NBA            |NHL             |FINISHED    |null      |eLxCSAx0crX0zaKI29DulLXakx4RA38ZyyltIHWpxXBxnWbKzmifTLBcwZaOMc9hka8WTJPMIBEypfEB0tdlZsQitjizuAYLcKvc|false      |5.300000000000000000|2024-03-13 11:51:56|2024-03-13 13:57:22|null                |68.000000000000000000|4.760000000000000000|2024-03-13 12:01:44|null                 |false           |null                 |
|84150253222367488864|IYnWIdndntRfPjLoegk5|pPJHSUI4pKvFPtnIpQ8O|19                |MULTIPLE  |ONGOING     |null           |null              |838          |UEFA           |Euro League Cup |LATE        |930       |eLxCSAx0crX0zaKI29DulLXakx4RA38ZyyltIHWpxXBxnWbKzmifTLBcwZaOMc9hka8WTJPMIBEypfEB0tdlZsQitjizuAYLcKvc|true       |3.450000000000000000|2024-03-13 11:20:27|2024-03-13 14:54:36|3.450000000000000000|55.000000000000000000|3.850000000000000000|2024-03-13 13:48:28|null                 |false           |null                 |
|94856581322015931314|IR7BaHNgrSpcXuF4p4vD|QvNd0UGnw5EIUdoLURnv|42                |MULTIPLE  |SUSPENDED   |OTHER          |null              |960          |Masters Cup    |NHL             |CANCELLED   |936       |J5o0FtwHulKD62mQLpz73Y

Eric Rodriguez

03/13/2024, 1:07 PM

its writing to the parquet that stalls it, for some reason

Jolan Rensen [JB]

03/13/2024, 1:14 PM

Copy code

Exception in thread "main" java.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$ (in unnamed module @0x3081f72c) cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export <http://sun.nio.ch|sun.nio.ch> to unnamed module @0x3081f72c
        at org.apache.spark.storage.StorageUtils$.<init>(StorageUtils.scala:213)
        at org.apache.spark.storage.StorageUtils$.<clinit>(StorageUtils.scala)
        at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:222)
        at org.apache.spark.SparkEnv$.create(SparkEnv.scala:376)
        at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:210)
        at

is interesting. Here on Stackoverflow, someone solves it by adding even more --add-opens: https://stackoverflow.com/questions/73465937/apache-spark-3-3-0-breaks-on-java-17-with-cannot-access-class-sun-nio-ch-direct

Eric Rodriguez

03/13/2024, 1:17 PM

I suspect its a reflection thing

Eric Rodriguez

03/13/2024, 1:17 PM

Copy code

fun <T> writeToParquet(ds: Dataset<T>, parquetName: String, partitionColumnName: String): Unit =
    ds.repartition(col(partitionColumnName))
        .write()
        .mode(SaveMode.Append)
        .partitionBy(partitionColumnName)
        .parquet("$parquetOutputPath/domains/$parquetName")

Eric Rodriguez

03/13/2024, 1:17 PM

with this fucntion it stalls

Eric Rodriguez

03/13/2024, 1:18 PM

if i write it directly though:

Copy code

sportDS.repartition(col("sports"))
            .write()
            .mode(SaveMode.Append)
            .partitionBy("playerUid")
            .parquet("$parquetOutputPath/domains/sports")

Eric Rodriguez

03/13/2024, 1:18 PM

it works

Eric Rodriguez

03/13/2024, 1:19 PM

as opposed to using the fnction:

Copy code

writeToParquet(sports.toDS(), "sports", "playerUid")

Eric Rodriguez

03/13/2024, 1:19 PM

again, running from intellij this works fine

Eric Rodriguez

03/13/2024, 1:19 PM

both ways work fine

Jolan Rensen [JB]

03/13/2024, 1:22 PM

really odd, have you also tried different kotlin versions? I know they changed something some time ago which touches the Serializablility of Kotlin functions on the JVM

Eric Rodriguez

03/13/2024, 1:22 PM

I was using 1.9.0 and then went back to 1.8.20

Eric Rodriguez

03/13/2024, 1:23 PM

actually, scratch what I said, it still stalls on writing parquet

Eric Rodriguez

03/13/2024, 1:24 PM

its just that the

show

before, executes fine

Jolan Rensen [JB]

03/13/2024, 1:25 PM

And kotlin 1.9.23, just for the sake of completeness?

Jolan Rensen [JB]

03/13/2024, 1:26 PM

I suspect it won't change anything, but I also don't really know what else to try anymore

Eric Rodriguez

03/13/2024, 1:27 PM

Will open more modules, just to be sure

Eric Rodriguez

03/15/2024, 2:19 PM

So the issue has to do with incompatible libraries needed to work with S3/MinIO and nothing to do woth kotlin spark api

Eric Rodriguez

03/15/2024, 2:20 PM

I tested on GCP Dataproc (which doesn't use these libs) and it works perfectly

asm0dey

03/15/2024, 3:57 PM

Dude! So cool you solved it! How did you find the root cause?

🙏 1

Eric Rodriguez

03/15/2024, 3:58 PM

blood and tears... tears mostly 😞

Eric Rodriguez

03/15/2024, 3:59 PM

basically I tried out different operations except read/write and everything worked fine, plus enabling DEBUG level for logs

Eric Rodriguez

03/15/2024, 4:00 PM

I could see all the relevant dirs were being listed and all tasks created but when it was time to actually read the file... nothing

Eric Rodriguez

03/15/2024, 4:01 PM

what I'd like to know is how exactly does jupyter "submits" the job

Eric Rodriguez

03/15/2024, 4:01 PM

my uberjar includes something that breaks the whole hadoop-aws/aws-sdk thing

Jolan Rensen [JB]

03/15/2024, 4:43 PM

Jupyter might behave differently due to class loading. It's loaded first and the integration with spark is loaded into it later. That might cause the difference

👍 1

32 Views

Open in Slack

Previous Next