https://kotlinlang.org logo
Channels
100daysofcode
100daysofkotlin
100daysofkotlin-2021
advent-of-code
aem
ai
alexa
algeria
algolialibraries
amsterdam
android
android-architecture
android-databinding
android-studio
androidgithubprojects
androidthings
androidx
androidx-xprocessing
anime
anko
announcements
apollo-kotlin
appintro
arabic
argentina
arkenv
arksemdevteam
armenia
arrow
arrow-contributors
arrow-meta
ass
atlanta
atm17
atrium
austin
australia
austria
awesome-kotlin
ballast
bangladesh
barcelona
bayarea
bazel
beepiz-libraries
belgium
benchmarks
berlin
big-data
books
boston
brazil
brikk
budapest
build
build-tools
bulgaria
bydgoszcz
cambodia
canada
carrat
carrat-dev
carrat-feed
chicago
chile
china
chucker
cincinnati-user-group
cli
clikt
cloudfoundry
cn
cobalt
code-coverage
codeforces
codemash-precompiler
codereview
codingame
codingconventions
coimbatore
collaborations
colombia
colorado
communities
competitive-programming
competitivecoding
compiler
compose
compose-android
compose-desktop
compose-hiring
compose-ios
compose-mp
compose-ui-showcase
compose-wear
compose-web
confetti
connect-audit-events
corda
cork
coroutines
couchbase
coursera
croatia
cryptography
cscenter-course-2016
cucumber-bdd
cyprus
czech
dagger
data2viz
databinding
datascience
dckotlin
debugging
decompose
decouple
denmark
deprecated
detekt
detekt-hint
dev-core
dfw
docs-revamped
dokka
domain-driven-design
doodle
dsl
dublin
dutch
eap
eclipse
ecuador
edinburgh
education
effective-kotlin
effectivekotlin
emacs
embedded-kotlin
estatik
event21-community-content
events
exposed
failgood
fb-internal-demo
feed
firebase
flow
fluid-libraries
forkhandles
forum
fosdem
fp-in-kotlin
framework-elide
freenode
french
fritz2
fuchsia
functional
funktionale
gamedev
ge-kotlin
general-advice
georgia
geospatial
german-lang
getting-started
github-workflows-kt
glance
godot-kotlin
google-io
gradle
graphic
graphkool
graphql
graphql-kotlin
graviton-browser
greece
grpc
gsoc
gui
hackathons
hacktoberfest
hamburg
hamkrest
helios
helsinki
hexagon
hibernate
hikari-cp
hire-me
hiring
hiring-french
hongkong
hoplite
http4k
hungary
hyderabad
image-processing
india
indonesia
inkremental
intellij
intellij-plugins
intellij-tricks
internships
introduce-yourself
io
ios
iran
israel
istanbulcoders
italian
jackson-kotlin
jadx
japanese
jasync-sql
java-to-kotlin-refactoring
javadevelopers
javafx
javalin
javascript
jdbi
jhipster-kotlin
jobsworldwide
jpa
jshdq
juul-libraries
jvm-ir-backend-feedback
jxadapter
k2-early-adopters
kaal
kafka
kakao
kalasim
kapt
karachi
karg
karlsruhe
kash_shell
kaskade
kbuild
kdbc
kgen-doc-tools
kgraphql
kinta
klaxon
klock
kloudformation
kmdc
kmm-español
kmongo
knbt
knote
koalaql
koans
kobalt
kobweb
kodein
kodex
kohesive
koin
koin-dev
komapper
kondor-json
kong
kontent
kontributors
korau
korean
korge
korim
korio
korlibs
korte
kotest
kotest-contributors
kotless
kotlick
kotlin-asia
kotlin-beam
kotlin-by-example
kotlin-csv
kotlin-data-storage
kotlin-foundation
kotlin-fuel
kotlin-in-action
kotlin-inject
kotlin-latam
kotlin-logging
kotlin-multiplatform-contest
kotlin-mumbai
kotlin-native
kotlin-pakistan
kotlin-plugin
kotlin-pune
kotlin-roadmap
kotlin-samples
kotlin-sap
kotlin-serbia
kotlin-spark
kotlin-szeged
kotlin-website
kotlinacademy
kotlinbot
kotlinconf
kotlindl
kotlinforbeginners
kotlingforbeginners
kotlinlondon
kotlinmad
kotlinprogrammers
kotlinsu
kotlintest
kotlintest-devs
kotlintlv
kotlinultimatechallenge
kotlinx-datetime
kotlinx-files
kotlinx-html
kotrix
kotson
kovenant
kprompt
kraph
krawler
kroto-plus
ksp
ktcc
ktfmt
ktlint
ktor
ktp
kubed
kug-leads
kug-torino
kvision
kweb
lambdaworld_cadiz
lanark
language-evolution
language-proposals
latvia
leakcanary
leedskotlinusergroup
lets-have-fun
libgdx
libkgd
library-development
lincheck
linkeddata
lithuania
london
losangeles
lottie
love
lychee
macedonia
machinelearningbawas
madrid
malaysia
mathematics
meetkotlin
memes
meta
metro-detroit
mexico
miami
micronaut
minnesota
minutest
mirror
mockk
moko
moldova
monsterpuzzle
montreal
moonbean
morocco
motionlayout
mpapt
mu
multiplatform
mumbai
munich
mvikotlin
mvrx
myndocs-oauth2-server
naming
navigation-architecture-component
nepal
new-mexico
new-zealand
newname
nigeria
nodejs
norway
npm-publish
nyc
oceania
ohio-kotlin-users
oldenburg
oolong
opensource
orbit-mvi
osgi
otpisani
package-search
pakistan
panamá
pattern-matching
pbandk
pdx
peru
philippines
phoenix
pinoy
pocketgitclient
polish
popkorn
portugal
practical-functional-programming
proguard
prozis-android-backup
pyhsikal
python
python-contributors
quasar
random
re
react
reaktive
realm
realworldkotlin
reductor
reduks
redux
redux-kotlin
refactoring-to-kotlin
reflect
refreshversions
reports
result
rethink
revolver
rhein-main
rocksdb
romania
room
rpi-pico
rsocket
russian
russian_feed
russian-kotlinasfirst
rx
rxjava
san-diego
science
scotland
scrcast
scrimage
script
scripting
seattle
serialization
server
sg-user-group
singapore
skia-wasm-interop-temp
skrape-it
slovak
snake
sofl-user-group
southafrica
spacemacs
spain
spanish
speaking
spek
spin
splitties
spotify-mobius
spring
spring-security
squarelibraries
stackoverflow
stacks
stayhungrystayfoolish
stdlib
stlouis
strife-discord-lib
strikt
students
stuttgart
sudan
swagger-gradle-codegen
swarm
sweden
swing
swiss-user-group
switzerland
talking-kotlin
tallinn
tampa
teamcity
tegal
tempe
tensorflow
terminal
test
testing
testtestest
texas
tgbotapi
thailand
tornadofx
touchlab-tools
training
tricity-kotlin-user-group
trójmiasto
truth
tunisia
turkey
turkiye
twitter-feed
uae
udacityindia
uk
ukrainian
uniflow
unkonf
uruguay
utah
uuid
vancouver
vankotlin
vertx
videos
vienna
vietnam
vim
vkug
vuejs
web-mpp
webassembly
webrtc
wimix_sentry
wwdc
zircon
Powered by
Title
j

jimn

02/21/2020, 7:49 AM
what I'm not finding after only a few minutes of googling is anyone who has undertaken benchmarks across jvm<->non-jvm dataframe tools. there appears to be some articles but this looks like a very non-standardized area. i found microbenchmarks of pandas operations as well, but the machine-killing heap hazards are not in those benhcmarks.
j

jimn

02/21/2020, 9:31 AM
a page about benchmarks without a benchmark hehehe nice
a

altavir

02/21/2020, 9:32 AM
🤷‍♂️
Usually java libraries tend to take quite different approach from python, so direct comparison is hard.
j

jimn

02/21/2020, 9:48 AM
im no connesuer of the space, but i can appreciate how pandas does one thing and keras does something else, and spcific to your link nd4j is the little engine for dl4j. comparing java libs to python definitely piles on the impedence a mile high. at this point in time im going to guess that kotlin enters well below java performance expectations and in the ballpark of optimal python tooling, but the ability to make a concise script for both looks like a much smaller gap
i dont lack the usecases, but i don't have any machine killer datasets yet that aren't under NDA.
the basis of my search is a small script adaptation from py to kotlin pandas or similar tool.
b

bjonnh

02/22/2020, 3:41 PM
what kind of data are you handling?
a

altavir

02/22/2020, 4:52 PM
I would also ask not about the data, but also about its structure. I've already said multiple times that if there are no operations performed on the whole set and it is evaluated row by row it seems absolutely meaningless to load it into memory or even use memory-mapped file instead of streaming.
b

bjonnh

02/22/2020, 6:46 PM
that's what I implied by kind of data. thanks for detailing
a

altavir

02/22/2020, 6:48 PM
The problem is that we had this discussion with the same participants several times. Kotlin does allow to implement better streaming and "flowing" capabilities without using huge buffers. But it requires more precise problem description.
b

bjonnh

02/22/2020, 7:09 PM
totally agreed
maybe we need a checklist for discussions
j

jimn

02/28/2020, 10:09 AM
I don't see the voice of experience in Alex's viewpoint, but despite his criticisms, i went and nailed it. there is a question of what super advanced planning strategy gets you On results for groupby, perhaps my straightforward attempt could be touched up.
https://github.com/jnorthrup/columnar this does in kotlin what i need from Pandas. the choice of IO and off-heap scales up where pandas and numpy will hit swap before eventually exhausting swap
I'm 99% convinced that Alex doesn't actually read code or descriptions before he judges this space based on his own straw man attempts.
@bjonnh my data per the readme, is ISAM from immutable mapped FWF format input, at this time both textually parsed fields and an intermediary NIO binary format writable using the NIO driver. Without kotlin, I might not have arrived at the design that efficienctly maps the features to the execution. With kotlin, i have something that performs better than python and can be cleanly ported to c or c++ and avail cleaner standard libraries and libc features.
i have also released the dataset on github as well, but maybe have some last minute touches to add the munging
a

altavir

02/28/2020, 11:24 AM
@jimn It is really hard for me to follow your ideas. I intend to look through your code, but I did not found time yet. What I understood from three previous posts is that you need is grouping operation. Grouping does not require the whole data to be loaded simultaneously. Do you still insist that you must have access to the whole data? Maybe I still get your use case wrong? I am talking about map-reduce.
j

jimn

02/28/2020, 11:34 AM
the basic idea is to take n rows of x columns and assign n/m rows of f' columns which could be a mapreduce operation, but for lstm time series you also wind up with a full pivot operation to make it x rows of n columns first. so now you may run into cascading loops of mapreduce or cartesians depending on how you do it. for 17000 columns apache arrow is untested, and i broke it and submitted the repeatable issue tracking details. i went forward with my thing at that point and didn't return. if this is ISAM, then you can do random access seek, discard, and have immutable data giving idempotent function results. FWF is an ISAM compatible layout, and if you don't have ISAM, pandas's CSV reader is pure c++ anyways, you can do that simple IO, but i digress, i wrote a jdbc streamer to stdout for fwf with meta on stderr.
a

altavir

02/28/2020, 11:38 AM
OK, now I am finally understand what are you taking about. And some things you've said before making a sense.
j

jimn

02/28/2020, 11:40 AM
woops, early return key. mapreduce is not inherntly a unary operator that gives you composable functions. it's a big stack of code that drops out some results as arbitrary a they please, but yes, embarrassingly parralel. except for re-reduce passes. pandas does give you that, so perhaps its a false blanket to have a dsel or at least a dataframe to arrive back at for more of the same interface. unary operators seem like a good thing for human comprehension. pandas excels at the conceptual models, but for the dreaded python overheads..
a

altavir

02/28/2020, 11:42 AM
I still think there should be better ways than to read the whole data into memory, but at least we won't be talking abot completely different things. Sorry, but you tend to drop super-long texts, which are impossible to understand without some background.
j

jimn

02/28/2020, 11:43 AM
my earliest attempts were indeed to tackle coroutines as flows and then i swiftly hit heap limits and went with mmap, flow, and hit heaps limits a bit later, but still too soon. for my third iteration i built the internal of a sql engine without a sql parser. it iterates rows, and cursors on x,y,z driver coordinates (fwf is y,x access) and gets the groupby job done. when i wanted to utilize coroutineContextElements as the formative glue to create maximum extensibility, i notice the threadlocal protection was lik 3/4 of my wall time.
now i have lean single threaded minimalist cursor iterators.
at this point I know there are jvm hard stop barriers that cqannot be sidestepped in the areas of kernel operations and thread affinity. to me this guarantees that any threadlocal abstractions are going to be spinning the barrel on russian roullete every single time suspend "resumes" into a thread.
so my current adequate build using mmap is still suffering some jvm related biases, like a threadlocal check (removed, now its a singleton) of the MAXINT window mapped on a file, but the unary operator features and the compasibility of the dsel as well as the coruotinecontextelements is to my liking. except corutinecontext is now not a suspension access, its just a bag
the usecases are io bound, even if we're talking about kernel disk buffers, the mmap is an io handle ultimately. but, heap is safe in the process and OS management of those mmap handles. i know that when i do a sort using gnusort, it's a million times faster. but kotlin still spanks python.
it is likely that i can added a fileio driver and it would outperform mmap in some access patterns, but the unary operator interfaces are clean enough that it makes sense to try a c++ port using namespace aliases and cpp macros to emulate the typealiased pairs i employ
my current runtime uses less than 256k of RAM irregardles of the input dimensions.
the slowdown comes from projection overheads of pivoting then groupby done without heap. it's clean, but there is certainly some datasets small enough to benefit from buffer windows to hold known repeats. On^2 groupby is not typically reading any row twidce, but it is doing an elevator from the first row to the last row for each cluster
i have considered making a zoned IO and doing re-reduce logic to combine these, the problem is that the pivots span the entire dataset so it's potentially better to do a full tablescan pivot, and break up the large intermediares into gropus before a multistage groupby, which is not far from mapreduce
using a blackboard ideology for deifining the IO access patterns and composing the IO drivers from traits, using kotlin, is also a nightmare. orthogonal maybe, but less overlap than i anticipated
my #2 slowdown in the code was having lambdas. these don't inline. when i inlined every single little thing the performance improved remarkably.
i'm sure that's my inlining is overboard at this point, but no matter how far i abuse it i don't see perofrmance degradation yet.
my groupby itself is actually maybe 5 seconds against 2.8 million rows. but REIFYING one row can take up to a minute.
2.8 million source rows for 3 years timeseries by day pivots to 1100 rows with 9800 columns in my machine killer dataset. so average groupby is summing 9800 columns on 2.8million/9800 rows, to reify each of the 1100 final rows. flow capture overhead was fatal for these kind of numbers