My latest article about questioning the supremacy ...
# datascience
t
My latest article about questioning the supremacy of Python in Machine Learning followed by a state of the ecosystem in Kotlin and an example in Spark. https://www.thomaslegrand.tech/kotlin/2019/04/07/machine-learning-kotlin.html Slack Conversation
👍 3
a
t
Thank you 🙂
a
It was kind of funny coincidence, I've read your message in the same time I started to write the response.
t
haha I guess. Life sometimes 🤷‍♂️
j
Python is language to quickly hack something, easily do what you want to do without hassle. Almost no wait time to run some code, while comparing with kotlin, you have to deal with some build system (gradle etc..), wait for compilation and build every time you change some code. Python’s supremacy on this becomes more important when you are not sure about, lets say how 3rd party lib. function works (or even how some python code works), you can quickly! test it with Python Console, no compiling, building... Creating project, virtual environment and installing requirements are easier (better) than java, kotlin. So, python is really good for small, medium projects. On the other hand, main problem could be dynamic typing. On python3 there is type hinting, but it does not solve the problem completely. You can type hint your own code, but still when reading library documentation, it is still difficult to understand what type it expects, returns etc… Which forces you to deep dive library code. Another difference comes into play when how every code, library works together. There are numpy, pandas, matplotlib (or seaborn), sklearn… libraries where all of them works together in harmony, just as it should be. You can convert one library’s data type (pandas.dataframe) to another library’s data type (numpy.ndarray), very cool. Now on java ecosystem, there are libraries that are alternative to numpy data types, to read csv, to plot graphics etc.. But do they all work together as single piece, just like in python? Are they mature as python libraries? Those are big concerns. So when it comes to huge project, java system vs python can be discussed, but on small-medium project i don’t think it should be discussed. 🙂
a
You can run kotlin in REPL/Notebook mode the same way as you run python. There is internal Kotlin REPL (it is not very convenient, but it works) and you can run kotlin in notebook for example with beakerx. Also I do not think that "test in a console if you do not know what it does" is a good argument. In statically typed language you never need to do it, because you can rely on types and documentation for that. In the worst case scenario, you can fall through to the sources. In python most sources are just wrappers for C/Fortran, so it does not help.
j
“test in a console if you do not know what it does” this argument is not about type system. It is about when learning or using something new, you can be unsure how it works, not about the data type, but the functionality. Same approach can be done on java ecosystem. For example, when learning RxJava, when you are novice and you don’t know how RxJava operators behave, hot or cold observables behave…, you open IntelliJ and create small project to quickly try to be sure that code behaves as you expect or not. But Python Console makes it really easy to do this kinds of test (no need to create separate project, just open console, import library, try your code)
a
You can't understand how function works just by running it in console. You can understand what input-output it has. I am not saying that python is useless, I am saying that the console is a wrong argument.
Python does have simplified data access and visualization tools out of the box. It is convenient when you need quick and dirty analysis and reproducible plots. But serious development - nah. Julia looks promising, but has a lot of problems of its own.
t
@Jemshit Iskenderov for sure, Python is convenient for prototyping and good enough for a lot of small projects. My post was mainly dealing with production code, where I'm really annoyed with Python sometimes. I heavily use types in Python 3, it has the merit to exist but it is completely useless at runtime. You couldn't be more right about libraries interoperability in Python and I've seen the painful reality in the current ML ecosystem on the JVM. We would gain from agreeing on common interfaces. Disclosure: I don't preach for anyone to fully move to Kotlin for Machine Learning but I was pointed out that more and more people complain about Python in production and that if you really want, JVM is not that of a bad choice to do ML.
👍 1
I would like to see kotlin m/p overtake the jvm in terms of language and in places where pandas does a better job of being succinct than kotlin provides, a DSL delta that bears discussion here. Arrow-kt seems like it has some vision of expressive power, but they also seem inured to jvm indefinitely.
a
In my opinion, Arrow is a evolutionary dead end for Kotlin. Especially if we are talking about hifh performance computing. Haskel could be pretty good, but Haskel compiler is highly optimized for its style of work. It is not possible in other cases. Also you do not need to use Arrow to write kotlin in functional style. As for Pandas replacements, there are several already. There is a pretty good table implementation in my old DataForge version. I currently do not have time to migrate it and I do not have appropriate feature requests. If there is a general request for that, it could really help if someone would draw a list of requirements and see currenty implementations like tablesaw for pros/cons.
j
re: py->kotlin ... i tried my hand at this once... https://sourceforge.net/projects/snakeskin/ i love reviewing this every few years to see how things have advanced since java 1.2
this thread is a singular source of fresh air for me having a jvm/kotlin distributed computing background and a begrudging 20 year visitor pass in the python world for cleaning up messes. im still processing. i haven't had time to step out of jupyter and examine what i can get done in kotlin and intellij, I'm a circus bear with keras for my present client just trying to play catchup myself
i think that operator overloading should be turned up to 11 in kotlin, and infix should be likewise taken to reducto+absurdome-1
then we'll be able to blow past pandas conventions.
a
We are working just on that. 🙂 And I believe that few teams from JB are doing that as well. We will probably discuss it on KotlinConf
j
"We are working just on that" where can i see?
i believe pandas is just another PHP success, something worked, and the neophytes had no sense of code-smells to look elsewhere, so critical mass settled in. there is an apoligy from the author to this effect.
same as with nodejs
a
https://github.com/mipt-npm/kmath, distributed things are in different repositories and are difficult to grasp yet.
j
hazelcast makes distributed java cake. that said, it makes kotlin into a steaming pile of workarounds
kind of like hibernate and serialization, all over again
i am considering authoring a kotlin-first multicast DHT which is 99% of the value. among a hundred other things that yeild a few minutes of slack per month, let alone code.
@altavir opening up operator overloading and syntax freedoms in kotlin is very intriguing, and to my knowledge a very dead cat. are you saying that you're working out how to increase the lattitude of the AST to become more transparent with other expression conventions via overloading/infix,etc. ?
my experience with boost::spirit has given me some kind of unrealistic bias toward programming language syntax, where c++ can represent entirely new (pseudo) EBNF with compile-time metaprogramming. niether c++ nor kotlin really approximate a compile-time AST change, but kotlin's still incomplete and has potential yet.
a
@jimn I am not sure why are you so concerned about operator overloading. In Kmath we solve most of the problems by using context encapsulation, so most operations are available only in specific lexical scopes. I think that the same approach could be used in table manipulation. You can see my articles: https://proandroiddev.com/diving-deeper-into-context-oriented-programming-in-kotlin-3ecb4ec38814 on the matter.
j
@altavir i see no reason why Kotlin can't be extended along the basis of minimizing kolmogorov complexity, in order to perform e.g. datascience with less programmer translation, fewer constraints on representation transparency, and to enable a language that emits object code within a compiler capable of bootstrapping dialects and extensions of grammar to focus less on turing complete and more toward shannon's limit and minimize comprehension costs.
it may be that the momentum and politics of Kotlin will keep it in that lower right quadrant of wikipedia operator overloading forever, in which case, a new AST might be better for llvm and borrowing from the kotlin lib where it makes sense, and borrowing from other native resources as it makes sense.
a
I am still not sure what you need. Could you write a code or pseudocode of what you want?
Operator overloading is needed very rarely in mathematical operations.
j
the right language flexibility should be able to operate like graph indirection to the degree where multiple languages coexist in the same compilation context such as java with c++ with kotlin, described by something like and EBNF jit with a libc. I'm not familiar with haskell but my understanding is that it takes syntax overloading pretty far.
if you want to talk about comparing apples to apples with pandas and kotlin code, python has a ton of positional and index operators that kotlin goes out of the way to prevent
a
Any syntax overloading kills tooling almost immediately. And I still do not see what do you want to achieve.
Could you give pandas example, so I can rewrite it in kotlin so we could compare?
j
a[:-1] is not impossible in kotlin, but it doesn't cme to me from the top of my head
then there's a[:-1,["x","y"]]
we're still barely scratching the surface. pandas breaks python 3.5 type hints completely, that's a win for kotlin code.
a
It is actually an example of not quite good usage of python since it forgoes all type safety and uses dynamic features to parse query. I can do overloads in kotlin that will work like
a[null,-1]
. But I won't do it since it is much better to do it like
a.select{ toIndex = -1}
j
destructuring operations and tuples are hand-annotated declarations at present. you can overload get(arity 1) but you don't get to redefine get with flexible arity, or the operator choice of symbols chosen by the kotlin spec
how does kotlin code even approximate eigen vector and operator notation ? I'mnot a math pro but I really don't like the idea of learning the equation grammar and then learning the adaptations in kotlin or any other imperative language seperately
this is a high kolmogorov complexity, two languages for one intent
pandas is not an empiricial study of efficient representation, to be sure. it just happened to enable the c++ native library wrapper and was a useful tool for hitting a critical mass
and iiuc it has some harsh deprecations in its history as well as the inneficiencies were addressed
a
Well, as I already said, you can't get both language morphing and safety simultaneously. Kotlin is really expressive in terms of creating declarative builders for anything, but it is limited by its static nature and tooling support. Anyway, I would start with writing what types of expressions do you need.
j
im just barely getting my legs about me in tensorflow code to start to detect when sample code really is a useless dead end to begin with. it involves wrote memorization of pandas code upon toy datasets, which are impossible at the scale of real-world value when you have 20000+ features (columns) times millions of rows in python in-ram
a
You are starting at wrong end. You are trying to understand which code is good. You should start by writing what mathematics you need and how do you expect to use it. Write use cases.
j
take a csv file, and then instead of importing it into sqlite and permuting and transforming it with sql, group-bys, and whatnot, you just load up everything in RAM and start nailing the python GIL as hard as you can to do the same thing. then you have something you can start to ... translate... into a TF "model" that expect numpy bindings
when you have 9 gigs of CSV source data, you may find time series features in the hundreds of gigs in RAM
that's my experience.
a
This has nothing to do with syntax. And this thing we can easily manage with kotlin
j
while i understand the capabilities of kotlin, there's a gap, and am impossibly large herd of data science disiples who quite simply couldn't give a rat's ass about code or readability; they are the ones who tend to publish and marrket the tutorials of cut and paste snippets to every nuance of variation.
in thier estimation, jupityr and pandas are the adequate approximation of something better than matlab and R for getting results out the door +- into GPU DL frameworks
a
You can use Kotlin with jupyter. As for Python users, it is not our aim to replace Python with Kotlin at the moment. This is just not possible. The aim is to create comfortable tools in kotlin and present them to people sick and tired of Python.
j
my personal recurring usecases are time series forecasts. the unanimous public opinion is that this is well handled by LSTM with some data features blown out from source data. the plan of attack, is referring to the above mentioned samples for an understanding of the DL target execution, and for the moment minimizing the python coding bottleneck which boils down to less than perfect docs, good, and complete, but vastly under-representing the potential applications, and therefore examples. likewise the propeller-heads that are comfortable may omit shorthand in their long time travels which get them to point B for a problem faster. the hyper-parameter tuning space is something that has no "math expression", convergence is an arbitrary result still subject to hand-tuning and experimentation
all roads lead to numpy, and typically thru pandas, and pandas recipes, though occasionally, due to python's miserable performance, the experts will lay out a very solution-specific tweak for a dataset that may not apply to anyone else's problems, like how to solve parallelism inadequacies typically.
you can say its an expression based space, and for something like modellica simulations, it absolutely is, but DL is less math and more tooling than dedicated math libs might address
the usecases of how to process 250 million csv rows with transforms, groupby, normalization, these are python's big logjam
my preference would be to focus on an operator based grammar welll ahead of compormising with a catalog of functions and classes to support data operations. the closer to a syntax that looks like latex output, the more succinct and robust it will be in the eyes of tool jocks like myself
@altavir if you want to work backwards from a timeseries LSTM Aparapi/GPU accelerated example in all-kotlin, you would find plentiful pandas analogs, datasets, kaggle solutions, and linear regression and markov hidden layer datasets for which the only small consolation i know of is: keras is slightly less intense than tensorflow.
this would be a huge marketshare draw to a kotlin ecosystem.