My latest article about questioning the supremacy of Python kotlinlang #datascience

My latest article about questioning the supremacy ...

Thomas Legrand

04/09/2019, 11:57 AM

My latest article about questioning the supremacy of Python in Machine Learning followed by a state of the ecosystem in Kotlin and an example in Spark. https://www.thomaslegrand.tech/kotlin/2019/04/07/machine-learning-kotlin.html Slack Conversation

👍 3

altavir

04/09/2019, 1:14 PM

Just used the link to answer a question on forum: https://discuss.kotlinlang.org/t/source-code-converter-from-python-to-kotlin/12270

Thomas Legrand

04/09/2019, 1:23 PM

Thank you 🙂

altavir

04/09/2019, 1:24 PM

It was kind of funny coincidence, I've read your message in the same time I started to write the response.

Thomas Legrand

04/09/2019, 1:26 PM

haha I guess. Life sometimes 🤷‍♂️

Jemshit Iskenderov

04/11/2019, 2:08 PM

Python is language to quickly hack something, easily do what you want to do without hassle. Almost no wait time to run some code, while comparing with kotlin, you have to deal with some build system (gradle etc..), wait for compilation and build every time you change some code. Python’s supremacy on this becomes more important when you are not sure about, lets say how 3rd party lib. function works (or even how some python code works), you can quickly! test it with Python Console, no compiling, building... Creating project, virtual environment and installing requirements are easier (better) than java, kotlin. So, python is really good for small, medium projects. On the other hand, main problem could be dynamic typing. On python3 there is type hinting, but it does not solve the problem completely. You can type hint your own code, but still when reading library documentation, it is still difficult to understand what type it expects, returns etc… Which forces you to deep dive library code. Another difference comes into play when how every code, library works together. There are numpy, pandas, matplotlib (or seaborn), sklearn… libraries where all of them works together in harmony, just as it should be. You can convert one library’s data type (pandas.dataframe) to another library’s data type (numpy.ndarray), very cool. Now on java ecosystem, there are libraries that are alternative to numpy data types, to read csv, to plot graphics etc.. But do they all work together as single piece, just like in python? Are they mature as python libraries? Those are big concerns. So when it comes to huge project, java system vs python can be discussed, but on small-medium project i don’t think it should be discussed. 🙂

altavir

04/11/2019, 2:20 PM

You can run kotlin in REPL/Notebook mode the same way as you run python. There is internal Kotlin REPL (it is not very convenient, but it works) and you can run kotlin in notebook for example with beakerx. Also I do not think that "test in a console if you do not know what it does" is a good argument. In statically typed language you never need to do it, because you can rely on types and documentation for that. In the worst case scenario, you can fall through to the sources. In python most sources are just wrappers for C/Fortran, so it does not help.

Jemshit Iskenderov

04/11/2019, 2:25 PM

“test in a console if you do not know what it does” this argument is not about type system. It is about when learning or using something new, you can be unsure how it works, not about the data type, but the functionality. Same approach can be done on java ecosystem. For example, when learning RxJava, when you are novice and you don’t know how RxJava operators behave, hot or cold observables behave…, you open IntelliJ and create small project to quickly try to be sure that code behaves as you expect or not. But Python Console makes it really easy to do this kinds of test (no need to create separate project, just open console, import library, try your code)

altavir

04/11/2019, 2:26 PM

You can't understand how function works just by running it in console. You can understand what input-output it has. I am not saying that python is useless, I am saying that the console is a wrong argument.

altavir

04/11/2019, 2:28 PM

Python does have simplified data access and visualization tools out of the box. It is convenient when you need quick and dirty analysis and reproducible plots. But serious development - nah. Julia looks promising, but has a lot of problems of its own.

Thomas Legrand

04/11/2019, 2:46 PM

@Jemshit Iskenderov for sure, Python is convenient for prototyping and good enough for a lot of small projects. My post was mainly dealing with production code, where I'm really annoyed with Python sometimes. I heavily use types in Python 3, it has the merit to exist but it is completely useless at runtime. You couldn't be more right about libraries interoperability in Python and I've seen the painful reality in the current ML ecosystem on the JVM. We would gain from agreeing on common interfaces. Disclosure: I don't preach for anyone to fully move to Kotlin for Machine Learning but I was pointed out that more and more people complain about Python in production and that if you really want, JVM is not that of a bad choice to do ML.

👍 1

jimn

09/20/2019, 7:20 AM

https://kotlinlang.slack.com/archives/C5UPMM0A0/p1568715873032800 my own tirade

jimn

09/20/2019, 7:25 AM

I would like to see kotlin m/p overtake the jvm in terms of language and in places where pandas does a better job of being succinct than kotlin provides, a DSL delta that bears discussion here. Arrow-kt seems like it has some vision of expressive power, but they also seem inured to jvm indefinitely.

altavir

09/20/2019, 7:32 AM

In my opinion, Arrow is a evolutionary dead end for Kotlin. Especially if we are talking about hifh performance computing. Haskel could be pretty good, but Haskel compiler is highly optimized for its style of work. It is not possible in other cases. Also you do not need to use Arrow to write kotlin in functional style. As for Pandas replacements, there are several already. There is a pretty good table implementation in my old DataForge version. I currently do not have time to migrate it and I do not have appropriate feature requests. If there is a general request for that, it could really help if someone would draw a list of requirements and see currenty implementations like tablesaw for pros/cons.

jimn

09/20/2019, 7:36 AM

re: py->kotlin ... i tried my hand at this once... https://sourceforge.net/projects/snakeskin/ i love reviewing this every few years to see how things have advanced since java 1.2

jimn

09/20/2019, 7:42 AM

this thread is a singular source of fresh air for me having a jvm/kotlin distributed computing background and a begrudging 20 year visitor pass in the python world for cleaning up messes. im still processing. i haven't had time to step out of jupyter and examine what i can get done in kotlin and intellij, I'm a circus bear with keras for my present client just trying to play catchup myself

jimn

09/20/2019, 7:43 AM

i think that operator overloading should be turned up to 11 in kotlin, and infix should be likewise taken to reducto+absurdome-1

jimn

09/20/2019, 7:43 AM

then we'll be able to blow past pandas conventions.

altavir

09/20/2019, 7:43 AM

We are working just on that. 🙂 And I believe that few teams from JB are doing that as well. We will probably discuss it on KotlinConf

jimn

09/20/2019, 7:45 AM

"We are working just on that" where can i see?

jimn

09/20/2019, 7:47 AM

i believe pandas is just another PHP success, something worked, and the neophytes had no sense of code-smells to look elsewhere, so critical mass settled in. there is an apoligy from the author to this effect.

jimn

09/20/2019, 7:47 AM

same as with nodejs

altavir

09/20/2019, 8:16 AM

https://github.com/mipt-npm/kmath, distributed things are in different repositories and are difficult to grasp yet.

jimn

09/20/2019, 8:18 AM

hazelcast makes distributed java cake. that said, it makes kotlin into a steaming pile of workarounds

jimn

09/20/2019, 8:18 AM

kind of like hibernate and serialization, all over again

jimn

09/20/2019, 8:20 AM

i am considering authoring a kotlin-first multicast DHT which is 99% of the value. among a hundred other things that yeild a few minutes of slack per month, let alone code.

jimn

09/20/2019, 8:32 AM

@altavir opening up operator overloading and syntax freedoms in kotlin is very intriguing, and to my knowledge a very dead cat. are you saying that you're working out how to increase the lattitude of the AST to become more transparent with other expression conventions via overloading/infix,etc. ?

jimn

09/20/2019, 8:38 AM

my experience with boost::spirit has given me some kind of unrealistic bias toward programming language syntax, where c++ can represent entirely new (pseudo) EBNF with compile-time metaprogramming. niether c++ nor kotlin really approximate a compile-time AST change, but kotlin's still incomplete and has potential yet.

altavir

09/20/2019, 10:58 AM

@jimn I am not sure why are you so concerned about operator overloading. In Kmath we solve most of the problems by using context encapsulation, so most operations are available only in specific lexical scopes. I think that the same approach could be used in table manipulation. You can see my articles: https://proandroiddev.com/diving-deeper-into-context-oriented-programming-in-kotlin-3ecb4ec38814 on the matter.

jimn

09/20/2019, 5:38 PM

@altavir i see no reason why Kotlin can't be extended along the basis of minimizing kolmogorov complexity, in order to perform e.g. datascience with less programmer translation, fewer constraints on representation transparency, and to enable a language that emits object code within a compiler capable of bootstrapping dialects and extensions of grammar to focus less on turing complete and more toward shannon's limit and minimize comprehension costs.

jimn

09/20/2019, 5:40 PM

it may be that the momentum and politics of Kotlin will keep it in that lower right quadrant of wikipedia operator overloading forever, in which case, a new AST might be better for llvm and borrowing from the kotlin lib where it makes sense, and borrowing from other native resources as it makes sense.

altavir

09/20/2019, 5:43 PM

I am still not sure what you need. Could you write a code or pseudocode of what you want?

altavir

09/20/2019, 5:44 PM

Operator overloading is needed very rarely in mathematical operations.

jimn

09/20/2019, 5:44 PM

the right language flexibility should be able to operate like graph indirection to the degree where multiple languages coexist in the same compilation context such as java with c++ with kotlin, described by something like and EBNF jit with a libc. I'm not familiar with haskell but my understanding is that it takes syntax overloading pretty far.

jimn

09/20/2019, 5:45 PM

if you want to talk about comparing apples to apples with pandas and kotlin code, python has a ton of positional and index operators that kotlin goes out of the way to prevent

altavir

09/20/2019, 5:45 PM

Any syntax overloading kills tooling almost immediately. And I still do not see what do you want to achieve.

altavir

09/20/2019, 5:46 PM

Could you give pandas example, so I can rewrite it in kotlin so we could compare?

jimn

09/20/2019, 5:47 PM

a[:-1] is not impossible in kotlin, but it doesn't cme to me from the top of my head

jimn

09/20/2019, 5:47 PM

then there's a[:-1,["x","y"]]

jimn

09/20/2019, 5:48 PM

we're still barely scratching the surface. pandas breaks python 3.5 type hints completely, that's a win for kotlin code.

altavir

09/20/2019, 5:50 PM

It is actually an example of not quite good usage of python since it forgoes all type safety and uses dynamic features to parse query. I can do overloads in kotlin that will work like

a[null,-1]

. But I won't do it since it is much better to do it like

a.select{ toIndex = -1}

jimn

09/20/2019, 5:51 PM

destructuring operations and tuples are hand-annotated declarations at present. you can overload get(arity 1) but you don't get to redefine get with flexible arity, or the operator choice of symbols chosen by the kotlin spec

jimn

09/20/2019, 5:54 PM

how does kotlin code even approximate eigen vector and operator notation ? I'mnot a math pro but I really don't like the idea of learning the equation grammar and then learning the adaptations in kotlin or any other imperative language seperately

jimn

09/20/2019, 5:54 PM

this is a high kolmogorov complexity, two languages for one intent

jimn

09/20/2019, 5:56 PM

pandas is not an empiricial study of efficient representation, to be sure. it just happened to enable the c++ native library wrapper and was a useful tool for hitting a critical mass

jimn

09/20/2019, 5:57 PM

and iiuc it has some harsh deprecations in its history as well as the inneficiencies were addressed

altavir

09/20/2019, 5:57 PM

Well, as I already said, you can't get both language morphing and safety simultaneously. Kotlin is really expressive in terms of creating declarative builders for anything, but it is limited by its static nature and tooling support. Anyway, I would start with writing what types of expressions do you need.

jimn

09/20/2019, 6:00 PM

im just barely getting my legs about me in tensorflow code to start to detect when sample code really is a useless dead end to begin with. it involves wrote memorization of pandas code upon toy datasets, which are impossible at the scale of real-world value when you have 20000+ features (columns) times millions of rows in python in-ram

altavir

09/20/2019, 6:01 PM

You are starting at wrong end. You are trying to understand which code is good. You should start by writing what mathematics you need and how do you expect to use it. Write use cases.

jimn

09/20/2019, 6:02 PM

take a csv file, and then instead of importing it into sqlite and permuting and transforming it with sql, group-bys, and whatnot, you just load up everything in RAM and start nailing the python GIL as hard as you can to do the same thing. then you have something you can start to ... translate... into a TF "model" that expect numpy bindings

jimn

09/20/2019, 6:04 PM

when you have 9 gigs of CSV source data, you may find time series features in the hundreds of gigs in RAM

jimn

09/20/2019, 6:04 PM

that's my experience.

altavir

09/20/2019, 6:04 PM

This has nothing to do with syntax. And this thing we can easily manage with kotlin

jimn

09/20/2019, 6:06 PM

while i understand the capabilities of kotlin, there's a gap, and am impossibly large herd of data science disiples who quite simply couldn't give a rat's ass about code or readability; they are the ones who tend to publish and marrket the tutorials of cut and paste snippets to every nuance of variation.

jimn

09/20/2019, 6:08 PM

in thier estimation, jupityr and pandas are the adequate approximation of something better than matlab and R for getting results out the door +- into GPU DL frameworks

altavir

09/20/2019, 6:13 PM

You can use Kotlin with jupyter. As for Python users, it is not our aim to replace Python with Kotlin at the moment. This is just not possible. The aim is to create comfortable tools in kotlin and present them to people sick and tired of Python.

jimn

09/20/2019, 6:17 PM

my personal recurring usecases are time series forecasts. the unanimous public opinion is that this is well handled by LSTM with some data features blown out from source data. the plan of attack, is referring to the above mentioned samples for an understanding of the DL target execution, and for the moment minimizing the python coding bottleneck which boils down to less than perfect docs, good, and complete, but vastly under-representing the potential applications, and therefore examples. likewise the propeller-heads that are comfortable may omit shorthand in their long time travels which get them to point B for a problem faster. the hyper-parameter tuning space is something that has no "math expression", convergence is an arbitrary result still subject to hand-tuning and experimentation

jimn

09/20/2019, 6:20 PM

all roads lead to numpy, and typically thru pandas, and pandas recipes, though occasionally, due to python's miserable performance, the experts will lay out a very solution-specific tweak for a dataset that may not apply to anyone else's problems, like how to solve parallelism inadequacies typically.

jimn

09/20/2019, 6:22 PM

you can say its an expression based space, and for something like modellica simulations, it absolutely is, but DL is less math and more tooling than dedicated math libs might address

jimn

09/20/2019, 6:23 PM

the usecases of how to process 250 million csv rows with transforms, groupby, normalization, these are python's big logjam

jimn

09/20/2019, 6:25 PM

my preference would be to focus on an operator based grammar welll ahead of compormising with a catalog of functions and classes to support data operations. the closer to a syntax that looks like latex output, the more succinct and robust it will be in the eyes of tool jocks like myself

jimn

09/20/2019, 6:35 PM

@altavir if you want to work backwards from a timeseries LSTM Aparapi/GPU accelerated example in all-kotlin, you would find plentiful pandas analogs, datasets, kaggle solutions, and linear regression and markov hidden layer datasets for which the only small consolation i know of is: keras is slightly less intense than tensorflow.

jimn

09/20/2019, 6:36 PM

this would be a huge marketshare draw to a kotlin ecosystem.

11 Views

Open in Slack

Previous Next