Hi, I would like to join the discussion on buildin...
# datascience
p
Hi, I would like to join the discussion on building Kotlin's mathematical libraries by showing you what I've done. I'm currently in a Signal and Image Processing Master where we are continuously assigned projects in Python and Matlab as they are the standard in the industry for quick prototyping and experimentation. Having experience with Kotlin, I see a huge opportunity to improve the existing tools to take advantage of a modern programming language and it's excellent tooling, the same way you do. I've been using @kyonifer's koma, which is very helpful and familiar to newcomers from Python and I had the same thoughts as @altavir, mimicking Python's and/or Matlab's syntax leads to the same drawbacks I was trying to avoid. I'm very interested in @altavir's kmath and his context driven approach, I think it is modelling mathematical primitives the right way. I believe the reason why Python became the first choice for data science is because of numpy. Python's data science community managed to converge on a common library for numerical primitives which, unlike Matlab or other's, it's based on ND Arrays, not matrices. ND Arrays are more natural to model problems in engineering, particularly for modern applications such as differentiable models (a.k.a "neural networks/AI") and the fact that all engineering libraries use the same numerical primitives on their interfaces make's it easy to compose and share work. This is the reason why I started my own project for NDArrays on pure Kotlin: https://github.com/TomasVolker/numeriko/tree/master/numeriko-core to see how such a library would look like. It is in a completely experimental state, far from being fully documented and tested and my intention is not to fragment even more the library ecosystem but to show to you how I envision such a library that takes advantage of Kotlin's features and uses the same conventions. Equally important, visualization is fundamental for data science and Python also has matplotlib for this. I have a basic matplotlib wrapper implemented in Kotlin that uses DSL for a declarative syntax: https://github.com/TomasVolker/numeriko/tree/master/kyplot. This should be a temporary solution for a more complete and powerful Kotlin library based on OpenGL, which I'm already experimenting with for 3D plots. There is a lot to discuss about implementation details, but what I believe is a priority is to agree on how we want a common ND Array library on which we all should build on, an idiomatic Kotlin numpy equivalent. I would like to know your views/thoughts/feedback on the next steps for building Kotlin's datascience, engineering and math libraries.
👍 1
t
Is a side purpose also to use this lib as a basis for a Dataframe library ? I think Kotlin mostly need convergence on these issues so that every developer contributes toward the same goals (mathematical, dataframe, and viz lib)
k
I tend to agree @Thomas Legrand. While choice is sometimes a good thing (e.g. having 10 plotting libraries is fine imo), I think one of the reasons Python scientific computing took off is because numeric and numarray converged into numpy. If the kotlin library ecosystem is fractured between 7 ndarray implementations it will make building an ecosystem impossible.
👍 1
@Percha I'm not sure I understood your explanation of why you started your own NDArray implementation. Is there something in particular about koma that was a roadblock? or just distaste for the numpy-like API?
a
@kyonifer It is OK to have 10 plotting libraries, but not 10 plotting APIs
p
@kyonifer my main limitation with koma was that it is too matrix oriented, which is the same reason I dislike Matlab. In my experience, engineering applications are better modeled by NDArrays (for example why we have to use row or column vectors? if elements are indexed by one value it's naturally a 1D array). Although koma provides basic NDArray support, I don't think it's taking full advantage of Kotlin's type system and conventions (having mutable/read only interfaces and low rank types for example, 1D, 2D and 3D). I also disagree on using numpy's naming (I prefer explicit names, not randn, cumsum, etc) and matplotlib or matlab's stateful plotting API when we can have a declarative DSL. I understand that we should provide a familiar library for people coming from Python and other languages but we have the opportunity to build an ecosystem using better practices. Maybe we can provide a Python like library as a wrapper for a more idiomatic Kotlin one. I do agree that there should be a de facto standard for ndarrays and numeriko's goal is to show how I believe a complete interface should look like: Mutable/Read only interfaces, primitive specialized types (until project valhalla at least) and low rank specializations (1D, 2D, 3D) as most applications deal with these.
As for @Thomas Legrand, I don't deal with Dataframes much so didn't have that in mind but indeed these issues should converge. Once we agree on a common ndarray framework we should build on top of it libraries for all domains: visualization, dataframes, machine learning, signal processing, etc
👌 1
k
one could consider the 2D specialization to just be the Matrix type. as you mentioned there is one inconsistency between the two types, namely that matrices always have dimension 2 (i.e. 5x1) whereas an NDArray with N=1 would not have the row/col vector distinction. Its been proposed to eliminate this distinction before in koma, see https://github.com/kyonifer/koma/issues/83#issuecomment-436113181, so I don't think thats a irreconcilable point.
in linear algebra, there is a distinction between col/row vectors, but a programming lib has to decide which way to go. numpy decided to deprecate its matrix type in favor of ndarrays only
i wouldn't look too closely at koma's plotting support, that was just a simple line plot wrapper i put together so i could have something at all, circa 2015 before anything else existed. there are better projects going on now. I do think there is room for a complete matplotlib wrapper as a plotting option, for people who are used to the MATLAB style plotting, because plotting choices are always good
👍 1
I think the rest of your issues come down to not liking numpy. In that case, the top-level functions ("numpy-like") could be split out into a different package, so that they don't interact with the implementations.
Note though that making a non-frustrating ndarray API that has immutable/mutable types can be tricky. In koma at one point we tried to return array views (non-copy) when someone asked for a transpose (something that is still to be added). We then tried making the
.T
be an immutable container sub-type. This invariably causes two things to happen: 1) people writing functions must go through all their code and decide which things actually need mutability. invariably half their stuff will need immutable, half will need mutable 2) people using said functions will have to play a guessing game of if what they have is immutable or not. maybe they were passed immutable but they need to call mutable, or vice versa. or maybe they were given mutable but called .T on it or sliced it (got an immutable view), and now they're back to needing mutable again The result being a hodgepodge of code needing mutable and non-mutable and asMutable() littered everywhere, along with people trying to fiddle with function in/out types to make it work with this or that other function without
asMutable
noise everywhere
So while I'm not categorically against such, I think it needs to be thought out carefully so that research scientists can get their jobs done without fiddling with type systems, if we ever want researchers to actually use what we build and not just programmers
👍 1
p
I think numpy did the right thing by focusing on ndarrays rather than matrices, avoiding the matrix-ndarray segmentation. NDArrays turn out to be more flexible and better suited for many applications, having linear algebra functions available on 2DArrays. As for the mutable/read only problem, I'm proposing the same approach as Kotlin's Collection interfaces. It's not about immutability but about having a read-only interface like List/MutableList, encouraging functional style which is very well suited for mathematical modeling and self documenting functions that modify their inputs. If I understood correctly you tried making an inmutable Matrix subtype, not a read only supertype, which is Java's original approach which just catches inmutable access in runtime, not compile time. I understand it's not trivial, there are many cases mutability is necessary for performance, specially in scientific computing, so I experimented with an
asMutable
function for those cases. On my experience, this approach worked very well for me, I'm using numeriko on my projects and it feels very natural to have the distinction just like with List and MutableList. I understand this distinction will make non-Kotlin programmers change their habits, so it's important we have the discussion.