<http web eecs utk edu ~azh blog notebookpainpoints html> kotlinlang #datascience

Join Slack

<http://web.eecs.utk.edu/~azh/blog/notebookpainpoi...

# datascience

breandan

01/19/2020, 8:47 PM

http://web.eecs.utk.edu/~azh/blog/notebookpainpoints.html

👍 3

altavir

01/20/2020, 9:09 AM

The notebook-based research is actually a very bad thing if used outside of proptyping or report preparation. Kotlin notebooks solve some problems with build reproducability (those are the most nasty ones in Python/Julia), but the problem of cell-order, dangling intermediate state and other things is still there. I've started to work on DataForge to mitigate those problems, but it's implementation is still far from being used broadly.

breandan

01/20/2020, 3:44 PM

The notebook-based research is actually a very bad thing if used outside of proptyping or report preparation

Do you mean this kind of HCI research based on notebooks or the general practice of using notebooks for data science? I'm not sure what better alternatives exist today besides IDEs, which are starting to provide more notebook-oriented features. Have you tried using DataLore before? What did you think? What features does DataForge aim to support that would provide additional value to the development experience? Do you see a fundamental reason why notebooks could not fix the reproducibility problems around statefulness and provide a more IDE-like experience?

altavir

01/20/2020, 3:55 PM

I tryed DataLore and it is not fundamentally different from any other notebook. The general problem with notebook is that you can't see cells as computation stages. If you run them sequentially, they are not better than single block code. If you start to run them in non-linear order, you can't get reproducable result. The idea of DataForge is to provide a declarative way to define analysis task and then combine them in a pull-based data flow. I am not sure it is applicable to classical "data science", but it deffenitely helps a lot to automate "big data" analysis in physics.

altavir

01/20/2020, 3:57 PM

It is possible to add cell-dependency functionality to a notebook and it could mitigate some problems, but there always will be a problem of dependency validattion (in DataForge it is solved by comparing analysis declarations).

bjonnh

01/20/2020, 9:34 PM

Some people are working on "different" approaches as well: https://www.fast.ai/2019/12/02/nbdev/ . I tend to only use notebooks for proofs of concept. The same way I draw on a whiteboard or a piece of paper before designing something complex…

Igor Alshannikov

01/25/2020, 2:20 AM

DataForge sounds intriguing, is it IDEA-based?

altavir

01/25/2020, 4:07 AM

No, it is a stand-alone system. We do not have time to work properly on a gui configurator right now. In old version I use tornadofx-based output and groovy config files. Now it is fully kotlinized, but still limited support for gui.

pabl0rg

01/28/2020, 11:08 AM

There’s an interesting comment thread on HN regarding that study: https://news.ycombinator.com/threads?id=fryguy

altavir

01/28/2020, 11:16 AM

It still is all about state management and cache validation. I do not believe than notebooks could be used for any complicated or long-running analysis. You just need a way to cache intermediate results. An as soon as you do caching, you need validation mechanisms. The only way I can imagine it done in the notebook is to add cell dependencies mechanism and invalidate all cell and dependent cell results, when cell text changes. I tried to to "sell" the idea to beakerx developers few years ago, but they decided it is to complicated to be useable.

bjonnh

01/28/2020, 4:48 PM

It just goes with the 200 different systems to do pipelines.

bjonnh

01/28/2020, 4:49 PM

which are just state management and cache validation systems as well.

altavir

01/28/2020, 5:40 PM

I agree, but with two exceptions. First, when you use spark or something like this, notebook is not your primary tool anymore. It is just a scripting entry point and not actual development is done in it. Second, spark does not actually provide cache validation. There is not way to make an automated cache unless you can track and compare changes in the source code. In spark, you cache results only inside single computation with the unchanged code.

4 Views

Open in Slack

Previous Next