How can I process a file line-by-line in Kotlin/JS...
# javascript
j
How can I process a file line-by-line in Kotlin/JS (in a webpage context) without reading the whole file into memory? I'm currently trying to start with a JS FileReader, though I'd be willing to use a different API if it worked.
The browser is willing to load the whole file with
readAsArrayBuffer
(it wasn't with
readAsText
) but I crash with "oh, snap" (presumably out of memory) when I try to use
decodeToString
on it. Is there any way to get it as a
CharSequence
that I could use
splitToSequence
on? Or any other suggestions?
e
You could
slice()
the
File
in multiple chunks.
AFAIK that's the only clean way to do it.
j
So to line-delimit I'd need to implement my own stdio-style buffering I guess. (With the additional fun of the slices not necessarily cleanly matching UTF-8 boundaries.)
e
You can use
readAsText(chunk)
instead of
readAsArrayBuffer
to simplify it, and then somehow manage the EOL chars.
Although... you're right that it might result in a broken string. I'm not sure how to solve this part
Sorry I missed the
it wasn't with
readAsText
You mean it did not load the entire content? That's strange.
j
It looks like the behavior with
readAsText
is to call the event callback with an empty string if it's too big for the memory.
e
Damn how big is the file?
j
500 MB
(That's after filtering it, the original file was 3 GB. This is a stats viewer for stats from our server, and the particular thing I'm trying to view stats for lasted like 20 hours.)
e
Quickly found https://stackoverflow.com/a/32753261/1392277 See the comment
I just noticed this with some rather large datasets: a 257MB file reads but a 459MB file returns an empty string, Chrome 49
j
readAsArrayBuffer
loads the file fine (presumably it's just mmap'ing it or something internally) but my attempt to convert it to text fails with "oh snap"
e
Probably fails for the same reason
readAsText
fails, although the latter doesn't crash the page
j
Yeah, agreed
e
So yeah, looks like you'll need custom logic, but to me it looks too messy
j
I was hoping someone else had written it already, but I guess not. 🙂
e
You could read the entire buffer, and then scan bytes to find line delimiters.
j
Yeah, that's probably a cleaner way to avoid issues with partial UTF-8. (It's a bit ugly because the files can either have Unix or DOS line endings, but that's manageable.)
✔️ 1
e
And what if the entire 500 MB text file is a single line? 👀
j
Well, that would fail no matter what...unless we have a buffered JSON parser, but I don't think we do...and even then I'd need the memory for the whole parsed JSON
e
Also, check out kotlinx-io UTF8 reader functions, which should accept byte arrays. Not sure if they'll work for your use case tho.
j
Yeah, I was wondering if kotlinx-io had tools in general for this, but I couldn't find them
e
You need to wrap your JS buffer into a kotlinx-io
Buffer
, and then you can use
Source.readString()
. Or you can even wrap your JS buffer into a custom
Source
implementation to avoid copying bytes.
Ah yeah that one was failing
Better if I go sleep, I'm forgetting stuff lol
stdlib and kotlinx-io decoding implementations differs, so giving io a try looks ok.
t
Do you have browser limitations?