Following problem: I maintain a Wolfram Language plugin for IntelliJ and one of its tasks is to chec...

halirutan

07/29/2018, 12:39 AM

Following problem: I maintain a Wolfram Language plugin for IntelliJ and one of its tasks is to check if a symbol/variable is defined in the core of the language. I have information of about 50000 (yes, 50k) built-in symbols and for each symbol I have things like name, namespace, options, usage message, etc.. A symbol is uniquely defined by its full name consisting of name and namespace e.g. ``Developer`ToPackedArray`` or ``System`Plot`` where `` ` `` is the divider like

in Java/Kotlin and the last part is the name and the first part is the namespace (it's called context). Namespaces can be nested like in Java too, so there might be symbol thats in ``My`name`space`symbol``. A simple datastructure for this is to have a hashset with entries

fullSymbolNames -> symbolInformation

but I have some operations that need to be blazingly fast because they are called very often 1. Does symbol

Plot

exist in the namespace ``System` ``? This can easily be done by prepending the namespace to the symbol-name and simply check if the keys contain such an entry 2. In which namespaces does the symbol

FooBar

exists? Not really fast as I have to filter the whole list of values or process all keys. 3. What symbols are in the namespace

PacletManager

? Again, I need to filter all values for this. My current implementation in Java makes this possible by handling several different hashsets/lists but I wonder if anyone can think of a better data-structure to make the 3 operations fast. Note that once I loaded all the information about symbols, the data is read-only and I do not mutate any of it.

karelpeeters

07/29/2018, 12:41 AM

A Tree? (or you might call it a Trie if you want to impress CS people)

halirutan

07/29/2018, 12:43 AM

Yes, I have considered a tree, because my full-symbol-names have a defined order (you can sort them) and this is probably faster, but I cannot think how it helps with points 2 and 3.

karelpeeters

07/29/2018, 12:44 AM

3 is eazy right, just list the children of the

PacletManager

node.

halirutan

07/29/2018, 12:44 AM

Ahh, OK, then you were thinking of something different. You mean a tree of namespaces?

karelpeeters

07/29/2018, 12:44 AM

For 2 this does indeed not help, you could maintain a separate map from symbols to nodes for that.

karelpeeters

07/29/2018, 12:44 AM

Yup!

karelpeeters

07/29/2018, 12:51 AM

Now I'm curious, what kind of tree were you thinking of?

halirutan

07/29/2018, 12:52 AM

I need to think about this because my tree wouldn't be consistent. E.g. the node

System

will have leafs that are real symbols and it can contain another namespace as child.

karelpeeters

07/29/2018, 12:57 AM

sealed

classes?

halirutan

07/29/2018, 1:00 AM

OK, for problem 1, a trie is indeed a very good data-structure.

halirutan

07/29/2018, 1:01 AM

(and for problem 3 as well)

halirutan

07/29/2018, 1:02 AM

The thing is, at some point I need to access all properties of a certain symbol. So maybe there is no way around storing names and other information in different data-structures. Let me re-read about

sealed classes

halirutan

07/29/2018, 1:02 AM

Thanks btw.

karelpeeters

07/29/2018, 1:04 AM

Actually they're not even necessary:

class Node(val children: Map<String, Node>, val symbols: Map<String, PropertyInfo>)

should do the trick.

halirutan

07/29/2018, 1:08 AM

The

symbols

map would be very large for some nodes and then I'm back to a hashmap when searching for a specific name in a namespace. Maybe it's too much trouble to think about it, but I plan to rewrite it anyway so I thought I asked.

karelpeeters

07/29/2018, 1:08 AM

Huh? HashMaps are O(1) key lookup.

halirutan

07/29/2018, 1:11 AM

O(n) worst case. But maybe you are right. The question is, should I go through the trouble of a tree or simply put all 50k names into one hash-map?

karelpeeters

07/29/2018, 1:12 AM

It's only O(n) if all keys hash to the same value, which I doubt will happen. I'd go with a tree because just appending strings together is ugly IMO, and you might need an actual tree for something else in the future.

karelpeeters

07/29/2018, 1:13 AM

50k items is not much for a hashmap, performance should be fine either way.

karelpeeters

07/29/2018, 1:13 AM

(for a single item lookup at least)

halirutan

07/29/2018, 1:14 AM

My first idea was simply to use

Map<String, Map<String, Property>>

where the first string is the namespace and the inner map is the list of symbols in that namespace.

halirutan

07/29/2018, 1:15 AM

The only complex problem is still number 2 with this.

halirutan

07/29/2018, 1:17 AM

And it's indeed not really complex as I have only about 500 namespaces and for each namespace I need to make a

containsKey

Open in Slack

Previous Next

kotlinlang

A modern programming language that makes developers happier.