Following problem: I maintain a Wolfram Language p...
# announcements
h
Following problem: I maintain a Wolfram Language plugin for IntelliJ and one of its tasks is to check if a symbol/variable is defined in the core of the language. I have information of about 50000 (yes, 50k) built-in symbols and for each symbol I have things like name, namespace, options, usage message, etc.. A symbol is uniquely defined by its full name consisting of name and namespace e.g. ``Developer`ToPackedArray`` or ``System`Plot`` where `` ` `` is the divider like
.
in Java/Kotlin and the last part is the name and the first part is the namespace (it's called context). Namespaces can be nested like in Java too, so there might be symbol thats in ``My`name`space`symbol``. A simple datastructure for this is to have a hashset with entries
fullSymbolNames -> symbolInformation
but I have some operations that need to be blazingly fast because they are called very often 1. Does symbol
Plot
exist in the namespace ``System` ``? This can easily be done by prepending the namespace to the symbol-name and simply check if the keys contain such an entry 2. In which namespaces does the symbol
FooBar
exists? Not really fast as I have to filter the whole list of values or process all keys. 3. What symbols are in the namespace
PacletManager
? Again, I need to filter all values for this. My current implementation in Java makes this possible by handling several different hashsets/lists but I wonder if anyone can think of a better data-structure to make the 3 operations fast. Note that once I loaded all the information about symbols, the data is read-only and I do not mutate any of it.
k
A Tree? (or you might call it a Trie if you want to impress CS people)
h
Yes, I have considered a tree, because my full-symbol-names have a defined order (you can sort them) and this is probably faster, but I cannot think how it helps with points 2 and 3.
k
3 is eazy right, just list the children of the
PacletManager
node.
h
Ahh, OK, then you were thinking of something different. You mean a tree of namespaces?
k
For 2 this does indeed not help, you could maintain a separate map from symbols to nodes for that.
Yup!
Now I'm curious, what kind of tree were you thinking of?
h
I need to think about this because my tree wouldn't be consistent. E.g. the node
System
will have leafs that are real symbols and it can contain another namespace as child.
k
sealed
classes?
h
OK, for problem 1, a trie is indeed a very good data-structure.
(and for problem 3 as well)
The thing is, at some point I need to access all properties of a certain symbol. So maybe there is no way around storing names and other information in different data-structures. Let me re-read about
sealed classes
.
Thanks btw.
k
Actually they're not even necessary:
class Node(val children: Map<String, Node>, val symbols: Map<String, PropertyInfo>)
should do the trick.
h
The
symbols
map would be very large for some nodes and then I'm back to a hashmap when searching for a specific name in a namespace. Maybe it's too much trouble to think about it, but I plan to rewrite it anyway so I thought I asked.
k
Huh? HashMaps are O(1) key lookup.
h
O(n) worst case. But maybe you are right. The question is, should I go through the trouble of a tree or simply put all 50k names into one hash-map?
k
It's only O(n) if all keys hash to the same value, which I doubt will happen. I'd go with a tree because just appending strings together is ugly IMO, and you might need an actual tree for something else in the future.
50k items is not much for a hashmap, performance should be fine either way.
(for a single item lookup at least)
h
My first idea was simply to use
Map<String, Map<String, Property>>
where the first string is the namespace and the inner map is the list of symbols in that namespace.
The only complex problem is still number 2 with this.
And it's indeed not really complex as I have only about 500 namespaces and for each namespace I need to make a
containsKey
.