Im looking for a data structure that is basically ...
# datascience
g
Im looking for a data structure that is basically a
double[10_000][300]
with fast
contains
and fast
indexOf
, any suggestions?
h
Do you care about memory?
g
yes but anything you give me will be better than what I have now
What I think i want is a NavigableSet that preserves insertion order...
that would let me implement
indexOf(elem)
as
headSet(elem).size
h
First of flat structures are great for speed. That means the array should be
double[10_000*300]
Then you can have a simple HashMap which points to the array. What are you gonna use it for? There’s most likely better data structure depending on how it’s gonna be used
i
Something like that should work for you:
Copy code
class OptimizedArray2D(val n: Int, val m: Int) {
    private val _data = DoubleArray(n*m)
    private val _ind : MutableMap<Double, MutableSet<Int>> = HashMap<Double, MutableSet<Int>>()
    
    init {
        for (i in 0 until n) {
            for (j in 0 until m) {
                set(i, j, 0.0)
            }
        }
    }
    
    operator fun get(i: Int, j: Int) = _data[i * m + j]
    operator fun set(i: Int, j: Int, el: Double) {
        val index = i * m + j
        
        val oldEl = _data[index]
        _data[index] = el      
        
        _ind[oldEl]?.remove(index)
        
        val indSet = _ind[el]
        if (indSet == null) {
            _ind[el] = mutableSetOf(index)
        } else {
            indSet.add(index)
        }
    }
    
    fun indexOf(el: Double): Pair<Int, Int>? {
        val indices = _ind[el] ?: return null
        val firstIndex = indices.firstOrNull() ?: return null
        return (firstIndex / m) to (firstIndex % m)
    }
    
    fun contains(el: Double): Boolean {
        return _ind.contains(el)
    }
}
p
I'm afraid that naive equality checking for floating point values is not a good idea due to precision-related issues.
i
Sometimes it may make sense, we don't know for sure. If it matters, HashSet may be replaced with TreeSet to preserve keys order and intertae them in some radius
g
sorry I wasnt clear, I dont need
contains(it: Double)
i need
contains(row: DoubleArray)
(or similar row type), I also need fast
indexOf(row: DoubleArray)
and yeah @Ilya Muradyan i think your point is this: linearize the data and then use a HashSet as a secondary index for fast contains. Unfortunately it doesn't get me fast
indexOf
i
Well, and what should
indexOf()
find, index of the first subrow?
I think, string algorithms play quite well here, but only if you construct your array once, and then make queries to it
You may then construct, i.e., trigram indices, and for exact match use some modification of i.e. Aho–Corasick
g
@Ilya Muradyan im dumber than that, so I wrote my own collection largely for fun but now im actually considering using it. https://gist.github.com/Groostav/d454f1711ffd98dbe6323ff7a112d79a
I would like to linearize the array but that would require some way of encoding "end of bin" into the array. Another index would do it I suppose --and I think I can use a
j.u.BitSet
for that which is a kind've fun performant object ive only spent a little bit of time with.