:rocket: londogard-nlp-toolkit (<https://github.co...
# datascience
h
🚀 londogard-nlp-toolkit (https://github.com/londogard/londogard-nlp-toolkit) just merged initial support for HuggingFace Transformers, and other Transformer-based models • PyTorch through JIT-saved TorchScript models • ONNX Models directly through the hub, e.g.
TokenClassificationPipeline.create("optimum/bert-base-NER")
◦ Where
optimum/bert-base-NER
is a model on the HuggingFace Hub Load both PyTorch (TorchScript) & ONNX model through local path •
ClassificationPipeline
and
TokenClassificationPipeline
exists ◦ See the following test for some examples on how to use it A
1.2.0-BETA
release has been cut!
👍 7
a
I tried to use LogisticRegression. But it always return index 1, followed the example your ClassifierTest
h
As such the test fails? Seems weird as the test-case is still passing on CI/CD Could you provide your sample code?
a
Copy code
@OptIn(ExperimentalTime::class)
fun logisticTest() {
    val labelsMap = mapOf(
        0 to "Bank Charges",
        1 to "Betting",
        2 to "Card fees",
        3 to "Food",
        4 to "Lifestyle",
        5 to "Loan",
        6 to "Reversal",
        7 to "Salary",
        8 to "Unknown",
        9 to "Utilities & Bills",
        10 to "Withdrawal"
    )
    val data = listOf(
        BankT("Vat amount charges", "Bank Charges"),
        BankT("Loan payment credit", "Loan"),
        BankT("Salary for Aug", "Salary"),
        BankT("Payment from betking","Betting"),
        BankT("Purchase from Shoprite","Food"),
    )
    val simpleTok = SimpleTokenizer()
    val xData = data.map { it.narration }.map(simpleTok::split)
    val yData = data.map { it.category!! }.mapToIndex()
    val y = mk.ndarray(yData, yData.size, 1)
    val tfidf = TfIdfVectorizer<Float>()
    val lr = com.londogard.nlp.meachinelearning.predictors.classifiers.LogisticRegression()
    val transformedData = tfidf.fitTransform(xData)
    val time = measureTime {
        lr.fit(transformedData, y)
    }
    println("Fitting: $time")

    val nar = xData[2]
    val list = listOf(nar)
    val mx = tfidf.transform(list)
    val prediction = lr.predict(mx).first()
    println("Predicted label is: $prediction. This corresponds to class. ${labelsMap[prediction]}")



}
@Hampus Londögård there you go. Having a feeling I did something wrong. Or it can't be used as a multi-class predictor?
h
What does
mapToIndex
do?
a
It maps the index of the category. If a
BankT
category is
Bank Charges
it replace it with
0
h
Running this I get
Fitting: 785.423920ms
Predicted label is: 0. This corresponds to class. Bank Charges
a
Yeah, try a different text.
h
Currently LogisticRegression only support binary classes currently. It was a while back since I wrote it so I had forgotten about it. I can make efforts in making it multiclass if that's something that is out of interest 👍
By changing one line I could make it multi-class supporting classes in the format of one-hot-encodings. i.e. class 1 = [0, 1, 0, ..., 0], class 5 = [0,0,0,0,0,1,0,..]
a
I'll be interested.
The Naive Bayes classifier throws an exception regarding multi-class usage
But it'll be great if you implement multi-class
h
Running with my small changes I get the following:
Copy code
Fitting: 651.062949ms
Predicted label is: [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]]
Would you prefer a one-hot-encoding or simple-class? E.g. should it be one-hot-encoding as input or the true class (1,2,3)?
a
True class
h
I'll have something up tomorrow night most likely in a 1.2.1-BETA release
a
Quick question, how do you compare classifiers like
LogisticRegression
and
NaiveBayes
compared to a neural network? What are your opinions??
h
As usually much weaker, but they're good baselines. Much faster also! LinearSVC in sk-learn with TfIdf using bigrams is usually a really powerful classifier though Currently I don't have that implemented unfortunately
1.2.0-BETA2
cut has been draft, it should be live within few hours. This is a beta and API could be changed. Current implementation you can see through https://github.com/londogard/londogard-nlp-toolkit/blob/main/src/test/kotlin/com/londogard/nlp/machinelearning/ClassifierTest.kt#L48
important part if you wish to use "simple API" with true class, apply
val lr = LogisticRegression().asAutoOneHotClassifier()
a
Okay noted
Weird thing is when I add the lib to Intellij, classes and sources are empty (gradle)
Until i import the jar file
Also when you want to predict using
tfidf.transform()
then
preditSimple()
the output is always zero
h
I’ve only used gradle to import libraries, which works last time I tested Are you running
implementation("com.londogard:nlp:1.2.0-BETA2")
?
Also when you want to predict using
tfidf.transform()
then
preditSimple()
the output is always zero
Did you look at the test I created? What’s the difference between your use-case and the one I did?
a
Yea, Its my IDE issue. Sorry to bother
h
No worries 🙂
a
About the prediction. I looked at your test. The difference is you are predicting using the result from
tfidf.fitTransform()
but ill be predicting using result from
tfidf.transform()
Which always return
0
h
That shouldn’t matter really, unless you’ve new data which hasn’t been trained on. This type of model can’t handle unseen data
a
I tested with trained data actually
h
I’ll validate this weekend and give you a notebook
Copy code
val labelsMap = mapOf(
            0 to "Bank Charges",
            1 to "Betting",
            2 to "Card fees",
            3 to "Food",
            4 to "Lifestyle",
            5 to "Loan",
            6 to "Reversal",
            7 to "Salary",
            8 to "Unknown",
            9 to "Utilities & Bills",
            10 to "Withdrawal"
        )
        val reversedLabelMap = labelsMap.asSequence().map { it.value to it.key }.toMap()

        val (data, categories) = listOf(
            "Vat amount charges" to "Bank Charges",
            "Loan payment credit" to "Loan",
            "Salary for Aug" to "Salary",
            "Payment from betking" to "Betting",
            "Purchase from Shoprite" to "Food",
        ).unzip()
        val simpleTok = SimpleTokenizer()
        val xData = data.map(simpleTok::split)
        val yList = categories.map { category -> reversedLabelMap.getOrDefault(category, 0) }
        val y = mk.ndarray(yList)

        val tfidf = TfIdfVectorizer<Float>()
        val lr = LogisticRegression().asAutoOneHotClassifier()

        val transformedData = tfidf.fitTransform(xData)
        lr.fit(transformedData, y)

        lr.predictSimple(tfidf.transform(xData)) shouldBeEqualTo lr.predictSimple(transformedData)
        lr.predictSimple(transformedData) shouldBeEqualTo y
works as expected