rocket londogard nlp toolkit <https github com londogard lo kotlinlang #datascience

:rocket: londogard-nlp-toolkit (<https://github.co...

Hampus Londögård

08/28/2022, 6:53 PM

🚀 londogard-nlp-toolkit (https://github.com/londogard/londogard-nlp-toolkit) just merged initial support for HuggingFace Transformers, and other Transformer-based models • ✅ PyTorch through JIT-saved TorchScript models • ✅ ONNX Models directly through the hub, e.g.

TokenClassificationPipeline.create("optimum/bert-base-NER")

◦ Where

optimum/bert-base-NER

is a model on the HuggingFace Hub • ✅ Load both PyTorch (TorchScript) & ONNX model through local path •

ClassificationPipeline

and

TokenClassificationPipeline

exists ◦ See the following test for some examples on how to use it A

1.2.0-BETA

release has been cut!

👍 7

ayodele

08/30/2022, 5:11 PM

I tried to use LogisticRegression. But it always return index 1, followed the example your ClassifierTest

Hampus Londögård

08/30/2022, 5:15 PM

As such the test fails? Seems weird as the test-case is still passing on CI/CD Could you provide your sample code?

ayodele

08/30/2022, 5:30 PM

Copy code

@OptIn(ExperimentalTime::class)
fun logisticTest() {
    val labelsMap = mapOf(
        0 to "Bank Charges",
        1 to "Betting",
        2 to "Card fees",
        3 to "Food",
        4 to "Lifestyle",
        5 to "Loan",
        6 to "Reversal",
        7 to "Salary",
        8 to "Unknown",
        9 to "Utilities & Bills",
        10 to "Withdrawal"
    )
    val data = listOf(
        BankT("Vat amount charges", "Bank Charges"),
        BankT("Loan payment credit", "Loan"),
        BankT("Salary for Aug", "Salary"),
        BankT("Payment from betking","Betting"),
        BankT("Purchase from Shoprite","Food"),
    )
    val simpleTok = SimpleTokenizer()
    val xData = data.map { it.narration }.map(simpleTok::split)
    val yData = data.map { it.category!! }.mapToIndex()
    val y = mk.ndarray(yData, yData.size, 1)
    val tfidf = TfIdfVectorizer<Float>()
    val lr = com.londogard.nlp.meachinelearning.predictors.classifiers.LogisticRegression()
    val transformedData = tfidf.fitTransform(xData)
    val time = measureTime {
        lr.fit(transformedData, y)
    }
    println("Fitting: $time")

    val nar = xData[2]
    val list = listOf(nar)
    val mx = tfidf.transform(list)
    val prediction = lr.predict(mx).first()
    println("Predicted label is: $prediction. This corresponds to class. ${labelsMap[prediction]}")



}

ayodele

08/30/2022, 5:32 PM

@Hampus Londögård there you go. Having a feeling I did something wrong. Or it can't be used as a multi-class predictor?

Hampus Londögård

08/30/2022, 5:35 PM

What does

mapToIndex

do?

ayodele

08/30/2022, 5:37 PM

It maps the index of the category. If a

BankT

category is

Bank Charges

it replace it with

Hampus Londögård

08/30/2022, 5:42 PM

Running this I get

Fitting: 785.423920ms

Predicted label is: 0. This corresponds to class. Bank Charges

ayodele

08/30/2022, 5:44 PM

Yeah, try a different text.

Hampus Londögård

08/30/2022, 5:51 PM

Currently LogisticRegression only support binary classes currently. It was a while back since I wrote it so I had forgotten about it. I can make efforts in making it multiclass if that's something that is out of interest 👍

Hampus Londögård

08/30/2022, 5:55 PM

By changing one line I could make it multi-class supporting classes in the format of one-hot-encodings. i.e. class 1 = [0, 1, 0, ..., 0], class 5 = [0,0,0,0,0,1,0,..]

ayodele

08/30/2022, 5:56 PM

I'll be interested.

ayodele

08/30/2022, 5:56 PM

The Naive Bayes classifier throws an exception regarding multi-class usage

ayodele

08/30/2022, 5:57 PM

But it'll be great if you implement multi-class

Hampus Londögård

08/30/2022, 5:58 PM

Running with my small changes I get the following:

Copy code

Fitting: 651.062949ms
Predicted label is: [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]]

Would you prefer a one-hot-encoding or simple-class? E.g. should it be one-hot-encoding as input or the true class (1,2,3)?

ayodele

08/30/2022, 5:59 PM

True class

Hampus Londögård

08/30/2022, 7:17 PM

I'll have something up tomorrow night most likely in a 1.2.1-BETA release

ayodele

08/30/2022, 7:58 PM

Quick question, how do you compare classifiers like

LogisticRegression

and

NaiveBayes

compared to a neural network? What are your opinions??

Hampus Londögård

08/31/2022, 4:07 AM

As usually much weaker, but they're good baselines. Much faster also! LinearSVC in sk-learn with TfIdf using bigrams is usually a really powerful classifier though Currently I don't have that implemented unfortunately

Hampus Londögård

08/31/2022, 4:42 AM

1.2.0-BETA2

cut has been draft, it should be live within few hours. This is a beta and API could be changed. Current implementation you can see through https://github.com/londogard/londogard-nlp-toolkit/blob/main/src/test/kotlin/com/londogard/nlp/machinelearning/ClassifierTest.kt#L48

Hampus Londögård

08/31/2022, 4:42 AM

important part if you wish to use "simple API" with true class, apply

val lr = LogisticRegression().asAutoOneHotClassifier()

ayodele

08/31/2022, 8:12 AM

Okay noted

ayodele

09/01/2022, 7:16 AM

Weird thing is when I add the lib to Intellij, classes and sources are empty (gradle)

ayodele

09/01/2022, 7:17 AM

Until i import the jar file

ayodele

09/01/2022, 8:12 AM

Also when you want to predict using

tfidf.transform()

then

preditSimple()

the output is always zero

Hampus Londögård

09/01/2022, 8:33 AM

I’ve only used gradle to import libraries, which works last time I tested Are you running

implementation("com.londogard:nlp:1.2.0-BETA2")

Hampus Londögård

09/01/2022, 8:33 AM

Also when you want to predict using
tfidf.transform()
then
preditSimple()
the output is always zero

Did you look at the test I created? What’s the difference between your use-case and the one I did?

ayodele

09/01/2022, 8:37 AM

Yea, Its my IDE issue. Sorry to bother

Hampus Londögård

09/01/2022, 8:37 AM

No worries 🙂

ayodele

09/01/2022, 8:40 AM

About the prediction. I looked at your test. The difference is you are predicting using the result from

tfidf.fitTransform()

but ill be predicting using result from

tfidf.transform()

ayodele

09/01/2022, 8:41 AM

Which always return

Hampus Londögård

09/01/2022, 8:41 AM

That shouldn’t matter really, unless you’ve new data which hasn’t been trained on. This type of model can’t handle unseen data

ayodele

09/01/2022, 9:00 AM

I tested with trained data actually

Hampus Londögård

09/02/2022, 5:49 AM

I’ll validate this weekend and give you a notebook

Hampus Londögård

09/02/2022, 7:15 PM

Copy code

val labelsMap = mapOf(
            0 to "Bank Charges",
            1 to "Betting",
            2 to "Card fees",
            3 to "Food",
            4 to "Lifestyle",
            5 to "Loan",
            6 to "Reversal",
            7 to "Salary",
            8 to "Unknown",
            9 to "Utilities & Bills",
            10 to "Withdrawal"
        )
        val reversedLabelMap = labelsMap.asSequence().map { it.value to it.key }.toMap()

        val (data, categories) = listOf(
            "Vat amount charges" to "Bank Charges",
            "Loan payment credit" to "Loan",
            "Salary for Aug" to "Salary",
            "Payment from betking" to "Betting",
            "Purchase from Shoprite" to "Food",
        ).unzip()
        val simpleTok = SimpleTokenizer()
        val xData = data.map(simpleTok::split)
        val yList = categories.map { category -> reversedLabelMap.getOrDefault(category, 0) }
        val y = mk.ndarray(yList)

        val tfidf = TfIdfVectorizer<Float>()
        val lr = LogisticRegression().asAutoOneHotClassifier()

        val transformedData = tfidf.fitTransform(xData)
        lr.fit(transformedData, y)

        lr.predictSimple(tfidf.transform(xData)) shouldBeEqualTo lr.predictSimple(transformedData)
        lr.predictSimple(transformedData) shouldBeEqualTo y

works as expected

Hampus Londögård

09/02/2022, 7:22 PM

https://datalore.jetbrains.com/view/notebook/mnH1wldy1w1UeQWXMFoxN6

23 Views

Open in Slack

Previous Next