https://kotlinlang.org logo
#skrape-it
Title
# skrape-it
x

xexiz

09/13/2021, 11:46 AM
Hi!! First of all, thanks @Christian Dräger for that pretty library, it really makes skraping much more fun hehe 🙂 Although, I’ve been trying with a site and I’m always getting timeouts (even if I override the timeout param to 30sec). Is there any logs we can activate to have more info on what’s going on and why it always times-out? Here’s what I’m trying for now (I did manage to have a successful call a few times to make sure my selectors were ok.
Copy code
private suspend fun getTotalPages(): Int =
    withContext(<http://Dispatchers.IO|Dispatchers.IO>) {
        skrape(AsyncFetcher) {
            request {
                url = "<https://www.capfriendly.com/browse/active>"
            }
            response {
                htmlDocument {
                    div {
                        withClass = "pagination"
                        findFirst {
                            div {
                                withClass = "r"
                                val paginationText = findByIndex(1).text
                                paginationText.substringAfter(" of ").toInt()
                            }
                        }
                    }
                }
            }
        }
    }
• kotlin 1.5.30 • AGP 7.1.0-alpha11 • API 31 • skrapeit:1.1.5 Thanks
c

Christian Dräger

09/14/2021, 6:37 PM
Glad you like it. I will try out tomorrow and see if I can investigate something :)
i did a quick check by putting your code in a junit test and its working fine for me
Copy code
@Test
    fun `can get total pages`() {
        runBlocking {
            withContext(<http://Dispatchers.IO|Dispatchers.IO>) {
                val totalPages = skrape(AsyncFetcher) {
                    request {
                        url = "<https://www.capfriendly.com/browse/active>"
                    }
                    response {
                        htmlDocument {
                            div {
                                withClass = "pagination"
                                findFirst {
                                    div {
                                        withClass = "r"
                                        val paginationText = findByIndex(1).text
                                        paginationText.substringAfter(" of ").toInt()
                                    }
                                }
                            }
                        }
                    }
                }

                println(totalPages)
            }
        }
    }
to avoid the string parsing since the links text will probably change more frequent than its attributes i would maybe do something like this:
Copy code
@Test
    fun `can get total pages`() = runBlocking {
        withContext(<http://Dispatchers.IO|Dispatchers.IO>) {
            val totalPages = skrape(AsyncFetcher) {
                request {
                    url = "<https://www.capfriendly.com/browse/active>"
                }
                response {
                    htmlDocument {
                        div {
                            withClass = "pagination"
                            findFirst {
                                div {
                                    withClass = "r"
                                    a {
                                        findAll { find { it.text == "Last" }?.attribute("data-val") }
                                    }
                                }
                            }
                        }
                    }
                }
            }

            println(totalPages)
        }
    }
👍 1
just to make sure, have you checked the troubleshooting section of the android example? https://github.com/skrapeit/skrape.it/tree/master/examples/android#troubleshooting i will try to build a little android app to check. seems to be something android specific since the example is running fine on the jvm server side
x

xexiz

09/15/2021, 2:23 PM
Thanks for the suggestion. I did reused most of the android example and adapted it a bit to be able to fetch total number of pages and then fetch all 30 pages of this list. I also bump all libraries versions.
It seems it might be the website that’s causing problems. I just tried with imdb to display list of movies and it’s working on my project, but not with capfriendly. No idea why though.
Copy code
private suspend fun fetchImdb(): List<User> =
    withContext(<http://Dispatchers.IO|Dispatchers.IO>) {
        skrape(AsyncFetcher) {
            request {
                url = "<https://www.imdb.com/chart/top/>"
                sslRelaxed = true
            }.also { println("call ${it.preparedRequest.url}") }
            response {
                htmlDocument {
                    table {
                        tbody {
                            withClass = "lister-list"
                            tr {
                                findAll {
                                    map {
                                        val title = <http://it.td|it.td> { findSecond { text } }
                                        User(name = title, "", "")
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
Yeah, the website really has something skrape{it} doesn’t like. I tried the whole loop I did but with another website and I can fetch all 8005 players
Copy code
private suspend fun fetchDB(): List<User> {
    var players = listOf<User>()
    withContext(<http://Dispatchers.IO|Dispatchers.IO>) {
        val deferred = ('a'..'z').filterNot { it == 'x' }.map { async { getHockeyDb(it) } }
        players = deferred.awaitAll().flatten()
    }
    println("players total: ${players.size}")
    println("players 5: ${players[5]}")
    println("players 100: ${players[100]}")
    println("players 250: ${players[250]}")
    println("players 500: ${players[500]}")
    println("players 5000: ${players[5000]}")
    return players
}

private suspend fun getHockeyDb(letter: Char): List<User> {
    return withContext(<http://Dispatchers.IO|Dispatchers.IO>) {
        skrape(AsyncFetcher) {
            request {
                url = "<https://www.hockeydb.com/ihdb/players/player_ind_$letter.html>"
                sslRelaxed = true
            }.also { println("call ${it.preparedRequest.url}") }
            response {
                htmlDocument {
                    table {
                        tbody {
                            tr {
                                findAll {
                                    map {
                                        val name = <http://it.td|it.td> { it.a { findFirst { text } } }
                                        val team = <http://it.td|it.td> { findByIndex(1) { text } }
                                        val salary = <http://it.td|it.td> { findLast { text } }
                                        User(name, team, salary)
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
Have you had time to take a look yet @Christian Dräger?
c

Christian Dräger

09/17/2021, 6:36 PM
Sorry I didn't managed to had a deeper look so far. But from your examples it really looks like it has something to do with that certain url. From that point of view it would be intressting to find the reason since it could probably be a bug in skrapeit. There currently is another open issue that is related to some https url and Android. The author of the issue assumes it has something to do with self signed or invalid tls certificates https://github.com/skrapeit/skrape.it/issues/162 But I couldn't reproduce it as well. But same as your error I didn't found the time to really test on Android. Maybe it is the same error or at least related 🤷‍♂️ but can not say much so far
👍 1
x

xexiz

09/17/2021, 6:50 PM
Allright, thanks. I also noticed while playing with the previously mentioned example ( using www.hockeydb.com) that it seems to only work on API29+ I’ll try to isolate the problem and open a bug if I find more info, but the http client seems to be responsible, not working on API25,26,27,28 so far. I’ll try to bump the ktor and all other dependencies on your lib, that;’s always a good guess to use latest versions of 3rd party libraries.
Interestingly enough, I just rebuilt your library and swapped the Apache client used in the
AsyncFetcher
with the OkHttp client (from
ktor-client-okhttp
https://ktor.io/docs/http-client-engines.html#okhttp) and all my problems are fixed. I can now fetch all 1465 players from capfriendly.com and also, it works on all Android versions I previously mentioned and not only on API29+ 🙂
c

Christian Dräger

09/24/2021, 5:24 PM
woooho
thx for investigating mate. ok i think it would make sense to change the default implemantations of HttpFetcher and AsyncFetcher to use ktor-client-okhttp instead of apache 🙂
👍 1
since i am running short on time these days would you be open to send a PR (since you already have the code anyway)? 🙂 thereby we can fix it upstream for everyone. would be super awesome
x

xexiz

09/24/2021, 6:00 PM
Yes I could. I had removed everything authentication or proxy related though so I’ll need to figure out what to do with this. I would need to get more familiar with the project to make sure swapping it doesn’t break other stuff. So you are more in favor to totally replace the Apache implem. with the OkHttp one? or support both so that the user decides which one he wants? And what about the BrowserFetcher? In the Android world, OkHttp is pretty much the standard and most supported client so it would make sense to make it the default except that originally I think your lib was mostly for unittest right?
c

Christian Dräger

09/25/2021, 7:36 PM
It started to be for unit testing but why not support Android as good as possible. Since I am not into android development (building backends and web Frontends is my daily bread and butter ^^) it's sometimes hard for me to catch up with such android support topics :D Since okhttp works perfectly fine one server or unit tests I think it's a good idea to just swap the implementation. In general everything crucial should already be covered with tests to verify if the new implementation works it should be enough to execute these tests. The BrowserFetcher is somewhat special because it sticks to htmlunit (and also has to in the future) to support the js rendering thingy. Not sure if it does trouble on Android as well and furthermore what to do to support BrowserFetcher on Android. If you don't feel comfortable withe authentication (which is more like beta currently anyway) and the proxy you could just leave it open and i could add it the PR :)
13 Views