Scraping Text Off Websites Without Using Selectors?

stever · June 28, 2018

I want to use Google to find top ranking websites and then scrape the text off the website. The problem is that every website is different, so using scrape attribute requires an attribute that will be common across every possible website: I don't see how this could work. I tried using page scrape, but again, you need tags on either side of the required content which will differ from site to site. I tried using tags like <p> and </p> but the results were messy.

So instead I've looked at using meta sites which convert any website into text which can then be scraped. Textise.com and TextMirror.com can do this but again, not ideal.

Am I missing a trick somewhere? Does anyone have ideas about how this can be done?

Thanks

Steve

UBotBuddy · June 28, 2018

Welcome to the wonderful world of Google!

We have all encountered the manipulations that Google imposes. Basically, you just have to plan for the worse and hope for the best.

Way back when I used to scrape Google, I would document the variations of their changes and then adapt my scrape per that particular change they used. I then learned that I could search for each pattern and then apply my scraping code the target area. It worked pretty well.

That's what I used to do.

Buddy

stever · June 28, 2018

Thanks :@UBotBuddy - true words!

But, I haven't been clear enough in my original post. What I meant was that I want to find high ranking websites, then navigate to the website page that Google is pointing to, and scrape the text off that website. I think I can get to the website easily enough, but it's the scraping of the pages that is the problem, because they could be Wikipedia pages, Youtube pages, news sites - anything.

I guess what I'm looking to do is like selecting the whole page and pasting it into Notepad so that all that's left is text. Textise seems to work and currently my favourite is https://www.w3.org/services/html2txt

Neither are ideal though and need a fair amount of messing around. That's why I wondered if there was a smarter idea.

Steve

bestmacros · June 28, 2018

when you load the website all its page source including text appear in document text parameter, so all you need is to strip html tags and you will get all the text which appear on the page (unless it has frames). you can also limit it by removing all text which is inside <body> tag .

but obviously you will get all functions data an other functionality code which you will need to remove also like data between <script> and so on.

I think that using text convertors is good idea otherwise you need to build parser to get innertext between specific tags which may contain text.

UBotBuddy · June 28, 2018

I would love for UBS to have a Text-based browser but it doesn't. I advocated for such a browser but at the time I believe I was the only one who voted for it. A company cannot really justify creating something like that for one customer.

Since then, I have had several bots that needed that capability but using techniques to strip out non-text like functions, images, CSS code and even random HTML code the overhead just got to be too much so I abandoned the projects.

In my case, I was wanting to analyze the text (articles/post) to process the data in such a way that I could make educated predictions. Doing manually my predictions were 85% accurate. Very encouraging to say the least. But I honestly could not do the amount of data research using the browsers within UBot. The scripting aspect was working great! But the scraped data was too convoluted and that is where I was stuck cleaning the crap out of the text.

After reading your post and seeing what you have tried I may resurrect my project and try Textise.com and TextMirror.com.

But this time employ Ex Browser.

If I can get the data in a pretty clean process from across many website that I was targeting then I can start feeding my database.

That's it from me.

Buddy

stever · June 28, 2018

Thanks for the input everyone - reassuring to know this wasn't a dumb question!

HelloInsomnia · June 29, 2018

You can identify common containers like Wordpress sites for example will probably all use some common container like id="post" or something.

Also you can maybe try removing things that say "footer" or "header" or "navigation"

One thing I would do is check out some reader browser extensions and see how they do it.

And another thing, you can look for a feed instead, that way the article will be formatted via RSS.

stever · July 2, 2018

@Helloinsomnia: thanks. I really like the idea of using rss feeds - could work well.

Sign In

Scraping Text Off Websites Without Using Selectors?

Recommended Posts

stever 10

Link to post

Share on other sites

UBotBuddy 331

Link to post

Share on other sites

stever 10

Link to post

Share on other sites

bestmacros 60

Link to post

Share on other sites

UBotBuddy 331

Link to post

Share on other sites

stever 10

Link to post

Share on other sites

HelloInsomnia 1103

Link to post

Share on other sites

stever 10

Link to post

Share on other sites

Join the conversation

Browse

Activity