Jump to content
UBot Underground

Scraping Text Off Websites Without Using Selectors?


Recommended Posts

I want to use Google to find top ranking websites and then scrape the text off the website. The problem is that every website is different, so using scrape attribute requires an attribute that will be common across every possible website: I don't see how this could work. I tried using page scrape, but again, you need tags on either side of the required content which will differ from site to site. I tried using tags like <p> and </p> but the results were messy.

 

So instead I've looked at using meta sites which convert any website into text which can then be scraped. Textise.com and TextMirror.com can do this but again, not ideal.

 

Am I missing a trick somewhere? Does anyone have ideas about how this can be done?

 

Thanks

Steve

Link to post
Share on other sites

Welcome to the wonderful world of Google!

 

We have all encountered the manipulations that Google imposes.  Basically, you just have to plan for the worse and hope for the best.

 

Way back when I used to scrape Google, I would document the variations of their changes and then adapt my scrape per that particular change they used.  I then learned that I could search for each pattern and then apply my scraping code the target area.  It worked pretty well.

 

That's what I used to do.

 

Buddy

Link to post
Share on other sites

Thanks :@UBotBuddy - true words!

But, I haven't been clear enough in my original post. What I meant was that I want to find high ranking websites, then navigate to the website page that Google is pointing to, and scrape the text off that website. I think I can get to the website easily enough, but it's the scraping of the pages that is the problem, because they could be Wikipedia pages, Youtube pages, news sites - anything.

I guess what I'm looking to do is like selecting the whole page and pasting it into Notepad so that all that's left is text. Textise seems to work and currently my favourite is https://www.w3.org/services/html2txt

 

Neither are ideal though and need a fair amount of messing around. That's why I wondered if there was a smarter idea.

 

Steve

Link to post
Share on other sites

when you load the website all its page source including text appear in document text parameter, so all you need is to strip html tags and you will get all the text which appear on the page (unless it has frames). you can also limit it by removing all text which is inside <body> tag .

but obviously you will get all functions data an other functionality code which you will need to remove also like data between <script> and so on.

I think that using text convertors is good idea otherwise you need to build parser to get innertext between specific tags which may contain text.

Link to post
Share on other sites

I would love for UBS to have a Text-based browser but it doesn't.  I advocated for such a browser but at the time I believe I was the only one who voted for it.  A company cannot really justify creating something like that for one customer.

 

Since then, I have had several bots that needed that capability but using techniques to strip out non-text like functions, images, CSS code and even random HTML code the overhead just got to be too much so I abandoned the projects.

 

In my case, I was wanting to analyze the text (articles/post) to process the data in such a way that I could make educated predictions.  Doing manually my predictions were 85% accurate.  Very encouraging to say the least.  But I honestly could not do the amount of data research using the browsers within UBot.  The scripting aspect was working great!  But the scraped data was too convoluted and that is where I was stuck cleaning the crap out of the text.

 

After reading your post and seeing what you have tried I may resurrect my project and try Textise.com and TextMirror.com.

 

But this time employ Ex Browser.

 

If I can get the data in a pretty clean process from across many website that I was targeting then I can start feeding my database.

 

That's it from me.

 

Buddy

  • Like 1
Link to post
Share on other sites

You can identify common containers like Wordpress sites for example will probably all use some common container like id="post" or something.

 

Also you can maybe try removing things that say "footer" or "header" or "navigation"

 

One thing I would do is check out some reader browser extensions and see how they do it.

 

And another thing, you can look for a feed instead, that way the article will be formatted via RSS.

  • Like 1
Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...