Scraping articles from webpages

General Lee · November 6, 2012

I'm building a bot that visits some urls from a list and scrapes the articles published. The websites are all different but I'm only looking to scrape the text articles. Anyone know of a good way to go about grabbing just the text?

Thanks!!

a2mateit · November 6, 2012

What websites are you trying to scrape from?

A little more info would be helpful in helping you out.

General Lee · November 6, 2012

Thats the thing, they're all different. Some are wordpress some are other html templates a huge variety. I though maybe there was a simple genetic attribute for scraping text, but I'm still searching..

gabel · November 6, 2012

not really as different sites have the text formatted differently so you`ll have to come up with a code for each one. (one for those who are on the same platform)

General Lee · November 7, 2012

Thanks for the replies, I'll keep tinkering..

Aymen · November 7, 2012

you can't scrape a text unless you know the div tags you are scraping from , you can make an if statement with all the platforms possible then scrape based on the current platform found !

Kev · November 7, 2012

Perhaps try a website that will provide you with read-only versions of websites? That way then you might find it easier to strip out the text.

Some sources: http://howto.cnet.com/8301-11310_39-20089311-285/how-to-view-text-only-versions-of-web-sites/

General Lee · November 7, 2012

you can't scrape a text unless you know the div tags you are scraping from , you can make an if statement with all the platforms possible then scrape based on the current platform found !

Yea that's whats I'm currently doing. Scraping tags with the if exists for stuff like * * and *

Its grabbing a lot of data this way but its still missing tons of articles and data.

I'll keep trying. Getting closer though

General Lee · November 7, 2012

Perhaps try a website that will provide you with read-only versions of websites? That way then you might find it easier to strip out the text.

Some sources: http://howto.cnet.co...s-of-web-sites/

You know, you just maybe on to something there. Thanks!

Aymen · November 13, 2012

well i've tested it with ezine article and it worked just great

Yea that's whats I'm currently doing. Scraping tags with the if exists for stuff like * * and *

Its grabbing a lot of data this way but its still missing tons of articles and data.

I'll keep trying. Getting closer though

scraperdev · November 16, 2012

I don't know ubot has capable to edit the scraper regular expression to scrapper correct tags etc..

Sign In

Scraping articles from webpages

Recommended Posts

General Lee 12

Link to post

Share on other sites

a2mateit 395

Link to post

Share on other sites

General Lee 12

Link to post

Share on other sites

gabel 51

Link to post

Share on other sites

General Lee 12

Link to post

Share on other sites

Aymen 385

Link to post

Share on other sites

Kev 69

Link to post

Share on other sites

General Lee 12

Link to post

Share on other sites

General Lee 12

Link to post

Share on other sites

Aymen 385

Link to post

Share on other sites

scraperdev 0

Link to post

Share on other sites

Join the conversation

Browse

Activity