General Lee 12 Posted November 6, 2012 Report Share Posted November 6, 2012 I'm building a bot that visits some urls from a list and scrapes the articles published. The websites are all different but I'm only looking to scrape the text articles. Anyone know of a good way to go about grabbing just the text? Thanks!! 1 Quote Link to post Share on other sites
a2mateit 395 Posted November 6, 2012 Report Share Posted November 6, 2012 What websites are you trying to scrape from? A little more info would be helpful in helping you out. Quote Link to post Share on other sites
General Lee 12 Posted November 6, 2012 Author Report Share Posted November 6, 2012 Thats the thing, they're all different. Some are wordpress some are other html templates a huge variety. I though maybe there was a simple genetic attribute for scraping text, but I'm still searching.. Quote Link to post Share on other sites
gabel 51 Posted November 6, 2012 Report Share Posted November 6, 2012 not really as different sites have the text formatted differently so you`ll have to come up with a code for each one. (one for those who are on the same platform) Quote Link to post Share on other sites
General Lee 12 Posted November 7, 2012 Author Report Share Posted November 7, 2012 Thanks for the replies, I'll keep tinkering.. Quote Link to post Share on other sites
Aymen 385 Posted November 7, 2012 Report Share Posted November 7, 2012 you can't scrape a text unless you know the div tags you are scraping from , you can make an if statement with all the platforms possible then scrape based on the current platform found ! Quote Link to post Share on other sites
Kev 69 Posted November 7, 2012 Report Share Posted November 7, 2012 Perhaps try a website that will provide you with read-only versions of websites? That way then you might find it easier to strip out the text. Some sources: http://howto.cnet.com/8301-11310_39-20089311-285/how-to-view-text-only-versions-of-web-sites/ Quote Link to post Share on other sites
General Lee 12 Posted November 7, 2012 Author Report Share Posted November 7, 2012 you can't scrape a text unless you know the div tags you are scraping from , you can make an if statement with all the platforms possible then scrape based on the current platform found ! Yea that's whats I'm currently doing. Scraping tags with the if exists for stuff like <font>*</font> <p>*</p> and <span>*</span> Its grabbing a lot of data this way but its still missing tons of articles and data. I'll keep trying. Getting closer though Quote Link to post Share on other sites
General Lee 12 Posted November 7, 2012 Author Report Share Posted November 7, 2012 Perhaps try a website that will provide you with read-only versions of websites? That way then you might find it easier to strip out the text. Some sources: http://howto.cnet.co...s-of-web-sites/ You know, you just maybe on to something there. Thanks! Quote Link to post Share on other sites
Aymen 385 Posted November 13, 2012 Report Share Posted November 13, 2012 well i've tested it with ezine article and it worked just great Yea that's whats I'm currently doing. Scraping tags with the if exists for stuff like <font>*</font> <p>*</p> and <span>*</span> Its grabbing a lot of data this way but its still missing tons of articles and data. I'll keep trying. Getting closer though Quote Link to post Share on other sites
scraperdev 0 Posted November 16, 2012 Report Share Posted November 16, 2012 I don't know ubot has capable to edit the scraper regular expression to scrapper correct tags etc.. Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.