Jump to content
UBot Underground

Scraping articles from webpages


Recommended Posts

I'm building a bot that visits some urls from a list and scrapes the articles published. The websites are all different but I'm only looking to scrape the text articles. Anyone know of a good way to go about grabbing just the text?

 

Thanks!!

  • Like 1
Link to post
Share on other sites

not really as different sites have the text formatted differently so you`ll have to come up with a code for each one. (one for those who are on the same platform)

Link to post
Share on other sites

you can't scrape a text unless you know the div tags you are scraping from , you can make an if statement with all the platforms possible then scrape based on the current platform found !

Link to post
Share on other sites

you can't scrape a text unless you know the div tags you are scraping from , you can make an if statement with all the platforms possible then scrape based on the current platform found !

 

Yea that's whats I'm currently doing. Scraping tags with the if exists for stuff like <font>*</font> <p>*</p> and <span>*</span>

 

Its grabbing a lot of data this way but its still missing tons of articles and data.

 

I'll keep trying. Getting closer though :)

Link to post
Share on other sites

Perhaps try a website that will provide you with read-only versions of websites? That way then you might find it easier to strip out the text.

 

Some sources: http://howto.cnet.co...s-of-web-sites/

 

You know, you just maybe on to something there. Thanks!

Link to post
Share on other sites

well i've tested it with ezine article and it worked just great

 

Yea that's whats I'm currently doing. Scraping tags with the if exists for stuff like <font>*</font> <p>*</p> and <span>*</span>

 

Its grabbing a lot of data this way but its still missing tons of articles and data.

 

I'll keep trying. Getting closer though :)

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...