Jump to content
UBot Underground

Scraping Rss


Recommended Posts

Been trying to get this for about 6 hours without any luck, I'm trying to get the links from Google RSS here:

 

https://news.google.com/news?cf=all&hl=en&ned=us&q=Red+Skelton&output=rss

 

The links look like this:

<link>http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNGdVPVljckRj3DjAnoe4B1Bs8I6Ow&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52779022175028&ei=u9yJVvi_MerrwQGb67GQBA&url=http://www.buffalonews.com/life-arts/book-reviews/book-review-limping-on-water-by-phil-beuth-with-kc-schulberg-20160103</link>

However, I just need this:

http://www.buffalonews.com/life-arts/book-reviews/book-review-limping-on-water-by-phil-beuth-with-kc-schulberg-20160103

The Ubot browser tries to parse it with html and it's an RSS feed so I can't get it to scrape right.

Thanks for your patience with new folks :-)

 

Peace,
EJ

Link to post
Share on other sites

try.

alert($find regular expression("<link>http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNGdVPVljckRj3DjAnoe4B1Bs8I6Ow&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52779022175028&ei=u9yJVvi_MerrwQGb67GQBA&url=http://www.buffalonews.com/life-arts/book-reviews/book-review-limping-on-water-by-phil-beuth-with-kc-schulberg-20160103</link>","(?<=url=).*?(?=<\\/link>)"))

 

Link to post
Share on other sites

Hi Pash,

 

I got the file downloaded and the regex figured out, having a hard time figuring how to scrape from the .txt file that I created with the RSS Code in it.  Here's the Regex for Google News RSS in case someone needs it:

https://news.google.com/news?cf=all&hl=en&ned=us&q=KEYWORD&output=rss

Or

https://news.google.com/news?cf=all&hl=en&ned=us&q=YOUR+KEYWORD&output=rss

Here's the REGEX to extract the links:

(?<=\&url=).*?(?=<\/link>)

Thanks in advance for your help!

Peace,

Z

Link to post
Share on other sites

You don't have to download it. You can just use the read file command. Something like this. 

add list to list(%links,$find regular expression($read file("https://news.google.com/news?cf=all&hl=en&ned=us&q=Red+Skelton&output=rss"),"(?<=\\&url=).*?(?=<\\/link>)"),"Delete","Global")

Link to post
Share on other sites

try

set(#Html,$read file("https://news.google.com/news?cf=all&hl=en&ned=us&q=KEYWORD&output=rss"),"Global")
set(#Html,$replace(#Html,"<","<"),"Global")
set(#Html,$replace(#Html,">",">"),"Global")
set(#Html,$replace(#Html,""","\""),"Global")
set(#Html,$replace(#Html,"'","\'"),"Global")
set(#Html,$replace(#Html,"'","\'"),"Global")
loop(10) {
    set(#Html,$replace(#Html,"&","&"),"Global")
}
load html(#Html)
wait(2)
add list to list(%Links,$find regular expression(#Html,"(?<=&url=).*?(?=(</link>|\"))"),"Delete","Global")
load html("
{$replace(%Links,"
","<br>")}")
  • Like 1
Link to post
Share on other sites

ds062692,

 

Thanks so much, perfect!

 

Thanks to you too pash!

 

My head is swimming, there's so much to learn that it seems a bit overwhelming right now, but I'm getting it slowly...

 

Peace,

EJ

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...