Scraping Rss

Learjet · January 4, 2016

Been trying to get this for about 6 hours without any luck, I'm trying to get the links from Google RSS here:

https://news.google.com/news?cf=all&hl=en&ned=us&q=Red+Skelton&output=rss

The links look like this:

<link>http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNGdVPVljckRj3DjAnoe4B1Bs8I6Ow&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52779022175028&ei=u9yJVvi_MerrwQGb67GQBA&url=http://www.buffalonews.com/life-arts/book-reviews/book-review-limping-on-water-by-phil-beuth-with-kc-schulberg-20160103</link>

However, I just need this:

http://www.buffalonews.com/life-arts/book-reviews/book-review-limping-on-water-by-phil-beuth-with-kc-schulberg-20160103

The Ubot browser tries to parse it with html and it's an RSS feed so I can't get it to scrape right.

Thanks for your patience with new folks :-)

Peace,
EJ

pash · January 4, 2016

try.

alert($find regular expression("<link>http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNGdVPVljckRj3DjAnoe4B1Bs8I6Ow&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52779022175028&ei=u9yJVvi_MerrwQGb67GQBA&url=http://www.buffalonews.com/life-arts/book-reviews/book-review-limping-on-water-by-phil-beuth-with-kc-schulberg-20160103</link>","(?<=url=).*?(?=<\\/link>)"))

Learjet · January 6, 2016

Hi Pash,

I got the file downloaded and the regex figured out, having a hard time figuring how to scrape from the .txt file that I created with the RSS Code in it. Here's the Regex for Google News RSS in case someone needs it:

https://news.google.com/news?cf=all&hl=en&ned=us&q=KEYWORD&output=rss

Or

https://news.google.com/news?cf=all&hl=en&ned=us&q=YOUR+KEYWORD&output=rss

Here's the REGEX to extract the links:

(?<=\&url=).*?(?=<\/link>)

Thanks in advance for your help!

Peace,

Z

ds062692 · January 6, 2016

You don't have to download it. You can just use the read file command. Something like this.

add list to list(%links,$find regular expression($read file("https://news.google.com/news?cf=all&hl=en&ned=us&q=Red+Skelton&output=rss"),"(?<=\\&url=).*?(?=<\\/link>)"),"Delete","Global")

pash · January 6, 2016

try

set(#Html,$read file("https://news.google.com/news?cf=all&hl=en&ned=us&q=KEYWORD&output=rss"),"Global")
set(#Html,$replace(#Html,"<","<"),"Global")
set(#Html,$replace(#Html,">",">"),"Global")
set(#Html,$replace(#Html,""","\""),"Global")
set(#Html,$replace(#Html,"'","\'"),"Global")
set(#Html,$replace(#Html,"'","\'"),"Global")
loop(10) {
    set(#Html,$replace(#Html,"&","&"),"Global")
}
load html(#Html)
wait(2)
add list to list(%Links,$find regular expression(#Html,"(?<=&url=).*?(?=(</link>|\"))"),"Delete","Global")
load html("
{$replace(%Links,"
","<br>")}")

Learjet · January 6, 2016

ds062692,

Thanks so much, perfect!

Thanks to you too pash!

My head is swimming, there's so much to learn that it seems a bit overwhelming right now, but I'm getting it slowly...

Peace,

EJ

Sign In

Scraping Rss

Recommended Posts

Learjet 27

Link to post

Share on other sites

pash 504

Link to post

Share on other sites

Learjet 27

Link to post

Share on other sites

ds062692 19

Link to post

Share on other sites

pash 504

Link to post

Share on other sites

Learjet 27

Link to post

Share on other sites

Join the conversation

Browse

Activity