Scraping of pdf files...

Cary Duke · May 27, 2010

Hi guys,

Looking for a little direction. Not only am I a newbie to Ubot but am totally new to programing... I want to make a bot which will scrape membership site information. Some of these include pdf's of the membership directory. This directory include emails. That's the info I desire to scrape and put into a .csv format for use in outlook.

With that said, I am watching the tutorials, practicing, and learning... I have the 3.3 beta version. The tutorials are somewhat different. Should I go back to the previous version?

Can Ubot scrape a pdf file? I ask because when i'm on the pdf page, the right click drop down options aren't there.

Suggestions?

Thanks in advance

Cary

TommyTx · May 27, 2010

I would guess Ubot won't scrape a pdf.. however it can download it and if you need just certain stuff from it there are programs that can convert it to word type documents so that any data can be retrieved...

UBotBuddy · May 27, 2010

No, I am pretty sure it can't scrape from a PDF. NOW THAT my friends would be a VERY cool trick. Also, it cannot scrape from a Flash form either...I tried. I thought they were regular forms. They sure looked liked regular web forms.

The_Brit · May 27, 2010

You can do it in a round about way. If the pdf can be downloaded via UBot, you can execute one of the programs that TommyTX mentioned and convert it to plain text. You then use the Navigate option to load it back into UBot where you can then scrape the contents. Instead of http:// you use file://

I used this for obtaining the shortened URL from lil.io for somebody. Not translating a pdf but loading a text file to scrape the data.

Hope this helps

Dave

UBotBuddy · May 27, 2010

Hmmmmm.... Interesting. That is an interesting solution. I wonder if there is a way to snapshot a Flash form to a PDF and do this method you suggested. That would be a solution to a problem I need to address.

Net66 · May 27, 2010

You could grab the url of the pdf and then feed it into the adobe online conversion tool to convert to html or text file.

http://www.adobe.com/products/acrobat/access_onlinetools.html

If you convert to html the content can be scraped :-)

I'd use proxies if you are doing more than one file in succession.

Andy

P.S. Welcome to the Ubot Underground Cary!

BizWebCoach · September 30, 2010

I am trying to download a PDF, but I'm unsuccessful. I can use the 'download file' command to save a file that ends up being a PDF, but when I try to open it, I get an error saying the file is not a valid PDF.

On the dialog to save the file for the download file command, I can name the file and choose where to save it, but I cannot specify that it should be saved as a PDF file type in the file type dropdown.

Is there a problem with mime types, or am I doing something wrong? I would appreciate instructions about how to save a PDF from within the adobe helper window that controls the browser when viewing a PDF with the browser.

Thanks to anyone who can help!

MiriamMB · September 30, 2010

I am trying to download a PDF, but I'm unsuccessful. I can use the 'download file' command to save a file that ends up being a PDF, but when I try to open it, I get an error saying the file is not a valid PDF.

On the dialog to save the file for the download file command, I can name the file and choose where to save it, but I cannot specify that it should be saved as a PDF file type in the file type dropdown.

Is there a problem with mime types, or am I doing something wrong? I would appreciate instructions about how to save a PDF from within the adobe helper window that controls the browser when viewing a PDF with the browser.

Thanks to anyone who can help!

Hmm...it seems there's no reason why it should not be working.

I just tested this with the download file command. Let me attach a picture.

and it was able to save when I browsed to the folder I wanted and then typed in "neruda.pdf".

try typing the name and the extension like how I saved mine.

so file.pdf

jimbourekas@yahoo.gr · June 12, 2013

Hello,

I know it's over 2 years later that I'm waking this thread up. I am wondering, is it possible with version 4 to display a pdf in the browser area just like your screenshot?

jim

Ok, I got support to answer this. Version 3 was based on Internet Explorer core while version 4 is not. So, the answer is that it cannot display a pdf...

Edited June 13, 2013 by jimbourekas@yahoo.gr

Sign In

Scraping of pdf files...

Recommended Posts

Cary Duke 0

Link to post

Share on other sites

TommyTx 5

Link to post

Share on other sites

UBotBuddy 331

Link to post

Share on other sites

The_Brit 13

Link to post

Share on other sites

UBotBuddy 331

Link to post

Share on other sites

Net66 54

Link to post

Share on other sites

BizWebCoach 0

Link to post

Share on other sites

MiriamMB 63

Link to post

Share on other sites

jimbourekas@yahoo.gr 1

Link to post

Share on other sites

Join the conversation

Browse

Activity