Jump to content
UBot Underground

Scraping of pdf files...


Recommended Posts

Hi guys,

 

Looking for a little direction. Not only am I a newbie to Ubot but am totally new to programing... I want to make a bot which will scrape membership site information. Some of these include pdf's of the membership directory. This directory include emails. That's the info I desire to scrape and put into a .csv format for use in outlook.

 

With that said, I am watching the tutorials, practicing, and learning... I have the 3.3 beta version. The tutorials are somewhat different. Should I go back to the previous version?

 

Can Ubot scrape a pdf file? I ask because when i'm on the pdf page, the right click drop down options aren't there.

 

Suggestions?

 

Thanks in advance

 

Cary

Link to post
Share on other sites

I would guess Ubot won't scrape a pdf.. however it can download it and if you need just certain stuff from it there are programs that can convert it to word type documents so that any data can be retrieved...

Link to post
Share on other sites

No, I am pretty sure it can't scrape from a PDF. NOW THAT my friends would be a VERY cool trick. Also, it cannot scrape from a Flash form either...I tried. I thought they were regular forms. They sure looked liked regular web forms.

Link to post
Share on other sites

You can do it in a round about way. If the pdf can be downloaded via UBot, you can execute one of the programs that TommyTX mentioned and convert it to plain text. You then use the Navigate option to load it back into UBot where you can then scrape the contents. Instead of http:// you use file://

 

I used this for obtaining the shortened URL from lil.io for somebody. Not translating a pdf but loading a text file to scrape the data.

 

Hope this helps

 

Dave

  • Like 1
Link to post
Share on other sites

Hmmmmm.... Interesting. That is an interesting solution. I wonder if there is a way to snapshot a Flash form to a PDF and do this method you suggested. That would be a solution to a problem I need to address.

Link to post
Share on other sites

You could grab the url of the pdf and then feed it into the adobe online conversion tool to convert to html or text file.

 

http://www.adobe.com/products/acrobat/access_onlinetools.html

 

If you convert to html the content can be scraped :-)

 

I'd use proxies if you are doing more than one file in succession.

 

Andy

 

P.S. Welcome to the Ubot Underground Cary!

  • Like 1
Link to post
Share on other sites
  • 4 months later...

I am trying to download a PDF, but I'm unsuccessful. I can use the 'download file' command to save a file that ends up being a PDF, but when I try to open it, I get an error saying the file is not a valid PDF.

 

On the dialog to save the file for the download file command, I can name the file and choose where to save it, but I cannot specify that it should be saved as a PDF file type in the file type dropdown.

 

Is there a problem with mime types, or am I doing something wrong? I would appreciate instructions about how to save a PDF from within the adobe helper window that controls the browser when viewing a PDF with the browser.

 

Thanks to anyone who can help!

Link to post
Share on other sites

I am trying to download a PDF, but I'm unsuccessful. I can use the 'download file' command to save a file that ends up being a PDF, but when I try to open it, I get an error saying the file is not a valid PDF.

 

On the dialog to save the file for the download file command, I can name the file and choose where to save it, but I cannot specify that it should be saved as a PDF file type in the file type dropdown.

 

Is there a problem with mime types, or am I doing something wrong? I would appreciate instructions about how to save a PDF from within the adobe helper window that controls the browser when viewing a PDF with the browser.

 

Thanks to anyone who can help!

 

 

Hmm...it seems there's no reason why it should not be working.

I just tested this with the download file command. Let me attach a picture.

Neruda.jpg

 

and it was able to save when I browsed to the folder I wanted and then typed in "neruda.pdf".

try typing the name and the extension like how I saved mine.

 

so file.pdf

Link to post
Share on other sites
  • 2 years later...

Hello,

 

I know it's over 2 years later that I'm waking this thread up. I am wondering, is it possible with version 4 to display a pdf in the browser area just like your screenshot?

 

jim

 

Ok, I got support to answer this. Version 3 was based on Internet Explorer core while version 4 is not. So, the answer is that it cannot display a pdf...

Edited by jimbourekas@yahoo.gr
Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...