Jump to content
UBot Underground

Recommended Posts

Hi all,

 

Trying to get my teeth into Regex... Watched some great tutorials on youtube and starting to get a hang of it.

 

I'm working on one bit and have got stuck.

 

Here's what I want:

 

http://wildandslow.com/wp-content/uploads/2011/03/WILD_CRAB_APPLE_123FINAL.pdf

http://www.teagasc.ie/ruraldev/docs/factsheets/38_Apple%20Production.pdf

 

I need to get the pdf file name for everything between the pdf and the /

So in these two examples here's what I'd need: WILD_CRAB_APPLE_123FINAL.pdf and also 38_Apple%20Production.pdf

 

Here's the regex snippet I came up with: ([/])\w+(.pdf)

 

 

This works well for the first one but not the second one and I guess it's cause of the %20 in it. How can I get around this? I need to also consider that there may be other symbols used too, such as - + etc

 

Thanks in advance!

 

Kevin

Link to post
Share on other sites

This has got me a bit further:

 

([/])([A-Za-z _-]|%20)+\w+(.pdf)

 

Here's the full list of URLS I'm trying to gather:

 

http://skagit.wsu.edu/FAM/publications/apples%2003%20NEW.UPDATED.pdf
http://nysipm.cornell.edu/organic_guide/apples.pdf
http://www.cals.ncsu.edu/entomology/apiculture/pdfs/3.03%20copy.pdf
http://www.uspirg.org/sites/pirg/files/reports/Apples-to-Twinkies-web-vUS.pdf
http://www.worldtradelaw.net/reports/wtoab/japan-apples(ab).pdf
http://www.worldtradelaw.net/reports/wtopanels/japan-apples(panel).pdf
http://www.worldtradelaw.net/reports/wtopanelsfull/japan-apples(panel)(full).pdf
http://www.botany.wisc.edu/courses/botany_940/06CropEvol/papers/Harris%2602.pdf
http://fruit.wisc.edu/wp-content/uploads/2011/06/Watercore-of-Apple.pdf
http://fruit.wisc.edu/wp-content/uploads/2011/06/When-are-Apple-Ripe.pdf
http://theutahhouse.org/files/uploads/Preserving%20apples.pdf
http://fruit.cfans.umn.edu/apples/starch-iodine.PDF
http://sitemaker.umich.edu/bajacob/files/cheating.pdf
http://orchard.uvm.edu/uvmapple/hort/AppleHortBasics/Readings/fertilizing_apple_trees.pdf
http://orchard.uvm.edu/uvmapple/hort/AppleHortBasics/Readings/pgrs.pdf
http://web.econ.ku.dk/Nguyen/teaching/hummels%20skiba.pdf
http://ag.udel.edu/extension/horticulture/pdf/hg/hg-21.pdf
http://www.cals.uidaho.edu/edcomm/pdf/CIS/CIS1090.pdf
http://www.cals.uidaho.edu/edcomm/pdf/BUL/BUL0820.pdf
http://www.extension.uidaho.edu/mgse/Fact%20Sheets/Harvesting%20Apples.pdf
http://www.tlsbooks.com/appletreebook.pdf
http://pods.dasnr.okstate.edu/docushare/dsweb/Get/Document-1039/F-6210web.pdf
http://www.marinbehealthy.org/toolkits/Garden_of_Eatin_Toolkit/Food%20Based%20Modules/GoE_Apple_Module.pdf
http://ucce.ucdavis.edu/files/datastore/391-69.pdf
http://www.sigkdd.org/explorations/issues/12-1-2010-07/v12-1-p49-forman-sigkdd.pdf
http://ir.library.oregonstate.edu/xmlui/bitstream/handle/1957/17252/fs147.pdf
http://oregonstate.edu/dept/kbrec/sites/default/files/appleppt1.pdf
http://www.uaex.edu/Other_Areas/publications/PDF/FSA-7538.pdf
http://www.uaex.edu/Other_Areas/publications/pdf/FSA-6058.pdf
http://www.fns.usda.gov/fdd/facts/hhpfacts/New_HHPFacts/Fruits/HHFS_APPLES_FRESH_F510-515_Final.pdf
http://www.ba.ars.usda.gov/hb66/027apple.pdf
http://www.rma.usda.gov/fields/il_rso/2012/apple.pdf
http://www.dpi.nsw.gov.au/__data/assets/pdf_file/0010/120142/bitter-pit-apple.pdf
http://www.curriculumsupport.education.nsw.gov.au/secondary/science/assets/aifst/Experiments/apple_browning.pdf
http://www.dpi.nsw.gov.au/__data/assets/pdf_file/0016/40084/Watercore_of_apples-primefact49.pdf
http://www.dpi.nsw.gov.au/__data/assets/pdf_file/0004/227362/Growing-cider-apples.pdf
http://www.hort.purdue.edu/newcrop/pri/chapter.pdf
http://www.ipmcenters.org/cropprofiles/docs/vaapples.pdf
http://www.uky.edu/Ag/NewCrops/introsheets/apples.pdf
http://www.superteacherworksheets.com/reading-comp/1st-apple-poem.pdf
http://www.ag.ndsu.edu/pubs/plantsci/hortcrop/h1547.pdf
http://www.agintheclassroom.org/TeacherResources/AgMags/Apple%20Ag%20Mag%20_SmartBoard.pdf
http://www.jhortscib.com/isafruit/isa_pp034_041.pdf
http://www.harvestofthemonth.cdph.ca.gov/download/Fall/Apples/Apples_Fami.pdf
http://www.sde.idaho.gov/site/cnp/ffvp/fruit_veg/Apple.pdf
http://wusf.usf.edu/pdf/Cooks_Country/CC_BestBakedApples.pdf
http://www.alfalaval.com/industries/food-dairy-beverages/Documents/Juicy%20apples.pdf
http://www.celma.org/archives/temp/CELMA_TF_Apples_Pears(KR)009_CELMA_Guide_quality_criteria_LED_luminaires_performance_Sept2011_FINAL.pdf
http://www.gloucestershireorchardgroup.org.uk/native_apples_of_gloucestershire.pdf
http://catdir.loc.gov/catdir/samples/cam033/2002031549.pdf
http://www.spectralcameras.com/files/Applications/Renfu_Lu_-_Detection_of_bruises_in_apples_-_iet595-proof.pdf
http://dixie.ifas.ufl.edu/pdfs/gardening/apples.pdf
http://www.usitc.gov/publications/332/ITS_4.pdf
http://www.gardenworks.ca/sites/gardenworks/files/caresheets/apples.pdf
http://www.foodroutes.org/doclib/211/Apples.pdf
http://www.nelsonirrigation.com/media/accessories/Apple_PP_508.pdf
http://www.dfaofca.com/Downloadables/DRIED/APPLES.PDF
http://www.schoolnutritionandfitness.com/data/pdf/teacherCenter/Apples.pdf
http://www.deloitte.com/assets/Dcom-UnitedStates/Local%20Assets/Documents/us_managing_bad_apple_100909.pdf

Link to post
Share on other sites

Not getting all the URLS I need - here's the ones that regex snippet doesnt get:

 

http://www.worldtradelaw.net/reports/wtoab/japan-apples(ab).pdf
http://www.worldtradelaw.net/reports/wtopanels/japan-apples(panel).pdf
http://www.worldtradelaw.net/reports/wtopanelsfull/japan-apples(panel)(full).pdf
http://www.botany.wisc.edu/courses/botany_940/06CropEvol/papers/Harris%2602.pdf
http://skagit.wsu.edu/FAM/publications/apples%2003%20NEW.UPDATED.pdf
http://www.celma.org/archives/temp/CELMA_TF_Apples_Pears(KR)009_CELMA_Guide_quality_criteria_LED_luminaires_performance_Sept2011_FINAL.pdf

 

Appreciate a bit of guidance here

 

Cheers!

Link to post
Share on other sites

A Sample snippet of the code I'm using:

 

clear list(%urls)
clear list(%urls2)
navigate("http://www.google.com/#hl=en&output=search&sclient=psy-ab&q=apples+filetype:pdf&oq=apples+filetype:pdf&gs_l=hp.3...1142.5691.0.5857.21.18.1.2.2.3.494.3874.0j10j2j0j5.17.0...0.0...1c.vp9scej1IPY&pbx=1&bav=on.2,or.r_gc.r_pw.r_qf.&fp=63fbb00a90edef7f&biw=1118&bih=930", "Wait")
add list to list(%urls, $scrape attribute(<href=w"*.pdf">, "href"), "Don\'t Delete", "Global")
add list to list(%urls2, $find regular expression(%urls, "([/])([A-Za-z _-]|%20|[0-9])+\\w+(.pdf)"), "Don\'t Delete", "Global")
save to file("{$special folder("Desktop")}\\noregex.txt", %urls)
save to file("{$special folder("Desktop")}\\regexadded.txt", %urls2)

Link to post
Share on other sites

hey man,

 

I only started learning today! It's quite easy to follow. By easy I mean getting a grasp of the basics... I'm clearly stuck here though on this issue.

 

 

Here's the videos I watched (thanks to this thread http://www.ubotstudio.com/forum/index.php?/topic/6489-regex-101-and-beyond/page__view__findpost__p__31259):

 

(three parts to it)

 

And this one:

(I downloaded the editpad pro trial as Frank suggests in that video - REALLY helpful.)

Link to post
Share on other sites

Hi,

 

Sample code:

set(#urllist, "http://skagit.wsu.edu/FAM/publications/apples%2003%20NEW.UPDATED.pdf
http://nysipm.cornell.edu/organic_guide/apples.pdf
http://www.cals.ncsu.edu/entomology/apiculture/pdfs/3.03%20copy.pdf
http://www.uspirg.org/sites/pirg/files/reports/Apples-to-Twinkies-web-vUS.pdf
http://www.worldtradelaw.net/reports/wtoab/japan-apples(ab).pdf
http://www.worldtradelaw.net/reports/wtopanels/japan-apples(panel).pdf
http://www.worldtradelaw.net/reports/wtopanelsfull/japan-apples(panel)(full).pdf
http://www.botany.wisc.edu/courses/botany_940/06CropEvol/papers/Harris%2602.pdf
http://fruit.wisc.edu/wp-content/uploads/2011/06/Watercore-of-Apple.pdf
http://fruit.wisc.edu/wp-content/uploads/2011/06/When-are-Apple-Ripe.pdf
http://theutahhouse.org/files/uploads/Preserving%20apples.pdf
http://fruit.cfans.umn.edu/apples/starch-iodine.PDF
http://sitemaker.umich.edu/bajacob/files/cheating.pdf
http://orchard.uvm.edu/uvmapple/hort/AppleHortBasics/Readings/fertilizing_apple_trees.pdf
http://orchard.uvm.edu/uvmapple/hort/AppleHortBasics/Readings/pgrs.pdf
http://web.econ.ku.dk/Nguyen/teaching/hummels%20skiba.pdf
http://ag.udel.edu/extension/horticulture/pdf/hg/hg-21.pdf
http://www.cals.uidaho.edu/edcomm/pdf/CIS/CIS1090.pdf
http://www.cals.uidaho.edu/edcomm/pdf/BUL/BUL0820.pdf
http://www.extension.uidaho.edu/mgse/Fact%20Sheets/Harvesting%20Apples.pdf
http://www.tlsbooks.com/appletreebook.pdf
http://pods.dasnr.okstate.edu/docushare/dsweb/Get/Document-1039/F-6210web.pdf
http://www.marinbehealthy.org/toolkits/Garden_of_Eatin_Toolkit/Food%20Based%20Modules/GoE_Apple_Module.pdf
http://ucce.ucdavis.edu/files/datastore/391-69.pdf
http://www.sigkdd.org/explorations/issues/12-1-2010-07/v12-1-p49-forman-sigkdd.pdf
http://ir.library.oregonstate.edu/xmlui/bitstream/handle/1957/17252/fs147.pdf
http://oregonstate.edu/dept/kbrec/sites/default/files/appleppt1.pdf
http://www.uaex.edu/Other_Areas/publications/PDF/FSA-7538.pdf
http://www.uaex.edu/Other_Areas/publications/pdf/FSA-6058.pdf
http://www.fns.usda.gov/fdd/facts/hhpfacts/New_HHPFacts/Fruits/HHFS_APPLES_FRESH_F510-515_Final.pdf
http://www.ba.ars.usda.gov/hb66/027apple.pdf
http://www.rma.usda.gov/fields/il_rso/2012/apple.pdf
http://www.dpi.nsw.gov.au/__data/assets/pdf_file/0010/120142/bitter-pit-apple.pdf
http://www.curriculumsupport.education.nsw.gov.au/secondary/science/assets/aifst/Experiments/apple_browning.pdf
http://www.dpi.nsw.gov.au/__data/assets/pdf_file/0016/40084/Watercore_of_apples-primefact49.pdf
http://www.dpi.nsw.gov.au/__data/assets/pdf_file/0004/227362/Growing-cider-apples.pdf
http://www.hort.purdue.edu/newcrop/pri/chapter.pdf
http://www.ipmcenters.org/cropprofiles/docs/vaapples.pdf
http://www.uky.edu/Ag/NewCrops/introsheets/apples.pdf
http://www.superteacherworksheets.com/reading-comp/1st-apple-poem.pdf
http://www.ag.ndsu.edu/pubs/plantsci/hortcrop/h1547.pdf
http://www.agintheclassroom.org/TeacherResources/AgMags/Apple%20Ag%20Mag%20_SmartBoard.pdf
http://www.jhortscib.com/isafruit/isa_pp034_041.pdf
http://www.harvestofthemonth.cdph.ca.gov/download/Fall/Apples/Apples_Fami.pdf
http://www.sde.idaho.gov/site/cnp/ffvp/fruit_veg/Apple.pdf
http://wusf.usf.edu/pdf/Cooks_Country/CC_BestBakedApples.pdf
http://www.alfalaval.com/industries/food-dairy-beverages/Documents/Juicy%20apples.pdf
http://www.celma.org/archives/temp/CELMA_TF_Apples_Pears(KR)009_CELMA_Guide_quality_criteria_LED_luminaires_performance_Sept2011_FINAL.pdf
http://www.gloucestershireorchardgroup.org.uk/native_apples_of_gloucestershire.pdf
http://catdir.loc.gov/catdir/samples/cam033/2002031549.pdf
http://www.spectralcameras.com/files/Applications/Renfu_Lu_-_Detection_of_bruises_in_apples_-_iet595-proof.pdf
http://dixie.ifas.ufl.edu/pdfs/gardening/apples.pdf
http://www.usitc.gov/publications/332/ITS_4.pdf
http://www.gardenworks.ca/sites/gardenworks/files/caresheets/apples.pdf
http://www.foodroutes.org/doclib/211/Apples.pdf
http://www.nelsonirrigation.com/media/accessories/Apple_PP_508.pdf
http://www.dfaofca.com/Downloadables/DRIED/APPLES.PDF
http://www.schoolnutritionandfitness.com/data/pdf/teacherCenter/Apples.pdf
http://www.deloitte.com/assets/Dcom-UnitedStates/Local%20Assets/Documents/us_managing_bad_apple_100909.pdf", "Global")
clear list(%urls)
add list to list(%urls, $list from text(#urllist, $new line), "Delete", "Global")
loop($list total(%urls)) {
   if($comparison($list position(%urls), "<", $list total(%urls))) {
       then {
           set(#pdfurlitem, $next list item(%urls), "Global")
           set(#pdfbookregex, $replace regular expression(#pdfurlitem, "http..|\\/|\\/|.*\\/", $nothing), "Global")
           clear list(%urlbreakdown)
           add list to list(%urlbreakdown, $list from text(#pdfurlitem, "/"), "Delete", "Global")
           set(#pdfbook, $list item(%urlbreakdown, $subtract($list total(%urlbreakdown), 1)), "Global")
           load html("<html>
<header></header>
<body>
PDFBook via list: {#pdfbook}
<br>PDFBook via Regex: {#pdfbookregex}
</body>
</html>")
           wait(3)
       }
       else {
       }
   }
}
clear list(%urlbreakdown)

sample-url-pdf-001.ubot

 

Kevin

Link to post
Share on other sites

Hi Kevin

 

This code here does pretty much all I need bar missing out the few I am looking for:

 

cclear list(%urls)
clear list(%urls2)
navigate("http://www.google.com/#hl=en&output=search&sclient=psy-ab&q=apples+filetype:pdf&oq=apples+filetype:pdf&gs_l=hp.3...1142.5691.0.5857.21.18.1.2.2.3.494.3874.0j10j2j0j5.17.0...0.0...1c.vp9scej1IPY&pbx=1&bav=on.2,or.r_gc.r_pw.r_qf.&fp=63fbb00a90edef7f&biw=1118&bih=930", "Wait")
add list to list(%urls, $scrape attribute(<href=w"*.pdf">, "href"), "Delete", "Global")
add list to list(%urls2, $find regular expression(%urls, "([/])([A-Za-z _-]|%20|[0-9])+\\w+(.pdf)"), "Delete", "Global")
save to file("C:\\Users\\Kevin\\Desktop\\regext.txt", %urls)
save to file("C:\\Users\\Kevin\\Desktop\\regext2.txt", %urls2)

 

 

 

I am just stuck with it missing a few of the 100 sites (initially i suggest going to google.com and setting your preferences to bring back 100 results)

 

Thanks,

Kevin

Link to post
Share on other sites

why not make a function that clears list, adds list to list ( list from text, url, /))

 

this will load the url as content, and the delimiter being the /

 

grab the list item (list total))

which is the last list item. being the file name.

Link to post
Share on other sites

What I ultimately want to do is download those PDF files and then name the downloaded PDF the same as it's original file name.

 

TJ what would happen if the URL had more than one / in it?

 

Cheers

Link to post
Share on other sites

it wouldnt matter how many / slashes are in the url.

 

as your picking up the last line of the list for the break down each time for the name.

 

set(#download url, "http://www.ubotstudio.com/resources/test.pdf", "Global")
clear list(%break down)
add list to list(%break down, $list from text(#download url, "/"), "Delete", "Global")
download file(#download url, "C:\\Users\\Tj Development\\Desktop\\{$list item(%break down, $subtract($list total(%break down), 1))}")

Link to post
Share on other sites

Hi Kevin,

 

I took your suggestion. I ran it on google search for 100 results.

 

With the two methods I used to get the PDF book name's. I had 100 PDF book name matches from search result URLs.

 

Kevin

Link to post
Share on other sites

it wouldnt matter how many / slashes are in the url.

 

as your picking up the last line of the list for the break down each time for the name.

 

set(#download url, "http://www.ubotstudio.com/resources/test.pdf", "Global")
clear list(%break down)
add list to list(%break down, $list from text(#download url, "/"), "Delete", "Global")
download file(#download url, "C:\\Users\\Tj Development\\Desktop\\{$list item(%break down, $subtract($list total(%break down), 1))}")

 

This actually worked really well for me, thanks TJ.

 

Guys thanks also for your solutions too, I appreciate it.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...