Brutal 164 Posted October 2, 2013 Report Share Posted October 2, 2013 Hi guys, I'm building a bot that will go and scrape a google page and put all of the links on the page into a list, and then allow the user to provide a list of their urls and keywords. I want the bot to compare the lists (the list of scraped urls against the list of user provided urls) to see #1, if any of the user urls are in the scraped url list, and #2, to note the position of the user urls within that list. I just can't seem to get a grasp on it. So if the google search creates a list like this: 1- domain.com2- otherdoman.com3- moredomain.com4- dodoman.com5- roroman.com and the user has provided a list of domains like this:1- udomain1.com2-dodoman.com3-udomain3.com Then the bot should find dodoman.com and note that it is currently in the #4 position. Can anyone help me get on track with this? Quote Link to post Share on other sites
the_way 52 Posted October 2, 2013 Report Share Posted October 2, 2013 all you need is a function that will return the position of an url, from a list of them stored in a list. build the return function first, then just pass the users input as a parameter into the function each time. Thats one way to do it. Quote Link to post Share on other sites
Brutal 164 Posted October 2, 2013 Author Report Share Posted October 2, 2013 thanks man - I appreciate it 1 Quote Link to post Share on other sites
Gogetta 263 Posted October 2, 2013 Report Share Posted October 2, 2013 Here try this... match lists.ubot Quote Link to post Share on other sites
Brutal 164 Posted October 3, 2013 Author Report Share Posted October 3, 2013 GoGetta, you're a Rock Star and I appreciate the help! It's not working for me as is, but I'll probably be able to tweak on it a bit to get it moving in my direction. I honestly didn't know where to even start on it. Thanks again. Quote Link to post Share on other sites
Brutal 164 Posted October 3, 2013 Author Report Share Posted October 3, 2013 GoGetta - If you happen to come back to this page, could you give me details on the code below? add list to list(%list_b, $find regular expression(#list_b, "[a-zA-Z0-9\\-]*\\.[a-zA-Z]\{2,4\}"), "Delete", "Global") This appears to be the sticking point... Nothing is being added into list b Quote Link to post Share on other sites
Gogetta 263 Posted October 3, 2013 Report Share Posted October 3, 2013 GoGetta - If you happen to come back to this page, could you give me details on the code below? add list to list(%list_b, $find regular expression(#list_b, "[a-zA-Z0-9\\-]*\\.[a-zA-Z]\{2,4\}"), "Delete", "Global") This appears to be the sticking point... Nothing is being added into list b Yeah, that was to match only the root domain when comparing to the next list item in a. I'll be the first to admit that I am not to good with regex. I used the regex cause I wasn't to sure if you wanted to match a domain even if the url was a subpage. If it doesn't matter you dont need to use regex for this. But take a look at this. http://www.regular-expressions.info/wordboundaries.html When I tested it with the example you provided above it worked. Here it is again without using regex, but the list b can't contain any subpages or it wont match the current a item.match lists.ubot Quote Link to post Share on other sites
HelloInsomnia 1103 Posted October 3, 2013 Report Share Posted October 3, 2013 Can you provide some example urls that you are getting from Google, if I can see how it's formatted I can make the regex for you. 1 Quote Link to post Share on other sites
Brutal 164 Posted October 3, 2013 Author Report Share Posted October 3, 2013 Well, it's just the whole url (http://www.mydomain.com/mypage.html)..... Basically creating a bot that allows the user to enter his url, and if its found in google, it will list the position its found in on the serp Quote Link to post Share on other sites
HelloInsomnia 1103 Posted October 3, 2013 Report Share Posted October 3, 2013 Give this a try: Edit: improved it now a bit http\:\/\/(www\.|)[\-\.\;\:\%\&\=\+\$\,\w+@]+[a-zA-Z\.]{2,4}+(\/[\-\.\#\?\;\:\%\&\=\+\$\,\w+@]+|) I don't have ubot open though so while it should work if it doesn't let me know. Quote Link to post Share on other sites
Brutal 164 Posted October 3, 2013 Author Report Share Posted October 3, 2013 Didn't work for me, but I might just be doing it wrong. I pasted your string directly into the find-regular-expression box and got an error. I can't begin to tell you guys how much I appreciate all of the help you're giving. Quote Link to post Share on other sites
HelloInsomnia 1103 Posted October 3, 2013 Report Share Posted October 3, 2013 Didn't work for me, but I might just be doing it wrong. I pasted your string directly into the find-regular-expression box and got an error. I can't begin to tell you guys how much I appreciate all of the help you're giving. Okay here it is in Ubot friendly mode: http\:\/\/[\-\+\.\;\:\%\&\=\$\,\w\@]+[a-zA-Z\.]{2,4}(\/[\-\+\.\#\?\;\:\%\&\=\$\,\w\@]+|) Quote Link to post Share on other sites
Brutal 164 Posted October 3, 2013 Author Report Share Posted October 3, 2013 Wow! Perfect and smooth!Thanks so much for all of the help. You guys went above and beyond, and I deeply appreciate it. Gogetta - Thanks so much. HelloInsomnia - Thank you man... This was really kicking my butt. Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.