Jump to content
UBot Underground

CHALLENGE: Create a snippet that Filters Lists by keyword in Phrase Match code for the benefit of the Ubot community


Recommended Posts

Hi Guys

 

I have noticed that their are many questions in the forum that deal with

1) list exceeded errors

2) removing words, urls, phrases from lists.

 

I have personally struggled with this basic list management challenges

 

So I have done an experiment and wrote a code block for a sample to start with and am hoping the experts can add their valued input to this thread to get a working example that does what it is supposed to as my attempt failed dismally.

 

1) I wanted to manage the "list exceeded challenge"

2) I wanted my phrase match to be accurate.

 

The challenge here is to refine the code sample on my findings, so that we has a standard block of code that can be used and modify.

 

In the experiment I wanted to remove all urls from a list that did not contain the Phrase "content-nation", "content-curation", "curation".

 

In 4 examples i discovered that something was going weird and I am thinking I may not clearly understand what the CONTAINS command does and would like some feed back from the pros on what they think about the findings and possible enhancements to the code so the members of the forum can use these blocks of code for their bots.

 

Hopefully we get some great samples and comments.

 

So the bot is attached and here is the source to look at:

 

ui text box("Search Parameter", #SearchStringDashed)
set(#SearchStringDashed, $trim(#SearchStringDashed), "Global")
set(#SearchStringDashed, $change text casing(#SearchStringDashed, "Lower Case"), "Global")
ui stat monitor("Original List count: ", #UrlsCountoriginal)
ui stat monitor("Processed List: ", $list total(%Urls))
set(#Urls, "/Curation
/Digital-Curation
/Social-Curation
/Content-Curation-1
/Data-Curation
/Store-Curation
/Can-content-curation-become-mainstream
/Who-curates-the-curators
/Web-Content-Curation-Applications
/Web-Content-Curation-Startups
/search?q=curation&context_type=&context_id=
#
#
/Why-isnt-the-National-Museum-of-the-American-Indian-more-like-the-Holocaust-Museum
/National-Geographic-1
/National-Football
/United-Nations
/Nations
/The-National-band
/Live-Nation
/National-Public-Radio
/Bling-Nation
/Washington-Nationals
/search?q=curation+nation&context_type=&context_id=
#
#
/Content-Curator
/Content-Curation-1
/Can-content-curation-become-mainstream
/Web-Content-Curation-Applications
/Web-Content-Curation-Startups
/What-are-the-best-content-curation-tools-for-daily-use
/When-does-content-curation-become-content-creation
/What-is-web-content-curation
/Web-Content-Curation/Can-you-make-money-curating
/Web-Content-Curation/Who-are-the-best-tech-content-curators-to-follow-in-2011
/search?q=content+curation&context_type=&context_id=
#
#", "Global")
comment("grab urls for questions")
add list to list(%Urls, $list from text(#Urls, $new line), "Delete", "Global")
set(#UrlsCountoriginal, $list total(%Urls), "Global")
set list position(%Urls, 0)
set(#FirststLevelQ, 0, "Global")
set(#FirststLevelQ_total, $list total(%Urls), "Global")
loop(#FirststLevelQ_total) {
   if($comparison(#FirststLevelQ_total, ">", #FirststLevelQ)) {
       then {
           if($contains($change text casing($list item(%Urls, #FirststLevelQ), "Lower Case"), #SearchStringDashed)) {
               then {
               }
               else {
                   remove from list(%Urls, #FirststLevelQ)
                   decrement(#FirststLevelQ)
                   decrement(#FirststLevelQ_total)
               }
           }
           increment(#FirststLevelQ)
           if($contains($list item(%Urls, #FirststLevelQ), "#")) {
               then {
                   remove from list(%Urls, #FirststLevelQ)
                   decrement(#FirststLevelQ)
                   decrement(#FirststLevelQ_total)
               }
               else {
               }
           }
           increment(#FirststLevelQ)
       }
       else {
       }
   }
}
save to file("{$special folder("Desktop")}\\{#SearchStringDashed}-results.txt", %Urls)
alert("Completed!")
stop script

 

In this experiment I am looking for url that only contain a keywords in phrase match and not broad match.

I have encoded 30 items in the list.

I have added random # to represent odd characters we may want to remove from the list as well.

The search parameter must use a - (dash) as the urls have dashes between the words and we looking to find the urls where the phrase is in the url.

 

Experiment 1: search on "content-curation" - you can see the original list in source provided

 

It returned 19 results.

Findings:

 

1) all results had either "content" or "curation" - not desired result

2) all the "#" signs were removed - completed successfully

3) the following urls should most definitely not be present

 

/National-Geographic-1

/United-Nations

/The-National-band

/National-Public-Radio

/Washington-Nationals

 

Experiment 2: search on "curation-nation"

 

returned 14 results

 

1) all results had either "curation" or "Nation" - not desired result

2) all the "#" signs were removed - completed successfully

3) the entire list does not hold a url that has the phrase "curation-nation" in the url

 

Experiment 3: search on "curation"

 

returned 24 results

 

1) all results had either "curation" or "Nation" - not desired result

2) all the "#" signs were removed - completed successfully

3) the following do not hold a url that has the word "curation" in the url

 

/National-Geographic-1

/Washington-Nationals

/National-Football

/Who-curates-the-curators

/Why-isnt-the-National-Museum-of-the-American-Indian-more-like-the-Holocaust-Museum

/Nations

/Bling-Nation

 

Experiment 4: search on "blah-blah"

 

returned 15 results

 

1) no results should have been returned. 15 were returned which have no mention of the word "blah-blah" in the url.

 

/Can-content-curation-become-mainstream

/National-Geographic-1

/What-are-the-best-content-curation-tools-for-daily-use

/Web-Content-Curation/Who-are-the-best-tech-content-curators-to-follow-in-2011

/Data-Curation

/search?q=curation+nation&context_type=&context_id=

/Social-Curation

/Who-curates-the-curators

/search?q=curation&context_type=&context_id=

/Nations

/When-does-content-curation-become-content-creation

/search?q=content+curation&context_type=&context_id=

/United-Nations

/Live-Nation

/Content-Curator

 

It would be great if you guys who are seasoned programmers could look at this code and fill in the gaps so we can see the proper way to do this.

 

Monkey see - monkey do!! haha

 

Your feedback would be appreciated as I believe this will help many of us.

 

Thanks upfront

List-filter-by-keyword.ubot

content-curation-results.txt

curation-nation-results.txt

curation-results.txt

blah-blah-results.txt

Link to post
Share on other sites

No, not at all..I just wanted to let you know I was aware and planning on helping, that's all. So, in running and looking at the script, I am wondering if the keyword is being treated as a math problem due to the "-" sign...This could be done easily using the find regex command, however, that would eliminate the ability to use it with a user unputted ui textbox (as it would have to transform the input into proper regex strings)...

 

John

 

 

 

Link to post
Share on other sites

In the ideal situation the script will use a keyword in phrase match and return the urls.

As the urls use a "-" between the words and we need to find out keyword phrase in the url, I can create the #SearchStringDashed and have the - placed where needed in the Keyword being searched within the script without the need of the text box input.

I did the example to demonstarte the weird results I was seeing.

 

Could your regex use this variable #SearchStringDashed?

 

All we want to do is only keep the urls that have the keyword phrase "keyword-phrase" in the urls.

The rest of the urls can be removed.

 

I would love to see what you do with regex

 

Thanks for the contribution

Link to post
Share on other sites

here is a sample of what I have done to dynamically insert the dash into the keyword...

Would this work with your idea?

 

comment("Take seed keywords and build up list of urls")
               set(#SearchString, "{$next list item(%Keywords)} ", "Global")
               set(#SearchString, $trim(#SearchString), "Global")
               set(#SearchString, $change text casing(#SearchString, "Lower Case"), "Global")
               set(#SearchStringDashed, $replace(#SearchString, " ", "-"), "Global")
               set(#SearchStringDashed, $trim(#SearchStringDashed), "Global")
               set(#SearchStringDashed, $change text casing(#SearchStringDashed, "Lower Case"), "Global")

Link to post
Share on other sites

This gets 26 results, is that correct?

 

clear list(%Urls)

ui text box("Search Parameter", #SearchStringDashed)

set(#SearchStringDashed, $trim(#SearchStringDashed), "Global")

set(#SearchStringDashed, $change text casing(#SearchStringDashed, "Lower Case"), "Global")

ui stat monitor("Original List count: ", #UrlsCountoriginal)

ui stat monitor("Processed List: ", $list total(%Urls))

set(#Urls, "/Curation

/Digital-Curation

/Social-Curation

/Content-Curation-1

/Data-Curation

/Store-Curation

/Can-content-curation-become-mainstream

/Who-curates-the-curators

/Web-Content-Curation-Applications

/Web-Content-Curation-Startups

/search?q=curation&context_type=&context_id=

#

#

/Why-isnt-the-National-Museum-of-the-American-Indian-more-like-the-Holocaust-Museum

/National-Geographic-1

/National-Football

/United-Nations

/Nations

/The-National-band

/Live-Nation

/National-Public-Radio

/Bling-Nation

/Washington-Nationals

/search?q=curation+nation&context_type=&context_id=

#

#

/Content-Curator

/Content-Curation-1

/Can-content-curation-become-mainstream

/Web-Content-Curation-Applications

/Web-Content-Curation-Startups

/What-are-the-best-content-curation-tools-for-daily-use

/When-does-content-curation-become-content-creation

/What-is-web-content-curation

/Web-Content-Curation/Can-you-make-money-curating

/Web-Content-Curation/Who-are-the-best-tech-content-curators-to-follow-in-2011

/search?q=content+curation&context_type=&context_id=

#

#", "Global")

comment("grab urls for questions")

add list to list(%Urls, $list from text(#Urls, $new line), "Delete", "Global")

set(#UrlsCountoriginal, $list total(%Urls), "Global")

set list position(%Urls, 0)

set(#FirststLevelQ, 0, "Global")

set(#FirststLevelQ_total, $list total(%Urls), "Global")

loop(#FirststLevelQ_total) {

if($comparison(#FirststLevelQ_total, ">", #FirststLevelQ)) {

then {

if($contains($change text casing($list item(%Urls, #FirststLevelQ), "Lower Case"), $find regular expression($list item(%Urls, #FirststLevelQ), ".*content-curation.*"))) {

then {

}

else {

remove from list(%Urls, #FirststLevelQ)

decrement(#FirststLevelQ)

decrement(#FirststLevelQ_total)

}

}

increment(#FirststLevelQ)

if($contains($list item(%Urls, #FirststLevelQ), "#")) {

then {

remove from list(%Urls, #FirststLevelQ)

decrement(#FirststLevelQ)

decrement(#FirststLevelQ_total)

}

else {

}

}

increment(#FirststLevelQ)

}

else {

}

}

}

save to file("{$special folder("Desktop")}\\{#SearchStringDashed}-results.txt", %Urls)

alert("Completed!")

stop script

John

Link to post
Share on other sites

Hi John

There are still urls that are not required.

See on your desktop -results.txt

 

The following urls should be removed as the do not contain the phrase "content-curation"

 

/Curation

/Digital-Curation

/Social-Curation

/Data-Curation

/Store-Curation

/Who-curates-the-curators

/search?q=curation&context_type=&context_id=

/Why-isnt-the-National-Museum-of-the-American-Indian-more-like-the-Holocaust-Museum

/National-Geographic-1

/National-Football

/United-Nations

/Nations

/The-National-band

/Live-Nation

/National-Public-Radio

/Bling-Nation

/Washington-Nationals

/search?q=curation+nation&context_type=&context_id=

/Content-Curator

/search?q=content+curation&context_type=&context_id=

 

 

 

The bot should return 13 records to be correct.

It looks like maybe the command does not understand what we want to do.

In my tests I pressed the quotes button to force that it is a string that must be parsed and you havve done the same in your sample in the regex.

 

Here are the results that we should be left with. Lucky number 13

 

 

/Content-Curation-1

/Can-content-curation-become-mainstream

/Web-Content-Curation-Applications

/Web-Content-Curation-Startups

/Content-Curation-1

/Can-content-curation-become-mainstream

/Web-Content-Curation-Applications

/Web-Content-Curation-Startups

/What-are-the-best-content-curation-tools-for-daily-use

/When-does-content-curation-become-content-creation

/What-is-web-content-curation

/Web-Content-Curation/Can-you-make-money-curating

/Web-Content-Curation/Who-are-the-best-tech-content-curators-to-follow-in-2011

Link to post
Share on other sites

I have tried a few regexs that work in regexbuddy and deliver the correct result of 13 items found:

 

Ruby format & .NET: \b.*content-curation.*\b(?i)

 

However in UBOT it fails to deliver the 13 required urls????

 

$find regular expression($list item(%Urls, #FirststLevelQ), "\\b.*{#SearchStringDashed}.*\\b(?i)")

 

Wonder if this is a bug?

Link to post
Share on other sites

Has anyone ever had issues with words that have dashes in them: "keyword-phrase"?

 

How has any of you cleaned lists with phrase match keywords that have dashes in them?

 

It seems no one has an answer :(

Link to post
Share on other sites

Here's my take on a solution:

 

Note: I'm not sending the output anywhere, just displaying in the UI.

 

ui text box("Search Parameter", #SearchString)
ui stat monitor("Original List count: ", #UrlsCount)
ui stat monitor("Original List count: ", #FilteredUrlsCount)
ui stat monitor("RegexPassed", #RegexPassed)
set(#Urls, "/Curation
/Digital-Curation
/Social-Curation
/Content-Curation-1
/Data-Curation
/Store-Curation
/Can-content-curation-become-mainstream
/Who-curates-the-curators
/Web-Content-Curation-Applications
/Web-Content-Curation-Startups
/search?q=curation&context_type=&context_id=
#
#
/Why-isnt-the-National-Museum-of-the-American-Indian-more-like-the-Holocaust-Museum
/National-Geographic-1
/National-Football
/United-Nations
/Nations
/The-National-band
/Live-Nation
/National-Public-Radio
/Bling-Nation
/Washington-Nationals
/search?q=curation+nation&context_type=&context_id=
#
#
/Content-Curator
/Content-Curation-1
/Can-content-curation-become-mainstream
/Web-Content-Curation-Applications
/Web-Content-Curation-Startups
/What-are-the-best-content-curation-tools-for-daily-use
/When-does-content-curation-become-content-creation
/What-is-web-content-curation
/Web-Content-Curation/Can-you-make-money-curating
/Web-Content-Curation/Who-are-the-best-tech-content-curators-to-follow-in-2011
/search?q=content+curation&context_type=&context_id=
#
#", "Global")
set(#UrlsCount, $list total($list from text(#Urls, $new line)), "Global")
set(#RegexPassed, $find regular expression(#Urls, ".*(?i)({#SearchString}).*"), "Global")
set(#FilteredUrlsCount, $list total($list from text(#RegexPassed, $new line)), "Global")
alert("Completed!")
stop script

Link to post
Share on other sites

Thought I'd do a quick speed test of the regex and while I was at it compare using javascript and Ubot native commands to count the number of lines in the blocks of text.

 

loop(1000) {
   set(#UrlsCount, $eval("\"{$replace regular expression(#Urls, $new line, "\\r\\n")}\".split(/\r\n/).length;"), "Global")
   set(#RegexPassed, $find regular expression(#Urls, ".*(?i)({#SearchString}).*"), "Global")
   set(#FilteredUrlsCount, $eval("\"{$replace regular expression(#RegexPassed, $new line, "\\r\\n")}\".split(/\r\n/).length;"), "Global")
}

 

Versus

 

loop(1000) {
   set(#UrlsCount, $list total($list from text(#Urls, $new line)), "Global")
   set(#RegexPassed, $find regular expression(#Urls, ".*(?i)({#SearchString}).*"), "Global")
   set(#FilteredUrlsCount, $list total($list from text(#RegexPassed, $new line)), "Global")
}

 

The Ubot native commands won. 1000 iterrations in 130secs using javascript versus 2.8secs with Ubot native commands. :P

 

 

Here's the winning version. Maybe someone can improve on the execution speed.

 

set(#StartTime, $eval("new Date().getTime();"), "Global")
ui text box("Search Parameter", #SearchString)
ui stat monitor("Original List count: ", #UrlsCount)
ui stat monitor("Original List count: ", #FilteredUrlsCount)
ui stat monitor("RegexPassed", #RegexPassed)
set(#Urls, "/Curation
/Digital-Curation
/Social-Curation
/Content-Curation-1
/Data-Curation
/Store-Curation
/Can-content-curation-become-mainstream
/Who-curates-the-curators
/Web-Content-Curation-Applications
/Web-Content-Curation-Startups
/search?q=curation&context_type=&context_id=
#
#
/Why-isnt-the-National-Museum-of-the-American-Indian-more-like-the-Holocaust-Museum
/National-Geographic-1
/National-Football
/United-Nations
/Nations
/The-National-band
/Live-Nation
/National-Public-Radio
/Bling-Nation
/Washington-Nationals
/search?q=curation+nation&context_type=&context_id=
#
#
/Content-Curator
/Content-Curation-1
/Can-content-curation-become-mainstream
/Web-Content-Curation-Applications
/Web-Content-Curation-Startups
/What-are-the-best-content-curation-tools-for-daily-use
/When-does-content-curation-become-content-creation
/What-is-web-content-curation
/Web-Content-Curation/Can-you-make-money-curating
/Web-Content-Curation/Who-are-the-best-tech-content-curators-to-follow-in-2011
/search?q=content+curation&context_type=&context_id=
#
#", "Global")
loop(1000) {
   set(#UrlsCount, $list total($list from text(#Urls, $new line)), "Global")
   set(#RegexPassed, $find regular expression(#Urls, ".*(?i)({#SearchString}).*"), "Global")
   set(#FilteredUrlsCount, $list total($list from text(#RegexPassed, $new line)), "Global")
}
set(#EndTime, $eval("new Date().getTime();"), "Global")
alert("Completed in {$eval($divide($subtract(#EndTime, #StartTime), 1000))} secs")
stop script

Link to post
Share on other sites

Here ya go!

 

mdc101.ubot

 

Here is what you did wrong...imho.

 

If you wanted all of the URLs lower case then do that before adding to a List.

Remove items you don't want in the list before adding to the List.

If you know you are going to have duplicates and want them then make sure you have that advanced setting selected to keep the duplicates.

 

Anyway, I hope this helps you!

 

Buddy

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...