CHALLENGE: Create a snippet that Filters Lists by keyword in Phrase Match code for the benefit of the Ubot community

mdc101 · February 21, 2012

Hi Guys

I have noticed that their are many questions in the forum that deal with

1) list exceeded errors

2) removing words, urls, phrases from lists.

I have personally struggled with this basic list management challenges

So I have done an experiment and wrote a code block for a sample to start with and am hoping the experts can add their valued input to this thread to get a working example that does what it is supposed to as my attempt failed dismally.

1) I wanted to manage the "list exceeded challenge"

2) I wanted my phrase match to be accurate.

The challenge here is to refine the code sample on my findings, so that we has a standard block of code that can be used and modify.

In the experiment I wanted to remove all urls from a list that did not contain the Phrase "content-nation", "content-curation", "curation".

In 4 examples i discovered that something was going weird and I am thinking I may not clearly understand what the CONTAINS command does and would like some feed back from the pros on what they think about the findings and possible enhancements to the code so the members of the forum can use these blocks of code for their bots.

Hopefully we get some great samples and comments.

So the bot is attached and here is the source to look at:

ui text box("Search Parameter", #SearchStringDashed)
set(#SearchStringDashed, $trim(#SearchStringDashed), "Global")
set(#SearchStringDashed, $change text casing(#SearchStringDashed, "Lower Case"), "Global")
ui stat monitor("Original List count: ", #UrlsCountoriginal)
ui stat monitor("Processed List: ", $list total(%Urls))
set(#Urls, "/Curation
/Digital-Curation
/Social-Curation
/Content-Curation-1
/Data-Curation
/Store-Curation
/Can-content-curation-become-mainstream
/Who-curates-the-curators
/Web-Content-Curation-Applications
/Web-Content-Curation-Startups
/search?q=curation&context_type=&context_id=
#
#
/Why-isnt-the-National-Museum-of-the-American-Indian-more-like-the-Holocaust-Museum
/National-Geographic-1
/National-Football
/United-Nations
/Nations
/The-National-band
/Live-Nation
/National-Public-Radio
/Bling-Nation
/Washington-Nationals
/search?q=curation+nation&context_type=&context_id=
#
#
/Content-Curator
/Content-Curation-1
/Can-content-curation-become-mainstream
/Web-Content-Curation-Applications
/Web-Content-Curation-Startups
/What-are-the-best-content-curation-tools-for-daily-use
/When-does-content-curation-become-content-creation
/What-is-web-content-curation
/Web-Content-Curation/Can-you-make-money-curating
/Web-Content-Curation/Who-are-the-best-tech-content-curators-to-follow-in-2011
/search?q=content+curation&context_type=&context_id=
#
#", "Global")
comment("grab urls for questions")
add list to list(%Urls, $list from text(#Urls, $new line), "Delete", "Global")
set(#UrlsCountoriginal, $list total(%Urls), "Global")
set list position(%Urls, 0)
set(#FirststLevelQ, 0, "Global")
set(#FirststLevelQ_total, $list total(%Urls), "Global")
loop(#FirststLevelQ_total) {
   if($comparison(#FirststLevelQ_total, ">", #FirststLevelQ)) {
       then {
           if($contains($change text casing($list item(%Urls, #FirststLevelQ), "Lower Case"), #SearchStringDashed)) {
               then {
               }
               else {
                   remove from list(%Urls, #FirststLevelQ)
                   decrement(#FirststLevelQ)
                   decrement(#FirststLevelQ_total)
               }
           }
           increment(#FirststLevelQ)
           if($contains($list item(%Urls, #FirststLevelQ), "#")) {
               then {
                   remove from list(%Urls, #FirststLevelQ)
                   decrement(#FirststLevelQ)
                   decrement(#FirststLevelQ_total)
               }
               else {
               }
           }
           increment(#FirststLevelQ)
       }
       else {
       }
   }
}
save to file("{$special folder("Desktop")}\\{#SearchStringDashed}-results.txt", %Urls)
alert("Completed!")
stop script

In this experiment I am looking for url that only contain a keywords in phrase match and not broad match.

I have encoded 30 items in the list.

I have added random # to represent odd characters we may want to remove from the list as well.

The search parameter must use a - (dash) as the urls have dashes between the words and we looking to find the urls where the phrase is in the url.

Experiment 1: search on "content-curation" - you can see the original list in source provided

It returned 19 results.

Findings:

1) all results had either "content" or "curation" - not desired result

2) all the "#" signs were removed - completed successfully

3) the following urls should most definitely not be present

/National-Geographic-1

/United-Nations

/The-National-band

/National-Public-Radio

/Washington-Nationals

Experiment 2: search on "curation-nation"

returned 14 results

1) all results had either "curation" or "Nation" - not desired result

2) all the "#" signs were removed - completed successfully

3) the entire list does not hold a url that has the phrase "curation-nation" in the url

Experiment 3: search on "curation"

returned 24 results

1) all results had either "curation" or "Nation" - not desired result

2) all the "#" signs were removed - completed successfully

3) the following do not hold a url that has the word "curation" in the url

/National-Geographic-1

/Washington-Nationals

/National-Football

/Who-curates-the-curators

/Why-isnt-the-National-Museum-of-the-American-Indian-more-like-the-Holocaust-Museum

/Nations

/Bling-Nation

Experiment 4: search on "blah-blah"

returned 15 results

1) no results should have been returned. 15 were returned which have no mention of the word "blah-blah" in the url.

/Can-content-curation-become-mainstream

/National-Geographic-1

/What-are-the-best-content-curation-tools-for-daily-use

/Web-Content-Curation/Who-are-the-best-tech-content-curators-to-follow-in-2011

/Data-Curation

/search?q=curation+nation&context_type=&context_id=

/Social-Curation

/Who-curates-the-curators

/search?q=curation&context_type=&context_id=

/Nations

/When-does-content-curation-become-content-creation

/search?q=content+curation&context_type=&context_id=

/United-Nations

/Live-Nation

/Content-Curator

It would be great if you guys who are seasoned programmers could look at this code and fill in the gaps so we can see the proper way to do this.

Monkey see - monkey do!! haha

Your feedback would be appreciated as I believe this will help many of us.

Thanks upfront

List-filter-by-keyword.ubot

content-curation-results.txt

curation-nation-results.txt

curation-results.txt

blah-blah-results.txt

mdc101 · February 21, 2012

No one able to figure out why or explain why I am getting the urls all sorts of results instead of the exact phrase match?

JohnB · February 21, 2012

Hey mdc...

I apologize...I am doing 15 things at the moment, but I WILL get to this today...

John

mdc101 · February 21, 2012

Sorry John never meant to Rush anyone, saw alot of views and no comments so I figured most are in the same baot as me!!

JohnB · February 21, 2012

No, not at all..I just wanted to let you know I was aware and planning on helping, that's all. So, in running and looking at the script, I am wondering if the keyword is being treated as a math problem due to the "-" sign...This could be done easily using the find regex command, however, that would eliminate the ability to use it with a user unputted ui textbox (as it would have to transform the input into proper regex strings)...

John

mdc101 · February 21, 2012

In the ideal situation the script will use a keyword in phrase match and return the urls.

As the urls use a "-" between the words and we need to find out keyword phrase in the url, I can create the #SearchStringDashed and have the - placed where needed in the Keyword being searched within the script without the need of the text box input.

I did the example to demonstarte the weird results I was seeing.

Could your regex use this variable #SearchStringDashed?

All we want to do is only keep the urls that have the keyword phrase "keyword-phrase" in the urls.

The rest of the urls can be removed.

I would love to see what you do with regex

Thanks for the contribution

mdc101 · February 21, 2012

here is a sample of what I have done to dynamically insert the dash into the keyword...

Would this work with your idea?

comment("Take seed keywords and build up list of urls")
               set(#SearchString, "{$next list item(%Keywords)} ", "Global")
               set(#SearchString, $trim(#SearchString), "Global")
               set(#SearchString, $change text casing(#SearchString, "Lower Case"), "Global")
               set(#SearchStringDashed, $replace(#SearchString, " ", "-"), "Global")
               set(#SearchStringDashed, $trim(#SearchStringDashed), "Global")
               set(#SearchStringDashed, $change text casing(#SearchStringDashed, "Lower Case"), "Global")

JohnB · February 22, 2012

This gets 26 results, is that correct?

clear list(%Urls)

ui text box("Search Parameter", #SearchStringDashed)

set(#SearchStringDashed, $trim(#SearchStringDashed), "Global")

set(#SearchStringDashed, $change text casing(#SearchStringDashed, "Lower Case"), "Global")

ui stat monitor("Original List count: ", #UrlsCountoriginal)

ui stat monitor("Processed List: ", $list total(%Urls))

set(#Urls, "/Curation

/Digital-Curation

/Social-Curation

/Content-Curation-1

/Data-Curation

/Store-Curation

/Can-content-curation-become-mainstream

/Who-curates-the-curators

/Web-Content-Curation-Applications

/Web-Content-Curation-Startups

/search?q=curation&context_type=&context_id=

#

/Why-isnt-the-National-Museum-of-the-American-Indian-more-like-the-Holocaust-Museum

/National-Geographic-1

/National-Football

/United-Nations

/Nations

/The-National-band

/Live-Nation

/National-Public-Radio

/Bling-Nation

/Washington-Nationals

/search?q=curation+nation&context_type=&context_id=

#

/Content-Curator

/Content-Curation-1

/Can-content-curation-become-mainstream

/Web-Content-Curation-Applications

/Web-Content-Curation-Startups

/What-are-the-best-content-curation-tools-for-daily-use

/When-does-content-curation-become-content-creation

/What-is-web-content-curation

/Web-Content-Curation/Can-you-make-money-curating

/Web-Content-Curation/Who-are-the-best-tech-content-curators-to-follow-in-2011

/search?q=content+curation&context_type=&context_id=

#

#", "Global")

comment("grab urls for questions")

add list to list(%Urls, $list from text(#Urls, $new line), "Delete", "Global")

set(#UrlsCountoriginal, $list total(%Urls), "Global")

set list position(%Urls, 0)

set(#FirststLevelQ, 0, "Global")

set(#FirststLevelQ_total, $list total(%Urls), "Global")

loop(#FirststLevelQ_total) {

if($comparison(#FirststLevelQ_total, ">", #FirststLevelQ)) {

then {

if($contains($change text casing($list item(%Urls, #FirststLevelQ), "Lower Case"), $find regular expression($list item(%Urls, #FirststLevelQ), ".*content-curation.*"))) {

then {

}

else {

remove from list(%Urls, #FirststLevelQ)

decrement(#FirststLevelQ)

decrement(#FirststLevelQ_total)

}

increment(#FirststLevelQ)

if($contains($list item(%Urls, #FirststLevelQ), "#")) {

then {

remove from list(%Urls, #FirststLevelQ)

decrement(#FirststLevelQ)

decrement(#FirststLevelQ_total)

}

else {

}

increment(#FirststLevelQ)

}

else {

}

save to file("{$special folder("Desktop")}\\{#SearchStringDashed}-results.txt", %Urls)

alert("Completed!")

stop script

John

mdc101 · February 22, 2012

Hi John

There are still urls that are not required.

See on your desktop -results.txt

The following urls should be removed as the do not contain the phrase "content-curation"

/Curation

/Digital-Curation

/Social-Curation

/Data-Curation

/Store-Curation

/Who-curates-the-curators

/search?q=curation&context_type=&context_id=

/Why-isnt-the-National-Museum-of-the-American-Indian-more-like-the-Holocaust-Museum

/National-Geographic-1

/National-Football

/United-Nations

/Nations

/The-National-band

/Live-Nation

/National-Public-Radio

/Bling-Nation

/Washington-Nationals

/search?q=curation+nation&context_type=&context_id=

/Content-Curator

/search?q=content+curation&context_type=&context_id=

The bot should return 13 records to be correct.

It looks like maybe the command does not understand what we want to do.

In my tests I pressed the quotes button to force that it is a string that must be parsed and you havve done the same in your sample in the regex.

Here are the results that we should be left with. Lucky number 13

/Content-Curation-1

/Can-content-curation-become-mainstream

/Web-Content-Curation-Applications

/Web-Content-Curation-Startups

/Content-Curation-1

/Can-content-curation-become-mainstream

/Web-Content-Curation-Applications

/Web-Content-Curation-Startups

/What-are-the-best-content-curation-tools-for-daily-use

/When-does-content-curation-become-content-creation

/What-is-web-content-curation

/Web-Content-Curation/Can-you-make-money-curating

/Web-Content-Curation/Who-are-the-best-tech-content-curators-to-follow-in-2011

mdc101 · February 22, 2012

I have tried a few regexs that work in regexbuddy and deliver the correct result of 13 items found:

Ruby format & .NET: \b.*content-curation.*\b(?i)

However in UBOT it fails to deliver the 13 required urls????

$find regular expression($list item(%Urls, #FirststLevelQ), "\\b.*{#SearchStringDashed}.*\\b(?i)")

Wonder if this is a bug?

mdc101 · February 22, 2012

hard coding the search phrase in does not work either

$find regular expression($list item(%Urls, #FirststLevelQ), "\\b.*content-curation.*\\b(?i)")

mdc101 · February 23, 2012

Has anyone ever had issues with words that have dashes in them: "keyword-phrase"?

How has any of you cleaned lists with phrase match keywords that have dashes in them?

It seems no one has an answer

Chainsaw · February 23, 2012

Here's my take on a solution:

Note: I'm not sending the output anywhere, just displaying in the UI.

ui text box("Search Parameter", #SearchString)
ui stat monitor("Original List count: ", #UrlsCount)
ui stat monitor("Original List count: ", #FilteredUrlsCount)
ui stat monitor("RegexPassed", #RegexPassed)
set(#Urls, "/Curation
/Digital-Curation
/Social-Curation
/Content-Curation-1
/Data-Curation
/Store-Curation
/Can-content-curation-become-mainstream
/Who-curates-the-curators
/Web-Content-Curation-Applications
/Web-Content-Curation-Startups
/search?q=curation&context_type=&context_id=
#
#
/Why-isnt-the-National-Museum-of-the-American-Indian-more-like-the-Holocaust-Museum
/National-Geographic-1
/National-Football
/United-Nations
/Nations
/The-National-band
/Live-Nation
/National-Public-Radio
/Bling-Nation
/Washington-Nationals
/search?q=curation+nation&context_type=&context_id=
#
#
/Content-Curator
/Content-Curation-1
/Can-content-curation-become-mainstream
/Web-Content-Curation-Applications
/Web-Content-Curation-Startups
/What-are-the-best-content-curation-tools-for-daily-use
/When-does-content-curation-become-content-creation
/What-is-web-content-curation
/Web-Content-Curation/Can-you-make-money-curating
/Web-Content-Curation/Who-are-the-best-tech-content-curators-to-follow-in-2011
/search?q=content+curation&context_type=&context_id=
#
#", "Global")
set(#UrlsCount, $list total($list from text(#Urls, $new line)), "Global")
set(#RegexPassed, $find regular expression(#Urls, ".*(?i)({#SearchString}).*"), "Global")
set(#FilteredUrlsCount, $list total($list from text(#RegexPassed, $new line)), "Global")
alert("Completed!")
stop script

Chainsaw · February 23, 2012

Thought I'd do a quick speed test of the regex and while I was at it compare using javascript and Ubot native commands to count the number of lines in the blocks of text.

loop(1000) {
   set(#UrlsCount, $eval("\"{$replace regular expression(#Urls, $new line, "\\r\\n")}\".split(/\r\n/).length;"), "Global")
   set(#RegexPassed, $find regular expression(#Urls, ".*(?i)({#SearchString}).*"), "Global")
   set(#FilteredUrlsCount, $eval("\"{$replace regular expression(#RegexPassed, $new line, "\\r\\n")}\".split(/\r\n/).length;"), "Global")
}

Versus

loop(1000) {
   set(#UrlsCount, $list total($list from text(#Urls, $new line)), "Global")
   set(#RegexPassed, $find regular expression(#Urls, ".*(?i)({#SearchString}).*"), "Global")
   set(#FilteredUrlsCount, $list total($list from text(#RegexPassed, $new line)), "Global")
}

The Ubot native commands won. 1000 iterrations in 130secs using javascript versus 2.8secs with Ubot native commands.

Here's the winning version. Maybe someone can improve on the execution speed.

set(#StartTime, $eval("new Date().getTime();"), "Global")
ui text box("Search Parameter", #SearchString)
ui stat monitor("Original List count: ", #UrlsCount)
ui stat monitor("Original List count: ", #FilteredUrlsCount)
ui stat monitor("RegexPassed", #RegexPassed)
set(#Urls, "/Curation
/Digital-Curation
/Social-Curation
/Content-Curation-1
/Data-Curation
/Store-Curation
/Can-content-curation-become-mainstream
/Who-curates-the-curators
/Web-Content-Curation-Applications
/Web-Content-Curation-Startups
/search?q=curation&context_type=&context_id=
#
#
/Why-isnt-the-National-Museum-of-the-American-Indian-more-like-the-Holocaust-Museum
/National-Geographic-1
/National-Football
/United-Nations
/Nations
/The-National-band
/Live-Nation
/National-Public-Radio
/Bling-Nation
/Washington-Nationals
/search?q=curation+nation&context_type=&context_id=
#
#
/Content-Curator
/Content-Curation-1
/Can-content-curation-become-mainstream
/Web-Content-Curation-Applications
/Web-Content-Curation-Startups
/What-are-the-best-content-curation-tools-for-daily-use
/When-does-content-curation-become-content-creation
/What-is-web-content-curation
/Web-Content-Curation/Can-you-make-money-curating
/Web-Content-Curation/Who-are-the-best-tech-content-curators-to-follow-in-2011
/search?q=content+curation&context_type=&context_id=
#
#", "Global")
loop(1000) {
   set(#UrlsCount, $list total($list from text(#Urls, $new line)), "Global")
   set(#RegexPassed, $find regular expression(#Urls, ".*(?i)({#SearchString}).*"), "Global")
   set(#FilteredUrlsCount, $list total($list from text(#RegexPassed, $new line)), "Global")
}
set(#EndTime, $eval("new Date().getTime();"), "Global")
alert("Completed in {$eval($divide($subtract(#EndTime, #StartTime), 1000))} secs")
stop script

UBotBuddy · February 23, 2012

Here ya go!

mdc101.ubot

Here is what you did wrong...imho.

If you wanted all of the URLs lower case then do that before adding to a List.

Remove items you don't want in the list before adding to the List.

If you know you are going to have duplicates and want them then make sure you have that advanced setting selected to keep the duplicates.

Anyway, I hope this helps you!

Buddy

mdc101 · February 23, 2012

Thanks ChainSaw and UBotBuddy, these are brilliant examples

CHALLENGE: Create a snippet that Filters Lists by keyword in Phrase Match code for the benefit of the Ubot community

Recommended Posts

mdc101 15

Link to post

Share on other sites

mdc101 15

Link to post

Share on other sites

JohnB 255

Link to post

Share on other sites

mdc101 15

Link to post

Share on other sites

JohnB 255

Link to post

Share on other sites

mdc101 15

Link to post

Share on other sites

mdc101 15

Link to post

Share on other sites

JohnB 255

Link to post

Share on other sites

mdc101 15

Link to post

Share on other sites

mdc101 15

Link to post

Share on other sites

mdc101 15

Link to post

Share on other sites

mdc101 15

Link to post

Share on other sites

Chainsaw 9

Link to post

Share on other sites

Chainsaw 9

Link to post

Share on other sites

UBotBuddy 331

Link to post

Share on other sites

mdc101 15

Link to post

Share on other sites

Join the conversation