Net66 54 Posted April 18, 2010 Report Share Posted April 18, 2010 Hi, Say you write a tool that scrapes an article database and posts the articles into your wordpress blog. Next time the bot is run for the same keywords it will find new articles but also some it has found before. How can that easily be detected so it doesnt try posting duplicate content? Andy Quote Link to post Share on other sites
meter 145 Posted April 18, 2010 Report Share Posted April 18, 2010 I can think of one way off the top of my head which might sound bad and complex, but really isnt (have done something similar). Create a sub that takes the article title as a parameter. Have that sub write the article title to a file called input.txt. Then use the SHELL command to run an external program called checker.exe. Now, code up checker.exe in any language you know, I prefer C#. Have checker.exe open the file input.txt, and read the article title from it. Next, have it open a file called article_title_list.txt. Check if the article title from input.txt is present in article_title_list.txt. If it is, then you have a duplicate. Write a 1 to a file called output.txt. If the article title isnt present in article_title_list.txt, then append the article title to the list and save the new list. Then write a 0 to output.txt, and exit checker.exe. Next, have the Ubot read output.txt, and check if it contains a 1 or a 0. If it contains a 1, you have a duplicate, so set some flag to indicate that. If it contains a 0, then the article is not a duplicate and you are good to go. I use this technique (outsourcing functions to external programs and using text files as a communications pipeline) for a lot of more complicated stuff that I rather not handle in Ubot (such as captcha cracking). So, I can attest that it works, and isnt as complicated as it sounds if you organize your code correctly. Cheers, -m Quote Link to post Share on other sites
Net66 54 Posted April 19, 2010 Author Report Share Posted April 19, 2010 Thanks. Andy Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.