Jump to content
UBot Underground

How do you avoid duplicate content.


Recommended Posts

Hi,

 

Say you write a tool that scrapes an article database and posts the articles into your wordpress blog. Next time the bot is run for the same keywords it will find new articles but also some it has found before. How can that easily be detected so it doesnt try posting duplicate content?

 

Andy

Link to post
Share on other sites

I can think of one way off the top of my head which might sound bad and complex, but really isnt (have done something similar).

 

Create a sub that takes the article title as a parameter. Have that sub write the article title to a file called input.txt. Then use the SHELL command to run an external program called checker.exe. Now, code up checker.exe in any language you know, I prefer C#. Have checker.exe open the file input.txt, and read the article title from it. Next, have it open a file called article_title_list.txt. Check if the article title from input.txt is present in article_title_list.txt. If it is, then you have a duplicate. Write a 1 to a file called output.txt. If the article title isnt present in article_title_list.txt, then append the article title to the list and save the new list. Then write a 0 to output.txt, and exit checker.exe. Next, have the Ubot read output.txt, and check if it contains a 1 or a 0. If it contains a 1, you have a duplicate, so set some flag to indicate that. If it contains a 0, then the article is not a duplicate and you are good to go.

 

I use this technique (outsourcing functions to external programs and using text files as a communications pipeline) for a lot of more complicated stuff that I rather not handle in Ubot (such as captcha cracking). So, I can attest that it works, and isnt as complicated as it sounds if you organize your code correctly.

 

Cheers,

 

-m

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...