Jump to content



Photo

Building An Instagram Hashtag Scraper [Tutorial]


  • Please log in to reply
2 replies to this topic

#1 HelloInsomnia

HelloInsomnia

    Advanced Member

  • Moderators
  • 2735 posts
  • OS:Windows 10
  • Total Memory:More Than 9Gb
  • Framework:v4.5+, unsure
  • License:Developer Edition

Posted 11 July 2018 - 07:17 PM

In this tutorial we are going to build an Instagram hashtag scraper. The idea is to enter a hashtag and then have the scraper find other hashtags which appear in the descriptions of the photos. The scraper will count the number of times each hashtag appears. You can then take the top related hashtags and use them for scraping users, posts, etc.

 

First we need a query, mine is going to be lunch:

 

https://www.instagram.com/explore/tags/lunch/

 

NyXOesN.png

 

We can accept a hashtag using a ui text box, and then when we run the program we can remove the spaces by replacing the with nothing. From what I can tell, IG does not like using spaces, periods, etc in the tag urls, and users probably don’t use underscores if they don’t have to.

ui text box("Hashtag",#hashtag)
set(#inputHashtag,$replace(#hashtag," ",""),"Global")

E7UpxkS.png

 

We can navigate to the tag page, and then get more results. IG uses lazy loading so we can use a bit of javascript to scroll to the bottom of the page in order to load more results. That is a handy bit of JS to save somewhere or just Google it when you need it.

navigate("https://www.instagram.com/explore/tags/{#inputHashtag}","Wait")
loop(10) {
   run javascript("window.scrollTo(0,document.body.scrollHeight);")
   wait(2)
   wait for browser event("Everything Loaded","")
}

Now we need to scrape the post descriptions. It seems that IG stores these in the alt tag of the image. We only care about descriptions which contain a hashtag, so we can target that by using a wildcard.

clear list(%descriptions)
add list to list(%descriptions,$scrape attribute(<alt=w"*#*">,"alt"),"Delete","Global")

fJ1Ssa6.png

 

The next step is to get the hashtags out of all of the descriptions and into a list. There are several things we need to consider when we do this so let’s break it down bit by bit.

 

To actually grab the hashtags we can use a regular expression. I played around with this a little bit before deciding to use this regex:

\#\w+

Basically, that says that it should find something that starts with a # and then grab all word characters after that. So that could be letters, numbers, underscores but not things like periods, hashtags, spaces, new lines, etc.

 

Next, we need to ensure that we change the text casing to be all lowercase because we don’t want duplicate entries like #food and #Food, we can use the change text casing function for this job.

 

Finally, we need to be sure that we are not deleting duplicates in this list. We want to be able to count the number of times each hashtag shows up, and if we remove duplicates then we won’t be able to do this.

 

So this is how we end up grabbing the hashtags, changing them to be all lowercase and allowing duplicates in the list.

clear list(%hashtags)
add list to list(%hashtags,$find regular expression($change text casing(%descriptions,"Lower Case"),"\\#\\w+"),"Don\'t Delete","Global")

jDlR918.png

 

We need to count all of the hashtags in the list and there are a variety of ways to do this. I wanted to keep this super simple, only use free plugins if necessary - and no bot bank. So I decided that the best way to do this was the following:

 

  1. Create a table to store the hashtag and number of occurrences
  2. Check to see if the hashtag was in the table
  3. If the hashtag is in the table, skip it
  4. If the hashtag is not in the table, count the number of occurrences

 

We need to do a bit of setup before the loop, because we are working with a table we need to clear it and set a row variable.

clear table(&hashtagPopularity)
set(#row,0,"Global")

Next we loop for the list total of hashtags:

loop($list total(%hashtags)) {
   
}

And this part is the code that goes inside of the loop.

 

We set a variable to be the next hashtag in the list, this allows us to use the variable in multiple places without calling next list item more than once.

set(#nextHashtag,$next list item(%hashtags),"Global")

Next we check to see if the hashtag is in the table, we can use the Table Command plugin which comes with Ubot to help us get a list of all the items in column 0 (the first column which contains all the hashtags).

set(#hashtagExists,$find regular expression($plugin function("TableCommands.dll", "$list from table", &hashtagPopularity, "Column", 0),"{#nextHashtag}(?=\\W|$)"),"Global")

We are using another regular expression and this time it’s a bit different.

 

It starts off with the #nextHashtag variable - so this could be #food for example.

 

Then we want to be sure that there is a non word character after the hashtag - or it's the end of the list. That is what (?=\W|$) means. The reason we want this is because we want #food to match #food and only #food. We don’t want it to match #foodfriday or something like that. So we need to ensure that there is some space or something after the hashtag. That is what \W means. And we also check for the end of the list by using $.

 

We can now run a check to see if the hashtag already exists in the table by dropping this comparison into an if command:

$comparison(#hashtagExists,"=","")

Basically if the hashtag = nothing then it's not found and so we can go on to count the number of occurrences it has in the list - otherwise, we already did that for this hashtag and so we can skip it.

 

At this point you may be wondering why we don’t just remove the hashtags that we have already counted. And the reason is because we can just do a simple check instead. This can be much faster than some other methods which may involve list manipulation.

 

Inside of our if statement (in the then command) we can go ahead and use another list to easily count the number of occurrences of our hashtag. So we clear a list and reuse the same regex as before:

clear list(%hashtag)
add list to list(%hashtag,$find regular expression(%hashtags,"{#nextHashtag}(?=\\W|$)"),"Don\'t Delete","Global")

Oh and don’t delete duplicates of course or all your hashtags will have a count of 1 ;)

 

Now that the hard part is over, we just need to simply add our hashtag and number of occurrences to the table.

set table cell(&hashtagPopularity,#row,0,#nextHashtag)
set table cell(&hashtagPopularity,#row,1,$list total(%hashtag))

N0chG9S.png

 

Oh and don’t forget to increment the row variable after so that on the next loop we won’t overwrite the same row data.

increment(#row)

And that is pretty much it. We can go ahead and save this as a CSV file to our desktop for now, this is done outside of the loop so we don’t save on each iteration of the loop.

save to file("{$special folder("Desktop")}\\hashtags.csv",&hashtagPopularity)

I did a few test runs and here were the results for the top 10 of each (excluding the input hashtag).

 

Lunch

 

#food - 22

#instafood - 15

#yummy - 14

#dinner - 14

#tasty - 13

#delicious - 12

#foodporn - 11

#fresh - 9

#foodie - 9

#breakfast - 9

 

Dog

 

#puppy - 13
#dogsofinstagram - 11
#love - 9
#cute - 9
#dogs - 8
#instagood - 6
#instadog - 6
#pet - 5
#corgifeed - 4
#corgiaddict - 4

 

Travel

 

#photography - 12
#love - 10
#nature - 9
#adventure - 8
#photooftheday - 8
#travelgram - 7
#travelphotography - 6
#fun - 6
#happy - 6
#travelling - 6

 

There are loads of ways to improve this basic example or build upon it.

 

Here’s the full code:

ui text box("Hashtag",#hashtag)
set(#inputHashtag,$replace(#hashtag," ",""),"Global")
navigate("https://www.instagram.com/explore/tags/{#inputHashtag}","Wait")
loop(10) {
   run javascript("window.scrollTo(0,document.body.scrollHeight);")
   wait(2)
   wait for browser event("Everything Loaded","")
}
clear list(%descriptions)
add list to list(%descriptions,$scrape attribute(<alt=w"*#*">,"alt"),"Delete","Global")
clear list(%hashtags)
add list to list(%hashtags,$find regular expression($change text casing(%descriptions,"Lower Case"),"\\#\\w+"),"Don\'t Delete","Global")
clear table(&hashtagPopularity)
set(#row,0,"Global")
loop($list total(%hashtags)) {
   set(#nextHashtag,$next list item(%hashtags),"Global")
   set(#hashtagExists,$find regular expression($plugin function("TableCommands.dll", "$list from table", &hashtagPopularity, "Column", 0),"{#nextHashtag}(?=\\W|$)"),"Global")
   if($comparison(#hashtagExists,"=","")) {
       then {
           clear list(%hashtag)
           add list to list(%hashtag,$find regular expression(%hashtags,"{#nextHashtag}(?=\\W|$)"),"Don\'t Delete","Global")
           set table cell(&hashtagPopularity,#row,0,#nextHashtag)
           set table cell(&hashtagPopularity,#row,1,$list total(%hashtag))
           increment(#row)
       }
       else {
       }
   }
}
save to file("{$special folder("Desktop")}\\hashtags.csv",&hashtagPopularity)


#2 BigEfromDaBX

BigEfromDaBX

    Advanced Member

  • Members
  • PipPipPip
  • 314 posts
  • OS:Windows 8
  • Total Memory:< 1Gb
  • Framework:v3.5
  • License:Standard Edition

Posted 13 July 2018 - 11:17 AM

You Rock :)



#3 HelloInsomnia

HelloInsomnia

    Advanced Member

  • Moderators
  • 2735 posts
  • OS:Windows 10
  • Total Memory:More Than 9Gb
  • Framework:v4.5+, unsure
  • License:Developer Edition

Posted 14 July 2018 - 10:33 AM

If you guys want to see anything else like this feel free to suggest.






0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users