Building simple Reddit scraper
- by Bazant Fundator
Let's say that I would like to make a collection of images from reddit for my own amusement. I have ran the code on my development env and It haven't gone past the first page of posts (anything beyond requries the after string from the JSON. Additionally, When I turn on the validation, the whole loop breaks if the item doesn't pass it, not just the current iteration.
I would be glad If you helped me understand mistakes I made.
class Link
include Mongoid::Document
include Mongoid::Timestamps
field :author, type: String
field :url, type: String
validates_uniqueness_of :url, # no duplicates
validates :url, uniqueness :true
end
def fetch (count, after)
count_s = count.to_s # convert count to string
link = "http://reddit.com/r/aww/.json?count="+count_s+"&after="+after #so it can be used there
res = HTTParty.get(link) # GET req. to the reddit server
json = JSON.parse(res.body) # Parse the response
if json['kind'] == "Listing" then # check if the retrieved item is a Listing
for i in 1...(count) do # for each list item
datum = json['data']['children'][i]['data'] #i-th element properties
if datum['domain'].in?(["imgur.com", "i.imgur.com"]) then # fetch only imgur links
Link.create!(author: datum['author'], url: datum['url']) # save to db
end
end
count += 25
fetch(count, json['data']['after']) # if it retrieved the right kind of object, move on to the next page
end
end
fetch(25," ") # run it