Building simple Reddit scraper
Posted
by
Bazant Fundator
on Stack Overflow
See other posts from Stack Overflow
or by Bazant Fundator
Published on 2013-06-25T10:19:23Z
Indexed on
2013/06/25
10:21 UTC
Read the original article
Hit count: 427
Let's say that I would like to make a collection of images from reddit for my own amusement. I have ran the code on my development env and It haven't gone past the first page of posts (anything beyond requries the after
string from the JSON. Additionally, When I turn on the validation, the whole loop breaks if the item doesn't pass it, not just the current iteration.
I would be glad If you helped me understand mistakes I made.
class Link
include Mongoid::Document
include Mongoid::Timestamps
field :author, type: String
field :url, type: String
validates_uniqueness_of :url, # no duplicates
validates :url, uniqueness :true
end
def fetch (count, after)
count_s = count.to_s # convert count to string
link = "http://reddit.com/r/aww/.json?count="+count_s+"&after="+after #so it can be used there
res = HTTParty.get(link) # GET req. to the reddit server
json = JSON.parse(res.body) # Parse the response
if json['kind'] == "Listing" then # check if the retrieved item is a Listing
for i in 1...(count) do # for each list item
datum = json['data']['children'][i]['data'] #i-th element properties
if datum['domain'].in?(["imgur.com", "i.imgur.com"]) then # fetch only imgur links
Link.create!(author: datum['author'], url: datum['url']) # save to db
end
end
count += 25
fetch(count, json['data']['after']) # if it retrieved the right kind of object, move on to the next page
end
end
fetch(25," ") # run it
© Stack Overflow or respective owner