Huge file in Clojure and Java heap space error

Posted by trzewiczek on Stack Overflow See other posts from Stack Overflow or by trzewiczek
Published on 2012-04-02T21:18:40Z Indexed on 2012/04/02 23:30 UTC
Read the original article Hit count: 301

Filed under:
|
|

I posted before on a huge XML file - it's a 287GB XML with Wikipedia dump I want ot put into CSV file (revisions authors and timestamps). I managed to do that till some point. Before I got the StackOverflow Error, but now after solving the first problem I get: java.lang.OutOfMemoryError: Java heap space error.

My code (partly taken from Justin Kramer answer) looks like that:

(defn process-pages
  [page]
  (let [title     (article-title page)
        revisions (filter #(= :revision (:tag %)) (:content page))]
    (for [revision revisions]
      (let [user (revision-user revision)
            time (revision-timestamp revision)]
        (spit "files/data.csv"
              (str "\"" time "\";\"" user "\";\"" title "\"\n" )
              :append true)))))

(defn open-file
[file-name]
(let [rdr (BufferedReader. (FileReader. file-name))]
  (->> (:content (data.xml/parse rdr :coalescing false))
       (filter #(= :page (:tag %)))
       (map process-pages))))

I don't show article-title, revision-user and revision-title functions, because they just simply take data from a specific place in the page or revision hash. Anyone could help me with this - I'm really new in Clojure and don't get the problem.

© Stack Overflow or respective owner

Related posts about clojure

Related posts about heap