Ruby : UTF-8 IO

Posted by subtenante on Stack Overflow See other posts from Stack Overflow or by subtenante
Published on 2010-03-12T22:09:35Z Indexed on 2010/03/12 22:57 UTC
Read the original article Hit count: 287

Filed under:
|
|

I use ruby 1.8.7.

I try to parse some text files containing greek sentences, encoded in UTF-8.

(I can't much paste here sample files, because they are subject to copyright. Really just some greek text encoded in UTF-8.)

I want, for each file, to parse the file, extract all the words, and make a list of each new word found in this file. All that saved to one big index file.

Here is my code :

#!/usr/bin/ruby -KU

def prepare_line(l)
    l.gsub(/^\s*[ST]\d+\s*:\s*|\s+$|\(\d+\)\s*/u, "")
end

def tokenize(l)
    l.split /['·.;!:\s]+/u
end

$dict = {}
$cpt = 0
$out = File.new 'out.txt', 'w'

def lesson(file)
    $cpt = $cpt + 1
    file.readlines.each do |l|
        $out.puts l
        l = prepare_line l
        tokenize(l).each do |t|
            unless $dict[t]
                $dict[t] = $cpt
                $out.puts  "  #{t}\n"
            end
        end
    end
end

Dir.new('etc/').each do |filename|
    f = File.new("etc/#{filename}")
    unless File.directory? f
        lesson f
    end
end

Here is part of my output :

?@???†?†?????????? ?...[snip very long hangul/hanzi mishmash]... ????????†? ???N2 : ?e?te?? (2) µ???µa

(Note that the puts l part seems to work fine, at the end of the given output line.)

Any idea what is wrong with my code ?

(General comments about ruby idioms I could use are very welcome, I'm really a beginner.)

© Stack Overflow or respective owner

Related posts about ruby

Related posts about utf-8