Ruby : UTF-8 IO
- by subtenante
I use ruby 1.8.7.
I try to parse some text files containing greek sentences, encoded in UTF-8.
(I can't much paste here sample files, because they are subject to copyright. Really just some greek text encoded in UTF-8.)
I want, for each file, to parse the file, extract all the words, and make a list of each new word found in this file. All that saved to one big index file.
Here is my code :
#!/usr/bin/ruby -KU
def prepare_line(l)
l.gsub(/^\s*[ST]\d+\s*:\s*|\s+$|\(\d+\)\s*/u, "")
end
def tokenize(l)
l.split /['·.;!:\s]+/u
end
$dict = {}
$cpt = 0
$out = File.new 'out.txt', 'w'
def lesson(file)
$cpt = $cpt + 1
file.readlines.each do |l|
$out.puts l
l = prepare_line l
tokenize(l).each do |t|
unless $dict[t]
$dict[t] = $cpt
$out.puts " #{t}\n"
end
end
end
end
Dir.new('etc/').each do |filename|
f = File.new("etc/#{filename}")
unless File.directory? f
lesson f
end
end
Here is part of my output :
?@???†?†?????????? ?...[snip very long hangul/hanzi mishmash]... ????????†? ???N2 : ?e?te?? (2) µ???µa
(Note that the puts l part seems to work fine, at the end of the given output line.)
Any idea what is wrong with my code ?
(General comments about ruby idioms I could use are very welcome, I'm really a beginner.)