Ruby : UTF-8 IO

Posted by subtenante on Stack Overflow See other posts from Stack Overflow or by subtenante
Published on 2010-03-12T22:09:35Z Indexed on 2010/03/12 22:57 UTC
Read the original article Hit count: 354

Filed under:

ruby

|

utf-8

|

io

I use ruby 1.8.7.

I try to parse some text files containing greek sentences, encoded in UTF-8.

(I can't much paste here sample files, because they are subject to copyright. Really just some greek text encoded in UTF-8.)

I want, for each file, to parse the file, extract all the words, and make a list of each new word found in this file. All that saved to one big index file.

Here is my code :

#!/usr/bin/ruby -KU

def prepare_line(l)
    l.gsub(/^\s*[ST]\d+\s*:\s*|\s+$|\(\d+\)\s*/u, "")
end

def tokenize(l)
    l.split /['·.;!:\s]+/u
end

$dict = {}
$cpt = 0
$out = File.new 'out.txt', 'w'

def lesson(file)
    $cpt = $cpt + 1
    file.readlines.each do |l|
        $out.puts l
        l = prepare_line l
        tokenize(l).each do |t|
            unless $dict[t]
                $dict[t] = $cpt
                $out.puts  "  #{t}\n"
            end
        end
    end
end

Dir.new('etc/').each do |filename|
    f = File.new("etc/#{filename}")
    unless File.directory? f
        lesson f
    end
end

Here is part of my output :

?@???†?†?????????? ?...[snip very long hangul/hanzi mishmash]... ????????†? ???N2 : ?e?te?? (2) µ???µa

(Note that the puts l part seems to work fine, at the end of the given output line.)

Any idea what is wrong with my code ?

(General comments about ruby idioms I could use are very welcome, I'm really a beginner.)

© Stack Overflow or respective owner

Related posts about ruby

Setting up Rails to work with sqlserver

as seen on Stack Overflow - Search for 'Stack Overflow'
Ok I followed the steps for setting up ruby and rails on my Vista machine and I am having a problem connecting to the database. Contents of database.yml development: adapter: sqlserver database: APPS_SETUP Host: WindowsVT06\SQLEXPRESS Username: se Password: paswd Run rake db:migrate… >>> More
marshal data too short!!!

as seen on Stack Overflow - Search for 'Stack Overflow'
My application requires to keep large data objects in session. There are like 3-4 data objects each created by parsing a csv containing 150 X 20 cells having strings of 3-4 characters. My application shows this error- "marshal data too short". I tried this- Deleting the old session table. Deleting… >>> More
Sinatra and XML POST request

as seen on Stack Overflow - Search for 'Stack Overflow'
I don't know is it my mistake or no. So i have that code: <code> post '/singin/get_token' do content_type :xml puts request.body.read puts xmlRequest xmlRequest = REXML::Document.new(request.body.read) ... </code> And when i post something like that: <code> <?xml… >>> More
how to change ruby path from /usr/bin/ruby to /usr/local/bin/ruby

as seen on Stack Overflow - Search for 'Stack Overflow'
reading around the various ruby install tutorials it's required to change path from /usr/bin/ruby to /usr/local/bin/ruby but i cant seem to be able to do it. Ultimately i want to install Ruby 1.9.2, should i uninstall 1.8.7 or what? i tried to install Ruby 1.9.2 with macports, the installation seemed… >>> More
strange bundler error: tar_input.rb:49:in `initialize': not in gzip format (Zlib::GzipFile::Error) o

as seen on Stack Overflow - Search for 'Stack Overflow'
i am getting a strange bundler error when running bundle pack with bundler 0.9.12 any ideas? (see pastie for a better formatted code: http://pastie.org/881328 ) /opt/ruby-enterprise-1.8.7-2010.01/lib/ruby/site_ruby/1.8/rubygems/package/tar_input.rb:49:in `initialize': not in gzip format (Zlib::GzipFile::Error) … >>> More

Related posts about utf-8

Why can't I change the AU_AU locale to en_US?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
/bin/bash: warning: setlocale: LC_ALL: cannot change locale ( (unset)) Generating locales... en_US.ISO-8859-1... /usr/sbin/locale-gen: line 177: warning: setlocale: LC_ALL: cannot change locale ( (unset)) done Generation complete. ganesha@ubuntu:~$ sudo update_locale LANG=en_US sudo: update_locale:… >>> More
Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm working on a english only C++ program for Windows where we were told "always use std::wstring", but it seems like nobody on the team really has much of an understanding beyond that. I already read the question titled "std::wstring VS std::string. It was very helpful, but I still don't quite… >>> More
Reading a plist utf-8 value as utf-16

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm working on an iphone app that needs to display superscripts and subscripts. I'm using a picker to read in data from a plist but the unicode values aren't being displayed corretly in the pickerview. Subscripts and superscripts are not being recognized. I'm assuming this is due to the encoding… >>> More
Forcing a mixed ISO-8859-1 and UTF-8 multi-line string into UTF-8

as seen on Stack Overflow - Search for 'Stack Overflow'
Consider the following problem: A multi-line string $junk contains some lines which are encoded in UTF-8 and some in ISO-8859-1. I don't know a priori which lines are in which encoding, so heuristics will be needed. I want to turn $junk into pure UTF-8 with proper re-encoding of the ISO-8859-1… >>> More
How can I tell if a CSV is in UTF-7 or UTF-8

as seen on Stack Overflow - Search for 'Stack Overflow'
Excel seems to save CSV files in (what I think is) UTF-7, despite the fact that most information I have read suggest that in general, you should not UTF-7. Indeed, other applications (Text pad, which lets me choose) save things in UTF-8 (or Unicode etc, but UTF-7 is not even an option). Using .NET… >>> More