How to count the Chinese word in a file using regex in perl?
Posted
by
Ivan
on Stack Overflow
See other posts from Stack Overflow
or by Ivan
Published on 2011-01-06T03:19:41Z
Indexed on
2011/01/06
3:53 UTC
Read the original article
Hit count: 250
I tried following perl code to count the Chinese word of a file, it seems working but not get the right thing. Any help is greatly appreciated.
The Error message is
Use of uninitialized value $valid in concatenation (.) or string at word_counting.pl line 21, <FILE> line 21.
Total things = 125, valid words =
which seems to me the problem is the file format. The "total thing" is 125 that is the string number (125 lines). The strangest part is my console displayed all the individual Chinese words correctly without any problem. The utf-8
pragma is installed.
#!/usr/bin/perl -w
use strict;
use utf8;
use Encode qw(encode);
use Encode::HanExtra;
my $input_file = "sample_file.txt";
my ($total, $valid);
my %count;
open (FILE, "< $input_file") or die "Can't open $input_file: $!";
while (<FILE>) {
foreach (split) { #break $_ into words, assign each to $_ in turn
$total++;
next if /\W|^\d+/; #strange words skip the remainder of the loop
$valid++;
$count{$_}++; # count each separate word stored in a hash
## next comes here ##
}
}
print "Total things = $total, valid words = $valid\n";
foreach my $word (sort keys %count) {
print "$word \t was seen \t $count{$word} \t times.\n";
}
##---Data----
sample_file.txt
??????,???????,????.??????.????:"?????????????,??????,????????.????????,?????????, ???????????.????????,???????????,??????,??????.???:`??,???????????.'?????, ??????????."??????,??????.????.???, ????????????,????,??????,?????????,??????????????. ????????,??????,???????????,????????,????????.????,????,???????, ??????????,??????,????????.??????.
© Stack Overflow or respective owner