Hi...:) This might look to be a very long question to you I understand, but trust me on this its not long. I am not able to identify why after processing this text is not being able to be read and edited. I tried using the ord() function in python to check if the text contains any Unicode characters( non ascii characters) apart from the ascii ones.. I found quite a number of them.
I have a strong feeling that this could be due to the original text itself( The INPUT ).
Input-File: Just copy paste it into a file "acle5v1.txt"
The objective of this code below is to check for upper case characters and to convert it to lower case and also to remove all punctuations so that these words are taken for further processing for word alignment
#include<iostrea>
#include<fstream>
#include<ctype.h>
#include<cstring>
using namespace std;
ifstream fin2("acle5v1.txt");
ofstream fin3("acle5v1_op.txt");
ofstream fin4("chkcharadded.txt");
ofstream fin5("chkcharntadded.txt");
ofstream fin6("chkprintchar.txt");
ofstream fin7("chknonasci.txt");
ofstream fin8("nonprinchar.txt");
int main()
{
char ch,ch1;
fin2.seekg(0);
fin3.seekp(0);
int flag = 0;
while(!fin2.eof())
{
ch1=ch;
fin2.get(ch);
if (isprint(ch))// if the character is printable
flag = 1;
if(flag)
{
fin6<<"Printable character:\t"<<ch<<"\t"<<(int)ch<<endl;
flag = 0;
}
else
{
fin8<<"Non printable character caught:\t"<<ch<<"\t"<<int(ch)<<endl;
}
if( isalnum(ch) || ch == '@' || ch == ' ' )// checks for alpha numeric characters
{
fin4<<"char added: "<<ch<<"\tits ascii value: "<<int(ch)<<endl;
if(isupper(ch))
{
//tolower(ch);
fin3<<(char)tolower(ch);
}
else
{
fin3<<ch;
}
}
else if( ( ch=='\t' || ch=='.' || ch==',' || ch=='#' || ch=='?' || ch=='!' || ch=='"' || ch != ';' || ch != ':') && ch1 != ' ' )
{
fin3<<' ';
}
else if( (ch=='\t' || ch=='.' || ch==',' || ch=='#' || ch=='?' || ch=='!' || ch=='"' || ch != ';' || ch != ':') && ch1 == ' ' )
{
//fin3<<" ';
}
else if( !(int(ch)>=0 && int(ch)<=127) )
{
fin5<<"Char of ascii within range not added: "<<ch<<"\tits ascii value: "<<int(ch)<<endl;
}
else
{
fin7<<"Non ascii character caught(could be a -ve value also)\t"<<ch<<int(ch)<<endl;
}
}
return 0;
}
I have a similar code as the above written in python which gives me an otput which is again not readable and not editable
The code in python looks like this:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import sys
input_file=sys.argv[1]
output_file=sys.argv[2]
list1=[]
f=open(input_file)
for line in f:
line=line.strip()
#line=line.rstrip('.')
line=line.replace('.','')
line=line.replace(',','')
line=line.replace('#','')
line=line.replace('?','')
line=line.replace('!','')
line=line.replace('"','')
line=line.replace('?','')
line=line.replace('|','')
line = line.lower()
list1.append(line)
f.close()
f1=open(output_file,'w')
f1.write(' '.join(list1))
f1.close()
the file takes ip and op at runtime.. as:
python punc_remover.py acle5v1.txt acle5v1_op.txt
The output of this file is in "acle5v1_op.txt"
now after processing this particular output file is needed for further processing. This particular file "aclee5v1_op.txt" is the UNREADABLE Aand UNEDITABLE File that I am not being able to use for further processing. I need this for Word alignment in NLP. I tried readin this output with the following program
#include<iostream>
#include<fstream>
using namespace std;
ifstream fin1("acle5v1_op.txt");
ofstream fout1("chckread_acle5v1_op.txt");
ofstream fout2("chcknotread_acle5v1_op.txt");
int main()
{
char ch;
int flag = 0;
long int r = 0; long int nr = 0;
while(!(fin1))
{
fin1.get(ch);
if(ch)
{
flag = 1;
}
if(flag)
{
fout1<<ch;
flag = 0;
r++;
}
else
{
fout2<<"Char not been able to be read from source file\n";
nr++;
}
}
cout<<"Number of characters able to be read: "<<r;
cout<<endl<<"Number of characters not been able to be read: "<<nr;
return 0;
}
which prints the character if its readable and if not it doesn't print them but I observed the output of both the file is blank thus I could draw a conclusion that this file "acle5v1_op.txt" is UNREADABLE AND UNEDITABLE. Could you please help me on how to deal with this problem..
To tell you a bit about the statistics wrt the original input file "acle5v1.txt" file it has around 3441 lines in it and around 3 million characters in it.
Keeping in mind the number of characters in the file you editor might/might not be able to manage to open the file.. I was able to open the file in gedit of Fedora 10 which I am currently using .. This is just to notify you that opening with a particular editor was not actually an issue at least in my case...
Can I use scripting languages like Python and Perl to deal with this problem if Yes how? could please be specific on that regard as I am a novice to Perl and Python. Or could you please tell me how do I solve this problem using C++ itself.. Thank you...:) I am really looking forward to some help or guidance on how to go about this problem....