Search Results

Search found 5654 results on 227 pages for 'pdf manipulation'.

Page 99/227 | < Previous Page | 95 96 97 98 99 100 101 102 103 104 105 106  | Next Page >

  • SharePoint OCR image files indexing

    Introduction This article describes how to setup indexing of the image files (including TIFF, PDF, JPEG, BMP...) using OCR technology. The indexing described below utilizes Microsoft IFilter technology and as such is not specific to SharePoint, but can be used with any product that uses Microsoft indexing: Microsoft Search, Desktop search, SQL Server search, and through the plug-ins with Google desktop search. I however use it with Microsoft Windows SharePoint Services 2003. For those other products, the registration may need to be slightly different. Background  One of the projects I was working on required a storage of old documents scanned into PDF files. Then there was a separate team of people responsible for providing a tags for a search engine so those image documents could be found. The whole process was clumsy, labor intensive, and error prone. That was what started me on my exploration path. OCR The first search I fired was for the Open Source OCR products. Pretty quickly, I narrowed it down to TESSERACT (http://code.google.com/p/tesseract-ocr/). Tesseract is an orphaned brain child of HP that worked on it from 1985 to 1995. Then it was moved to the Open Source, and now if I understand it correctly, Google is working on it. With credentials like that, it's no wonder that Tesseract scores one of the highest marks on OCR recognition and accuracy. After downloading and struggling just a bit, I got Tesseract to work. The struggling part was that the home page claims that its base input format is a TIFF file. May be my TIFFs were bad, but I was able to get it to work only for BMP files. Image files conversion So now that I have an OCR that can convert BMP files into text, how do I get text out of the image PDF files? One more search, and I settled down on ImageMagic (http://www.imagemagick.org/). This is another wonderful Open Source utility that can convert any file into image. It did work out of the box, converting any TIFF files into bitmaps, but to get PDF files converted, it requires a GhostScript (http://mirror.cs.wisc.edu/pub/mirrors/ghost/GPL/gs864/gs864w32.exe). Dealing with text PDFs With that utility installed, I was cooking - I can convert any file (in particular PDF and TIFF) into bitmap, and then I can extract the text out of the bitmap. The only consideration was to somehow treat PDF files containing text differently - after all, OCR is very computation intensive and somewhat error prone even with perfect image quality and resolution. So another quick search, and I have a PDFTOTEXT (ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl4-win32.zip) - thank God for Open Source! With these guys, I can pull text out of PDF in an eye blink. However, I would get nothing for pure image PDFs, but I already have a solution for that! Batch process It took another 15 minutes to setup a batch script to automate the process: Check the file extension If file is a PDF file try to extract text out of it if there is more than certain amount of text in the file - done! if there is no text, convert first page into bitmap run OCR on the bitmap For any other file type, convert file into bitmap Run OCR on the bitmap Once you unzip the attached project, check out the bin\OCR.BAT file. It will create a temporary file in the directory where your source file is with the same name + the '.txt' extension.Continue span.fullpost {display:none;}

    Read the article

  • Teminal non-responsive on load, can't enter anything until CTRL+C

    - by Silver Light
    Hello! I have an issue with terminal in Ubuntu 10.04. When I launch it, it hangs, like this: I cannot do anything until I press CTRL+C: I cannot remember when this started. What can be wrong? Looks like teminal is loading or processing something each time it loads. How can I diagnose and solve this problem? EDIT: Here are the conents of ~/.bashrc: # ~/.bashrc: executed by bash(1) for non-login shells. # see /usr/share/doc/bash/examples/startup-files (in the package bash-doc) # for examples # If not running interactively, don't do anything [ -z "$PS1" ] && return # don't put duplicate lines in the history. See bash(1) for more options # ... or force ignoredups and ignorespace HISTCONTROL=ignoredups:ignorespace # append to the history file, don't overwrite it shopt -s histappend # for setting history length see HISTSIZE and HISTFILESIZE in bash(1) HISTSIZE=1000 HISTFILESIZE=2000 # check the window size after each command and, if necessary, # update the values of LINES and COLUMNS. shopt -s checkwinsize # make less more friendly for non-text input files, see lesspipe(1) [ -x /usr/bin/lesspipe ] && eval "$(SHELL=/bin/sh lesspipe)" # set variable identifying the chroot you work in (used in the prompt below) if [ -z "$debian_chroot" ] && [ -r /etc/debian_chroot ]; then debian_chroot=$(cat /etc/debian_chroot) fi # set a fancy prompt (non-color, unless we know we "want" color) case "$TERM" in xterm-color) color_prompt=yes;; esac # uncomment for a colored prompt, if the terminal has the capability; turned # off by default to not distract the user: the focus in a terminal window # should be on the output of commands, not on the prompt #force_color_prompt=yes if [ -n "$force_color_prompt" ]; then if [ -x /usr/bin/tput ] && tput setaf 1 >&/dev/null; then # We have color support; assume it's compliant with Ecma-48 # (ISO/IEC-6429). (Lack of such support is extremely rare, and such # a case would tend to support setf rather than setaf.) color_prompt=yes else color_prompt= fi fi if [ "$color_prompt" = yes ]; then PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\]\$ ' else PS1='${debian_chroot:+($debian_chroot)}\u@\h:\w\$ ' fi unset color_prompt force_color_prompt # If this is an xterm set the title to user@host:dir case "$TERM" in xterm*|rxvt*) PS1="\[\e]0;${debian_chroot:+($debian_chroot)}\u@\h: \w\a\]$PS1" ;; *) ;; esac # enable color support of ls and also add handy aliases if [ -x /usr/bin/dircolors ]; then test -r ~/.dircolors && eval "$(dircolors -b ~/.dircolors)" || eval "$(dircolors -b)" alias ls='ls --color=auto' #alias dir='dir --color=auto' #alias vdir='vdir --color=auto' alias grep='grep --color=auto' alias fgrep='fgrep --color=auto' alias egrep='egrep --color=auto' fi # some more ls aliases alias ll='ls -alF' alias la='ls -A' alias l='ls -CF' # Add an "alert" alias for long running commands. Use like so: # sleep 10; alert alias alert='notify-send --urgency=low -i "$([ $? = 0 ] && echo terminal || echo error)" "$(history|tail -n1|sed -e '\''s/^\s*[0-9]\+\s*//;s/[;&|]\s*alert$//'\'')"' # Alias definitions. # You may want to put all your additions into a separate file like # ~/.bash_aliases, instead of adding them here directly. # See /usr/share/doc/bash-doc/examples in the bash-doc package. if [ -f ~/.bash_aliases ]; then . ~/.bash_aliases fi # enable programmable completion features (you don't need to enable # this, if it's already enabled in /etc/bash.bashrc and /etc/profile # sources /etc/bash.bashrc). if [ -f /etc/bash_completion ] && ! shopt -oq posix; then . /etc/bash_completion fi # Source .profile if [ -f ~/.profile ]; then . ~/.profile fi Setting -x at the beginning showed me that it tries to repeat this without stopping: +++++++++++++++++++ '[' 'complete -f -X '\''!*.@(pdf|PDF)'\'' acroread gpdf xpdf' '!=' 'complete -f -X '\''!*.@(pdf|PDF)'\'' acroread gpdf xpdf' ']' +++++++++++++++++++ line='complete -f -X '\''!*.@(pdf|PDF)'\'' acroread gpdf xpdf' +++++++++++++++++++ line='complete -f -X '\''!*.@(pdf|PDF)'\'' acroread gpdf xpdf' +++++++++++++++++++ line=' acroread gpdf xpdf' +++++++++++++++++++ list=("${list[@]}" $line) +++++++++++++++++++ read line

    Read the article

  • Teminal hands on load, can't enter anything until CTRL+C

    - by Silver Light
    Hello! I have an issue with terminal in Ubuntu 10.04. When I launch it, it hangs, like this: I cannot do anything until I press CTRL+C: I cannot remember when this started. What can be wrong? Looks like teminal is loading or processing something each time it loads. How can I diagnose and solve this problem? EDIT: Here are the conents of ~/.bashrc: # ~/.bashrc: executed by bash(1) for non-login shells. # see /usr/share/doc/bash/examples/startup-files (in the package bash-doc) # for examples # If not running interactively, don't do anything [ -z "$PS1" ] && return # don't put duplicate lines in the history. See bash(1) for more options # ... or force ignoredups and ignorespace HISTCONTROL=ignoredups:ignorespace # append to the history file, don't overwrite it shopt -s histappend # for setting history length see HISTSIZE and HISTFILESIZE in bash(1) HISTSIZE=1000 HISTFILESIZE=2000 # check the window size after each command and, if necessary, # update the values of LINES and COLUMNS. shopt -s checkwinsize # make less more friendly for non-text input files, see lesspipe(1) [ -x /usr/bin/lesspipe ] && eval "$(SHELL=/bin/sh lesspipe)" # set variable identifying the chroot you work in (used in the prompt below) if [ -z "$debian_chroot" ] && [ -r /etc/debian_chroot ]; then debian_chroot=$(cat /etc/debian_chroot) fi # set a fancy prompt (non-color, unless we know we "want" color) case "$TERM" in xterm-color) color_prompt=yes;; esac # uncomment for a colored prompt, if the terminal has the capability; turned # off by default to not distract the user: the focus in a terminal window # should be on the output of commands, not on the prompt #force_color_prompt=yes if [ -n "$force_color_prompt" ]; then if [ -x /usr/bin/tput ] && tput setaf 1 >&/dev/null; then # We have color support; assume it's compliant with Ecma-48 # (ISO/IEC-6429). (Lack of such support is extremely rare, and such # a case would tend to support setf rather than setaf.) color_prompt=yes else color_prompt= fi fi if [ "$color_prompt" = yes ]; then PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\]\$ ' else PS1='${debian_chroot:+($debian_chroot)}\u@\h:\w\$ ' fi unset color_prompt force_color_prompt # If this is an xterm set the title to user@host:dir case "$TERM" in xterm*|rxvt*) PS1="\[\e]0;${debian_chroot:+($debian_chroot)}\u@\h: \w\a\]$PS1" ;; *) ;; esac # enable color support of ls and also add handy aliases if [ -x /usr/bin/dircolors ]; then test -r ~/.dircolors && eval "$(dircolors -b ~/.dircolors)" || eval "$(dircolors -b)" alias ls='ls --color=auto' #alias dir='dir --color=auto' #alias vdir='vdir --color=auto' alias grep='grep --color=auto' alias fgrep='fgrep --color=auto' alias egrep='egrep --color=auto' fi # some more ls aliases alias ll='ls -alF' alias la='ls -A' alias l='ls -CF' # Add an "alert" alias for long running commands. Use like so: # sleep 10; alert alias alert='notify-send --urgency=low -i "$([ $? = 0 ] && echo terminal || echo error)" "$(history|tail -n1|sed -e '\''s/^\s*[0-9]\+\s*//;s/[;&|]\s*alert$//'\'')"' # Alias definitions. # You may want to put all your additions into a separate file like # ~/.bash_aliases, instead of adding them here directly. # See /usr/share/doc/bash-doc/examples in the bash-doc package. if [ -f ~/.bash_aliases ]; then . ~/.bash_aliases fi # enable programmable completion features (you don't need to enable # this, if it's already enabled in /etc/bash.bashrc and /etc/profile # sources /etc/bash.bashrc). if [ -f /etc/bash_completion ] && ! shopt -oq posix; then . /etc/bash_completion fi # Source .profile if [ -f ~/.profile ]; then . ~/.profile fi Setting -x at the beginning showed me that it tries to repeat this without stopping: +++++++++++++++++++ '[' 'complete -f -X '\''!*.@(pdf|PDF)'\'' acroread gpdf xpdf' '!=' 'complete -f -X '\''!*.@(pdf|PDF)'\'' acroread gpdf xpdf' ']' +++++++++++++++++++ line='complete -f -X '\''!*.@(pdf|PDF)'\'' acroread gpdf xpdf' +++++++++++++++++++ line='complete -f -X '\''!*.@(pdf|PDF)'\'' acroread gpdf xpdf' +++++++++++++++++++ line=' acroread gpdf xpdf' +++++++++++++++++++ list=("${list[@]}" $line) +++++++++++++++++++ read line

    Read the article

  • Save Java frame as a Microsoft Word or PDF document?

    - by Jason
    I am working on a billing program - right now when you click the appropriate button it generates a frame that shows the various charges etc, basically an invoice. Is there a way to give the user an option of saving that frame as a document, either Microsoft Word, Microsoft Works or PDF?

    Read the article

  • Is it possible to generate a XSL-FO template from a PDF?

    - by Vihung
    Given a PDF document, is it possible to generate a XSL-FO (FOP) template? Obviously, this would be a one-time thing - the generated template would just be a starting point for creating a proper template that pulls in the appropriate data. For me, the ideal tool for doing so would be a Java-based one and should be executable from the command line or through an ANT task. Failing that, it would be something that runs on Linux and MacOS X.

    Read the article

  • Is there a good free (prefrerably PDF) bash tutorial online?

    - by morpheous
    I am finding myself doing a lot more messing around with scripts than I used to and my lack of knowledge in this area (and linux sysadmin/security in general), is becoming a hindrance. Can anyone recommend a good online resource for bash scripting/linux admin. Preferably, it will be in pdf format, so I can copy it (single file) onto my PDA.

    Read the article

  • fPDF: how to strikeout/strikethrough justified text in multicell?

    - by SWilk
    Hi, I am generating a PDF with fPDF. I need to strikethrough a long text inside a MultiCell. The text is justified to left and right, which probably is the source of the problem. Here is my code: //get the starting x and y of the cell to strikeout $strikeout_y_start = $pdf->GetY(); $strikeout_x = $pdf->getX(); $strikeText = "Some text with no New Lines (\n), which is wrapped automaticly, cause it is very very very very very very very very very very long long long long long long long long long long long long long long long long long long" //draw the text $pdf->MultiCell(180, 4, $strikeText); //get the y end of cell $strikeout_y_end = $pdf->GetY(); $strikeout_y = $strikeout_y_start+2; $strikeCount = 0; for ($strikeout_y; $strikeout_y < $strikeout_y_end - 4; $strikeout_y+=4) { $strikeCount++; //strike out the full width of all lines but last one - works OK $pdf->Line($strikeout_x, $strikeout_y, $strikeout_x + 180, $strikeout_y); } //this works, but gives incorrect results $width = $pdf->GetStringWidth($strikeText); $width = $width - $strikeCount*180; //the line below will strike out some text, but not all the letters of last line $pdf->line($strikeout_x, $strikeout_y, $strikeout_x+$width, $strikeout_y); The problem is that as the text in multicell is justified (and have to be), the spacec in previous lines are wider than the GetStringWidth assumes, so GetStringWidth underestimates the full width of this text. As a result, the last line is stroked out in, say, 70%, and some letters on the end of it are not stroked out. Any ideas how to calculate the width of last line in multicell?

    Read the article

  • How to know if all the Thread Pool's thread are already done with its tasks?

    - by mcxiand
    I have this application that will recurse all folders in a given directory and look for PDF. If a PDF file is found, the application will count its pages using ITextSharp. I did this by using a thread to recursively scan all the folders for pdf, then if then PDF is found, this will be queued to the thread pool. The code looks like this: //spawn a thread to handle the processing of pdf on each folder. var th = new Thread(() => { pdfDirectories = Directory.GetDirectories(pdfPath); processDir(pdfDirectories); }); th.Start(); private void processDir(string[] dirs) { foreach (var dir in dirs) { pdfFiles = Directory.GetFiles(dir, "*.pdf"); processFiles(pdfFiles); string[] newdir = Directory.GetDirectories(dir); processDir(newdir); } } private void processFiles(string[] files) { foreach (var pdf in files) { ThreadPoolHelper.QueueUserWorkItem( new { path = pdf }, (data) => { processPDF(data.path); } ); } } My problem is, how do i know that the thread pool's thread has finished processing all the queued items so i can tell the user that the application is done with its intended task?

    Read the article

  • Get filename from path

    - by Eric
    I am trying to parse the filename from paths. I have this: my $filepath = "/Users/Eric/Documents/foldername/filename.pdf"; $filepath =~ m/^.*\\(.*[.].*)$/; print "Linux path:"; print $1 . "\n\n"; print "-------\n"; my $filepath = "c:\\Windows\eric\filename.pdf"; $filepath =~ m/^.*\\(.*[.].*)$/; print "Windows path:"; print $1 . "\n\n"; print "-------\n"; my $filepath = "filename.pdf"; $filepath =~ m/^.*\\(.*[.].*)$/; print "Without path:"; print $1 . "\n\n"; print "-------\n"; But that returns: Linux path: ------- Windows path:Windowsic ilename.pdf ------- Without path:Windowsic ilename.pdf ------- I am expecting this: Linux path: filename.pdf ------- Windows path: filename.pdf ------- Without path: filename.pdf ------- Can somebody please point out what I am doing wrong? Thanks! :)

    Read the article

  • Prawn image position

    - by John
    I'm trying to layout 6 images per page with prawn in Ruby: case (idx % 6) # ugly when 0 : (pdf.start_new_page; pdf.image img, :position => :left, :vposition => :top, :width => 270) when 1 : pdf.image img, :position => :right, :vposition => :top, :width => 270 when 2 : pdf.image img, :position => :left, :vposition => :center, :width => 270 when 3 : pdf.image img, :position => :right, :vposition => :center, :width => 270 when 4 : pdf.image img, :position => :left, :vposition => :bottom, :width => 270 when 5 : pdf.image img, :position => :right, :vposition => :bottom, :width => 270 end Not sure what I'm doing wrong, but it prints the first 3 images to the PDF, then creates a new page and prints the last three: Page 1: <img> <img> <blank> <blank> <blank> <blank> Page 2: <blank> <blank> <blank> <img> <img> <img> Any suggestions would help.

    Read the article

  • How can I get a filename from a path with Perl?

    - by Eric
    I am trying to parse the filename from paths. I have this: my $filepath = "/Users/Eric/Documents/foldername/filename.pdf"; $filepath =~ m/^.*\\(.*[.].*)$/; print "Linux path:"; print $1 . "\n\n"; print "-------\n"; my $filepath = "c:\\Windows\eric\filename.pdf"; $filepath =~ m/^.*\\(.*[.].*)$/; print "Windows path:"; print $1 . "\n\n"; print "-------\n"; my $filepath = "filename.pdf"; $filepath =~ m/^.*\\(.*[.].*)$/; print "Without path:"; print $1 . "\n\n"; print "-------\n"; But that returns: Linux path: ------- Windows path:Windowsic ilename.pdf ------- Without path:Windowsic ilename.pdf ------- I am expecting this: Linux path: filename.pdf ------- Windows path: filename.pdf ------- Without path: filename.pdf ------- Can somebody please point out what I am doing wrong? Thanks! :)

    Read the article

  • How to rename many files url escaped (%XX) to human readable form

    - by F. Hauri
    I have downloaded a lot of files in one directory, but many of them are stored with URL escaped filename, containing sign percents folowed by two hexadecimal chars, like: ls -ltr $HOME/Downloads/ -rw-r--r-- 2 user user 13171425 24 nov 10:07 Swisscom%20Mobile%20Unlimited%20Kurzanleitung-%282011-05-12%29.pdf -rw-r--r-- 2 user user 1525794 24 nov 10:08 31010ENY-HUAWEI%20E173u-1%20HSPA%20USB%20Stick%20Quick%20Start-%28V100R001_01%2CEnglish%2CIndia-Reliance%2CC%2Ccolor%29.pdf ... All theses names match the following form whith exactly 3 parts: Name of the object -( Revision, and/or Date, useless ... ). Extension In same command, I would like to obtain unde My goal is to having one command to rename all this files to obtain: -rw-r--r-- 2 user user 13171425 24 nov 10:07 Swisscom_Mobile_Unlimited_Kurzanleitung.pdf -rw-r--r-- 2 user user 1525794 24 nov 10:08 31010ENY-HUAWEI_E173u-1_HSPA_USB_Stick_Quick_Start.pdf I've successfully do the job in full bash with: urlunescape() { local srce="$1" done=false part1 newname ext while ! $done ;do part1="${srce%%%*}" newname="$part1\\x${srce:${#part1}+1:2}${srce:${#part1}+3}" [ "$part1" == "$srce" ] && done=true || srce="$newname" done newname="$(echo -e $srce)" ext=${newname##*.} newname="${newname%-(*}" echo ${newname// /_}.$ext } for file in *;do mv -i "$file" "$(urlunescape "$file")" done ls -ltr -rw-r--r-- 2 user user 13171425 24 nov 10:07 Swisscom_Mobile_Unlimited_Kurzanleitung.pdf -rw-r--r-- 2 user user 1525794 24 nov 10:08 31010ENY-HUAWEI_E173u-1_HSPA_USB_Stick_Quick_Start.pdf or using sed, tr, bash ... and sed: for file in *;do echo -e $( echo $file | sed 's/%\(..\)/\\x\1/g' ) | sed 's/-(.*\.\([^\.]*\)$/.\1/' | tr \ \\n _\\0 | xargs -0 mv -i "$file" done ls -ltr -rw-r--r-- 2 user user 13171425 24 nov 10:07 Swisscom_Mobile_Unlimited_Kurzanleitung.pdf -rw-r--r-- 2 user user 1525794 24 nov 10:08 31010ENY-HUAWEI_E173u-1_HSPA_USB_Stick_Quick_Start.pdf But, I'm sure, there must exist simplier and/or shorter way to do this.

    Read the article

< Previous Page | 95 96 97 98 99 100 101 102 103 104 105 106  | Next Page >