guiprep.pl  Preprocessing toolkit - Current version .40


guiprep.pl  (644k)Includs both a perl script version and a compiled Windows executable. **NOTE** Both require that you have either perl 5.8.0 or later installed or the perl runtime libraries (prl03.zip) installed. If you already have prl03 installed for guiguts, there is no need to download it again.


prl03.zip - (5773k) perl runtime libraries - contains a full complement of perl libraries with Tk804.026 installed to allow either perl scripts or the Gui* executables to run on your system. Installation instructions and information on compiling your own package on this page.

Written by Steve Schulze (thundergnat).

Also see my post processing toolkit - Guiguts.pl

Questions or comments? Leave a message in the Distributed Proofreaders forums or private message me as "thundergnat".

Portions of this script are derived from RTF::Tokenizer by Peter Sergeant.
For more information on the RTF file format, SEE: The_RTF_Cookbook by Sean M. Burke

The included pngcrush.exe is a windows/dos compiled version of pngcrush, a png file compression tool. It will losslesly reduce the size of png files. Most image creation programs do not optimally compress png files. Get the latest version of pngcrush.exe at sourceforge (make sure you get the executable unless you are planning to compile it yourself) or go the the pngcrush home page for more information. The version included with the script is the lowest common denominator version. If you have a MMX capable processor, a faster, MMX enabled version is available. Uncompress it and place it the pngcrush directory in the guiprep folder. Make sure the included readme text file is named "README.txt" so the help button can find it. Some distributions I have seen have the help file named just "README"



This software has no guarantees as to it fitness to do this or any other task. Any damages to your computer, data, your mental health or anything else as a result of using this software are your problem and not mine. If not satisfied, your purchase price will be cheerfully refunded.

This program may be freely distributed, used, and modified. Reverse engineering is condoned and encouraged. If you come up with some really cool addition (or even just an idea) let me know, and it may be included in future releases. If you do reuse some of my code, I would appreciate you mentioning it in the comments of your script and dropping me a line to let me know...


This script requires a perl interpreter to run. The ActiveState perl interpreter is probably the most popular for Windows users. (95, 98, 98se, ME, NT, 2K, XP) They are also have versions available for Linux and Solaris. It's very functional and free. (They do ask that you register, but you can bypass the registration page without entering anything.) I personally would recommend the 5.8.0 distribution. For Windows users, you use the Microsoft Installer (MSI) version, it is very simple and automatic to set up. If you don't have Microsoft Installer, a link is included on the Activeperl download page.


What is it for?
Whats new?
Setting up the text files:
with RTF Markup Extraction:
without RTF Markup Extraction:
without RTF Markup Extraction or Dehyphenization:
Using the script:
Select Options tab
Process Text tab
Search tab
Remove Headers tab
Change Directory tab
Program Prefs tab
FTP tab
Troubleshooting
Known Bugs
Changlog History


What is it for?

Given a set of rtf files output from an Optical Character Recognition (OCR) program; this tool will extract text and italic and bold markup from the .rtf files and save it as text, rejoin end-of-line hyphenated words, filter out bad and undesirable characters, check for common scannos* and check for zero byte files to help automate preparation of files for Distributed Proofreaders. If your .png files are in a directory named PNGS, it can rename the .png files into the upload format. It can also, if desired run a png size optimizer on the files. You can queue up several projects and process them in a batch. It provides a mechanism to semi automate header removal and provides hooks to link in your favorite text editor and image viewer to help check files. There is also a mini FTP client built in that automates uploading a project to the site.

*[A scanno is like a typo... only from a scanner instead of a typist.]



Whats new?

Version .40 (644k) Argh. When I added the option to extract the small caps markup from the RTF files, I broke the handler for small caps if you WEREN'T extracting the markup. Fixed now.

Modified how the Precessing functions displayed progress. They used to just print a dot to the screen for each page (file) that was completed. That worked fine as long as there weren't any problems. If the WAS a problem, it was extremely tedious to try to count the dots to figure out which file was causing it. Changed it to print an incremented counter mod 10. It will print the digits from 123456789012345... and so on. That should make it much easier to figure out which file causes a problem when one occurs.

Fixed an obscure problem with code page handling during RTF extraction. Set it to have a reasonable default if it couldn't determine the codepage.

Tightened up a bunch of code in the font table and codepage handling code. Made it much more memory efficient (and probably faster, though negligibly so.)


History.


Setting up the text files:

There are two different dehyphenization routines. One works with a single set of files, the files with line breaks preserved; the format need by the Distributed Proofreaders site. The other will use two sets of files, one set with line breaks and one set without. The two set will yeild better accuracy during dehyphenization at the expense of slightly longer processing time and more disk storage space. To do two set dehyphenization, save the text from ABBYY FineReader (or possibly other OCR packages; should work as long as they produce standard, well formed rtf files) two times in two different directories. Assuming you have a project directory named "PROJECT", under the project directory you will need two directories "textw" and "textwo". "textw" stands for "text with line breaks" and "textwo" stands for "text without line breaks". If you are only going to do single set dehyphenization, you only need to follow the instructions for the "textw" directory.

 with RTF Markup Extraction:

In ABBYY after all of your images are loaded and OCRed, select File => Save Text As;

 menu



A dialog box will pop up.

In the "textw" directory, save the text with the settings: Save as type Rich text Format, Create a separate file for each page, Retain font and font size. On the RTF tab of the Formats Settings, check Keep page breaks and Keep line breaks and uncheck everything else. It doesn't matter what the File name is set to. The default is probably fine.

text1 textw2





In the "textwo" directory, save the text with the settings: Save as type Rich text Format, Create a separate file for each page,Retain font and font size. On the RTF tab of the Formats Settings, check Keep page breaks and Remove optional hyphens and uncheck everything else. Make sure the File name is set the same as in the textw directory.

textwo1  textwo2




without RTF Markup Extraction:

If you don't want to do markup extraction, (or your OCR package won't support RTF files) you can skip saving the files as RTFs and just save them as plain text files. Again, to do dehyphenization, you will need to save the files in two directories, textw and textwo.

Save the text with line breaks in textw. The ISO Latin-1 code page will give you pretty good results for English and most European languages. The site works with ISO Latin-1 so that may be least problematic to fit into the character space used. Windows codepage 1252 should also work well since it overlaps Latin1 very closely and where it doesn't, the filter routine will convert characters that don't fall within Latin1. This may actually yeild better results than trying to force the OCR to fit the text into the Latin1 character set.

textw





The textwo directory should use all of the same settings except that Keep line breaks needs to be unchecked. Be sure to use the same code page and file names in both the textw and textwo directories.

textwo



At this point the script is used exactly the same way except you'll skip the Extract Markup routine.


 without RTF Markup Extraction or Dehyphenization:

If you are using a different OCR package that can't save as rtf or do automatic line rejoining, you may need to skip those two functions. Save the files in a directory named "text" using the same settings as for textw without RTF extraction above. Uncheck both Extract and Dehyphenate under the Process Text tab.





Using the script:

When you run the script, a Graphical User Interface will pop up allowing you to select options, select the working directory, and process the text files. One implication of this is that the script no longer NEEDS to run from the working directory. In fact, it will work better if run you it from the same directory each time, changing to the working directory after it starts, because it will save all of the option settings in directory it is started in - in a file named settings.rc, (rc is a standard extension for resource file). and it will look for the scannos.rc file in the startup directory. The script remembers the last directory you were working in and reopens to that directory the next time you run it.

Select Options tab

The Select Options tab will allow you to adjust the markup used for italics and bold extraction and set the options you want the filter routine to run. The Save Settings button will save your markup and selections from session to session. The Default Markup button will change all the markup text back to defaults. If there is little or no bold in your text, you may want to disable bold extraction to cut down on false positives. The other settings are all options for the filter routine. See discussion below under Filtering for suggestions and explanations for different settings.

There are a few options having to do with batch processing.

Extract Bold Markup - If you don't have much bold text in your project you may want to diable this to cut sdown on false positives, especially for lower quality scans.

Insert cell delimiters in tables - If you have tables in your project, the script will try to keep the layout as much as it can. The cells usually will not come out exactly as the origional, so youcan add markers "|", between the cells to help the proofers align them .

Extract sub/superscript markup - Select whether to extract sub and super scriptws while doing mrkup extraction.

Dehyphenate using German style hyphens; "=" - Option to dehyphenate German texts.

Header Removal - You can now select whether you want to run automatic header removal on your text files during batch processing. It will automatically remove the top line from every text file. THIS MAY POSSIBLY REMOVE LINES THAT SHOULDN'T BE REMOVED. USE WITH CARE. It is highly recommended that header removal be done in interactive mode if feasible.

Build a zip of the project files - The site promises to soon have the capability to upload the project files a a zip file. Possibly through a web interface rather than FTP. This option will generate a zip archive containing all of the files in the "text" and "pngs" directories. (or whatever you chose to name your image directory) It will be written to the project directory with the name of the project directory used as the name of the zip file.



Filtering options:


 
As of now, the pattern substitution/filtering functions the script will perform are:


• Remove extra (multiple) spaces in text. - Highly recommended. Makes all of the other filtering more effective. Default on.

• Convert Windows-1252 codepage glyphs 80-9F. - Highly recommended. Will need to be fixed eventually, may as well do it now. Default on.

• Remove spaces at end of line. - Recommended. Not a big deal either way but may make the proofers job easier. Will help later during rewrapping. Default on.

• Convert spaced hyphens to em dashes. - Recommended. Correct behavior for most texts. Not recommended for math texts. Default on.

• Convert multiple consecutive underscores to em dashes. - Recommended. Correct behavior for most texts. Default on.
 
• Remove spaces on either side of hyphens. - Highly recommended. Easily automated formatting fix. Correct behavior more than 99% of the time.Not recommended for math texts. Default on.

• Convert double commas to a singe double quote. - Recommended. Usually correct behavior. Default on.

• Remove spaces on either side of em dashes. - Highly recommended. Easily automated formatting fix. Correct behavior more than 99% of the time. Not recommended for math texts. Default on.

• Remove space before periods. - Highly recommended. Easily automated formatting fix. Correct behavior more than 99% of the time. Default on.

• Remove space before exclamation points. - Highly recommended. Easily automated formatting fix. Correct behavior more than 99% of the time. Default on.

• Remove space before question marks. - Highly recommended. Easily automated formatting fix. Correct behavior more than 99% of the time. Default on.

• Remove space before commas. - Highly recommended. Easily automated formatting fix. Correct behavior more than 99% of the time. Default on.

• Remove space before semicolons. - Highly recommended. Easily automated formatting fix. Correct behavior more than 99% of the time. Default on.

• Remove space after opening and before closing brackets. - Recommended. Easily automated formatting fix. Correct behavior most of the time. Default on.

• Strip space after start & before end doublequotes. - Highly recommended. Easily automated formatting fix. Correct behavior more than 99% of the time. Default on.

• Ensure space before ellipses except after period. - Recommended. Easily automated formatting fix. Correct behavior most of the time. Default on.

• Convert two adjacent single quotes to a single double quote. - Highly recommended. Easily automated formatting fix. Correct behavior more than 99% of the time. Default on.

• Convert solitary 1 to I, if not at beginning of line, or if preceded by quotes. - Recommended. Depends on text. For vast majority does much more good than harm. Default on. *See note

• Convert solitary lowercase l to I if preceded by space or quotes. - Recommended. Depends on text. For vast majority does much more good than harm. Default on.

• Convert solitary 0 preceded by quotes to O. - Recommended. Depends on text. For vast majority does much more good than harm. *See note below. Default on.

• Convert vulgar fractions (¼,½, ¾) to "1/4", "1/2" and "3/4". - Your choice. Depends on book. Depends on your preference. Default on.

• Convert ² and ³ to "^2" and "^3". - Your choice. Depends on book. Depends on your preference. Default on.

• Convert £ to "Pounds". - Your choice. Depends on book. Depends on your preference. Default off. *See note below:

• Convert ¢ to "cents". - Your choice. Depends on book. Depends on your preference. Default off. *See note below:

• Convert § to "Section". - Your choice. Depends on book. Depends on your preference. Default off.

• Convert ° to "degrees". - Your choice. Depends on book. Depends on your preference. Default off.

• Convert forward slash (/) at a word end to comma apostrophe(,'). - Your choice. Depends on book. Depends on your preference. Default on. (Will ignore slash after less than </)

• Convert \v or \\ to w. - Your choice. Fairly common scanno.  Depends on your preference. Default on.

• Convert solitary j or at end of word not proceeded by "a,e,n or u" to semicolon. - Your choice. Depends on book. Depends on your preference. Default on.

• Convert string 'tli' to 'th' if it is a the beginning of a word. - Very highly recommended for English texts, especially if you are going to run the Scanno check. Recommended with caution for non-English. Default on.

• Convert string 'tii' to 'th' if it is at the beginning of a word.- Very highly recommended for English texts, especially if you are going to run the Scanno check. Recommended with caution for non-English. Default on.

• Convert string 'wli' to 'wh' if it is at the beginning of a word.- Very highly recommended for English texts, especially if you are going to run the Scanno check. Recommended with caution for non-English. Default on.

• Convert string 'rn' to 'm' if it is at the beginning of a word.- Very highly recommended for English texts, especially if you are going to run the Scanno check. Recommended with caution for non-English. Default on.

• Convert string 'hl' to 'bl' if it is at the beginning of a word.- Very highly recommended for English texts, especially if you are going to run the Scanno check. Recommended with caution for non-English. Default on.

• Convert string 'hr' to 'br' if it is at the beginning of a word.- Very highly recommended for English texts, especially if you are going to run the Scanno check. Recommended with caution for non-English. Default on.

• Convert string 'rnp' to 'mp' in a word.- Very highly recommended for English texts, especially if you are going to run the Scanno check. Recommended with caution for non-English. Default on.

• Convert vv at the beginning of a word to w - Recommended, default on.

• Convert !! at the beginning of a word to H - Recommended, default on.

• Convert initial X not followed by e to N - Also takes into account Roman Numerals, Recommended, default on.

• Convert ! inside a word to l - Recommended, default on.

• Convert '11 to 'll - Recommended, default on.

• Convert rnm in a word to mm - Recommended, default on.

• Convert string 'cb' to 'ch' in a word.- Very highly recommended for English texts, especially if you are going to run the Scanno check. Recommended with caution for non-English. Default on.

• Convert string 'gbt' to 'ght' in a word.- Very highly recommended for English texts, especially if you are going to run the Scanno check. Recommended with caution for non-English. Default on.

• Convert string '[ai]hle' to '[ai]ble' in a word.- Very highly recommended for English texts, especially if you are going to run the Scanno check. Recommended with caution for non-English. Default on. [ai] means: either a or i .

• Convert cl at the end of a word to d - Recommended, default on.

• Convert pbt in a word to pht - Recommended, default on.

• Convert whole words string 'to he' to 'to be'.- Very highly recommended. Almost always correct behavior. Default on.

• Move punctuation outside of markup.- Highly recommended if you have extracted markup. Otherwise not. Default on.

• Remove empty lines from the top of the file. - Highly recommended. Easily automated formatting fix. Default on.

• Convert multiple concurrent empty lines to a single. - Recommended. Usually correct behavior. Easy to fix if not. Default on.

• Remove empty lines from the bottom of the file. - Highly recommended. Easily automated formatting fix. Default on.

• If top line has nothing but digits, (page number) delete it. - Recommended. Up to your personal preference. Default on.

• If bottom line has nothing but digits, (page number) delete it. - Recommended. Up to your personal preference. Default on.

They are all selectable from the options page.

The "improbable character combination" filters (tli, rn, wli, hl, hr, rnp, cb, gbh, [ai]hle) DEFINITELY should be run if you intend to run Fix Common Scannos. Those filters reduce the number of checks that need to be done by scanno routine by 330 words yet effectively add several thousand.

*After ad hoc testing of about 50 texts pulled from PG at random, solitary I is about 90 times more likely than solitary 1. If instances at the beginning of lines are ignored, it rises to about 150 times. Pretty good odds I think.

*Solitary 0 (With nothing but space on either side) is automatically converted to O. This is non negotiable. Because of the way the dehyphenate subroutine works, if it encounters a solitary 0 in the text, it will delete the rest of the paragraph. I would rather have a few misconverted O's then deleted paragraphs. (It's not really the dehyphenate subroutines fault, it's more just a consequence of perls weak variable typing, but I digress.) This is not just my dehyphenate routine, aldarondo's has the same problem but doesn't trap it.

* £ to "Pounds" uses some intelligence when it converts. It will move the "Pounds" to after the number. I.E. £30 will become '30 Pounds' not 'Pounds 30'

* ¢ converts to "cents" unless it follows a solitary 1, in which case it converts to "cent"

Change Directory tab

You can select the directory you want to work in in the Change Directory tab. The top bar shows what directory is the "current directory". In general, to run the Extract and Dehyphen scripts you will need the "textw" and "textwo" directories visible in the Change To selection box. The other text processing routines need the text directory, which will be created by the dehyphen routine, if necessary. The png processing routines will need to see the pngs directory. Click on the directory name to move to that directory or on the " .. " to move up one level. All of the routines expect to run from the same directory. (The parent directory of pngs, textw, textwo and [eventually] text.) If you want to run the script in batch mode, select one or more directories containing the files to be processed in the right hand box. All of the batch functions work exactly like the interactive functions, they just allow you to queue a bunch of projects up and process them all with one command.

Remember, to do interactive processing you need to be IN the project directory, for batch processing you need to be ABOVE the project directory. Remove headers can only be done in interactive mode, so you will need to be IN the project directory to do it.



Process Text tab

Once your options are set up and you are set to the right directory, go to the Process Text tab. In this tab you can run the different routines on the text files. You can run individual routines, mix and match or select Do All Selected to run all the subroutines you select in one batch. Different routines have different prerequisites so you can't necessarily run the routines out of order and get good results.

The Extract Markup routine expects to find the directory "textw" (and optionally "textwo" if you are using the original dehyphenate routine) with rtf format files in them. It will extract the text and markup and put the extracted files in the same directory with a .txt extension.

The Dehyphenate routine expects to find the "textw" and optionally "textwo" directories with .txt files in them. Whether the .txt are as a result of the Extract routine or just .txt format files saved directly from Abby is immaterial. It will put the merged files into a directory named "text", creating it if it doesn't already exist. **WARNING: any files with a .txt extension in the "text" when Dehyphenate runs WILL BE DELETED. WITHOUT WARNING OR ASKING.**

The Rename, Filter, Correct Common Scannos and Fix Zero Byte routines all expect to find the "text" directory with .txt files in it. Again the files may be from Dehyphen routine or may not.

Rename Png Files expects to find the "pngs" directory with your .png files in it. It will rename all of the .png files in the upload format.

Run Pngcrush expects to find the "pngs" directory with your .png files in it. It will run pngcrush on each file to optimize the compression and reduce the size. The default settings will change reduce the palette to the minimum necessary. It does save the original files in a directory " _pngsback_" so you can easily recover them. If interrupted part way through, it will pick up where it left off the next time you start it. As a consequence, if you interrupt it, the pngs directory WILL NOT have all of the files in it. Make sure you have the same number of text and png files before you upload them.

If you are going to run both Filter and Fix Common Scannos, is highly recommended that you run Filter first, then Fix Common Scannos. Fix Common Scannos will check your files for over 3000 of the most common mis-scanned English words and correct them. It should be used with caution on non-english texts though. It probably won't hurt but you should check a bunch of pages afterwords to be sure. (It probably won't help either.)

It is recommended that the Fix Zero Byte Files routine be run last, though the order is not really critical.

Convert to ISO-8859-1 NEEDS to be run on files for the original DP site but SHOULD NOT be run on files for DPEU.  This will transliterate any Greek characters and convert any other characters outside of Latin-1 to question marks. Hopefully the original DP site will be converting to Unicode in the near future and make this function unnecessary.

In general, the routines should be run in top to bottom order. If you run them by selecting the routines you want to run, then pressing Do All Selected, they will automatically run in the optimal order.

The Start Processing and Interrupt Processing buttons will start and stop processing job. If you have a batch queued up, it will run the batch. Otherwise, it will run in interactive mode in the current directory.

For batch processing, Start Processing will run Do All Selected on each project. You can select and deselect routines and Process Batch will follow your selections. If you really want to, you can change options and selections while the batch is in progress.... but you probably shouldn't. The small box in the lower left shows the status of the current batch.

? will pop a terse help message.

Clear Status Box will clear the messages from the status box.



Fix Common Scannos:

The scannos word list was pulled from the Distributed Proofreaders CVS site. There are approximately 3400 words in the scannos list (though the improbable letter combination filters make about 330 of them redundant) >From the description in the scannos list header:

# Word list derived from Moby project data, cut for top 2000 frequency and word
# of 6 characters or less (to reduce size and assuming that longer words will
# be closely examined by the proofreaders). The resulting list was processed
# through perl scripts which generated scannos by replacement (see below).
# This result was then filtered to eliminate valid words from the generated
# "error" list (left side) to eliminate false positives.
#
# The common scannos from gutcheck and PRTK were then added, as well as some
# additional scannos provided by numerous DP proofreaders.
#
# The resulting list was then tested against just over 1 million words of raw
# OCR output provided by charlz. Further false positives were discovered and
# removed. The actual hit rate for this code is about 1 scanno detected per 30k
# words of input text. The actual accuracy rate against the corpus provided by
# charlz is: 2 false positives out of 122 scannos detected, or 98.3% accurate.
# Seems worthwhile to me. :)


If you come up with misscanned word that you think should be in the scanno list, let me know. Words that commonly are misscanned for each other (like bad / had or and / arid) are NOT good additions. Those are better off in Big_Bills' stealth scannos list.


Header Removal tab

When all of the processing routines have been performed, you can go to the Header Removal tab to delete the top lines of each text file, if desired. To remove headers in interactive mode, you will need to be IN the project directory. If remove headers is run in batch mode, it will automatically remove the top line of EVERY text file (unless the top line is the blank page markup) and then run the Fix Zero Byte Files routine to catch any emptied files.

Often the top line is a book or chapter title that will be removed anyway. This tool will help semi automate removing them. Press Get Headers. This will load a list of the top lines of each file. Select the ones you want to delete (probably easier to select all, then unselect the ones you don't want to delete) then press Remove Selected to write the changes to the affected files. If you like, you can Get Headers again to see if there are any others you would like to remove. Repeat as necessary. If the top line of a file is the blank page markup (from the select options tab) Remove Headers will not delete it, you will have to delete it manually if you want to remove it.
If you accidentally remove headers you didn't intent to, will probably be easiest to go back to Process Text and run Dehyphenate, Rename, Filter, Fix Common Scannos and Fix Zero Byte Files again to regenerate the files, then rerun Remove Headers. It would not be necessary to rerun Extract, since those files are stored in a different directory. Alternately, you may want to back up your text files with the Make Backups function on the Process Text tab before you run Remove Headers so you can revert easily if problems arise.

After you Get Headers, you can easily edit the file that the header is part of by double left clicking on the header to open your text editor. Set up your text editor on the Program Prefs tab. If you DO edit some of the files this way, remember to refresh the header list before running Remove Selected. You can also link an image viewer so you can compare image and text side by side (Much like the site! :-) ) Irfanview works really well for this. And it's free! XnView is another great free image viewer that works well. Invoke your image viewer by left then right clicking on a file header. **Will not work in Winprep.exe. Winprep cannot run external programs.**

 If you use Irfanview, for best results, set View->Display options to 'Fit only big images to window'.

If you use XnView, it's a little more complex. Go to Tools->Options->View and check 'Maximize view when open' and set 'Auto image size' to 'Fit image to window, large only.' Go to Tools->Options->Misc and check 'Remember last position/size'

*Caveat* There is a bug in the command line parsing in XnView. If you have a directory with a space in the name, in the path to XnView (like 'Program Files' for instance), it will fail with a 'File not found' error. As long as there are no directories with spaces in the name in the path, it will work fine. Irfanview and other image viewers I have tested don't have this problem.



Search tab

The Search tab has search and replace functions that will search through the text files and display the files with the search term and allow you to modify them, if desired. This is a strictly interactive tab. It is handy to check for project specific scanning errors or to check up on synchronization errors during dehyphenization. (Search for '**') Maybe after your are done all your processing, you decide that you shouldn't have done bold extraction after all. Just do a search and replace on <b> and </b>.
There are some options to do case insensitive searching or search for whole words only to narrow down what the search function will find.
When you perform a search, if the search text is found, the whole file it is in will be displayed in the text window with the found text highlighted and the cursor just before it. If the search text is not found in any of the remaining files, a dialog will pop up informing you.



The buttons are pretty self explanatory. The Save Open File button saves the text that is currently displayed in the window to the file, overwriting the original. Search looks for the next occurrence of the search term. If you already have a text file open and press Search, it will proceed with the search starting from the open file. Replace substitutes the Replacement Text for the Search Text in the window, and saves the file. To cancel an in progress search, change the Search Text, that will reset the file index counter to the beginning. Replace & Search (R & S) just combines the Replace and Search buttons into one function call. Replace All will call Replace and Search until all of the files have been searched. It will reset the file index counter to zero before it starts so if you are performing a manual search, get halfway through the files and then press Replace All, it will start over again at the first file.



Program Prefs tab

There is a Program Prefs tab where you can set some preferences which affect how the program looks and runs. You can change the color palette the the script uses, you can associate a text editor with the script to allow easy checking and editing of files while you are doing header removal and you can associate an image viewer to do side-by-side comparisons with text.

The default palette is CornSilk2. I also like PeachPuff2, Bisque2, CadetBlue3 and Ivory3. Some truly painful ones are chartruse1, IndianRed1, brown1 and DarkOrchid2. Ouch!

You can now specify what the name of the directory containing your png files is on this tab. Default is 'pngs'. Avoid using directory names with spaces in them.

For Windows users, you will probably want to use wordpad or notepad or some equivalent for your text editor. Irfanview or XnView or an equivalent for an image viewer.

The default locations for notepad and wordpad are:

Win 95, 98, 98SE, ME & XP:
C:\WINDOWS\NOTEPAD.EXE   
C:\Program Files\Accessories\WORDPAD.EXE

Win NT & 2K
C:\WINNT\NOTEPAD.EXE
C:\Program Files\Windows NT\Accessories\WORDPAD.EXE





FTP tab

There is an FTP client included which will help automate uploading the project to the Distributed Proofreaders FTP server.


A simple moderate featured FTP client. Suitable for uploading to DP and minor maintenance.

From left to right in rows....

Host name (Text Entry) - Defaults to pgdp01.archive.org
User name (Text Entry) - Get it from the Project Managers page. Will be saved from session to session if Save User & Password is checked.
Password (Text Entry) - Get it from the Project Managers page. Will be saved from session to session if Save User & Password is checked.
Home Directory (Text Entry) - Set a prefered home directory on the FTP server if desired. Will automatically change to that directory when you connect.

Connect To Host (Push Button) - Initiate FTP connection. Will fail if you have no internet connection. May take a while.
Disconnect (Push Button) - Break FTP connection.
Save Log File (Push Button) - Save a session log to a file.
Clear Log (Push Button) - Clear Session log.
? (Push Button) - A terse help file with a brief explanation of how to use the client.
Save User & Password (Check box) - Option to save User name and Password.

Session Log (Text Readout) - Commands and feedback issued during session.

Connection Status Box (Text Readout) - Connection monitor.

Build Batch (Push Button) - Make a standard batch. Adds all the .txt files in the text directory and all the .png files in the pngs directory.
Add a File  (Push Button) - Mostly to upload a few files instead of a standard batch. Adds a filename to the batch.
Zip Batch Files (Push Button) - Zip all of the batch files into one zip archive. New functionality on the site coming soon.
Clear Local List (Push Button) - Cancel batch before it is sent and clear batch list. Will interrupt batch in progress.
Send Files (Push Button) - Transfer all of the files in the batch list to the FTP host in binary mode.
Stop Transfer (Push Button) - Interrupt a batch transfer in progress. Uploads can be resumed. Downloads must be reinitialized.
Download (Push Button) - Select a file or directory on the remote server and press Download. A dialog will pop up to select a directory to download to. Alternate (file) - double left click.

Make New Directory (Push Button) - Make a directory on the remote host in the current directory using the directory name from the Directory Name text entry.
Directory Name (Text Entry) - Name to use when making a new directory on the remote server.
Chdir Sel (Push Button) - Select a directory on the remote server and press Change to to change to it. Alternate (directory) double - left click.
Chdir Up (Push Button) - Change directory on the remote server up one level. Alternate double - left click on double dot entry "..".
Rename (Push Button) - Select a file or directory on the remote server then push Rename to rename it.
Delete (Push Button) - Select a file or directory on the remote server then push Delete to delete it.

Remote Directory (Text Readout) - The directory you are currently browsing / working in on the remote host.

Local Listing (Text Readout) - List of files that will be uploaded when Send Files is pressed.
Remote Listing (Text Readout) - A listing of all of the files and directories in the current directory on the remote host.

The directory listing shows files prefixed by 'FILE - ' with the byte size after. Directories are prefixed by 'DIR -  ' (except the double dot entry ".." which is short hand for "parent directory").

To Change Directories on the remote host, double left click on a directory name in the remote directory listing.

To Download a file or directory, left click then right click on it.(Or just double left click a file name) A dialog will pop up to select a download directory. When downloading multiple single files, don't try to start the next before the previous one is finished. It will cause problems with the script.

When downloading a directory, the script will make a directory with the same name as the remote directory that it is downloading in the local directory that you choose through the download dialog.
The directory download dialog may be somewhat confusing. Double left click on a directory name or drive name to change to it, or on ".." to go up one directory. When the target directory (where you want the new directory to be placed) is in the text box below, press OK to start the download. The script will create the new directory, if necessary, and download the files into it, overwriting without warning any same name files that may already be in it.
At the bottom of the dialog there is a filename filter box. By default, (filter left blank, equivalent to '.' [any character]) all of the files in the selected remote directory will be downloaded. If you only want the text files, put .txt in the filter box. If you only want the PNG files put .png. If you put a 2 in the filter box, you'll get all of the files that have a 2 in the name. (002.txt, 002.png, 012.txt, 012.png, 020.txt, 020.png.... etc.) It uses perl regular expressions to evaluate the pattern so you can build a much more complex matching filter if desired.

To interrupt a batch download or upload in progress, press Stop Transfer. This will stop the batch transfer after the current file is finished. To stop immediately, press Disconnect.
 
To Delete a file or directory, select it (left click) then press Delete.
Directories do not need to be empty to be deleted. A dialog will pop up to ask for confirmation.
MAKE SURE YOU REALLY WANT TO DO THIS. IT CAN NOT BE UNDONE.

You can view files in the local list (if they are text or image files) by double left clicking on them.

You can remove one or several files from the local list by highlighting the file name(s) then double right clicking, or clear the list with the Clear File List push button.

For files in the local list, double click on a file name to view it (if it is a text or image file).

Select one or more file names and double right click to remove the names from the list.

The script does directory caching to drastically speed up walking the tree. It does not save the cache when you close the program.




Troubleshooting:


The script tries to figure out whether it can run the way it expects and tries to warn you if it has a problem.

If it warns that it can't find files or a directory, you probably selected the wrong directory as a working directory or you may being running in the wrong mode, (batch instead of interactive or vice versa) or possibly you have incorrect options selected. (Running extract when you don't have rtf files.) Remember, for interactive mode, the textw and textwo directories (and possibly text and pngs) should be visible in the change directory box. For batch mode, you need to select the parent directory of the textw and textwo directories.

Warnings about the scannos file are a result of a missing or corrupted scannos.rc file. If you edit the file, be sure to follow the format shown.

One thing that will produce odd results is to feed the script RTF files that contain 16 bit characters. (Unicode or UTF-8) It is really designed to work with 8 bit characters using code page Windows 1252 or ISO Latin-1. If you end up with page after page of question marks, you probably are saving your RTF files as 16 bit characters.

If you somehow get the window set larger than your desktop and can't get to an edge to resize it, delete the settings.rc file in the startup directory. That will reset all of the settings to defaults, which will reduce the window to 640x480 pixels. Alternately, you can edit the settings.rc file with a text editor and remove the line that starts: $geometry = .


Known bugs and odd behavior:


When viewing a file through the Headers tab, you may have an unexpected or wrong file open up. Due to the way list boxes are handled under Tk, you need to specifically select (left click) an entry before you can act on it (right click). If you haven't selected an entry, either the last entry in the list or the previous selection is defaulted to. The actual mouse pointer position is ignored on right click.

Customized open and close markup markers are not sanity checked. The script will not check or care if you use inappropriate markers. For instance you can set both italics and bold to use the same markup, or, even worse, use a marker which will occur normally in the text. If you specify "the" and "and" or even "     " for your italics open and close markers, the script will uncomplainingly use them. Probably not a good idea.

If you switch away from the Process text tab while a batch or job is running, it will automatically cancel the job to prevent contamination of other directories. Each tab has its own peculiarities about where it needs to run and if you try to switch to one while another is processing, it could cause problems.

The FTP client blocks while it is waiting for a response from the FTP server. It appears that the program has locked up but it is just waiting for something to happen. If the connection is lost, it will return immediately. On a dial-up connection, for large transfers, (the initial directory listing, for instance) it can take between 30 - 60 seconds to respond. Be patient. It is working.



Changlog History


Version .39 (643k) Added option to extract small caps markup from the rtf during the extraction routine. Markup will be added as <sc> .. </sc> around the text that is marked as small caps in the RTF file. It doesn't do too bad, but there are problems trying to convert RTF markup (which is strictly presentational) into semantic sensitive markup.

Added an entry box to the Process Text tab where you can specify what number to start with when renaming the text and/or png files. By default it is set to 1, but if you want to offset the pages by 127, enter 127 in the box and the files will be renamed starting at 127.  IF you want to force four digit numbers even for texts that nominally would only need three (say an early volume of a multi-volume work,) left pad the start number out to 4 places with zeros, e.g. 0001. Sorry, no negative numbers, no skipping numbers in the sequence after the start offset. If you don't like the offset you have, change it and rename again, filename collisions will be automatically avoided.

Modified file renaming routine to be able to deal with offset start points. Rewrote it to be more robust about avoiding filename collisions. As a side effect, I sped it up about two to three times as fast as it used to be.

Modified Search tab to be able to deal with file names that don't correspond to their index.

Twiddled with the layout of the options tab slightly. Mostly cosmetic changes.

Got tired of the default palette and changed it. Shouldn't affect most current users, only new users, and you can still change it to whatever you prefer.


Version .38 (642k) Added a whole bunch of tweaks suggested by lorax.

Tweaked "Remove garbage punctuation " regexes a bit. Broke apart the "Strip from front" and "Strip from end" regexes into separate options.

Modified Header Removal functions to not display pages where the only text is the "Blank Page" text string from the options page.

Fixed improper calling of nohyph.dict  loading function. Sigh.

Included a basic English nohyph.dict courtesy of lorax.

Tweaked quote handling a bit to try to intelligently resolve quote spacing a bit better.

Added function that will try to find and change the case of ALL CAPS words at the start of a chapter. It isn't very aggressive to prevent unwanted case changes, but it should help a little.

Fixed bug with Convert £ to "Pounds" option  where it would erroneously split numeric quantities at commas. E.G., £100,000 would become 100 Pounds ,000 rather than 100,000 Pounds. Note, this option is little used and somewhat discouraged, but it is available.

Fiddled around with the "Move punctuation outside of markup" functions to avoid a few undesirable side effects. Most obnoxious of which was , the <i</i>> problem.

Fixed a bug in the Extraction routine where if a page contained a table, any text after the table would have its spaces changed to non-breaking spaces. Normally this would be a non-issue since the filter routine changes all non-breaking space back to regular spaces, however, in rare instances they seemed to be slipping through.

Added an option to save two files during dehyphenization; hyphens.txt and dehyphen.txt. The hyphens.txt will contain all of the end-of-line hyphenated words that the script found during the dehyphenate routine where the words remained hyphenated. The dehyphen.txt will contain all of the words where a hyphen was removed. The script has been capable of generating these files for some time as a debugging aid, however it required editing the source to set a debugging flag. Since the addition of the nohyph.dict dictionary file though, these could be more useful to general users so I made the generation optional in the program. The files will be placed in the base directory of the project, (the directory that contains the textw, textwo, text and pngs directories.) They will be overwritten each time the dehyphenate routine is run.

Messed around with the layout of the options page a bit. The layout manager I was using was very automatic, but I didn't like the staggered columns of checkboxes.


Version .37 (638k) Fixed problem where guiprep would occasionally lock up while running Filter Files with the "Move punctuation outside of markup" selected.

Added an option for the  "Remove garbage punctuation at ends of line" to the options page. Made filter regex much more aggressive.

Tweaked a few other filters a bit.


Version .36 (638k) It's a veritable bug fest.

Fixed problem with semicolons being turned into question marks.  Stupidity errer :-( 

Think I finally fixed the problem with disappering punctuation after hyphenated words. (Actually lorax spotted the error.)

Fixed some other mistakes I made while trying to implement dehyphenate code modifications submitted by lorax. The problems should not have caused any errors in the processed texts, though they limited the effectiveness of the dehyphenate routine a bit.

Added a new filter to the filter routine to try to clean up junk at the end of lines. Often, OCR will erroneously put a bunch of junk puntuation at the end of lines, (typically where the page runs off into the gutter.) This will try to detect and clean up the worst of it.

Was not able to replicate problem with emdash being rendered as â", so that hasn't been fixed yet if it is truly a problem.

Remembered to update version number this time.



Version .35 (638k) Phooey. Yet more bugs. (Well, bug fixes, one would hope.)

Fixed  bug where Filter function would lock up on certain files. Root cause was a regex to move punctuation outside of markup that had adverse reactions to characters outside of Latin-1.

Fixed a few warnings about printing wide (multi-byte UTF-8) characters.


Version .34 (637k) A few tweaks and bug fixes.

Added option to use an external file of words that are not hyphenated. If there is a file named nohyph.dict in the guiprep directory, it will be loaded and used to help determin which words should be dehyphenated during the dehyphenization routine. (Similar to Nicola's DPEU version.)

Fixed problem with the Convert to ISO-8859-1 routine that was causing some bizarre u <-> y substitutions.

Revised dehyphen routine to be a little more agressive. Changed to agressivly lower false negatives without significantly raising false positives. Based on code sample by lorax.

Twiddled around with FTP routines a bit. Nothing substantial, most visible change is the "activity indicator". Used to just append vertical bars to the log, now just has a "spinning" line.


Version .33 (636k) Updated program to deal with Unicode files gracefully. Now works natively in UTF-8. File for the original DP site NEED to be in ISO 8859-1 (Latin-1). There is an extra button on the Process Text tab "Convert to ISO 8859-1" PLEASE down convert files for the original DP site. (At least until the UTF-8 mods get activated.) No such restreictions for DPEU. UTF-8 files are PREFERRED at DPEU. Note the Convert to ISO8859-1 function will do transliteration of any Greek it finds. (It uses the guiguts beta code to denote accented characters.) Other characters outside of Latin-1 will be converted to question marks at this time. If I get some transliteration tables, I could make auto transliterion for other character sets too. I don't really want to spend lots of time on it though because hopefully, in the near future, DP will convert to UTF-8.  A very large Thank You to Nikola Smolenski, one of the lead developers for the DPEU site who worked out the bulk of the UTF-8 character extraction code.

Fixed problem with pngcrush under Win2000 and WinXP.  It was easy enough, once I figured out what was causing the problem. The fix consisted mostly of downloading a version of pngcrush that works correctly under 32 bit Windows. Argh. Note: for Win 95, 98 and ME users. The 32 bit version will not work crrectly under DOS. The old version is still included as pngcrush16.exe. Rename pngcrush.exe to pngcrush32.exe and pngcrush16.exe to pngcrush.exe. The 32 bit version will not work correctly under DOS.

A few other small (and mostly invisible) tweaks.

Version .32 (550k) Fixed bug where if an italicized word was at the start of a line after a line that ended with a hyphen, the word would be removed during dehyphenization.

Modified guiprep to fix markup that closes at the end of a line to not leave the ending markup at the beginning of the next line.

Modified guiprep to use the spawn.pl spawning script for external programs instead of runner.pl for the same reasons I changed it in guiguts. More compact, and better Linux compatability.

Added check for common italicized scholarly abbreviations to move markup outside of punctuation. (e.g., ibid., loc., cit., Ib., cf., op., et seq., viz., etc.)

Cut out 100k of extreaneous images from the manual.


Version .31 (659k) Major update of the code to work with the Tk:804 series. Rewrote and updated user interface to work with the new unicode aware Tk. The basic operation is as near to identical to previous versions as I could make it. It uses the same layout, though button and font sizes are subtly different.

I have split apart the libraries from the executable version and am including the windows exe along with the perl script. The executable version uses the same prl03 perl runtime libraries as guiguts. If you already have prl03 (prl03.zip) for guiguts installed, there is no need to download it again.

Added unicode handling code to all of the functions. There was very basic unicode handling in the extract routines before, but all it would do was substitute question marks for any  unicode character outside the Latin-1 character space. Will now deal with unicode in all routines. **NOTE** The PGDP site is still not able to work with multi byte characters. If you have a unicode encoded text, you are better off putting it through DPEU.

Puttered around with FTP functions to try to get more accurate tracking of transfer rates and estimated times.

Worked on making things that SHOULD be impossible to do, harder to do accidentally.  :-\

Lots of little tweaks and tuning that are not worth mentioning individually but which added up to a substantial amount of time.

Played around with optionally marking up texts with questionable word markup as determined by ABBYY during OCR but after messing with it a bit, have serious reservations about it's usefulness, and have removed it again.


Version .30 (590k) Modified FTP reporting code, now reports on instantaneous and average speed of file transfers. Reports real throughput after overhead. Selectable readout in Kilobytes per second (KBps) or Kilobits per second (Kbps). Makes an estimate of seconds remaining to transfer the current file. Not going to be very accurate for small files.
Fixed problem where script would dump you in the wrong directory if processing was interrupted during  the scannos routine.
Made rename functions report file counts. Useful to check that you have the same number of text and image files.
When building a batch for FTP upload, the build routine will now check for and warn about zero byte files.
Changed Change Directory tab to use double click instead of single click to navigate. (Made it the same as the navigate function in the FTP window.)
When making a new directory on the FTP server, the script automatically issues a CHMOD 0777 command to set the permissions on the new directory.


Version .29 (590k) Fixed "Change initial X not followed by e to N" to also ignore X followed by hyphen.
Tweaked a few more thing on FTP tab. Added a "percentage done" on upload or download to status box.
Found and fixed bug where search window would add a blank line to the bottom of each file every time it was opened.
Ripped out the original two set dehyphenization function and wrote a new one based on the single set dehyphenization function. Actually both dehyphenization function use the same code  to perform the dehyphenization, they just use different dictionary building code. The new two set function has all of the robustness and flexibility of the single set, with as good accuracy (potentially even better, in fact) than the original two set.
Found and fixed bug in dehyphenization where it was getting confused by italic markup (and likely bold too, though I didn't confirm that.)
Rewrote large portions of the logging and error reporting code to be much more compact and less error prone. Reduced script size by 10 percent in the process.
Added capability to use German style "=" instead of "-" as the hyphen symbol  for dehyphenization.
Removed some of the more problematic scannos from the scanno dictionary. "cf" => "of", "au"=>"an" and "dont"=>"don't".
Did a fair amount of updating to the manual.

Version .28(601k) Fixed a few spelling errors in the user interface.
Made "Change initial X not followed by e to N" option not change Roman numerals. (Basically it will ignore an initial X followed by eEIVXDCML or space.)
Made "rnp" to "mp" fix ignore turnpike as a special case.
Tinkered around with the dehyphenate routine to try to figure out what could be causing the intermittent moving of whole lines instead of just word halves. Was not really able to find a specific fix. Was not able to make it fail on any of the texts I have. Still waiting on some sample files that show the symptom from someone, so I can try to track it down. Was not able to make it happen, even by downloading some images from the FTP server that have text files exhibiting the symptom and OCRing them myself. Oh well, if I can't duplicate it, I can't rectify it. I made a few changes that may help, but, as it worked for me both before and after the changes, it is difficult to tell whether they will be of any use.
Puttered around with the FTP client a bit. Added a preferred "Home" directory option as suggested by sjg1978. (Actually, adapted a working patch he submitted) Will automatically switch to this directory on the FTP server when you log on. Made the client a little more general purpose. Now able to save and recall different host names. User names, passwords and Home directories will be saved with the different host names (if that option is selected.) Status box has been moved down to just below the log window (to make room for the home directory box up on the top row) Status box now gives a lot more useful information during transfers. Actually keeps track of progress instead of just saying uploading/downloading.
Added ability to customize superscript markup. It still defaults to ^{xx} but can be changed to whatever you want. It is not sanity checked, so if you put markup like "<<<<KYpR%J>" "$$$$+=*", it will cheerfully use it without a second glance.

Version .27 (612k)
Added code to handle mouse wheel events in WinXP (and apparently some installations of Win 2K, though it always worked for me on my Win2K system).
Fixed problem where zip file name was being incorrectly added to the FTP batch.
Removed limitation on uploading into root directory.
Changed order of operations for changing / to ,' and change '' to " to catch some occurrences that were slipping through.
Modified "cb" fixing code to be a little less greedy. Will no longer "fix" Macbeth to Macheth
Made "Convert solitary 1 to I"  ignore a 1 followed by a full stop.
Added convert initial VV to W option.
Added convert initial !! to H option.
Added convert initial X not followed by e to N option.
Added convert ! in a word to l option.
Changed empty file handling code and average file size calculation to be more efficient based on suggestions by Elronse. (Thanks!)
Changed page switching code on search tab to automatically save the page file if you have made edits.
Changed Search page text window to have some undo capability. WILL ONLY UNDO CHANGES DONE TO A SINGLE PAGE. once you switch pages, the changes are written and the undo buffer is cleared.
Debated quite a bit about how best to implement the spaced double quotes repair option that papeters requested. Decided to make it universal rather than hard coding it for double quotes. Added two more "Alternate" replacement text fields with some more Replace and Replace & Search buttons beside the corresponding field. Now you can have up to three alternate replacement terms. The "Replace All" function uses the first alternate. Tried to make the button layout easy and quick to use with a mouse.
Changed the FTP tab password entry to be a little more secure. Will now keep your 5 year old nephew from figuring it out.  :roll:  Displays **** instead of the actual password.
Lots and lots of minor tune ups and enhancements to make it more user friendly. Too many to list (or remember).


In Version .26 ( K)
Added option to not extract sub/superscript from RTF files.
Fixed fcanno (Olde Englifh) routine to skip words that have a capitalized F at the beginning. For instance, Fire will not be changed to *ire, since the capital F is unambiguous.
Back ported some of the external program calling routines I developed for guiguts. Now all the external program calls will work in both guiprep and winprep
Added "See Image" Button to search page. Allows you to easily compare text and image for the project pages.


In version .25 (601 k) Added function very similar to Jon Ingrams de-fcanno script he published in the developers forum. Ported from python to perl and integrated into the text processing page. Added a new button on text processing page "Fix Olde Englifh". This will comb through the text and replace any words spelled with long esses (f) with the modern English equivalent. (They are not really misspelled. The long s really is an s, it is just very, very close to looking like an f.) The script will preserve the case of the original word when it replaces it.
 I based the de-fcanno function off of my scannos function, but as the fcannos dictionary was about 35 times the size of dictionary used by the scannos function (and that wasn't any speed demon,) running the fcannos function was nearly grinding my computer to a halt. I couldn't leave it like that so I went back and optimized both functions a bit and sped them up by close to 2 orders of magnitude. (found some really, really inefficient code in there....)  Anyway, they are both pretty spritely now. After some experimentation, I decided not to use the Moby SINGLE.TXT word list to generate my dictionary. It was TOO complete. There were way too many extremely uncommon words that were getting pushed as replacements, generating way too many false positives. After some hunting around I settled on generating it from the 2of4brif.txt word list from the 12dicts-4.0.zip package available at Kevins's Word List Page This was somewhat arbitrary, but it generated a much more reasonably sized list, (23000 words instead of 132000) and seems to generate a lot fewer false positives in practice. It is a heavily slanted toward British spellings as well, which fits in rather well with the period of most of the texts we are seeing. I've included the dictionary generation script in the distribution if you want to try others. It is named fwordgen.pl and requires perl to run. The name of the word list is hard coded. If you want to try different ones, you'll need to change the line -- open (WLIST, "<2of4brif.txt"); -- to have the name of your file instead of 2of4brif.txt. That will generate fcannos.bin, a serialized hash of words in the format needed by the script.
If you are planning to run both the scannos fix up and the Olde Englifh fixup routines, you should definitely run the scannos routine first. Do not run the scannos routine after the Olde Englifh routine, it will find lots of false positives
Fixed a few other minor user interface bugs.


In version .24
(383k) More user requests. Improved how script deals with tabular data. Optionally insert bar "|" surrounding each "cell" in a table and try to retain original table spacing as much as possible. Added automated markup for super and sub script text. Right now these are hard coded to be TEXish markup: caret-braces "^{X}" for superscript and underscore-braces "_{X}" for subscript. These may be made editable markup in a future version, similar to the bold and italics markup so different projects can use different styles.
Found and fixed bug with underscore handling in the filter routine that made it impossible to use an underscore for italics markup (the nominal Gutenberg standard).
Added new filter options  "Convert double commas to a double quote", "Remove space after doublequote if it is the first character on a line" and "Remove space before doublequote if it is the last character on a line". (Thanks for the suggestions, Curtis.)

In version .23
(376k)  Sigh... fixed bug on search page where an edited page wouldn't save unless you were in the midst of a search.
Poked around in the source of gutcheck and stole a few more checks for unlikely letter combinations - added to options page. (Thanks Jim!)
Fixed last thing keeping script from running under Linux, thanks to jneves for bug reports and feedback Still not 100% functionality, external programs (text editor, image viewer, pngcrush) still are not functioning, but that's fairly minor. All of the internal routines should work now. There is essentially a built in text editor on the search page anyway, and you can run pngcrush as a separate program if desired.

In version .22
(374k) Added some more functionality to search tab. Now allows you to cycle through the text files or jump to a particular file with out actually doing a search. Changed logic to automatically load the first file from the text directory when search tab is activated. Now caching the list of filenames between calls to the different search functions to generally speed up operation, especially for large numbers of files. Altered changed file save semantics slightly to better fit with the new functionality.
Added Zip function to batch upload in FTP client in anticipation of the option being available soon on the site. Automatically adds all the files in the upload batch to a zip file named the same as your working directory. Should make uploads a little faster since it is not constantly have to negotiate transfers with the FTP server for each file. Added option to build zip file during batch mode. Paves the way to make the FTP upload batchable along with the pre-processing.
Moved both new batch options to options page where they should have been originally.
Changed a few more things which were blocking Linux compatibility.
Trapped error which would sometimes result in the saved settings file being corrupted and losing your personalized settings.
Trapped bizarre behavior if italics or bold markup is extracted with a blank markup string.
Updated Manual.

In version .21
(350k)  Added a bunch of user requested items.
Tuned a few few things in the newer dehyphenization routine. Deals better with spaced hyphens at end of line now.
You can now choose the directory name where your png files are stored. It is no longer hard coded to be "pngs". Change it on the Program Prefs tab.
Header Removal is now selectably automated for batch processing. It will automatically remove the top line from every text file. THIS MAY POSSIBLY REMOVE LINES THAT SHOULDN'T BE REMOVED. USE WITH CARE. It is highly recommended that header removal be done in interactive mode if feasible.
The header removal function has been made a little smarter. It will no longer remove lines that contain the zero byte file text marker - [Blank page], by default.
If header removal is run in batch mode, it will automatically run the Fix Zero Byte Files routine after it finishes. In this case, it is not necessary to select it on the Process Text tab since that will only make it run twice.
There is a new tab with basic search & replace functions that you can run against the text files. Will automatically search through all of the text files. Useful for project specific spell checks that you'd like to run. Select Case Insensitive search or Whole Word search or combinations thereof to further narrow down the search target.
Disabled the "standard project directory name" check in the "make remote directory" function of the FTP client. Has become moot with recent changes to the site code.
Fixed a few inconsistencies in the FTP download logic.
Combed through code trying to reduce Linux incompatibilities. As far as I can tell without actually trying to run it, there are only three places where the code is Linux incompatible: the three external program hook subroutines - testart(), ivstart() & pngcrushstart() [text editor start, image viewer start and pngcrush start] Need to get access to a Linux system to get them working. There may be others, but they are the ones I know about.
Went through most of program , cleaned up code, improved commenting and indenting. Generally tried to make program more maintainable. Updated manual.

In version .20
(353k) Major update. Added new dehyphenate routine. The original dehyphenate routine is still there and is far more comprehensive than the new one, but the new one has a huge advantage in that it only needs one set of text files and is not dependent on Abbyy FineReaders' dehyphenization feature. The new routine builds a dictionary of all of the words in the text files that do not have a hyphen in them, then uses that dictionary to decide whether to remove the hyphen from a split word or not. It will rejoin hyphenated words whether it removes the hyphen or not. It will make a few educated guesses when it sees some very common prefixes or suffixes. The new routine looks for a set of text or RTF files in a "textw" directory. If there is also a "textwo" directory, the script will automatically use the original dehyphenate routine. Changed original dehyphenate routine to automatically fall back to the breaking text if a threshold of synchronization errors was reached (currently 3) in any one file.
Added much better reporting of what is going on during filtering of "improbable letter combinations" and scanno replacement. Changed order that routines run in to make reporting more useful. (Moved rename text files to before any of the routines that do progress reporting so I could include a file name.) Changed button order to match. Added a button and logic to save a copy of the processing log to a file from the process text tab. Added buttons and logic to the process text tab to save and revert to backups of the text files.
Moved conversion of Windows codepage 1252 glyphs 80-9F (decimal 128-159) from the extract routine to the filter routine where it really belonged. Added option for it on Select Options tab.
Made Remove Headers routine more tolerant of filenames with spaces in them.
When downloading a directory in the FTP client, it will now automatically make a directory in the selected local directory with the same name as the selected remote directory and download the files into that directory.
Added a file name filter to the FTP directory download dialog box. Default (blank) is 'download all files in directory'. If you want to download only the text files in a directory, put .txt in the filter box. For all of the PNG files put .png , etc. You can build more complex pattern matching filters too, if you like. It uses perl regular expressions to evaluate the pattern, so don't use DOS wildcard expressions (*.*, *.txt, etc). Added some more word pairs to the scannos list.

 In version .19: (354k) Fixed up a bunch of minor non-fatal errors (warnings). Changed default watchdog timer to allow longer subroutines to run without raising a fatal timeout exception. Was giving problems with some users.(Well, one specific user, but I'm sure it would crop up again sooner or later.) Made a few of the routines a little more robust/error resistant. The dehyphenate routine now marks the word in question with "**" when it gets a synchronization error. Added a few more word pairs to the common scannos list. Removed the check for double backslashes, no longer necessary after site update.

 In version .18: (357k) Fixed pngcrush feedback mechanism to work consistently across windows platforms. Changed it to work predictably no matter what your pngcrush option settings. Added capability to edit pngcrush command line options to the Program Prefs tab and changed default pngcrush settings to something a little more generic.
Tweaked a few of the markup filters to catch boundary conditions better. Fixed FTP client to understand directory names with spaces in them. Changed FTP directory download dialog box to custom built one, a little easier to work with, I think. Added directory download list display. Change default FTP host to pgdp01.archive.org. Changed client to allow editing host name. Tuned a bunch of the FTP functions to work more intuitively. Just does the right thing. Double clicking on a directory name on the remote server will change to that directory. Double clicking on a file name will download that file. Double clicking on a local file name will open a viewer for the file. Made all of the FTP routines less fragile.
Wrote modified FTP::put and FTP::get routines that won't block the calling Tk window to replace the ones in the standard FTP module which blocks Tk very badly. Updates at least once for every 10KB of upload or download. (You'll get a tick mark in the log box for every 10K of data transferred).
Changed how external programs are invoked on the header removal page to be more consistent with other pages.
Fixed missing last drive problem under NT / 2K.
Changed some code in the script which caused problems under WinXP and perl 5.6.
Lots of code cleanup, added and formatted comments, remove some unused routines, made indenting style more uniform. Updated manual.

In version .17: (377k) Better resynchronization after error during Dehyphenization and better trapping of errors. Finally dehyphenization is as stable as I would like. In the worst case, it will use the text with line breaks as its fall back if there are too many errors. Provides more information on exactly what problem is on Dehyphenization error condition. More efficient markup pattern matching in Filtering routine. Combined about 14 pattern matching searches down to 4. Reworked Pngcrush calling routine to be compatible with NT based Windows platforms. Provide more feedback during the pngcrush routine. Improved the FTP client drastically. Added buttons for Change directory, Download, Rename and Delete as alternatives to the arcane mouse button - key press combinations. Added Rename function. Works with both files and directories. Improved Download function to allow automatic batch downloading of all the files in a directory. Disabled floppy drive search on startup. Get rid of annoying "No Disk" acknowledge in XP. Not really realistic that a project would be on a floppy anyway. Fixed problem with small caps text not being upper cased on some occasions. Updated Manual. Added history section. Miscellaneous bug fixes.

 In version 16: (374k) Reworked Process Text tab layout. Combined Process Batch and Do All Selected button into one Start Processing button. Just does the right thing depending on mode. Added routine to run pngcrush on your png image files. Pngcrush is a png size optimizer. Most image generating programs are not particularly efficient about making the smallest possible lossless png file. Since the images are uploaded and downloaded 4 - 6 times during a project, it makes sense to make it as efficient as possible. Added pop up help buttons on most pages. Added download and remote delete functionality to FTP client. Updated Manual. Miscellaneous bug fixes

In version 15: (319k) Added basic FTP client to help automatically upload preprocessed projects to site. Added hook to link in external Image viewer. Added routine to automatically rename png files in pngs directory under project. Changed help box to a button activated pop up window on Change Directory page to make more room for directory and batch listing boxes. Started putting version number in program title bar to make it easier to track. Updated Manual. Miscellaneous bug fixes

In version 14: (202k) Improved the hooks for the external programs to run them non blocking. (Able to run more than one at once without locking up guiprep) No longer any reasonable expectation of Linux compatibility. Added some more filtering options. Fixed some race conditions.Script now remembers the window size and location from session to session. Added much better reporting on processing progress. Renamed guiprepe to winprep. Updated Manual. Other miscellaneous bug fixes.

In version 13: (202k) Added hook to link in external text editor so you can view files easily during Header Removal. Added more filtering options. Improved batch processing . Added Program Preferences tab to allow you to choose some settings that don't directly affect the text processing. Script will remember preference settings. Script now remembers the last directory you were working in and reopens to there. Modified Interrupt Processing to interrupt whether in batch OR interactive mode. Script will interrupt processing if you switch away from the processing window. Reworked layout to be usable down to VGA resolution. Debut of guiprepe, (guiprep executable) a compiled windows version of guiprep. Updated Manual. Miscellaneous bug fixes.

In version 12: (194k) Jon Ingram edition. Now does batching. Queue up several projects in a batch and run processing on them sequentially. Updated Manual.

In version 11: (193k)Added Check For Common Scannos routine & list. Check for 3400 or so common scannos. Added lots of new filtering options for improbable letter combinations and others. Made Text Processing routines batchable with check boxes to select which one to do.Updated Manual. Lots of bug fixes.

In version 10: (123k) First gui version. Made a gui interface to the prep.pl script to allow runtime option selection without huge command line lists. Renamed to guiprep.pl to reflect interface change. Linked hrtk.pl header removal tool into the script as a separate tab. Updated Manual. Created lots and lots of bugs

In version 9: (0k)There was no version nine.

In version 8: (94k) Last command line version of prep.pl. Added basic header removal command line scripts and gui tool that implements them (hrtk.pl).