guiprep.pl
Preprocessing toolkit
- Current version .40
guiprep.pl
(644k)Includs both a perl script version and a compiled Windows
executable. **NOTE** Both require that you have either perl 5.8.0 or
later installed or the perl runtime libraries (prl03.zip)
installed. If you already have prl03 installed for guiguts, there is no
need to download it again.
prl03.zip
- (5773k) perl runtime libraries - contains a full complement of perl
libraries with Tk804.026
installed to allow either perl scripts or the Gui* executables to run
on your system. Installation instructions and information on
compiling
your own package on this page.
Written by Steve Schulze (thundergnat).
Also see my post processing toolkit - Guiguts.pl
Questions or comments? Leave a message in the Distributed Proofreaders
forums or private message me as "thundergnat".
Portions of this script are derived from
RTF::Tokenizer by Peter Sergeant.
For more information on the RTF file format, SEE: The_RTF_Cookbook
by Sean M. Burke
The included pngcrush.exe is a windows/dos compiled version of
pngcrush, a png file compression tool. It will losslesly reduce the
size of png files. Most image creation programs do not optimally
compress png files. Get the latest version of pngcrush.exe at sourceforge
(make sure you get the executable unless you are planning to compile it
yourself) or go the the pngcrush
home page for more information. The version included with the
script is the lowest common denominator version. If you have a MMX
capable processor, a faster, MMX enabled version is available.
Uncompress it and place it the pngcrush directory in the guiprep
folder. Make sure the included readme text file is named "README.txt"
so the help button can find it. Some distributions I have seen have the
help file named just "README"
This software has no guarantees as to it fitness to do this or any
other task. Any damages to your computer, data, your mental health or
anything else as a result of using this software are your problem and
not mine. If not satisfied, your purchase price will be cheerfully
refunded.
This program may be freely distributed, used, and modified. Reverse
engineering is condoned and encouraged. If you come up with some really
cool addition (or even just an idea) let me know, and it may be
included in future releases. If you do reuse some of my code, I would
appreciate you mentioning it in the comments of your script and dropping me a line
to let me know...
This script requires a perl
interpreter to run. The
ActiveState perl interpreter is probably the most popular for
Windows users. (95, 98, 98se, ME, NT, 2K, XP) They are also have
versions available for Linux and Solaris. It's very functional and
free. (They do ask that you register, but you can bypass the
registration page without entering anything.) I personally would
recommend the 5.8.0 distribution. For Windows users, you use the
Microsoft Installer (MSI) version, it is very simple and automatic to
set up. If you don't have Microsoft Installer, a link is included on
the Activeperl download page.
What is it
for?
Whats new?
Setting
up the text files:
Using the script:
Troubleshooting
Known Bugs
Changlog History
What is it for?
Given a set of rtf files output from an Optical Character Recognition
(OCR) program; this tool will extract text and italic and bold markup
from the .rtf files and save it as text, rejoin end-of-line hyphenated
words, filter out bad and undesirable characters, check for common
scannos* and check for zero byte files to help automate preparation of
files for Distributed Proofreaders.
If your .png files are in a directory named PNGS, it can rename the
.png files into the upload format. It can also, if desired run a png
size optimizer on the files. You can queue up several projects and
process them in a batch. It provides a mechanism to semi automate
header removal and provides hooks to link in your favorite text editor
and image viewer to help check files. There is also a mini FTP client
built in that automates uploading a project to the site.
*[A scanno is like a typo... only from a scanner instead of a typist.]
Whats new?
Version .40 (644k)
Argh. When I added the option to extract the small caps markup from the
RTF files, I broke the handler for small caps if you WEREN'T extracting
the markup. Fixed now.
Modified how the Precessing functions displayed progress. They used to
just print a dot to the screen for each page (file) that was completed.
That worked fine as long as there weren't any problems. If the WAS a
problem, it was extremely tedious to try to count the dots to figure
out which file was causing it. Changed it to print an incremented
counter mod 10. It will print the digits from 123456789012345... and so
on. That should make it much easier to figure out which file causes a
problem when one occurs.
Fixed an obscure problem with code page handling during RTF extraction.
Set it to have a reasonable default if it couldn't determine the
codepage.
Tightened up a bunch of code in the font table and codepage handling
code. Made it much more memory efficient (and probably faster, though
negligibly so.)
History.
Setting up the text files:
There are two different dehyphenization routines. One works with a
single set of files, the files with line breaks preserved; the format
need by the Distributed Proofreaders site. The other will use two sets
of files, one set with line breaks and one set without. The two set
will yeild better accuracy during dehyphenization at the expense of
slightly longer processing time and more disk storage space. To do two
set dehyphenization, save the text from ABBYY FineReader (or
possibly other OCR packages;
should work as long as they produce standard, well formed rtf files)
two times in two different directories. Assuming you have a project
directory named "PROJECT", under the project directory you will need
two directories "textw" and "textwo". "textw" stands for "text with
line breaks" and "textwo" stands for "text without line breaks". If you
are only going to do single set dehyphenization, you only need to
follow the instructions for the "textw" directory.
with RTF Markup Extraction:
In ABBYY after all of your images are loaded and OCRed, select File
=> Save Text As;
A dialog box will pop up.
In the "textw" directory, save the text with the settings: Save as type
Rich text Format, Create a separate file for each page,
Retain font and font size. On
the RTF tab of the Formats Settings, check Keep page breaks and Keep line breaks and uncheck
everything else. It doesn't matter what the File name is set to. The
default is probably fine.
In the "textwo" directory, save the text with the settings: Save as
type Rich text Format, Create a separate file for each page,Retain font and font size. On the
RTF tab of the Formats Settings, check Keep
page breaks and Remove optional
hyphens and uncheck everything else. Make sure the File name is
set the same as in the textw directory.
without RTF Markup
Extraction:
If you don't want to do markup extraction, (or your OCR package won't
support RTF files) you can skip saving the files as RTFs and just save
them as plain text files. Again, to do dehyphenization, you will need
to save the files in two directories, textw and textwo.
Save the text with line breaks in textw. The ISO Latin-1 code page will
give you pretty good results for English and most European languages.
The site works with ISO Latin-1 so that may be least problematic to fit
into the character space used. Windows codepage 1252 should also work
well since it overlaps Latin1 very closely and where it doesn't, the
filter routine will convert characters that don't fall within Latin1.
This may actually yeild better results than trying to force the OCR to
fit the text into the Latin1 character set.
The textwo directory should use all of the same settings except that Keep line breaks needs to be
unchecked. Be sure to use the same code page and file names in both the
textw and textwo directories.
At this point the script is used exactly the same way except you'll
skip the Extract Markup routine.
without
RTF Markup Extraction or
Dehyphenization:
If you are using a different OCR package that can't save as rtf or do
automatic line rejoining, you may need to skip those two functions.
Save the files in a directory named "text" using the same settings as
for textw without RTF extraction above. Uncheck both Extract and
Dehyphenate under the Process Text tab.
Using the script:
When you run the script, a Graphical User Interface will pop up
allowing you to select options, select the working directory, and
process the text files. One implication of this is that the script no
longer NEEDS to run from the working directory. In fact, it will work
better if run you it from the same directory each time, changing to the
working directory after it starts, because it will save all of the
option settings in directory it is started in - in a file named
settings.rc, (rc is a standard extension for resource file). and it
will look for the scannos.rc file in the startup directory. The script
remembers the last directory you were working in and reopens to that
directory the next time you run it.
Select
Options tab
The Select Options tab will
allow you to adjust the markup used for italics and bold extraction and
set the options you want the filter routine to run. The Save Settings button will save your
markup and selections from session to session. The Default Markup button will change
all the markup text back to defaults. If there is little or no bold in
your text, you may want to disable bold extraction to cut down on false
positives. The other settings are all options for the filter routine.
See discussion below under Filtering for suggestions and explanations
for different settings.
There are a few options having to do with batch processing.
Extract Bold Markup - If you
don't have much bold text in your project you may want to diable this
to cut sdown on false positives, especially for lower quality scans.
Insert cell delimiters in tables
- If you have tables in your project, the script will try to keep the
layout as much as it can. The cells usually will not come out exactly
as the origional, so youcan add markers "|", between the cells to help
the proofers align them .
Extract sub/superscript markup
- Select whether to extract sub and super scriptws while doing mrkup
extraction.
Dehyphenate using German style
hyphens; "=" - Option to dehyphenate German texts.
Header Removal - You can now
select whether you want to run automatic header removal on your text
files during batch processing. It will automatically remove the top
line from every text file. THIS MAY POSSIBLY REMOVE LINES THAT
SHOULDN'T BE REMOVED. USE WITH CARE. It is highly recommended that
header removal be done in interactive mode if feasible.
Build a zip of the project files
- The site promises to soon have the capability to upload the project
files a a zip file. Possibly through a web interface rather than FTP.
This option will generate a zip archive containing all of the files in
the "text" and "pngs" directories. (or whatever you chose to name your
image directory) It will be written to the project directory with the
name of the project directory used as the name of the zip file.
Filtering
options:
As of now, the pattern substitution/filtering functions
the script will perform are:
• Remove extra (multiple)
spaces in text. - Highly recommended. Makes all of the other
filtering more effective. Default on.
• Convert Windows-1252
codepage glyphs 80-9F. - Highly recommended. Will need to be
fixed eventually, may as well do it now. Default on.
• Remove spaces at end of
line. - Recommended. Not a big deal either way but may make the
proofers job easier. Will help later during rewrapping. Default on.
• Convert spaced hyphens to
em dashes. - Recommended. Correct behavior for most texts. Not
recommended for math texts. Default on.
• Convert multiple
consecutive underscores to em dashes. - Recommended. Correct
behavior for most texts. Default on.
• Remove spaces on either
side of hyphens. - Highly recommended. Easily automated
formatting fix. Correct behavior more than 99% of the time.Not
recommended for math texts. Default on.
• Convert double commas to a
singe double quote. - Recommended. Usually correct behavior.
Default on.
• Remove spaces on either
side of em dashes. - Highly recommended. Easily automated
formatting fix. Correct behavior more than 99% of the time. Not
recommended for math texts. Default on.
• Remove space before periods.
- Highly recommended. Easily automated formatting fix. Correct behavior
more than 99% of the time. Default on.
• Remove space before
exclamation points. - Highly recommended. Easily automated
formatting fix. Correct behavior more than 99% of the time. Default on.
• Remove space before
question marks. - Highly recommended. Easily automated
formatting fix. Correct behavior more than 99% of the time. Default on.
• Remove space before commas.
- Highly recommended. Easily automated formatting fix. Correct behavior
more than 99% of the time. Default on.
• Remove space before
semicolons. - Highly recommended. Easily automated formatting
fix. Correct behavior more than 99% of the time. Default on.
• Remove space after opening
and before closing brackets. - Recommended. Easily automated
formatting fix. Correct behavior most of the time. Default on.
• Strip space after start
& before end doublequotes. - Highly recommended. Easily
automated formatting fix. Correct behavior more than 99% of the time.
Default on.
• Ensure space before
ellipses except after period. - Recommended. Easily automated
formatting fix. Correct behavior most of the time. Default on.
• Convert two adjacent
single quotes to a single double quote. - Highly recommended.
Easily automated formatting fix. Correct behavior more than 99% of the
time. Default on.
• Convert solitary 1 to I,
if not at beginning of line, or if preceded by quotes. -
Recommended. Depends on text. For vast majority does much more good
than harm. Default on. *See note
• Convert solitary lowercase
l to I if preceded by space or quotes. - Recommended. Depends on
text. For vast majority does much more good than harm. Default on.
• Convert solitary 0
preceded by quotes to O. - Recommended. Depends on text. For
vast majority does much more good than harm. *See note below. Default
on.
• Convert vulgar fractions
(¼,½, ¾) to "1/4", "1/2" and "3/4". - Your
choice. Depends on book. Depends on your preference. Default on.
• Convert ² and ³ to "^2" and "^3". - Your choice.
Depends on book. Depends on your preference. Default on.
• Convert £ to
"Pounds". - Your choice. Depends on book. Depends on your
preference. Default off. *See note below:
• Convert ¢ to "cents".
- Your choice. Depends on book. Depends on your preference. Default
off. *See note below:
• Convert § to
"Section". - Your choice. Depends on book. Depends on your
preference. Default off.
• Convert ° to "degrees".
- Your choice. Depends on book. Depends on your preference. Default off.
• Convert forward slash (/)
at a word end to comma apostrophe(,'). - Your choice. Depends on
book. Depends on your preference. Default on. (Will ignore slash after
less than </)
• Convert \v or \\ to w.
- Your choice. Fairly common scanno. Depends on your preference.
Default on.
• Convert solitary j or at
end of word not proceeded by "a,e,n or u" to semicolon. - Your
choice. Depends on book. Depends on your preference. Default on.
• Convert string 'tli' to
'th' if it is a the beginning of a word. - Very highly
recommended for English texts, especially if you are going to run the
Scanno check. Recommended with caution for non-English. Default on.
• Convert string 'tii' to
'th' if it is at the beginning of a word.- Very highly
recommended for English texts, especially if you are going to run the
Scanno check. Recommended with caution for non-English. Default on.
• Convert string 'wli' to
'wh' if it is at the beginning of a word.- Very highly
recommended for English texts, especially if you are going to run the
Scanno check. Recommended with caution for non-English. Default on.
• Convert string 'rn' to 'm'
if it is at the beginning of a word.- Very highly recommended
for English texts, especially if you are going to run the Scanno check.
Recommended with caution for non-English. Default on.
• Convert string 'hl' to
'bl' if it is at the beginning of a word.- Very highly
recommended for English texts, especially if you are going to run the
Scanno check. Recommended with caution for non-English. Default on.
• Convert string 'hr' to
'br' if it is at the beginning of a word.- Very highly
recommended for English texts, especially if you are going to run the
Scanno check. Recommended with caution for non-English. Default on.
• Convert string 'rnp' to
'mp' in a word.- Very highly recommended for English texts,
especially if you are going to run the Scanno check. Recommended with
caution for non-English. Default on.
• Convert vv at the
beginning of a word to w - Recommended, default on.
• Convert !! at the
beginning of a word to H - Recommended, default on.
• Convert initial X not
followed by e to N - Also takes into account Roman Numerals,
Recommended, default on.
• Convert ! inside a word to
l - Recommended, default on.
• Convert '11 to 'll
- Recommended, default on.
• Convert rnm in a word to mm
- Recommended, default on.
• Convert string 'cb' to
'ch' in a word.- Very highly recommended for English texts,
especially if you are going to run the Scanno check. Recommended with
caution for non-English. Default on.
• Convert string 'gbt' to
'ght' in a word.- Very highly recommended for English texts,
especially if you are going to run the Scanno check. Recommended with
caution for non-English. Default on.
• Convert string '[ai]hle'
to '[ai]ble' in a word.- Very highly recommended for English
texts, especially if you are going to run the Scanno check. Recommended
with caution for non-English. Default on. [ai] means: either a or i .
• Convert cl at the end of a
word to d - Recommended, default on.
• Convert pbt in a word to
pht - Recommended, default on.
• Convert whole words string
'to he' to 'to be'.- Very highly recommended. Almost always
correct behavior. Default on.
• Move punctuation outside
of markup.- Highly recommended if you have extracted markup.
Otherwise not. Default on.
• Remove empty lines from
the top of the file. - Highly recommended. Easily automated
formatting fix. Default on.
• Convert multiple
concurrent empty lines to a single. - Recommended. Usually
correct behavior. Easy to fix if not. Default on.
• Remove empty lines from
the bottom of the file. - Highly recommended. Easily automated
formatting fix. Default on.
• If top line has nothing
but digits, (page number) delete it. - Recommended. Up to your
personal preference. Default on.
• If bottom line has nothing
but digits, (page number) delete it. - Recommended. Up to your
personal preference. Default on.
They are all selectable from the options page.
The "improbable character combination" filters (tli, rn, wli, hl, hr,
rnp, cb, gbh, [ai]hle) DEFINITELY should be run if you intend to run Fix Common Scannos. Those filters
reduce the number of checks that need to be done by scanno routine by
330 words yet effectively add several thousand.
*After ad hoc testing of about 50 texts pulled from PG at random,
solitary I is about 90 times more likely than solitary 1. If instances
at the beginning of lines are ignored, it rises to about 150 times.
Pretty good odds I think.
*Solitary 0 (With nothing but space on either side) is automatically
converted to O. This is non negotiable. Because of the way the
dehyphenate subroutine works, if it encounters a solitary 0 in the
text, it will delete the rest of the paragraph. I would rather have a
few misconverted O's then deleted paragraphs. (It's not really the
dehyphenate subroutines fault, it's more just a consequence of perls
weak variable typing, but I digress.) This is not just my dehyphenate
routine, aldarondo's has the same problem but doesn't trap it.
* £ to "Pounds" uses some intelligence when it converts. It will
move the "Pounds" to after
the number. I.E. £30 will become '30 Pounds' not 'Pounds 30'
* ¢ converts to "cents" unless it follows a solitary 1, in which
case it converts to "cent"
Change
Directory tab
You can select the directory you want to work in in the Change Directory tab. The top bar
shows what directory is the "current directory". In general, to run the
Extract and Dehyphen scripts you will need the "textw" and "textwo"
directories visible in the Change To selection box. The other text
processing routines need the text directory, which will be created by
the dehyphen routine, if necessary. The png processing routines will
need to see the pngs directory. Click on the directory name to move to
that directory or on the " .. " to move up one level. All of the
routines expect to run from the same directory. (The parent directory
of pngs, textw, textwo and [eventually] text.) If you want to run the
script in batch mode, select one or more directories containing the
files to be processed in the right hand box. All of the batch functions
work exactly like the interactive functions, they just allow you to
queue a bunch of projects up and process them all with one command.
Remember, to do interactive processing you need to be IN the project directory, for batch
processing you need to be ABOVE
the project directory. Remove headers can only be done in interactive
mode, so you will need to be IN
the project directory to do it.
Process Text
tab
Once your options are set up and you are set to the right directory, go
to the Process Text tab. In
this tab you can run the different routines on the text files. You can
run individual routines, mix and match or select Do All Selected to run all the
subroutines you select in one batch. Different routines have different
prerequisites so you can't necessarily run the routines out of order
and get good results.
The Extract Markup routine
expects to find the directory "textw" (and optionally "textwo" if you
are using the original dehyphenate routine) with rtf format files in
them. It will extract the text and markup and put the extracted files
in the same directory with a .txt extension.
The Dehyphenate routine
expects to find the "textw" and optionally "textwo" directories with
.txt files in them. Whether the .txt are as a result of the Extract
routine or just .txt format files saved directly from Abby is
immaterial. It will put the merged files into a directory named "text",
creating it if it doesn't already exist. **WARNING: any files with a
.txt extension in the "text" when Dehyphenate runs WILL BE DELETED.
WITHOUT WARNING OR ASKING.**
The Rename, Filter, Correct Common Scannos and Fix Zero Byte routines all expect to
find the "text" directory with .txt files in it. Again the files may be
from Dehyphen routine or may not.
Rename Png Files expects to
find the "pngs" directory with your .png files in it. It will rename
all of the .png files in the upload format.
Run Pngcrush expects to find
the "pngs" directory with your .png files in it. It will run pngcrush
on each file to optimize the compression and reduce the size. The
default settings will change reduce the palette to the minimum
necessary. It does save the
original files in a directory " _pngsback_" so you can easily recover
them. If interrupted part way through, it will pick up where it left
off the next time you start it. As a consequence, if you interrupt it,
the pngs directory WILL NOT have all of the files in it. Make sure you
have the same number of text and png files before you upload them.
If you are going to run both Filter
and Fix Common Scannos, is highly recommended that you run Filter first, then Fix Common Scannos. Fix Common Scannos will check your
files for over 3000 of the most common mis-scanned English words and
correct them. It should be used with caution on non-english texts
though. It probably won't
hurt but you should check a bunch of pages afterwords to be sure. (It
probably won't help either.)
It is recommended that the Fix Zero
Byte Files routine be run last, though the order is not really critical.
Convert to ISO-8859-1 NEEDS to
be run on files for the original DP site but SHOULD NOT be run on files
for DPEU. This will transliterate any Greek characters and
convert any other characters outside of Latin-1 to question marks.
Hopefully the original DP site will be converting to Unicode in the
near future and make this function unnecessary.
In general, the routines should be run in top to bottom order. If you
run them by selecting the routines you want to run, then pressing Do All Selected, they will
automatically run in the optimal order.
The Start Processing and Interrupt Processing buttons will
start and stop processing job. If you have a batch queued up, it will
run the batch. Otherwise, it will run in interactive mode in the
current directory.
For batch processing, Start
Processing will run Do All
Selected on each project. You can select and deselect routines
and Process Batch will follow
your selections. If you really
want to, you can change options and selections while the batch is in progress....
but you probably shouldn't. The small box in the lower left shows the
status of the current batch.
? will pop a terse help message.
Clear Status Box will clear the
messages from the status box.
Fix
Common Scannos:
The scannos word list was pulled from the Distributed Proofreaders CVS
site. There are approximately 3400 words in the scannos list (though
the improbable letter combination filters make about 330 of them
redundant) >From the description in the scannos list header:
# Word list derived from Moby project data, cut for top 2000 frequency
and word
# of 6 characters or less (to reduce size and assuming that longer
words will
# be closely examined by the proofreaders). The resulting list was
processed
# through perl scripts which generated scannos by replacement (see
below).
# This result was then filtered to eliminate valid words from the
generated
# "error" list (left side) to eliminate false positives.
#
# The common scannos from gutcheck and PRTK were then added, as well as
some
# additional scannos provided by numerous DP proofreaders.
#
# The resulting list was then tested against just over 1 million words
of raw
# OCR output provided by charlz. Further false positives were
discovered and
# removed. The actual hit rate for this code is about 1 scanno detected
per 30k
# words of input text. The actual accuracy rate against the corpus
provided by
# charlz is: 2 false positives out of 122 scannos detected, or 98.3%
accurate.
# Seems worthwhile to me. :)
If you come up with misscanned word that you think should be in the
scanno list, let me know. Words that commonly are misscanned for each
other (like bad / had or and / arid) are NOT good additions. Those are
better off in Big_Bills' stealth scannos list.
Header Removal tab
When all of the processing routines have been performed, you can go to
the Header Removal tab to
delete the top lines of each text file, if desired. To remove headers
in interactive mode, you will need to be IN the project directory. If remove
headers is run in batch mode, it will automatically remove the top line
of EVERY text file (unless the top line is the blank page markup) and
then run the Fix Zero Byte Files routine to catch any emptied files.
Often the top line is a book or chapter title that will be removed
anyway. This tool will help semi automate removing them. Press Get Headers. This will load a list
of the top lines of each file. Select the ones you want to delete
(probably easier to select all,
then unselect the ones you don't want to delete) then press Remove Selected to write the changes
to the affected files. If you like, you can Get Headers again to see if there
are any others you would like to remove. Repeat as necessary. If the
top line of a file is the blank page markup (from the select options
tab) Remove Headers will not delete it, you will have
to delete it manually if you want to remove it.
If you accidentally remove headers you didn't intent to, will probably
be easiest to go back to Process Text
and run Dehyphenate, Rename, Filter, Fix Common Scannos and Fix Zero Byte Files again to
regenerate the files, then rerun Remove
Headers. It would not be necessary to rerun Extract, since those files are
stored in a different directory. Alternately, you may want to back up
your text files with the Make Backups
function on the Process Text
tab before you run Remove Headers
so you can revert easily if problems arise.
After you Get Headers, you can
easily edit the file that the header is part of by double left clicking
on the header to open your text editor. Set up your text editor on the Program Prefs tab. If you DO edit some of the files this way,
remember to refresh the header list before running Remove Selected. You can also link
an image viewer so you can compare image and text side by side (Much
like the site! :-) ) Irfanview
works really well for this. And it's free! XnView is another great free image
viewer that works well. Invoke your image viewer by left then right
clicking on a file header. **Will not
work in Winprep.exe. Winprep
cannot run external programs.**
If you use Irfanview, for
best results, set View->Display options to 'Fit only big images to
window'.
If you use XnView, it's a little
more complex. Go to Tools->Options->View and check 'Maximize view
when open' and set 'Auto image size' to 'Fit image to window, large
only.' Go to Tools->Options->Misc and check 'Remember last
position/size'
*Caveat* There is a bug in the
command line parsing in XnView.
If you have a directory with a space in the name, in the path to XnView
(like 'Program Files' for instance), it will fail with a 'File not
found' error. As long as there are no directories with spaces in the
name in the path, it will work fine. Irfanview
and other image viewers I have tested don't have this problem.
Search tab
The Search tab has search and
replace functions that will search through the text files and display
the files with the search term and allow you to modify them, if
desired. This is a strictly interactive tab. It is handy to check for
project specific scanning errors or to check up on synchronization
errors during dehyphenization. (Search for '**') Maybe after your are
done all your processing, you decide that you shouldn't have done bold
extraction after all. Just do a search and replace on <b> and
</b>.
There are some options to do case insensitive searching or search for
whole words only to narrow down what the search function will find.
When you perform a search, if the search text is found, the whole file
it is in will be displayed in the text window with the found text
highlighted and the cursor just before it. If the search text is not
found in any of the remaining files, a dialog will pop up informing you.
The buttons are pretty self explanatory. The Save Open File button saves the text
that is currently displayed in the window to the file, overwriting the
original. Search looks for the
next occurrence of the search term. If you already have a text file
open and press Search, it will
proceed with the search starting from the open file. Replace substitutes the Replacement Text for the Search Text in the window, and saves
the file. To cancel an in progress search, change the Search Text, that will reset the
file index counter to the beginning. Replace
& Search (R & S)
just combines the Replace and Search buttons into one function
call. Replace All will call Replace and Search until all of the files have
been searched. It will reset the file index counter to zero before it
starts so if you are performing a manual search, get halfway through
the files and then press Replace All,
it will start over again at the first file.
Program
Prefs tab
There is a Program Prefs tab
where you can set some preferences which affect how the program looks
and runs. You can change the color palette the the script uses, you can
associate a text editor with the script to allow easy checking and
editing of files while you are doing header removal and you can
associate an image viewer to do side-by-side comparisons with text.
The default palette is CornSilk2. I also like PeachPuff2, Bisque2,
CadetBlue3 and Ivory3. Some truly painful ones are chartruse1,
IndianRed1, brown1 and DarkOrchid2. Ouch!
You can now specify what the name of the directory containing your png
files is on this tab. Default is 'pngs'. Avoid using directory names
with spaces in them.
For Windows users, you will probably want to use wordpad or notepad or
some equivalent for your text editor. Irfanview or XnView or an equivalent for an image
viewer.
The default locations for notepad and wordpad are:
Win 95, 98, 98SE, ME & XP:
C:\WINDOWS\NOTEPAD.EXE
C:\Program Files\Accessories\WORDPAD.EXE
Win NT & 2K
C:\WINNT\NOTEPAD.EXE
C:\Program Files\Windows NT\Accessories\WORDPAD.EXE
FTP tab
There is an FTP client included which will help automate uploading the
project to the Distributed Proofreaders FTP server.
A simple moderate featured FTP client. Suitable for uploading to DP and
minor maintenance.
From left to right in rows....
Host name (Text Entry) -
Defaults to pgdp01.archive.org
User name (Text Entry) - Get it
from the Project Managers page. Will be saved from session to session
if Save User & Password is checked.
Password (Text Entry) - Get it
from the Project Managers page. Will be saved from session to session
if Save User & Password is checked.
Home Directory (Text Entry) -
Set a prefered home directory on the FTP server if desired. Will
automatically change to that directory when you connect.
Connect To Host (Push Button) -
Initiate FTP connection. Will fail if you have no internet connection.
May take a while.
Disconnect (Push Button) -
Break FTP connection.
Save Log File (Push Button) -
Save a session log to a file.
Clear Log (Push Button) - Clear
Session log.
? (Push Button) - A terse help
file with a brief explanation of how to use the client.
Save User & Password (Check
box) - Option to save User name and Password.
Session Log (Text Readout) -
Commands and feedback issued during session.
Connection Status Box (Text
Readout) - Connection monitor.
Build Batch (Push Button) -
Make a standard batch. Adds all the .txt files in the text directory
and all the .png files in the pngs directory.
Add a File (Push Button)
- Mostly to upload a few files instead of a standard batch. Adds a
filename to the batch.
Zip Batch Files (Push Button) -
Zip all of the batch files into one zip archive. New functionality on
the site coming soon.
Clear Local List (Push Button)
- Cancel batch before it is sent and clear batch list. Will interrupt
batch in progress.
Send Files (Push Button) -
Transfer all of the files in the batch list to the FTP host in binary
mode.
Stop Transfer (Push Button) -
Interrupt a batch transfer in progress. Uploads can be resumed.
Downloads must be reinitialized.
Download (Push Button) - Select
a file or directory on the remote server and press Download. A dialog
will pop up to select a directory to download to. Alternate (file) -
double left click.
Make New Directory (Push
Button) - Make a directory on the remote host in the current directory
using the directory name from the Directory Name text entry.
Directory Name (Text Entry) -
Name to use when making a new directory on the remote server.
Chdir Sel (Push Button) -
Select a directory on the remote server and press Change to to change
to it. Alternate (directory) double - left click.
Chdir Up (Push Button) - Change
directory on the remote server up one level. Alternate double - left
click on double dot entry "..".
Rename (Push Button) - Select a
file or directory on the remote server then push Rename to rename it.
Delete (Push Button) - Select a
file or directory on the remote server then push Delete to delete it.
Remote Directory (Text Readout)
- The directory you are currently browsing / working in on the remote
host.
Local Listing (Text Readout) -
List of files that will be uploaded when Send Files is pressed.
Remote Listing (Text Readout) -
A listing of all of the files and directories in the current directory
on the remote host.
The directory listing shows files prefixed by 'FILE - ' with the byte
size after. Directories are prefixed by 'DIR - ' (except the
double dot entry ".." which is short hand for "parent directory").
To Change Directories on the
remote host, double left click on a directory name in the remote
directory listing.
To Download a file or
directory, left click then right click on it.(Or just double left click
a file name) A dialog will pop up to select a download directory. When
downloading multiple single files, don't try to start the next before
the previous one is finished. It will cause problems with the script.
When downloading a directory, the script will make a directory with the
same name as the remote directory that it is downloading in the local
directory that you choose through the download dialog.
The directory download dialog may be somewhat confusing. Double left
click on a directory name or drive name to change to it, or on ".." to
go up one directory. When the target directory (where you want the new
directory to be placed) is in the text box below, press OK to start the
download. The script will create the new directory, if necessary, and
download the files into it, overwriting without warning any same name
files that may already be in it.
At the bottom of the dialog there is a filename filter box. By default,
(filter left blank, equivalent to '.' [any character]) all of the files
in the selected remote directory will be downloaded. If you only want
the text files, put .txt in
the filter box. If you only want the PNG files put .png. If you put a 2 in the filter box, you'll get all
of the files that have a 2 in the name. (002.txt, 002.png, 012.txt,
012.png, 020.txt, 020.png.... etc.) It uses perl regular expressions to
evaluate the pattern so you can build a much more complex matching
filter if desired.
To interrupt a batch download or upload in progress, press Stop Transfer. This will stop the
batch transfer after the current file is finished. To stop immediately,
press Disconnect.
To Delete a file or directory,
select it (left click) then press Delete.
Directories do not need to be empty to be deleted. A dialog will pop up
to ask for confirmation.
MAKE SURE YOU REALLY WANT TO DO THIS. IT CAN NOT BE UNDONE.
You can view files in the local list (if they are text or image
files) by double left clicking on them.
You can remove one or several files from the local list by highlighting
the file name(s) then double right clicking, or clear the list with the
Clear File List push button.
For files in the local list, double click on a file name to view it (if
it is a text or image file).
Select one or more file names and double right click to remove the
names from the list.
The script does directory caching to drastically speed up walking the
tree. It does not save the cache when you close the program.
Troubleshooting:
The script tries to figure out whether it can run the way it expects
and tries to warn you if it has a problem.
If it warns that it can't find files or a directory, you probably
selected the wrong directory as a working directory or you may being
running in the wrong mode, (batch instead of interactive or vice versa)
or possibly you have incorrect options selected. (Running extract when
you don't have rtf files.) Remember, for interactive mode, the textw
and textwo directories (and possibly text and pngs) should be visible
in the change directory box. For batch mode, you need to select the parent directory of the textw and
textwo directories.
Warnings about the scannos file are a result of a missing or corrupted
scannos.rc file. If you edit the file, be sure to follow the format
shown.
One thing that will produce odd results is to feed the script RTF files
that contain 16 bit characters. (Unicode or UTF-8) It is really
designed to work with 8 bit characters using code page Windows 1252 or
ISO Latin-1. If you end up with page after page of question marks, you
probably are saving your RTF files as 16 bit characters.
If you somehow get the window set larger than your desktop and can't
get to an edge to resize it, delete the settings.rc file in the startup
directory. That will reset all of the settings to defaults, which will
reduce the window to 640x480 pixels. Alternately, you can edit the
settings.rc file with a text editor and remove the line that starts:
$geometry = .
Known
bugs and odd behavior:
When viewing a file through the Headers tab, you may have an unexpected
or wrong file open up. Due to the way list boxes are handled under Tk,
you need to specifically select (left click) an entry before you can
act on it (right click). If you haven't selected an entry, either the
last entry in the list or the previous selection is defaulted to. The
actual mouse pointer position is ignored on right click.
Customized open and close markup markers are not sanity checked. The
script will not check or care if you use inappropriate markers. For
instance you can set both italics and bold to use the same markup, or,
even worse, use a marker which will occur normally in the text. If you
specify "the" and "and" or even " " for your
italics open and close markers, the script will uncomplainingly use
them. Probably not a good idea.
If you switch away from the Process text tab while a batch or job is
running, it will automatically cancel the job to prevent contamination
of other directories. Each tab has its own peculiarities about where it
needs to run and if you try to switch to one while another is
processing, it could cause problems.
The FTP client blocks while it is waiting for a response from the FTP
server. It appears that the program has locked up but it is just
waiting for something to happen. If the connection is lost, it will
return immediately. On a dial-up connection, for large transfers, (the
initial directory listing, for instance) it can take between 30 - 60
seconds to respond. Be patient. It is
working.
Changlog History
Version .39 (643k)
Added option to extract small caps markup from the rtf during the
extraction routine. Markup will be added as <sc> .. </sc>
around the text that is marked as small caps in the RTF file. It
doesn't do too bad, but there are problems trying to convert RTF markup
(which is strictly presentational) into semantic sensitive markup.
Added an entry box to the Process Text tab where you can specify what
number to start with when renaming the text and/or png files. By
default it is set to 1, but if you want to offset the pages by 127,
enter 127 in the box and the files will be renamed starting at
127. IF you want to force four digit numbers even for texts that
nominally would only need three (say an early volume of a multi-volume
work,) left pad the start number out to 4 places with zeros, e.g. 0001.
Sorry, no negative numbers, no skipping numbers in the sequence after
the start offset. If you don't like the offset you have, change it and
rename again, filename collisions will be automatically avoided.
Modified file renaming routine to be able to deal with offset start
points. Rewrote it to be more robust about avoiding filename
collisions. As a side effect, I sped it up about two to three times as
fast as it used to be.
Modified Search tab to be able to deal with file names that don't
correspond to their index.
Twiddled with the layout of the options tab slightly. Mostly cosmetic
changes.
Got tired of the default palette and changed it. Shouldn't affect most
current users, only new users, and you can still change it to whatever
you prefer.
Version .38 (642k)
Added a whole bunch of tweaks suggested by lorax.
Tweaked "Remove garbage punctuation " regexes a bit. Broke apart the
"Strip from front" and "Strip from end" regexes into separate options.
Modified Header Removal functions to not display pages where the only
text is the "Blank Page" text string from the options page.
Fixed improper calling of nohyph.dict loading function. Sigh.
Included a basic English nohyph.dict courtesy of lorax.
Tweaked quote handling a bit to try to intelligently resolve quote
spacing a bit better.
Added function that will try to find and change the case of ALL CAPS
words at the start of a chapter. It isn't very aggressive to prevent
unwanted case changes, but it should help a little.
Fixed bug with Convert £ to "Pounds" option where it would
erroneously split numeric quantities at commas. E.G., £100,000
would become 100 Pounds ,000 rather than 100,000 Pounds. Note, this
option is little used and somewhat discouraged, but it is available.
Fiddled around with the "Move punctuation outside of markup" functions
to avoid a few undesirable side effects. Most obnoxious of which was ,
the <i</i>> problem.
Fixed a bug in the Extraction routine where if a page contained a
table, any text after the table would have its spaces changed to
non-breaking spaces. Normally this would be a non-issue since the
filter routine changes all non-breaking space back to regular spaces,
however, in rare instances they seemed to be slipping through.
Added an option to save two files during dehyphenization;
hyphens.txt and dehyphen.txt. The hyphens.txt will contain all of the
end-of-line hyphenated words that the script found during the
dehyphenate routine where the words remained hyphenated. The
dehyphen.txt will contain all of the words where a hyphen was removed.
The script has been capable of generating these files for some time as
a debugging aid, however it required editing the source to set a
debugging flag. Since the addition of the nohyph.dict dictionary file
though, these could be more useful to general users so I made the
generation optional in the program. The files will be placed in the
base directory of the project, (the directory that contains the textw,
textwo, text and pngs directories.) They will be overwritten each time
the dehyphenate routine is run.
Messed around with the layout of the options page a bit. The layout
manager I was using was very automatic, but I didn't like the staggered
columns of checkboxes.
Version .37 (638k)
Fixed problem where guiprep would occasionally lock up while running
Filter Files with the "Move punctuation outside of markup" selected.
Added an option for the "Remove garbage punctuation at ends of
line" to the
options page. Made filter regex much more aggressive.
Tweaked a few other filters a bit.
Version .36 (638k)
It's a veritable bug fest.
Fixed problem with semicolons being turned into question marks.
Stupidity errer :-(
Think I finally fixed the problem with disappering punctuation after
hyphenated words. (Actually lorax spotted the error.)
Fixed some other mistakes I made while trying to implement dehyphenate
code modifications submitted by lorax. The problems should not have
caused any errors in the processed texts, though they limited the
effectiveness of the dehyphenate routine a bit.
Added a new filter to the filter routine to try to clean up junk at the
end of lines. Often, OCR will erroneously put a bunch of junk
puntuation at the end of lines, (typically where the page runs off into
the gutter.) This will try to detect and clean up the worst of it.
Was not able to replicate problem with emdash being rendered as
â", so that hasn't been fixed yet if it is truly a problem.
Remembered to update version number this time.
Version .35 (638k)
Phooey. Yet more bugs. (Well, bug fixes, one would hope.)
Fixed bug where Filter function would lock up on certain files.
Root cause was a regex to move punctuation outside of markup that had
adverse reactions to characters outside of Latin-1.
Fixed a few warnings about printing wide (multi-byte UTF-8) characters.
Version .34 (637k)
A few tweaks and bug fixes.
Added option to use an external file of words that are not hyphenated.
If there is a file named nohyph.dict in the guiprep directory, it will
be loaded and used to help determin which words should be dehyphenated
during the dehyphenization routine. (Similar to Nicola's DPEU version.)
Fixed problem with the Convert to ISO-8859-1 routine that was causing
some bizarre u <-> y substitutions.
Revised dehyphen routine to be a little more agressive. Changed to
agressivly lower false negatives without significantly raising false
positives. Based on code sample by lorax.
Twiddled around with FTP routines a bit. Nothing substantial, most
visible change is the "activity indicator". Used to just append
vertical bars to the log, now just has a "spinning" line.
Version .33 (636k)
Updated program to deal with Unicode files gracefully. Now works
natively in UTF-8. File for the original DP site NEED to be in ISO
8859-1 (Latin-1). There is an extra button on the Process Text tab
"Convert to ISO 8859-1" PLEASE down convert files for the original DP
site. (At least until the UTF-8 mods get activated.) No such
restreictions for DPEU. UTF-8 files are PREFERRED at DPEU. Note the
Convert to ISO8859-1 function will do transliteration of any Greek it
finds. (It uses the guiguts beta code to denote accented characters.)
Other characters outside of Latin-1 will be converted to question marks
at this time. If I get some transliteration tables, I could make auto
transliterion for other character sets too. I don't really want to
spend lots of time on it though because hopefully, in the near future,
DP will convert to UTF-8. A very large Thank You to Nikola Smolenski, one of the lead
developers for the DPEU site who worked out the bulk of the UTF-8
character extraction code.
Fixed problem with pngcrush under Win2000 and WinXP. It was easy
enough, once I figured out what was causing the problem. The fix
consisted mostly of downloading a version of pngcrush that works
correctly under 32 bit Windows. Argh. Note: for Win 95, 98 and ME
users. The 32 bit version will not work crrectly under DOS. The old
version is still included as pngcrush16.exe. Rename pngcrush.exe to
pngcrush32.exe and pngcrush16.exe to pngcrush.exe. The 32 bit version
will not work correctly under DOS.
A few other small (and mostly invisible) tweaks.
Version .32 (550k)
Fixed bug where if an italicized word was at the start of a line after
a line that ended with a hyphen, the word would be removed during
dehyphenization.
Modified guiprep to fix markup that closes at the end of a line to not
leave the ending markup at the beginning of the next line.
Modified guiprep to use the spawn.pl spawning script for external
programs instead of runner.pl for the same reasons I changed it in
guiguts. More compact, and better Linux compatability.
Added check for common italicized scholarly abbreviations to move
markup outside of punctuation. (e.g., ibid., loc., cit., Ib., cf., op.,
et seq., viz., etc.)
Cut out 100k of extreaneous images from the manual.
Version .31 (659k)
Major update of the code to work with the Tk:804 series. Rewrote and
updated user interface to work with the new unicode aware Tk. The basic
operation is as near to identical to previous versions as I could make
it. It uses the same layout, though button and font sizes are subtly
different.
I have split apart the libraries from the executable version and am
including the windows exe along with the perl script. The executable
version uses the same prl03 perl runtime libraries as guiguts. If you
already have prl03 (prl03.zip)
for guiguts installed, there is no need to download it again.
Added unicode handling code to all of the functions. There was very
basic unicode handling in the extract routines before, but all it would
do was substitute question marks for any unicode character
outside the Latin-1 character space. Will now deal with unicode in all
routines. **NOTE** The PGDP site is still not able to work with multi
byte characters. If you have a unicode encoded text, you are better off
putting
it through DPEU.
Puttered around with FTP functions to try to get more accurate tracking
of transfer rates and estimated times.
Worked on making things that SHOULD be impossible to do, harder to do
accidentally. :-\
Lots of little tweaks and tuning that are not worth mentioning
individually but which added up to a substantial amount of time.
Played around with optionally marking up texts with questionable word
markup as determined by ABBYY during OCR but after messing with
it a bit, have serious reservations about it's usefulness, and have
removed it again.
Version .30 (590k)
Modified FTP reporting code, now
reports
on instantaneous and average speed of file transfers. Reports real
throughput after overhead. Selectable readout in Kilobytes per second
(KBps) or Kilobits per second (Kbps). Makes an estimate of seconds
remaining to transfer
the current file. Not going to be very accurate for small files.
Fixed problem where script would dump you in the wrong directory if
processing was interrupted during the scannos routine.
Made rename functions report file counts. Useful to check that you have
the same number of text and image files.
When building a batch for FTP upload, the build routine will now check
for and warn about zero byte files.
Changed Change Directory tab to use double click instead of single
click to navigate. (Made it the same as the navigate function in the
FTP window.)
When making a new directory on the FTP server, the script automatically
issues a CHMOD 0777 command to set the permissions on the new directory.
Version .29 (590k) Fixed
"Change initial X not followed by e to
N"
to also ignore X followed by hyphen.
Tweaked a few more thing on FTP tab. Added a "percentage done" on
upload or download to status box.
Found and fixed bug where search window would add a blank line to the
bottom of each file every time it was opened.
Ripped out the original two set dehyphenization function and wrote a
new one based on the single set dehyphenization function. Actually both
dehyphenization function use the same code to perform the
dehyphenization, they just use different dictionary building code. The
new two set function has all of the robustness and flexibility of the
single set, with as good accuracy (potentially even better, in fact)
than the original two set.
Found and fixed bug in dehyphenization where it was getting confused by
italic markup (and likely bold too, though I didn't confirm that.)
Rewrote large portions of the logging and error reporting code to be
much more compact and less error prone. Reduced script size by 10
percent in the process.
Added capability to use German style "=" instead of "-" as the hyphen
symbol for dehyphenization.
Removed some of the more problematic scannos from the scanno
dictionary. "cf" => "of", "au"=>"an" and "dont"=>"don't".
Did a fair amount of updating to the manual.
Version .28(601k) Fixed a few
spelling errors in the user interface.
Made "Change initial X not followed by e to N" option not change Roman
numerals. (Basically it will ignore an initial X followed by eEIVXDCML
or space.)
Made "rnp" to "mp" fix ignore turnpike as a special case.
Tinkered around with the dehyphenate routine to try to figure out what
could be causing the intermittent moving of whole lines instead of just
word halves. Was not really able to find a specific fix. Was not able
to make it fail on any of the texts I have. Still waiting on some
sample files that show the symptom from someone, so I can try to track
it down. Was not able to make it happen, even by downloading some
images from the FTP server that have text files exhibiting the symptom
and OCRing them myself. Oh well, if I can't duplicate it, I can't
rectify it. I made a few changes that may help, but, as it worked for
me both before and after the changes, it is difficult to tell whether
they will be of any use.
Puttered around with the FTP client a bit. Added a preferred "Home"
directory option as suggested by sjg1978. (Actually, adapted a working
patch he submitted) Will automatically switch to this directory on the
FTP server when you log on. Made the client a little more general
purpose. Now able to save and recall different host names. User names,
passwords and Home directories will be saved with the different host
names (if that option is selected.) Status box has been moved down to
just below the log window (to make room for the home directory box up
on the top row) Status box now gives a lot more useful information
during transfers. Actually keeps track of progress instead of just
saying uploading/downloading.
Added ability to customize superscript markup. It still defaults to
^{xx} but can be changed to whatever you want. It is not sanity
checked, so if you put markup like "<<<<KYpR%J>"
"$$$$+=*", it will cheerfully use it without a second glance.
Version .27 (612k) Added code to handle mouse wheel events in
WinXP (and apparently some installations of Win 2K, though it always
worked for me on my Win2K system).
Fixed problem where zip file name was being incorrectly added to the
FTP batch.
Removed limitation on uploading into root directory.
Changed order of operations for changing / to ,' and change '' to " to
catch some occurrences that were slipping through.
Modified "cb" fixing code to be a little less greedy. Will no longer
"fix" Macbeth to Macheth
Made "Convert solitary 1 to I" ignore a 1 followed by a full stop.
Added convert initial VV to W option.
Added convert initial !! to H option.
Added convert initial X not followed by e to N option.
Added convert ! in a word to l option.
Changed empty file handling code and average file size calculation to
be more efficient based on suggestions by Elronse.
(Thanks!)
Changed page switching code on search tab to automatically save the
page file if you have made edits.
Changed Search page text window to have some undo capability. WILL ONLY
UNDO CHANGES DONE TO A SINGLE PAGE. once you switch pages, the changes
are written and the undo buffer is cleared.
Debated quite a bit about how best to implement the spaced double
quotes repair option that papeters requested. Decided to make it
universal rather than hard coding it for double quotes. Added two more
"Alternate" replacement text fields with some more Replace and Replace
& Search buttons beside the corresponding field. Now you can have
up to three alternate replacement terms. The "Replace All" function
uses the first alternate. Tried to make the button layout easy and
quick to use with a mouse.
Changed the FTP tab password entry to be a little more secure. Will now
keep your 5 year old nephew from figuring it out. :roll:
Displays **** instead of the actual password.
Lots and lots of minor tune ups and enhancements to make it more user
friendly. Too many to list (or remember).
In Version .26 ( K) Added option to not extract sub/superscript
from RTF files.
Fixed fcanno (Olde Englifh) routine to skip words that have a
capitalized F at the beginning. For instance, Fire will not be changed
to *ire, since the capital F is unambiguous.
Back ported some of the external program calling routines I developed
for guiguts. Now all the external program calls will work in both
guiprep and winprep
Added "See Image" Button to search page. Allows you to easily compare
text and image for the project pages.
In version .25 (601 k) Added
function very similar to Jon Ingrams de-fcanno script he published in
the developers forum. Ported from python to perl and integrated into
the text processing page. Added a new button on text processing page
"Fix Olde Englifh". This will comb through the text and replace any
words spelled with long esses (f) with the modern English equivalent.
(They are not really misspelled. The long s really is an s, it is just very, very
close to looking like an f.) The script will preserve the case of the
original word when it replaces it.
I based the de-fcanno function off of my scannos function, but as
the fcannos dictionary was about 35 times the size of dictionary used
by the scannos function (and that
wasn't any speed demon,) running the fcannos function was nearly
grinding my computer to a halt. I couldn't leave it like that so I went
back and optimized both functions a bit and sped them up by close to 2
orders of magnitude. (found some really, really inefficient code in
there....) Anyway, they are both pretty spritely now. After some
experimentation, I decided not to use the Moby SINGLE.TXT word list to generate my
dictionary. It was TOO complete. There were way too many extremely
uncommon words that were getting pushed as replacements, generating way
too many false positives. After some hunting around I settled on
generating it from the 2of4brif.txt
word list from the 12dicts-4.0.zip
package available at Kevins's
Word List Page This was somewhat arbitrary, but it generated a much
more reasonably sized list, (23000 words instead of 132000) and seems
to generate a lot fewer false positives in practice. It is a heavily
slanted toward British spellings as well, which fits in rather well
with the period of most of the texts we are seeing. I've included the
dictionary generation script in the distribution if you want to try
others. It is named fwordgen.pl and requires perl to run. The name of
the word list is hard coded. If you want to try different ones, you'll
need to change the line -- open (WLIST, "<2of4brif.txt"); -- to have
the name of your file instead of 2of4brif.txt. That will generate
fcannos.bin, a serialized hash of words in the format needed by the
script.
If you are planning to run both the scannos fix up and the Olde Englifh
fixup routines, you should definitely run the scannos routine first. Do
not run the scannos routine after the Olde Englifh routine, it will
find lots of false positives
Fixed a few other minor user interface bugs.
In version .24 (383k) More user requests. Improved how script
deals with tabular data. Optionally insert bar "|" surrounding each
"cell" in a table and try to retain original table spacing as much as
possible. Added automated markup for super and sub script text. Right
now these are hard coded to be TEXish markup: caret-braces "^{X}" for
superscript and underscore-braces "_{X}" for subscript. These may be
made editable markup in a future version, similar to the bold and
italics markup so different projects can use different styles.
Found and fixed bug with underscore handling in the filter
routine that made it impossible to use an underscore for italics markup
(the nominal Gutenberg standard).
Added new filter options "Convert double commas to a double
quote", "Remove space after doublequote if it is the first character on
a line" and "Remove space before doublequote if it is the last
character on a line". (Thanks for the suggestions, Curtis.)
In version .23 (376k) Sigh... fixed bug on search page
where an edited page wouldn't save unless you were in the midst of a
search.
Poked around in the source of gutcheck and stole a few more checks for
unlikely letter combinations - added to options page. (Thanks Jim!)
Fixed last thing keeping script from running under Linux, thanks to
jneves for bug reports and feedback Still not 100% functionality,
external programs (text editor, image viewer, pngcrush) still are not
functioning, but that's fairly minor. All of the internal routines
should work now. There is essentially a built in text editor on the
search page anyway, and you can run pngcrush as a separate program if
desired.
In version .22 (374k) Added some more functionality to search
tab. Now allows you to cycle through the text files or jump to a
particular file with out actually doing a search. Changed logic to
automatically load the first file from the text directory when search
tab is activated. Now caching the list of filenames between calls to
the different search functions to generally speed up operation,
especially for large numbers of files. Altered changed file save
semantics slightly to better fit with the new functionality.
Added Zip function to batch upload in FTP client in anticipation of the
option being available soon on the site. Automatically adds all the
files in the upload batch to a zip file named the same as your working
directory. Should make uploads a little faster since it is not
constantly have to negotiate transfers with the FTP server for each
file. Added option to build zip file during batch mode. Paves the way
to make the FTP upload batchable along with the pre-processing.
Moved both new batch options to options page where they should have
been originally.
Changed a few more things which were blocking Linux compatibility.
Trapped error which would sometimes result in the saved settings file
being corrupted and losing your personalized settings.
Trapped bizarre behavior if italics or bold markup is extracted with a
blank markup string.
Updated Manual.
In version .21 (350k) Added a bunch of user requested
items.
Tuned a few few things in the newer dehyphenization routine. Deals
better with spaced hyphens at end of line now.
You can now choose the directory name where your png files are stored.
It is no longer hard coded to be "pngs". Change it on the Program Prefs
tab.
Header Removal is now selectably automated for batch processing. It
will automatically remove the top line from every text file. THIS MAY
POSSIBLY REMOVE LINES THAT SHOULDN'T BE REMOVED. USE WITH CARE. It is
highly recommended that header removal be done in interactive mode if
feasible.
The header removal function has been made a little smarter. It will no
longer remove lines that contain the zero byte file text marker -
[Blank page], by default.
If header removal is run in batch mode, it will automatically run the
Fix Zero Byte Files routine after
it finishes. In this case, it is not necessary to select it on the
Process Text tab since that will only make it run twice.
There is a new tab with basic search & replace functions that you
can run against the text files. Will automatically search through all
of the text files. Useful for project specific spell checks that you'd
like to run. Select Case Insensitive search or Whole Word search or
combinations thereof to further narrow down the search target.
Disabled the "standard project directory name" check in the "make
remote directory" function of the FTP client. Has become moot with
recent changes to the site code.
Fixed a few inconsistencies in the FTP download logic.
Combed through code trying to reduce Linux incompatibilities. As far as
I can tell without actually trying to run it, there are only three
places where the code is Linux incompatible: the three external program
hook subroutines - testart(), ivstart() & pngcrushstart() [text
editor start, image viewer start and pngcrush start] Need to get access
to a Linux system to get them working. There may be others, but they
are the ones I know about.
Went through most of program , cleaned up code, improved commenting and
indenting. Generally tried to make program more maintainable. Updated
manual.
In version .20 (353k) Major update. Added new dehyphenate
routine. The original dehyphenate routine is still there and is far
more comprehensive than the new one, but the new one has a huge
advantage in that it only needs one set of text files and is not
dependent on Abbyy FineReaders' dehyphenization feature. The new
routine builds a dictionary of all of the words in the text files that do not have a hyphen in them, then
uses that dictionary to decide whether to remove the hyphen from a
split word or not. It will rejoin hyphenated words whether it removes
the hyphen or not. It will make a few educated guesses when it sees
some very common prefixes or suffixes. The new routine looks for a set
of text or RTF files in a "textw" directory. If there is also a "textwo" directory, the
script will automatically use the original dehyphenate routine. Changed
original dehyphenate routine to automatically fall back to the breaking
text if a threshold of synchronization errors was reached (currently 3)
in any one file.
Added much better reporting of what is going on during filtering of
"improbable letter combinations" and scanno replacement. Changed order
that routines run in to make reporting more useful. (Moved rename text
files to before any of the routines that do progress reporting so I
could include a file name.) Changed button order to match. Added a
button and logic to save a copy of the processing log to a file from
the process text tab. Added buttons and logic to the process text tab
to save and revert to backups of the text files.
Moved conversion of Windows codepage 1252 glyphs 80-9F (decimal
128-159) from the extract routine to the filter routine where it really
belonged. Added option for it on Select Options tab.
Made Remove Headers routine more tolerant of filenames with spaces in
them.
When downloading a directory in the FTP client, it will now
automatically make a directory in the selected local directory with the
same name as the selected remote directory and download the files into that directory.
Added a file name filter to the FTP directory download dialog box.
Default (blank) is 'download all files in directory'. If you want to
download only the text files in a directory, put .txt in the filter box. For all of
the PNG files put .png , etc.
You can build more complex pattern matching filters too, if you like.
It uses perl regular expressions to evaluate the pattern, so don't use DOS wildcard expressions
(*.*, *.txt, etc). Added some more word pairs to the scannos list.
In version .19: (354k)
Fixed up a bunch of minor non-fatal errors (warnings). Changed default
watchdog timer to allow longer subroutines to run without raising a
fatal timeout exception. Was giving problems with some users.(Well, one
specific user, but I'm sure it would crop up again sooner or later.)
Made a few of the routines a little more robust/error resistant. The
dehyphenate routine now marks the word in question with "**" when it
gets a synchronization error. Added a few more word pairs to the common
scannos list. Removed the check for double backslashes, no longer
necessary after site update.
In version .18: (357k)
Fixed pngcrush feedback mechanism to work consistently across windows
platforms. Changed it to work predictably no matter what your pngcrush
option settings. Added capability to edit pngcrush command line options
to the Program Prefs tab and changed default pngcrush settings to
something a little more generic.
Tweaked a few of the markup filters to catch boundary conditions
better. Fixed FTP client to understand directory names with spaces in
them. Changed FTP directory download dialog box to custom built one, a
little easier to work with, I think. Added directory download list
display. Change default FTP host to pgdp01.archive.org. Changed client
to allow editing host name. Tuned a bunch of the FTP functions to work
more intuitively. Just does the right thing. Double clicking on a
directory name on the remote server will change to that directory.
Double clicking on a file name will download that file. Double clicking
on a local file name will open a viewer for the file. Made all of the
FTP routines less fragile.
Wrote modified FTP::put and FTP::get routines that won't block the
calling Tk window to replace the ones in the standard FTP module which
blocks Tk very badly. Updates at least once for every 10KB of upload or
download. (You'll get a tick mark in the log box for every 10K of data
transferred).
Changed how external programs are invoked on the header removal page to
be more consistent with other pages.
Fixed missing last drive problem under NT / 2K.
Changed some code in the script which caused problems under WinXP and
perl 5.6.
Lots of code cleanup, added and formatted comments, remove some unused
routines, made indenting style more uniform. Updated manual.
In version .17: (377k) Better
resynchronization after error during Dehyphenization and better
trapping of errors. Finally dehyphenization is as stable as I would
like. In the worst case, it will use the text with line breaks as its
fall back if there are too many errors. Provides more information on
exactly what problem is on Dehyphenization error condition. More
efficient markup pattern matching in Filtering routine. Combined about
14 pattern matching searches down to 4. Reworked Pngcrush calling
routine to be compatible with NT based Windows platforms. Provide more
feedback during the pngcrush routine. Improved the FTP client
drastically. Added buttons for Change directory, Download, Rename and
Delete as alternatives to the arcane mouse button - key press
combinations. Added Rename function. Works with both files and
directories. Improved Download function to allow automatic batch
downloading of all the files in a directory. Disabled floppy drive
search on startup. Get rid of annoying "No Disk" acknowledge in XP. Not
really realistic that a project would be on a floppy anyway. Fixed
problem with small caps text not being upper cased on some occasions.
Updated Manual. Added history section. Miscellaneous bug fixes.
In version 16: (374k)
Reworked Process Text tab layout. Combined Process Batch and Do All
Selected button into one Start Processing button. Just does the right
thing depending on mode. Added routine to run pngcrush on your png
image files. Pngcrush is a png size optimizer. Most image generating
programs are not particularly efficient about making the smallest
possible lossless png file. Since the images are uploaded and
downloaded 4 - 6 times during a project, it makes sense to make it as
efficient as possible. Added pop up help buttons on most pages. Added
download and remote delete functionality to FTP client. Updated Manual.
Miscellaneous bug fixes
In version 15: (319k) Added
basic FTP client to help automatically upload preprocessed projects to
site. Added hook to link in external Image viewer. Added routine to
automatically rename png files in pngs directory under project. Changed
help box to a button activated pop up window on Change Directory page
to make more room for directory and batch listing boxes. Started
putting version number in program title bar to make it easier to track.
Updated Manual. Miscellaneous bug fixes
In version 14: (202k) Improved
the hooks for the external programs to run them non blocking. (Able to
run more than one at once without locking up guiprep) No longer any
reasonable expectation of Linux compatibility. Added some more
filtering options. Fixed some race conditions.Script now remembers the
window size and location from session to session. Added much better
reporting on processing progress. Renamed guiprepe to winprep. Updated
Manual. Other miscellaneous bug fixes.
In version 13: (202k) Added
hook to link in external text editor so you can view files easily
during Header Removal. Added more filtering options. Improved batch
processing . Added Program Preferences tab to allow you to choose some
settings that don't directly affect the text processing. Script will
remember preference settings. Script now remembers the last directory
you were working in and reopens to there. Modified Interrupt Processing
to interrupt whether in batch OR interactive mode. Script will
interrupt processing if you switch away from the processing window.
Reworked layout to be usable down to VGA resolution. Debut of guiprepe,
(guiprep executable) a compiled windows version of guiprep. Updated
Manual. Miscellaneous bug fixes.
In version 12: (194k) Jon
Ingram edition. Now does batching. Queue up several projects in a batch
and run processing on them sequentially. Updated Manual.
In version 11: (193k)Added
Check For Common Scannos routine & list. Check for 3400 or so
common scannos. Added lots of new filtering options for improbable
letter combinations and others. Made Text Processing routines batchable
with check boxes to select which one to do.Updated Manual. Lots of bug
fixes.
In version 10: (123k) First gui
version. Made a gui interface to the prep.pl script to allow runtime
option selection without huge command line lists. Renamed to guiprep.pl
to reflect interface change. Linked hrtk.pl header removal tool into
the script as a separate tab. Updated Manual. Created lots and lots of
bugs
In version 9: (0k)There was no
version nine.
In version 8: (94k) Last
command line version of prep.pl. Added basic header removal command
line scripts and gui tool that implements them (hrtk.pl).