Indexer logo  

Indexer

3.3 (40zb)

by zweibieren
@physpics.com

 
Topics
Overview
Getting ready to start
Proper Nouns List
Starting Indexer
Indexer main window
Index Terms window
More about indexterms.txt
Criteria for a good index

Overview

Indexer displays the pages of a book and the index entries for each page. You use Indexer to add entries, delete entries, and create new terms that can be applied as entries to pages. Finally, the Create Index command combines the entries from all pages to produce the index.

Indexer main window (blurred)Here is the Indexer main window with each text page filling a row; the columns are page number, text, and index entries. Yellow highlights the entries of the active page. The active page is always one of the pages visible on-screen.When entries in the yellow area are selected, they get a blue background.  They can be then be deleted by clicking the Remove Entry button, one of the buttons in the menu bar atop the window. A message announcing the removal appears in the message area at the bottom.

Index Terms window, blurredIndex terms are in a separate window like the one at the right. You can have multiple terms window showing different parts of the list. A selected term is high-lit in blue. It will be inserted in the yellow area if the Add Entry button is clicked. The Create new term... button prompts for and adds a new term to the list. Type a term in the list window and the list scrolls to the first term starting with what you typed.

Terminology

A "term" is an item that may appear in the index. It will not appear if no page has an entry for it.

An "entry" is a term that has been chosen for a page. That term will appear in the index and will have the current page among its page numbers.

For instance, if "labor party" is a term and it is added as an entry for page 20. In the final index, the entry for "labor party" will list 20 as one of the pages.

A "project directory" is the directory with all files for generating an index for one book. The files include the list of all terms, one text file for each chapter, the index entries for each chapter, and the final output index file.

Output

The index can be output as text or HTML. Here's a sample thjat was generated in HTML.
labor force
composition of, 67, 94, 96, 103, 131, 188n8–9, 191n10
growth in, 103, 108, 110, 111, 117, 174, 177, 180
labor party, xi, 14, 15, 20, 21, 122, 124–127, 184n3
labor union
membership, 4, 15, 16, 35, 36, 52, 57, 59, 74, 81, 93, 94, 159, 189n21, 194n19

The actual generated file starts:

<dl style="margin:0;">
<dt class='indexermainterm'>labor force</dt>
<dd class='indexersubterm'>composition of, 67, 94,
96, 103, 131, 188n8&ndash;9, 191n10</dd>

The format is defined in CSS. Each main term is of class indexermainterm and each subterm is class indexersubclass. By adjusting the <style> declaration, the appearance can change. The style for the above is the Indexer default:

	<style type="text/css">  	
		.indexermainterm { padding-left: 1em; text-indent: -1em; }  
		.indexersubterm { padding-left: 1em; text-indent: -1em; }
	</style>

Getting Ready to Use Indexer

Indexing is done in a project directory devoted to indexing one book. Before running Indexer the project directory must have the text of the book and a list of index terms. Notionally the text is in one xxxx.txt file per chapter —Chap1.txt, Chap2.txt, ... — but any subdivision of the book is acceptable; even the "subdivision" which is simply the entire book. The text of each page must follow a line starting $@ and continuing with the page number. The list of terms must follow the format below and be in a file named indexterms.txt.

After running Indexer, there will be one xxxx-index.txt file for each chapter. The final index is generated to a file named index.txt or index.html.

Chapter text files: xxxx.txt

Each section of the book needs to be in the project directory as a text file. Each page of text must begin with a line containing  "$@" and the page number:
    $@1
          Chapter 1.
          Call me Ishmael. ...

The initial .txt file for the book can created by "Save as" from most word processors. In Microsoft Word, the option appears in the dialog box as "Plain Text (*.txt)". (If your word processor lacks this amenity, email me.) When the document contains special characters, MS WOrd will prompt you for an encoding. Choose "UTF-8" or "Unicode(UTF-8)." After creating the text file, break it up into sections and add page number lines with a text editor. Wordpad works well. Or emacs, if you have it.

Indexer does rudimentary formating on text:

  • Paragraphs are created for empty lines or for lines beginning with whitespace.
  • A line beginning with digit-dot-digit is a heading.
  • A line beginning "Chapter digit" is a heading.

Headings are bold and centered.

Terms file: indexterms.txt

The lines of indexterms.txt mostly define index terms. The simplest form is
      phrase  WHITE  term
where WHITE is some combination of tabs and spaces. Since phrase and term can each have spaces, WHITE must be at least one tab or two spaces. More are okay.

The phrase is employed when Indexer scans a chapter text; it scans for instances of the phrase and where it finds one, inserts the corresponding term as an entry for the page. When inserting terms from the Index Terms window, only the term is employed.

Phrase words can contain only letters, hyphens, and apostrophes. Other characters are ignored.  The phrase can be omitted and then that term is never automatically added to a page by the initial scan. If the phrase is left out, there must be leading white space, as in
     WHITE term

For narrower categories, index terms are often subdivided with subterms. An index term with a subterm is written in the form
    phrase WHITE term SPACE COLON SPACE subterm

The corresponding index entry will appear as
    term
       subterm  xx, xx, ... (page numbers)

See below for more about indexterms.txt.

Proper Nouns List

To get started building an index terms list it may be useful to have a list of the proper nouns that appear in the text. The installation includes a rudimentary program for sifting your text for proper nouns. No such program will be perfect and this one is a tad simplistic. Here are some of the phrases it extracted from one manuscript:

H. L. Mencken of the Baltimore Sun
Number
O'Hair's
OFA
Obama
Obama Justice Department
Obama and McCain
Obama and the Democratic Party
Obama's
Office of Faith-Based Initiatives
pro-Israel AIPAC
Roe v. Wade

Some of the principles the tool employs are these:

  • Generally, a proper noun is capitalized and a noun phrase is a sequence of proper nouns.
  • First words of sentences are ignored. A sentence ends with period, question mark, or exclamation point. If a sentece begins "Barrack Obama ... " only the Obama would be extracted as a proper noun.
  • Noun phrases extend through capitalized words and through words that are articles, prepositions, and conjunctions. Sometimes this inadvertently combines two phrases as in "Obama and the Democratic Party."
  • Words with mixed-case like eTrade and eBay are considered proper nouns.
  • Apostrophes and dashes are accepted if contained within words.
  • Noun phrases do not contain other punctuation; not even commas. The phrase "Number" above follows a colon in the text.
  • A single capital letter followed by a period is treated as a proper noun. "v." is a noun as a special case.

Results will be best if the references section is NOT scanned with this tool. Author names are usually last-name-comma-first-name, which will be parsed as two names by this tool. I suggest emacs or Excel for processing references.

Running the Proper Nouns List tool

The document must be converted to text form as discussed above. Use UTF-8 if an encoding is necessary to report all characters.

The tool is run from a command line. Navigate to the Indexer installation directory:

    cd /where/I/put/Indexer

Then give the command

    java -cp Indexer.jar com.physpics.indexer.ProperNouns

When execution begins, the tool will prompt for the name of the file to scan. The output will be written to ProperNounPhrases.txt in the current directory. (It will overwrite any existing file of that name.)

Indexer accepts some words in lower-case within noun phrases. The default list is all pronouns, articles, and prepositions. Words can be added to this list by putting them in a file called addXWords.txt, one word per line.

Starting Indexer

Start Indexer by double-clicking the Indexer.jar file in whatever directory you installed it. (See download instructions.) The download will also have created a shortcut Indexer shortcut image in the same directory. You can click it to start Indexer. Or copy the Icon to your desktop, another directory, or the start program menu and click it there.

Once the Indexer window is open, select from the File dropdown menu the option for Open Chapter.  The text file you select must be Indexer-ready;  it has $@ page number lines and lives in a directory with a file called indexterms.txt. See Getting Ready. On subsequent runs, Indexer will remember the directory and offer its files as options in the Open Chapter dialog box.

"Indexer" Main Window

The main Indexer window names the current file and directory in the title bar:
 
the Indexer main window
 

The three columns show the page number, the contents of the page, and the index terms that have been selected for that page. As the page was read in, Indexer scanned it for trigger phrases (as given in indexterms.txt). In the image above, the phrases "race to the bottom" and "slavery" resulted in index entries of the same. "Interstate competition" and "Levi" resulted in "labor costs, state" subhead "interstate competition" and "Levi, Margaret." The term United States of America was added with the Add Entry command. The phrase "labor costs" is red because that phrase is the trigger for two different index terms. Neither was automatically listed, so you need to review red phrases to see if any index terms should be added for that page. Selecting the entire red phrase will make the Index Terms window scroll to the alphabetically first term in the Index Terms window. Selecting a blue phrase will cause the selection to jump to the index entry made for that term.

The index entries on the "active" page are hi-lit in yellow. Additions and removal of index entries occur there. When you scroll the text, Indexer makes one of the visible pages active and colors its entries section in yellow.As the text is scrolling you will see empty entry areas. That is because the text is not scanned for trigger phrases until the page is made active (and thus has a yellow area).

Command Buttons

The menu bar has four buttons for the commands of Indexer:.
  • Click on an entry in the yellow area and click Remove Entry. The entry is removed.
  • Click on a term in the Index Terms window and click Add Entry. The term is added to the yellow area.
  • Click Create new index term ... and you will be prompted for an entry to be added to the Index Terms window.
  • Click Rescan page and the page will be rescanned for all trigger phrases. If any are found, new terms will be added at the bottom of the entries list.

Rescanning is usuaully unnecessary. Every time a page is made active it is scaned for terms that have been added since the last time the page entries were modified. However, once an entry has been deleted for a page the only way to get it back is by selecting the entry in a terms window and using the Add Entry button.

Commands can be invoked from menus, and also from the keyboard:

Command
Keystrokes
Add Entry

Insert or Control-a
or double-click on term in Index Terms window

Remove Entry
Delete or Control-d
Create new index term ...
Control-n
Save entries
Control-s

With Create new index term, you can add a new term or add a crossreference. For adding a term you will see three fields:
Adding an index term
The trigger phrase is one or more words; when a page is scanned, Indexer looks for these phrases. If one is found, its term is added to the entries for the page. The index entry is the main heading field together with the optional sub heading field. A new term is rejected if all three fields exactly match those of an existing term. Oterwise the term is added at its alphabetic location in the Index Terms window and immediately inserted in the indexterms.txt file. It is not added to the active page; to do os, type the INSERT key.

Clicking the "Cross reference" tab at the top of the dialog box brings up the fields for entering a cross reference:
Four fields for entering a cross-reference: the term where the reference will appear and the term it refers to.
The "for nickname" term is the term in the index where this cross reference will appear; The "See ..." term is the one that is referred to. The "under" term might be NEA and the "See" term "National Education Association (NEA)" Then the index would have entries
National Education Association (NEA) 12, 20, 44
...
NEA. See National Education Association
(Note the special case for acronyms. The trailing instance of "(NEA)" is stripped from the entry for NEA, but appears in the other entry.)

The File Menu

Open Chapter - Prompts for a new chapter and opens it. The file must be a text file with extension .txt. Pages in the text must each be preceded with a line having $@xxx, where xxx is the page number. The directory for Chapter files is remembered from one editing session to the next.

Save Entries - For chapter xxx.txt, this command creates file xxx-index.txt and stores into it all the index entries. It remembers which entries you have deleted. The chapter is rescanned every time it becomes active, but deleted entries do not come back. Entries are saved automatically when you open another chapter, or you exit the program, or when a five minute timer fires.

Create Index ... - You are prompted with a list of all the ...-index.txt files in the current directory. When you click "Index in text" or "Index in html", the checked files are read, the entries are sorted, and an index is created in index.txt or index.html, respectively. The html file can be edited with Microsoft Word to convert it to some other format. Or with emacs to modify line endings conveniently.

New Terms Window - A new instance of the Index Terms window is opened. All such windows look and behave alike, except that they may be scrolled differently and each may have its own set of selected entries. The selection is visible only when the window has the input focus.

Exit - Indexer saves any entries. For filename.txt, entries are saved to filename-index.txt.  Entries are automatically saved when you switch to another file or exit the program.  They are also saved every five minutes,

The Help Menu

About Indexer - Displays some mildly useful information, especially the current directory and file name. You should report the version number in error reports.

The bottom lines of the About window display the current directory and current file.

Help - Brings up a window displaying the ContextHelp file. As the mouse moves across the Indexer windows, the help window scrolls to describe what is under the mouse. F1 will also raise the ContextHelp window. In addition, it jumps the mouse to that window without changing the main window; thus you can explore the Context help.

Enter demo mode - In demo mode, Indexer works on a single built-in file and set of index terms. Creating an index shows it on the screen instead of saving it to a file.  Things to try:

Choose menu item File/CreateIndex and either html or text.
See the nice index.
Click "shadow" at the bottom of the terms list window.
It turns blue.
Click on a page in the main window.
Its index entries turn yellow.
Click the Add Entry button at the top.
"shadow" gets added to the entries in the yellow area.
Choose File/CreateIndex again.
Now the index has an entry for "shadow" and a cross reference to it.
If you add "dark" as a term on some page(s), more cross references will appear. (Cross references do not appear unless the term they point at has associated entries.)

"Index Terms" Window

Available Index Terms window        Any term in the"Index Terms" window can be assigned to any page in the text.  Scroll through the list. Select a term. It turns blue. Click the Add Entry button, and that term becomes an entry for the current page. Select two or more consecutive terms. They get blue. Click the Add Entry button, and they all become entries for the page.

If you want a new term, use the Create new index term ... button. If you want another copy of the entire window, use New Terms Window in the File menu. The contents of the window are derived from indexterms.txt in the same directory as the open chapter.

More about indexterms.txt

Besides terms, indexterms.txt may contain blank and comment lines. Comments begin with "//". One comment line can have the form
    // title: title words ...
When the index is generated in html, this book title will appear as the page title for the html page.

The first book indexed had phrases for both New York and New York Times. This works because the longest phrase found is the one used. But "York Times" would not work; the text "New York Times" would be recognized as "New York" and not as an instance of "York Times".

As terms are added, they are appended to indexterms.txt. Preceding each new term is a comment like this

// 1308343209406 end of session Fri Jun 17 16:40:09 EDT 2011
Only the long number matters. It is the internal form for the time when the entry was created.

Cross reference entries

Cross references are index entries that direct the reader to look at other index entries. They appear in the index as "see ..." and "see also ...," as in
NYT. See New York Times
race/ethnicity
    home ownership and, 25n3
    equality, struggle for (see racial equality, struggle for)
    political party polarization, 102-3 (see also polarization, racial)
    See also Eastern Europeans; Asians.
These are incorporated in indexterms.txt with lines having the form
 index term .SEE. index term
where either index term may be just a main heading, or may be a main heading, " : ", and a sub heading.  For instance
NEA : members .SEE. National Education Association (NEA) : membership
which will generate in the index as
National Education Association (NEA)
membership 23, 25, 167-71
NEA
members (see National Education Association, membership)

Neither the term before or after .SEE. can have an associated phrase. To assign a phrase, put in another line that gives the phrase and its index term.

For the indexterms.txt line "xxx .SEE. yyy", the Index Terms window will have a listing of yyy.


http://www.asindexing.org/i4a/pages/index.cfm?pageid=3297
ASI logo
Indexing Evaluation Checklist

The Index is the KEY to the book

Is the index to your book or web site good enough for your readers?
Here are some helpful insights for ensuring an excellent index.

"An index is not an outline, nor is it a concordance.
It's an intelligently compiled list of topics covered in the work,
prepared with the reader's needs in mind."

Reader Appropriateness
  • Are the indexed terms appropriate for the intended audience? For example: "heart attack" in a book for the general public, "myocardial infarction" in a book for health professionals; "Taxus" in a work for botanists or horticulturalists, "Yew" in a work for home gardeners.
Main Headings
  • Are the main headings relevant to the needs of the reader? Are they pertinent, specific, comprehensive? Not too general yet not too narrow? Not inane or improbable?
  • Do main headings have not more than 5–7 locators (page references)? If more, they should be broken down into subheadings.
Subheadings
  • Are the subheadings useful? In the example below,
    a) the page ranges are extensive
    b) the subheading "problems with Republicans" may be too general
    Roosevelt, Franklin
       problems with Republicans, 1–32
  • Are subheadings concise, with the most important word at the beginning? For example, not:
    banks
       and relationship to Federal Reserve bank
    but
    banks
        Federal Reserve regulation
  • Unnecessary words and phrases like "concerning" and "relating to" and proliferation of prepositions and articles should be avoided. 
  • Is the number of subheadings about right? More than one column’s worth is probably too many. Are subheadings overanalyzed? Could they be combined? For example, could "dimensions" be substituted for "height," "width," and "length"? Or should some subheadings become main headings with their own subheadings instead?
  • Do subheadings have more than 5–7 locators? If more, they should either be broken down into sub-subheadings or be changed to main headings.
Double Postings
  • For the reader’s convenience, many subheadings should be double posted—that is, they should exist as main headings too. An example: "Cats: Siamese" and "Siamese cats." Has this been done? Double postings should, of course, have the same locators. Do they?
Locators
(Page References)
  • Are the locators accurate? Check a sample of entries to see. Spot-check pagination for nonsense numbers where the hyphen or en dash may be missing, such as 18693 for 186-93. Check that elision (page ranges such as 186-93) is consistent.
  • When locators include roman numerals or volume numbers, does the typography make the usage clear?
Cross-References
  • Have see and see also cross-references been provided?
  • A see should direct the reader to a different term expressing the same concept, such as "Clemens, Samuel. See Twain, Mark" or "aerobics see exercise".
  • A see also should guide the reader from a complete entry to the related entries for more and different information. Examples: "Mammals: 81, 85, 105; see also names of individual mammals" "astronomy 12–14, 56, 68. See also galaxies; planets"
Length and Type
  • Is the index length adequate for the complexity of the book? An index should be 3–5% of the pages in the typical nonfiction book, perhaps 5–8% for a history or biography, and more (15–20%) for reference books.  
  • Is there a need for more than one type of index? For example, in addition to the usual subject index, perhaps a separate name or place index is called for. If so, is there one?
Format
  • Is the type large enough to be easily read? Do the index pages look open and not crowded?
  • Are the main headings and subheadings (and sub-subheadings if any) distinguished from each other?
  • Is the organization—whether alphabetical, chronological, or other—accurate, clear, and consistent?
  • When an entry’s subheadings "turn a page" that is, are continued from a right-hand page to a left-hand page, the main heading should be repeated, followed by the word continued in parentheses. Depending on the size of the pages, continued headings might be appropriate for continuations from left to right pages, or even from left to right columns. Are they present?
  • Preferences for punctuation between main headings and their subheadings and see and see also cross-references will vary from publisher to publisher. This discussion features several acceptable variants. The important thing is that the punctuation style be clear to the reader and consistent. Is it?
Courtesy of the Chicago/Great Lakes ASI Chapter