Long ago, discussions about exception handling in PDFBox concentrated
on "exposing internal state" and using a canned package
to handle exceptions. To me, it seemed the discussion missed a key point:
Each Exception is a hint that the code needs more thought.
Some data. For the then current version of PDFBox 2.0:
lines of code: 33352
number of "throw new" statements: 746
ratio: 2.2% of all lines are throw statements
(An additional 38 lines throw
Exceptions from variables instead of from new... . These are sterile exceptions; they contain no information describing the error.)
In contrast, the com.javax source code has
646455 lines
2650 "throw new"
0.4% throw statements
and JPedal, another pdf library, has
208265 lines
364 "throw new"
< 0.2% throw statements
PDFBox has a throw-statement-density of more than five times that of javax and ten times that of JPedal. Why the difference?
I suggest that the difference is in the maturity of the code. Exceptions are excellent tools when initially composing a program. The basic flow can be written without burdening it with a plethora of error handling code.
After the code is written and working, among the next steps is to reconsider each Exception to see if the error can be handled more gracefully. Some exceptions will certainly remain, but many can be logged and then ignored, repaired, or proxied. (For logging to be viable, PDFBox must adopt a default log examiner that both programmers and end-users can easily access or ignore without delving into the mysteries of log4j. log4j can remain, as long as there is a simple default configuration file that makes te log accessible to a PDFBox log reader application.)
Lets look at some classes. The number of "throw new" instances in the top ten PDFBox classes
36 pdfbox/pdmodel/edit/PDPageContentStream.java
32 xmpbox/xml/DomXmpParser.java
28 pdfbox/pdfparser/BaseParser.java
27 pdfbox/pdfparser/NonSequentialPDFParser.java
22 fontbox/afm/AFMParser.java
20 pdfbox/pdfwriter/COSWriter.java
20 pdfbox/preflight/content/StubOperator.java
18 pdfbox/pdmodel/font/PDFontDescriptorAFM.java
17 fontbox/ttf/CMAPEncodingEntry.java
17 pdfbox/pdfparser/ConformingPDFParser.java
That several of these are parsers suggests that PDFBox has parsers that abort at the first sign of trouble. If you have ever tried to use such a parser, you know the frustration of making a run and only fixing only one problem, over and over again.
PDPageContentStream.java
Most of the exceptions thrown in PDPageContentStream are for illegal values of inTextMode . For instance, calling drawString() while inTextMode is false or calling setLineDashPattern() while inTextMode is true . These would seem to be programmer errors. However, in many applications, as in LayerUtility , the content comes from some existing document beyond the programmer's control. Instead of an abrupt termination via exception, I would log the error and repair the stream by inserting the beginning or end of a text block as necessary. Instead of
if (inTextMode) throw new Exception ...
I would write
if (inTextMode) { endText(); LOG.warn("inventing an endText ... "); }
Then execution could continue and the client-programmer or end-user could check the log for errors.
DomXmpParser.java
This parser is for meta-data: title, that is, date, author, keywords, .... It seems extra galling to abort processing of a file just because the meta-data are incorrect. Since meta-data are mostly optional, the parser can log and skip the offending sections of the source file. Instead of
if (inner == null) throw new XmpParsingException( ...);
process inner ...
the program could say
if (inner == null) log.warn("...");
else { process inner ... }
BaseParser.java
Several of the exceptions in BaseParser look like this:
throw new IOException("expected='R' actual='"+ r +"' "+pdfSource);
This may be useful to the client-programmer, but what is the poor end-user to make of it? He or she may not have any idea that there are objects, let alone references to them, or indirect references indicated with R . And the information printed
for pdfSource is mostly inane. The message needs to be better thought out. Moreover, this aborts the parse in the middle when there may be further progress to be made by repairing or skipping the error. On initial examination it looks like this throw could be replaced with a repair that just inserts the "R ":
log.warn(...); pdfSource.unread(r);
Before doing this, some thought needs to be given to what kinds of tokens can be expected to follow this indirect reference and whether they can begin with the value of r .
BaseParser also twice aborts when a PushBackInputStream overflows. Then the hapless user reads:
Could not push back xxx bytes in order to reparse stream.
Try increasing push back buffer using system property
org.apache.pdfbox.baseParser.pushBackSize
Besides being incomprehensible to an ordinary user, this abort is not even necessary. PDFBox already overrides PushbackInputStream (note the subtle difference in capitalization). The override method could repair the situation by replacing the pushback buffer with a bigger one.
DateConverter.java
I have recently been involved in replacing the code for DateConverter.java . The old code threw an IOException when its argument string was not a recognizable date:
throw new IOException("Error converting date: " + text);
instead of throwing the exception we returned null, which had formerly been returned only if the argument was the empty string.
By returning a dummy value, the exception was proxied. Proxying--a form of repair--is possible when a method parses some text and returns an object for it. Null is a common proxy value, but methods may also return versions of the object they usually return. Perhaps an instance of the object with an error flag turned on.
Conclusion
Rather than argue over philosophy and tools, each instance of "throw new " should be examined and treated as will best fit the needs of programmer/clients and end-users.
|