I just made a discovery that will be of no interest to the non-technical folks out there.
If you use Java's builtin Scanner (as I've done hundreds of times) to read in a bunch of text, it turns out that if any of that text is not in the expected encoding, it just silently treats that as unreadable, which means hasNext() is false and it appears for all intents and purposes as if it's an end of file. Here's the catch: this happens as soon as the Scanner reads the bad character into its buffer, *not* when your cursor catches up to the bad character.
The way this manifests is that your data seems to be silently truncated for no apparent reason. If you look at the portion of the file where it stops, there appears to be nothing wrong there---and there isn't. The problem is somewhere in the next few hundred characters.
The workaround to this is, if you know what encoding your input uses (and you're sure there's no noise in it), you can specify it:
Scanner in = new Scanner (new File (filename), "ISO-8859-1");(similarly "UTF-8"). If you expect your data might be noisy and you don't have access to your data in advance to clean it up, I'm not sure that you can use a Scanner, although it's possible there's something involving rolling your own BufferedReader that you can do.
That took a stupid amount of time to track down, though. "What do you mean, you're at the end of the file? I can see more data RIGHT THERE."
"When judging the relative merits of programming languages, some still seem to equate "the ease of programming" with the ease of making undetected mistakes." --Edsger Dijkstra
Posted by blahedo at 4:30pm on 2 Apr 2013