For document images, the most important is to access the text information within the images. So one important preprocessing step called binarization is to segment the text pixels from the image background. This process can also be viewed as dividing the pixels of the document into two categories: background and foreground.
Usually, to determine each pixel's label, many thresholding techniques are applied. These techniques can be divided into global(which assigns a threshold for the whole image) or local(which assigns different thresholds for different image pixels). The local thresholding method is usually a better choice, because for many degraded document images, the unique threshold doesn't exist.
Pixel intensity, image gradient/contrast are usually good features for threshold selection. And sometime the domain knowledge(such as text stroke information) will be applied to produce a better result. However, it is still quite hard to find out incorporable domain knowledge. When thinking how we human being recognize the text strokes? The intensity(usually dark in white background), the contrast(so that we can separate the text from the background) are easily to be found out. But that is not enough? How we find out that is a character even the text stroke is broken? How we separate the text stroke from both the background and the noises? And when the text characters are mixed together, we can still recognize them. The problem is, we know that because we knows what is text, what is characters, and even we don't know french, or japanese, we can still point them out instead of taking them as noises.
What i believe is that, there should be some common sense on what is character, which can be applied to all the languages in the world. But i don't what it is, and how to programme it.
So from my point of view, to really propose a method that can replace human, we must view it as a learning problem. I don't whether it is right or wrong~~~