如何在 Python OpenCV 中检测文本文档图像中的段落是否存在不一致的文本结构

问题描述

我试图通过首先将其转换为图像然后使用 OpenCV 来识别 .pdf 文档中的文本段落.但是我在文本行而不是段落上得到边界框.如何设置一些阈值或其他限制来获取段落而不是行?

这是示例输入图像:

这是我为上述示例得到的输出:

我试图在中间的段落上设置一个边界框.我正在使用

这就是魔法发生的地方.我们可以假设一个段落是一段紧密相连的单词，为了实现这一点，我们将相邻的单词进行扩张

结果

导入 cv2将 numpy 导入为 np# 加载图像，灰度，高斯模糊，Otsu的阈值图像 = cv2.imread('1.png')灰色 = cv2.cvtColor(图像，cv2.COLOR_BGR2GRAY)模糊 = cv2.GaussianBlur(灰色, (7,7), 0)thresh = cv2.threshold(模糊, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]# 创建矩形结构元素并扩张内核 = cv2.getStructuringElement(cv2.MORPH_RECT, (5,5))dilate = cv2.dilate(阈值，内核，迭代=4)# 查找轮廓并绘制矩形cnts = cv2.findContours(扩张，cv2.RETR_EXTERNAL，cv2.CHAIN_APPROX_SIMPLE)cnts = cnts[0] 如果 len(cnts) == 2 否则 cnts[1]对于 cnts 中的 c:x,y,w,h = cv2.boundingRect(c)cv2.rectangle(图像, (x, y), (x + w, y + h), (36,255,12), 2)cv2.imshow('thresh', thresh)cv2.imshow('扩张'，扩张)cv2.imshow('图像', 图像)cv2.waitKey()

I am trying to identify paragraphs of text in a .pdf document by first converting it into an image then using OpenCV. But I am getting bounding boxes on lines of text instead of paragraphs. How can I set some threshold or some other limit to get paragraphs instead of lines?

Here is the sample input image:

Here is the output I am getting for the above sample:

I am trying to get a single bounding box on the paragraph in the middle. I am using this code.

import cv2
import numpy as np

large = cv2.imread('sample image.png')
rgb = cv2.pyrDown(large)
small = cv2.cvtColor(rgb, cv2.COLOR_BGR2GRAY)

# kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3))
kernel = np.ones((5, 5), np.uint8)
grad = cv2.morphologyEx(small, cv2.MORPH_GRADIENT, kernel)

_, bw = cv2.threshold(grad, 0.0, 255.0, cv2.THRESH_BINARY | cv2.THRESH_OTSU)

kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (9, 1))
connected = cv2.morphologyEx(bw, cv2.MORPH_CLOSE, kernel)

# using RETR_EXTERNAL instead of RETR_CCOMP
contours, hierarchy = cv2.findContours(connected.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
#For opencv 3+ comment the previous line and uncomment the following line
#_, contours, hierarchy = cv2.findContours(connected.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)

mask = np.zeros(bw.shape, dtype=np.uint8)

for idx in range(len(contours)):
    x, y, w, h = cv2.boundingRect(contours[idx])
    mask[y:y+h, x:x+w] = 0
    cv2.drawContours(mask, contours, idx, (255, 255, 255), -1)
    r = float(cv2.countNonZero(mask[y:y+h, x:x+w])) / (w * h)

    if r > 0.45 and w > 8 and h > 8:
        cv2.rectangle(rgb, (x, y), (x+w-1, y+h-1), (0, 255, 0), 2)


cv2.imshow('rects', rgb)
cv2.waitKey(0)

解决方案

This is a classic use for dilate. Whenever you want to connect multiple items together, you can dilate them to join adjacent contours into a single contour. Here's a simple approach:

Convert image to grayscale and Gaussian blur
Otsu's threshold
Dilate to connect adjacent words together
Find contours and draw contours

Otsu's threshold

Here's where the magic happens. We can assume that a paragraph is a section of words that are close together, to achieve this we dilate to connect adjacent words

Result

import cv2
import numpy as np

# Load image, grayscale, Gaussian blur, Otsu's threshold
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (7,7), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Create rectangular structuring element and dilate
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5,5))
dilate = cv2.dilate(thresh, kernel, iterations=4)

# Find contours and draw rectangle
cnts = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    x,y,w,h = cv2.boundingRect(c)
    cv2.rectangle(image, (x, y), (x + w, y + h), (36,255,12), 2)

cv2.imshow('thresh', thresh)
cv2.imshow('dilate', dilate)
cv2.imshow('image', image)
cv2.waitKey()