unstructured
是一个开源的 Python 库,专门用于处理非结构化数据,如从 PDF、Word 文档、HTML 文件等中提取文本内容,并将其转换为结构化格式
(1)安装依赖库
pip install unstructured
使用text
from unstructured.partition.auto import partitionfilename = "a.txt" docs = partition(filename=filename) for doc in docs:print(doc.text)
docx
from unstructured.partition.auto import partitionfilename = "c.docx" docs = partition(filename=filename) for doc in docs:print(doc.text)
需要安装
pip install "unstructured[docx]"
markdown
from unstructured.partition.auto import partitionfilename = "README.md" docs = partition(filename=filename) for doc in docs:print(doc.text)
from unstructured.partition.auto import partitionfilename = "aa.pdf" docs = partition(filename=filename) for doc in docs:print(doc.text)
或
from unstructured.partition.pdf import partition_pdffilename = "aa.pdf" docs = partition_pdf(filename=filename) for doc in docs:print(doc.text)
需要安装
pip install "unstructured[pdf]"
注意:
安装中需要cmake
sudo apt install cmake
或
sudo yum install cmake
(2)本地部署服务
https://github.com/Unstructured-IO/unstructured-api
docker pull downloads.unstructured.io/unstructured-io/unstructured-api:latest
启动
docker run -p 9500:9500 -d --rm --name unstructured-api -e PORT=9500 downloads.unstructured.io/unstructured-io/unstructured-api:latest
服务的
URL url = "http://localhost:9500/general/v0/general"