背景

在业务系统开发的过程中,很多情况下会去识别图片中的相关信息,并且把信息录入到系统中。现在希望通过自动化的方式录入,就有了以下的工作。
在对比了几个OCR软件在中文识别方面的准确率后,决定使用微软的OneNote开发相应的功能。

准备工作

  1. 安装OneNote 2010;(注:在 Microsoft Office 2003 中的工具组件中有一个“ Microsoft Office Document Imaging”的组件包,之后的 Office 版本将这个功能集成到OneNote中了)
  2. 查询网上相关OneNote的资料,真是少得可怜,即使找到现有的代码也是各种坑。
  3. 在OneNote中的图片识别功能如下图,把图片放到一个tab中,右键图片就会出现红框所标注的功能,这个是我需要在程序中来调用的:

onenote show

代码实现的逻辑

  1. 获取图片的Base64编码;
  2. 开启OneNote程序,在一个空的 newfile.one中,生成一个新的page;
  3. 此时,新的page页中,会有一个固定格式的xml,把图片的Base64编码,更新到对应的节点上;
  4. 更新节点后,会自动调用OCR的功能,把识别出来的文字,放入到固定节点上;
  5. 从识别出来的文字节点上,取出相应的文字就可以了;
  6. 彻底销毁当前的页面(如果不是彻底的话,这个 newfile.one会越来越大);
查看 C# 代码
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
public class OrcImage
    {
        private static readonly string tmpPath = AppDomain.CurrentDomain.BaseDirectory + "tmpPath/";
        private static readonly int waitTime = Convert.ToInt32(ConfigurationManager.AppSettings["WaitTime"]);
    <span class="k">private</span> <span class="n">Tuple</span><span class="p">&lt;</span><span class="kt">string</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">&gt;</span> <span class="n">GetBase64</span><span class="p">(</span><span class="kt">string</span> <span class="n">strImgPath</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="k">return</span> <span class="n">GetBase64</span><span class="p">(</span><span class="k">new</span> <span class="n">FileInfo</span><span class="p">(</span><span class="n">strImgPath</span><span class="p">));</span>
    <span class="p">}</span>

    <span class="c1">/// &lt;summary&gt;

/// 获取图片的Base64编码 /// </summary> /// <param name="file"></param> /// <returns></returns> private Tuple<string, int, int> GetBase64(FileInfo file) { using (MemoryStream ms = new MemoryStream()) { Bitmap bp = new Bitmap(file.FullName); switch (file.Extension.ToLower()) { case ".jpg": bp.Save(ms, ImageFormat.Jpeg); break;

                <span class="k">case</span> <span class="s">&#34;.jpeg&#34;</span><span class="p">:</span>
                    <span class="n">bp</span><span class="p">.</span><span class="n">Save</span><span class="p">(</span><span class="n">ms</span><span class="p">,</span> <span class="n">ImageFormat</span><span class="p">.</span><span class="n">Jpeg</span><span class="p">);</span>
                    <span class="k">break</span><span class="p">;</span>

                <span class="k">case</span> <span class="s">&#34;.gif&#34;</span><span class="p">:</span>
                    <span class="n">bp</span><span class="p">.</span><span class="n">Save</span><span class="p">(</span><span class="n">ms</span><span class="p">,</span> <span class="n">ImageFormat</span><span class="p">.</span><span class="n">Gif</span><span class="p">);</span>
                    <span class="k">break</span><span class="p">;</span>

                <span class="k">case</span> <span class="s">&#34;.bmp&#34;</span><span class="p">:</span>
                    <span class="n">bp</span><span class="p">.</span><span class="n">Save</span><span class="p">(</span><span class="n">ms</span><span class="p">,</span> <span class="n">ImageFormat</span><span class="p">.</span><span class="n">Bmp</span><span class="p">);</span>
                    <span class="k">break</span><span class="p">;</span>

                <span class="k">case</span> <span class="s">&#34;.tiff&#34;</span><span class="p">:</span>
                    <span class="n">bp</span><span class="p">.</span><span class="n">Save</span><span class="p">(</span><span class="n">ms</span><span class="p">,</span> <span class="n">ImageFormat</span><span class="p">.</span><span class="n">Tiff</span><span class="p">);</span>
                    <span class="k">break</span><span class="p">;</span>

                <span class="k">case</span> <span class="s">&#34;.png&#34;</span><span class="p">:</span>
                    <span class="n">bp</span><span class="p">.</span><span class="n">Save</span><span class="p">(</span><span class="n">ms</span><span class="p">,</span> <span class="n">ImageFormat</span><span class="p">.</span><span class="n">Png</span><span class="p">);</span>
                    <span class="k">break</span><span class="p">;</span>

                <span class="k">case</span> <span class="s">&#34;.emf&#34;</span><span class="p">:</span>
                    <span class="n">bp</span><span class="p">.</span><span class="n">Save</span><span class="p">(</span><span class="n">ms</span><span class="p">,</span> <span class="n">ImageFormat</span><span class="p">.</span><span class="n">Emf</span><span class="p">);</span>
                    <span class="k">break</span><span class="p">;</span>

                <span class="k">default</span><span class="p">:</span>
                    <span class="k">return</span> <span class="k">new</span> <span class="n">Tuple</span><span class="p">&lt;</span><span class="kt">string</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">&gt;(</span><span class="s">&#34;不支持的图片格式。&#34;</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">);</span>
            <span class="p">}</span>
            <span class="kt">byte</span><span class="p">[]</span> <span class="n">buffer</span> <span class="p">=</span> <span class="n">ms</span><span class="p">.</span><span class="n">GetBuffer</span><span class="p">();</span>
            <span class="k">return</span> <span class="k">new</span> <span class="n">Tuple</span><span class="p">&lt;</span><span class="kt">string</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">&gt;(</span><span class="n">Convert</span><span class="p">.</span><span class="n">ToBase64String</span><span class="p">(</span><span class="n">buffer</span><span class="p">),</span> <span class="n">bp</span><span class="p">.</span><span class="n">Width</span><span class="p">,</span> <span class="n">bp</span><span class="p">.</span><span class="n">Height</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="k">public</span> <span class="kt">string</span> <span class="n">Orc_Img</span><span class="p">(</span><span class="n">FileInfo</span> <span class="n">fi</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="c1">// 向Onenote2010中插入图片

var onenoteApp = new Microsoft.Office.Interop.OneNote.Application(); //onenote提供的API /*******/ string sectionID; onenoteApp.OpenHierarchy(tmpPath + "newfile.one", null, out sectionID, CreateFileType.cftSection); string pageID = "{A975EE72-19C3-4C80-9C0E-EDA576DAB5C6}{1}{B0}"; // 格式 {guid}{tab}{??} onenoteApp.CreateNewPage(sectionID, out pageID, NewPageStyle.npsBlankPageNoTitle); // string notebookXml; onenoteApp.GetHierarchy(null, HierarchyScope.hsPages, out notebookXml); var doc = XDocument.Parse(notebookXml); var ns = doc.Root.Name.Namespace; var pageNode = doc.Descendants(ns + "Page").FirstOrDefault(); var existingPageId = pageNode.Attribute("ID").Value; if (pageNode != null) { Tuple<string, int, int> imgInfo = this.GetBase64(fi); var page = new XDocument(new XElement(ns + "Page", new XElement(ns + "Outline", new XElement(ns + "OEChildren", new XElement(ns + "OE", new XElement(ns + "Image", new XAttribute("format", fi.Extension.Remove(0, 1)), new XAttribute("originalPageNumber", "0"), new XElement(ns + "Position", new XAttribute("x", "0"), new XAttribute("y", "0"), new XAttribute("z", "0")), new XElement(ns + "Size", new XAttribute("width", imgInfo.Item2), new XAttribute("height", imgInfo.Item3)), new XElement(ns + "Data", imgInfo.Item1))))))); page.Root.SetAttributeValue("ID", existingPageId);

            <span class="n">onenoteApp</span><span class="p">.</span><span class="n">UpdatePageContent</span><span class="p">(</span><span class="n">page</span><span class="p">.</span><span class="n">ToString</span><span class="p">(),</span> <span class="n">DateTime</span><span class="p">.</span><span class="n">MinValue</span><span class="p">);</span>

            <span class="c1">// 线程休眠时间,单位毫秒,若图片很大,则延长休眠时间,保证Onenote OCR完毕

int fileSize = Convert.ToInt32(fi.Length / 1024 / 1024); // 文件大小 单位M System.Threading.Thread.Sleep(waitTime * (fileSize > 1 ? fileSize : 1)); // 小于1M的都默认1M string pageXml; onenoteApp.GetPageContent(existingPageId, out pageXml, PageInfo.piBinaryData);

            <span class="cm">/*********************************************************************************/</span>

            <span class="n">XmlDocument</span> <span class="n">xmlDoc</span> <span class="p">=</span> <span class="k">new</span> <span class="n">XmlDocument</span><span class="p">();</span>
            <span class="n">xmlDoc</span><span class="p">.</span><span class="n">LoadXml</span><span class="p">(</span><span class="n">pageXml</span><span class="p">);</span>
            <span class="n">XmlNamespaceManager</span> <span class="n">nsmgr</span> <span class="p">=</span> <span class="k">new</span> <span class="n">XmlNamespaceManager</span><span class="p">(</span><span class="n">xmlDoc</span><span class="p">.</span><span class="n">NameTable</span><span class="p">);</span>
            <span class="n">nsmgr</span><span class="p">.</span><span class="n">AddNamespace</span><span class="p">(</span><span class="s">&#34;one&#34;</span><span class="p">,</span> <span class="n">ns</span><span class="p">.</span><span class="n">ToString</span><span class="p">());</span>

            <span class="n">XmlNode</span> <span class="n">xmlNode</span> <span class="p">=</span> <span class="n">xmlDoc</span><span class="p">.</span><span class="n">SelectSingleNode</span><span class="p">(</span><span class="s">&#34;//one:Image//one:OCRText&#34;</span><span class="p">,</span> <span class="n">nsmgr</span><span class="p">);</span>
            <span class="kt">string</span> <span class="n">strRet</span> <span class="p">=</span> <span class="n">xmlNode</span><span class="p">.</span><span class="n">InnerText</span><span class="p">;</span>

            <span class="cm">/**********************************************************************/</span>

            <span class="n">onenoteApp</span><span class="p">.</span><span class="n">DeleteHierarchy</span><span class="p">(</span><span class="n">sectionID</span><span class="p">,</span> <span class="n">DateTime</span><span class="p">.</span><span class="n">MinValue</span><span class="p">,</span> <span class="k">true</span><span class="p">);</span>  <span class="c1">// 摧毁原始页面

return strRet; }

        <span class="k">return</span> <span class="s">&#34;没有识别&#34;</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>


XML的格式
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/*Onenote 2010 中图片的XML格式
<one:Image format="" originalPageNumber="0" lastModifiedTime="" objectID="">
    <one:Position x="" y="" z=""/>
    <one:Size width="" height=""/>
    <one:Data>Base64</one:Data>
//以下标签由Onenote 2010自动生成,不要在程序中处理,目标是获取OCRText中的内容。
<span class="nt">&lt;one:OCRData</span> <span class="na">lang=</span><span class="s">&#34;en-US&#34;</span><span class="nt">&gt;</span>
<span class="nt">&lt;one:OCRText&gt;</span>
    <span class="cp">&lt;![CDATA[   OCR后的文字   ]]&gt;</span>
<span class="nt">&lt;/one:OCRText&gt;</span>
<span class="nt">&lt;one:OCRToken</span> <span class="na">startPos=</span><span class="s">&#34;0&#34;</span> <span class="na">region=</span><span class="s">&#34;0&#34;</span> <span class="na">line=</span><span class="s">&#34;0&#34;</span> <span class="na">x=</span><span class="s">&#34;4.251968383789062&#34;</span> <span class="na">y=</span><span class="s">&#34;3.685039281845092&#34;</span> <span class="na">width=</span><span class="s">&#34;31.18110275268555&#34;</span> <span class="na">height=</span><span class="s">&#34;7.370078563690185&#34;</span><span class="nt">/&gt;</span>
<span class="nt">&lt;one:OCRToken</span> <span class="na">startPos=</span><span class="s">&#34;7&#34;</span> <span class="na">region=</span><span class="s">&#34;0&#34;</span> <span class="na">line=</span><span class="s">&#34;0&#34;</span> <span class="na">x=</span><span class="s">&#34;39.40157318115234&#34;</span> <span class="na">y=</span><span class="s">&#34;3.685039281845092&#34;</span> <span class="na">width=</span><span class="s">&#34;13.32283401489258&#34;</span> <span class="na">height=</span><span class="s">&#34;8.78740119934082&#34;</span><span class="nt">/&gt;</span>
<span class="nt">&lt;one:OCRToken</span> <span class="na">startPos=</span><span class="s">&#34;12&#34;</span> <span class="na">region=</span><span class="s">&#34;0&#34;</span> <span class="na">line=</span><span class="s">&#34;1&#34;</span> <span class="na">x=</span><span class="s">&#34;4.251968383789062&#34;</span> <span class="na">y=</span><span class="s">&#34;17.85826683044434&#34;</span> <span class="na">width=</span><span class="s">&#34;23.52755928039551&#34;</span> <span class="na">height=</span><span class="s">&#34;6.803150177001953&#34;</span><span class="nt">/&gt;</span>
<span class="nt">&lt;one:OCRToken</span> <span class="na">startPos=</span><span class="s">&#34;18&#34;</span> <span class="na">region=</span><span class="s">&#34;0&#34;</span> <span class="na">line=</span><span class="s">&#34;1&#34;</span> <span class="na">x=</span><span class="s">&#34;32.031494140625&#34;</span> <span class="na">y=</span><span class="s">&#34;17.85826683044434&#34;</span> <span class="na">width=</span><span class="s">&#34;41.10236358642578&#34;</span> <span class="na">height=</span><span class="s">&#34;6.803150177001953&#34;</span><span class="nt">/&gt;</span>
<span class="nt">&lt;one:OCRToken</span> <span class="na">startPos=</span><span class="s">&#34;28&#34;</span> <span class="na">region=</span><span class="s">&#34;0&#34;</span> <span class="na">line=</span><span class="s">&#34;1&#34;</span> <span class="na">x=</span><span class="s">&#34;77.66928863525391&#34;</span> <span class="na">y=</span><span class="s">&#34;17.85826683044434&#34;</span> <span class="na">width=</span><span class="s">&#34;31.46456718444824&#34;</span> <span class="na">height=</span><span class="s">&#34;6.803150177001953&#34;</span><span class="nt">/&gt;</span>
................

</one:Image> */

/*ObjectID格式 The representation of an object to be used for identification of objects on a page. Not unique through OneNote, but unique on the page and the hierarchy. <xsd:simpleType name="ObjectID"> <xsd:restriction base="xsd:string"> <xsd:pattern value="{[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}}{[0-9]+}{[A-Z][0-9]+}" /> </xsd:restriction> </xsd:simpleType> */

目前是桌面应用程序是实现了相关功能。预期期望是:任何一个系统通过webservice接口形式就能使用OCR功能。但是改成一个web程序遇到了问题,在网上找了仅有的一点点资料,也没有解决。我了解到,现在使用OneNote的OCR功能的程序也都是用WinForm程序,在程序运行的过程中,会在后台启动OneNote程序。所以我猜测可能是由于这个原因,导致它只能做成桌面程序。

warning

检索 COM 类工厂中 CLSID 为 {D7FAC39E-7FF1-49AA-98CF-A1DDD316337E} 的组件失败,原因是出现以下错误: 80070005 拒绝访问。 (异常来自 HRESULT:0x80070005 (E_ACCESSDENIED))。

web中报这个错误,是权限的问题。依照配置Excel,Word这类COM来找,可是发现DCOM中,一直都找不到这个ID的组件。知道的朋友麻烦告知一下,谢谢。

程序效果图如下:识别效果还不错,剩下的就是根据所需要的信息,进行正则表达式的匹配就可以了。
test image

源码 OCRTools