Posted by Catur Nugroho | File under :
How to read text from xps document using C#.

The purpose of this article is a personal note from the author. When the project is lost or forgotten so it is easier to find on this page. For over 2 years ago I use this code and really helped me.

I'm sorry, My English is not very good and and a little difficult to make the article.

The first step please add some this references of dot net class library :
1. PresentationCore.
2. PresentationFramework.
3. ReachFramework
4. System
5. System.Core.
6. System.Data
7. System.Data.DataSetExtensions
8. System.Deployment
9. System.Drawing
10. System.Runtime.Serialization
11. System.ServiceModel
12. System.Windows.Forms
13. System.Xml
14. System.Xml.Linq
15. WindowsBase.

Add this code into your name space :
using System.Threading;
using System.Windows.Xps.Packaging;
using System.Windows.Documents;


Usually I using threading to run the function. You can use the following code :
Thread _thMain = new Thread(new ThreadStart(StepOne)); //StepOne is function.
_thMain.IsBackground = true;
_thMain.SetApartmentState(ApartmentState.STA);
_thMain.Start();


The first way :
private void StepOne()
        {
            List<string> lData = new List<string>();
            using (XpsDocument xpsDoc = new XpsDocument(@"C:\sample.xps", System.IO.FileAccess.Read))
            {
                FixedDocumentSequence docSeq = xpsDoc.GetFixedDocumentSequence();
                Dictionary<string, string> docPageText = new Dictionary<string, string>();
                for (int pageNum = 0; pageNum < docSeq.DocumentPaginator.PageCount; pageNum++)
                {
                    DocumentPage docPage = docSeq.DocumentPaginator.GetPage(pageNum);
                    foreach (System.Windows.UIElement uie in ((FixedPage)docPage.Visual).Children)
                    {
                        if (uie is System.Windows.Documents.Glyphs)
                        {
                            lData.Add(((System.Windows.Documents.Glyphs)uie).UnicodeString);
                        }
                    }
                }
            }

            string strText = string.Empty;
            foreach (string strItem in lData)
            {
                if (string.IsNullOrEmpty(strItem))
                    continue;

                strText += strItem;
            }
        }


The second way :
private void StepTwo()
        {
            List<string> lData = new List<string>();
            using (XpsDocument xpsDoc = new XpsDocument(@"C:\sample.xps", System.IO.FileAccess.Read))
            {
                FixedDocumentSequence docSeq = xpsDoc.GetFixedDocumentSequence();
                Dictionary<string, string> docPageText = new Dictionary<string, string>();
                for (int pageNum = 0; pageNum < docSeq.DocumentPaginator.PageCount; pageNum++)
                {
                    DocumentPage docPage = docSeq.DocumentPaginator.GetPage(pageNum);
                    foreach (System.Windows.UIElement uie in ((FixedPage)docPage.Visual).Children)
                    {
                        if (uie is System.Windows.Documents.Glyphs)
                        {
                            //lData.Add(((System.Windows.Documents.Glyphs)uie).UnicodeString);
                            string strUnicode = ((System.Windows.Documents.Glyphs)uie).UnicodeString;
                            string strIndices = ((System.Windows.Documents.Glyphs)uie).Indices;
                            string strFormattedLine = "";
                            string[] astrIndices = strIndices.Split(new char[] { ';' }, StringSplitOptions.RemoveEmptyEntries);
                            for (int i = 0; i < astrIndices.Length; i++)
                            {
                                strFormattedLine += strUnicode[i].ToString();

                                if (astrIndices[i].Contains(","))
                                {
                                    if (int.Parse(astrIndices[i].Substring(astrIndices[i].IndexOf(",") + 1)) > 150)
                                        if (!strUnicode[i].ToString().Equals(" "))
                                            strFormattedLine += " ";
                                }
                            }

                            lData.Add(strFormattedLine);
                        }
                    }
                }
            }

            string strText = string.Empty;
            foreach (string strItem in lData)
            {
                if (string.IsNullOrEmpty(strItem))
                    continue;

                strText += strItem;
            }
        }


The difference in the two ways above is space.
Sometimes we will find a line that does not have the correct limit so we often struggle to define the data that we want.
Example :
The first way :
444444444444444444444 
The second way :
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

I hope this article can help you in managing text in the XPS documents. 

Good Luck!! 

 
 

1 comment: