Readdy Write  
0,00 €
Your View Money
Views: Count
Self 20% 0
Your Content 60% 0

Users by Links 0
u1*(Content+Views) 10% 0
Follow-Follower 0
s2*(Income) 5% 0

Count
Followers 0
Login Register as User

Text aus PDF auslesen

10.10.2018 (👁14364)


Unter C# kann man mit wenigen Code Zeilen einen kompletten Pdf-Reader erstellen.

Hierzu kann man das Nuget Package iTextSharp einbinden.

kleines Windows Programm mit iTextSharp in C# und WPF

C#, wpf: PDF Textreader

Mit iTextSharp

In diesem Beispiel wurde das PDF Dokument von der Rechten Seite eingelesen und als Text extrahiert zur C# WPF Anwendung übergeben

MainWindow.

Xaml Code, MainWindow.xaml

<Window x:Class="PDF_TextReader.MainWindow"

        xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"

        xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"

        xmlns:d="http://schemas.microsoft.com/expression/blend/2008"

        xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"

        xmlns:local="clr-namespace:PDF_TextReader"

        mc:Ignorable="d"

        Title="MainWindow" Height="700" Width="800">

    <Grid >

        <Button x:Name="btnStart" Content="Read PDF" Click="btnStart_Click" HorizontalAlignment="Left" Margin="15,9,0,0" VerticalAlignment="Top" Width="86" Height="33"/>

        <TextBox x:Name="tbxFilename" Text="C:\_Daten\Desktop\VS_Projects\Office\PDF_TextReader\_Test_PDF\test_pdf_import.pdf"  

                 Width="631" Height="27" Margin="115,12,0,0" TextWrapping="Wrap"  VerticalAlignment="Top" HorizontalAlignment="Left" />

        <ScrollViewer Height="584"  Margin="16,71,25.6,0" VerticalAlignment="Top" >

            <TextBlock x:Name="lblPDF_Output" Text=""  

                   TextWrapping="Wrap" HorizontalAlignment="Stretch"  VerticalAlignment="Stretch"                     

                   />

        </ScrollViewer>

 

 

    </Grid>

</Window>

C# Codebehind Window

Mit PdfReader(Filename) bindet man den iTextSharp Reader an ein Pdf-Dokument an.

PdfReader pdf_Reader = new PdfReader(sFilename);

 

Mit der C# Code Zeile  PdfTextExtractor.GetTextFromPage wird der Text aus einer Pdf-Seite komplett als String mit Umbruchzeichen ausgelesen.

Platzhalter wie Bilder, Scans und leer-Tabellen werden ausgelassen.

sText = PdfTextExtractor.GetTextFromPage(pdf_Reader, 1);

 

 

Code in C#, .net Framework 4.7

Unter MainWindow.xaml.cs

using System;

using System.Windows;

 

//< using >

using iTextSharp.text.pdf;          //*iTextSharp

using iTextSharp.text.pdf.parser;   //*iTextSharp Text-Reader

//</ using >

 

namespace PDF_TextReader

{

    /// <summary>

    /// demo pdf reader

    /// </summary>

    public partial class MainWindow : Window

    {

        public MainWindow()

        {

            InitializeComponent();

        }

 

        private void btnStart_Click(object sender, RoutedEventArgs e)

        {

 

            //String sFilename = "C:\\_Daten\\Desktop\\VS_Projects\\Office\\PDF_TextReader\\_Test_PDF\\test_pdf_import_bank.pdf";

            String sFilename = tbxFilename.Text;

 

            //--< read File >--

            PdfReader pdf_Reader = new PdfReader(sFilename);

            String sText = "";

 

            for (int i = 1; i <= pdf_Reader.NumberOfPages; i++)

            {

                sText = sText + PdfTextExtractor.GetTextFromPage(pdf_Reader, i);

            }

 

            //MessageBox.Show(sText);

            lblPDF_Output.Text=sText;

        }

    }

}

Nuget Package: iTextSharp

in das wpf Projekt muss man per Nuget Package das Package iTextSharp einbinden.

iTextSharp ist für den privaten Gebrauch kostenlos und frei verfügbar, solange man keine Software erstellt, welche zum öffentlichen Verkauf angeboten wird.

Beschreibung von iTextSharp:

Nuget Package

iText is a PDF library that allows you to CREATE, ADAPT, INSPECT and MAINTAIN documents in the Portable Document Format (PDF), allowing you to add PDF functionality to your software projects with ease.  We even have documentation to help you get coding.

 

We have two currently supported versions: iText 5 and iText 7. Both are available under AGPL and Commercial license.

* iText 5 AGPL

* iText 7 community: https://www.nuget.org/packages/itext7/

iText 5 is a one solution library that is complex, but well documented to help you create your solutions.

iText 7 is a complete re-write of iText 5, allowing you to choose your adventure with add-ons, all based on a simple, modular code structure that is easy to use and well documented.

 

 

Both versions allow you to:

- Generate documents and reports based on data from an XML file or a database

- Create maps and books, exploiting numerous interactive features available in PDF

- Add bookmarks, page numbers, watermarks, and other features to existing PDF documents

- Split or concatenate pages from existing PDF files

- Fill out interactive forms

- Serve dynamically generated or manipulated PDF documents to a web browser

 

iText 7 includes pdfDebug, the first debugging tool that gives you a clear overview of your content streams and document structure as well as pdfCalligraph, allowing you to leverage advanced typography.

iText is available for Java, .NET in both versions, and Android and GAE for iText 5 only.

iTextSharp is the .NET port of iText 5.

 

Several iText engineers are actively supporting the project on StackOverflow: http://stackoverflow.com/questions/tagged/itext