Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using DuplicateOverlappingTextProcessor in HOcrTextExporter #867

Open
KayTannee opened this issue Jul 9, 2024 · 0 comments
Open

Using DuplicateOverlappingTextProcessor in HOcrTextExporter #867

KayTannee opened this issue Jul 9, 2024 · 0 comments

Comments

@KayTannee
Copy link

I'm trying to see how can use DuplicateOverlappingTextProcessor as part of the HOcrTextExporter process.

hocrTextExporter.Get() only accepts a Page as input, along with wordExtractor and pagesegmentor in the constructor.

Where as DuplicateOverlappingTextProcessor only returns a list of letters. Doesn't seem to be a defined way to get from 1 to the other.

I think solution is to add an option to the Word Extractor Options. And use like so.

        var ops = new NearestNeighbourWordExtractorOptions();
        ops.DeduplicateOverlappingText = true;
        var wordExtractor = new NearestNeighbourWordExtractor(ops);
        HOcrTextExporter hocrTextExporter = new HOcrTextExporter(wordExtractor, DocstrumBoundingBoxes.Instance);
        string hocrtext = hocrTextExporter.Get(page, useHocrjs: true);

Having a look I think only need the below 2 changes to the 1 class. I'm not able to test code at the moment though.

UglyToad.PdfPig.DocumentLayoutAnalysis.WordExtractor
NearestNeighbourWordExtractor.cs

    /// <summary>
    /// Get the words.
    /// </summary>
    /// <param name="letters">The page's letters to group into <see cref="Word"/>s.</param>
    /// <returns>The <see cref="Word"/>s generated by the nearest neighbour method.</returns>
    public IEnumerable<Word> GetWords(IReadOnlyList<Letter> letters)
    {
        if (letters == null || letters.Count == 0)
        {
            return Array.Empty<Word>();
        }

        // #### Change 1
        // Remove overlapping duplicates
        if (options.DeduplicateOverlappingText) {
            letters = DuplicateOverlappingTextProcessor.Get(letters);
        }

....

    /// <summary>
    /// Nearest neighbour word extractor options.
    /// </summary>
    public class NearestNeighbourWordExtractorOptions : IWordExtractorOptions
    {
        /// <summary>
        /// <inheritdoc/>
        /// Default value is -1.
        /// </summary>
        public int MaxDegreeOfParallelism { get; set; } = -1;

        // #### Change 2
        /// <summary>
        /// Uses DuplicateOverlappingTextProcessor to remove overlapping letters before GetWords. 
        /// </summary>
        public bool DeduplicateOverlappingText = false;

Happy if there's an alternative existing way of doing it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant