Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SystemNullReferenceException at parser.parse in StreamTextExtractor.cs #150

Open
johnwnowlin opened this issue Feb 17, 2022 · 2 comments
Open

Comments

@johnwnowlin
Copy link

Tika is crashing on a PDF (which has confidential information, sorry can't post). at line 30 of StreamTextExtractor.cs attempting to extract text from the PDF.

var textExtractor = new TextExtractor();
var extraction = textExtractor(@"filename");

Exception details:
System.NullReferenceException
HResult=0x80004003
Message=Object reference not set to an instance of an object.
Source=TikaOnDotNet
StackTrace:
at org.apache.jempbox.impl.XMLUtil.getStringValue(Element node)

Oddly, even though this code is in a try/finally block it trows an exception. If it would let me catch the exception, we could just ignore this file and keep going.

using (var inputStream = streamFactory(metadata))
{
    try
    {
        parser.parse(inputStream, handler, metadata, parseContext);
    }
    finally
    {
        inputStream.close();
    }
}

I can open the file in adobe. Have saved as new pdf which also fails.

Is it possible to catch this error so the code can keep going?

@johnwnowlin
Copy link
Author

The file causing the error came from a Konica copier and appears to be a TIFF parked in a PDF. I suspect this error is related to issues #145 and #142 , only because Tika needs to extract information from a TIFF. I do not see how to add the optional dependencies to the .Net build to see if that is the problem. Does anybody know how that is accomplished?

@KevM
Copy link
Owner

KevM commented Jul 14, 2022

It would be really nice to get an example that crashes so we could try to correct this issue in future releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants