Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: cannot download document from tar gz archive #1135

Closed
mvanzalu opened this issue Jul 21, 2023 · 3 comments
Closed

fix: cannot download document from tar gz archive #1135

mvanzalu opened this issue Jul 21, 2023 · 3 comments
Assignees
Labels

Comments

@mvanzalu
Copy link
Contributor

mvanzalu commented Jul 21, 2023

Describe the bug
Some document can not be downloaded and DS leads to a 500 error (Tika is probably the cause).

How to reproduce

  1. Compress a PDF file into a tar.gz archive
  2. Index your archive
  3. Try to download the embedded PDF
2023-07-21 12:52:24,696 [Worker: RequestDispatcher: Thread-36] ERROR Fluent - Unexpected error:                                                                                                            
org.icij.datashare.text.indexing.elasticsearch.ExtractException: extract error for embedded document in project foo / id : bar
 / routing_id : baz                                                               
        at org.icij.datashare.text.indexing.elasticsearch.SourceExtractor.getSource(SourceExtractor.java:70)                                                                                               
        at org.icij.datashare.web.DocumentResource.getPayload(DocumentResource.java:396)                                                                                                                   
        at org.icij.datashare.web.DocumentResource.getSourceFile(DocumentResource.java:85)                                                                                                                 
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)                                                                                                                  
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)                                                                                                
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at net.codestory.http.routes.ReflectionRoute.invoke(ReflectionRoute.java:83)
        at net.codestory.http.routes.ReflectionRoute.lambda$body$0(ReflectionRoute.java:45)
        at net.codestory.http.annotations.MethodAnnotations.apply(MethodAnnotations.java:48)
        at net.codestory.http.routes.ReflectionRoute.body(ReflectionRoute.java:40)
        at net.codestory.http.routes.RouteWithPattern.body(RouteWithPattern.java:56)
        at net.codestory.http.routes.Route.apply(Route.java:25)
        at net.codestory.http.routes.RouteCollection.lambda$createContextToPayload$98caf044$1(RouteCollection.java:577)
        at net.codestory.http.routes.RouteCollection.lambda$null$2339cd96$1(RouteCollection.java:593) 
        at org.icij.datashare.session.LocalUserFilter.otherUri(LocalUserFilter.java:35)
        at net.codestory.http.filters.auth.CookieAuthFilter.apply(CookieAuthFilter.java:78)
        at net.codestory.http.routes.RouteCollection.lambda$createContextToPayload$51719a14$1(RouteCollection.java:593)
        at net.codestory.http.routes.RouteCollection.lambda$null$2339cd96$1(RouteCollection.java:593) 
        at org.icij.datashare.web.IndexWaiterFilter.apply(IndexWaiterFilter.java:44)
        at net.codestory.http.routes.RouteCollection.lambda$createContextToPayload$51719a14$1(RouteCollection.java:593)
        at net.codestory.http.routes.RouteCollection.lambda$null$2339cd96$1(RouteCollection.java:593) 
        at org.icij.datashare.session.LocalUserFilter.otherUri(LocalUserFilter.java:35)
        at net.codestory.http.filters.auth.CookieAuthFilter.apply(CookieAuthFilter.java:78)
        at net.codestory.http.routes.RouteCollection.lambda$createContextToPayload$51719a14$1(RouteCollection.java:593)
        at net.codestory.http.routes.RouteCollection.lambda$null$2339cd96$1(RouteCollection.java:593) 
        at org.icij.datashare.web.IndexWaiterFilter.apply(IndexWaiterFilter.java:44)
        at net.codestory.http.routes.RouteCollection.lambda$createContextToPayload$51719a14$1(RouteCollection.java:593)
        at net.codestory.http.routes.RouteCollection.lambda$null$2339cd96$1(RouteCollection.java:593) 
        at org.icij.datashare.mode.CorsFilter.apply(CorsFilter.java:19)
        at net.codestory.http.routes.RouteCollection.lambda$createContextToPayload$51719a14$1(RouteCollection.java:593)
        at net.codestory.http.routes.RouteCollection.apply(RouteCollection.java:567)
        at net.codestory.http.AbstractWebServer.handleHttp(AbstractWebServer.java:152)
        at net.codestory.http.internal.SimpleServerWrapper.handle(SimpleServerWrapper.java:71)
        at org.simpleframework.http.socket.service.RouterContainer.handle(RouterContainer.java:106)
        at org.simpleframework.http.core.RequestDispatcher.dispatch(RequestDispatcher.java:121)
        at org.simpleframework.http.core.RequestDispatcher.run(RequestDispatcher.java:103)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.CompressorParser@5e88ebb8
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)
        at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55)
        at org.icij.extract.extractor.EmbeddedDocumentMemoryExtractor.extract(EmbeddedDocumentMemoryExtractor.java:48)
        at org.icij.datashare.text.indexing.elasticsearch.SourceExtractor.getSource(SourceExtractor.java:66)
        ... 39 common frames omitted
Caused by: org.apache.commons.io.TaggedIOException: Resetting to invalid mark
        at org.apache.commons.io.input.TaggedInputStream.handleIOException(TaggedInputStream.java:114)
        at org.apache.commons.io.input.ProxyInputStream.reset(ProxyInputStream.java:169)
        at org.apache.tika.io.TikaInputStream.reset(TikaInputStream.java:800)

@mvanzalu
Copy link
Contributor Author

It may be related to the format or the archive, for the same document,:

  • tested in a zip archive : can download
  • tested in a tar.gz/xz archive : cannot download

@pirhoo pirhoo moved this to Todo in Datashare - Sprint 17 Jul 21, 2023
@mvanzalu mvanzalu changed the title Cannot download document from archive Cannot download document from tar gz archive Jul 21, 2023
@mvanzalu
Copy link
Contributor Author

Another stacktrace can be found locally

2023-07-21 21:04:59,878 [Worker: RequestDispatcher: Thread-30] INFO  SourceExtractor - extracting embedded document 4726...2b8b from root document /home/dev/Datashare/Data/foo.tar.gz
2023-07-21 21:05:14,343 [Worker: RequestDispatcher: Thread-30] ERROR Fluent - Unable to GET /api/local-datashare/documents/src/4726fde4827ce624eeaa8e052bac85b840ae88f78e1af9ea3089a8ff8a4591feb93ca08f11a343e051974e7ca60c2b8b
2023-07-21 21:05:14,344 [Worker: RequestDispatcher: Thread-30] ERROR Fluent - Unexpected error:
org.icij.extract.extractor.EmbeddedDocumentMemoryExtractor$ContentNotFoundException: <4726fde4827ce624eeaa8e052bac85b840ae88f78e1af9ea3089a8ff8a4591feb93ca08f11a343e051974e7ca60c2b8b> embedded document not found in root document /home/dev/Datashare/Data/foo.tar.gz
        at org.icij.extract.extractor.EmbeddedDocumentMemoryExtractor$DigestEmbeddedDocumentExtractor.lambda$getDocument$0(EmbeddedDocumentMemoryExtractor.java:107)
        at java.base/java.util.Optional.orElseThrow(Optional.java:408)
        at org.icij.extract.extractor.EmbeddedDocumentMemoryExtractor$DigestEmbeddedDocumentExtractor.getDocument(EmbeddedDocumentMemoryExtractor.java:106)
        at org.icij.extract.extractor.EmbeddedDocumentMemoryExtractor.extract(EmbeddedDocumentMemoryExtractor.java:50)
        at org.icij.datashare.text.indexing.elasticsearch.SourceExtractor.getSource(SourceExtractor.java:66)
        at org.icij.datashare.web.DocumentResource.getPayload(DocumentResource.java:396)
        at org.icij.datashare.web.DocumentResource.getSourceFile(DocumentResource.java:85)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at net.codestory.http.routes.ReflectionRoute.invoke(ReflectionRoute.java:83)
        at net.codestory.http.routes.ReflectionRoute.lambda$body$0(ReflectionRoute.java:45)
        at net.codestory.http.annotations.MethodAnnotations.apply(MethodAnnotations.java:48)
        at net.codestory.http.routes.ReflectionRoute.body(ReflectionRoute.java:40)
        at net.codestory.http.routes.RouteWithPattern.body(RouteWithPattern.java:56)
        at net.codestory.http.routes.Route.apply(Route.java:25)
        at net.codestory.http.routes.RouteCollection.lambda$createContextToPayload$98caf044$1(RouteCollection.java:577)
        at net.codestory.http.routes.RouteCollection.lambda$null$2339cd96$1(RouteCollection.java:593)
        at org.icij.datashare.session.LocalUserFilter.otherUri(LocalUserFilter.java:35)
        at net.codestory.http.filters.auth.CookieAuthFilter.apply(CookieAuthFilter.java:78)
        at net.codestory.http.routes.RouteCollection.lambda$createContextToPayload$51719a14$1(RouteCollection.java:593)
        at net.codestory.http.routes.RouteCollection.lambda$null$2339cd96$1(RouteCollection.java:593)
        at org.icij.datashare.web.IndexWaiterFilter.apply(IndexWaiterFilter.java:44)
        at net.codestory.http.routes.RouteCollection.lambda$createContextToPayload$51719a14$1(RouteCollection.java:593)
        at net.codestory.http.routes.RouteCollection.lambda$null$2339cd96$1(RouteCollection.java:593)
        at org.icij.datashare.session.LocalUserFilter.otherUri(LocalUserFilter.java:35)
        at net.codestory.http.filters.auth.CookieAuthFilter.apply(CookieAuthFilter.java:78)
        at net.codestory.http.routes.RouteCollection.lambda$createContextToPayload$51719a14$1(RouteCollection.java:593)
        at net.codestory.http.routes.RouteCollection.lambda$null$2339cd96$1(RouteCollection.java:593)
        at org.icij.datashare.web.IndexWaiterFilter.apply(IndexWaiterFilter.java:44)
        at net.codestory.http.routes.RouteCollection.lambda$createContextToPayload$51719a14$1(RouteCollection.java:593)
        at net.codestory.http.routes.RouteCollection.lambda$null$2339cd96$1(RouteCollection.java:593)
        at org.icij.datashare.mode.CorsFilter.apply(CorsFilter.java:19)
        at net.codestory.http.routes.RouteCollection.lambda$createContextToPayload$51719a14$1(RouteCollection.java:593)
        at net.codestory.http.routes.RouteCollection.apply(RouteCollection.java:567)
        at net.codestory.http.AbstractWebServer.handleHttp(AbstractWebServer.java:152)
        at net.codestory.http.internal.SimpleServerWrapper.handle(SimpleServerWrapper.java:71)
        at org.simpleframework.http.socket.service.RouterContainer.handle(RouterContainer.java:106)
        at org.simpleframework.http.core.RequestDispatcher.dispatch(RequestDispatcher.java:121)
        at org.simpleframework.http.core.RequestDispatcher.run(RequestDispatcher.java:103)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

@mvanzalu mvanzalu moved this from Todo to In Progress in Datashare - Sprint 17 Jul 24, 2023
bamthomas added a commit to ICIJ/extract that referenced this issue Jul 26, 2023
…are#1135

Co-authored-by: Maxime Vanza Lutonda <mvanzalu@users.noreply.github.com>
bamthomas added a commit to ICIJ/extract that referenced this issue Jul 26, 2023
Co-authored-by: Maxime Vanza Lutonda <mvanzalu@users.noreply.github.com>
mvanzalu added a commit that referenced this issue Jul 26, 2023
@mvanzalu mvanzalu moved this from In Progress to Done in Datashare - Sprint 17 Jul 26, 2023
@mvanzalu mvanzalu moved this from Done to In Progress in Datashare - Sprint 17 Jul 27, 2023
@mvanzalu
Copy link
Contributor Author

mvanzalu commented Jul 27, 2023

The first error is still happening. It turns out that resetting the TikaInputStream causes the error : https://github.com/ICIJ/extract/blob/1ecabee74e445e8226bc71489891d21a34410984/extract-lib/src/main/java/org/icij/extract/extractor/EmbeddedDocumentMemoryExtractor.java#L82

mvanzalu added a commit that referenced this issue Jul 31, 2023
@pirhoo pirhoo changed the title Cannot download document from tar gz archive fix: cannot download document from tar gz archive Aug 3, 2023
@mvanzalu mvanzalu moved this from In Progress to Todo in Datashare - Sprint 17 Aug 7, 2023
@mvanzalu mvanzalu moved this from Todo to In Progress in Datashare - Sprint 17 Aug 28, 2023
mvanzalu added a commit that referenced this issue Aug 28, 2023
@mvanzalu mvanzalu moved this from In Progress to Done in Datashare - Sprint 17 Aug 29, 2023
mvanzalu added a commit that referenced this issue Sep 1, 2023
@mvanzalu mvanzalu closed this as completed Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Status: Done
Development

No branches or pull requests

1 participant