Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[DISCUSSION] Drop Tika app and Solr Cell support? #14

Open
Tracked by #180
irnnr opened this issue Dec 16, 2015 · 6 comments
Open
Tracked by #180

[DISCUSSION] Drop Tika app and Solr Cell support? #14

irnnr opened this issue Dec 16, 2015 · 6 comments

Comments

@irnnr
Copy link
Collaborator

irnnr commented Dec 16, 2015

There might be some reasons to drop support for Tika app and Solr support:

  • Tika app is slow as it needs to boot the JVM for each invocation
  • Likewise Tika server is much faster as it sits there and awaits requests
  • Solr Cell does not support all the features as provided by Tika app/server

If we were to decide to do that, it would also result in a new major version as it is a breaking change. Nothing is set in stone or even decided yet. We're just looking for opinions for now.

@timohund
Copy link
Contributor

@irnnr I would at least propose to drop the solr cell support because a lot of the features are not supported by solr cell.

@dkd-friedrich
Copy link
Member

Though the Solr Cell doesn't support all features and the local Tika app/server is more performant, I think we should keep the Solr Cell support. There are a lot of TYPO3 installations where no local Java installation is available and therefore depend on Solr Cell (e.g. solrfal & text extraction)

@LeoniePhiline
Copy link

Tika App is very important for me, since I need to extract metadata of lots and lots of mp3 files (and pdfs, but these are smaller) in a TYPO3 installation.
Sending these gigabytes to the solr server for metadata extraction creates much more overhead, timeouts, headaches and delay than firing up the tika app.

Therefore, please keep Tika App support! :)

By the way, I also had to add another memory-expanding argument to the tika command: -Xmx512M, to avoid the Java VM error "Could not reserve enough space for object heap".

@irnnr
Copy link
Collaborator Author

irnnr commented Mar 2, 2017

@LeoniePhiline thanks for your input! Can you please open a separate issue for the memory flag so that it can be taken care of?

@AndreasA
Copy link

AndreasA commented Mar 21, 2017

I also think solr extraction should be kept as that way one can use solrfal if the TYPO3 server itself has no Java (or one cannot install the tika server, etc.)
Also for most cases solr will be enough and one doesn't have to setup and maintain a tika server.

EDIT: However, maybe one could add the advantages and disadvantages (e.g. what doesn't work when using solr cell) of the various types in .md file.

@LeoniePhiline
Copy link

Right now I have a nice case where with the same site on some environments I can use the tika app (jar), but on other environments I need to switch extconf['extractor'] from 'jar' to 'solr'. Works quite well.

I only had to add $GLOBALS['TYPO3_CONF_VARS']['SYS']['FileInfo']['fileExtensionToMimeType']['mp3'] = 'audio/mpeg3'; for \ApacheSolrForTypo3\Tika\Service\Tika\SolrCellService::getSupportedMimeTypes() to match. (Although solr then returns the 'audio/mpeg' mimetype for the extracted mp3 file - so rather getSupportedMimeTypes() should be extended by adding 'audio/mpeg'.)

And to add in \ApacheSolrForTypo3\Tika\Service\Extractor\MetaDataExtractor::normalizeMetaData() a mapping of xmpDM:durataion to (int)($value / 1000).

EXT:extractor has a nicely configurable metadata mapping (normalization) handling. There no code change would be necessary - but EXT:extractor does not support SolrCell, only Tika App local or Tika server.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants