Spark NLP 5.2.1: Official support for Apache Spark 3.5, Introducing BGE annotator for Text Embeddings, ONNX support for DeBERTa Token and Sequence Classifications, and Question Answering task, new Databricks 14.x runtimes, Over 400 new state-of-the-art Transformer Models in ONNX, and bug fixes! #14114
maziyarpanahi
announced in
Announcement
Replies: 0 comments
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
-
📢 Overview
Spark NLP 5.2.1 🚀 comes with full compatibility with
Spark/PySpark 3.5
, brand newBGEEmbeddings
to load BGE models for text embeddings, new ONNX support forDeBertaForTokenClassification
,DeBertaForSequenceClassification
, andDeBertaForQuestionAnswering
annotators. Additionally, we've added over 400 state-of-the-art transformer models in ONNX format to ensure rapid inference for multi-class/multi-label classification models.We're pleased to announce that our Models Hub now boasts 22,000+ free and truly open-source models & pipelines 🎉. Our deepest gratitude goes out to our community for their invaluable feedback, feature suggestions, and contributions.
🔥 New Features & Enhancements
full support
for Apache Spark and PySpark 3.5 that comes with lots of improvements for Spark Connect: https://spark.apache.org/releases/spark-release-3-5-0.html#highlightsBGEEmbeddings
annotator for Spark NLP. This annotator enables the integration ofBGE
models, based on the BERT architecture, into Spark NLP. TheBGEEmbeddings
annotator is designed for generating dense vectors suitable for a variety of applications, includingretrieval
,classification
,clustering
, andsemantic search
. Additionally, it is compatible withvector databases
used inLarge Language Models (LLMs)
.DeBertaForTokenClassification
annotatorDeBertaForSequenceClassification
annotatorDeBertaForQuestionAnswering
annotatorT5
family into Spark NLP with TensorFlow formatT5
family into Spark NLP with ONNX formatMarianNMT
family into Spark NLP with ONNX format🐛 Bug Fixes
DocumentTokenSplitter
annotator failing to be saved and loaded in a PipelineDocumentCharacterTextSplitter
annotator failing to be saved and loaded in a Pipelineℹ️ Known Issues
T4 GPU
runtime ONNX models crash when they are used in Colab's T4 GPU runtime #14109📓 New Notebooks
📖 Documentation
❤️ Community support
Installation
Python
#PyPI pip install spark-nlp==5.2.1
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: (Scala 2.12):
GPU
Apple Silicon (M1 & M2)
AArch64
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x:
spark-nlp-gpu:
spark-nlp-silicon:
spark-nlp-aarch64:
FAT JARs
CPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.2.1.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.2.1.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.2.1.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.2.1.jar
What's Changed
Full Changelog: 5.2.0...5.2.1
This discussion was created from the release Spark NLP 5.2.1: Official support for Apache Spark 3.5, Introducing BGE annotator for Text Embeddings, ONNX support for DeBERTa Token and Sequence Classifications, and Question Answering task, new Databricks 14.x runtimes, Over 400 new state-of-the-art Transformer Models in ONNX, and bug fixes!.
Beta Was this translation helpful? Give feedback.
All reactions