-
Notifications
You must be signed in to change notification settings - Fork 437
Vector datatype support #2634
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
base: main
Are you sure you want to change the base?
Vector datatype support #2634
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2634 +/- ##
============================================
- Coverage 51.59% 51.40% -0.20%
+ Complexity 3999 3994 -5
============================================
Files 147 148 +1
Lines 33706 33841 +135
Branches 5631 5648 +17
============================================
+ Hits 17391 17396 +5
- Misses 13866 13991 +125
- Partials 2449 2454 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…o decode/encode data, along with additional features like dimensionType, and dimensionCount.
…roved constructor parameter positioning for clarity, removed VectorInputStream
…nsionCount for vector, fixed datatype and tdsType mappings along with setting correct typeDefinition in Parameter class.
…ector handling to ensure insertion of vector data.
…a, and measured performance for max-length vectors at varying scales — 100, 1K, 10K, 100K, and 1M records.
… performance testing.
…ull insertion correctly, added test case to validate null insertion and test vectorsupport feature enabled. Removed getVector, setVector and updated existing setObject() along with adding helper functions in Vector class.
…getObject, and setObject APIs.
…CopyForBatchInsert and added support for vector datatype in SQLServerBulkCSVFileRecord along with updating strategy for Vector type in dtv.java
JDBCType.Category.TIMESTAMP, JDBCType.Category.NCHARACTER, JDBCType.Category.GUID)), | ||
|
||
VECTOR(SSType.Category.VECTOR, EnumSet.of(JDBCType.Category.BINARY, JDBCType.Category.LONG_BINARY, | ||
JDBCType.Category.CHARACTER, JDBCType.Category.LONG_CHARACTER, JDBCType.Category.BLOB, JDBCType.Category.VECTOR)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to clean getter, setter and update conversions in DataTypes class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add test cases to validate all data types in this list, including their conversions?
… for read, insert, bulkCopy, bulkCopyCSV operation and corrected typeInfo and Parameter.java along with adding test cases for BulkCopy
…ated insertion and retrieval of VECTOR data, and updated stored procedure and TVP definitions to use scale as dimensionType and precision as dimensionCount.
…CopyForBatchInsert along with updating getObject() when data is null
…d functions (UDFs); fixed previously failing test case.
… by returning normalised vector, global and local temp tables
…ta information for vector data type
… varbinary combinations -vector and varchar combinations
…ios along with adding test for setObject(getObject())
…peration is performed for mismatched source and destination tables
JDBCType.Category.TIME, JDBCType.Category.TIMESTAMP, JDBCType.Category.CHARACTER, | ||
JDBCType.Category.LONG_CHARACTER, JDBCType.Category.NCHARACTER, JDBCType.Category.LONG_NCHARACTER, | ||
JDBCType.Category.BINARY, JDBCType.Category.LONG_BINARY, JDBCType.Category.BLOB, JDBCType.Category.GUID, | ||
JDBCType.Category.SQL_VARIANT, JDBCType.Category.VECTOR)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add test cases to validate all data types in this list, including their conversions? It would also help to understand how this mapping is utilized in the execution flow.
SSType.Category.CHARACTER, SSType.Category.LONG_CHARACTER, SSType.Category.NCHARACTER, | ||
SSType.Category.LONG_NCHARACTER, SSType.Category.XML, SSType.Category.BINARY, | ||
SSType.Category.LONG_BINARY, SSType.Category.UDT, SSType.Category.TIMESTAMP, SSType.Category.GUID, | ||
SSType.Category.SQL_VARIANT, SSType.Category.VECTOR)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[As above] Can we add test cases to validate all data types in this list, including their conversions? It would also help to understand how this mapping is utilized in the execution flow.
} | ||
} | ||
break; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we write in more concise way:
case VECTOR:
byte[] bValue = (currentObject == null) ? null : (byte[]) currentObject;
writeShort((short) (bValue == null ? -1 : bValue.length));
if (bValue != null) {
writeBytes(bValue);
}
break;
writeShort((short) (8 + (scale * precision))); //maxLength | ||
byte scaleByte = (byte) (scale == 2 ? 0x01 : 0x00); | ||
writeByte((byte) scaleByte); //dimension type | ||
writeShort((short) -1); // actual len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the writeTVPColumnMetaData method contains duplicate code. If possible, consider moving the common logic to a utility method to improve reusability and maintainability.
} | ||
|
||
String vectorData = data[pair.getKey() - 1].trim(); | ||
microsoft.sql.Vector.VectorDimensionType vectorDimensionType = microsoft.sql.Vector.getVectorDimensionType(cm.scale); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to import the microsoft.sql.Vector package?
@@ -318,6 +318,21 @@ private void internalAddrow(JDBCType jdbcType, Object val, Object[] rowValues, | |||
internalAddrow(internalJDBCType, val, rowValues, pair); | |||
break; | |||
|
|||
case VECTOR: | |||
nValueLen = ((microsoft.sql.Vector) val).getActualLength(); | |||
int scale = ((microsoft.sql.Vector) val).getScale() == 0x01 ? 2 : 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move the hardcoded values 2 and 4 to constant variables with meaningful names?
typeInfo.displaySize = typeInfo.maxLength; | ||
typeInfo.ssType = SSType.VECTOR; | ||
int scaleByte = tdsReader.readUnsignedByte(); // Read the dimension type (scale) | ||
typeInfo.scale = (scaleByte == 0) ? 4 : 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the constant that will be created as suggested in the comment above.
public VectorDimensionType vectorType; | ||
public int dimensionCount; | ||
|
||
public float[] data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can those variables be made private?
return null; | ||
} | ||
if (bytes.length < 8) { | ||
throw new IllegalArgumentException("Byte array length must be at least 8 bytes."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the below error message in SQLServerResource.java
{"R_InvalidVectorLength", "Byte array length must be at least "{0}" bytes."}
throw new IllegalArgumentException("Byte array length must be at least 8 bytes."); | ||
} | ||
if (bytes.length % 4 != 0) { | ||
throw new IllegalArgumentException("Byte array length must be a multiple of 4."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the below error message in SQLServerResource.java
{"R_InvalidVectorLength", "Byte array length must be at least "{0}" bytes."}
The VECTOR data type is designed to store multi-dimensional arrays of numerical data, which are essential for various machine learning and data analysis applications. By introducing the VECTOR type, we aim to enhance the capabilities of SQL Server and Azure SQL Database, enabling users to efficiently store, retrieve, and manipulate vector data.
Vector Design doc -> https://microsoft.sharepoint.com/:w:/t/SQLAI-Projects/Ef632IF6mp1Mqj0OyKSrI1cBWNwxtM29eM31TZ3wUPgf4w?e=Ztgtq2&wdOrigin=LOOP-WEB.PREVIEW.NT&wduihid=ea3cf057-9d25-4ea5-bc4f-518a7ff343a7&web=1&ct=1741863974397