Releases: charles-r-earp/autograph
v0.2.1
v0.2.0
- Removed async traits and methods.
- Core functionality reimplemented in krnl:
- Only targets Vulkan, more portable than Metal / DX12.
- Metal is supported via MoltenVK.
- GPGPU kernels implemented inline in Rust:
- Kernels can be defined in the same file, near where they are invoked.
- Modules allow sharing code between host and device.
- Kernel bindings are type safe, checked at compile time.
- Simple iterator patterns can be implemented without unsafe.
- Supports specialization constants provided at runtime.
- DeviceInfo includes useful properties:
- Max / default threads per group.
- Max / min threads per subgroup.
- With DebugPrintf, kernel panics produce errors on the host.
- krnlc generates a device crate and invokes spirv-builder.
- spirv-builder / spirv-tools are compiled once on install.
- Significantly streamlines and accelerates workflow.
- Kernels are compressed to reduce package and binary size.
- GPGPU kernels implemented inline in Rust:
- Device operations readily execute:
- Block until kernels / transfers can queue.
- An operation can be queued while another is executing.
- Reduced latency, better repeatability, reliability, and performance.
- Device buffers can be copied by the host if host visible.
- Large buffer copies are streamed rather than allocating a large temporary.
- Reuses a few small buffers for transfers.
- Overlaps host and device copies.
- Performance significantly closer to CUDA.
- Also streams between devices.
- Device buffers can be i32::MAX bytes (~2 GB, up from 256 MB).
- Scalar / ScalarBufferBase replaces Float / FloatBuffer:
- Streamlined conversions between buffers.
- Buffers can be sliced.
- Supports wasm (without device feature).
- TensorBase and ScalarBufferBase implemented with krnl::BufferBase and krnl::ScalarBufferBase:
- Streamlined conversions between tensor types.
- Host ops accelerated with rayon.
- Improved and streamlined device gemm kernel.
- Device sum and sum_axis use subgroup reductions for improved performance.
- Replaced Criterion trait with Accuracy / CrossEntropyLoss traits.
- ops::AddAssign implemented by Tensor and Variable.
- Implement ndarray::linalg::Dot for Tensor and Variable.
- Direct convolution algorithm for better host performance.
- Removed learn::kmeans.
- Redesigned autograd:
- Autograd replaced with VariableBuilder:
- Nodes and edges applied when building a Variable.
- Backward edges are simply f(output_grad) -> input_grad.
- Gradients are automatically accumulated.
- Parameter and Variable are separate types (instead of VertexBase).
- Parameters can be converted to Variables.
- Autograd replaced with VariableBuilder:
- Redesigned Layer trait:
- for_each_parameter fn's instead of returning a Vec.
- Cast layers to a ScalarType.
- Removed enumeration of child layers.
- Redesigned Forward trait:
- Generic over input and output type.
- Derive improvements:
- Removed layer attribute.
- Supports enums.
- Fields can be skipped.
- Redesigned Optimizer trait:
- Added learning rate.
- Accepts a single parameter instead of a slice.
- Parameter optimizer::State:
- Can be serialized / deserialized with serde.
- Simplified Iris dataset.
- MNIST dataset:
- Replaced downloader with curl.
- Decompress in parallel with rayon.
MSRV: 1.70.0
autograph v0.1.1
Profiling
Currently requires nightly and feature "profile". Set the AUTOGRAPH_PROFILE environmental variable to 1 or True to produce a table of statistics for compute passes that are executed.
AUTOGRAPH_PROFILE=1 cargo +nightly run --feature profile
Rust GEMM
Improved performance on Neural Network MNIST example (Lenet5) by 5x.
- Implemented in Rust for u32, i32, f32
- bf16 not yet implemented
- Unrolled loops with crunchy
- Work per thread (1x1, 2x2, 4x4) micro tiles
- SplitK variant (256) for small m or n and large k
- Atomically accumulates with multiple work groups
Tensor
- Added Tensor::ones method.
Neural Networks
- Allowed SGD learning_rate = 1.0
- MeanPool
- Fixed correctness issues
- Cross Entropy Loss
- Sum
- Test accuracy improved to ~99% on Neural Network MNIST example (Lenet5)
Examples
- Added shuffling of training batches
Benchmark
Added Neural Network Benchmark to compare performance with other libraries. Training is now ~2.7x slower than tch (NVIDIA GeForce GTX 1060 with Max-Q Design) with similar test accuracy.
+-----------+------------+---------------+-----------------------+----------------------------------+
| Library | Best Epoch | Best Accuracy | Time To Best Accuracy | Mean Epoch Time to Best Accuracy |
+===========+============+===============+=======================+==================================+
| autograph | 69 | 99.04% | 127.38s | 1.85s |
+-----------+------------+---------------+-----------------------+----------------------------------+
| tch | 32 | 99.12% | 22.03s | 688.31ms |
+-----------+------------+---------------+-----------------------+----------------------------------+
autograph v0.1.0
This is the first release of autograph rebuilt on SPIR-V compute shaders that can be compiled from Rust source with rust-gpu!
Compute Shaders
All computations are implemented in either Rust or GLSL (to be replaced by Rust), and this API is publicly exposed so that external crates can develop their own routines. Shader code targeting SPIR-V is portable and is compiled at runtime for devices supporting Vulkan, Metal, and DX12 API's.
Datasets
The library includes MNIST and Iris datasets to make it easy to get started and these are used in examples.
Machine Learning
High level traits like Train, Test, and Infer are provided to create a common interface for different algorithms.
KMeans
An implementation of the KMeans classifier, demonstrated in the examples.
Neural Networks
Networks can be constructed as a structure of Layers, including:
- Convolutions
- ReLU
- MaxPool
- Dense
Each of these layers implement Layer and Forward traits, which can be derived to reduce boiler plate.
#[derive(Layer, Forward, Clone, Debug, Serialize, Deserialize)]
struct Lenet5 {
#[autograph(layer)]
conv1: Conv,
#[autograph(layer)]
relu1: Relu,
#[autograph(layer)]
pool1: MaxPool,
#[autograph(layer)]
conv2: Conv,
#[autograph(layer)]
relu2: Relu,
#[autograph(layer)]
pool2: MaxPool,
#[autograph(layer)]
dense1: Dense,
#[autograph(layer)]
relu3: Relu,
#[autograph(layer)]
dense2: Dense,
#[autograph(layer)]
relu4: Relu,
#[autograph(layer)]
dense3: Dense,
}
Similarly, backward ops can be defined using the Autograd and Backward traits, where Autograd can be derived in much the same way that Layer is.
#[derive(Autograd)]
struct DenseBackward {
// Use vertex / optional_vertex for Variables and Parameters
#[autograph(vertex)]
input: Variable2,
#[autograph(vertex)]
weight: Parameter2,
#[autograph(optional_vertex)]
bias: Option<Parameter1>,
}
The intent is that users can write their own custom, modular layers and functions which can be defined from the high level down to custom shader code, all implemented in Rust.
Status
The crate is fairly minimal, missing implementations for some data types, not supporting bf16 for convolutions and pooling layers, with many functions like matrix multiplication internal and not publicly exposed. Things that are potential work items:
- Fully support bf16 in Neural Networks, with a nicer means to convert from f32 to bf16 and back for Variables and Parameters.
- Render the backward "graph" using petgraph for visualization and debugging purposes.
- Profiling tools for evaluating key functions / shaders and for improving the engine itself.
- Port GLSL to Rust, rust-gpu barriers are not working yet and need to reduce the need for code duplication particularly for bf16.
- Improve performance, particularly the GEMM implementation.
- Implement more operations and algorithms:
- MeanPool is implemented but backward is not yet working.
- Binary ops like addition are easy but not yet implemented due to uncertainty over API (in regards to Residual layers etc with more than 2 inputs).
- SGD with momentum not yet implemented, implement other optimizers.
- Model parallelism supported but not tested or optimized. Data parallelism is intended to override Layer::update() to perform an all reduce (ie mean) over the the gradients for each parameter duplicated on several devices prior to the optimization step.
Contributors
Thank you to those that have contributed to the project!