19 Aug 00:15

charles-r-earp

dcd497e

v0.2.1 Latest

Latest

Fixed MNIST download link and test (#69).
Relaxed half version requirement (#70).
Upgrade to krnl v0.1.1 (#72).
Candle benches (#74).
Additional CI fixes (#68, #69, #73).

Assets 2

30 Mar 03:38

charles-r-earp

v0.2.0

58faf3f

v0.2.0

Removed async traits and methods.
Core functionality reimplemented in krnl:
- Only targets Vulkan, more portable than Metal / DX12.
- Metal is supported via MoltenVK.
  - GPGPU kernels implemented inline in Rust:
    - Kernels can be defined in the same file, near where they are invoked.
    - Modules allow sharing code between host and device.
    - Kernel bindings are type safe, checked at compile time.
    - Simple iterator patterns can be implemented without unsafe.
    - Supports specialization constants provided at runtime.
    - DeviceInfo includes useful properties:
      - Max / default threads per group.
      - Max / min threads per subgroup.
    - With DebugPrintf, kernel panics produce errors on the host.
    - krnlc generates a device crate and invokes spirv-builder.
      - spirv-builder / spirv-tools are compiled once on install.
      - Significantly streamlines and accelerates workflow.
    - Kernels are compressed to reduce package and binary size.
- Device operations readily execute:
  - Block until kernels / transfers can queue.
  - An operation can be queued while another is executing.
  - Reduced latency, better repeatability, reliability, and performance.
- Device buffers can be copied by the host if host visible.
- Large buffer copies are streamed rather than allocating a large temporary.
  - Reuses a few small buffers for transfers.
  - Overlaps host and device copies.
  - Performance significantly closer to CUDA.
  - Also streams between devices.
- Device buffers can be i32::MAX bytes (~2 GB, up from 256 MB).
- Scalar / ScalarBufferBase replaces Float / FloatBuffer:
  - Streamlined conversions between buffers.
- Buffers can be sliced.
- Supports wasm (without device feature).
TensorBase and ScalarBufferBase implemented with krnl::BufferBase and krnl::ScalarBufferBase:
- Streamlined conversions between tensor types.
- Host ops accelerated with rayon.
- Improved and streamlined device gemm kernel.
- Device sum and sum_axis use subgroup reductions for improved performance.
Replaced Criterion trait with Accuracy / CrossEntropyLoss traits.
ops::AddAssign implemented by Tensor and Variable.
Implement ndarray::linalg::Dot for Tensor and Variable.
Direct convolution algorithm for better host performance.
Removed learn::kmeans.
Redesigned autograd:
- Autograd replaced with VariableBuilder:
  - Nodes and edges applied when building a Variable.
  - Backward edges are simply f(output_grad) -> input_grad.
- Gradients are automatically accumulated.
- Parameter and Variable are separate types (instead of VertexBase).
  - Parameters can be converted to Variables.
Redesigned Layer trait:
- for_each_parameter fn's instead of returning a Vec.
- Cast layers to a ScalarType.
- Removed enumeration of child layers.
Redesigned Forward trait:
- Generic over input and output type.
Derive improvements:
- Removed layer attribute.
- Supports enums.
- Fields can be skipped.
Redesigned Optimizer trait:
- Added learning rate.
- Accepts a single parameter instead of a slice.
Parameter optimizer::State:
- Can be serialized / deserialized with serde.
Simplified Iris dataset.
MNIST dataset:
- Replaced downloader with curl.
- Decompress in parallel with rayon.

MSRV: 1.70.0

Assets 2

12 Dec 03:24

charles-r-earp

v0.1.1

fc4d74c

autograph v0.1.1

Profiling

Currently requires nightly and feature "profile". Set the AUTOGRAPH_PROFILE environmental variable to 1 or True to produce a table of statistics for compute passes that are executed.

AUTOGRAPH_PROFILE=1 cargo +nightly run --feature profile

Rust GEMM

Improved performance on Neural Network MNIST example (Lenet5) by 5x.

Implemented in Rust for u32, i32, f32
- bf16 not yet implemented
Unrolled loops with crunchy
Work per thread (1x1, 2x2, 4x4) micro tiles
SplitK variant (256) for small m or n and large k
- Atomically accumulates with multiple work groups

Tensor

Added Tensor::ones method.

Neural Networks

Allowed SGD learning_rate = 1.0
MeanPool
Fixed correctness issues
- Cross Entropy Loss
- Sum
- Test accuracy improved to ~99% on Neural Network MNIST example (Lenet5)

Examples

Added shuffling of training batches

Benchmark

Added Neural Network Benchmark to compare performance with other libraries. Training is now ~2.7x slower than tch (NVIDIA GeForce GTX 1060 with Max-Q Design) with similar test accuracy.

+-----------+------------+---------------+-----------------------+----------------------------------+
| Library   | Best Epoch | Best Accuracy | Time To Best Accuracy | Mean Epoch Time to Best Accuracy |
+===========+============+===============+=======================+==================================+
| autograph | 69         | 99.04%        | 127.38s               | 1.85s                            |
+-----------+------------+---------------+-----------------------+----------------------------------+
| tch       | 32         | 99.12%        | 22.03s                | 688.31ms                         |
+-----------+------------+---------------+-----------------------+----------------------------------+

Assets 2

30 Oct 07:45

charles-r-earp

v0.1.0

aa17929

autograph v0.1.0

This is the first release of autograph rebuilt on SPIR-V compute shaders that can be compiled from Rust source with rust-gpu!

Compute Shaders

All computations are implemented in either Rust or GLSL (to be replaced by Rust), and this API is publicly exposed so that external crates can develop their own routines. Shader code targeting SPIR-V is portable and is compiled at runtime for devices supporting Vulkan, Metal, and DX12 API's.

Datasets

The library includes MNIST and Iris datasets to make it easy to get started and these are used in examples.

Machine Learning

High level traits like Train, Test, and Infer are provided to create a common interface for different algorithms.

KMeans

An implementation of the KMeans classifier, demonstrated in the examples.

Neural Networks

Networks can be constructed as a structure of Layers, including:

Convolutions
ReLU
MaxPool
Dense

Each of these layers implement Layer and Forward traits, which can be derived to reduce boiler plate.

#[derive(Layer, Forward, Clone, Debug, Serialize, Deserialize)]
struct Lenet5 {
    #[autograph(layer)]
    conv1: Conv,
    #[autograph(layer)]
    relu1: Relu,
    #[autograph(layer)]
    pool1: MaxPool,
    #[autograph(layer)]
    conv2: Conv,
    #[autograph(layer)]
    relu2: Relu,
    #[autograph(layer)]
    pool2: MaxPool,
    #[autograph(layer)]
    dense1: Dense,
    #[autograph(layer)]
    relu3: Relu,
    #[autograph(layer)]
    dense2: Dense,
    #[autograph(layer)]
    relu4: Relu,
    #[autograph(layer)]
    dense3: Dense,
}

Similarly, backward ops can be defined using the Autograd and Backward traits, where Autograd can be derived in much the same way that Layer is.

#[derive(Autograd)]
struct DenseBackward {
    // Use vertex / optional_vertex for Variables and Parameters
    #[autograph(vertex)]
    input: Variable2,
    #[autograph(vertex)]
    weight: Parameter2,
    #[autograph(optional_vertex)]
    bias: Option<Parameter1>,
}

The intent is that users can write their own custom, modular layers and functions which can be defined from the high level down to custom shader code, all implemented in Rust.

Status

The crate is fairly minimal, missing implementations for some data types, not supporting bf16 for convolutions and pooling layers, with many functions like matrix multiplication internal and not publicly exposed. Things that are potential work items:

Fully support bf16 in Neural Networks, with a nicer means to convert from f32 to bf16 and back for Variables and Parameters.
Render the backward "graph" using petgraph for visualization and debugging purposes.
Profiling tools for evaluating key functions / shaders and for improving the engine itself.
Port GLSL to Rust, rust-gpu barriers are not working yet and need to reduce the need for code duplication particularly for bf16.
Improve performance, particularly the GEMM implementation.
Implement more operations and algorithms:
- MeanPool is implemented but backward is not yet working.
- Binary ops like addition are easy but not yet implemented due to uncertainty over API (in regards to Residual layers etc with more than 2 inputs).
- SGD with momentum not yet implemented, implement other optimizers.
Model parallelism supported but not tested or optimized. Data parallelism is intended to override Layer::update() to perform an all reduce (ie mean) over the the gradients for each parameter duplicated on several devices prior to the optimization step.

Contributors

Thank you to those that have contributed to the project!

Contributors

AlbertoGP and nkconnor

Assets 2

7 Join discussion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling

Rust GEMM

Tensor

Neural Networks

Examples

Benchmark

Compute Shaders

Datasets

Machine Learning

KMeans

Neural Networks

Status

Contributors

Contributors

Releases: charles-r-earp/autograph

v0.2.1

v0.2.0

autograph v0.1.1

Profiling

Rust GEMM

Tensor

Neural Networks

Examples

Benchmark

autograph v0.1.0

Compute Shaders

Datasets

Machine Learning

KMeans

Neural Networks

Status

Contributors

Contributors