Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Crash on GPU in Roberta branch #727

Open
dlwh opened this issue Sep 12, 2024 · 0 comments
Open

Crash on GPU in Roberta branch #727

dlwh opened this issue Sep 12, 2024 · 0 comments

Comments

@dlwh
Copy link
Member

dlwh commented Sep 12, 2024

https://github.com/JulienDarve/levanter/tree/broken_gpu

F0912 11:59:01.941641  741969 shape_tree.cc:54] Check failed: result->children_start_id >= 0 (-1 vs. 0)
*** Check failure stack trace: ***
    @     0x7f0e2365c1d4  absl::lts_20230802::log_internal::LogMessage::SendToLog()
    @     0x7f0e2365c0d4  absl::lts_20230802::log_internal::LogMessage::Flush()
    @     0x7f0e2365c579  absl::lts_20230802::log_internal::LogMessageFatal::~LogMessageFatal()
    @     0x7f0e2076c1f1  xla::internal::IndexTable::operator[]()
    @     0x7f0e204fd1d4  xla::HloDataflowAnalysis::GetValueSet()
    @     0x7f0e2025905c  xla::BufferAssignment::GetUniqueSlice()
    @     0x7f0e200667ce  xla::gpu::GetAllocationSlice()
    @     0x7f0e1f270b3b  xla::gpu::(anonymous namespace)::GetResultSlice()
    @     0x7f0e1f274217  xla::ShapeUtil::ForEachSubshapeWithStatus<>()::{lambda()#1}::operator()()
    @     0x7f0e1f274088  xla::ShapeUtil::ForEachMutableSubshapeWithStatusHelper<>()
    @     0x7f0e1f27410f  xla::ShapeUtil::ForEachMutableSubshapeWithStatusHelper<>()
    @     0x7f0e1f26bbed  xla::gpu::DynamicSliceFusion::Emit()
    @     0x7f0e1c0cc4a4  xla::gpu::IrEmitterUnnested::EmitFusion()
    @     0x7f0e1c0d7135  xla::gpu::IrEmitterUnnested::EmitHloInstruction()
    @     0x7f0e1c0b9cbe  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x7f0e1be94d2a  xla::gpu::CompileModuleToLlvmIr()
    @     0x7f0e1be75920  xla::gpu::GpuCompiler::CompileToBackendResult()
    @     0x7f0e1be784fe  xla::gpu::GpuCompiler::RunBackend()
    @     0x7f0e1be39982  xla::Service::BuildExecutable()
    @     0x7f0e1be01f55  xla::LocalService::CompileExecutables()
    @     0x7f0e1bdf5a14  xla::LocalClient::Compile()
    @     0x7f0e1bd9adcb  xla::PjRtStreamExecutorClient::CompileInternal()
    @     0x7f0e1bd9be7e  xla::PjRtStreamExecutorClient::Compile()
    
Stack (most recent call first):
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/compiler.py", line 260 in backend_compile
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/profiler.py", line 333 in wrapper
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/compiler.py", line 654 in _compile_and_write_cache
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/compiler.py", line 426 in compile_or_get_cached
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/interpreters/pxla.py", line 2639 in _cached_compilation
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/interpreters/pxla.py", line 2827 in from_hlo
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/interpreters/pxla.py", line 2313 in compile
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/pjit.py", line 1651 in _pjit_call_impl_python
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/pjit.py", line 1721 in call_impl_cache_miss
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/pjit.py", line 1739 in _pjit_call_impl
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/core.py", line 949 in process_primitive
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/core.py", line 443 in bind_with_trace
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/core.py", line 2782 in bind
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/pjit.py", line 190 in _python_pjit_helper
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/pjit.py", line 332 in cache_miss
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/jax/_src/traceback_util.py", line 180 in reraise_with_filtered_traceback
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/haliax/partitioning.py", line 337 in _call
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/equinox/_module.py", line 1078 in __call__
  File "/sailhome/jdarve/miniconda3/envs/levanter/lib/python3.10/site-packages/haliax/partitioning.py", line 261 in __call__
  File "/sailhome/jdarve/levanter/src/levanter/trainer.py", line 364 in train_step
  File "/sailhome/jdarve/levanter/src/levanter/trainer.py", line 380 in training_steps
  File "/sailhome/jdarve/levanter/src/levanter/trainer.py", line 397 in train
  File "/sailhome/jdarve/levanter/src/levanter/main/train_mlm.py", line 215 in main
  File "/sailhome/jdarve/levanter/src/levanter/config.py", line 84 in wrapper_inner
  File "/sailhome/jdarve/levanter/src/levanter/main/train_mlm.py", line 218 in <module>

Extension modules: jaxlib.cpu_feature_guard, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, zstandard.backend_c, pyarrow.lib, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, ray._raylet, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, multidict._multidict, yarl._helpers_c, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, PIL._imaging, kiwisolver._cext, regex._regex (total: 86)
Aborted (core dumped)
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant