Optimize tokens throws seg fault #454

tclements-usgs · 2025-01-22T01:50:15Z

🐛 Bug

Tokenizing a dataset for LLM pre-training using optimize with greater than 1 workers leads to a segmentation fault when loading the dataset with a StreamingDataLoader and batch_size > 1. I think this might be a follow-on to #366 in that StreamingDataLoader errors are linked to the inputs to optimize.

To Reproduce

Steps to reproduce the behavior:

Run optimize with item_loader=TokensLoader() and num_workers>=2
Usebatch_size>1 with StreamingDataLoader
StreamingDataLoader throws a segmentation fault

Interestingly, streaming tokens works if:

num_workers>1 in optimize and batch_size==1 when loading.
num_workers<2 in optimize and batch_size>=1 when loading.

MWE to give the error below:

Code sample

import os 
import tempfile 

from litdata import optimize, TokensLoader, StreamingDataset, StreamingDataLoader
import torch 
from tqdm import tqdm 

def tokenize_fn(idx): 
    yield torch.randint(low=0, high=127, size=(8192,)) 

def main(output_dir, num_workers=0, batch_size=1):
    outputs = optimize(
            fn=tokenize_fn, 
            inputs=list(range(1000)),
            output_dir=output_dir,
            chunk_size=(2049 * 8012),
            item_loader=TokensLoader(),
            num_workers=num_workers,
    )
    print(os.listdir(output_dir))

    dataset = StreamingDataset(
        input_dir=output_dir,
        item_loader=TokensLoader(block_size=2049),
        shuffle=True,
        drop_last=True,
    )
    dataloader = StreamingDataLoader(dataset, batch_size=batch_size, shuffle=True, num_workers=0, drop_last=True)

    # load data
    for data in tqdm(dataloader):
        pass
    
if __name__ == "__main__": 

    with tempfile.TemporaryDirectory() as output_dir:
        main(output_dir, num_workers=0, batch_size=1) # works 

    with tempfile.TemporaryDirectory() as output_dir:
        main(output_dir, num_workers=2, batch_size=1) # works 
    
    with tempfile.TemporaryDirectory() as output_dir:
        main(output_dir, num_workers=0, batch_size=8) # works 

    with tempfile.TemporaryDirectory() as output_dir:
        main(output_dir, num_workers=2, batch_size=8) # creates a seg fault

Here's the error I get after the segmentation fault:

UserWarning: resource_tracker: There appear to be 19 leaked semaphore objects to clean up at shutdown

which seems to be related to multiprocessing errors.

Expected behavior

Multi-worker dataset creation should lead to smooth dataset streaming with all batch sizes.

Additional context

LitData Version: 0.2.36
PyTorch Version: 2.5.1
OS: Errored on macOS Sonoma and Ubuntu 22.04
How you installed PyTorch (conda, pip, source): pip
Python version: 3.12

The text was updated successfully, but these errors were encountered:

github-actions · 2025-01-22T01:50:46Z

Hi! thanks for your contribution!, great first issue!

tchaton · 2025-01-22T10:21:46Z

Hey @tclements-usgs If you want to try to debug it, it should be around there: https://github.com/Lightning-AI/litdata/blob/main/src/litdata/streaming/item_loader.py#L368. In theory, we should close the memap but it seemed they were some more issues doing so but this would be the right way to do it.

tclements-usgs · 2025-01-22T17:56:11Z

Great, thanks - I'll have a look!

tchaton · 2025-01-22T18:18:56Z

Feel free to make a PR if you fix it ;)

tclements-usgs added bug Something isn't working help wanted Extra attention is needed labels Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize tokens throws seg fault #454

Optimize tokens throws seg fault #454

tclements-usgs commented Jan 22, 2025

github-actions bot commented Jan 22, 2025

tchaton commented Jan 22, 2025

tclements-usgs commented Jan 22, 2025

tchaton commented Jan 22, 2025

Optimize tokens throws seg fault #454

Optimize tokens throws seg fault #454

Comments

tclements-usgs commented Jan 22, 2025

🐛 Bug

To Reproduce

Expected behavior

Additional context

github-actions bot commented Jan 22, 2025

tchaton commented Jan 22, 2025

tclements-usgs commented Jan 22, 2025

tchaton commented Jan 22, 2025