Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Optimize tokens throws seg fault #454

Open
tclements-usgs opened this issue Jan 22, 2025 · 4 comments
Open

Optimize tokens throws seg fault #454

tclements-usgs opened this issue Jan 22, 2025 · 4 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@tclements-usgs
Copy link

🐛 Bug

Tokenizing a dataset for LLM pre-training using optimize with greater than 1 workers leads to a segmentation fault when loading the dataset with a StreamingDataLoader and batch_size > 1. I think this might be a follow-on to #366 in that StreamingDataLoader errors are linked to the inputs to optimize.

To Reproduce

Steps to reproduce the behavior:

  1. Run optimize with item_loader=TokensLoader() and num_workers>=2
  2. Usebatch_size>1 with StreamingDataLoader
  3. StreamingDataLoader throws a segmentation fault

Interestingly, streaming tokens works if:

  • num_workers>1 in optimize and batch_size==1 when loading.
  • num_workers<2 in optimize and batch_size>=1 when loading.

MWE to give the error below:

Code sample
import os 
import tempfile 

from litdata import optimize, TokensLoader, StreamingDataset, StreamingDataLoader
import torch 
from tqdm import tqdm 

def tokenize_fn(idx): 
    yield torch.randint(low=0, high=127, size=(8192,)) 

def main(output_dir, num_workers=0, batch_size=1):
    outputs = optimize(
            fn=tokenize_fn, 
            inputs=list(range(1000)),
            output_dir=output_dir,
            chunk_size=(2049 * 8012),
            item_loader=TokensLoader(),
            num_workers=num_workers,
    )
    print(os.listdir(output_dir))

    dataset = StreamingDataset(
        input_dir=output_dir,
        item_loader=TokensLoader(block_size=2049),
        shuffle=True,
        drop_last=True,
    )
    dataloader = StreamingDataLoader(dataset, batch_size=batch_size, shuffle=True, num_workers=0, drop_last=True)

    # load data
    for data in tqdm(dataloader):
        pass
    
if __name__ == "__main__": 

    with tempfile.TemporaryDirectory() as output_dir:
        main(output_dir, num_workers=0, batch_size=1) # works 

    with tempfile.TemporaryDirectory() as output_dir:
        main(output_dir, num_workers=2, batch_size=1) # works 
    
    with tempfile.TemporaryDirectory() as output_dir:
        main(output_dir, num_workers=0, batch_size=8) # works 

    with tempfile.TemporaryDirectory() as output_dir:
        main(output_dir, num_workers=2, batch_size=8) # creates a seg fault 

Here's the error I get after the segmentation fault:

UserWarning: resource_tracker: There appear to be 19 leaked semaphore objects to clean up at shutdown

which seems to be related to multiprocessing errors.

Expected behavior

Multi-worker dataset creation should lead to smooth dataset streaming with all batch sizes.

Additional context

  • LitData Version: 0.2.36
  • PyTorch Version: 2.5.1
  • OS: Errored on macOS Sonoma and Ubuntu 22.04
  • How you installed PyTorch (conda, pip, source): pip
  • Python version: 3.12
@tclements-usgs tclements-usgs added bug Something isn't working help wanted Extra attention is needed labels Jan 22, 2025
Copy link

Hi! thanks for your contribution!, great first issue!

@tchaton
Copy link
Collaborator

tchaton commented Jan 22, 2025

Hey @tclements-usgs If you want to try to debug it, it should be around there: https://github.com/Lightning-AI/litdata/blob/main/src/litdata/streaming/item_loader.py#L368. In theory, we should close the memap but it seemed they were some more issues doing so but this would be the right way to do it.

@tclements-usgs
Copy link
Author

Great, thanks - I'll have a look!

@tchaton
Copy link
Collaborator

tchaton commented Jan 22, 2025

Feel free to make a PR if you fix it ;)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants