Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

guide to run the code #11

Open
Abolfazl-kr opened this issue Feb 12, 2024 · 2 comments
Open

guide to run the code #11

Abolfazl-kr opened this issue Feb 12, 2024 · 2 comments

Comments

@Abolfazl-kr
Copy link

Thanks for your effort. I have a little confusion about the process. Correct me if I'm wrong. First, we should run block_expansion.py to create our extended model. Then, we clone the repository at https://github.com/hills-code/open-instruct.git@7c2b14d and run finetune_codealpaca.sh. Is this correct?"

Regarding your repo I have some problem in this process too:
1- After running block_expansion.py, a 14.5 GB pytorch_model.bin file will be created. It does not have a pytorch_model.bin.index.json or any other files. However, in the Hugging Face model, there are two shards plus all extra files needed like pytorch_model.bin.index.json, special_tokens_map.json, generation_config.json, config.json. how could we create them?

2- I want to pre train model with my raw text. what should i do? my data is not in your mentioned data like SlimOrca and ....
how could i transform my dataset to work with your codes?

@hills-code
Copy link
Collaborator

  1. You do not need pytorch_model.bin.index.json. For the other necessary files, you can just copy the original base model.
  2. The code can directly load the dataset from the huggingface use datasets.load_dataset('YOUR_DATASET'). However, if you want to do pretrain, you may need to revise the tokenize function as the tokenize function is used for SFT and will mask the instruction label during the process.

@kiran-coditation
Copy link

Hi @Abolfazl-kr are you able to pretrain after block-expansion? If yes can you please guide me for the same

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants