Text Preprocessing

Following are some text processing you must think of doing, it is not necessary to do all these, it depends on the nlp task what text processing you want to do before doing that task

String Manipulation using Regex
Tokenization
Stemming & Lemmatization
Removing Stopwords

Representation Learning (###pretraining)

We need to represent language mathematically i.e. given a corpus you need to convert this corpus into its numerical form. This mathematical representation is called an embedding/context and the process is called representation learning. Why do this?? Because computers understand only numbers and not texts. We can do this in several ways:

Via Sentence Embedding
Via Word Embedding
Via Character Embedding
Via Subword Embedding (everyone uses this)

Downstream NLP (Supervised Fine Tuning - SFT) (###posttraining)

With foundation models that are able to do multiple tasks, you just need to do prompting to solve a single downstream task problem.
But many times prompting does not work well, this is called HALLUCINATION PROBLEM. The model would sometimes give wrong answers to prompted questions (incases where such a task was not trained during the training of multitask foundation model)
To solve this hallucination probelm you can finetune the foundation models for specific tasks. More about this here

AI / Preference Alignment (###posttraining)

Now once OpenAI made ChatGPT they found that if asked about some harmful activities like ‘tell me techniques to make rat poison at home’ then it would answer such questions too !! If tempted it would also use curse words / …. Hence it was lacking HUMAN ETHICS and if gotten in wrong hands could lead to bigger concerns. Hence researchers wanted to ALIGN the LLM outputs with human preferences.
This was called as PREFERENCE PROBLEM
Methods to solve preference problem are called preference alignment. There are two ways to do so
- Fine Tuning LLM with human preference using Reinforcement Learning – RLHF Algorithm
- Fine tuning LLM with human preferences using Supervised Learning – DPO Algorithm
More information available here

LLM Agents

Resources Used to Develop This

Name		Name	Last commit message	Last commit date
Latest commit History 220 Commits
Preprocessing		Preprocessing
Representation-Learning		Representation-Learning
multitask_downstream_task		multitask_downstream_task
unitask_downstream_nlp		unitask_downstream_nlp
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Preprocessing

Representation Learning (###pretraining)

Downstream NLP (Supervised Fine Tuning - SFT) (###posttraining)

AI / Preference Alignment (###posttraining)

LLM Agents

Resources Used to Develop This

About

Releases

Packages

Languages

khetansarvesh/NLP

Folders and files

Latest commit

History

Repository files navigation

Text Preprocessing

Representation Learning (###pretraining)

Downstream NLP (Supervised Fine Tuning - SFT) (###posttraining)

AI / Preference Alignment (###posttraining)

LLM Agents

Resources Used to Develop This

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages