Challenge 24 - Using transformer models to develop a search engine for datasets, charts, and documentation #9

EsperanzaCuartero · 2023-02-24T15:47:24Z

Challenge 24 - Using transformer models to develop a search engine for datasets, charts, and documentation

Stream 2 - Machine Learning for Earth Science

Goal

Develop a natural language search engine to improve the discoverability of ECMWF datasets, graphical products, and documentation using natural language

Mentors and skills

Mentors: Sylvie Lamy-Thepaut, Baudouin Raoult, Helen Setchell, Myranda Uselton Shirk
Skills required:
- Python
- Machine Learning
- Data Science
- Possible extra: Confluence plugins and macros (Java, Velocity)

Note: Only nationals or residents from the ECMWF Member States and Co-operating States are eligible to participate (see Terms and Conditions).

Challenge description

It is difficult for users to find ECMWF data, both when using external and internal searches. This is true even though we have added Google structured data to our dataset pages because we only have limited content and metadata for our datasets.

There is editorial inconsistency in our documentation and a lot of it! The data and charts content has recently been reviewed and rewritten, but it can still be difficult for users to find the documentation they need.

Data/System to use

Transformer machine learning model in Python
HuggingFace Python library
ECMWF chart discovery API (can be extended if needed)
ECMWF dataset API (in development)
ECMWF dataset DOIs (to investigate)
Confluence content API
A city database

Solution

A ML-based search engine presents users with a simple free text search box into which they can type natural language search terms and questions. This will then show a list of matching results, selected by the ML search system.

An example user search might be "what data do you have for Oslo rainfall in 1963?"

Consideration should be given to users using other languages to search and read results.

It should be possible to weigh results by, for example, population or proximity to ECMWF.

A possible extra for this project - time permitting - could be to write a Confluence plugin or macro, with parameters for search scope.

Implementation. Possible milestones

Explore the functionalities of HuggingFace Transformers.
Explore ECMWF's datasets and charts and find their metadata.
Setup a test system

We will devise a reference set of questions, based on popular real user enquiries, to test search results before and after implementation.
The search box could be used on the datasets search page, the chart search page and the chart browser search and the support portal and possibly as a replacement for the confluence search features.

Additional comment

We hope to mentor this project in cooperation with Myranda Uselton Shirk at NOAA who provided this following presentation from AMS that greatly inspired this proposal.

Laudrup21 · 2023-03-27T10:07:53Z

Hi all, I am really interested in this project and want to work on it.

Laudrup21 · 2023-03-27T10:41:41Z

But I don't really understand the deliverable and the data that need to be
used

kiden · 2023-03-28T16:33:55Z

Thanks for your interest in our challenge!

The deliverable is as described in the 'Solution' section of the challenge - a free text search box into which a user can type a natural language search question or phrase. Results in the form of answers or links should appear on submission. Users should ideally be asked for feedback on the usefulness of the results.

The data would be a mix of our Confluence api (for documentation), our charts api, our data and parameters apis, as well as the data from the search engine itself. During the course of the project we may discover other sources of data - maybe even external ones - that would improve the search results.

Existing searches to review for replacement with the new search engine are the ones mentioned under the 'Implementation. Possible milestones' section (see links in last paragraph). Our parameter search is also relevant.

We also recommend watching the presentation by one of the mentors, Myranda - the link is given in the 'Additional comment' section of the challenge.

We hope this provides enough information for you. Of course we are also interested in your ideas and don't want to be too prescriptive!

Laudrup21 · 2023-03-28T16:43:02Z

okay, thank you, could give me an estimation of the time you can expect me to work per week?

sylvielamythepaut · 2023-03-29T07:58:30Z

Hi,
Not easy to say, but previous participants mentioned half day to one day a week. We will meet regularly (once a week) to discuss the progress, and the next phase. We really hope that this work will give use the opportunity to explore this area, and help us.

EsperanzaCuartero added the Stream 2 Machine Learning for Earth Sciences label Feb 24, 2023

EsperanzaCuartero assigned kiden, sylvielamythepaut, EsperanzaCuartero and trakasa Feb 24, 2023

EsperanzaCuartero changed the title ~~Challenge 9 - ML dataset, graphical product, and document search~~ Challenge 24 - ML dataset, graphical product, and document search Feb 27, 2023

EsperanzaCuartero assigned myrandaGoesToSpace Feb 28, 2023

EsperanzaCuartero changed the title ~~Challenge 24 - ML dataset, graphical product, and document search~~ Challenge 24 - Using transformer models to develop a search engine for datasets, charts, and documentation Mar 28, 2023

RubenRT7 mentioned this issue Feb 16, 2024

Challenge 24 - Knowledge Graph Generation for Enhanced Chatbot and Scientific Literature Synthesis ECMWFCode4Earth/challenges_2024#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Challenge 24 - Using transformer models to develop a search engine for datasets, charts, and documentation #9

Challenge 24 - Using transformer models to develop a search engine for datasets, charts, and documentation #9

EsperanzaCuartero commented Feb 24, 2023 •

edited

Loading

Laudrup21 commented Mar 27, 2023

Laudrup21 commented Mar 27, 2023

kiden commented Mar 28, 2023

Laudrup21 commented Mar 28, 2023

sylvielamythepaut commented Mar 29, 2023

Challenge 24 - Using transformer models to develop a search engine for datasets, charts, and documentation #9

Challenge 24 - Using transformer models to develop a search engine for datasets, charts, and documentation #9

Comments

EsperanzaCuartero commented Feb 24, 2023 • edited Loading

Challenge 24 - Using transformer models to develop a search engine for datasets, charts, and documentation

Goal

Mentors and skills

Challenge description

Data/System to use

Solution

Implementation. Possible milestones

Additional comment

Laudrup21 commented Mar 27, 2023

Laudrup21 commented Mar 27, 2023

kiden commented Mar 28, 2023

Laudrup21 commented Mar 28, 2023

sylvielamythepaut commented Mar 29, 2023

EsperanzaCuartero commented Feb 24, 2023 •

edited

Loading