·E147

Training Artificial Intelligence Experts

Episode Transcript

This is the Discovery Files podcast from the U.S.

National Science Foundation.

Artificial intelligence technologies are accelerating research, expanding America's workforce and transforming society as AI becomes a larger part of the world around us.

Advancements in computer vision, natural language processing, and speech recognition have led to widespread use of learning based systems in AI that interact frequently with humans in highly personalized contexts.

Humans are using strategies such as reinforcement learning to train AI systems, and using AI systems to upskill the next generation workforce.

We're joined by Mingyi Hong, a professor of electrical and computer engineering at the University of Minnesota.

His optimization for artificial intelligence lab is designing solutions for problems in data science, machine learning, and AI.

Professor Hong thank you so much for joining me today.

Thank you.

Nathan.

I want to ask you about reinforcement learning.

What are some of the key components of this approach?

So reinforcement learning or abbreviated as RL, is a branch of machine learning where agents learn to make decisions by interacting with the environments.

So the key idea here is to learn through trial and error guided by some kind of rewards, right.

So therefore the key components here are agents, right.

So you have to have a learner a decision maker who is trying to learn environment.

You're interacting with an environment state.

So the current situation you're in.

Describing what the environment is and then your current position.

And then what can you do.

So a choice the agent can make at this particular time.

And they reward.

Right.

So after each actions have been made what are the incentives or penalties that I receive.

And then the policy.

Right.

So policy is eventually what the strategy that the agents eventually learned.

What would be an example of a reward for a machine learning program?

So let me give you an example, teaching a robot to walk.

This is a very typical machine learning problem where if you imagine you have a four legged robot, it needs to learn how to walk across a room.

So the robot doesn't know how to move initially.

So it will just try, try out different sequences of movements.

And then sometimes some may fail, some may not fail.

But every time it moves closer to its destination without failing, it receive a positive reward, let's say plus one or good or some something positive right?

If it tips over, it gets penalty.

So as you're gathering this reward, your model is trying to adapt, trying to optimize trying to increase the reward.

Oh I see that I should do this.

And I get better reward then I should go towards that.

So over time using this kind of approach trial and error and then improving my model.

So the robot will learn which sequence of actions leads to higher reward and then it will effectively teach itself to walk smoothly.

So this is an example of how rewards are defined and how the entire process is done through reinforcement learning.

So you mentioned a robot there, but what are some of the applications that people might see around them that rely on reinforcement learning?

For example, we all use ChatGPT .

Chatbot fine tuning is a very, very important application for reinforcement learning.

Now here the agent is the chat bot we're trying to train.

So the chatbot may have seen a lot of text, maybe all the text in the internet.

So it knows how to speak English, or to speak Chinese or how to speak German.

Right.

So it knows the basic structure, but it needs to be aligned towards a particular way of human speaking or a particular region that some, some things could be said, some things are not appropriate, something appropriate, things like that.

So it needs to be aligned towards certain population.

So then the first question is where is the reward model coming from.

For a given question you collect many, many answers.

These answers can come from humans, come from language model itself and then ask human to annotate to rank which one is the best, which one is good, which is not acceptable, and so on.

And you collect thousands and thousands of data like this.

Right.

And then use this data to train the reward model.

So now with this reward model trained by human annotation data, then we will use reinforcement learning algorithm to train, align, the language model itself.

Before training, the language model already should know, the basic structures of English, of Chinese, of German, whichever language you are working on.

However, it doesn't know what should be the appropriate answer to certain questions.

And then the language model will explore.

The reward model will tell you, hey, this is correct, this is not correct, this is okay.

And so on.

With this signal, the language model will be trained to sort of a tuning to be gradually going towards those that have high reward.

So that's the process.

Is there a difference in using RL in large language models than you would in a chatbot?

So I think this is the most popular way of aligning or fine tuning a chat bot, which is powered by large language model.

There are even more advanced ways of using this.

For example, now let's forget about chat bot, right?

So let's just look at what large language model can be used to do.

So now people are talking about using language models to plan things right.

So for example, as an agent, i tell the language model to book an air flight for me.

So the language model needs to know, okay, here is instruction.

And my first step is to go to this website.

And then to input this and that and then get my results.

And then look at my results and say, oh, this looks good, this flight looks good.

And I'll pick this and go to Delta maybe, or go to some air carrier's, website and, and then click right and then purchase and things like that.

So it's a sequence of a planning step.

And then eventually we have a reward.

Okay.

You made the right decision, you made the wrong decision, and so on.

And then the sequence how to plan this.

Well, how the language model can plan is step by step by correctly making decision at each step.

It's also one very, sort of exciting directions where IL reinforcement learning can be used to train language model.

Right.

So this is of course well beyond the chat.

bot What are some of the limitations when you're getting RL into real world applications?

Okay.

There are many, many limitations I think right now.

First of all, RL is very hard to train, it needs many millions of interactions to learn, learn effectively.

So data is very hard to obtain.

So this is one thing.

The second is how to specify the reward.

So designing good reward function is hard and often misaligned with the true human intention.

Give you an example.

Right.

So a cleaning robot is rewarded based on how shiny the floor looks for example.

So this is a reward function.

Then the robot can sort of start to train this oh certain time and discover the shiny.

The better I get a better reward.

And then it would dump water everywhere to create a shine, but not actually cleaning.

Just make it a shine right.

So this is happening because the agent simply tried to maximize the reward.

And the reward is not very well designed.

So this is sometimes called a reward hacking problem.

For complicated task, this sometimes happens.

So this is a second sort of a challenge how to design a better reward model.

The third is how to make the model more robust.

Make it safer.

For example, for chat bot.

A chatbot learned so many things it has seen right?

It has been trained.

So how to prevent it from saying something that is not supposed to say so.

This is very hard to distinguish.

Maybe some of these answers can be very helpful.

It directly answers what the human user was asking.

For example, how to, i don’t know, how to make a bomb.

So then for helpful agent, if it learns to be very, very helpful, you should directly say something detailed on it’s own right, which is, obviously not appropriate.

And then the last thing I want to say is sometimes, in many cases, the agent trained by RL lack generalization, which means if you move to a new environment, the old policy may not work.

For example, a robot may train to learn how to pick up a red cup from a table.

If it moved to a new environment where all the cups are blue, it may already forgot what to do.

This is where you run into that limitations of data.

It's like it might know to do this part of it, but if it's this, what happens next?

Yeah, exactly.

Or like social norms.

Like, maybe I wouldn't ask you about something like bomb reference.

Yeah, but, like, how does it know that it's not supposed to tell you how to do this?

Yeah, exactly.

Right.

So then okay, one solution.

So obviously give a lot of data to it.

Oh I need to consider all different kind of nuances of different things I should say.

Should not say this and that so but again you don't have too many data.

on this What are some of the challenges in training an AI agent or algorithm to be an expert?

Like, it seems like there could be a lot of complications in that.

Yeah.

So I just mentioned this data limitation.

So high quality expert demonstration is very rare, especially if we're not considering just conversation.

We consider something practical, something that the really needs human demonstration.

So for example you want to teach an agent to drive, automated drive.

So where is the data coming from.

You have to go out and collect probably take videos of how human drive and so on.

So so this data, this amount of data is compared to text data has been available online.

So that's very, very small.

Another important challenge here is really the ambiguity.

It's also related to how we define a reward model.

For example in an expert pilot you want to learn how to fly right.

So now you're observing okay how this pilot was doing.

But then they will make very subtle adjustment when landing the plane.

But you certainly don't know where this is coming from.

Is it because of a wing, because of the habit or because of, there is anticipation of something turbulence or something else?

Right.

So this is very hard for the agent to understand.

If you just look at data, it's hard to sort of learn the reward.

Also another important thing is edge cases.

So for example, we want to learn how to play chess game.

You're observing chess grandmaster that is sort of a playing regularly.

However, there will be some moves that are so rare that they only make it once or twice in their entire lifetime.

But those are so important.

Maybe it helps win the critical game, right?

So those those examples, those those cases only happen once or twice.

How, how should it learn right.

There's again not enough data.

So those are sort of important examples.

But it happens so infrequently it's very hard to learn as well.

So those are the I think some of the main challenges mostly related to training, a ai agent to be an expert.

Kind of a quality of data and the context.

Yeah.

Context and yeah.

I had a conversation where we were looking at vision language models, and I was kind of wondering how that might relate to this.

Overall, the idea is very similar.

Now vision language model gives to you is so going back to the very first definition of, reinforcement learning, right?

So here a vision language model compared with the only text language model, give you more sort of a context, right.

So your environment is richer.

You know what happens not only from the previous text exchange, you conversation, but you also, you know, your actual position where you are and then what context it is.

Right?

It also the action space is much more complex.

For visual language models.

You're allowed to generate text, so you're allowed to reason.

You're allowed to give a situation.

The model is allowed to provide reason to provide why you want to do this, and so on.

It also will be allowed to click certain button, right?

So so you'd be able to actually operate on some of the objects.

Right.

So this give you a much more complex action space as well.

So entire training process will be more complicated compared with a text only language model.

So there's an opposite kind of concept here.

Inverse reinforcement learning.

What's the difference between inverse reinforcement learning and reinforcement learning?

First of all, let me say that they're actually not opposite to each other.

So inverse reinforcement learning is something let's say one step beyond reinforcement learning.

Some of our earlier discussion pointed a discussion that designing reward model has been important.

It's very hard and needs to be important.

So inverse reinforcement learning goes one step beyond RL.

So in RL we need to have a reward model that guides me.

For inverse RL you don't.

So instead of learning the policy from the reward will learn the reward function itself by observing expert behavior.

So the key difference here is that RL given a reward I give you a policy, inverse RL give me expert demonstration.

You don't tell me what is good it was about just just demonstrate for me and then I will learn the reward.

I will try to understand.

Oh, why you do this at this given point of time, you should do this, not that, and so on.

And then with this, I eventually get a good policy.

That's the sort of key difference between IRL and RL.

Thinking about workforce training, once you've established an expert model, how hard is it to introduce that to a novice human?

It's a very, very good question actually.

This is related to some of the white papers we are writing.

I don't think there has been too much work on this.

So that's why we try to propose this.

If you want to train a human this way, it's better to integrate an entire process.

So suppose you have an expert and then suppose you have a teacher and then suppose you have an human.

Now what the expert can do is to provide feedback.

So what is good, what is bad?

Now the teacher is also an AI model.

So the teacher's goal is not necessarily to become an expert, because the eventual goal is to to teach human right.

So it needs to be able to integrate the expert's feedback and then ask the right question for this particular human.

So it needs to understand what is a level of expertise at this particular point.

What should be my next set of questions that I should ask when I ask the next set of questions or next set of tasks, what is the human's actual performance?

And then I'll send to the expert to ask, is this correct or what?

Right?

And based on this, I will summarize these interactions and provide feedback to human and then provide next set of tasks.

I think a better way is to again integrate the entire process, and then have this sort of learning environment set up by sort of a training a teacher, AI teacher model, and to interact with the expert model.

Probably it's a better way compared with directly interacting with an expert and say, hey, how should I learn from from it.

I want to ask you about how NSF support has impacted your career.

So over the years, I have been receiving grants on topics related to designing optimization algorithms as well as for reinforcement learning.

So obviously you can see that optimization algorithm, this very important component when we talk about how to train a model, how to improve the model.

It all involves optimization algorithms.

And also reinforcement learning technique has been the cornerstone for LLM, training LLM.

And then and and go beyond.

So these are very crucial to support my labs research to recruit students for me to attend conferences.

And so on.

So I'm very grateful.

Thank you.

For my last question, I want to ask you about the future and what's coming up.

Where do you see your work going in the next few years?

Yeah, excellent question, I ask this question to myself many, many times.

So I think one of the most exciting frontier now is to use reinforcement learning to transform large language models into autonomous agent or systems of agent, where you could do multi-step reasoning and even solving scientific problems.

Right.

So sort of related to your previous question of what's beyond that just chatbot.

So I think these will be something the really useful will be really having significant impact to the society, to either scientific community or even beyond that.

So here, as we alluded to before, IRL reinforcement learning will play a key role here.

Right?

Because here we're essentially training an agent how to do planning, how to perform very complicated tasks.

For example, one of the goal here is to ask if language model can assist or plan or even execute a research workflow.

First, I can discuss it with a language model.

Hey, I want to do this.

I have this idea is this good is it bad?

And let's have a discussion and then it will go on to do literature review.

it will go on to do experiment design analysis.

So writing paper and so on.

Right.

So this entire process of course will have human sort of supervision.

But a language model itself or a foundation model itself, maybe even go beyond language model.

It could be vision language model as well.

Can conduct the process step by step, similar example as I just mentioned before, booking travel and other things.

So, so similar flavor.

Right.

So it's no longer just a chatbot, but really help everyone do something useful.

Special thanks to Mingyi Hong.

For the Discovery Files, I'm Nate Pottker.

You can watch video versions of these conversations on our YouTube channel by searching @NSFscience.

Please subscribe wherever you get podcasts and if you like our program, share with a friend and consider leaving a review.

Discover how the U.S.

National Science Foundation is advancing research at NSF.gov.

Training Artificial Intelligence Experts

Episode Transcript

Never lose your place, on any device