What different types of synthetic data are there?

Episode Transcript

Will

hello I'm Will Poynter and I'm Ray Poynter and together we're the founders of ResearchWiseAI and today we're going to talk to you about synthetic data first of all ray tell us what synthetic data is and what it can mean in market research so it's one of those things that is

Ray

always problematic to to define but i'm settling down pretty much on it's data that has been created instead of collected so normal data is collected by talking to people in the market research industry it might be from telemetry it might be from passive data collection all sorts of things but synthetic data has been created instead of collected and i've been hearing a little bit

Will

about different types of synthetic data is there anything you can expand upon the different types synthetic data and what people are talking about so in the world of market research there are broadly

Ray

three types of synthetic data the one that perhaps we're seeing the most is what we're calling augmented synthetic data so i do a survey i collect maybe a thousand interviews i wanted to have 250 young men in that data set but when i look at the data set i've only got 150 so i want to boost the sample up to the 250 so i create a hundred synthetic cases which are young men which match the characteristics that i would have expected if i had collected 100 more people so that is a really common use that we're beginning to see around the industry augmented synthetic data the next most common is personas and instead of replicating individuals so augmented synthetic data is creating additional individuals personas more typically tries to create data that represents groups of people so we might have a persona for brand loyalists we might have a persona for trialists we might have a persona for rural occasionals or whatever it is and try to create that information and personas can be quantitative synthetic data is mostly quantitative augmented data is mostly quantitative personas can be but more typically they're qualitative so brands can then have a conversation with these personas and say if i were to do this how might people respond and the third category is is talked about in some of the journals and it's talked about in some of the newspaper articles and the trade press but isn't really used very much yet which is fully synthetic data instead of collecting survey data from people we absolutely just construct it from the information we already have and then we use that in the analysis process

Will

brilliant thank you good nice thorough answer there and so to continue that why why are we hearing so much about synthetic data is it is it all because it's new in AI or has it got a really practical useful application or is it a bit of both um it's mostly around two really attractive features it's much faster

Ray

and it's much cheaper um if we could create data without doing any surveys whatsoever instead of projects taking days they would take minutes um you could imagine a situation where you were in a planning meeting at a client and they said we really need to have some information on this they could speak in natural language to the computer the computer could run some analyses generate some synthetic data come back with some opinions now we're not at that point yet but if we talk about augmented synthetic data the most expensive and the most time consuming part of the data is the last 10 percent the last 20 percent if that could be the synthetic part then we speed the process up we reduce the costs so that is why people are super excited um in this field and of course we we always like saving money and making things faster but in the case of fully synthetic data and personas is there an also an argument of the what is often the number one priority of lots of market research company uh is data security is there a case that fully synthetic data can also help with that space um absolutely so historically one of the earliest uses of synthetic data was in the census um before they released census information to the researchers they would add noise to it to protect the anonymity of the people who supplied that data and we by replacing um real bank data with synthetic bank data we by replacing um very sensitive questions about people their lives their beliefs and their behaviors with synthetic data you could potentially protect those people so that is of real interest to a relatively small group of people but an important group of people so we could see that that would be part of the process there that makes a lot of sense uh just to pivot slightly on this one area where we're seeing synthetic data used in software development is one of the biggest challenges we've often had is that we develop software to work on real people's data but we don't want to often expose the real people's data to the entire development team which could be hundreds of engineers and who have no real reason to know um about your your personal data so we have a big job creating fake data to try and test with which is often not similar enough to the um to the real data and so synthetic data synthetic data has started to come in as a way of composing completely fake users who are interacting with apps in fairly realistic ways to then test those apps so i'm smiling a lot here because fake is one of those words that crops up in the debate around synthetic data in the research industry and it's normally used by people that don't approve of synthetic data and i i draw a big distinction between um fake when it's used um without telling somebody this has been created um and synthetic where you tell people this data has been created here are the characteristics and it's quite interesting when you look at where people come from people who come from the arts generally have a fairly bad view of the word synthetic so they will talk about synthetic love synthetic art and so on when you get to music however synthetic music has got its lovers as well as its haters move to science you take something like insulin real insulin is made from the pancreas of dead pigs and other animals and it's got lots of impurities synthetic insulin is the good stuff that's the one you want um so it's it's often interesting when this word fake crops up in the the discussion but while we're moving on to negatives we've heard lots of good things about uh synthetic data what are some of the concerns from the industry well the biggest one is that it won't work um and that's a really big concern uh and the second one is how do you know whether it's working or not um and perhaps we'll dive into that in a moment but just because it worked last time doesn't mean it will work next time just because it worked when we tested toothpaste doesn't mean that it'll work when we test cars just because it worked when we were estimating what young men would have said doesn't mean that it'll work when we want to estimate young women or something like this so everything is is based on on experiential information there are there's no deep philosophy of why it should work and and that is concerning and you said the most common form of synthetic data in the industry at the moment is augmented synthetic data so can you expand upon that yeah sure so if we're trying to we we do a study and we we don't have as many final cases as we want and so historically we might have used waiting um so we've only got 150 men we want young men we wanted 250 we will up weight them there are some problems with that um it tends to behave in a clunky way synthetic data holds up the possibility of that data behaving more like 250 data points would behave they won't replicate exactly what the 250 would be but they will behave more like a 250 than 150 up weighted if we think about tracking studies then we collect one twelfth of the data every month and that will often understate changes in the the real marketplace because you look at moving averages and you look at the last year's data if something quite dramatic is happening this month we make it harder to see with traditional methods of using information if we can use synthetic data to amplify the most recent data we can look at it in greater depth so that is is where we're seeing synthetic data coming into play not having to collect the last 10 perhaps with the hard to reach people um so we're we're doing that and it's pulling in there but that is where this discussion about how do you test it comes into play yes i see that is it fair to say that uh it augmented synthetic data is a uh an improvement upon a potential improvement upon waiting to fill that gap to simulate effectively the the shortage of data is a potential improvement upon that i i would say that's a very fair comment it's a potential improvement it doesn't deal with the problem entirely um and it needs to make sure that it is an improvement because it's adding an extra uncertainty it's something that's a bit less that people can understand whereas waiting is very straightforward but it should be when done correctly in many cases an improvement on waiting and you mentioned a couple of times about the importance of validating this uh efficacy and whether it is going in the right direction how are people going about that problem right now the good players and there are quite a few of them have a pretty straightforward route but it's got a flaw to it um and so they will go to a client and say well give me your research projects from last year and you've got a study there with 1200 people let's remove 200 of them put them to one side don't give them to me i will then use your 1000 to generate the 1200 and we will compare the 1200 with your original 1200 to see how good the fit is and people who have published studies generally show it's pretty good however there isn't a good definition of how similar the data will be are we talking about the the distribution of data we know the significance testing would be different is there a difference between a point estimate are we trying to estimate how many people are left-handed or are we working with scales on a five point degree disagree scale so we need really a bit more um precision a bit more definition about what similar means you've used synthetic data um to recreate something you've already had how do you then say how similar we know what it wants to end up with did i make the same business decision um fundamentally that is what it comes down to so the statisticians will play around with validation techniques and that's great but at the end of the day would i make the same business decision with the synthetic data that i would have made with the real data makes me worry about a future where data sets have been augmented and then fed into another data set that's been augmented that's fed into another day so that's been augmented and you start to wonder how much true data is there at the bottom of the stack absolutely i mean this is this is definitely one of the concerns um we talk about cannibalization where ai is trained on ai produced products um it's a real issue with images eminently 50 of the images on the internet were generated by ai so the ai systems are now learning from their own stuff and if we had that with synthetic data then there would be a a level of drift is one of the things that can certainly happen within that and if we don't keep collecting new data then it's hard to see how synthetic data can be useful in five years time or ten years time if there has been no new data collected through that process moving on from augmented data on to uh personas the more quality friendly synthetic data can you tell us a bit more about personas yes absolutely so there's some really quite interesting works and i'll give a bit of a shout out to uh signoy who are doing quite a bit in this one so they will work with the client they will have access to an llm but they will also upload lots and lots and lots of data from that client to it um there will be segmentation work the client has done there will be um brand tracking information all sorts of things like that and then they'll talk about okay what are the typical types of customer and what have they said in the focus groups and the transcripts so right i've got that on on board now i can have conversations between the brand manager and the advertising manager with these personas and here the validation process gets even harder if i'd employed two qualitative researchers and give them this from a different background they've never met each other i gave them the same exercise to go and do they would both come back with a useful true interpretation but it would be different and likewise we wouldn't expect the personas generated by one set of exercises to replicate what any specific human would have created so now this basic test of is it useful becomes even more important and it's probably going to be something like a new version of the turing test where you're trying to say does this persona come up with the answers it should do to key questions so if we are asking about um cars it will tend it should say that german cars are safer than cars from um another category or a better built i've got better engineering yes it should probably talk about better engineering we know that those things should be in there and the testing becomes a whole little checklist of does it say the right things to the right questions for those of you have seen the old blade runner movie uh me harrison ford has a whole set of questions which give evidence that somebody is a droid or android and it it's probably going to be a bit like that when we try to test these personas and remember they only have to be useful they don't have to be statistically correct so that is is part of this process if what they do is they unlock the creativity of the brand manager then there is merit in that in itself so it's going to be quite an interesting area i'm seeing quite a few people use personas in things like brainstorming where there isn't a right and a wrong answer what they want is to want is to generate lots of new ideas new thoughts and then the human in the loop is going to take the ones that are going to be the winners hearing you talk about personas is there a clear distinction here between uh persona or synthetic data and the augmented and augmented and fully synthetic uh yeah fully synthetic data it seems that the one is creating data sets where the other is creating an interactive entity um but also is not um a persona closer to fully synthetic in the way that it is the ai that's creating the responses whereby augmented has used ai to extrapolate from responses um so personas are closer to fully synthetic than than augmented synthetic i think that's certainly true you could use a situation where you collected a thousand interviews you then use additional information to enrich them so these became um real hubs with a synthetic outer to allow you to have conversations with them some people are creating a thousand personas and then giving them survey questions so at the margin it blurs um so what i'm talking about is a generalization that augmented synthetic data adds a small amount of quantitative data to an original set personas generally speaking create entities that represent groups of people where you can have ongoing conversations with them and fully synthetic data is normally in the context of quantitative but it certainly embraces personas as well and moving on to what i suspect could be the most controversial of the three in the market research industry fully synthetic data sets firstly are people now using these in uh what i would call in production in in market research for real rather than experimentation a small number of companies companies and buyers are um there's there's been an interesting paper polish a little while ago looking at talking about car brands um and coming up with stuff that replicated the real world quite well and you can see why in that case it would um but actually most clients are still nervous about synthetic data so very few clients are buying buying the the fully synthetic option and i suspect it will be a long time which in my book is about two years before we see people being more willing to go into that fully synthetic mode and anyway i suspect it's a blind alley do you want to elaborate on that absolutely so what we're trying to do with fully synthetic data in particular is we say okay we research people we like having survey questions which we give to real people they send it back we run analysis and then we answer the original business questions and we have a business question we design a survey the survey gets asked of people the people give us answers we then run analyses we then combine the business question with the analysis and we get an insight and maybe we could use the computer to generate the people to answer the to generate the surveys if we can do that we can work with the business question directly to the ai and we really don't need to replicate this mind-bogglingly tedious process of asking survey questions i don't once we get to the level where the ai is really on top of understanding the data about people and manipulating the data about people i don't think we'll need to replicate to replicate the survey question statistical analysis part of what we're doing yes i can see that that makes a lot of sense this is a common theme we've seen in the last two and a half years since chat gpt 3.5 exploded onto the scene is using modern ai to replicate manual and human steps in a process that many steps could be omitted could be omitted altogether or replaced or replaced um and so yes generating um fully synthetic data sets to then go through um a set of processes which are designed because of the nature of human data collection um yes i can see how that may be a dead end while you're making predictions about the future my last question is what else would you uh predict about synthetic data in the near and medium future with synthetic data there is a chance it will get it wrong with any form of market research there is a chance it will get it wrong how much bigger is the chance and what we are going to see and what we've already seen is some buyers of research are saying i will switch my risk over and i will start doing this because it's faster it's cheaper and so i think some of those are going to get some real advantages moving through and we are going to see um the regulators coming along to try to make a certain set of guidances and questions that help buyers do this more safely but um over the next couple of years i see personas growing a lot i see augmented synthetic data growing quite a lot the idea that you can just talk to the computer we can already do this with llms pretend that you're a young african-american living in new york you're interested in sports you're doing this and it will draw on its knowledge that it's learned um and the literature that it's read to give you a set of answers that are that would be massively better than ray pointer's knowledge of what an african-american in new york would say but by really adding in additional data collected from young african-american men into that data set plus the rest of it um it's not necessarily personas which are kind of static but they could be made on at run time from all the information we've got i want you to act now as this person now i want you to react as a couple of retired people who've moved to florida um and we want to talk about diet and the rest of it and i can see that next step of personas dealing with a lot of cases where brands need ideas information suggestions with a reasonable probability of being right sounds like a lot of potential so uh that's everything that we've got on synthetic today today uh thank you for um all of answering all of my questions ray and thank you for everyone at home for joining us uh please join us again next week for our next episode of talking ai