I only half-joke when I say: "Data has carried the grunt work over the years and AI has gotten the glory".
Poor Data has been cleansed, transformed, normalized, wrangled, and obfuscated, and prepared in order to serve its AI master.
On top of all that, it has been called names - How many times a week do you hear the term "Dirty data"? 😊
At the latest TGI-AI event, we gathered Data AI enthusiasts from Xoogler.co and MIT Alumni Startups (MITAS) to explore how AI can pay back and serve Data.
Thanks to Christopher and Shuja for enabling this forum, to my partner-in-crime Kushagra, to Laura for her support, and to my co-panelist Jayanth for such a lively and enriching discussion.
Here's a short video excerpt of our discussion. We'll post the entire video in the comments.
Jayanth and I tackled questions like:
1. What do businesses do with Data? Is Data a first-class citizen or does it just exist for feeding the AI machine?
2. What does Data go through before it’s used?
3. If the internet disrupted distribution and Generative AI is disrupting creation, creation of what data analysis assets is Generative AI disrupting?
4. Just like you can have conversational analytics, can you have conversational transformations?
5. Why is there such a strong interest in building data assistants? Aren’t BI and Data Science tools good enough?
6. What is hard about building a data assistant that can answer business questions?
7. How far are we from the magical moment when we can leave all our messy data where it is, and just start asking questions?
This virtual-only event was well attended over Zoom. It included Xooglers, MITAS members and friends like: Rohit, Imama, Aditya, Peter, Sergio, Meghna, Feihong, Ankita, Rikako, Mehak, Dhruv, Al, Heather, Sanjay, Anya, Bhavani, and Brian.
At StepFunction, Tej, Aditya, Rajesh, Puneet, Ankit, Chandani, Bijit, Tushar and team really embody the mission of Data readiness powered by AI. It was great to see at the recent Databricks Data AI Summit that many industry players also share our mission.
Stay tuned for next month's event which should be both in-person and online.
So Jane, why don't you introduce yourself and what do you? What is your day job? First of all, hello everyone, such a delight to be here and meet, uh, the wonderful community. So my name is giant Mysore. I am a cofounder of a company that's still in stealth mode and what we are working on is a product that seeks to bring. I mean that the way like to think about it, this is a dispel all notions of fear, uncertainty from getting insights from data for users in lines of business, right? And we believe that the full depth of the problem is that the reason why data is not used. And you'll see I'm not using AI anywhere. It's really data is the entity with all the beauty and the insight and how do you kind of get it all out? We believe the challenges usability, reach, understandability, a bunch of those characteristics, so. We are working on building an assistant that will help address all of those gaps. That's kind of what I've been doing now for about a year and a few months. Prior to this, I used to lead product management and marketing at A at an excellent BI startup called Sigma Computing, prior to which I was at Google for many years. Had the opportunity to work on Google Analytics, Google Maps, Enterprise Search at Gsuite, and Graph search for the Knowledge Graph. I keep saying that Google Maps is the most beautiful. Dashboard I ever built in my life or as part of a much larger team, of course. So that's good about me, yeah. Awesome. And you guys have seen me at these events before. Question I I lead these events. My name is Navneet Singh. I'm a cofounder and CEO of a company called Step Function, step function dot AI and we are a customer growth app that takes SaaS companies and growth operators from raw data to recommendations. Growth recommendations, for example, which of their customers. Are at risk of churn, which of their customers can be up sold to and what they can be up sold to just grow their revenues from their existing customers. And as you can imagine we are going from raw data to business recommendations involves a lot and we'll we'll get into to some of that. And you know, let's let's start off, you know, give a bit of a humor filled intro that I feel bad for, for data getting beat up and I getting the glory. But let's, let's, let's get into the details. And like I mentioned in my chat, feel free to raise your hand at any time. Giant. What do businesses do with data? I mean, is data even a first class citizen or does it only exist as its only purpose in life is to serve AI and feed the AI machine? Uh, of course data has a lot of utility in and of itself. Uh, it's, it's, uh, it's job is not just to feed the AI machine learning, right? So clearly. I think that, uh. 2 broad ways in which data gets used and this is just to set the context for everybody who might not be working with data on a day-to-day basis in a in a in a business context. I guess first is. You know, everybody in the line of business kind of uses some set of applications that are core to their work. If you are in sales, for example, you might be using Salesforce for doing your typical pipeline management and sort of going through the entire workflow of managing customers and leads and all of that. You might be using Gong for recording calls, whatever it is. So one way in which data gets consumed is 11 big role of data is to just get draw insights from those those applications and I think. People like to think of them as data silos, but the point is each of those is actually contains a lot of really good insights. Second is the way in which data gets used very heavily is through the construction of the entire factory. I call it the data factory that takes data from all of these and creates more integrated holistic views for the business to consumer insight at slightly higher level of a business. That's across these silos, if you will, right. So these are the two broadways. And the path that data takes from each of these to the consumer, the consumer is basically the same. It's somebody that's working in a business somewhere. But the part that data takes is very different. And the ways in which data can be used in these two paths, they're different. And what I think is super exciting for us in the current time frame is this latter part used to be extraordinarily expensive labor intensive path for mining data that has become both radically cheaper. And even more radically easier to kind of scale compute because of which in a whole bunch of excitement and exciting opportunities await for those who can kind of dive in and figure out how to power up this analytical computer for for companies at scale. Awesome. So, yeah, let's let's get into kind of, UM, what data goes through right before I was going to ask you like, you know, why don't you kind of educate everybody here and all of us on sort of what needs to happen to the data before it can actually be consumed by somebody in the business, right? So, you know, I wasn't joking when I said that data is twisted and wrangled in so many different ways, right? So think about it, businesses have raw data. And like to answer it in silos. So first it needs to be collated and just ingested, needs to be cleaned up, needs to be quality checked, missing values need to be imputed. If there's any PII that needs to be obfuscated, then it then it needs to be reconciled, right? So Jayanth Enterprises might be called Jayant Inc in one, one place and gently enterprises and another place and something else. Some other place might be a spelling mistake needs to be reconciled, needs to be normalized. There's different ways that needs to be mapped and normalized needs to be then prepared even if it's not being used in AI, right? For it needs to be prepared even for reports and all those things. So it's it goes through poor data goes through a lot, right? And what I'm excited about is all this work that we've been doing. Over the years and have been dreading that Oh my God, we need 10 data engineers to spend 6 months right preparing the data, cleaning up and all that today, right that process like Jim said. A lot shorter, lot easier thanks to advances in generative AI and and related. Fields and our goal is to really dig in and see how each one of these things is. Radically helped by AI. Does that make sense guys? And you know, I'd say. If. So let's let's get into how generative AI is disrupting this. You know, if if the Internet disrupted distribution and we were saying generative AI is disrupting creation. So creation of exactly what? Data analysis assets? Enchant is generative AI disrupting. Great question, kind of fundamental to this whole discussion, I think, right? In a nutshell, it's disrupting creation of assets from the throughout the entire data factory as they call it. If you were to dissect the data factory, as you kind of pointed out, there are all of these different jobs that needed to be done and the pipeline, the factory kind of resembles sort of a classic sort of an assembly line almost, right? You kind of go from data that's represented in application schema to sort of business schema and then transformations that are more and more custom, closer and closer to the point of consumption. Got more and more assets getting created. On the one hand, what's happening with the with generative AI techniques is you can you can you can apply. The techniques to reduce like basically automate the creation of all of these assets. Obviously this is not going to be 100% automation, but it's a dramatic reduction in the amount of time and I say dramatic we're talking orders of magnitude, right? It's not just a multiple. So that's one the one area that I'm very close to in, in my day-to-day and when I talked to my customers so on and so forth, they see is the disruption in the last stage, right, which is once all of the. Models have been created if you want to get insights from this data. Uh, what's getting disrupted is not. Um, for example, the BI tool or the data science tool, but it's actually the effort it takes to create the output. Recently I read about sort of the model going from potentially a, going from a model of software as a service to result as a service. And that's really what's going on. If you look at, you know, why was the dashboard getting created? It was not, it was the answer was it was so expensive to deliver one insight at a time. That you would rather create a kind of a massive data product if you will and provide all the controls in there that kind of has a very long half life. The cost was it made the product that they had to consume. What's happening now is. That entire process has been disrupted or will be disrupted. I don't think we we have still seen the first evidence of a product that does this really well. Hopefully our company will be one of those that does it. But the idea here is if insights could be generated literally in the matter of less than a minute. Why would you go through the effort of even creating a dashboard, right? Why would you go through this fixed cost of creating any of these? So that's one of the big areas where the labor and the effort involved in creating dashboards itself has come into question. You can replace dashboards with like, you know, data science outputs as well. All of that is getting, I think, dramatically getting shrunk in terms of effort. Excellent. Gee, did you have a question? I don't see your full name, just says Gee. But. Anyway, if not, I will mute you. Um, and that's a. That's that's good inside Jim, Thanks. But but let's let's get into the the details of. The the end to end pipeline of what what data goes through and I'd love the communities questions and feedback, but you know, let's start by kind of saying. OK. The very first thing, once we ingest data, we need to really see, do we have quality data? Right. And people have worked for a long time, use a lot of mechanisms, mostly manual, and it's been frustrating. But now what I see, what I personally see and what my team personally uses is. AI, so for example even generative Ilms trained on lots and lots of similar data, even that particular corporates data. And then being able to detect. Anomalies and other issues in data quality for a new data set. Right. So I've seen. So just from the quality check perspective, we haven't even gotten to correcting the quality issues just from the quality check monitoring perspective. I'm very thankful to the new techniques, I see a lot changing due to others as well. Jay and others and do do people see what I mean by kind of just detecting outliers, detecting, you know, having a distribution and kind of seeing something that completely falls off or knowing for example, that our revenue shouldn't be negative or you know, something like that. Just inherently knowing that without somebody telling you these rules, right, that. These values have to be positive or that you know, revenue cannot be a string for example and things like that. This is knowledge that you know, for example, Ellms either already have or can easily gain by fine tuning. Do people see that? I think what I've seen none of need is like the kind of the. Umm. When I talk to kind of the data analysts in in some of my customers. This the, uh, I think many of them are going to be exploring tools that allow you to do things like this and also open source efforts that allow you to dramatically simplify the effort. Because one kind of characterizing feature about data overall, I would not say data, I would say the humans of data people who are involved in kind of in the working in this data factory and even consuming it is, I would actually say it's fear and apprehension and rooted in many different things. Most data folks are super worried if you know the reports that they're giving out are wrong. Or because not because of not because they are careless, in fact they fear that they will get blamed for it, but because somewhere something in the pipeline breaks, right? And all of a sudden techniques that seemed modern and state-of-the-art till just about a year ago suddenly seemed like ohh crap, this whole thing can happen in real time. I can actually be kind of detect things as in and they happen and they can go fix things, right?
Thank you Navneet Singh ,Kushagra Shrivastava for the opportunity to be a part of this discussion. We are living in the hobbyist days of inventing in a technology space that is full of brilliant ideas and promises. Some of these are going to create discontinuities in how data integrates into work life bringing with it the clarity and confidence that truth alone can. Data is not dirty - data is the truth.
🤖 When you look at stalled or unsuccessful AI/ML projects, it’s tempting to conclude that the problem is around the model-building process.
But 9 times out of 10 the problem is poor data quality.
In this post, Matthew Kelliher-Gibson, MBA, shares best practices for solving it at the source, and details how superior data quality enables data science teams to do their best work.
Go deeper 👇
https://bit.ly/49NxqtH
Welcome to our first Feature Friday (pretend it's Friday)! Every week, we'll highlight a feature that makes Roe AI special in the world of unstructured data.
At Roe AI, we're redefining how you handle unstructured data. By merging SQL with the power of AI agents, we've created Roe SQL—your gateway to indexing, searching, and analyzing multimodal data seamlessly, no matter the scale.
This week, we'll show you how to create a semantic search index across columns of your data tables (video below). Let’s say you have a large table of YC companies from a batch. There’s a column called `description` that describes what each company does, and you want to find companies specializing in a specific area. Import your table, create a search index on the `description` column, type your query, and voilà—the top results appear instantly.
How is our feature different from others like Pinecone? Flexibility. With Roe SQL, you can perform complex hybrid searches in SQL to filter out data via custom metadata. For example, we can classify each company using their description, filter out companies that are only AI companies, and search within that refined list, all in a few lines of Roe SQL. The possibilities are endless, but we ensure you have the power to grasp them.
Join our Slack workspace to see how Roe AI can accelerate your unstructured data management!
https://lnkd.in/eyexKHFb#unstructured#data#ai
Insightful discussion on generative AI use cases and the critical role of data quality! At Fundbox, we're experiencing firsthand how high-quality data enables better decision-making. Monte Carlo has been instrumental in streamlining our data processes. Looking forward to further exploring the intersection of AI and data quality with fellow professionals! #AI#DataQuality#MonteCarlo
🤖 Feeling overwhelmed by data? You’ll want to check out this article from Forbes exploring how generative AI might be the solution we've been waiting for. A great read for anyone dealing with #DataManagement and analytics!
🤖 Feeling overwhelmed by data? You’ll want to check out this article from Forbes exploring how generative AI might be the solution we've been waiting for. A great read for anyone dealing with #DataManagement and analytics!
Is Databricks AI/BI the future of business intelligence? In our episode 'BI in the age of AI' I predicted that the future of business intelligence work was building and maintaining the overall system of metrics and certified queries/content as opposed to what we've been doing the last decade, which is primarily building charts.
Less than a year later, AI/BI launches as a deeply integrated, AI driven business intelligence solution for Databricks customers. In it, BI professionals maintain a library of certified queries and report content for use by multi-modal AI systems to answer business user questions primarily via natural language instead of human authored dashboards. At least, that's the vision.
It's almost like they watched the episode and said, 'Holy crap let's build that!'
Of course it's early days and the BI capabilities of AI/BI are by all reports very rudimentary, but directionally I think Databricks is correct. This is where business intelligence is headed.
We'll be doing a deep dive on this concept and Databrick's AI/BI announcement tomorrow at 12PM EDT. Bring your questions! Link in the comments.
#databricks#aibi#businessintelligence#analytics
CEO of StepFunction.ai, ex-Google Eng Lead, ex-MIT AI lab
1moPlease find the full video here: https://youtu.be/3ZbvhxtF0I0