In today’s column, I am going to identify and explain the momentous pairing of both generative AI and data science. These two realms are each monumental in their own respective ways, thus they are worthy of rapt attention on a standalone basis individually. On top of that, when you connect the dots and bring them together as a working partnership, you have to admire and anticipate big changes that will arise, especially as the two fields collaboratively reinvent data strategies all told.
This is entirely tangible and real-world, not merely something abstract or obtuse.
The recent extensive release of the ChatGPT Code Interpreter by AI maker OpenAI showcases the merging of generative AI and data science vividly and excitingly (readers might recall that the ChatGPT Code Interpreter was earlier this year made available on an alpha basis and I briefly discussed the topic at that time, see the link here). With the wider rollout now taking place, be ready to witness a bonanza of data science adoption across the board. Modern data scientists and even data neophytes are going to eagerly and earnestly make use of generative AI to do the bulk of their heavy lifting for them when it comes to analyzing and leveraging data.
I’ll momentarily describe how the ChatGPT Code Interpreter works and what you can do with it. Please be aware that this isn’t the only combo in town. You can easily see the writing on the wall that other generative AI apps are going to incorporate data science capabilities too (some are already doing so). Soon enough, a plethora of options will exist as to which generative AI you might opt to use and which set of data science tools and techniques you might want to utilize. Keep your eyes on Bard (Google), Claude 2 (Anthropic), GPT-4 (OpenAI), and the many other generative AI apps available.
You could say that we are entering the best of times when it comes to data science. But like the famous line, sometimes the best of times perchance coincides with the worst of times. Here’s why that fits in with these latest advances.
If just about everyone and anyone will be able to use AI to undertake data science activities, the question arises whether we will be awash in nonsensical data science or at least amateurish data science that is replete with errors, falsehoods, biases, glitches, AI hallucinations, and other undesirable maladies. You see, from an AI Ethics perspective and likely an AI Law viewpoint too, we are opening the door to data science at scale. This could lead to an explosion of really awful data science that misleads, misguides, and furthers the tidal wave of societal misinformation and disinformation.
Some believe or hope that state-of-the-art generative AI will be good enough to keep us from all getting swamped in lousy or even purposeful wrongdoing data science. The idea is that AI will aid us in avoiding shooting our own feet. There are recommended AI Ethics data science precepts and considerations afoot. Lawmakers and regulators are also bound to be drawn into the generative AI and data science coupling. You can expect that new AI Laws will be proposed and established to deal with the potential anything-goes wild west of AI-powered data science.
One compelling argument being made right now is that the combining of generative AI and data science is going to effectively democratize data science. Whereas the case today might be that to do prudent and intelligible data science you need sufficiently honed skills and sometimes costly training, the fact that generative AI can demonstrably reduce the barrier to entry will open data science to all. No longer will data science be confined to those with specialized capabilities in hand. Data science will be at the fingertips of everyone.
The conundrum again arises whether we are going to overly rely on AI. Will people be lulled into letting the AI do the data analyses for them and ergo fail to double-check or use humankind’s common sense to ascertain whether the results are viable and practical? This seems unfortunately to be highly plausible, perhaps highly likely. Anyone that has ever used a generative AI app knows how easy it is to fall into a slumber and assume that the AI has perfectly figured things out for you. That’s a bad mental trap and one that I’ve repeatedly forewarned we need to be extremely mindful of, see my discussions at the link here and the link here, just to name a few.
Okay, so we have the coupling of generative AI and data science, which bears some wonderment and also a degree of controversy. We need to also take into account data strategies.
Entities such as companies and governmental organizations devise data strategies that are intended to leverage and maximize their beneficial use of data. Those data strategies ought to also be vigilant for the downsides of mishandling and misreporting associated with data. Whatever data strategies an entity might currently have, if any, will need to be rethought and likely rejiggered in light of the powerful twofer associated with generative AI and data science.
Whew, that’s some rather intense thinking.
Take a brief breather.
Here’s how I will proceed next. First, I will unpack the big picture of what data strategies are all about. We can then dive into the nature of data science and the nature of generative AI, along with their coupling. This will establish a framework and context for then exploring the advent of tools such as the ChatGPT Code Interpreter and what it can currently undertake.
At the conclusion, I’ll have a few final remarks to make about the need to try and keep things on the up and up, aiming to avoid the pitfalls and dark abyss that regrettably goes along with this march forward in the generative AI and data science burgeoning world around us.
Buckle up and get ready for quite a ride.
Clarifying What Data Strategies Are All About
You might hear or read from time to time that a top leader at some entity has declared that their organization has devised a miraculous data strategy. This especially became popularized a few years ago as the emergence of Big Data and large-scale data warehouses first arose. Whereas in the past there tended to be an overlooking of how data ought to be thoughtfully managed by organizations, the cost and ease of collecting and using data increasingly prodded organizations into realizing that they can no longer afford a haphazard approach to their data efforts.
Consider these three examples of the meaning associated with having a data strategy:
- Example of the meaning of Data Strategy for a business entity: “A data strategy is a highly dynamic process employed to support the acquisition, organization, analysis, and delivery of data in support of business objectives” (posted online in the Gartner glossary on Information Technology (IT) terminology, retrieved July 2023).
- Another example of the meaning of Data Strategy for businesses: “A strategy outlines the initiatives and actions that you believe will drive your desired business outcomes. The purpose of a data strategy is to enable your organization to achieve its mission and objectives using data – giving you a competitive advantage” (excerpted from Google’s posted online White Paper entitled “Three Pillars For Building A Modern Data Strategy”, retrieved July 2023).
- Meaning of Data Strategy for the Federal government: “The mission of the Federal Data Strategy is to fully leverage the value of federal data for mission, service, and the public good by guiding the Federal Government in practicing ethical governance, conscious design, and a learning culture” (posted online by the U.S. Office of Management and Budget (OMB), the CDO Council, and the General Services Administration (GSA), retrieved July 2023).
The realization that a data strategy is essential to all organizations has today become a seemingly obvious conception. Without a top-level data strategy, the odds are that an organization of any substance will become befuddled as they allow data to come and go, carelessly. The value of the data will not be realized. Worse still, the data can undercut the mission of the organization and get the entity into deep trouble as a result of using tainted data or employing data in disavowable or illegal ways.
Data strategies have risen to such a level of prominence that some argue persuasively that overarching mission-oriented strategies of an organization are shaped by their devised data strategy. You can construe this as a two-way street, namely that the top-level organization strategy drives the data strategy, and simultaneously the data strategy can drive the top-level organizational strategy.
In a data science article entitled “Big Data Dreams: A Framework For Corporate Strategy” by Matthew J. Mazzei and David Noble (Business Horizons, 2017), the authors emphasize the dramatic impact that data strategy has had on organizational strategy overall:
- “We are witness to a movement in practice that has begun to unravel much of the known strategic management theory developed over the last 40 years by eviscerating traditional value chains and competitive forces. The uses for data are shifting as collected data helps to determine what markets to explore and how consumer trends are changing, and the data can drive these determinations in real-time. We are seeing firms take on non-traditional markets, leveraging their data and analytic resources–—in conjuncture with massive amounts of human and financial capital–—to upend traditional barriers to entry.”
- “The ultimate goal of big data movers and innovators is to build greater knowledge and dynamic capabilities and to apply the benefits of big data analytics in a way that creates unique and sustainable competitive advantage through the development of diverse ecosystems and data flows. Through these advances in the consumption and application of big data, competition and competitive forces are being redefined.”
Many organizations have been hampered by getting themselves into what we now know to be commonplace data science traps. One such circumstance is the classic multiple versions of the truth (MVOT) dilemma.
Here’s how that goes.
An organization lacks a cohesive data strategy. Various elements of the organization each opt to create data, collect data, modify data, interpret data, and otherwise make use of data in whatever manner they so choose. When it comes time to figure out what is going on at the organization, there is no readily sensible way to coalesce the data into a logically consistent comprehensive whole. Little of the data is able to be reconciled with other data from throughout the organization. The values and meaning of the data roam all over the map.
This results in having multiple versions of the truth. One part of the organization claims that things are going tremendously well. Another part of the organization bemoans that things are going poorly. Upon looking at their respective data, each seems to be making a bona fide claim. They each in a sense have their own truth to tell. The problem is that it becomes onerous if not impossible to reconcile these disparate so-called truths.
Top leaders find themselves behind the eight ball. They want to have a single source of truth (SSOT), rather than having to contend with the debatable and irreconcilable MVOT or multiple versions of the truth. Furthermore, the aim is to get the entire organization aligned on data by ensuring that a single source of truth is the guiding light for all data aspects.
Here’s a noteworthy comment about SSOT and MVOT as mentioned in the Harvard Business Review (HBR):
- “A sound data strategy requires that the data contained in a company’s single source of truth (SSOT) is of high quality, granular, and standardized, and that multiple versions of the truth (MVOTs) are carefully controlled and derived from the same SSOT. This necessitates good governance for both data and technology” (source is “What’s Your Data Strategy?” by Leandro DalleMule and Thomas H. Davenport, Harvard Business Review, May-June 2017).
An added insight is that you can think of a data strategy as a kind of sports-related conception, akin to playing football and making sure that you have both a strong offense and a robust defense, as it were.
Here’s how the same HBR article described the data defense and data offense roles:
- “Data defense and offense are differentiated by distinct business objectives and the activities designed to address them. Data defense is about minimizing downside risk. Activities include ensuring compliance with regulations (such as rules governing data privacy and the integrity of financial reports), using analytics to detect and limit fraud, and building systems to prevent theft. Defensive efforts also ensure the integrity of data flowing through a company’s internal systems by identifying, standardizing, and governing authoritative data sources, such as fundamental customer and supplier information or sales data, in a ‘single source of truth.’ Data offense focuses on supporting business objectives such as increasing revenue, profitability, and customer satisfaction. It typically includes activities that generate customer insights (data analysis and modeling, for example) or integrate disparate customer and market data to support managerial decision making through, for instance, interactive dashboards” (ibid).
That covers some keystones associated with data strategy.
Next, let’s consider the coupling of generative AI and data science. This will allow a subsequent exploration of how the pairing is reinventing or reinvigorating data strategies.
Generative AI And Data Science Become Close Partners
Let’s dive into three crucial perspectives in this particular sequence:
- (1) Foundations of Generative AI. Generative AI and what can and cannot be done with this latest advance in AI.
- (2) Foundations of Data Science. Data science realm and what data science consists of.
- (3) Pairing of Generative AI and Data Science. Generative AI and data science paired up and the synergies that arise accordingly.
I will first do a quick overview of generative AI. If you are already versed in generative AI, perhaps do a fast skim on this portion.
Foundations Of Generative AI
Generative AI is the latest and hottest form of AI and has caught our collective devout attention for being seemingly fluent in undertaking online interactive dialoguing and producing essays that appear to be composed by the human hand. In brief, generative AI makes use of complex mathematical and computational pattern-matching that can mimic human compositions by having been data-trained on the text and other content found on the Internet. For my detailed elaboration on how this works see the link here.
The usual approach to using ChatGPT or any other similar generative AI such as Bard (Google), Claude 2 (Anthropic), GPT-4 (OpenAI), etc. is to engage in an interactive dialogue or conversation with the AI. Doing so is admittedly a bit amazing and at times startling at the seemingly fluent nature of those AI-fostered discussions that can occur. The reaction by many people is that surely this might be an indication that today’s AI is reaching a point of sentience.
On a vital sidebar, please know that today’s generative AI and indeed no other type of AI is currently sentient. I mention this because there is a slew of blaring headlines that proclaim AI as being sentient or at least on the verge of being so. This is just not true. The generative AI of today, which admittedly seems startling capable of generative essays and interactive dialogues as though by the hand of a human, are all using computational and mathematical means. No sentience lurks within.
There are numerous overall concerns about generative AI.
For example, you might be aware that generative AI can produce outputs that contain errors, have biases, contain falsehoods, incur glitches, and concoct seemingly believable yet utterly fictitious facts (this latter facet is termed as AI hallucinations, which is another lousy and misleading naming that anthropomorphizes AI, see my elaboration at the link here). A person using generative AI can be fooled into believing generative AI due to the aura of competence and confidence that comes across in how the essays or interactions are worded. The bottom line is that you need to always be on your guard and have a constant mindfulness of being doubtful of what is being outputted. Make sure to double-check anything that generative AI emits. Best to be safe than sorry, as they say.
Into all of this comes a slew of AI Ethics and AI Law considerations.
There are ongoing efforts to imbue Ethical AI principles into the development and fielding of AI apps. A growing contingent of concerned and erstwhile AI ethicists are trying to ensure that efforts to devise and adopt AI takes into account a view of doing AI For Good and averting AI For Bad. Likewise, there are proposed new AI laws that are being bandied around as potential solutions to keep AI endeavors from going amok on human rights and the like. For my ongoing coverage of AI Ethics and AI Law, see the link here and the link here.
The development and promulgation of Ethical AI precepts are being pursued to hopefully prevent society from falling into a myriad of AI-inducing traps. For my coverage of the UN AI Ethics principles as devised and supported by nearly 200 countries via the efforts of UNESCO, see the link here. In a similar vein, new AI laws are being explored to try and keep AI on an even keel. One of the latest takes consists of a set of proposed AI Bill of Rights that the U.S. White House recently released to identify human rights in an age of AI, see the link here. It takes a village to keep AI and AI developers on a rightful path and deter the purposeful or accidental underhanded efforts that might undercut society.
That quick rundown of what’s up with generative AI should hopefully put us all in the same mindset and allow me to next dive into the topic of data science.
Foundations Of Data Science
Without seeming to be smarmy or vacuous, one could plainly state that data science is a field of study and a form of practice that entails using a science-based systematic approach to all facets of data. I realize that might seem lacking as a definition and you might be wanting a bit more meat on the bones.
A recent article in the Communications of the ACM (CACM) entitled “Data Science – A Systematic Treatment” by M. Tamer Ozsu (article posted online July 2023), provided a handy overview of data science and included these various reported definitions of data science:
- “Data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting non-obvious and useful patterns from large data sets” (cited source is MacKay and Oldford, “Scientific Method, Statistical Method and the Speed of Light”, Statistical Science, 2000).
- “Systematic study of the organization and use of digital data for research discoveries, decision-making, and a data-driven economy” (cited source is the National Consortium for Data Science (NCDS), Ahalt, “Why Data Science?”, 2013).
- “Data science is a data-based approach to problem-solving by analyzing and exploring large volumes of possibly multi-modal data, extracting knowledge and insight from it, and using information for better decision-making. It involves the process of collecting, preparing, managing, analyzing, explaining, and disseminating the data and analysis results” (as directly proffered by Ozsu in the cited CACM article).
To garner a sense of what data science materially consists of, these data science building blocks were specified in the CACM article:
- Data Engineering. Includes big data management, data preparation, etc.
- Data Analytics. Includes exploring data and data mining, building models and algorithms such as machine learning, and data preparation.
- Data Protection. Includes security for data science, data privacy, and so on.
- Data Ethics. Impact on individuals, organizations, and society, ethical and normative concerns, bias in data, algorithmic bias, and regulatory issues.
Many tend to think of data science from a life cycle viewpoint.
The idea is that you start with a need or basis for considering data and then proceed through a series of iterative steps involving data engineering, data storage, data preparation, data analysis, data reporting for deployment and dissemination, etc. Numerous data science methodologies articulate the details of the data science life cycle and provide detailed checklists and guidance when undertaking a data science effort.
I trust that this gives you a semblance of what data science consists of.
We now get to the truly thrilling part, pairing up generative AI and data science.
Pairing Of Generative AI And Data Science
Two peas in a pod.
We have today’s generative AI that can seemingly fluently via mathematical and computational pattern-matching do all kinds of natural language narratives and interactions. You might use generative AI to help analyze your career aspirations and what job you should next pursue (see my coverage at the link here) or maybe use generative AI to analyze a knotty problem at work, and so on. There are lots and lots of analyses you can use generative AI to aid with.
Aha, you might be thinking, we ought to give due consideration to using generative AI to aid in performing analyses associated with data, especially performing the myriad of data science tasks that are customarily needed when systematically dealing with data.
Yes, congrats, you hit the proverbial nail on the head.
This makes a lot of sense for numerous reasons. A data scientist has to examine data to see what the data is and what kinds of data issues might exist. There is undoubtedly an in-depth analysis that needs to be undertaken. A data scientist might transform data and look for patterns. Score that as another kind of analysis. After searching for patterns, a data scientist might pull together data and make various displays or portrayals, along with interpreting and reporting on what the visualizations indicate. Once again, this involves various analyses.
Let’s put generative AI into these data science analysis tasks and see how it does.
Before I leap into the instance of the ChatGPT Code Interpreter as an example of using generative AI for these types of data science analyses, I’d like to cover the elephant in the room about the pairing of generative AI and data science.
Are you ready?
Some claim that data science is a subset of AI.
Drop the mic.
They fervently argue that AI makes use of data science such as when collecting together the voluminous data needed to data-train the AI. Likewise, the contention is that data efforts involving devising the AI pattern-matching and during the development of the large language model (LLM) or machine learning (ML) data structures involve doing data science and data mining (DM). As such, the viewpoint is that data science squarely and inarguably falls within the purview of AI.
Hogwash.
I counterargue that this is a false commingling.
Just because the field of AI and including generative AI opts to make use of data science does not ergo allow us to conclude that data science is therefore a subset of AI. Please know that I don’t want to get into a blistering debate about this herein, but I do freely acknowledge that others make such a bold claim.
To me, this is confounding two otherwise separate domains. They have an intersection, for sure, but this doesn’t make either one a subset of the other.
I was pleased to see that the CACM article addressed the elephant in the room too:
- “Data science is not a subfield of ML/DM nor is it synonymous with these disciplines. More broadly, data science is not a subtopic of AI—a common claim originating from confusion on boundaries. AI and data science are conceptually different fields that overlap when ML/DM techniques are used in data analytics but otherwise have their own broader concerns. The broader scope of data science is discussed in this article, highlighting its constituents that are not part of AI. Conversely, there are topics in AI, such as agents, robotics, automated programming, and others, that are not within the scope of data science. Thus, AI and data science are related, but one does not encompass the other” (ibid).
Moving on, another essential side note worth covering consists of the wording arrangement involved in these two realms. Here’s an intriguing question to contemplate:
- Should we refer to the pairing as that of generative AI and data science, or should the wording be in the other sequence of stating that it is data science and generative AI?
You might be tempted to shrug your shoulders and say that the difference is negligible and that either order is acceptable.
Here’s my approach.
I generally try to refer to generative AI and data science when I am discussing the application of generative AI to data science endeavors. You might construe this as Generative AI -> Data Science, as though I am declaring that we are applying generative AI to data science. Now then, when I am referring to the use of data science to aid in devising generative AI, I tend to mention this as data science and generative AI. This could be construed as Data Science -> Generative AI, implying that data science is being applied to the crafting of generative AI.
This latter reference is also sometimes named as being data-centric AI, such as this research article posits:
- “Data-centric AI encompasses methods and tools to systematically characterize, evaluate, and monitor the underlying data used to train and evaluate models. At the ML pipeline level, this means that the considerations at each stage should be informed in a data-driven manner. We term this a data-centric lens. Since data is the fuel for any ML system, we should keep a sharp focus on the data, yet rather than ignoring the model, we should leverage the data-driven insights as feedback to systematically improve the model” (source is “DC-Check: A Data-Centric AI Checklist To Guide The Development Of Reliable Machine Learning Systems”, Nabeel Seedat, Fergus Imrie, and Mihaela van der Schaar, posted online November 2022).
My rule of thumb is this:
- (a) Generative AI and Data Science. Wording generally but not always implies that generative AI is being applied to data science, expressed more clearly as Generative AI -> Data Science.
- (b) Data Science and Generative AI. Wording generally but not always implies that data science is being applied to generative AI, expressed more clearly as Data Science -> Generative AI.
I am also fine with allowing the catchphrases in either sequence, such that it simply suggests that the two fields or domains are paired up. No need to be a stickler on the ordering question, it just isn’t worth the drawn-out drama.
When applying data science to generative AI (Data Science -> Generative AI), the data-centric AI approach imbues a data scientist’s perspective to consider the life cycle of data that goes into formulating an AI system. Per the above-cited article by Seedat and others, they suggest that AI developers should be asking these data-centric AI questions when crafting AI systems:
- Q1: “How did you select, collect or curate your dataset?”
- Q2: “What data cleaning and/or pre-processing, if any, has been performed?”
- Q3: “Has data quality been assessed?”
- Q4: “Have you considered synthetic data?”
- Q5: “Have you conducted a model architecture and hyperparameter search?”
- Q6: “Does the training data match the anticipated use?”
- Q7: “Are there different data subsets or subgroups of interest?”
- Q8: “Is the data noisy, either in features or labels?”
- Q9: “How has the dataset been split for model training and validation?”
- Q10: “How has the model been evaluated (e.g. metrics & stress tests)?”
- Q11: “Are you monitoring your model?”
- Q12: “Do you have mechanisms in place to address data shifts?”
- Q13: “Have you incorporated tools to engender model trust?”
I’ve previously extensively covered the data-centric AI and data science applied to AI wranglings associated with developing AI apps, such as at the link here and the link here.
ChatGPT Code Interpreter As A Data Science Pairing Exemplar
Let’s now explore the particulars of generative AI as applied to data science (Generative AI -> Data Science).
I’ll begin with a brief introduction to how and why generative AI is being bolstered by adding the capability to generate programming code. This will intricately tie into the application of generative AI to data science. It is, shall we say, the secret sauce.
Turn the clock back a few months or so.
For those of you that had earlier attempted to use generative AI for doing data science, you probably right away hit a wall that was difficult to overcome. The wall was that most generative AI didn’t play well with numbers. The text-oriented generative AI was preoccupied with words and wordsmithing, rather than being able to do calculations and readily deal with numbers.
I remember that when contemporary generative AI first rose to fame last year, some tried to use generative AI for exceedingly simple calculations that went bust. For example, you could potentially have instructed generative AI to add together one plus one and get an answer such as the number three. This seemed shocking. How could such a seemingly advanced AI system not properly calculate the simplest of additions?
The eye-opening realization was that the generative AI wasn’t necessarily doing calculations in a usual mathematical way. Instead, the use of words was being relied upon. Envision this. Search the Internet for the words entailing adding together one plus one. The chances are that out there on the Internet there are instances whereby people have written that the answer is three. They might do so for fun. They might do so to make a point, for example, that maybe you get something as added value when you add two items together, thus you get a bonus that comes to the amount of three.
Generative AI that was data-trained on words or text found throughout the Internet would potentially have found those patterns of wording. Therefore, when you asked in natural language to add together one plus one, the prior data training might have led the generative AI to have landed on the word three. This is sensible in the sense that it was wording that had been found during the initial data training.
AI makers often stepped in when these seeming oddities arose and would adjust the generative AI so that it would no longer appear to be adrift of simple calculations. They might for example adjust the weights or guide the generative AI via RLHF (reinforcement learning via human feedback) toward answering two rather than answering three when presented with the one plus one question.
The crux is that the generative AI did not necessarily have any immediate means to do calculations per se.
This was soon remedied by allowing for APIs (application programming interfaces) that could connect the generative AI to other apps, see my coverage at the link here. When someone asked the generative AI to calculate one plus one, this becomes a question that is fed into a calculator-oriented app outside of the generative AI. The calculator app does the needed calculation and returns the result to the generative AI. The generative AI then displays the result to you. You wouldn’t necessarily realize that the outside app had been used to get the result for you.
One potential constraint about using a calculator app is that it can only likely do calculations as predefined for that specific app. Suppose the calculator app wasn’t devised to do square roots. The generative AI would then be unable to get an answer to a question about a square root that someone might have entered into a prompt.
All of this is my way of bringing you to the advent of something quite spectacular that has both plusses and minuses. The big reveal is that AI makers have been steadily advancing generative AI and can somewhat ably able to get generative AI to generate programming code.
Here’s the deal.
Imagine if we were able to get generative AI to be able to produce programming code, akin to the kind of programming that software engineers and programmers are able to compose. The generative AI could potentially then generate code of a desired nature as needed for whatever problem is at hand. For instance, if a user enters a prompt asking for square roots, the generative AI could generate programming code to do so, and then invoke or run the code correspondingly. The sky is the limit as to whatever programming code can be concocted.
I stated that this is spectacular because it opens up immense possibilities for what generative AI can do. Rather than being relatively constrained, the generative AI can potentially produce programming code to perform whatever function or calculations might be needed. You have taken the generative AI and given it a general-purpose programming capability that widens tremendously what the generative AI can accomplish.
And that takes us to the door of data science. We can now eagerly walk in that door and see what is inside. I’ll specifically explore the ChatGPT Code Interpreter, but as mentioned do realize that other generative AI apps are doing likewise.
OpenAI’s ChatGPT can be augmented by using an OpenAI plug-in known as Code Interpreter (currently available for ChatGPT Plus subscribers). The generative AI app can generate code that is then run or executed in what is said to be a sandboxed, firewalled execution environment. The basis for constraining where the code is run would be that if the generated code goes awry, you want to try and confine what it can get away with. The type of code currently being generated is based on the popular programming language Python.
Of the multitude of ways that this programming code-producing capability can be used, our interest in this discussion is in the arena of generating code to perform data science tasks. The good news is that the generative AI takes care of the programming specifics for you. You do not need to know how to write programs. You do not need to know the programming language Python. All that you need to know is how to use your everyday natural language to express what you want the generative AI to do when it comes to tackling data science tasks.
For example, you might begin a data science conversation with ChatGPT by importing a data file that you have available. You could then express in everyday language that you want the AI app to examine the data and tell you what it finds.
Assuming that the data file is readable, the AI app will likely examine the data and then indicate what it has found. For example, suppose the data consists of basketball teams and stats about each of the basketball players. The AI app might come back and tell you that the data consists of eight basketball teams. Furthermore, there are stats such as shooting percentage, time played, and other factors that are given for each basketball player.
At that juncture, you can ask the AI app to do a deeper analysis and discern whether the data might contain any issues or questionable values. The AI app might reply that one basketball player has zero time played and yet has various points scored, which seems like an inherent contradiction. You can ask the AI app what it advises be done about the data anomaly, or you can tell the AI app what you want it to do to correct or remove the data.
Once the data seems to have been sufficiently initially reviewed, you could then ask the AI app to do a statistical analysis of the relationship between the height of each player and their propensity to score baskets. The AI app is likely to suggest several statistical methods and inform you as to which might be most promising. You could then tell it which one to run or ask it to decide for you.
The AI app might then not only perform the statistical analysis but also provide a narrative describing and interpreting what the stats indicate. You could then ask the AI app to make bar charts, area charts, tree maps, piece charts, histograms, heat maps, box plots, scatter plots, line plots, and other types of visualizations. Along with those visualizations, you can request essays that depict what the graphed data illustrates and why it is significant to consider.
Notice that throughout the data science session, you did not need to perform any arcane commands or need to know how to produce the graphs. You merely expressed your preferences in ordinary language. That’s the beauty of using generative AI in this front-end manner. You interact with the generative AI. It generates the needed code and executes it in the sandbox, thereupon then showing the results to you and interacting seamlessly with you.
I hope that you can now grasp how this potentially opens up data science and makes data science available on a scaled-up basis. This also hopefully elucidates why some assert that the use of generative AI for data science is going to democratize data science. The user of the generative AI doesn’t need to presumably know one iota about data science. They simply describe generally what they want to have done, and the generative AI proceeds. Indeed, if the user isn’t sure of what kind of data science effort to undertake, they can ask the generative AI to make suggestions or recommendations. The user doesn’t have to be the active director and can in lieu be a passive receiver that just nudges along the generative AI.
Good Or Bad Tidings When Pairing Generative AI And Data Science
Is this pairing up a good thing or a scary thing?
The good aspects are perhaps self-evident.
The downside is that someone that has no idea whatsoever about how to do data science is suddenly and immediately able to assume that they can do so. They are cheerfully likely to rely upon generative AI. One issue is that if the generative AI messes up, or misstates something, the neophyte is not going to realize what has happened. They might decide to take the AI-generated data science analysis, toss it into a convenient email, and send it around to the rest of the organization as though it is entirely accurate and immaculate.
We also return to the single version of the truth (SVOT) enigma. What data did the user opt to import into the generative AI? The prevalence of garbage-in garbage-out (GIGO) can rear its ugly head. The user might have gotten some sloppy or outdated data. The generative AI won’t especially realize this. Meanwhile, the most elegant of data science analyses are performed, doing so on data that is utterly bereft of being correct or proper.
Yikes, we are scaling up and democratizing GIGO, some might loudly lament.
I am only touching upon the tip of the iceberg. There’s a lot more to be considered. For example, suppose the generative AI misinterprets what the user says they want to have occurred. The user might assume that the analysis comports fully with their request even though it doesn’t.
Another example would be that the code generated by the generative AI turns out to be wrong and miscalculates things or sets the visualizations based on parameters that misleadingly portray the data. You cannot blindly assume that the generative AI will produce viable code. The code might be executable, but this doesn’t also equate to the code doing the right things. Etc.
Conclusion
Some have wondered whether we need human data scientists at all, going forward.
If a neophyte can use a generative AI to perform data science work, maybe we no longer need data scientists per se. Think of the potential cost savings. Get rid of your expensive data scientists and just tell the rest of the organization they are now anointed as data scientists. That was easy.
Maybe we can take that a step further. Do we need humans to run the generative AI that is undertaking the data scientist activities? One viewpoint is that we can make this autonomous, akin to taking the human driver out of cars by devising an autonomously driven vehicle. We could aim to have autonomous artificial data scientists, namely generative AI or some allied kind of AI that does all the data science work, and no human intervention is required at all.
Those are some pretty big leaps of logic, especially since we are only now dipping our toes into the waters of generative AI as applied to data science. The number of dominos that need to fall one after another to reach those lofty lengths is currently a bit beyond the horizon. Let’s not count our chickens before they are hatched.
Returning to the data strategy topic, the application of generative AI to data science does require that you take a serious relook at your existing data strategies.
Consider these pressing questions that pertain to your data strategies already underway:
- Who in your organization is going to aid in overseeing how generative AI is used for data science purposes?
- What kind of data governance do you have that will encompass the proper use of generative AI for data science and seek to curtail or mitigate improper uses?
- Are there sufficient data protections concerning generative AI gaining access to organizational data that might otherwise have strict privacy and confidentiality stipulations?
- Which personnel in your organization will be greenlighted to use generative AI for data science?
- For those that aren’t permitted to use generative AI for data science, how will you detect when they do so, and what ramifications will there be for any such illicit use?
- How will you communicate throughout the organization about the tradeoffs and uses and misuses that can arise as a result of using generative AI for data science purposes?
- Etc.
If your data strategy has been sitting on a shelf, which is probably not a good practice, but anyway, you now have an impetus to rethink and reinvent your data strategy.
The emergence of generative AI for data science is a wake-up call. You can use this wake-up call to rationalize why it is now timely to reexamine the data strategy of the organization. That alone provides a bona fide and useful net result from the generative AI and data science boon that is about to occur. You are at least spurred to revisit your data strategy and ensure that it is keeping up with the times.
You might want to identify some qualified data scientists in your organization that can start utilizing generative AI to explore how it will best fit your firm. Do not just toss them at this. Make sure they are aware of the AI Ethics and AI Law concerns that go into using generative AI, see for example my coverage at the link here.
I would also suggest that you do some prototype or pilot efforts, including that you opt to encompass selected personnel that aren’t data scientists. You want to determine how they too are going to likely make use of generative AI for data science activities. Make sure that this is done in a secure generative AI environment and that you aren’t letting the horse out of the barn as you try things out.
Tie together those efforts into a tidy bow by making sure to think of this on a mission-oriented organizational strategy basis. Data strategy can drive organizational strategy, while likewise, organizational strategy is to drive data strategy. The new kid on the block, generative AI, must inextricably be blended into all of this.
Generative AI is going to assuredly drive data strategy, which can drive organizational strategy. Beyond the realm of data science, you can also assert that generative AI in general is going to be driving organizational strategy all told.
A final remark for now.
As Charles Dickens so aptly put things in the remarkable A Tale Of Two Cities: “It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair.
Take my advice and without delay get onto updating your data strategy in light of generative AI. Your goal will be to maximize the age of wisdom and the season of light for attaining the best of times, along with minimizing or eliminating if feasible the season of darkness and a winter of despair.
You can do it.