Making Data Work

In today’s largely cloud-native world, having access to real world data is critical in order for organizations to derive meaning that drives innovation.

Yet most organizations are still contending with inadequate tools that enable them to safely share, collaborate, and build with data. What’s more, many businesses lack the engineering resources to develop workarounds in the form of anonymized data sets or synthetic datasets that can be used to run powerful tests.

Ali Golshan, the CEO and co-founder of Gretel.ai, says that dynamic has put the majority of businesses and organizations at a disadvantage when it comes to innovation, which is exactly where Gretel.ai comes in. The company, which was founded in 2019, aims to be the “GitHub of data,” stepping into to provide tools for data similar to the way the open platform effectively democratized building with code. Greylock has been an investor since 2020, and Sridhar Ramaswamy is on the company’s board.

“You have better tools, you get better data, you have more data, you can build better tools,” says Golshan. “Considering the immersive world we’re venturing into beyond the traditional web 2.0, and more sensors, IoT data collection with different types of methods, we felt like that the larger community had the right to be able to not only have as great of set of tools, but to also be able to share and collaborate with that data.”

Gretel.ai works with a wide range of organizations across industry sectors including gaming platforms and healthcare and biotech organizations including genomic sequencing giant Illumina. Initially developed for heavily data-focused personas such as ML/AI engineers and data scientists, the platform became generally available for any user in February.

Golshan sat down with Greylock head of content and editorial Heather Mack on the Greymatter podcast to discuss the current data-powered business landscape and how Gretel.ai is enabling innovation. You can listen to the podcast at the link below, on YouTube, or wherever you get your podcasts.

Greylock Partners · Gretel.ai | Making Data Work

Episode Transcript

Heather Mack:
Ali, thanks so much for joining us today on Greymatter.

Ali Golshan:
Yeah, thanks for having me.

HM:
So, Gretel has been called the GitHub of data, what exactly does that mean?

AG:
Sure. So, for a few decades, one of the trends that we saw was the community working on removing the compute bottleneck. And removing the compute bottleneck really consisted of many different things.

Obviously, the compute nature itself is accessing the resources. And this is where we saw emergence and usage from cloud native technologies, microservices, CI/CD, and the trend towards more developer tooling.

And GitHub, we feel, was a huge milestone in how engineers, developers, and scientists collaborated with each other. Before that, it was very difficult to have these repos and collaborating and working with code. And democratization of building with code was really a major part of what gave rise to applications and services. And it became a huge driving force behind compute and proliferation of it.

We think that the same bottleneck is now applied to data. Essentially, we built an ecosystem around usage for data, higher velocity of development, applications, relying more and more on ML/AI for personalization recommendation. And the problem with data is, it has higher entropy. It’s more bound by ethics, by privacy, by regulation, compliance, or just internal policies.

So, how do you unlock and remove that bottleneck for data is really, what we set out to do. And a lot of it really, has to do with our own background. But the bottleneck for data is really, the main problem that we’re set out to solve here.

HM:
How did you identify this issue and decide to work on it? And then, once you figured that out, how did you know that it was something that could carry an entire company and maybe, not just a single product somewhere else?

AG:
So, myself and my two co-founders, Alex Watson, who’s our CPO, runs the product and applied research. And then, John Myers, who’s our CTO runs engineering. We’ve known each other 10, 15 years, and we’ve had an interesting path. All three of us came out of the intelligence community. And then, we started our own companies. And those companies were eventually acquired by bigger companies.

What we have seen is both sides of the coin: where we are starting – where we have this lack of data, lack of access, inability to validate things or test things. And then the flip side of it – which is what it’s like to work in organizations where you have access to an abundance of data for testing, validation, and acceleration of R&D. But then, combining that data with extremely effective tooling. So, a level of tooling that is inaccessible by the majority of the community.

Having come from that background, what we looked at and saw was, more and more, this is becoming a self-fulfilling prophecy: you have better tools, you get better data, you have more data, you can build better tools. And as a result, a lot of the larger companies are building walled gardens around their data. And they’re creating this massive advantage for themselves with their data.

And considering the immersive world we’re venturing into beyond the traditional web 2.0, and more sensors, IoT data collection with different types of methods, we felt like that the larger community had the right to be able to not only have as great of set of tools, but to also be able to share and collaborate with that data. Because collectively, there’s a lot more data available to us as a community than individual companies.

So, while we can’t unleash data for everyone, what we set out to do was actually build the tools that these larger companies have access to. And we can make it accessible and easy to use for everyone. So, you didn’t have to have multiple PhDs to use differential privacy or you didn’t have to have PhDs to build entire pipelines.

And that was really the genesis of Gretel.

While we can’t unleash data for everyone, what we set out to do was actually build the tools that these larger companies have access to. And we can make it accessible and easy to use for everyone.

HM:
Mm-hmm (affirmative). And so, you founded Gretel in 2019. What were some of the early conversations like with early investors, or early people who you were getting to help build the product, the first foundations of the company?

AG:
Yeah. So, obviously, this is where Greylock played a huge part, Sridhar Ramaswamy, who’s our investor who joined us from the Seed Round. And a lot of the conversations we had early on, really revolved around three areas, which was, if you need to remove the data on a bottleneck, what are the main questions users and customers need to answer?

So, when we started the company, we wrote the initial lines of code in late 2019. We released our first public beta in September of 2020. So, only nine months after we started. And very quickly, we realized there are three questions users were asking us, which is, How good is the quality of my data? As a result of that, what use cases can I actually power? And then, finally, How private is this data? Meaning, Can I share this with my team? Can I share this with my whole company? Or can I actually publish this on the web and collaborate with others?

This is an area where we spent a ton of time talking to users in our community. This is actually why we decided to open source our core libraries for synthetics and some of the work we do on NLP. Because the other thing we found that was very difficult for users to answer was, How do you answer the quality of AI tooling or algorithms you’re building?

So, our view was, “Well, let’s be transparent about it. Let’s have the community and the larger ecosystem validate that for us.” So, based on those questions, we set out to really build tools that help create fast and easy access to safe data. And that was the genesis of it, which is, we released our beta in September of 2020.

And then, we actually released our second beta, which was really our GA preview a year later last summer. And at that point, we had a few thousand users on the platform. We have a full feature product that’s open to all users using GitHub and G-Suite to sign in. And at that point, we decided to start orienting a little bit more around use cases.

And what we saw was that, the data bottleneck problem really led into three tiers of problems that map themselves into use cases, which is, one, I have a dataset that only a subset of my teams can use. So, how do I create larger access to that? Two, it was, I have a dataset, but only a small portion of that data set is relevant to me. If I were to train ML/AI models, or run this for my data science use cases, I end up building huge biases in my answers, in my decision.
So, how do I boost that underrepresented data set without really having to invest massively in teams that are doing that themselves?

And then the third one was less of a technological challenge more of an economic one for some customers, which was, rather than building an entire infrastructure for data collection or acquisition, sanitizing that data making it usable, How can I just generate data that I can use for testing before going out there and deciding what data I should even have?

So,[this is] the data from nothing problem, really. And this is where we really, heavily focused on synthetic data, which is, How do you create high quality synthetic data? How do you add visibility into that? So, we can answer the questions of the quality as a result, what use cases you can power, a quantified view of privacy. So, what does it mean to have differential privacy in your training versus not from a quality and privacy standpoint?

And finally, some of the more recent work we’ve been doing, which is, as a user, I just want to share a schema and say, “Generate data for me, I don’t have that data, I haven’t collected that data.”

And we think that’s a very fundamental piece of this whole problem, which is, if we can move users away from their first answer of, “Let’s always just collect as much data as we can and we’ll figure it out later,” towards, “Let’s use safe data that is accessible to us, and then build from there,” I think that could be a huge fork in the road for overall work that we can help. So, that’s the progression we’ve gone through in the last year.

HM:
What are the advantages of using synthetic data?

AG:
In a lot of cases, synthetic data can actually end up yielding better results than raw data. And we can get into why that is, but in more detail, but at a very high level. If you have missing columns or fields or you haven’t properly classified or labeled your data, you can end up having raw data training sets that are biased or actually, lower quality than a synthetic data set that has gone through proper labeling transformations and synthetics.

Our CPO, Alex, just gave a talk on this particular topic at the NVIDIA Conference, where he demonstrated, based on some of our research that we published openly, how synthetic data actually does yield better results in a lot of cases. And it goes back to the tooling approach, better labeling means better transforms, better transforms need better privacy. Better privacy and labeling means better synthetic data compared to raw data.

So, that’s an area we’re very excited about. We know, it’s probably a lot of education and market evangelism and advocacy around that. But, again, we have an ambitious goal about how we want to bring privacy as an enabler, and accelerant to data.

Synthetic data actually does yield better results in a lot of cases. It goes back to the tooling approach – better labeling means better transforms, better transforms need better privacy. Better privacy and labeling means better synthetic data compared to raw data.

HM:
How did you decide, or how did you go about figuring out how to build for all those different users? And who were some of those early adopters that helped you figure out how you would build this product roadmap?

AG:
Yeah. So, one of the companies that was actually very helpful to us from the very early stages was Riot Games. And they presented a unique challenge, and part of it was, they were a multinational company, or using multiple cloud environments.

But more importantly, they were a cloud native and product-led company where users were front and center. They needed to build trust and transparency with their users. But at the same time, they needed to figure out how to expand into markets that were already saturated. So, they were great.

Illumina is obviously, another one we’ve written about that, that has been tremendously helpful for us to work with. And there’s a number of others. But being a privacy company, we try not to talk too much about our customers too openly.

But I can iterate on the use cases, which is, what we found was, the market was and is still is somewhat fragmented, from a maturity standpoint. There are some that are just starting on this journey. And as we mentioned, there are some that are highly sophisticated in this area.

One of the first decisions we made was, we need to build a toolkit that very mature customers can work with. And at the same time, early customers in their journey can actually leverage that. So, as we looked at the data problems, we said, “Well, what are a few pieces that every company needs in place to be able to have high quality data that is private, shareable, and can be used for higher production, great use cases?” And the use cases we found are a lot of customers were doing a lot of manual, or for example, scripted labeling, or classification of their data.

So, one of the tools that we ended up building is our classification and labeling, which is to remove that notion of manual labeling, classification, whether it’s structured or unstructured data, because that’s a very complicated problem. And while there is a lot of companies working in this particular space, as part of a comprehensive data toolkit, we still felt like we needed to invest in that, because most companies haven’t made their investments there.

But that’s not what we’re trying to differentiate. We just think labeling and classification using NLP and NER advanced technology is just good hygiene to have as part of your data toolkit. So that was one part. The second part of the problem and use cases we saw were, a lot of engineers, developers, especially product engineers, were trying to gain quick access to data from production, so they can accelerate R&D, build better products, better experiences, AB testing.

And that’s where we found transformations to be highly valuable. How do I identify data? How do I ensure that I am encrypting, tokenizing, or anonymizing production data for lower-level environments? That was another set of use cases that we saw for customers that were starting to become more data oriented, more cloud native oriented, and then wanted to rely more on data to drive their businesses.

This is where we saw a lot of correlated patterns, especially from the top of the material company markets, those who established their businesses. And now, their growth strategies is really around How do I use my data as a differentiation and expand myself? And then, the last one, really, from a use case standpoint was, I want to be able to move freely share or collaborate with data that is high quality and reminiscent of my raw data.

And this is where our synthetic data came in. And synthetic data for us is a deep neural net language model that trains on your existing models. Since then, we’ve actually expanded to using GANs as well, because we deal with text or time series or tabular data. Actually, recently, we just expanded our synthetics into image data as well based on work we’re doing with some technology and health sciences companies.

But coming back to your question is, we started with each of these as individual tools that they can be consumed individually if you needed to, because I was just doing labeling or just transformed. And now, since then, we’ve actually, rolled out more of a platform approach, which is, if you are trying to generate highest quality of synthetic data for tabular data, well, firstly, to make sure your data is properly labeled and classified, then you might want to actually encrypt or anonymize some outliers in your data set, they’re not standing out and reversible, and then synthesize it.

So, we’ve gone from this tooling – which is still available – to a more cohesive platform approach based on the use cases, because the common theme we saw from users was, they wanted to automate away as much complexity from end-to-end workflows as possible, and remove as much of that overhead burden of managing and maintaining systems as they can. This is also why you can consume our product completely as a SaaS product, or just deploy the compute components as containers in your environment.

So, our entire mission has been, How do we make things as easy as possible to be consumed individually or fully end-to-end automated? But it was really those individual use cases that actually had a settle on these toolkit approaches. That then, allowed us to be a more horizontal application. So, rather than focusing just on fintech or health sciences, we found a way we can mix and match these tools and can actually solve some higher order problems.

HM:
Fascinating. And then, last month, you released your GA product. And so, who would be able to use that? And I’m wondering, what was a possible use case? Who could use it now that you maybe didn’t even think of when you first started the company?

AG:
That’s a really great question. So, we initially started to sit down on the personas of data engineers, and developers focused on data. So, ML/AI engineers, or data scientists, and that was a very tight scope.

But one of the things we’ve done is, if you, for example, log into our dashboard, you see our dashboard is actually very much organized, similar to GitHub, the same way you have different code repos, we have data repos.

So, where you can sign in and see who’s working with what data in my company, what data do I actually have access to. And you can just drag-n-drop file JSON or CSV of data and automatically classify or transform or synthesize them with configurations automatically picked for you, orchestration automatically managed for you by our cloud.

As a result of that, one of the really interesting patterns we’ve seen are individuals or smaller companies, small companies, startups, SMBs, generally, where they don’t have resources to

invest in those teams coming and wanting to use a low code, no code approach to data engineering, for synthetics, labeling, or transforms, using our product.

A great example is, we had a researcher from Ohio who was working on a very niche specific dataset. And he’s not a data engineer, and he actually wanted to transform and synthesize it, so he can share it with some hospitals. We actually, had a financial trader that was working on a very specific vertical and wanted to see how her data mapped to some more geographically specific grey areas.

So, the point being is that, the focus on UX, UI, and ease of use has actually demonstrated to us that we have a much larger and broader set of folks we can go after. And that was eventually, what we always wanted to get to: How do we make working with data so easy and safe, that everybody can use it?

Eventually, it doesn’t just have to be Pandas Frames, or CSVs, or JSON, or streams that you’re using. What if you can just take simple files, upload and synthesize and transform them, so you can actually, use that collaboration through that. So, that is the longer term path we’re taking. But we’re very excited to see that emergence of that, actually, in our datasets now.

The focus on UX, UI, and ease of use has actually demonstrated to us that we have a much larger and broader set of folks we can go after.

HM:
Right. And very specifically, since you launched, a lot has happened that has reiterated the role of data in healthcare decision making, and how are you working on that?

AG:
Yeah, that’s a great question.

So, health sciences has generally been a large environment we’ve been working on because of that collaboration bottleneck with data. So, what we have found is, during the COVID pandemic, a lot of health sciences, pharma, hospitals, a lot of traditional red tape was removed just because they needed to move faster and make decisions. And as a result, they actually, saw what faster work with data can really mean to them from trials to innovation, to even larger collaboration.

One of the areas that we can talk about is, for example, the work we’ve been doing with Illumina, which is the world’s largest genomics company. And the work we’re doing with them is, building synthetic datasets of genotype and phenotype, which are some of the most complex and challenging data in the world, hundreds of thousands of columns, millions of rows, to be able to demonstrate that.

Another example is actually, the work we ended up doing with University of California Irvine, where their problem was not necessarily collaboration, but they were trying to collaborate with other groups on improving the detection of female heart disease in patients. And what they had was an order of magnitude more sample data sets for male patients versus female patients. As a result, treatments were always yielding towards male patients. So, in that area, we actually demonstrated this functionality we have called autocomplete for data, which is the system automatically learns relevant datasets and boosts them, so you can improve that.

And now, we’re working with a number of other institutions around these very similar areas. But what we’re finding is that removing bias, higher quality, and privacy is really critical to health sciences. And we’re starting to see that branch into other areas, but health sciences, and generally, healthcare itself has been a big area of investment for us, especially because it’s always really nice to do some good as a result of the work you’re doing.

HM:
I’m curious, once you start working with a research institution, or a healthcare institution, what’s the onboarding effort of working with you? Does it take a lot of training?

AG:
We built our entire product to essentially be able to operate as a self-serve platform for engineers and developers and scientists. We’ve invested as much as we can in automating away complexity. And as you can imagine, a lot of the work that comes with labeling your data, transforming it, properly determining configurations, and model training, and then orchestrating an entire environment to run those training models and produce models out is a very difficult process for most companies.

So, what we didn’t want to do was build a product that was focused on the top 5% of the market where they could draw hundreds of engineers at it. What we actually want to do is bring this to the masses. So, if somebody is even an individual analyst and wants to share and synthesize their data, they can do that. So, the majority of our users, actually, just go to our website, Gretel.ai, sign up using a G-Suite, or GitHub or SSO account, and they can drop directly into the console and start synthesizing, classifying or transforming data.

Now, to the question you asked is, there are a number of strategic partners or customers we work with, because beyond being a customer, they are potentially a gateway to a much larger opportunity for us.

So, Illumina is a great one, right? They are doing some phenomenal work around bringing synthetic data for genomics, to the larger market for hospitals, for researchers, for smaller institutions, so they can benefit from that research. We’re doing the same thing with one of the largest cloud providers, some of the other service providers around compute. And actually, in gaming as well.

So, in those scenarios, what we are finding is, it’s the typical aspects of the market maturity, where you have some who are very much bleeding edge, and have some very specific use cases. In those areas, we tend to engage with them on more hands-on cases, because it’s a great learning lesson for applied research, engineering and product teams. And it helps us prioritize a little bit more our product roadmap with better anticipation as to what are the key movers for some of the more large markets, especially in financial technologies, and health sciences, gaming, and just general technology.

So, a very small subset of our customers and partners we engage with directly, majority tend to either start with the open source, and move into the free tier, and then graduate from there up. But then, we do have an enterprise tier that a lot of larger companies become dedicatedly engaged with us.

HM:
Very cool. These use cases are really fascinating. Any more like very specific ones you can share?

AG:
Yeah. So, yeah, there’s actually a couple of really interesting use cases that we can talk about. One is, we can’t name the company. But it’s a fortune 10 company that we’re working with, where what they are trying to do is combine time series data with seasonality of that data.

So, as an example, they’re trying to take commerce data as it relates in a particular region of the world and forecast what that same buying pattern look like in a completely remote area at a different time of the year. And this is an area where we were thinking about how a combination of synthetic data can be very helpful.

However, we ended up pulling up that particular roadmap item, when we actually ended up releasing a blog about it, And the blog we ended up writing was [a question of] How do I train my synthetic model on images of a city, and then flow a traffic, which is more tabular and time series, and then be able to use that feature learning to be able to train it on different images of cities and be able to actually, predict the same models of traffic or flow? And as we gradually fine-tuned that, we demonstrated that then you can actually apply that to other types of variables.

The variables just become interchangeable. We can take those out. And one of the things we ended up doing was actually, predicting commerce patterns in Japan based on data we did not have, but publicly available data that we could crawl, anonymize, inject differential privacy, and then make a very large prediction around it. So, that was very interesting for us.

One of the pharma companies that we’re working with is doing a lot of research on skin cancer. What they wanted to do was generate variations of what skin cancer could look like versus other types of anomalies or discolorations on skin and answer how do you tease those apart.

Now, that’s one side of the problem. What they wanted to be able to do was, can we take this image information, and actually combine it with doctor’s note, which is mostly time series or tabular or even free text data. And say, “Can you pull features or context from this data, and actually, be able to give it some context or priming of what this image should look like?”

So, treating each one by itself is very difficult just on notes making a decision on an outcome or just on image.

So, this is an area we’re very excited about, because it applies to many things. We’re actually working with a different company who’s using a similar mindset to say, “What if I wanted to generate fake receipts and be able to train my models on it?”

So, combining textual data with image data and making accurate predictions on top of it, whether it’s for health detection, or whether it’s for financial is another area that we’ve been heavily focused on. On the FinTech side of things, we’ve seen that is actually, related a lot to your point of detecting black swan events, and what variables in here have the potential to move the needle in unexpected ways, because history is bound to repeat itself. It’s just making sure you understand the correlations and patterns in history, and how they repeat themselves and conditions.

And then, obviously, on the other side of it in commerce, it has been very helpful for us. So, those are two interesting areas we have seen emerge, and we think are vastly growing areas.

And if I were to step back, I think there’s three layers that we classify this. There’s generally synthetic data itself, which there’s a lot of companies working on synthetic data with whether it’s textual, image, visual or audio.

But then, there’s two additional steps that we have taken, which we think are very exciting. And this is, again, to your previous question, why we chose to work with some of these really bleeding edge companies that are pushing the needs for data. One was, How do you create visibility and true usability for data? It’s not just that I have synthetics, it’s understanding Does this synthetic not only match my original data, but what can I do with it? How can I make it operational as quickly as possible without really needing teams?

And then, building very comprehensive reporting around that synthetic [data] that gives your scientists that deep visibility into correlations into statistical anomalies, but then gives higher level decision makers confidence that this is something we can bet on.

And then, the third one, which we just mentioned, which is How can Gretel help me generate data if I don’t have any data at all? Without me going through the economic barriers of collecting data and building an entire infrastructure?

That third tier is something we’re very excited about, because that is ultimately not only solving the data bottleneck, but solves a lot of major privacy bottlenecks as well.

So, this is becoming a very interesting year. And obviously, we’re excited to see a lot of other companies contribute to this space as well.

HM:
So, it being 2022, I’d be remiss if we didn’t at least mention Web3. So, are there any applications for Gretel and building out this next generation of the internet as we’re talking about it in the Web3 lens?

AG:
Yeah. So, when we set out, with our toolkit approach, we were very conscious not to build verticals. But one of the areas we have actually, recently, since our GA have started to see some uptake is around Web3.0 companies. And they fit into two different areas. One is more around companies that are trying to bring financial services to Web3.0. And they need to do better forecasting or be able to do testing around particular demographics.

And as you can imagine, there is extremely sparse, low quality or lack of data for that particular industry. I mean, it’s blazing a trail, right? There is not a lot of data there. So, we’ve started to see an emergence of companies that are coming to us and saying, “Hey, we know, we have extremely small subsets or there’s public datasets that are very sparse, can you train on a generate models for us where we can then generate unlimited amounts of high quality data to make some better predictions?”

That’s one area we’ve seen. Another very specific area that I’m finding very interesting is, actually a category of Web3 gaming companies that have come to us. And I’ll use one specific example without naming the company, which is, this company is focused on a particular type of racing, and they build it on the blockchain. And what they do is, is they have a production chain and a test chain.

And it’s a tricky thing, because they just can’t take production data and load it into their test chain because it’s a public chain. But at the same time, they need to be able to anonymize parts of it – synthesized parts of it – and still have the same distributions. Because what they’re trying to replicate is 90% of the same data. But then, 10% of the data that is being generated, but their users is very much generated through pseudo random generators, because it’s meant to be new NFTs, or new models for their games.

And those, our synthetic actually, have to learn how the distributions of those look like, and then regenerate them in their test network. So, they’re absolutely, fascinating use cases. Frankly, Web3 was not something I was digging in deep, until I got exposed to it in the last six months because of some of the use cases. But we are starting to see a swell from a lot of Web3 companies because of either their data needs or some replication, and their test chains and test networks based on data that is mostly randomly generated, so they can make better predictions as to what could be coming, in next cases for them.

HM:
Very cool.

That’s all really interesting. And I’m wondering, what are areas that you’re just starting to think about that you haven’t quite started building products, or haven’t really targeted into potential users yet, but thinking like, here’s an application but we’re just not quite sure yet?

AG:
Yeah. So, at the very highest level, one thing that we’re thinking about is, How do we become the single platform for all types of synthetic data, but not all types of synthetic data in a vertical or in isolation? But as we mentioned, How do I take synthetic data that is high enough quality as tabular, combine it with image, combine it with visuals, and make accurate predictions to be able to train ML and AI systems on it?

So, really, the next 18 to 24 months – and a lot of the areas that we haven’t built but are venturing into – is becoming the single platform for all types of synthetic data, that then you can actually make predictions on top of combining that synthetic data. And being able to build reporting on top of that, that says, “Here is a prediction that your ML or AI models made on your raw data. Here is a prediction that your models made on our synthetic data.”

And then, being able to give the users concrete reporting as to what is the delta? What is the efficacy of you making your decisions and predictions on synthetics versus raw data? We believe that is a very key piece of enabling, and accelerating the community’s adoption towards synthetic data because the main misperception around this space is, raw data will always yield better results for me compared to synthetic data.

So, being a one stop shop for all synthetic data types is a big goal that we have on our plate for the next 24 months.

“How do I take synthetic data that is high enough quality as tabular, combine it with image, combine it with visuals, and make accurate predictions to be able to train ML and AI systems on it?”

HM:
Yeah, I would imagine that taking some of the negative connotations of the word “synthetic” is one of the hardest things.

AG:
Yes, yes.

HM:
… especially for expanded use cases. And then, you did just touch on this, but any other things that we want to talk about that you’re focusing on the next few years?

AG:
Yeah, so there are a number of areas that are tangential to what we would consider to be synthetic, or generally a toolkit of building better data. So, the areas were very heavily focused on is, if you look at our mission, our mission is creating fast and easy access to safe data. Now, the fast and easy, there’s a lot packed in there.

So, one of the areas we’re heavily focused on is, how do we make Gretel another tool in the ecosystem of tools for developers, but a very easy tool to use, and a substantially easier tool to integrate with other services.

At the end of the day, Gretel is a set of APIs that are data in, data out. So, we want to be a bump in the line in your wire, whether you’re streaming data, be able to point to services like S3, or other storage systems, or warehouse or data lakes, or, for example, being able to just pass structured data to it.

AG:
So, as part of that, we are taking a huge initiative and not only building open source, but what we call connectors. So, one-button connectors to S3 buckets. We actually, recently, did a workshop on how Gretel with using a couple lines of code can now integrate into Apache Airflow, and be able to create an instant, a streaming synthetic data pipeline for you.

So, usability is a huge part of our investment. We understand again, most teams, most companies don’t have the resources to work with overly complex data sets. So, how do you make that as easy as possible to consume.

When we talk about a platform, our vision of the platform is anything that automates away complexity for you. That means building better, easier integration, simple, one-button connectors, and then stitching to the rest of the community tools is a big part of our investment over the next year to year and a half.

We feel like the last two years have really been around that problem product market fit and validation of use cases. And now, we have a really good understanding of what services or tools are just upstream or downstream from us. But the stitching to the other tooling is a big part of it. And we take a lot of on us on being a good citizen of the tool’s community for the developers.

HM:
Right. And where are you today, actually, as a company? Who makes up Gretel?

AG:
Sure. So, the company overall right now is 40 people. We are a fully distributed and remote-first company.

So far, just to maintain a little bit easier working relationship with peers, we’ve managed to maintain that within just North America with between three time zones, so we only try to stay between Pacific and Eastern Time Zone just to make it easier for teams to work together.But right now, our team is spread between the US and Canada.

Our goal is to roughly double in size this year, and hopefully, double again next year. We’re investing very heavily across the board. Actually, just about a month ago, we brought on our sales and customer success teams because of that rush and swell we have on that side of things.

And then, the other area we’re investing very heavily on is our talent team. We have a four person, in-house talent team that is only going to grow. And the people side is very vital to us. One of our co-founders is actually, on the board for us is Laszlo Bock with the SVP of Operations and Human Resources at Google. So, obviously him and Sridhar go way back there. And we feel like they present two really great sides of it. Whereas, one Sridhar who’s heavily focused on our community, on our product and technology on feedback. And Laszlo, about giving us very deep insights about how to build a people-friendly organization.

So, a lot of our efforts, and frankly, a lot of my time goes towards, How do we make it a better environment for people? How do we create psychological safety for our team, and create an environment of survival of the fittest ideas? And one of the things that is a top-of-mind thing that all of our leadership and team brings up is, I think everybody is so focused on hyper scaling and growing and hiring.

We’re trying to take a very systematic approach where our teams are not spending all their time interviewing, or we’re having to onboard 10, 15 people every month or two, to allow that culture to organically grow as we’re adding it. The great thing about it is, is it actually can be solved via product. You need substantially less resources on the people side of things if you are very thoughtful and insightful on how you build.

Build things that are easy to use, automated, scalable, don’t require a lot of professionals or management people to run it. And this is really a testament to what our team has done. I’ve just been lucky to work with some of the best people we’ve managed to find. And yeah, it’s a fun ride. It’s very different than my first two companies. I was joking with my two co-founders that it’s nice to see what the other side of the coin looks like, when not everything is always on fire.

HM:
So, as you said, you are rapidly expanding. You had your Series B at the end of last year, you just released your general platform, you’re looking to hire across almost every department. So, more specifically, who are you looking for right now?

AG:
Yeah. So, thanks for bringing that up. So, yeah, we raised our Series B, it was a $50 million Series B last October. And a lot of that is really for our commercial go-to-market. So, we’re hiring across the board, but just focus on a few areas. Engineering and Applied Research are two areas we’re very heavily focused on.

So, our Applied Research team is actually an entire independent organization. We’re investing quite heavily into that. We really care about the applied part of that, because a lot of that work is again, removing complexity and putting it into easy-to-use ways for our users. And then, the other area we’re very heavily focused on hiring right now is our marketing team.

So, in our marketing, we’re hiring for growth, we’re hiring developer advocates where a lot of developer relations. We think of marketing really, as market intelligence and community work, not that the traditional view of marketing. And then, the other part of it is, investing a lot in customer success and solutions. So, those are the three main areas that we’re heavily trying to scale over the next six to nine months as we expand into the market.

HM:
Across North America?

AG:
Yes

HM:
Right.

AG:
We officially have a Canadian Gretel entity now too, so we can officially hire in Canada, not as just consultants anymore.

HM:
Congratulations. Awesome.

AG:
Thank you.

HM:
Great.

Okay, fascinating. I learned a ton today, as I’m sure our audience did as well. And Ali, thank you so much for joining me on Greymatter today.

AG:
Thank you very much, Heather. Thanks for having me.

In today’s largely cloud-native world, having access to real world data is critical in order for organizations to derive meaning that drives innovation.

Listen to this article >

Episode Transcript

Subscribe to the Greylock newsletter