Subscribe to Blog via Email
I’m in sunny Phoenix this week at the Data Platforms 2017 Conference and looking forward to a break in the heat when I return to Colorado this evening.
As this event is for big data, I expected to present on how big data could benefit from virtualization, but was surprised to find that I didn’t have a lot of luck finding customers utilizing us for this reason, (yet). As I’ve discussed in previous presentations, I was aware of what a “swiss army knife” virtualization is, resolving numerous issues, across a myriad of environments, yet often unidentified.
To find my use case, I went out to the web and found a great article, “The Case for Flat Files in Big Data Projects“, by Time.com interactive graphics editor, Chris Wilson. The discussion surrounds the use of data created as part of ACA and used for another article, “How Much Money Does Your Doctor Get From Medical Companies“. The data in the interactive graphs that are part of the article is publicly available from cms.gov and the Chris discusses the challenges created by it and how they overcame it.
Upon reading it, there was a clear explanation of why Chris’ team did what they did to consume the data in a way that complimented their skill set. It resonated with anyone who works in any IT shop and how we end up with technical choices that we’re left to justify later on. While observing conversations at the conference this week, I lost count of how often I accepted the fact that there wasn’t a “hadoop” shop or a “hive” shop, but everyone had a group of varied solutions that resulted in their environment and if you didn’t have it, don’t count it out- Pig, Python, Kafka or others could show up tomorrow.
This results in a more open and accepting technical landscape, which I, a “legacy data store” technologist, was welcome. When I explained my upcoming talk and my experience, no one turned up their nose at any technology I admitted to having an interest in.
With the use case found online, was also the data. As part of the policies in the ACA, cms.gov site, (The Center for Medicare and Medicaid) you can get access to all of this valuable data and it can offer incredible insight into these valuable programs. The Time article only focuses on the payments to doctors from medical companies, but the data that is collected, separated out by area and then zipped up by year, is quite extensive, but as noted by a third article, as anticipated as the data was, it was cumbersome and difficult to use.
I proceeded to take this use case and imagine it as part of an agile environment, with this data as part of the pipeline to produce revenue by providing information to the consumer. What would be required and how could virtualization not only enhance what Chris Wilson’s team had built, but how could the entire pipeline benefit from Delphix’s “swiss army knife” approach?
There were four areas that I focused to solve and eliminate bottlenecks that I either experienced or foresaw an organization experiencing when having this data as part of their environment.
Each of the files, compressed were just over 500M and uncompressed, 15-18G. This took about over 4 minutes per file to transfer to a host and could add up to considerable space.
I used Delphix vFile to virtualize files. This means that there is a single, “gold copy” host of the files at the Delphix engine and then there’s an NFS Mount that “projects” the file access to each target, which can be used for unique copies to as many development, test and reporting copies.
If a refresh is required, then the parent is refreshed and an automated refresh to all the “children” can be performed. Changes can be made at the child level and if catastrophic, Delphix would allow for easy recovery, allowing for data version control, not just code version control throughout the development and testing process.
Its a pretty cool feature and one that is very valuable to the big data arena. I heard countless stories of how often, due to lack of storage, data scientists and developers were taking subsets of data to test and then once to production, find out that their code wouldn’t complete or fail when presented against the full data. Having the ability to have the FULL files without taking up more space for multiple environments would be incredibly beneficial and shorten development cycles.
Most big data shops are still dependent on legacy data stores. These are legendary roadblocks due to their size, complexity and demands for refreshes. I proposed that those be virtualized so that each developer could have a copy and instant refresh without storage demands to again, ease development deadline pressures and allow for full access of data towards the development success.
Most people know we mask relational databases, but did you know we have Agile Data Masking for flat files? If these files are going to be pushed to non-production systems, especially with as much as we’re starting to hear about GDPR, (General Data Protection Regulations) from the EU in the US now, shouldn’t we mask outside of the database?
What kind of files can be masked?
Thats a pretty considerable and cool list. The ability to go in and mask data from flat files is a HUGE benefit to big data folks. Many of them were looking at file security from the permissions and encryption level, so the ability to render the data benign to risk is a fantastic improvement.
The last step is in simple recognition that big data is complex and consists of a ton of moving parts. To acknowledge how much is often home built, open source, consisting of legacy data stores, flat files, application and other dependent tiers, should be expected.
Delphix has the ability to create templates of our sources, (aka dSources) which is nothing more than creating a container. In my use case enhancement, I took all of these legacy data stores, applications, (including any Ajax code) flat files and then create a template from it for simple refreshes, deployments via jenkins, Chef jobs or other DevOps automation. The ability to then take these templates and deploy them to the cloud would make a migration from on-prem to the cloud a simpler process or from one cloud vendor to another.
The end story is that this use case could be any big data shop or start up in the world today. So many of these companies are hindered by data and Delphix virtualization could easily let their data move at the speed of business.
I want to thank Data Platforms 2017 and all the people who were so receptive of my talk. If you’d like access to the slide deck, it’s been uploaded to Slideshare. I had a great time in Phoenix and hope I can come back soon!
I’ve been involved in two data masking projects in my time as a database administrator. One was to mask and secure credit card numbers and the other was to protect personally identifiable information, (PII) for a demographics company. I remember the pain, but it was better than what could have happened if we hadn’t protected customer data….
Times have changed and now, as part of a company that has a serious market focus on data masking, my role has time allocated to research on data protection, data masking and understanding the technical requirements.
The percentage of companies that contain data that SHOULD be masked is much higher than most would think.
The amount of data that should be masked vs. is masked can be quite different. There was a great study done by the Ponemon Instititue, (that says Ponemon, you Pokemon Go freaks…:)) that showed 23% of data was masked to some level and 45% of data was significantly masked by 2014. This still left over 30% of data at risk.
We also don’t think very clearly about how and what to protect. We often silo our security- The network administrators secure the network. The server administrators secure the host, but doesn’t concern themselves with the application or the database and the DBA may be securing the database, but the application that’s accessing it, may be open to accessing data that shouldn’t be available to those involved. We won’t even start about what George in accounting is doing.
We need to change from thinking just of disk encryption and start thinking about data encryption and application encryption with key data stores that protect all of the data- the goal of the entire project. It’s not like we’re going to see people running out of a building with a server, but seriously, it doesn’t just happen in the movies and people have stolen drives/jump or even print outs of spreadsheets drives with incredibly important data residing on it.
As I’ve been learning what is essential to masking data properly, along with what makes our product superior, is that it identifies potential data that should be masked, along with ongoing audits to ensure that data doesn’t become vulnerable over time.
This can be the largest consumption of resources in any data masking project, so I was really impressed with this area of Delphix data masking. Its really easy to use, so if you don’t understand the ins and outs to DBMS_CRYPTO or unfamiliar with the java.utilRANDOM syntax, no worries, Delphix product makes it really easy to mask data and has a centralized key store to manage everything.
It doesn’t matter if the environment is on-premise or in the cloud. Delphix, like a number of companies these days, understands that hybrid management is a requirement, so efficient masking and ensuring that at no point is sensitive data at risk is essential.
How many data breaches do we need to hear about to make us all pay more attention to this? Security topics at conferences are diminished vs. when I started to attend less than a decade ago, so I know it wasn’t that long ago it appeared to be more important to us and yet it seems to be more important of an issue.
Research was also performed that found only 7-19% of companies actually knew where all their sensitive data was located. That’s over 80% sensitive data vulnerable to a breach. I don’t know about the rest of you, but upon finishing up on that little bit of research, I understood why many feel better about not knowing and why its better just to accept this and address masking needs to ensure we’re not one of the vulnerable ones.
Automated solutions to discover vulnerable data can significantly reduce risks and reduce the demands on those that often manage the data, but don’t know what the data is for. I’ve always said that the best DBAs know the data, but how much can we really understand it and do our jobs? It’s often the users that understand it, but may not comprehend the technical requirements to safeguard it. Automated solutions removes that skill requirement from having to exist in human form, allowing us all to do our jobs better. I thought it was really cool that our data masking tool considers this and takes this pressure off of us, letting the tool do the heavy lifting.
Along with a myriad of database platforms, we also know that people are bound and determined to export data to Excel, MS Access and other flat file formats resulting in more vulnerabilities that seem out of our control. Delphix data masking tool considers this and supports many of these applications, as well. George, the new smarty-pants in accounting wrote out his own XML pull of customers and credit card numbers? No problem, we got you covered… 🙂
So now, along with telling you how to automate a script to email George to change his password from “1234” in production, I can now make recommendations on how to keep him from having the ability to print out a spreadsheet with all the customer’s credit card numbers on it and leave it on the printer…:)
Happy Monday, everyone!