Subscribe to Blog via Email
I’m in sunny Phoenix this week at the Data Platforms 2017 Conference and looking forward to a break in the heat when I return to Colorado this evening.
As this event is for big data, I expected to present on how big data could benefit from virtualization, but was surprised to find that I didn’t have a lot of luck finding customers utilizing us for this reason, (yet). As I’ve discussed in previous presentations, I was aware of what a “swiss army knife” virtualization is, resolving numerous issues, across a myriad of environments, yet often unidentified.
To find my use case, I went out to the web and found a great article, “The Case for Flat Files in Big Data Projects“, by Time.com interactive graphics editor, Chris Wilson. The discussion surrounds the use of data created as part of ACA and used for another article, “How Much Money Does Your Doctor Get From Medical Companies“. The data in the interactive graphs that are part of the article is publicly available from cms.gov and the Chris discusses the challenges created by it and how they overcame it.
Upon reading it, there was a clear explanation of why Chris’ team did what they did to consume the data in a way that complimented their skill set. It resonated with anyone who works in any IT shop and how we end up with technical choices that we’re left to justify later on. While observing conversations at the conference this week, I lost count of how often I accepted the fact that there wasn’t a “hadoop” shop or a “hive” shop, but everyone had a group of varied solutions that resulted in their environment and if you didn’t have it, don’t count it out- Pig, Python, Kafka or others could show up tomorrow.
This results in a more open and accepting technical landscape, which I, a “legacy data store” technologist, was welcome. When I explained my upcoming talk and my experience, no one turned up their nose at any technology I admitted to having an interest in.
With the use case found online, was also the data. As part of the policies in the ACA, cms.gov site, (The Center for Medicare and Medicaid) you can get access to all of this valuable data and it can offer incredible insight into these valuable programs. The Time article only focuses on the payments to doctors from medical companies, but the data that is collected, separated out by area and then zipped up by year, is quite extensive, but as noted by a third article, as anticipated as the data was, it was cumbersome and difficult to use.
I proceeded to take this use case and imagine it as part of an agile environment, with this data as part of the pipeline to produce revenue by providing information to the consumer. What would be required and how could virtualization not only enhance what Chris Wilson’s team had built, but how could the entire pipeline benefit from Delphix’s “swiss army knife” approach?
There were four areas that I focused to solve and eliminate bottlenecks that I either experienced or foresaw an organization experiencing when having this data as part of their environment.
Each of the files, compressed were just over 500M and uncompressed, 15-18G. This took about over 4 minutes per file to transfer to a host and could add up to considerable space.
I used Delphix vFile to virtualize files. This means that there is a single, “gold copy” host of the files at the Delphix engine and then there’s an NFS Mount that “projects” the file access to each target, which can be used for unique copies to as many development, test and reporting copies.
If a refresh is required, then the parent is refreshed and an automated refresh to all the “children” can be performed. Changes can be made at the child level and if catastrophic, Delphix would allow for easy recovery, allowing for data version control, not just code version control throughout the development and testing process.
Its a pretty cool feature and one that is very valuable to the big data arena. I heard countless stories of how often, due to lack of storage, data scientists and developers were taking subsets of data to test and then once to production, find out that their code wouldn’t complete or fail when presented against the full data. Having the ability to have the FULL files without taking up more space for multiple environments would be incredibly beneficial and shorten development cycles.
Most big data shops are still dependent on legacy data stores. These are legendary roadblocks due to their size, complexity and demands for refreshes. I proposed that those be virtualized so that each developer could have a copy and instant refresh without storage demands to again, ease development deadline pressures and allow for full access of data towards the development success.
Most people know we mask relational databases, but did you know we have Agile Data Masking for flat files? If these files are going to be pushed to non-production systems, especially with as much as we’re starting to hear about GDPR, (General Data Protection Regulations) from the EU in the US now, shouldn’t we mask outside of the database?
What kind of files can be masked?
Thats a pretty considerable and cool list. The ability to go in and mask data from flat files is a HUGE benefit to big data folks. Many of them were looking at file security from the permissions and encryption level, so the ability to render the data benign to risk is a fantastic improvement.
The last step is in simple recognition that big data is complex and consists of a ton of moving parts. To acknowledge how much is often home built, open source, consisting of legacy data stores, flat files, application and other dependent tiers, should be expected.
Delphix has the ability to create templates of our sources, (aka dSources) which is nothing more than creating a container. In my use case enhancement, I took all of these legacy data stores, applications, (including any Ajax code) flat files and then create a template from it for simple refreshes, deployments via jenkins, Chef jobs or other DevOps automation. The ability to then take these templates and deploy them to the cloud would make a migration from on-prem to the cloud a simpler process or from one cloud vendor to another.
The end story is that this use case could be any big data shop or start up in the world today. So many of these companies are hindered by data and Delphix virtualization could easily let their data move at the speed of business.
I want to thank Data Platforms 2017 and all the people who were so receptive of my talk. If you’d like access to the slide deck, it’s been uploaded to Slideshare. I had a great time in Phoenix and hope I can come back soon!
I was in a COE, (Center of Excellence) meeting yesterday and someone asked me, “Kellyn, is your blog correct? Are you really speaking at a Blockchain event??” Yeah, I’m all over the technical map these days and you know what?
I love the variety of technology, the diversity of attendance and the differences in how the conferences are managed. Now that last one might seem odd and you might think that they’d all be similar, but its surprising how different they really are.
Today I’m going to talk about an aspect of conferences that’s very near to my heart, which is networking via events. For women in technology, there are some unique challenges for us when it comes to networking. Men have concerns about approaching women to network- such as fearful of accusations of inappropriate interaction and women have the challenge that a lot of networking opportunities occur outside of the workplace and in social situations that we may not be comfortable in. No matter who you are, no matter what your intentions, there’s a lot of wariness and in the end, women often just lose out when it comes to building their network. I’ve been able to breach this pretty successfully, but I have seen where it’s backfired and have found myself on more than one occasion defending both genders who’ve ended up on the losing side of the situation.
With that said, conferences and other professional events can assist with helping us geeks build our networks and it’s not all about networking events. I noticed a while back that the SQL Server community appeared to be more networked among their members. I believe part of this is due to the long history of their event software and some of its features.
Using the SQL Pass website, specifically the local user group event management software- notice that its all centralized. Unlike the significantly independent Oracle user groups, SQL Server user groups are able to use a centralized repository for their event management, speaker portal, scheduling, etc. It’s not to say that there aren’t any events outside of Pass Summit and SQL Saturdays, there’s actually a ton, but this was the portal for the regional user groups, creating the spoke that bridged out to the larger community.
Outside of submitting my abstract proposals to as many SQL Saturdays worldwide from one portal, I also can maintain one speaker biography, information about my blog, Twitter, Linkedin and other social media in this one location.
The second benefit of this simplicity, is that these biographies and profiles “feed” the conference schedules and event sites. You have a central location for management, but hundreds of event sites where different members can connect. After abstracts have been approved and the schedule built, I can easily go into an event’s schedule and click on each speaker biography and choose to connect with anyone listed who has entered their social media information in their global profile.
Using my profile as an example, you’ll notice the social media icons under my title are available with a simple click of the mouse:
This gives me both an easy way to network with my fellow speakers, but also an excuse to network with them! I can click on each one of the social media buttons and choose to follow each of the speakers on Twitter and connect with them on Linkedin. I send a note with the Linkedin connection telling the speaker that we’re both speaking at the event and due to this, I’d like to add them to my network.
As you can join as many regional and virtual user groups as you like, (and your Pass membership is free) I joined the three in Colorado, (Denver, Boulder and Colorado Springs.) Each one of those offers the ability to also connect with the board members using a similar method, (now going to use Todd and David as my examples from the Denver SQL Server user group.)
The Oracle user groups have embraced adding twitter links to most speaker bios and some board groups, but I know for RMOUG, many still hesitated or aren’t using social media to the extent they could. I can’t stress enough how impressed I am when I see events incorporate Linkedin and Twitter into their speaker and management profiles, knowing the value they bring to technical careers, networks and the community.
Although the SQL Server community is a good example, they aren’t the only ones. I’m also speaking at new events on emergent technologies, like Data Platforms 2017. I’ll be polite and expose my own profile page, but I’m told I’m easy to find in the sea of male speakers… 🙂 Along with my picture, bio and session information, there are links to my social media connections, allowing people to connect with me:
Yes, the Bizzabo software, (same software package that RMOUG will be using for our 2018 conference, along with a few other Oracle events this coming year) is aesthetically appealing, but more importantly, it incorporates important networking features that in the past just weren’t as essential as they are in today’s business world.
I first learned the networking tactic of connecting with people I was speaking with from Jeff Smith and I think its a great skill that everyone should take advantage of, no matter if you’re speaking or just attending. For women, I think it’s essential to your career to take advantage of opportunities to network outside of the traditional ways we’ve been taught in the past and this is just one more way to work around that glass ceiling.