Ubuntu Data Engineering and Analytics

Switching to Ubuntu on My Data Engineering and Analytics Workstation

Several years back, I installed the Ubuntu Linux environment on my primary data engineering and analytics workstation. At the time, I wanted to see how the architecture that runs my data servers would function as my data desktop. That installation kickstarted a rewarding trip headlong down the Linux rabbit hole.

Fast forward to today, and I’m still committed, and I spend the majority of my time in my Linux environment. Granted, I do need to dip into Windows and Mac OS from time to time, but the majority of the apps I use are either browser-based or support Linux installations.

So, if you’ve ever considered taking a similar plunge, keep reading. I think you’ll like what you see.

What is “Linux” Anyway?

Linux isn’t an operating system per se. The colloquial term “Linux” is really just a catch-all for the various open-source environments built atop the Linux kernel. Though these environments have very little mainstream desktop operating system market penetration, they power the overwhelming majority of servers and nearly all other computing segments.

“Linux” is also very much “a movement.” Collectively, it is the world’s largest open-source software project, with 13,594 developers from at least 1,340 companies contributing to the Linux kernel since 2005 alone. The source code is open for anyone to run, modify, or redistribute the source code as they wish under the terms of the GNU General Public License.

That flexibility and the massive community behind Linux has enabled the development of a variety of Linux flavors (called distros). If you have a niche or a need, there’s a Linux distro for you! The unparalleled stability and capabilities of Linux systems make it the go-to solution for professionals and hobbyists alike. Linux is not a peasant in the computing kingdom; it is the king.

Ubuntu and Linux in general power data engineering analytics in major enterprise work.
Linux powers the vast majority of major commercial computing. Source: The Linux Foundation

Picking Ubuntu as My Data Engineering and Analytics Workstation Distro

When I decided to make an OS shift on my data engineering and analytics workstation, I was already familiar with Ubuntu because of its widespread use as a server operating system. I soon discovered, however, that there were many, many other flavors to choose from.

There were very accessible distros like Ubuntu, Manjaro, and Mint on one end of the spectrum and more challenging distros like Arch and Gentoo on the other. Each obviously had pros and cons, and, despite what the Arch cult will tell you on Reddit, there was no “one distro to rule them all.” That’s part of the beauty of Linux environments. There is so much room to customize, improve, and craft!

Ultimately, I knew I wanted a distro that (at least) met the following conditions:

A well-supported and well-maintained base and a large, supportive community. 

I tried to stick to stable distros with active, engaged developers.

Ample troubleshooting resources

These could be official docs or even tutorials and instructions provided by other users. Leaps into Linux are best done with helpful resources on hand.

Transparency

The developers needed to be upfront about what’s under the hood. Fortunately, this is a key philosophy of every Linux distro I could find. Be sure that your distro is reputable, trustworthy, and follows the Four Essential Freedoms. You should have the freedom to:

  • Run the program, for any purpose.
  • Study how the program works and adapt it to your needs.
  • Redistribute copies so you can help others.
  • Improve the program and release your improvements to the public, so that everyone benefits.
Considerate of CPU and RAM resources

I wanted to avoid any distros that would tax my system unnecessarily. That was particularly important because I would rather have those resources running my analytics programs instead of running the OS itself. A good Linux distro should not require you to always have the latest hardware.

Could run the software you need it to run

This was an absolute necessity. If I wasn’t able to use my tools (or find comparable ones), the costs would outweigh the benefits.

Would expose areas for growth

I take learning my craft seriously, and a good distro would enable that growth and challenge me.

As you know, in the end, I opted for Ubuntu on my data engineering and analytics workstation. It is a very well-established distro with very few barriers to entry. If this is your first Linux distro, I highly recommend considering Ubuntu. You should feel right at home with it if you’re used to working in environments like Mac OS.

If you’re a little more confident in your Linux abilities, take a look at Manjaro XFCE. That was my runner-up choice. It’s very lightweight and has an excellent, intuitive desktop environment. I actually have a hard drive running it at the moment, and I love what I’ve seen so far.

Linux-based operating systems open up opportunities for expansive customization.
An example of a gorgeous XFCE customization by u/addy-fe. With Linux, you can tailor your desktop environment to meet your exact specifications. Source.

Why Ubuntu Meets my Data Engineering and Analytics Workstation Needs and Why I Recommend It

Most of you reading this are probably considering moving to Linux from Windows or Mac OS. As such, I’m recommending a user-friendly and forgiving Linux distro that will make that transition smooth and enjoyable. Of course, this distro is also great for experienced Linux users.

If you’re looking for more of a challenge and are already familiar with Linux, I have two other recommendations. There is Manjaro XFCE for those still wanting a casual desktop experience but want plenty of room for customization and Arch Linux for the die-hards that want to basically build their desktop environment one piece at a time from the Terminal.

The Majority of My Data Engineering and Analytics Tools Happily Run on Ubuntu

This whole undertaking would have been a non-starter if my tools didn’t run on Ubuntu. I use the following tools in my daily work, and they all play well with Linux.

  • Google Chrome and Mozilla Firefox for web browsing
  • JetBrains products (DataGrip, PyCharm, Webstorm) | local installation
  • | local installation
  • Postman | local installation
  • Altair GraphQL Client | local installation
  • G-Suite products (Sheets, Apps Script, Google Cloud) | browser-based
  • Amazon Web Services (EC2, RDS, Cloudfront, S3) | browser-based
  • Custom software I have written myself

There are, of course, some tools (i.e., Microsoft Excel and Adobe Creative Cloud) that still require me to dip into my Windows or Mac OS environment, but I don’t need to use those all the time.

Ubuntu is very stable, backed by a dedicated organization, and “just works”

The whole Ubuntu project is run by an organization called Canonical. It has more than 500 employees across 39 countries working to ensure that Ubuntu is stable and remains a relevant modern computing solution. They deliver releases every six months and regular LTS releases for enterprise production use. They also provide security updates and facilitate interactions between members of the community.

Ubuntu does have a slightly polarized image among Linux users. Some see Canonical as a corporate overlord, a force for evil. Others (myself included) appreciate the stability and accessibility of the platform and see Canonical’s involvement as a net positive.

Best of all, Ubuntu just works. Making the switch from Windows meant that I was able to spend dramatically less time on headaches like trying to get dependencies/packages/libraries working together and more time doing the work I needed to do.

Ubuntu is free, open-source, and easy to manage

Ubuntu is free to download and free to use. There isn’t some “premium” version of Ubuntu that will pounce on you later. It’s all there and all free. You’re also able to modify it substantially because you have almost total control over the operating system. This can, however, create problems if you mistakenly change something critical.

On the flip side, when something does go wrong, you can generally fix it quickly. Ubuntu’s user base has produced plenty of well-crafted tutorials and instructions to get you back on your feet when you stumble. Also, because of the attentiveness of the Linux development community in general, bugs typically get squashed in short order. All this translates to less time waiting for help desk techs and more time getting work done.

Ubuntu is lightweight and fast

You can run an entire Ubuntu desktop environment on a 4GB bootable USB flash drive. This compact, streamlined architecture means you can run the latest version Ubuntu on older hardware. If you have an old desktop or laptop lying around, try installing Ubuntu and enjoy your newfound fast, performant Linux machine.

The bootable USB is also an excellent option for those wanting to “test drive” Ubuntu without committing to a full installation. You have the whole environment on the stick and can get a feel for the fit.

A Linux-based data engineering and analytics workstation takes advantage of the myriad of FOSS software aimed at developers.
A screengrab from the Developer Tools section of the Ubuntu Software app. The software here is free and open-source.

Ubuntu encouraged me to become more familiar with my machine and prepared me for other distros like XFCE, Arch, and Amazon Linux AMI

Even a distro as polished as Ubuntu still had a very indie feel. There were subtle hints at every turn suggesting so much potential under the surface waiting to be tapped.

I had the ability to interface closely with my operating system in Ubuntu in ways you don’t get working with Windows or Mac OS. I was encouraged to get under the hood and found myself losing my prior fear of The Terminal. Ultimately, with my new found confidence and skills, I was able to explore other distros like Manjaro XFCE and Arch. I was also prepared to work more effectively with the Amazon Linux AMI instances I used in my work.

The Terminal and Bash

Automation and semi-automation are the keys to the data kingdom. Anecdotally, we all know that we work way too long on repetitive processes as data professionals. You need to be able to automate substantial amounts of work if you’re going to keep your sanity in the face of that pressure. If you can automate enough, you might even be able to actually make progress instead of just keeping pace 😉.

Linux systems (like Ubuntu) liberally facilitate automation through the Terminal and Bash scripting. (For those unfamiliar with the Terminal, it’s analogous to the Command Line in Windows, and it’s integral to the Linux experience.) 

I’m not advocating you throw out your Python, R scripts, etc. Don’t do that! You should think of Bash as just another useful tool in your toolbox.

One area where Bash excels is probing CSV files (even large ones). The simple head command, for instance, will print the first few lines of a CSV file so you can get a feel for things. If you need to know the row count, you just use cat FILE_NAME | wc -l. Want to figure out the data types of each column? Using the tool csvstat, you can generate a detailed summary of the data features within a CSV.


Seeing csvstat in action

The examples below uses the popular mtcars dataset familiar to those who use R. The technique, however, works on much larger datasets.

$ head mtcars.csv
model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1
Duster 360,14.3,8,360,245,3.21,3.57,15.84,0,0,3,4
Merc 240D,24.4,4,146.7,62,3.69,3.19,20,1,0,4,2
Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
$ csvstat mtcars.csv
 1. "model"

 Type of data:          Text
 Contains null values:  False
 Unique values:         32
 Longest value:         19 characters
 Most common values:    Mazda RX4 (1x)
                        Mazda RX4 Wag (1x)
                        Datsun 710 (1x)
                        Hornet 4 Drive (1x)
                        Hornet Sportabout (1x)

 2. "mpg"

 Type of data:          Number
 Contains null values:  False
 Unique values:         25
 Smallest value:        10.4
 Largest value:         33.9
 Sum:                   642.9
 Mean:                  20.091
 Median:                19.2
 StDev:                 6.027
 Most common values:    21 (2x)
                        22.8 (2x)
                        21.4 (2x)
                        19.2 (2x)
                        15.2 (2x)

 3. "cyl"

 Type of data:          Number
 Contains null values:  False
 Unique values:         3
 Smallest value:        4
 Largest value:         8
 Sum:                   198
 Mean:                  6.188
 Median:                6
 StDev:                 1.786
 Most common values:    8 (14x)
                        4 (11x)
                        6 (7x)

 4. "disp"

 Type of data:          Number
 Contains null values:  False
 Unique values:         27
 Smallest value:        71.1
 Largest value:         472
 Sum:                   7,383.1
 Mean:                  230.722
 Median:                196.3
 StDev:                 123.939
 Most common values:    275.8 (3x)
                        160 (2x)
                        360 (2x)
                        167.6 (2x)
                        108 (1x)

 5. "hp"

 Type of data:          Number
 Contains null values:  False
 Unique values:         22
 Smallest value:        52
 Largest value:         335
 Sum:                   4,694
 Mean:                  146.688
 Median:                123
 StDev:                 68.563
 Most common values:    110 (3x)
                        175 (3x)
                        180 (3x)
                        245 (2x)
                        123 (2x)

 6. "drat"

 Type of data:          Number
 Contains null values:  False
 Unique values:         22
 Smallest value:        2.76
 Largest value:         4.93
 Sum:                   115.09
 Mean:                  3.597
 Median:                3.695
 StDev:                 0.535
 Most common values:    3.92 (3x)
                        3.07 (3x)
                        3.9 (2x)
                        3.08 (2x)
                        3.15 (2x)

 7. "wt"

 Type of data:          Number
 Contains null values:  False
 Unique values:         29
 Smallest value:        1.513
 Largest value:         5.424
 Sum:                   102.952
 Mean:                  3.217
 Median:                3.325
 StDev:                 0.978
 Most common values:    3.44 (3x)
                        3.57 (2x)
                        2.62 (1x)
                        2.875 (1x)
                        2.32 (1x)

 8. "qsec"

 Type of data:          Number
 Contains null values:  False
 Unique values:         30
 Smallest value:        14.5
 Largest value:         22.9
 Sum:                   571.16
 Mean:                  17.849
 Median:                17.71
 StDev:                 1.787
 Most common values:    17.02 (2x)
                        18.9 (2x)
                        16.46 (1x)
                        18.61 (1x)
                        19.44 (1x)

 9. "vs"

 Type of data:          Boolean
 Contains null values:  False
 Unique values:         2
 Most common values:    False (18x)
                        True (14x)

 10. "am"

 Type of data:          Boolean
 Contains null values:  False
 Unique values:         2
 Most common values:    False (19x)
                        True (13x)

 11. "gear"

 Type of data:          Number
 Contains null values:  False
 Unique values:         3
 Smallest value:        3
 Largest value:         5
 Sum:                   118
 Mean:                  3.688
 Median:                4
 StDev:                 0.738
 Most common values:    3 (15x)
                        4 (12x)
                        5 (5x)

 12. "carb"

 Type of data:          Number
 Contains null values:  False
 Unique values:         6
 Smallest value:        1
 Largest value:         8
 Sum:                   90
 Mean:                  2.812
 Median:                2
 StDev:                 1.615
 Most common values:    4 (10x)
                        2 (10x)
                        1 (7x)
                        3 (3x)
                        6 (1x)

Row count: 32

You may think that’s all good and fine, but how about manipulating files? For example, one of the more frustrating things about working with US government datasets is that they love to use delimited formats outside of “normal” tab-separated or comma-separated values.

Let’s suppose you need to convert 8 million pipe-delimited (|) records with comma-delimited (,) records. You can’t do this in a spreadsheet. There are too many rows. It’s also a pain to have to boot up a whole program like R just to convert a file. So what can you do?

You can actually do this in the Linux terminal. The command you’d use is a SED command (Stream EDitor). Here’s what it looks like:

sed -i 's/^/"/;s/|/","/g;s/$/"/' yourFileName

Easy! You just plug that in, sit back, and watch those millions of rows get processed in a lovely data stream.

I can develop my utilities in an environment similar to my production environment

As mentioned, Linux-based systems are the primary workhorses of back-end data computing worldwide. For example, popular virtual computing solutions like AWS EC2 support and recommend Linux operating environments. I use such EC2 instances in my work all the time to run custom data utilities. The ability to develop those utilities in the same environment as the runtime environment is immensely helpful.

Getting familiar with Linux systems on your own data workstation first will also shorten the learning curve if you’re new to Linux cloud solutions!

A Linux-based Data engineering and analytics workstation development environment mirrors production.
Ubuntu is a commonly-used AMI in AWS EC2 instances

Give it a try!

A Linux based operating system like Ubuntu on your data engineering and analytics workstation is a natural step to take if you’re a data professional. It will bring your development closer to your probable production environment and will inherently push you to become a better data developer. You will also find ample avenues for customization that will benefit both your UI/UX and your productivity.

If you have any questions, look for me @xyzologyblog on Twitter or get in touch here!

Give it a try!

Advertisements Disclosure

I will always make it clear if I am writing to endorse or recommend a specific product(s) or service(s). I hate it when I visit a site only to find out that the article is just one big ad.

Various ads may be displayed on this post to help defray the operating cost of this blog. I may make a small commission on any purchases you make by clicking on those advertisements. Thank you for supporting my work bringing you accurate and actionable information on data literacy, analytics, and engineering.

Advertisements

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.