By now, TabPy should be installed and connected to your Tableau Desktop v10.1+. This post is focused on the fun part – running Python code directly in Tableau and visualizing outputs!
Python code is inserted using Tableau’s calculated fields so incorporating Python into existing data analysis is straight-forward and familiar for Tableau users. Four functions are available within Tableau Desktop depending on the output of your Python code block.
SCRIPT_BOOL – returns boolean result(s) (TRUE/FALSE)
SCRIPT_INT – returns integer result(s) (e.g. -1, 0, 2)
SCRIPT_REAL – returns real result(s) (e.g. -.25, 0, 3/4)
SCRIPT_STR – returns string result(s) (e.g. “red”, “Wisconsin”)
All four functions follow a similar input format within Tableau’s calculated fields – insert Python code block in quotation marks and then pass in one or more arguments (i.e. fields) from your Tableau data source. I’ll focus on two good examples in this post using Tableau’s provided Superstore data set. For those unfamiliar with the Superstore data, it is customer-level sales data of a fictitious US-based business that sells furniture, office supplies, and technology products. It’s a clean data model that works well to quickly test out Tableau features.
To get a basic understanding of calculation syntax, I built a simple loop in Python to check the sum of profit for each U.S. state and return a Boolean list of whether it was greater than or less than zero. Note that using Python to do this task is overkill as Tableau has a lot more efficient, simpler ways to do this profitability check.
The calculated field starts with SCRIPT_BOOL in this example because I want to return a Python list filled with Boolean types (True or False). The first parameter needed is the actual Python code block – surrounded by quotation marks. My code block creates an empty Python list called ‘newList’ and then loops through each SUM([Profit]) supplied in another list to check if greater than 0.
The text “_arg1” in the Python code block tells Tableau that I will supply the argument (i.e. data field) at the end of my Tableau function. It’s a placeholder essentially. The second parameter of Tableau’s calculated field describes what data to send for _arg1 – in this case, SUM([Profit]). It’s important to understand that these functions only take aggregate inputs. In turn, Tableau is expecting a single column with either one row or as many rows as you passed. The error ‘Invalid JSON’ will often mean that Python’s ‘return’ result doesn’t match the format Tableau is expecting.
As an example, I want to see what profitability looked like in each state.
Tableau first checks the context of the view before passing the SUM([Profit]) list to Python. By adding state, Python receives a list with 49 elements (-2 from Alaska/Hawaii, +1 for D.C.). The beginning of the list in Python looks like [5787, -3428, 4009, 76381, …]. The calculated field loops and build a new list with 49 elements looking like [True, False, True, True, …]. I could easily add in city or zip code-level data and TabPy will recalculate. Being able to change the context of the view with Tableau’s click-and-drag interface is the biggest selling point for using languages like Python or R directly in the tool (discussed more below).
The final example shows off a much more useful scenario for showing off Python in Tableau – correlation coefficients or the linear dependence of two variables. Leveraging an example provided in TabPy’s documentation, calculating the relationship between sales and profits would look like the following:
This calculated field differs from the prior example in that:
SCRIPT_REAL was used because the final result will be a decimal between -1 and 1 based on positive/negative correlation.
Two arguments/inputs are passed (_arg1 and _arg2); both SUM([Sales]) and SUM([Profit]) are referenced respectively.
Python’s NumPy library is imported and used (showcasing a small appetizer of what’s possible using external libraries)
Product category and customer segment were used in my final view to create a neat 3×3 grid to individually analyze relationships between sales and profit. Adding customer names to ‘Details’ brings in more granularity so that each circle represents a customer’s purchase in that particular product category. Tableau then uses the context of the view AND the calculated field’s addressing/partitioning to determine the level in which the Python code is run. In this example, product category and customer segment are on the view. The ‘Correlation Coefficient’ calculated field (a.k.a. Python results) can be changed to calculate at the level of customer name.
In this example, TabPy is now called 9 times to calculate the correlation coefficient for the 3×3 matrix in the view at the level of customer. Without changing the default table calculation of including product category, customer segment, and customer name, the displayed coefficient would be for the entire displayed dataset. While not wrong, the above screenshot is a lot more powerful for testing data hypotheses.
TabPy vs. Preprocessing
TabPy’s release provides a ton of new opportunities to use Python’s powerful libraries directly in Tableau. It’s great to see Tableau continue to provide API support outside of the product. With that said, most organizations I’ve worked with have already been exporting their Python results for visualization in Tableau for awhile. I personally think that pre-processing data through Python scripts outside of Tableau will continue being the norm for a few reasons.
Debugging is helpful. Unless you have a solid understanding of what you want to test directly in Tableau, working within Tableau’s calculated fields gives little options to quickly test/debug code as you incrementally add functionality. Tableau provides error messages, but it’s still clunky development at times
Data science teams likely already have robust environments. Tableau has some great selling points for code maintenance/sharing with the new Python integration. I agree that this is a huge benefit to teams looking to get more into Python use, but most teams I’ve encountered already have fairly stable environments with quality infrastructure to share scripts across the team.
More export opportunities. Having Python results exists outside of Tableau also makes it easier to pass along results to additional Python modules or other tools as needed. Most teams I’ve worked with already have nightly processes set up to import Python output into super-charged Tableau data extracts and that process seems to work well as-is. TabPy would require the users to re-run the analysis in TabPy when the workbook is opened. I haven’t been able to test how caching the result would work on a Tableau Server deployment yet.
In conclusion, I still believe that TabPy has a great place in the Tableau ecosystem. The ability to change the resulting data from Python based on the view’s context is a huge reason why I’d consider using it for real-time data exploration. I think TabPy is also a great step in promoting data science capabilities within Python and R; organizations are noticing the benefits and making investments in using data as a competitive advantage beyond day-to-day operational reporting. I’m really excited to see where these capabilities go next!