Notes from Industry

About a year ago I suggested two peer review processes for data science projects, outlined a structure for the process — including separate review of the research phase and the model design and implementation — and positioned it within the wider scope of a data science project flow (as it is practiced in startup companies). The framework also included a list of topics, pitfalls and questions that should be reviewed.

It was not just a mental exercise. The post was written with intent to introduce formal peer review processes into the workflow of one of the data science teams I…

The Problem

Ever had a Python project that uses a tool or package that is configured by environment variables to authenticates with some service? When that is the case, we often have our project locally configured to work against the actual “working” service environment, usually using the credentials of our personal user on that service — for example, using the S3 of our AWS research account.

However, when we write integration tests for the same project, we might want them to work against an actual environment of that service (and not against a mock of it), but still not against a production…

Making your work more error-proof using peer scrutiny

Peer review is an important part of any creative activity. It is used in research — both inside and outside academia — to ensure the correctness of results, adherence to the scientific method and quality of output. In engineering it is used to provide outside scrutiny and to catch costly errors early on in the process of technology development. Everywhere it is used to improve decision making.

Those of us working in the tech industry are familiar with one particular and very important instance of peer review — the code review process. If you’ve ever received a review you should…

A concise review of the major approaches.

The question of what event caused another, or what brought about a certain change in a phenomenon, is a common one. Examples include whether a drug caused an improvement in some medical condition (versus the placebo effect, additional hospital visits, etc.), tracking down the cause for a malfunction in an assembly line or determining what caused an upsurge in a website’s traffic.

While a naive interpretation of the problem may suggest simple approaches like equating causality with high correlation, or to infer the degree to which x causes y from the degree of x’s goodness as a predicator of y

A review of notable literature on the topic

Word embedding — the mapping of words into numerical vector spaces — has proved to be an incredibly important method for natural language processing (NLP) tasks in recent years, enabling various machine learning models that rely on vector representation as input to enjoy richer representations of text input. These representations preserve more semantic and syntactic information on words, leading to improved performance in almost every imaginable NLP task.

Both the novel idea itself and its tremendous impact have led researchers to consider the problem of how to provide this boon of richer vector representations to larger units of texts —…

A practical guide to packaging Python code

Say you have a nice piece of Python code; a couple of small related functions, or perhaps even a medium-sized module with a few hundred lines of code. And say that you end up using that piece of code time and again; maybe you keep copying it into different projects or repositories, or you keep importing it from some dedicated utility-code folder you’ve set up in a specific path.

It’s natural — we all keep accumulating these small personal tools as we code, and Python probably enables and encourages this more than the average programming language — and it feels…

Stationarity is an important concept in time series analysis. For a concise (but thorough) introduction to the topic, and the reasons that make it important, take a look at my previous blog post on the topic. Without reiterating too much, it is suffice to say that:

  1. Stationarity means that the statistical properties of a a time series (or rather the process generating it) do not change over time.
  2. Stationarity is important because many useful analytical tools and statistical tests and models rely on it.

As such, the ability to determine wether a time series is stationary is important. Rather than…

Testing open-source Python on several operating systems

Say you have an open source Python project or package you are maintaining. You probably want to test it on the major Python versions that are currently in wide use. You definitely should. In some cases you might also need to test it on different operating systems. I’ll discuss both scenarios, and suggest a way to do just that, in this post.

For the sake of this post I’m going to assume you are:

  1. Hosting your open source project on Github.
  2. Using pytest to test your code.
  3. Checking for code coverage.
  4. You want to submit coverage stats to the free…

A review of the concept and types of stationarity

This post is meant to provide a concise but comprehensive overview of the concept of stationarity and of the different types of stationarity defined in academic literature dealing with time series analysis.

Future posts will aim to provide similarly concise overviews of detection of non-stationarity in time series data and of the different ways to transform non-stationary time series into stationary ones.¹

Why is stationarity important?

Before diving into formal definitions of stationarity, and the related concepts upon which it builds, it is worth considering why the concept of stationarity has become important in time series analysis and its various applications.

In the most…

A data scientist’s take on our process

I was recently asked by a startup I’m consulting (BigPanda) to give my opinion about the structure and flow of data science projects, which made me think about what makes them unique. Both managers and the different teams in a startup might find the differences between a data science project and a software development one unintuitive and confusing. If not stated and accounted for explicitly, these fundamental differences might cause misunderstanding and clashes between the data scientist and her peers.

Respectively, researchers coming from academia (or highly research-oriented industry research groups) might have their own challenges when arriving at a…

Shay Palachy

Data Science consultant.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store