We are working on a new product where one of the key features is web scraping. Python is great for that, but everything we build here uses Elixir and friends. So, we could have a problem, right? Well, maybe not. This post explains what we did to make them work together.

It is important to highlight that we know there are several ways to build state-of-the-art web scrapers in Elixir, but we already have a solution in Python that works. We also have other priorities for the product and a limited time to build it. Every second counts. 🐻

1. Where do I add the Python code?

In the priv folder. According to the docs, this folder is the place to put all resources that are needed in production, but are not directly part of the source code. We think this is our case, so we’ve put the script there. You may have a different opinion, so feel free to change it as you see fit.

2. How to run it?

We tried two ways. Both have pros and cons, so use your judgment to decide what’s best for your use case. Important: If you expect to get something back from the script, you need to print the result from the Python code in order to capture it in Elixir.

This is the first one:

defmodule MyModule
  @python_dir Application.app_dir(:my_app, ["priv", "python_code"])

  def run_script(args) do
    case System.cmd(
      "bash",
        [
          "-c",
          "python3.11 code.py '#{prepare_args_for_script(args)}'"
        ],
        cd: @python_dir,
        env: [{"PYTHONPATH", System.find_executable("python3.11")}]
    ) do
      {result, 0} ->
        result

      {reason, _} ->
        {:error, reason}
      end
  end

  defp prepare_args_for_script(args) do
    Jason.encode!(%{
      name: args.my_name,
      age: args.my_age,
      dream: args.my_dream
    })
  end
end
  • System.cmd/3 will execute the command with the given args;
  • We can send additional arguments to the script. In our case, we are sending a JSON with a few ones, so we need to encode it first. This is the reason why prepare_args_for_script/1 exists;
  • cd describes where to find the Python code;
  • env sets environment variables for the script.

The command will run once the function is executed. In our case, we expect to get a result from the script, so we need to wrap the command in a case...do block.

The second way is to use Ports. It works like this:

defmodule MyServer do
  use GenServer
  require Logger

  @python_dir Application.app_dir(:my_app, ["priv", "python_code"])

  def start_link(_args) do
    GenServer.start_link(__MODULE__, nil, name: __MODULE__)
  end

  def run_external_script do
    GenServer.cast(__MODULE__, :run_external_script)
  end

  @impl true
  def init(_args) do
    Process.flag(:trap_exit, true)

    {:ok, nil}
  end

  @impl true
  def handle_cast(:run_external_script, state) do
    python_executable = System.find_executable("python3.11")

    port =
      Port.open({:spawn_executable, System.find_executable("bash")}, [
        :binary,
        {:args,
         [
           "-c",
          "python3.11 code.py '#{prepare_args_for_script(args)}'"
         ]},
        {:cd, ~c"#{@python_dir}"},
        {:env, [{~c"PYTHONPATH", ~c"#{python_executable}"}]}
      ])

    {:noreply, port}
  end

  @impl true
  def handle_info({port, {:data, result}}, state) do
    # result is the output of the script
    # do whatever thing you want to do with it
    {:noreply, state}
  end

  @impl true
  def handle_info({:EXIT, port, reason}, _state) do
    Logger.warning("Script exited.")

    {:noreply, nil}
  end
end
  • This version uses the GenServer to invoke the script;
  • We have opened a port to run the script and save it in the GenServer state;
  • Each time new data is available from the script, the handle_info({port, {:data, result}}, state) callback is invoked;
  • When opening a port, we had to use System.find_executable/1 to run the bash command. I’m not sure why, but it works;
  • Since it uses Erlang Ports, we need to use the ~c sigil for both :cd and :env options. This is because the commands expect Erlang strings, which use single quotes, and not Elixir strings, which use double quotes.

The great thing about the second version is that the code runs asynchronously, but it is more complex. Again, use your judgment to decide which version is best for your use case.

See you in the next post!