We are working on a new product where one of the key features is web scraping. Python is great for that, but everything we build here uses Elixir and friends. So, we could have a problem, right? Well, maybe not. This post explains what we did to make them work together.
It is important to highlight that we know there are several ways to build state-of-the-art web scrapers in Elixir, but we already have a solution in Python that works. We also have other priorities for the product and a limited time to build it. Every second counts. 🐻
1. Where do I add the Python code?
In the priv
folder. According to the docs,
this folder is the place to put all resources that are needed in production, but are not directly part of the
source code. We think this is our case, so we’ve put the script
there. You may have a different opinion, so feel free to change it as you see fit.
2. How to run it?
We tried two ways. Both have pros and cons, so use your judgment to decide
what’s best for your use case. Important: If you expect to get something
back from the script, you need to print
the result from the Python code in
order to capture it in Elixir.
This is the first one:
defmodule MyModule
@python_dir Application.app_dir(:my_app, ["priv", "python_code"])
def run_script(args) do
case System.cmd(
"bash",
[
"-c",
"python3.11 code.py '#{prepare_args_for_script(args)}'"
],
cd: @python_dir,
env: [{"PYTHONPATH", System.find_executable("python3.11")}]
) do
{result, 0} ->
result
{reason, _} ->
{:error, reason}
end
end
defp prepare_args_for_script(args) do
Jason.encode!(%{
name: args.my_name,
age: args.my_age,
dream: args.my_dream
})
end
end
-
System.cmd/3
will execute the command with the givenargs
; -
We can send additional arguments to the script. In our case, we are sending a JSON
with a few ones, so we need to encode it first. This is the reason why
prepare_args_for_script/1
exists; -
cd
describes where to find the Python code; -
env
sets environment variables for the script.
The command will run once the function is executed. In our case, we expect to
get a result from the script, so we need to wrap the command in a case...do
block.
The second way is to use Ports. It works like this:
defmodule MyServer do
use GenServer
require Logger
@python_dir Application.app_dir(:my_app, ["priv", "python_code"])
def start_link(_args) do
GenServer.start_link(__MODULE__, nil, name: __MODULE__)
end
def run_external_script do
GenServer.cast(__MODULE__, :run_external_script)
end
@impl true
def init(_args) do
Process.flag(:trap_exit, true)
{:ok, nil}
end
@impl true
def handle_cast(:run_external_script, state) do
python_executable = System.find_executable("python3.11")
port =
Port.open({:spawn_executable, System.find_executable("bash")}, [
:binary,
{:args,
[
"-c",
"python3.11 code.py '#{prepare_args_for_script(args)}'"
]},
{:cd, ~c"#{@python_dir}"},
{:env, [{~c"PYTHONPATH", ~c"#{python_executable}"}]}
])
{:noreply, port}
end
@impl true
def handle_info({port, {:data, result}}, state) do
# result is the output of the script
# do whatever thing you want to do with it
{:noreply, state}
end
@impl true
def handle_info({:EXIT, port, reason}, _state) do
Logger.warning("Script exited.")
{:noreply, nil}
end
end
- This version uses the GenServer to invoke the script;
- We have opened a port to run the script and save it in the GenServer state;
-
Each time new data is available from the script, the
handle_info({port, {:data, result}}, state)
callback is invoked; -
When opening a port, we had to use
System.find_executable/1
to run the bash command. I’m not sure why, but it works; -
Since it uses Erlang Ports, we
need to use the
~c
sigil for both:cd
and:env
options. This is because the commands expect Erlang strings, which use single quotes, and not Elixir strings, which use double quotes.
The great thing about the second version is that the code runs asynchronously, but it is more complex. Again, use your judgment to decide which version is best for your use case.
See you in the next post!