补充
Some checks failed
Build wheels / build (ubuntu-latest, 3.11) (push) Has been cancelled
Build wheels / build (ubuntu-latest, 3.12) (push) Has been cancelled
Build wheels / build (ubuntu-latest, 3.13) (push) Has been cancelled
Tests / check (push) Has been cancelled
Tests / build (ubuntu-latest, 3.11) (push) Has been cancelled
Tests / build (ubuntu-latest, 3.12) (push) Has been cancelled
Tests / build (ubuntu-latest, 3.13) (push) Has been cancelled
Some checks failed
Build wheels / build (ubuntu-latest, 3.11) (push) Has been cancelled
Build wheels / build (ubuntu-latest, 3.12) (push) Has been cancelled
Build wheels / build (ubuntu-latest, 3.13) (push) Has been cancelled
Tests / check (push) Has been cancelled
Tests / build (ubuntu-latest, 3.11) (push) Has been cancelled
Tests / build (ubuntu-latest, 3.12) (push) Has been cancelled
Tests / build (ubuntu-latest, 3.13) (push) Has been cancelled
This commit is contained in:
@@ -1,254 +1,53 @@
|
|||||||
# Contest Runners
|
# TN
|
||||||
|
|
||||||
This directory contains two self-contained contest entrypoints:
|
|
||||||
|
|
||||||
- `tools/tn_contest_runner.py`: general tensor-network path search and contraction.
|
|
||||||
- `tools/mps_contest_runner.py`: Vidal/MPS multi-node expectation runner.
|
|
||||||
|
|
||||||
Both scripts keep circuit and observable definitions inside the script so a
|
|
||||||
contest case can be edited in one place.
|
|
||||||
|
|
||||||
## Environment
|
|
||||||
|
|
||||||
Run commands from the repository root:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cd /home/yx/qibotn
|
# qibotn目录下
|
||||||
```
|
I_MPI_FABRICS=shm:ofi \
|
||||||
|
I_MPI_OFI_PROVIDER=tcp \
|
||||||
|
FI_PROVIDER=tcp \
|
||||||
|
CASE=main1 \
|
||||||
|
OBSERVABLES=long_z_string \
|
||||||
|
NQUBITS=34 \
|
||||||
|
NLAYERS=20 \
|
||||||
|
TORCH_THREADS=48 \
|
||||||
|
SEARCH_REPEATS=2048 \
|
||||||
|
SEARCH_TIME=300 \
|
||||||
|
SCHEDULER_HOST=10.20.1.103 \
|
||||||
|
WORKER_HOSTS="10.20.1.103 10.20.6.101" \
|
||||||
|
DASK_ADDRESS="tcp://10.20.1.103:8786" \
|
||||||
|
NWORKERS=84 \
|
||||||
|
NTHREADS=1 \
|
||||||
|
MPIEXEC_FULL="mpirun -np 4 -hostfile /home/yx/qibotn/hostfile -perhost 2" \
|
||||||
|
tools/run_tn_dask_mpi_all.sh
|
||||||
|
|
||||||
For Intel MPI on two nodes, use the known working style:
|
# 单独缩并contract计算
|
||||||
|
|
||||||
```bash
|
I_MPI_FABRICS=shm:ofi \
|
||||||
mpirun -np 4 -hostfile /home/yx/qibotn/hostfile -perhost 2 ...
|
I_MPI_OFI_PROVIDER=tcp \
|
||||||
```
|
FI_PROVIDER=tcp \
|
||||||
|
mpirun -np 4 -hostfile /home/yx/qibotn/hostfile -perhost 2 \
|
||||||
Set `TCM_ENABLE=1` for CPU runs:
|
.venv/bin/python -u tools/tn_contest_runner.py contract \
|
||||||
|
|
||||||
```bash
|
|
||||||
export TCM_ENABLE=1
|
|
||||||
```
|
|
||||||
|
|
||||||
## TN Workflow
|
|
||||||
|
|
||||||
List built-in TN contest cases:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python -u tools/tn_contest_runner.py list
|
|
||||||
```
|
|
||||||
|
|
||||||
TN path search uses dask by default. Without `--dask-address`, the script starts
|
|
||||||
a local dask cluster. For multiple servers, start one scheduler and workers
|
|
||||||
with the helper script, then pass the scheduler address to the search command.
|
|
||||||
|
|
||||||
Start the default two-node dask cluster:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd /home/yx/qibotn
|
|
||||||
tools/manage_tn_dask_cluster.sh start
|
|
||||||
```
|
|
||||||
|
|
||||||
Check status:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd /home/yx/qibotn
|
|
||||||
tools/manage_tn_dask_cluster.sh status
|
|
||||||
```
|
|
||||||
|
|
||||||
Stop the cluster:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd /home/yx/qibotn
|
|
||||||
tools/manage_tn_dask_cluster.sh stop
|
|
||||||
```
|
|
||||||
|
|
||||||
The helper defaults are:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
SCHEDULER_HOST=10.20.1.103
|
|
||||||
WORKER_HOSTS="10.20.1.103 10.20.1.102"
|
|
||||||
NWORKERS=48
|
|
||||||
NTHREADS=1
|
|
||||||
ROOT_DIR=/home/yx/qibotn
|
|
||||||
PYTHON_BIN=.venv/bin/python
|
|
||||||
DASK_WORKER_TTL="24 hours"
|
|
||||||
DASK_TICK_LIMIT="30 minutes"
|
|
||||||
DASK_LOST_WORKER_TIMEOUT="30 minutes"
|
|
||||||
```
|
|
||||||
|
|
||||||
Override them inline if needed:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
WORKER_HOSTS="10.20.1.103 10.20.1.102" NWORKERS=48 \
|
|
||||||
tools/manage_tn_dask_cluster.sh restart
|
|
||||||
```
|
|
||||||
|
|
||||||
Check that both nodes are connected by adding `--tn-debug-trials` to a small
|
|
||||||
search. The output should include `qibotn_dask_workers` with both hosts.
|
|
||||||
|
|
||||||
`tools/tn_contest_runner.py search` stops the external dask cluster after the
|
|
||||||
search phase by default. Pass `--keep-dask` if you want to reuse the same dask
|
|
||||||
cluster for several searches.
|
|
||||||
|
|
||||||
Use enough trials to fill the cluster. With the default two-node setup there are
|
|
||||||
96 worker slots, so `--tn-search-repeats` should be at least 96. The contest
|
|
||||||
runner default is 2048.
|
|
||||||
|
|
||||||
Cotengra trials are CPU-bound and can hold the Python GIL long enough for dask
|
|
||||||
to report `Event loop was unresponsive`. Dask defaults are much more aggressive:
|
|
||||||
`scheduler.worker-ttl=5 minutes`, `admin.tick.limit=3s`, and
|
|
||||||
`deploy.lost-worker-timeout=15s`. The helper script raises these limits so
|
|
||||||
workers are not killed by dask during search. The intended timeout is
|
|
||||||
`--tn-search-time`; after that, the runner stops the external dask cluster.
|
|
||||||
|
|
||||||
Small correctness check against statevector:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python -u tools/tn_contest_runner.py validate \
|
|
||||||
--case main1 \
|
|
||||||
--nqubits 8 \
|
|
||||||
--nlayers 2 \
|
|
||||||
--torch-threads 4 \
|
|
||||||
--tn-search-repeats 8 \
|
|
||||||
--tn-search-time 5
|
|
||||||
```
|
|
||||||
|
|
||||||
Search and save contraction trees:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
TCM_ENABLE=1 python -u tools/tn_contest_runner.py search \
|
|
||||||
--case main1 \
|
|
||||||
--torch-threads 48 \
|
|
||||||
--dtype complex64 \
|
|
||||||
--dask-address tcp://10.20.1.103:8786 \
|
|
||||||
--tn-search-repeats 2048 \
|
|
||||||
--tn-search-time 300
|
|
||||||
```
|
|
||||||
|
|
||||||
Contract using the saved tree on one node:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
TCM_ENABLE=1 mpirun -np 2 python -u tools/tn_contest_runner.py contract \
|
|
||||||
--mpi \
|
--mpi \
|
||||||
--case main1 \
|
--case main1 \
|
||||||
|
--nqubits 34 \
|
||||||
|
--nlayers 20 \
|
||||||
|
--observables long_z_string \
|
||||||
|
--tree-dir trees/contest_tn \
|
||||||
--torch-threads 48 \
|
--torch-threads 48 \
|
||||||
--dtype complex64
|
--dtype complex64
|
||||||
```
|
```
|
||||||
|
|
||||||
Contract using the saved tree on two nodes:
|
# MPS
|
||||||
|
|
||||||
```bash
|
|
||||||
TCM_ENABLE=1 mpirun -np 4 -hostfile /home/yx/qibotn/hostfile -perhost 2 \
|
|
||||||
python -u tools/tn_contest_runner.py contract \
|
|
||||||
--mpi \
|
|
||||||
--case main1 \
|
|
||||||
--torch-threads 48 \
|
|
||||||
--dtype complex64
|
|
||||||
```
|
```
|
||||||
|
cd /home/yx/qibotn
|
||||||
|
|
||||||
Run search and contract in one command:
|
I_MPI_FABRICS=shm:ofi \
|
||||||
|
I_MPI_OFI_PROVIDER=tcp \
|
||||||
```bash
|
FI_PROVIDER=tcp \
|
||||||
TCM_ENABLE=1 python -u tools/tn_contest_runner.py all \
|
MPIEXEC_FULL="mpirun -np 4 -hostfile /home/yx/qibotn/hostfile -perhost 2" \
|
||||||
--case main1 \
|
TORCH_THREADS=48 \
|
||||||
--torch-threads 48 \
|
OBS_FILTER=ring_xz \
|
||||||
--dtype complex64 \
|
MAIN1_NQ=128 \
|
||||||
--dask-address tcp://10.20.1.103:8786 \
|
MAIN1_LAYERS=24 \
|
||||||
--tn-search-repeats 2048 \
|
MAIN1_BOND=1024 \
|
||||||
--tn-search-time 300
|
tools/run_vidal_mpi_contest_cases.sh main1
|
||||||
```
|
```
|
||||||
|
|
||||||
Run only selected observables:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python -u tools/tn_contest_runner.py search \
|
|
||||||
--case main2 \
|
|
||||||
--observables open_zz
|
|
||||||
```
|
|
||||||
|
|
||||||
Tree files are written to `trees/contest_tn/` by default. The tree filename
|
|
||||||
contains case, observable, qubit count, layer count, and target slice count.
|
|
||||||
If any of these change, search again.
|
|
||||||
|
|
||||||
Edit TN contest cases in `tools/tn_contest_runner.py`:
|
|
||||||
|
|
||||||
- `CASES`: case name, circuit kind, observable list, default scale.
|
|
||||||
- `build_circuit`: circuit definitions.
|
|
||||||
- `pauli_sum_observable`: observable definitions.
|
|
||||||
|
|
||||||
## MPS Workflow
|
|
||||||
|
|
||||||
List built-in Vidal/MPS contest cases:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python -u tools/mps_contest_runner.py list
|
|
||||||
```
|
|
||||||
|
|
||||||
Small correctness check against statevector:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
mpirun -np 2 python -u tools/mps_contest_runner.py validate \
|
|
||||||
--case main1 \
|
|
||||||
--nqubits 8 \
|
|
||||||
--nlayers 2 \
|
|
||||||
--bond 64 \
|
|
||||||
--torch-threads 4
|
|
||||||
```
|
|
||||||
|
|
||||||
Run one MPS case on one node:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
TCM_ENABLE=1 mpirun -np 2 python -u tools/mps_contest_runner.py run \
|
|
||||||
--case main1 \
|
|
||||||
--torch-threads 48
|
|
||||||
```
|
|
||||||
|
|
||||||
Run one MPS case on two nodes:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
TCM_ENABLE=1 mpirun -np 4 -hostfile /home/yx/qibotn/hostfile -perhost 2 \
|
|
||||||
python -u tools/mps_contest_runner.py run \
|
|
||||||
--case main1 \
|
|
||||||
--torch-threads 48
|
|
||||||
```
|
|
||||||
|
|
||||||
Run only one observable:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
TCM_ENABLE=1 mpirun -np 4 -hostfile /home/yx/qibotn/hostfile -perhost 2 \
|
|
||||||
python -u tools/mps_contest_runner.py run \
|
|
||||||
--case main1 \
|
|
||||||
--observables ring_xz \
|
|
||||||
--torch-threads 48
|
|
||||||
```
|
|
||||||
|
|
||||||
Override scale:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
TCM_ENABLE=1 mpirun -np 4 -hostfile /home/yx/qibotn/hostfile -perhost 2 \
|
|
||||||
python -u tools/mps_contest_runner.py run \
|
|
||||||
--case main1 \
|
|
||||||
--nqubits 128 \
|
|
||||||
--nlayers 24 \
|
|
||||||
--bond 1024 \
|
|
||||||
--torch-threads 48
|
|
||||||
```
|
|
||||||
|
|
||||||
Edit MPS contest cases in `tools/mps_contest_runner.py`:
|
|
||||||
|
|
||||||
- `CASES`: case name, circuit kind, observable list, default scale and bond.
|
|
||||||
- `build_circuit`: circuit definitions.
|
|
||||||
- `observable`: observable definitions, including dense local terms.
|
|
||||||
|
|
||||||
## Notes
|
|
||||||
|
|
||||||
- TN uses path search plus contraction. Reuse tree files only for the exact same
|
|
||||||
circuit, observable, qubit count, layer count, seed, and slicing setup.
|
|
||||||
- TN path search defaults to dask. Use `--tn-search-backend processpool` only
|
|
||||||
for fallback/debugging.
|
|
||||||
- Prefer the default `--tn-target-size 4294967296` memory target. Do not force
|
|
||||||
`--tn-target-slices` unless you have already verified that cotengra can find
|
|
||||||
valid trees for that exact setting.
|
|
||||||
- MPS/Vidal does not use contraction-tree search. It runs the circuit directly
|
|
||||||
and reports `trunc_sum` and `trunc_max`.
|
|
||||||
- Default TN contraction is the stable torch/quimb path. Do not pass
|
|
||||||
`--tn-contract-implementation cpp` for contest runs.
|
|
||||||
@@ -41,11 +41,19 @@ def _bind_numa_node(rank):
|
|||||||
Returns the NUMA domain that was selected, or ``None`` if the binding
|
Returns the NUMA domain that was selected, or ``None`` if the binding
|
||||||
could not be determined.
|
could not be determined.
|
||||||
"""
|
"""
|
||||||
|
current_affinity = os.sched_getaffinity(0)
|
||||||
|
online_cpus = set(range(os.cpu_count() or 1))
|
||||||
|
if current_affinity and current_affinity != online_cpus:
|
||||||
|
# MPI launchers such as Intel MPI often pin local ranks correctly
|
||||||
|
# before Python starts. Do not narrow that placement further.
|
||||||
|
return None
|
||||||
|
|
||||||
local_rank = rank
|
local_rank = rank
|
||||||
for name in (
|
for name in (
|
||||||
"OMPI_COMM_WORLD_LOCAL_RANK",
|
"OMPI_COMM_WORLD_LOCAL_RANK",
|
||||||
"MV2_COMM_WORLD_LOCAL_RANK",
|
"MV2_COMM_WORLD_LOCAL_RANK",
|
||||||
"MPI_LOCALRANKID",
|
"MPI_LOCALRANKID",
|
||||||
|
"I_MPI_LOCAL_RANK",
|
||||||
"SLURM_LOCALID",
|
"SLURM_LOCALID",
|
||||||
):
|
):
|
||||||
try:
|
try:
|
||||||
@@ -54,13 +62,27 @@ def _bind_numa_node(rank):
|
|||||||
except (KeyError, ValueError):
|
except (KeyError, ValueError):
|
||||||
pass
|
pass
|
||||||
|
|
||||||
domain = local_rank % 2
|
domains = _available_numa_domains()
|
||||||
cpulist = f"/sys/devices/system/node/node{domain}/cpulist"
|
if not domains:
|
||||||
|
return None
|
||||||
|
|
||||||
|
local_size = _local_world_size()
|
||||||
|
assigned_domains = domains[local_rank::local_size]
|
||||||
|
if not assigned_domains:
|
||||||
|
assigned_domains = [domains[local_rank % len(domains)]]
|
||||||
|
|
||||||
|
domain = assigned_domains[0]
|
||||||
|
cpus = set()
|
||||||
|
for selected in assigned_domains:
|
||||||
|
cpulist = f"/sys/devices/system/node/node{selected}/cpulist"
|
||||||
|
try:
|
||||||
|
cpus.update(_parse_cpu_list(open(cpulist, encoding="utf-8").read().strip()))
|
||||||
|
except (FileNotFoundError, OSError):
|
||||||
|
pass
|
||||||
try:
|
try:
|
||||||
cpus = _parse_cpu_list(open(cpulist, encoding="utf-8").read().strip())
|
|
||||||
if cpus:
|
if cpus:
|
||||||
os.sched_setaffinity(0, cpus)
|
os.sched_setaffinity(0, cpus)
|
||||||
except (FileNotFoundError, OSError):
|
except OSError:
|
||||||
pass
|
pass
|
||||||
|
|
||||||
try:
|
try:
|
||||||
@@ -76,6 +98,38 @@ def _bind_numa_node(rank):
|
|||||||
return domain
|
return domain
|
||||||
|
|
||||||
|
|
||||||
|
def _available_numa_domains():
|
||||||
|
nodes = []
|
||||||
|
base = Path("/sys/devices/system/node")
|
||||||
|
try:
|
||||||
|
for path in base.glob("node[0-9]*"):
|
||||||
|
try:
|
||||||
|
nodes.append(int(path.name[4:]))
|
||||||
|
except ValueError:
|
||||||
|
pass
|
||||||
|
except OSError:
|
||||||
|
return []
|
||||||
|
return sorted(nodes)
|
||||||
|
|
||||||
|
|
||||||
|
def _local_world_size():
|
||||||
|
for name in (
|
||||||
|
"OMPI_COMM_WORLD_LOCAL_SIZE",
|
||||||
|
"MV2_COMM_WORLD_LOCAL_SIZE",
|
||||||
|
"MPI_LOCALNRANKS",
|
||||||
|
"I_MPI_LOCAL_SIZE",
|
||||||
|
"SLURM_NTASKS_PER_NODE",
|
||||||
|
):
|
||||||
|
value = os.environ.get(name)
|
||||||
|
if not value:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
return max(1, int(str(value).split("(", 1)[0]))
|
||||||
|
except ValueError:
|
||||||
|
pass
|
||||||
|
return 1
|
||||||
|
|
||||||
|
|
||||||
def _parse_cpu_list(text):
|
def _parse_cpu_list(text):
|
||||||
cpus = set()
|
cpus = set()
|
||||||
for item in text.split(","):
|
for item in text.split(","):
|
||||||
|
|||||||
@@ -745,6 +745,12 @@ def _contract_mpi(
|
|||||||
is_torch = backend == "torch"
|
is_torch = backend == "torch"
|
||||||
nslices = int(getattr(tree, "multiplicity", 1))
|
nslices = int(getattr(tree, "multiplicity", 1))
|
||||||
stats = SlicedContractStats(rank, size, nslices, 0, assignment)
|
stats = SlicedContractStats(rank, size, nslices, 0, assignment)
|
||||||
|
nslices_by_rank = comm.allgather(nslices)
|
||||||
|
if len(set(nslices_by_rank)) != 1:
|
||||||
|
raise RuntimeError(
|
||||||
|
"Inconsistent contraction tree slices across MPI ranks: "
|
||||||
|
f"{nslices_by_rank}. Ensure all nodes load the same tree file."
|
||||||
|
)
|
||||||
|
|
||||||
if not set(getattr(tree, "sliced_inds", ())).isdisjoint(set(getattr(tree, "output", ()))):
|
if not set(getattr(tree, "sliced_inds", ())).isdisjoint(set(getattr(tree, "output", ()))):
|
||||||
raise NotImplementedError(
|
raise NotImplementedError(
|
||||||
|
|||||||
@@ -5,7 +5,7 @@ set -euo pipefail
|
|||||||
#
|
#
|
||||||
# Defaults target two servers:
|
# Defaults target two servers:
|
||||||
# scheduler: 10.20.1.103:8786
|
# scheduler: 10.20.1.103:8786
|
||||||
# workers: 10.20.1.103, 10.20.1.102
|
# workers: 10.20.1.103, 10.20.6.101
|
||||||
#
|
#
|
||||||
# Usage:
|
# Usage:
|
||||||
# tools/manage_tn_dask_cluster.sh start
|
# tools/manage_tn_dask_cluster.sh start
|
||||||
@@ -14,7 +14,7 @@ set -euo pipefail
|
|||||||
#
|
#
|
||||||
# Common overrides:
|
# Common overrides:
|
||||||
# SCHEDULER_HOST=10.20.1.103
|
# SCHEDULER_HOST=10.20.1.103
|
||||||
# WORKER_HOSTS="10.20.1.103 10.20.1.102"
|
# WORKER_HOSTS="10.20.1.103 10.20.6.101"
|
||||||
# NWORKERS=48
|
# NWORKERS=48
|
||||||
# NTHREADS=1
|
# NTHREADS=1
|
||||||
# ROOT_DIR=/home/yx/qibotn
|
# ROOT_DIR=/home/yx/qibotn
|
||||||
@@ -25,8 +25,8 @@ PYTHON_BIN="${PYTHON_BIN:-.venv/bin/python}"
|
|||||||
SCHEDULER_HOST="${SCHEDULER_HOST:-10.20.1.103}"
|
SCHEDULER_HOST="${SCHEDULER_HOST:-10.20.1.103}"
|
||||||
SCHEDULER_PORT="${SCHEDULER_PORT:-8786}"
|
SCHEDULER_PORT="${SCHEDULER_PORT:-8786}"
|
||||||
DASHBOARD_ADDRESS="${DASHBOARD_ADDRESS:-:8787}"
|
DASHBOARD_ADDRESS="${DASHBOARD_ADDRESS:-:8787}"
|
||||||
WORKER_HOSTS="${WORKER_HOSTS:-10.20.1.103 10.20.1.102}"
|
WORKER_HOSTS="${WORKER_HOSTS:-10.20.1.103 10.20.6.101}"
|
||||||
NWORKERS="${NWORKERS:-48}"
|
NWORKERS="${NWORKERS:-84}"
|
||||||
NTHREADS="${NTHREADS:-1}"
|
NTHREADS="${NTHREADS:-1}"
|
||||||
MEMORY_LIMIT="${MEMORY_LIMIT:-0}"
|
MEMORY_LIMIT="${MEMORY_LIMIT:-0}"
|
||||||
LOCAL_DIRECTORY="${LOCAL_DIRECTORY:-/tmp/qibotn-dask}"
|
LOCAL_DIRECTORY="${LOCAL_DIRECTORY:-/tmp/qibotn-dask}"
|
||||||
|
|||||||
93
tools/run_tn_dask_mpi_all.sh
Executable file
93
tools/run_tn_dask_mpi_all.sh
Executable file
@@ -0,0 +1,93 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||||
|
cd "$ROOT_DIR"
|
||||||
|
|
||||||
|
CASE="${CASE:-main1}"
|
||||||
|
OBSERVABLES="${OBSERVABLES:-long_z_string}"
|
||||||
|
NQUBITS="${NQUBITS:-34}"
|
||||||
|
NLAYERS="${NLAYERS:-20}"
|
||||||
|
TORCH_THREADS="${TORCH_THREADS:-48}"
|
||||||
|
SEARCH_REPEATS="${SEARCH_REPEATS:-2048}"
|
||||||
|
SEARCH_TIME="${SEARCH_TIME:-300}"
|
||||||
|
TN_TARGET_SIZE="${TN_TARGET_SIZE:-8589934592}"
|
||||||
|
TN_TARGET_SLICES="${TN_TARGET_SLICES:-}"
|
||||||
|
|
||||||
|
PYTHON_BIN="${PYTHON_BIN:-.venv/bin/python}"
|
||||||
|
DTYPE="${DTYPE:-complex64}"
|
||||||
|
TREE_DIR="${TREE_DIR:-trees/contest_tn}"
|
||||||
|
DASK_ADDRESS="${DASK_ADDRESS:-tcp://10.20.1.103:8786}"
|
||||||
|
MPIEXEC_FULL="${MPIEXEC_FULL:-mpirun -np 4 -hostfile /home/yx/qibotn/hostfile -perhost 2}"
|
||||||
|
SYNC_TREES="${SYNC_TREES:-1}"
|
||||||
|
SYNC_HOSTS="${SYNC_HOSTS:-${WORKER_HOSTS:-}}"
|
||||||
|
SSH_BIN="${SSH_BIN:-ssh}"
|
||||||
|
|
||||||
|
export TCM_ENABLE="${TCM_ENABLE:-1}"
|
||||||
|
|
||||||
|
tn_slice_args=(--tn-target-size "$TN_TARGET_SIZE")
|
||||||
|
if [[ -n "$TN_TARGET_SLICES" ]]; then
|
||||||
|
tn_slice_args+=(--tn-target-slices "$TN_TARGET_SLICES")
|
||||||
|
fi
|
||||||
|
|
||||||
|
is_local_host() {
|
||||||
|
local host="$1"
|
||||||
|
[[ "$host" == "localhost" || "$host" == "127.0.0.1" ]] && return 0
|
||||||
|
[[ "$host" == "$(hostname)" ]] && return 0
|
||||||
|
[[ "$host" == "$(hostname -f 2>/dev/null || true)" ]] && return 0
|
||||||
|
hostname -I 2>/dev/null | tr ' ' '\n' | grep -qx "$host"
|
||||||
|
}
|
||||||
|
|
||||||
|
sync_trees_to_hosts() {
|
||||||
|
[[ "$SYNC_TREES" == "1" ]] || return 0
|
||||||
|
[[ -n "$SYNC_HOSTS" ]] || return 0
|
||||||
|
|
||||||
|
local src_dir="$TREE_DIR"
|
||||||
|
local dst_dir="$TREE_DIR"
|
||||||
|
if [[ "$TREE_DIR" != /* ]]; then
|
||||||
|
src_dir="$ROOT_DIR/$TREE_DIR"
|
||||||
|
dst_dir="$ROOT_DIR/$TREE_DIR"
|
||||||
|
fi
|
||||||
|
|
||||||
|
for host in $SYNC_HOSTS; do
|
||||||
|
is_local_host "$host" && continue
|
||||||
|
echo "Sync tree dir to $host:$dst_dir"
|
||||||
|
"$SSH_BIN" "$host" "mkdir -p $(printf '%q' "$dst_dir")"
|
||||||
|
if command -v rsync >/dev/null 2>&1; then
|
||||||
|
rsync -a "$src_dir/" "$host:$dst_dir/"
|
||||||
|
else
|
||||||
|
scp -q "$src_dir"/*.pkl "$host:$dst_dir/"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
tools/manage_tn_dask_cluster.sh start
|
||||||
|
|
||||||
|
echo "Search with dask: $DASK_ADDRESS"
|
||||||
|
"$PYTHON_BIN" -u tools/tn_contest_runner.py search \
|
||||||
|
--case "$CASE" \
|
||||||
|
--nqubits "$NQUBITS" \
|
||||||
|
--nlayers "$NLAYERS" \
|
||||||
|
--observables $OBSERVABLES \
|
||||||
|
--tree-dir "$TREE_DIR" \
|
||||||
|
--dask-address "$DASK_ADDRESS" \
|
||||||
|
--torch-threads "$TORCH_THREADS" \
|
||||||
|
--dtype "$DTYPE" \
|
||||||
|
--tn-search-repeats "$SEARCH_REPEATS" \
|
||||||
|
--tn-search-time "$SEARCH_TIME" \
|
||||||
|
"${tn_slice_args[@]}"
|
||||||
|
|
||||||
|
sync_trees_to_hosts
|
||||||
|
|
||||||
|
echo "Contract with MPI: $MPIEXEC_FULL"
|
||||||
|
read -r -a mpi_prefix <<< "$MPIEXEC_FULL"
|
||||||
|
"${mpi_prefix[@]}" "$PYTHON_BIN" -u tools/tn_contest_runner.py contract \
|
||||||
|
--mpi \
|
||||||
|
--case "$CASE" \
|
||||||
|
--nqubits "$NQUBITS" \
|
||||||
|
--nlayers "$NLAYERS" \
|
||||||
|
--observables $OBSERVABLES \
|
||||||
|
--tree-dir "$TREE_DIR" \
|
||||||
|
--torch-threads "$TORCH_THREADS" \
|
||||||
|
--dtype "$DTYPE" \
|
||||||
|
"${tn_slice_args[@]}"
|
||||||
@@ -199,7 +199,7 @@ def build_parallel_opts(args, tree_file=None, search_only=False):
|
|||||||
"search_workers": args.tn_search_workers or args.torch_threads,
|
"search_workers": args.tn_search_workers or args.torch_threads,
|
||||||
"max_repeats": args.tn_search_repeats,
|
"max_repeats": args.tn_search_repeats,
|
||||||
"max_time": args.tn_search_time,
|
"max_time": args.tn_search_time,
|
||||||
"print_stats": not args.no_tn_stats,
|
"print_stats": False,
|
||||||
}
|
}
|
||||||
if args.tn_search_backend is not None:
|
if args.tn_search_backend is not None:
|
||||||
opts["search_backend"] = args.tn_search_backend
|
opts["search_backend"] = args.tn_search_backend
|
||||||
@@ -303,7 +303,7 @@ def run_one(args, case_name, obs_name, mode):
|
|||||||
f"failed_trials={search_stats.get('failed_trials', 'na')} "
|
f"failed_trials={search_stats.get('failed_trials', 'na')} "
|
||||||
f"requested_trials={search_stats.get('requested_trials', 'na')} "
|
f"requested_trials={search_stats.get('requested_trials', 'na')} "
|
||||||
f"best_score={search_stats.get('best_score', float('nan')):.6g} "
|
f"best_score={search_stats.get('best_score', float('nan')):.6g} "
|
||||||
f"slices={cost.get('slices')} "
|
f"slices={cost.get('nslices')} "
|
||||||
f"log10_flops={cost.get('log10_flops', float('nan')):.3f} "
|
f"log10_flops={cost.get('log10_flops', float('nan')):.3f} "
|
||||||
f"log10_write={cost.get('log10_write', float('nan')):.3f} "
|
f"log10_write={cost.get('log10_write', float('nan')):.3f} "
|
||||||
f"log2_size={cost.get('log2_size', float('nan')):.3f} "
|
f"log2_size={cost.get('log2_size', float('nan')):.3f} "
|
||||||
@@ -337,6 +337,11 @@ def apply_case_defaults(args):
|
|||||||
def stop_dask_cluster(args):
|
def stop_dask_cluster(args):
|
||||||
if args.keep_dask or args.tn_search_backend != "dask" or not args.dask_address:
|
if args.keep_dask or args.tn_search_backend != "dask" or not args.dask_address:
|
||||||
return
|
return
|
||||||
|
if args.mpi:
|
||||||
|
from mpi4py import MPI
|
||||||
|
|
||||||
|
if MPI.COMM_WORLD.Get_rank() != 0:
|
||||||
|
return
|
||||||
script = ROOT / "tools" / "manage_tn_dask_cluster.sh"
|
script = ROOT / "tools" / "manage_tn_dask_cluster.sh"
|
||||||
if not script.exists():
|
if not script.exists():
|
||||||
print(f"dask_stop_skipped reason=missing_script path={script}", flush=True)
|
print(f"dask_stop_skipped reason=missing_script path={script}", flush=True)
|
||||||
|
|||||||
Binary file not shown.
Reference in New Issue
Block a user